Data Science Asked on January 4, 2022
I’m using python. Some 2D numpy arrays are stored in individual rows of a Series. They are 30×30 images. It looks something like this:
pixels
0 [[23,4,54...],[54,6,7...],[........]]
1 [[65,54,255,55,...],[43,54,6...],[......]]
...
...
...
7000 [[........]]
For each row in the Series, I want to take these 2D arrays, flatten them to 1D, take the values and assign them to the columns of one row in data frame. Each row will have 30×30 = 900pixels each, storing the values of each pixel. Like this:
pixel1 pixel2 pixel3... pixel900
0 23 4 54 77
1 65 54 255 33
...
...
...
I’m using an elaborate function that extracts one row from the series at a time, flattens the array, converts it to a Series again, and appends it to a dataframe. It takes sooo long. I’m sure there must be a faster way. I’m using this code:
def prep_pixels(X):
# X is a series
df = pd.DataFrame()
for i in range(len(X.index)): #iterate through the whole series
df = df.append(pd.Series(X[i].flatten()), ignore_index=True)
return df
EDIT:
Upon request from a user, I will provide code with how I ended up in this rut in the first place 😀
#reading the files
filepath = 'dataset.pickle'
data_np = pd.read_pickle(filepath)
print(data_np[0])
Output:
[array([[255, 248, 253, 255, 251, 253, 254, 236, 220, 217, 191, 145, 139,
185, 216, 227, 252, 251, 254, 248, 251, 236, 221, 222, 213, 175,
120, 75, 74, 209],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
... 30 rows of 30 pixels
...
... last row coming up ...
[255, 255, 254, 254, 253, 252, 253, 254, 255, 255, 254, 252, 249,
249, 251, 213, 126, 178, 231, 252, 248, 250, 254, 254, 252, 253,
255, 255, 255, 255]], dtype=uint8), 'क']
The last symbol in this list is the character that this image represents. It’s the ‘label’. It’s supervised learning using CNNs. Anyway, I need them to be in the other format I described to be able to work with them. This is how I’m handling this data:
data = pd.DataFrame(data_np, columns=['pixels','labels'])
def prep_pixels(X):
df = pd.DataFrame()
for i in range(len(X.index)): #iterate through whole series
df = df.append(pd.Series(X[i].ravel()), ignore_index=True)
return df
X = prep_pixels(data['pixels'])
y = data['labels']
EDIT: a user suggested that I use a mutable datatype to do this procedure. They said that it might speed things up because the computation does not need to make copies of data. I used some nested for loops and it cut the time to half (1 min 22 sec instead of 3 min). I still feel like its pathetic, given that my dataset has just 7000, 30×30 pixel images. Or, maybe I’m just new to data wrangling.
Here is the code I used. Please let me know if you have any other suggestions:
filepath = 'dataset.pickle'
data_np = pd.read_pickle(filepath)
df = pd.DataFrame()
for row in range(IMG_ROW):
for col in range(IMG_COL):
pixel=[]
for img in range(len(data_np)):
pixel.append(data_np[img][0][row][col])
columns = pd.Series(pixel, name=col)
df = pd.concat([df, columns], ignore_index=True, axis=1)
Or you walk through each column, see Test3. Mind in general:
Results:
(AMD A8-3870 APU with Radeon(tm) HD Graphics, 3000 MHz, 4 cores, 4 logical processors, 8 GB memory, Windows 10 64bit)
import numpy as np
import pandas as pd
# https://stackoverflow.com/questions/7370801/how-to-measure-elapsed-time-in-python
from timeit import default_timer as timer
row = [np.array([
[255, 248, 253, 255, 251, 253, 254, 236, 220, 217, 191, 145, 139,
185, 216, 227, 252, 251, 254, 248, 251, 236, 221, 222, 213, 175,
120, 75, 74, 209],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 255, 248, 252, 255, 202, 88, 15, 16, 14, 11, 11,
12, 12, 20, 40, 46, 38, 43, 40, 25, 21, 19, 17, 35,
53, 58, 64, 124],
[255, 253, 254, 253, 252, 254, 223, 146, 87, 75, 58, 30, 27,
58, 86, 116, 157, 168, 164, 165, 167, 136, 96, 71, 59, 49,
21, 9, 27, 144],
[255, 255, 254, 254, 253, 252, 253, 254, 255, 255, 254, 252, 249,
249, 251, 213, 126, 178, 231, 252, 248, 250, 254, 254, 252, 253,
255, 255, 255, 255]
], dtype='uint8'), "क"]
data_np = []
for i in range(7000):
data_np.append(row)
data = pd.DataFrame(data_np, columns=['pixels','labels'])
# Test1 (@IsuShrestha, flatten each row, append row-wise)
def prep_pixels(X):
df = pd.DataFrame()
for i in range(len(X.index)): #iterate through whole series
df = df.append(pd.Series(X[i].ravel()), ignore_index=True)
return df
start = timer()
test = prep_pixels(data['pixels'])
print(timer()-start)
print(test.shape)
# 232.96580633900157
# (7000, 900)
# Test2 (@BenReiniger, flatten each row)
def prep_pixels2(X):
flatX = X.apply(lambda x: x.flatten())
return pd.DataFrame(row for row in flatX)
start = timer()
test2 = prep_pixels2(data['pixels'])
print(timer()-start)
print(test2.shape)
# 6.919965944998694
# (7000, 900)
# Test3 (append column-wise = over all rows in one go = 29 appends)
def prep_pixels3(X):
test = np.array([x[0] for x in X])
for i in range(len(X[0])-1):
# print(i)
test = np.append(arr=test, values=np.array([x[i+1] for x in X]), axis=1)
return pd.DataFrame(test)
start = timer()
test3 = prep_pixels3(data['pixels'].to_numpy())
print(timer()-start)
print(test3.shape)
# 0.3909464059997845
# (7000, 900)
Answered by questionto42 on January 4, 2022
I'm getting timeit
results of about 1/4 of the time using:
flatX = X.apply(lambda x: x.flatten())
pd.DataFrame(item for item in flatX)
See also https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe for some possibly better options for the second line.
(Regarding my earlier comment, I don't get any real savings by just dataframe-ing at the end.)
Answered by Ben Reiniger on January 4, 2022
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP