Inputting (a lot of )data into a dataframe one row at a time

Question

I'm using python. Some 2D numpy arrays are stored in individual rows of a Series. They are 30x30 images. It looks something like this:

pixels
0    [[23,4,54...],[54,6,7...],[........]]
1    [[65,54,255,55,...],[43,54,6...],[......]]
...
...
...
7000 [[........]]

For each row in the Series, I want to take these 2D arrays, flatten them to 1D, take the values and assign them to the columns of one row in data frame. Each row will have 30x30 = 900pixels each, storing the values of each pixel. Like this:

pixel1    pixel2    pixel3...    pixel900
0       23         4        54             77
1       65        54       255             33
...
...
...

I'm using an elaborate function that extracts one row from the series at a time, flattens the array, converts it to a Series again, and appends it to a dataframe. It takes sooo long. I'm sure there must be a faster way. I'm using this code:

def prep_pixels(X):
    # X is a series
    df = pd.DataFrame()
    for i in range(len(X.index)): #iterate through the whole series
        df = df.append(pd.Series(X[i].flatten()), ignore_index=True) 
    return df

EDIT:
Upon request from a user, I will provide code with how I ended up in this rut in the first place :D

#reading the files
filepath = 'dataset.pickle'
data_np = pd.read_pickle(filepath)
print(data_np[0])

Output:

[array([[255, 248, 253, 255, 251, 253, 254, 236, 220, 217, 191, 145, 139,
        185, 216, 227, 252, 251, 254, 248, 251, 236, 221, 222, 213, 175,
        120,  75,  74, 209],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
         58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
         21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
         12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
         53,  58,  64, 124],
    ... 30 rows of 30 pixels
    ...
    ... last row coming up ...
    [255, 255, 254, 254, 253, 252, 253, 254, 255, 255, 254, 252, 249,
    249, 251, 213, 126, 178, 231, 252, 248, 250, 254, 254, 252, 253,
    255, 255, 255, 255]], dtype=uint8), 'क']

The last symbol in this list is the character that this image represents. It's the 'label'. It's supervised learning using CNNs. Anyway, I need them to be in the other format I described to be able to work with them. This is how I'm handling this data:

data = pd.DataFrame(data_np, columns=['pixels','labels'])
def prep_pixels(X):
    df = pd.DataFrame()
    for i in range(len(X.index)): #iterate through whole series
        df = df.append(pd.Series(X[i].ravel()), ignore_index=True)  
    return df

X = prep_pixels(data['pixels'])
y = data['labels']

EDIT: a user suggested that I use a mutable datatype to do this procedure. They said that it might speed things up because the computation does not need to make copies of data. I used some nested for loops and it cut the time to half (1 min 22 sec instead of 3 min). I still feel like its pathetic, given that my dataset has just 7000, 30x30 pixel images. Or, maybe I'm just new to data wrangling.

Here is the code I used. Please let me know if you have any other suggestions:

filepath = 'dataset.pickle'
data_np = pd.read_pickle(filepath)

df = pd.DataFrame()
for row in range(IMG_ROW):   
    for col in range(IMG_COL):
        pixel=[]
        for img in range(len(data_np)):
            pixel.append(data_np[img][0][row][col]) 
        columns = pd.Series(pixel, name=col)
        df = pd.concat([df, columns], ignore_index=True, axis=1)

questionto42 · Answer

Or you walk through each column, see Test3.
Mind in general:

".apply()" is slower than list comprehension, see https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas.
Column-wise is faster than row-wise in the standard case of #rows > #columns (of course).
Use numpy if you can.

Results:

Test1: 232.96580633900157 (@IsuShrestha, flatten each row, append row-wise)
Test2: 6.919965944998694 (@BenReiniger, flatten each row)
Test3: 0.3909464059997845 (append column-wise = over all rows in one go = 29 appends)

(AMD A8-3870 APU with Radeon(tm) HD Graphics, 3000 MHz, 4 cores, 4 logical processors, 8 GB memory, Windows 10 64bit)
import numpy as np
import pandas as pd
# https://stackoverflow.com/questions/7370801/how-to-measure-elapsed-time-in-python
from timeit import default_timer as timer

row = [np.array([
    [255, 248, 253, 255, 251, 253, 254, 236, 220, 217, 191, 145, 139,
    185, 216, 227, 252, 251, 254, 248, 251, 236, 221, 222, 213, 175,
    120,  75,  74, 209],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],       
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],    
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],
       [255, 255, 255, 248, 252, 255, 202,  88,  15,  16,  14,  11,  11,
     12,  12,  20,  40,  46,  38,  43,  40,  25,  21,  19,  17,  35,
     53,  58,  64, 124],
       [255, 253, 254, 253, 252, 254, 223, 146,  87,  75,  58,  30,  27,
     58,  86, 116, 157, 168, 164, 165, 167, 136,  96,  71,  59,  49,
     21,   9,  27, 144],         
    [255, 255, 254, 254, 253, 252, 253, 254, 255, 255, 254, 252, 249,
    249, 251, 213, 126, 178, 231, 252, 248, 250, 254, 254, 252, 253,
    255, 255, 255, 255]
    ], dtype='uint8'), "क"]

data_np = []
for i in range(7000):
    data_np.append(row)

data = pd.DataFrame(data_np, columns=['pixels','labels'])

# Test1 (@IsuShrestha, flatten each row, append row-wise)
def prep_pixels(X):
    df = pd.DataFrame()
    for i in range(len(X.index)): #iterate through whole series
        df = df.append(pd.Series(X[i].ravel()), ignore_index=True)  
    return df
start = timer()
test = prep_pixels(data['pixels'])
print(timer()-start)
print(test.shape)
# 232.96580633900157
# (7000, 900)

# Test2 (@BenReiniger, flatten each row)
def prep_pixels2(X):
    flatX = X.apply(lambda x: x.flatten())
    return pd.DataFrame(row for row in flatX)
start = timer()
test2 = prep_pixels2(data['pixels'])
print(timer()-start)
print(test2.shape)
# 6.919965944998694
# (7000, 900)

# Test3 (append column-wise = over all rows in one go = 29 appends)
def prep_pixels3(X):
    test = np.array([x[0] for x in X])
    for i in range(len(X[0])-1):
        # print(i)
        test = np.append(arr=test, values=np.array([x[i+1] for x in X]), axis=1)
    return pd.DataFrame(test)
start = timer()
test3 = prep_pixels3(data['pixels'].to_numpy())
print(timer()-start)
print(test3.shape)
# 0.3909464059997845
# (7000, 900)

Ben Reiniger · Answer

I'm getting timeit results of about 1/4 of the time using:

flatX = X.apply(lambda x: x.flatten())
pd.DataFrame(item for item in flatX)

See also 
https://stackoverflow.com/questions/45901018/convert-pandas-series-of-lists-to-dataframe
for some possibly better options for the second line.

(Regarding my earlier comment, I don't get any real savings by just dataframe-ing at the end.)

Inputting (a lot of )data into a dataframe one row at a time

2 Answers

Add your own answers!

Ask a Question