Get values of cells in dataframe column quickly

Question

I have a dataframe in which one of the columns has the format:
Items
3
3
3
5
5
11
11
11
11
16
16
...

I want to quickly get a single instance of every number (so for the example above I would need 3, 5, 11, and 16). Currently I have this:
Item_set = set()

for index, row in df.iterrows():
    Item_set.add(row['Items'])

But the dataframe is ~385,000 rows long so this process takes 15 minutes, is there any way to speed this up?

Harshwardhan Nandedkar · Answer

I happened to be working on something similar while I came across this question. Why not convert your Series to a numpy array and use np.unique()?
That's the fastest as per my knowledge.
Attached some code below.
%timeit y_data.unique()
105 µs ± 8.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.unique(list(y_data))
220 µs ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.unique(y_data.to_numpy())
23.3 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

len(y_data)
Out[104]: 1281

type(y_data)
Out[109]: pandas.core.series.Series

Jonathan Besomi · Answer

Given your Pandas Series s, s.unique() should do the job:
>>> import pandas as pd
>>> s = pd.Series([3,3,3, 5, 5])
>>> s.unique()
array([3, 5])

If you need a set:
Item_set = set(s.unique())

Get values of cells in dataframe column quickly

2 Answers

Add your own answers!

Ask a Question