TransWikia.com

Get values of cells in dataframe column quickly

Stack Overflow Asked by user11035198 on December 30, 2021

I have a dataframe in which one of the columns has the format:

Items
3
3
3
5
5
11
11
11
11
16
16
...

I want to quickly get a single instance of every number (so for the example above I would need 3, 5, 11, and 16). Currently I have this:

Item_set = set()

for index, row in df.iterrows():
    Item_set.add(row['Items'])

But the dataframe is ~385,000 rows long so this process takes 15 minutes, is there any way to speed this up?

2 Answers

I happened to be working on something similar while I came across this question. Why not convert your Series to a numpy array and use np.unique()? That's the fastest as per my knowledge. Attached some code below.

%timeit y_data.unique()
105 µs ± 8.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.unique(list(y_data))
220 µs ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.unique(y_data.to_numpy())
23.3 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

len(y_data)
Out[104]: 1281

type(y_data)
Out[109]: pandas.core.series.Series

Answered by Harshwardhan Nandedkar on December 30, 2021

Given your Pandas Series s, s.unique() should do the job:

>>> import pandas as pd
>>> s = pd.Series([3,3,3, 5, 5])
>>> s.unique()
array([3, 5])

If you need a set:

Item_set = set(s.unique())

Answered by Jonathan Besomi on December 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP