Stack Overflow Asked by user11035198 on December 30, 2021
I have a dataframe in which one of the columns has the format:
Items
3
3
3
5
5
11
11
11
11
16
16
...
I want to quickly get a single instance of every number (so for the example above I would need 3, 5, 11, and 16). Currently I have this:
Item_set = set()
for index, row in df.iterrows():
Item_set.add(row['Items'])
But the dataframe is ~385,000 rows long so this process takes 15 minutes, is there any way to speed this up?
I happened to be working on something similar while I came across this question. Why not convert your Series to a numpy array and use np.unique()? That's the fastest as per my knowledge. Attached some code below.
%timeit y_data.unique()
105 µs ± 8.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(list(y_data))
220 µs ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.unique(y_data.to_numpy())
23.3 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
len(y_data)
Out[104]: 1281
type(y_data)
Out[109]: pandas.core.series.Series
Answered by Harshwardhan Nandedkar on December 30, 2021
Given your Pandas Series s
, s.unique()
should do the job:
>>> import pandas as pd
>>> s = pd.Series([3,3,3, 5, 5])
>>> s.unique()
array([3, 5])
If you need a set:
Item_set = set(s.unique())
Answered by Jonathan Besomi on December 30, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP