Stack Overflow Asked on November 29, 2020
I have a large number of files for which I have to carry out calculations based on string columns. The relevant columns look like this.
df = pd.DataFrame({'A': ['A', 'B', 'A', 'B'], 'B': ['B', 'C', 'D', 'A'], 'C': ['A', 'B', 'D', 'D'], 'D': ['A', 'C', 'C', 'B'],})
A B C D
0 A B A A
1 B C B C
2 A D D C
3 B A D B
I have to create new columns containing the number of occurences of certain strings in each row. I do this like this:
for elem in ['A', 'B', 'C', 'D']:
df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)
A B C D n_A n_B n_C n_D
0 A B A A 3 1 0 0
1 B C B C 0 2 2 0
2 A D D C 1 0 1 2
3 B A D B 1 2 0 1
However, this is taking minutes per file, and I have to do this for around 900 files. Is there any way I can speed this up?
Rather than use apply
to loop over each row, I looped over each column to compute the sum for each letter:
for l in ['A','B','C','D']:
df['n_' + l] = (df == l).sum(axis=1)
This seems to be an improvement in this example, but (from quick testing not shown) seems like it can be ~equal or worse depending on the shape and size of the data (and probably how many strings you are looking for)
Some time comparisons:
%%timeit
for elem in ['A', 'B', 'C', 'D']:
df['n_{}'.format(elem)] = df[['A', 'B', 'C', 'D']].apply(lambda x: (x == elem).sum(), axis=1)
#6.77 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
for l in ['A','B','C','D']:
df['n_' + l] = (df == l).sum(axis=1)
#1.95 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And for other answers here:
%%timeit
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
#3.59 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df1=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
#5.82 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
#5.58 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Correct answer by Tom on November 29, 2020
you could do:
counts = df.apply(lambda s: s.value_counts(), axis=1).fillna(0)
counts.columns = [f'n_{col}' for col in counts.columns]
df.join(counts)
Answered by Ayoub ZAROU on November 29, 2020
Try get_dummies
and sum
with level
, here we do not need stack
:-)
df=df.join(pd.get_dummies(df,prefix='n',prefix_sep='_').sum(1,level=0))
Out[57]:
A B C D n_A n_B n_C n_D
0 A B A A 3 1 0 0
1 B C B C 0 2 2 0
2 A D D C 1 0 1 2
3 B A D B 1 2 0 1
Answered by BEN_YO on November 29, 2020
Use stack
+ str.get_dummies
and then sum
on level=0
and join
it with df
:
df1 = df.join(df.stack().str.get_dummies().sum(level=0).add_prefix('n_'))
Result:
print(df1)
A B C D n_A n_B n_C n_D
0 A B A A 3 1 0 0
1 B C B C 0 2 2 0
2 A D D C 1 0 1 2
3 B A D B 1 2 0 1
Answered by Shubham Sharma on November 29, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP