Data Science Asked by Math on September 8, 2020
I have a dataset with labels and usernames:
Labels Usernames
1 Londonderry
1 Londoncalling
1 Steveonder43
0 Maryclare_re
1 Patent107391
0 Anonymous
1 _24londonqr
...
It seems that the usernames containing the word London are very frequent in having assigned label=1.
Do you have any idea on how I could proof it?
You could create a second label for your usernames according to whether they contain london or not (pseudocode below):
for idx, username in df['Usernames']:
if 'London' in username:
df['London'].iloc[idx] = 1
else:
df['London'].iloc[idx] = 0
Consequently given you want to go with correlation and that you are comparing binary variables, the metric to go with is Pearson correlation coefficient (pseudocode below):
import scipy.stats.pearsonr as rho
corr = rho(df['labels'], df['London'])
Answered by hH1sG0n3 on September 8, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP