TransWikia.com

Estimate correlation in Python

Data Science Asked by Math on September 8, 2020

I have a dataset with labels and usernames:

Labels   Usernames
1         Londonderry
1         Londoncalling
1          Steveonder43
0         Maryclare_re
1         Patent107391
0         Anonymous 
1         _24londonqr
... 

It seems that the usernames containing the word London are very frequent in having assigned label=1.
Do you have any idea on how I could proof it?

One Answer

You could create a second label for your usernames according to whether they contain london or not (pseudocode below):

for idx, username in df['Usernames']:
    if 'London' in username:
        df['London'].iloc[idx] = 1
    else:
        df['London'].iloc[idx] = 0

Consequently given you want to go with correlation and that you are comparing binary variables, the metric to go with is Pearson correlation coefficient (pseudocode below):

import scipy.stats.pearsonr as rho
corr = rho(df['labels'], df['London'])

Answered by hH1sG0n3 on September 8, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP