Data Science Asked on December 6, 2020
I have a dataset with most columns having Boolean values and categorical values.
A sample of it is:
Name Country approved political
bbc.com US true True
stackoverflow.com US true False
Number.com US False False
...
Based of values above, I would like to determine if other websites have been approved or not.
My questions are:
does any heat/correlation map/matrix sense with categorical variables?
Would it be possible to predict if a website is approved or not (target variable), using caterogical values?
Is there any other model that should be preferable?
Thanks
Logistic regression is a standard method of performing binary classification, which matches your task here. Categorical variables can be dealt with, depending on the model you choose.
You can see from the Scikit-Learn documentation on logistic regression, that your data only really needs to be of a certain shape: (num_samples, num_features)
. It might ignore the columns that are non-numerical, so you should convert e.g. strings to class IDs (e.g. integers) - see below.
Computing the correlation can make sense for categorical values, but to compute these, you need to provide numerical values; strings like "bbc.com" or "US" won't work.
You can map each of the values to a numerical value and make a new column with that data using pd.factorize
like this:
df["Country_id"] = pd.factorize(df.Country)[0] # taking the first return element: the ID values
df["Name_id"] = pd.factorize(df.Name)[0]
You don't need to do it really for the approved
and political
columns, because they hold boolean
values, which are seen by Python as 0
and 1
for False
and True
, respectively.
Now you can do something like this to see a correlation plot:
import matplotlib.pyplot as plt # plotting library: pip install matplotlib
# compute the correlation matrix
corr_mat = df[["Name_id", "Country_id", "approved", "political"]].corr()
# plot it
plt.matshow(corr_mat)
plt.show()
Answered by n1k31t4 on December 6, 2020
It looks like there are two part to your question ,
https://en.wikipedia.org/wiki/Data_visualization is a good starting point. Also I prefer reading articles on Medium papers are getting too heavy to start, few are below
https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization https://eazybi.com/blog/data_visualization_and_chart_types
For this specific data I would be much more interested in seeing the count plot () of target variable and scatter plots for co- relation. Seaborn has nice feature of seeing pair plots after converting these features to numeric types as suggested in first answer.
Answered by BlackCurrant on December 6, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP