TransWikia.com

Logistic regression for classification?

Data Science Asked on December 6, 2020

I have a dataset with most columns having Boolean values and categorical values.
A sample of it is:

Name    Country approved political
bbc.com  US       true   True
stackoverflow.com US true False
Number.com US      False False

...

Based of values above, I would like to determine if other websites have been approved or not.

My questions are:

does any heat/correlation map/matrix sense with categorical variables?

Would it be possible to predict if a website is approved or not (target variable), using caterogical values?

Is there any other model that should be preferable?

Thanks

2 Answers

Logistic regression is a standard method of performing binary classification, which matches your task here. Categorical variables can be dealt with, depending on the model you choose.

You can see from the Scikit-Learn documentation on logistic regression, that your data only really needs to be of a certain shape: (num_samples, num_features). It might ignore the columns that are non-numerical, so you should convert e.g. strings to class IDs (e.g. integers) - see below.


Computing the correlation can make sense for categorical values, but to compute these, you need to provide numerical values; strings like "bbc.com" or "US" won't work.

You can map each of the values to a numerical value and make a new column with that data using pd.factorize like this:

df["Country_id"] = pd.factorize(df.Country)[0]   # taking the first return element: the ID values
df["Name_id"] = pd.factorize(df.Name)[0]

You don't need to do it really for the approved and political columns, because they hold boolean values, which are seen by Python as 0 and 1 for False and True, respectively.

Now you can do something like this to see a correlation plot:

import matplotlib.pyplot as plt    # plotting library: pip install matplotlib

# compute the correlation matrix
corr_mat = df[["Name_id", "Country_id", "approved", "political"]].corr()

# plot it
plt.matshow(corr_mat)
plt.show()

Answered by n1k31t4 on December 6, 2020

It looks like there are two part to your question ,

  1. You want to explore data before predicting the values to gain insights about the data which falls under Visual analytics in which EDA (Exploratory data analysis) helps. Regarding the question of choosing right kind of plot to see the distribution for categorical data please refer below links which gives a very basic understanding of choosing the right chart type. Later on you can move to reading papers on visualization.

https://en.wikipedia.org/wiki/Data_visualization is a good starting point. Also I prefer reading articles on Medium papers are getting too heavy to start, few are below

https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization https://eazybi.com/blog/data_visualization_and_chart_types

For this specific data I would be much more interested in seeing the count plot () of target variable and scatter plots for co- relation. Seaborn has nice feature of seeing pair plots after converting these features to numeric types as suggested in first answer.

  1. On availability of models for classification, your data comes under supervised classification problem since it has labels, there are choices like, based on probabilities (Naive Bayes), neural network, logistic regression etc. In practice we try different models, tune the parameters see the performance and do the model evaluation to see if the model is working well on unseen data, which is called generalization performance. Feel free to try libraries like Ski-Learn which gives options you to implement and see the performance of each one of it. https://scikit-learn.org/stable/auto_examples/index.html#classification Specifically for Logistic regression in order to under please refer- https://web.stanford.edu/~jurafsky/slp3/5.pdf To understand Classification in very simple words please refer- Introduction to Data Mining by Vipin kumar, MICHAEL STEINBACH, PANG-NING TAN Classification chapter is available for free.

Answered by BlackCurrant on December 6, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP