Data Science Asked by thewhitetie on January 14, 2021
I am using the text of comments on a forum to predict how many upvotes it will get. I want to be able to say, “Reviews with X, Y, Z words are more upvoted”. So to do this, I want to use text features in a regression. In particular,
What model should I use to maximize interpretability of coefficients?
I suppose you have a binary outcome (upvote: yes/no). In this case you could use simple linear (ols) regression (with lasso penalty). Each word (in a bag of words) is a „dummy“ here. If you look at the predicted coefficients, you can directly interpret them as „marginal effects“. Higher value means higher chance of getting an upvote (if the word is present). You can also directly see the magnitude of exp. increase in upvote probability.
One problem is that OLS is unconstrained wrt to y. So you can end up with predicted probability of upvote > 1. Use logistic regression if this bothers you. Under logit you will have similar results. Positive/negative coefficients indicate if a word increases/decreases the probability of an upvote. But because logit uses a transformation to squeeze y into a interval from zero to one, the coefficients are the log-odds and you cannot directly infer marginal effects. You would need to calculate this separately.
I tend to say: use OLS and get quick and dirty results if you are not interested in super precise estimates but if you merely look for robust estimates for which words are important.
However, if you really want to do this in a sound way, you would also need to think about „interaction“ of words (positive or negative effect on y), which are „masked“ on approaches as described above.
Answered by Peter on January 14, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP