TransWikia.com

Is it possible to build a regression model for predicting movie gross using sections on their wikipedia pages?

Data Science Asked on November 25, 2021

I got this as an assignment from a company recruiter and I’ve successfully scraped a dataset of about 650 movies with their ‘Plot’, ‘Music’ and ‘Marketing’ sections and gross. I’ve tried tfidf and count vectorizers and performed LSA/PCA to reduce the dimensions which originally are around 20k terms.

This is really boggling me, due to less instances(650) I guess the no. features should be around 100 or atleast < 600 but that is a drastic reduction of dimensions using PCA or LSA, is it the right thing to do?

Also my regression model is way off target or in some cases overfitting excessively. I’ve also tried normalizing and de-normalizing the target(gross) for no improvements. I know I should try NN and other models too but the results of linear regression should be atleast comparable too, right?

This makes me ponder if its even possible to model such a problem or am I doing something wrong with latter being more probable as I’m no expert in the field.

PLEASE HELP !

One Answer

If your models suffers from multicollinearity problems, then yes, I'd say go for dimensionality reduction (possibly non-linear, such as t-SNE or Autoencoders). Moreover, 100 variables for 650 observations is a lot, I would apply a much more drastic reduction, but that's up to your preferences.

If your model is overfitting excessively, then I suggest you to try Neural Networks with high dropout probability. That helps a lot in preventing overfitting. Also, try other models, the more you try, the more confident you'll be about results.

Please keep in mind that normally Neural Networks require a lot of data to perform very good. If you have just 650 observations, tree-based models (Random Forest Regressions or XGBoost Regressions) could work even better than Deep Learning. They are ensembles of trees, i.e. desgined to combat overfitting, and could capture non-linear patterns of your dataset.

Answered by Leevo on November 25, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP