Data Science Asked by Haffi112 on December 16, 2020
I’m using supervised learning on monthly activity data to predict when a customer buys a particular product. This product is typically bought infrequently and at the moment my target variable is whether the customer buys the product in the next twelve months.
Assume that for every customer I get a set of features every month, $x_1,x_2,ldots,x_n$. The goal is to use these features to predict whether $y=0$ or $y=1$ ($y$ is 1 if the customer did buy the product in the next twelve months, otherwise it is zero).
However, this creates a dilemma. If I use this approach for $y$ then my freshest training data is twelve months old as I do not know the true value for $y$ for data that is younger than twelve months old. My main question is thus the following: Is there a way for me to make use of newer data in this setting?
Also, I should note that I have tried changing $y$ into: “Does the customer buy the product in the next month?”. It works but not nearly as well as the other approach. My data is imbalanced so by allowing the target period to be composed of the following twelve months instead of a single month I get many more positive data points.
I am confronted with the same problem and I am afraid the answer is quite generally no. You'll have to wait the length of the time horizon to use new data in your calibration process. Similarly, for validating your prediction you will have to wait the time horizon.
Somehow building a target from some information you have would not yield interesting results. Because you will build the target based on some features, once you put your features trough a ML model, said ML model will just learn what rule you used to build the target.
As you mentionned the main solution is to build a shorter horizon target. A shorter horizon might help your main problem, but if it too short you will get stability problems. (And generally speaking a bad target, if for exemple, the sale usually happen 3 months after, using a 1 month horizon target won't help you learn the relationship between feature and the sale). Maybe you should try other target horizon between 1 month and a year.
Note that the time horizon you want to use might also depend on which frequency you observe your instances. Using a time horizon for a target that is longer than the time interval between your observation will create correlation between targets. This might get really bad for separating independent subsets (like for train and test set splitting or using cross validation).
Answered by lcrmorin on December 16, 2020
There is no other way than shortening target period. For imbalanced data problem you may try sampling methods and also some algorithms have parameters to give more weights to minority class in optimization process.
Answered by Tolga Karahan on December 16, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP