TransWikia.com

Predicting churn - deal with missing dates in time series and improve modelling result

Data Science Asked on April 10, 2021

This is the follow up question for General approach on time series for customer retention/churn in retail.

I have a time series of data in the following form:

| purchase_date    |    cutomer_id  |   num_purchases | churned |
   2018-10-31            id1              39             0
   2018-11-31            id1              0              0
   2019-01-31            id1              6              0
                         ...
   2019-03-31            id1              88            1
   2019-03-31            id2              300            0 
   2018-04-31            id2               2             1
   2019-02-31            id3               1             1
   2019-07-31            id4               100           0
     ...                 id5   

I grouped the data by month and summed num_purchases by month. The churned column for user id1 for example represents in which month customer churned. So id1 in my case churned in March. Before this, to label who has churned or not, we sampled customers based on 2 months of inactivity period from the churn date. I need to predict if a user is going to churn in a 2 months from now.

I am getting very bad prediction results using logistic regression for example and the churned column as a class column. I suspect this is because some users like id3 and id4 appear only once (or very few number of times) and other users like id1 appear a lot. I am not sure how to approach imputation in this case because these users just didn’t exit before or after and I am not sure if it would make sense to impute them. Does anyone have idea on how to improve my model result? I am getting 0.85 for accuracy, and 0 for precision, recall and F1.

One Answer

It would be interested to deal with it as a sequence classification problem. For instance, you could use HMM (Hidden Markov Model) or equivalent to classify the sequences. The data format would be:

ID:  sequence      label
id1: 39,0,6,...,88  1
id2: 300, 2         1
id3: 1              1
id4: 100            0

Some suggestions:

  • Create more samples to balance also your dataset (e.g id1 39, 0 0)
  • Possibly bin the variables (e.g. to the decimal 6 -> 10, 39 -> 40)

Answered by 20roso on April 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP