TransWikia.com

Predicting credit applications with timeseries

Data Science Asked by Yatoom on September 16, 2020

I was wondering what the best way is to make a model for predicting credit applications.

I have two tables, which look like this:

   client_nr  yearmonth  total_nr_trx  nr_debit_trx  volume_debit_trx  ... etc.
1          1     201201            94            49           6527529   
2          1     201202            85            58           3475518   
3          1     201203            94            61          31317405   
4          1     201204            85            52          18869967   
5          1     201205            93            53           2893105   
  client_nr  yearmonth  credit_application  nr_credit_applications
1          1     201201                   0                       0
2          1     201202                   0                       0
3          1     201203                   0                       0
4          1     201204                   1                       1
5          1     201205                   0                       1

The goal is to determine which clients are likely to apply for a credit. So far, I have made sequences of size (months, features) for each client,
using the first table.

My questions:

  • Would it now be a good idea to create train/test folds based on theclient_nr? Or should I make splits by month?
  • Should I then select the first $n$ months as features, and create one label from $k$ months after $n$ that indicates whether the client has applied for a credit in those $k$ months? Or is there a better way?
  • Would it be better to use regression using nr_credit_applications, or classification on credit_application?

One Answer

Intuitively I think that the model would need some additional features based on the economic context in order to be more accurate, and that would also be a part where the evolution across time really matters.

  • Would it now be a good idea to create train/test folds based on theclient_nr? Or should I make splits by month?

If possible you need to have a full time series for a client as an instance, so I'd say splitting on the client number is much better. But select the client numbers randomly from the full set of client ids, because it's possible that the clients number are assigned following a particular order in time.

  • Should I then select the first $n$ months as features, and create one label from $k$ months after $n$ that indicates whether the client has applied for a credit in those $k$ months? Or is there a better way?

That depends on the exact goal and on the kind of algorithm being used, but afaik usually with a time series the label is predicted at any given time given the past and current features and the past labels (or past predicted labels).

  • Would it be better to use regression using nr_credit_applications, or classification on credit_application?

This is more a matter of convenience for your application or for the algorithm being used, as the accuracy should be very similar (assuming you use the same kind of method of course).

Answered by Erwan on September 16, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP