How do I prepare data in which each output row depends on multiple input rows?

Question

My goal is to predict the value of Y based on multiple values of X1 and X2 for each observation of Y.

In my example, I want to predict whether a customer will file for bankruptcy (table 1) based on limits and balances of their credit cards (table 2). The challenge is that customer 1 has two credit cards, whereas customer 2 has one credit card.

How do I map table 2 to table 1 in this scenario?

(Of course I can create summary statistics, but I do not want to bias the model.)

Table 1: Bankruptcy filing

customerId | Customer filed for bankruptcy (Y)
1            1
2            0

Table 2: Customer credit cards

customerId | creditCardId | Credit limit (X1) | Credit used (X2)
1            1              5                   5
1            2              9                   8
2            1              10                  1

lcrmorin · Answer

A simple, practical approach would be to aggregate your data on each customer. The idea is that the repartition of credit usage / credit limit might not really matter for the overall bankruptcy. You might then want to build new features to avoid loss of information : the number of maxed out credit card, average interest on credit card. This is the general idea :
Table 1: Customer credit usage
customerId | nb_cc  | maxed-out_cc| Overal Credit limit| Overal Credit used | bankruptcy
1            2        1             14                   13                   1
2            1        0             10                   1                    0

The main alternative would be to try to guess on which credit card clients will default. Depending on the law and procedure you might want to build a target at credit card level (when someone goes bankrupt, do you lose the overall credit used or simply the credit used on the cc he failed to reimburse ?). Contrary to bankruptcy this is not a well defined (ie law defined and enforced target). This would be the general idea :
Table 2: Credit cards
cc_id | Limit | Used | nb_other_cc | other_limit | other_used |  bankruptcy (Y1) | cc_default (Y2)
1-1     5       5      1             9             8             1                 1 
1-2     9       8      1             5             5             1                 0
2-1     10      1      0             0             0             0                 0

In general I don't think using Y1 here would really be beneficial compared to using it in the first approach. (It might even add some modelling problem as it would overweigth people with multiple credit cards)
Defining and mesuring Y2 might be difficult overall and it might require further modelling to aggregate predictions at a client level (if #1 default on cc #1 what does that mean for cc #2 ?).
Note : as you didn't mention it in your initial question I didn't deal with it, but generally speaking you'll have to deal with a time horizon. Basically you might want to have multiple observations over the life of your products (say each year) and a time horizon for your prediction (say bankruptcy over the next year). Even if that introduce some modelling problems it might help you have better business oriented metrics (how much will we lose next year ?).

How do I prepare data in which each output row depends on multiple input rows?

One Answer

Add your own answers!

Ask a Question