What's the best way to do classification basing on two given datasets (annual data and daily data)?

Question

I want to do binary-classification basing on two given dataset, one is annual statistical data of a company and has the label I should be able to predict like this:

company_id | year | annual sales | something else... | label
0          | 2017 |  2000320     |   ...             |   0
0          | 2018 |  4002530     |   ...             |   0
0          | 2019 |  800050      |   ...             |   1
1          | 2017 |  1024380     |   ...             |   1
1          | 2018 |  7085521     |   ...             |   0
1          | 2019 |  4525252     |   ...             |   0
2          | 2017 |  25258770    |   ...             |   0
2          | 2018 |  95402000    |   ...             |   1
2          | 2019 |  8605200     |   ...             |   0

And the other dataset is daily statistical data of a company:

company_id | year | date(MM-dd) | daily sales  | something else... 
    0          | 2017 | 12-02       | 5210         |   ...             
    0          | 2017 | 12-03       | 3542         |   ...             
    0          | 2017 | 12-04       | 8575         |   ...             
    0          | 2017 | 12-06       | 1254         |   ...             
    0          | 2017 | ...         | ...          |   ...             
    0          | 2018 | 12-01       | 1352         |   ...   
    0          | 2018 | 12-02       | 4856         |   ... 
    0          | 2018 | ...         | ...          |   ...           
    0          | 2019 | 12-01       | 4583         |   ...  
    0          | 2019 | ...         | ...          |   ...            
    1          | 2017 | 12-01       | 5210         |   ...   
    1          | 2017 | ...         | ...          |   ...            
    1          | 2018 | 12-01       | 5202         |   ...   
    1          | 2018 | ...         | ...          |   ...           
    1          | 2019 | 12-01       | 8675         |   ...       
    1          | 2019 | ...         | ...          |   ...

I am wondering what's the best way to fully utilize these data to predict the label of each company?

Or is there any related topic I may refer to? I am willing to do some searching on that.

I am considering left join the annual dataset on the daily dataset, but this will result that many rows have the same value in the annual features and the size of dataset grows dramatically.

kkz · Answer

Since the daily dataset does not contain labels, you could aggregate the daily data into annual and then do the join. It sounds like a (binary) classification problem, which can be done using methods such as logistic regression. You will however have to handle missing values caused by the left join, one method would be imputing them. Or just doing an inner join if the missing data is random without patterns (e.g. companies of certain type don't have missing data more often than the other types) and if there's enough data that is not missing.

Answered by kkz on February 17, 2021

What's the best way to do classification basing on two given datasets (annual data and daily data)?

One Answer

Add your own answers!

Ask a Question