Data Science Asked by jfiedler on January 13, 2021
I’m concerned with a supervised classification problem for the following type of data. The data consists of $N$ rows (where $N$ is not very large – this is not a big-data problem) and $M$ columns (features) and each row has a certain label I’m interested in. Every row belongs to a certain block and each block consists of 1-50 rows. The size of a block depends on the duration of a block (so the rows in each block are correlated, but the correlation between different blocks can be neglected).
The aim is to learn a classification algorithm on the data that allows to classify a new row or a new block.
Now there are two important things: The labels are constant within each block and there are some features that allow the identification of blocks.
My question is: What might be the best way to learn a model on this specific type of data?
To illustrate this problem a bit further, let me report some problems I found when I worked with Random Forests on the data set.
Suppose I take a part of the data as validation data, which contains whole blocks. If I split the remaining rows randomly in training and test data, the accuracy of the learned Random Forest is very high on training and test data but very low on the validation data. This is due to the fact that the Random Forest learns in this case to identify the individual blocks and as the validation data contains some unknown blocks, the accuracy drops. This can be also seen in the resulting importance for the features: the most important ones are those features that allow an identification of a cycle very easily.
Splitting the remaining data into training and test data and leaving the blocks whole helps a bit but doesn’t fix the problem completely.
Another approach would be to remove all features that allow the identification of blocks. But this is difficult a priori and some of these features could have important information for my problem.
A simple way to overcome the problems would be to take the row mean in each block and use the resulting data for the classification problem. However, in this case you loose a lot of information.
So I’m wondering if there are more natural ways to approach this classification problem which respect the block-structure of the data.
I think your approach is more or less valid, but you’re overfitting the model because you don’t have a lot of data. So that’s my main point.
There are many ways in principle to deal with this, but if you have little data, perhaps you simply shouldn’t be expecting great results. If you use the averages, you get better results, but “the problem isn’t solved completely”. A priori I’m not sure you have reason to think you can do better.
Anyway, step 1 should be to simplify the model (fewer and more shallow trees). You should find that your test error deteriorates but your validation error improves, because rather than simply memorizing records, the model is now forced to learn something about the blocks. This information can be reused on records the model hasn’t seen yet.
You can also look at regularization, but it is essentially an automatic way of finding a simpler, but better model. Also, data augmentation. Is it possible to generate more data by varying the existing data a bit? Like the way you can generate more images by changing the images that you have: By cropping, flipping, rotating, etc.
Finally, make sure that your split between train-test-validate sets is random. I guess it depends on your data but it seems you can’t expect the model to learn about a block that doesn’t exist at all in the training set. If that block appears in the validation set only, I guess the model would be random on that data.
Answered by Paul on January 13, 2021
Suppose I take a part of the data as validation data, which contains whole blocks. If I split the remaining rows randomly in training and test data, the accuracy of the learned Random Forest is very high on training and test data but very low on the validation data. This is due to the fact that the Random Forest learns in this case to identify the individual blocks and as the validation data contains some unknown blocks, the accuracy drops. This can be also seen in the resulting importance for the features: the most important ones are those features that allow an identification of a cycle very easily.
Splitting the remaining data into training and test data and leaving the blocks whole helps a bit but doesn't fix the problem completely.
The latter approach is better; at least your test set will be more representative of your desired use-case, and so the scores will be more relevant. sklearn has this built-in, with GroupKFold
.
Another approach would be to remove all features that allow the identification of blocks. But this is difficult a priori and some of these features could have important information for my problem.
Indeed, this seems problematic. There's interesting related work, "domain adaptive neural networks," which essentially try to simultaneously learn the predictive trends while unlearning the block-specific information; but I'm not sure how relevant that is here, or whether there are similar non-NN approaches.
A simple way to overcome the problems would be to take the row mean in each block and use the resulting data for the classification problem. However, in this case you loose a lot of information.
You could try to extract other relevant features from the blocks. This could work especially well if you have some domain knowledge to guide the feature engineering.
In general, if the blocks are associated with a label each (rather than each row having its own label), you're dealing with "Multiple Instance Learning." How to deal with that depends on the specifics of how the blocks are generated. The wikipedia page, especially the sections Assumptions and Algorithms, is a good place to start.
Answered by Ben Reiniger on January 13, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP