How can I perform stratified sampling for multi-label multi-class classification?

Question

I am asking this question for few reasons:

The dataset in hand is imbalanced
I used below code

x = dataset[['Message']] 
y = dataset[['Label1', 'Label2']]
train_data, test_data = train_test_split(x, test_size = 0.1, stratify=y, random_state = 73)

but the error message that I am getting is The least populated class in y has only 1 member, which is too few. The minimum number of labels for any class cannot be less than 2. I removed those classes where the class count is < 2 in each individual label.

I am not sure why this error is popping.

So, I am thinking to go to implement the stratified sampling myself. Here,  I need help in deciphering the cause of the problem here and the implementation of the stratified sampling in multi-label classification so that it works well for the individual batches too while training.

n1k31t4 · Answer

The error here seems to be because you want train and test data (so two data sets), meaning that each class must be present in each of the data sets. This would mean that each class must have at least two samples. It is a design choice of whoever implemented train_test_split. I guess it might not technically be stratified otherwise.

You can see where it is implemented in the SciKit Learn source code, within class StratifiedShuffleSplit:

classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]

class_counts = np.bincount(y_indices)

if np.min(class_counts) < 2:
    raise ValueError("The least populated class in y has only 1"
                     " member, which is too few. The minimum"
                     " number of groups for any class cannot"
                     " be less than 2.")

np.unique finds indices of the each the classes in y. Because the option return_inverse=True is passed, it returns an array of indices that will allow full reconstruction of the input array, y. This means, to get the total number of classes that are present, you need to use np.bincount; creating class_counts.

The final check is whether or not class_counts is less than the number of data sets you want to create. If it is, then you cannot create a properly stratified split of your data - so you get an error.

As to how you might create your own version: one way I implemented stratified sampling was to use histograms, more specifically NumPy's histogram function. It worked well for continuous labels (i.e. not discrete classes) - and I was not looking at a multi-label problem, so you might have to adjust my suggestion to allow it to accomodate your needs.

The main idea is to split the labels into bins of a histogram and then randomly sample from those bins, with the option to allow for duplicates. That is really the part that will solve your specific problem of < 2 labels in a class. I realise this doesn't specifically answer your problem, but perhaps it will give you some new ideas.

If duplicates don't make sense or are strictly not allowed in your experiment, then you could think about merging the smaller classes toether in some way, so they will have > 2 labels per class. This might be more useful than deleting them, but whether or not it is feasible  will depend on your data.

Hamada Zahera · Answer

You can use Multi-label data stratification in skmultilearn library

Shayan Amani · Answer

This because of the nature of stratification. The stratify parameter sets it to split data in a way to allocate test_size amount of data to each class. In this case, you don't have sufficient class labels of one (or more) of your classes to keep the data splitting ratio equal to test_size.

How can I perform stratified sampling for multi-label multi-class classification?

3 Answers

Add your own answers!

Ask a Question