TransWikia.com

train_test_split ValueError: Input contains NaN

Data Science Asked by cyanide on May 21, 2021

I tried to do a stratified sampling by way of train_test_split in order to save myself some trouble later. So I wrote the following lines:


from sklearn.model_selection import train_test_split

X=data_df
y=data_df.pop('class')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.125, stratify=y)

I got the error :

ValueError: Input contains NaN

Any help is welcome!

2 Answers

Check to see if you have any null or nan values:

X[X.isnull() == True]

Then you have to decide what to do with those nan values. Something that is commonly done is to forward fill in place.

X.fillna(method = 'ffill', inplace = True)
y.fillna(method = 'ffill', inplace = True)

Answered by rigo on May 21, 2021

(Upgrading comment to answer.)

This error message is generally pretty straightforward: you have missing values (generally one of np.nan, pd.NA, None), and whatever method you're trying to use cannot handle that.

Now train_test_split doesn't usually care about missing values: it's just splitting up the rows, so why should it care what values are in there? But, in this case you're asking to stratify on y (making the train/test split have the same proportion of each class in y), and so it does care about the values in y. So the error is because you have missing values in y.

Missing the target variable is problematic. The best thing to do is probably to drop those rows, unless there's some additional context (e.g. if your data is time-series, maybe you can impute based on the adjacent rows).

Answered by Ben Reiniger on May 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP