Data Science Asked by cyanide on May 21, 2021
I tried to do a stratified sampling by way of train_test_split in order to save myself some trouble later. So I wrote the following lines:
from sklearn.model_selection import train_test_split
X=data_df
y=data_df.pop('class')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.125, stratify=y)
I got the error :
ValueError: Input contains NaN
Any help is welcome!
Check to see if you have any null or nan
values:
X[X.isnull() == True]
Then you have to decide what to do with those nan
values. Something that is commonly done is to forward fill in place.
X.fillna(method = 'ffill', inplace = True)
y.fillna(method = 'ffill', inplace = True)
Answered by rigo on May 21, 2021
(Upgrading comment to answer.)
This error message is generally pretty straightforward: you have missing values (generally one of np.nan
, pd.NA
, None
), and whatever method you're trying to use cannot handle that.
Now train_test_split
doesn't usually care about missing values: it's just splitting up the rows, so why should it care what values are in there? But, in this case you're asking to stratify
on y
(making the train/test split have the same proportion of each class in y
), and so it does care about the values in y
. So the error is because you have missing values in y
.
Missing the target variable is problematic. The best thing to do is probably to drop those rows, unless there's some additional context (e.g. if your data is time-series, maybe you can impute based on the adjacent rows).
Answered by Ben Reiniger on May 21, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP