Data Science Asked on February 26, 2021
I got ValueError when predicting test data using a RandomForest model.
My code:
clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2)
clf.fit(X_fit, y_fit)
df_test.fillna(df_test.mean())
X_test = df_test.values
y_pred = clf.predict(X_test)
The error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
How do I find the bad values in the test dataset? Also, I do not want to drop these records, can I just replace them with the mean or median?
Thanks.
With np.isnan(X)
you get a boolean mask back with True for positions containing NaN
s.
With np.where(np.isnan(X))
you get back a tuple with i, j coordinates of NaN
s.
Finally, with np.nan_to_num(X)
you "replace nan with zero and inf with finite numbers".
Alternatively, you can use:
pd.DataFrame(X).fillna()
, if you need something other than filling it with zeros.Correct answer by fernando on February 26, 2021
Assuming X_test
is a pandas dataframe, you can use DataFrame.fillna
to replace the NaN values with the mean:
X_test.fillna(X_test.mean())
Answered by kmandov on February 26, 2021
For anybody happening across this, to actually modify the original:
X_test.fillna(X_train.mean(), inplace=True)
To overwrite the original:
X_test = X_test.fillna(X_train.mean())
To check if you're in a copy vs a view:
X_test._is_view
Answered by CommonSurname on February 26, 2021
I faced similar problem and saw that numpy handles NaN and Inf differently.
Incase if you data has Inf, try this:
np.where(x.values >= np.finfo(np.float64).max)
Where x is my pandas Dataframe
This will be giving a tuple of location of places where NA values are present.
Incase if your data has Nan, try this:
np.isnan(x.values.any())
Answered by Prakash Vanapalli on February 26, 2021
Don't forget
col_mask=df.isnull().any(axis=0)
Which returns a boolean mask indicating np.nan values.
row_mask=df.isnull().any(axis=1)
Which return the rows where np.nan appeared. Then by simple indexing you can flag all of your points that are np.nan.
df.loc[row_mask,col_mask]
Answered by bmc on February 26, 2021
Do not forget to check for inf values as well. The only thing that worked for me:
df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)
And even better if you are using sklearn
def replace_missing_value(df, number_features):
imputer = Imputer(strategy="median")
df_num = df[number_features]
imputer.fit(df_num)
X = imputer.transform(df_num)
res_def = pd.DataFrame(X, columns=df_num.columns)
return res_def
When number_features would be an array of the number_features labels, for example:
number_features = ['median_income', 'gdp']
Answered by Kohn1001 on February 26, 2021
Here is the code for how to "Replace NaN with zero and infinity with large finite numbers." using numpy.nan_to_num.
df[:] = np.nan_to_num(df)
Also see fernando's answer.
Answered by Domi W on February 26, 2021
In most cases getting rid of infinite and null values solve this problem.
get rid of infinite values.
df.replace([np.inf, -np.inf], np.nan, inplace=True)
get rid of null values the way you like, specific value such as 999, mean, or create your own function to impute missing values
df.fillna(999, inplace=True)
or
df.fillna(df.mean(), inplace=True)
Answered by Natheer Alabsi on February 26, 2021
If your values are larger than float32
, try to run some scaler first. It'd be rather unusual to have deviation spanning more than float32
.
Answered by Piotr Rarus on February 26, 2021
You can list your columns that had NaN with this function
df.isnull().sum()
and then you can fill these NAN values in your dataset file. (csv or excel file)
Answered by Busra Dogan on February 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP