ValueError: Input contains NaN, infinity or a value too large for dtype('float32') on predicting case (similar to titanic predicting)

Question

I am still newbie on python with jupyter notebook
I’d like to ask how to solve error "ValueError: Input contains NaN, infinity or a value too large for dtype(‘float32’)"

first I make prediction with these code

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, roc_curve,auc, confusion_matrix
from xgboost import XGBClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

dec = DecisionTreeClassifier()
ran = RandomForestClassifier(n_estimators=100)
ran2 = RandomForestClassifier(criterion='gini',
                                           n_estimators=1750,
                                           max_depth=7,
                                           min_samples_split=6,
                                           min_samples_leaf=6,
                                           max_features='auto',
                                           oob_score=True,
                                           random_state=42,
                                           n_jobs=-1,
                                           verbose=1)
knn = KNeighborsClassifier(n_neighbors=50) #DISESUAIKAN DENGAN JUMLAH SAMPLE KITA, JANGAN SAMPE SAMPLE kita cuma 83 tapi ditulis 100
sgd = SGDClassifier(max_iter=1000, tol=1e-3)
xgb = XGBClassifier()
naive = GaussianNB()
log = LogisticRegression(random_state = 0)
svc_lin = SVC(kernel = 'linear', random_state = 0)
svc_rbf = SVC(kernel = 'rbf', random_state = 0)

models = {"Decision tree" : dec,
          "Random forest" : ran,
          "Random forest Tuning" : ran2,
          "KNN" : knn,
          "SGD" : sgd,
          "Gaussian Naive bayes" : naive,
          "XGBoost" : xgb,
          "Logistic Regression" : log,
          "Linear Classifier" : svc_lin,
          "RBF Classifier" : svc_rbf}
scores= { }

for key, value in models.items():    
    model = value
    model.fit(x_train, y_train)
    scores[key] = model.score(x_test, y_test)

then, see the accuracy score with :

scores_frame1 = pd.DataFrame(scores, index=["Accuracy Score"]).T
scores_frame1.sort_values(by=["Accuracy Score"], axis=0 ,ascending=False, inplace=True)
scores_frame1

after that, I got that "Decision tree method" have high accuracy score, then, I make prediction with my own data set.

preds = dec.predict(df) 
preds

however, I got error,

> --------------------------------------------------------------------------- ValueError                                Traceback (most recent call
> last) <ipython-input-34-4034b7c264f2> in <module>
> ----> 1 preds = dec.predict(hasil1_1)
>       2 preds
> 
> ~anaconda3libsite-packagessklearntree_classes.py in
> predict(self, X, check_input)
>     425         """
>     426         check_is_fitted(self)
> --> 427         X = self._validate_X_predict(X, check_input)
>     428         proba = self.tree_.predict(X)
>     429         n_samples = X.shape[0]
> 
> ~anaconda3libsite-packagessklearntree_classes.py in
> _validate_X_predict(self, X, check_input)
>     386         """Validate X whenever one tries to predict, apply, predict_proba"""
>     387         if check_input:
> --> 388             X = check_array(X, dtype=DTYPE, accept_sparse="csr")
>     389             if issparse(X) and (X.indices.dtype != np.intc or
>     390                                 X.indptr.dtype != np.intc):
> 
> ~anaconda3libsite-packagessklearnutilsvalidation.py in
> inner_f(*args, **kwargs)
>      71                           FutureWarning)
>      72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
> ---> 73         return f(**kwargs)
>      74     return inner_f
>      75 
> 
> ~anaconda3libsite-packagessklearnutilsvalidation.py in
> check_array(array, accept_sparse, accept_large_sparse, dtype, order,
> copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples,
> ensure_min_features, estimator)
>     643 
>     644         if force_all_finite:
> --> 645             _assert_all_finite(array,
>     646                                allow_nan=force_all_finite == 'allow-nan')
>     647 
> 
> ~anaconda3libsite-packagessklearnutilsvalidation.py in
> _assert_all_finite(X, allow_nan, msg_dtype)
>      95                 not allow_nan and not np.isfinite(X).all()):
>      96             type_err = 'infinity' if allow_nan else 'NaN, infinity'
> ---> 97             raise ValueError(
>      98                     msg_err.format
>      99                     (type_err,
> 
> ValueError: Input contains NaN, infinity or a value too large for
> dtype('float32').

the data for prediction is just 110 from total 1278 rows. However, if I divided into 2 dataset, which are from 1-685 and 686-1278, they can run smoothly.
is it because too many dataset will cause error?

help me….

predictive modeling scikit learn

ValueError: Input contains NaN, infinity or a value too large for dtype('float32') on predicting case (similar to titanic predicting)

Add your own answers!

Ask a Question