TransWikia.com

PySpark Crossvalidation error

Data Science Asked by Robse on November 13, 2021

I want to do a very simple cross validation using LogisticRegression.
Here is my code:

logreg = LogisticRegression(labelCol = "churn", featuresCol = "features") 

pipeline = Pipeline(stages = [logreg])
paramGrid = ParamGridBuilder().addGrid(logreg.regParam, [.1, .01]).build()

crossval = CrossValidator(
    estimator = pipeline,
    estimatorParamMaps = paramGrid,
    evaluator = BinaryClassificationEvaluator(),
    numFolds = 2)

bestLogReg = crossval.fit(df_train)

When I run this, I get the following error on bestLogReg = crossval.fit(df_train):

IllegalArgumentException: label does not exist. Available: features, churn, CrossValidator_764038c00edc_rand, rawPrediction, probability, prediction

Here is my df_train dataset’s schema:

root
 |-- features: vector (nullable = true)
 |-- churn: integer (nullable = true)

I have fit this to a LogisticRegression before and it predicts fine.

Can you help me figure out what I did wrong?

One Answer

For some reason in cross validation we also need to set the label column of the evaluator (even tho it's already set for the estimator. So all you need to do is change BinaryClassificationEvaluator() into BinaryClassificationEvaluator().setLabelCol("churn") where "churn" is the name of your target variable.

Answered by jared3412341 on November 13, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP