PySpark Crossvalidation error

Question

I want to do a very simple cross validation using LogisticRegression.
Here is my code:
logreg = LogisticRegression(labelCol = "churn", featuresCol = "features")

pipeline = Pipeline(stages = [logreg])
paramGrid = ParamGridBuilder().addGrid(logreg.regParam, [.1, .01]).build()

crossval = CrossValidator(
    estimator = pipeline,
    estimatorParamMaps = paramGrid,
    evaluator = BinaryClassificationEvaluator(),
    numFolds = 2)

bestLogReg = crossval.fit(df_train)

When I run this, I get the following error on bestLogReg = crossval.fit(df_train):
IllegalArgumentException: label does not exist. Available: features, churn, CrossValidator_764038c00edc_rand, rawPrediction, probability, prediction
Here is my df_train dataset's schema:
root
 |-- features: vector (nullable = true)
 |-- churn: integer (nullable = true)

I have fit this to a LogisticRegression before and it predicts fine.
Can you help me figure out what I did wrong?

jared3412341 · Answer

For some reason in cross validation we also need to set the label column of the evaluator (even tho it's already set for the estimator. So all you need to do is change BinaryClassificationEvaluator() into BinaryClassificationEvaluator().setLabelCol("churn") where "churn" is the name of your target variable.

Answered by jared3412341 on November 13, 2021

PySpark Crossvalidation error

One Answer

Add your own answers!

Ask a Question