Null predictions for ALS in Pyspark

Question

I am trying to read from my dataset which has three coloumns. (User, Repository and Number of Stars)

In[10]

lines = spark.read.text("Dataset.csv").rdd
print(lines.take(10))

Out[10]

[Row(value='0,0,0,290'), Row(value='1,1,1,112'), Row(value='2,2,2,87.8'), Row(value='3,3,3,69.7'), Row(value='4,4,4,65.7'), Row(value='5,5,5,62'), Row(value='6,6,6,61.6'), Row(value='7,7,7,60.7'), Row(value='8,8,8,57.7'), Row(value='9,9,9,56.2')]

In[10]

# Need to convert p[1] from str to int
parts = lines.map(lambda row: row.value.split(","))

print(parts.take(2))

Out[11]

[['0', '0', '0', '290'], ['1', '1', '1', '112']]

In[12]

# RDD mapped as int and float from Dataset

ratingsRDD = parts.map(lambda p: Row(userId=int(p[1]),repoId=int(p[2]),repoCount=float(p[3])))
ratings = spark.createDataFrame(ratingsRDD)
print(ratings.head(10))

Out[12]

[Row(repoCount=290.0, repoId=0, userId=0), Row(repoCount=112.0, repoId=1, userId=1), Row(repoCount=87.8, repoId=2, userId=2), Row(repoCount=69.7, repoId=3, userId=3), Row(repoCount=65.7, repoId=4, userId=4), Row(repoCount=62.0, repoId=5, userId=5), Row(repoCount=61.6, repoId=6, userId=6), Row(repoCount=60.7, repoId=7, userId=7), Row(repoCount=57.7, repoId=8, userId=8), Row(repoCount=56.2, repoId=9, userId=9)]

In[13]

(training, test) = ratings.randomSplit([0.8, 0.2])

In[14]:

# Build the recommendation model using ALS on the training data
# Cold start strategy is set to '"drop" to make sure there is no NaN evaluation metrics which would result in error.
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="repoId", ratingCol="repoCount"
        ,coldStartStrategy="drop") #Cold-start is set to DROP
model = als.fit(training)

In[15]

#Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
type(predictions)

predictions.show(3)

Out[15]

My model is giving NULL values. Is there a problem in my dataset, or wrong assumptions on Training.

Note that my ratingCol in ALS is Number of Stars which is an explicit rating and not implicit rating.

gbdata · Answer

I would need to understand certain aspects of your dataset to give you a better answer but:

The reason you're receiving NULL Values:

Depending on the class distribution, You may receive null predictions due to not having classes("repoId") in both samples of the original data. So classes that exist in test data may not exist in the training data. So when you apply the transform to the test data it has no basis on which to make a prediction on the supplied data. When you use "coldStartStrategy" it just omits those records altogether.

I would recommend first setting "ColdStartStrategy" to False just to see if all your records are just returning Null prediction values.

If that is the case, You would then need to check class distributions in both the training and test samples of your data. This should be the class distribution for "repoId". Then you would have to sample the data in a way that makes sure classes are present in both samples then reapply.

Potential flaws in your process:

Use seeding in "randomSplit" so your samples are always reproducible therefore you can detect issues regardless of when you ran your program
You may have assumed equal class distribution in your dataset
Normalizing your "Repocount" values would benefit the algorithm, but likely has no bearing on the overall results

Null predictions for ALS in Pyspark

One Answer

Add your own answers!

Ask a Question