Data Science Asked by Gui Kham on May 23, 2021
I am trying to read from my dataset which has three coloumns. (User, Repository and Number of Stars)
In[10]
lines = spark.read.text("Dataset.csv").rdd
print(lines.take(10))
Out[10]
[Row(value='0,0,0,290'), Row(value='1,1,1,112'), Row(value='2,2,2,87.8'), Row(value='3,3,3,69.7'), Row(value='4,4,4,65.7'), Row(value='5,5,5,62'), Row(value='6,6,6,61.6'), Row(value='7,7,7,60.7'), Row(value='8,8,8,57.7'), Row(value='9,9,9,56.2')]
In[10]
# Need to convert p[1] from str to int
parts = lines.map(lambda row: row.value.split(","))
print(parts.take(2))
Out[11]
[['0', '0', '0', '290'], ['1', '1', '1', '112']]
In[12]
# RDD mapped as int and float from Dataset
ratingsRDD = parts.map(lambda p: Row(userId=int(p[1]),repoId=int(p[2]),repoCount=float(p[3])))
ratings = spark.createDataFrame(ratingsRDD)
print(ratings.head(10))
Out[12]
[Row(repoCount=290.0, repoId=0, userId=0), Row(repoCount=112.0, repoId=1, userId=1), Row(repoCount=87.8, repoId=2, userId=2), Row(repoCount=69.7, repoId=3, userId=3), Row(repoCount=65.7, repoId=4, userId=4), Row(repoCount=62.0, repoId=5, userId=5), Row(repoCount=61.6, repoId=6, userId=6), Row(repoCount=60.7, repoId=7, userId=7), Row(repoCount=57.7, repoId=8, userId=8), Row(repoCount=56.2, repoId=9, userId=9)]
In[13]
(training, test) = ratings.randomSplit([0.8, 0.2])
In[14]:
# Build the recommendation model using ALS on the training data
# Cold start strategy is set to '"drop" to make sure there is no NaN evaluation metrics which would result in error.
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="repoId", ratingCol="repoCount"
,coldStartStrategy="drop") #Cold-start is set to DROP
model = als.fit(training)
In[15]
#Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
type(predictions)
predictions.show(3)
Out[15]
+---------+------+------+----------+
|repoCount|repoId|userId|prediction|
+---------+------+------+----------+
+---------+------+------+----------+
My model is giving NULL values. Is there a problem in my dataset, or wrong assumptions on Training.
Note that my ratingCol
in ALS
is Number of Stars which is an explicit rating and not implicit rating.
I would need to understand certain aspects of your dataset to give you a better answer but:
The reason you're receiving NULL Values:
Depending on the class distribution, You may receive null predictions due to not having classes("repoId") in both samples of the original data. So classes that exist in test data may not exist in the training data. So when you apply the transform to the test data it has no basis on which to make a prediction on the supplied data. When you use "coldStartStrategy" it just omits those records altogether.
I would recommend first setting "ColdStartStrategy" to False just to see if all your records are just returning Null prediction values.
If that is the case, You would then need to check class distributions in both the training and test samples of your data. This should be the class distribution for "repoId". Then you would have to sample the data in a way that makes sure classes are present in both samples then reapply.
Potential flaws in your process:
Answered by gbdata on May 23, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP