TransWikia.com

Output of machine learning classification prediction on out-of-sample data has too few observations

Data Science Asked by Sage Wyatt on April 26, 2021

Using the mlr package, I have developed a task using word frequency data to categorize tweets into two categories (TRUE and FALSE). Now I am using this task to classify out-of-sample tweets. Sorry, I cannot share my data, but I will show my code here:

task = makeClassifTask(data = train, target = "category")
mod = train("classif.randomForest", task)

newdata.pred = predict(mod, newdata = outofsample)
newdata.pred

My output is shown here:

Prediction: 17981 observations
predict.type: response
threshold: 
time: 0.29
... (#rows: 17981, #cols: 1)

And as a dataframe:

response
<fctr>
1   TRUE            
2   FALSE           
3   FALSE           
4   FALSE           
5   FALSE           
6   FALSE

I now want to use my categorization to remove any tweets falling in the "FALSE" category. But if I have 17981 tweets, why do I only see 6 observations? I cannot find anything wrong with the "train" dataframe or "outofsample" dataframe (they both have the appropriate number of observations and are listed as dataframes in the global environment, but I did notice that the object "task" creates a list of 6. Is this just a coincidence? How do I retrieve the classification for all of my tweets? And how do I link this information to the out of sample dataset to delete tweets by ID number, if this classification method does not include ID number?

Please don’t judge me too harshly, I’m very new to machine learning and R in general. Any advice would be a big help.

One Answer

The actual predictions can be found in newdata.pred$data.

Note that mlr is no longer maintained:

{mlr} is considered retired from the mlr-org team. We won’t add new features anymore and will only fix severe bugs. We suggest to use the new mlr3 framework from now on and for future projects.

See the docs: https://mlr.mlr-org.com/

library(mlr)
library(ISLR)

df = ISLR::Auto
df$cylinders = as.factor(df$cylinders)
df$name <- NULL

task = makeClassifTask(data = df, target = "cylinders")
mod = train("classif.randomForest", task)

newdata.pred = predict(mod, newdata = df)
result = data.frame(newdata.pred$data)

    truth response
1       8        8
2       8        8
3       8        8
4       8        8
5       8        8
6       8        8
7       8        8
8       8        8
9       8        8
10      8        8
11      8        8
12      8        8
...

Answered by Peter on April 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP