Data Science Asked by funkyFunk on September 3, 2021
I have a data with the following columns
col1 col2 col3 col4 label
7669.533073 7669.533073 7669.695497 7669.922593 1
7669.533043 7669.533072 7669.695487 7669.922596 0
the mean across all the 50 columns are similar and also the min and maximum.
I am trying to build a classifier and the best model(random forest) is giving me a recall of .55 (doesn’t seem so good), could there be anything I am missing in this?
I have thought about normalising the data but there seems to be no need as all columns have a similar mean and std.
Is there any statistics technique I could apply to the data to help get an improved result.
Note the data is from a simulated crypto price and I am trying to predict the price movement (up or down)
Your datapoints are too close to each other and hence it is really tough for any ML model to learn this inputs as it doesn't know how to differentiate almost same data to 1 and 0 label. That's why the result is random and you are getting around half accuracy.
Answered by SrJ on September 3, 2021
If the data values are this close together, it's possible the slight differences in values could be due to, or at least masked by, measurement error. If this is the case, you won't be able to model the data accurately, as measurement error is typically random, not related to any label that is attached. Also curious about the high precision of the data, with 10 significant digits. Decimal side is down to the millionths column, even with data values being in the thousands.
Answered by Donald S on September 3, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP