Cross Validated Asked on November 8, 2020
I validate usage of a clinical cardiovascular score to predict the risk of dementia using data from a longitudinal study. Therefore, my outcome is binary (dementia yes or not) and the independent variable (the score) is continuous, of course I have a whole set of covariates.
I did Cox analysis to assess an association between baseline values and the outcome over time but now I would like to validate the use of the score.
I thought about taking a random sub-sample of my cohort to split in training and test and run some sort of validation statistics (i.e. ROC curves) but I have some concerns about this for a number of reasons:
How shall I evaluate the use of this score?
Can you suggest tests that suit better for the problem?
For data analysis I use Stata.
So you basically want to do a Cross Validation of your dataset.
One type of CV is the Holdout method which divides the data into two parts: Dtest and Dtrain. The model will predict values from Dtest (x → y) and since we know the actual values, that is what x-value correspond to what y-value we can compare the predicted and the actual values and estimate the performance. The subsample is quite arbitrarily Dtrain/Dtest: 70/30.
But your concern about splitting up your dataset into subset is valid. Why? Because when the data is split up into subparts with a dataset that is quite small, in your case N=2500, there's a much higher chance that Dtest and Dtrain is different from one another.
We can solve this by using a k-folded CV. K-folded cross validation divides the dataset into several parts. One of the subparts will be used as test data and the rest (k-1 parts) will be used as training data. The model is iterated (sort of) through each subpart and an error rate will be obtained for each iteration. The mean error rate will be used as a performance value. This solves the problem of having subparts of the data that is not representative of the whole dataset.
The problem with using k-folded cross validation is if the subparts are divided so as to give a model that overfits the data. That is, it predicts the value of the datapoints that are given quite well but when the model is given new data the error rate will be high. This usually happens when there's little data to begin with. To prevent this we can use Repeated cross validation.
Repeated cross validation:
Figure 1. General idea behind CV.
Answered by Lennart on November 8, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP