What test shall I use to validate the use of a certain score to predict my outcome in a survival analysis?

Question

I validate usage of a clinical cardiovascular score to predict the risk of dementia using data from a longitudinal study. Therefore, my outcome is binary (dementia yes or not) and the independent variable (the score) is continuous, of course I have a whole set of covariates.

I did Cox analysis to assess an association between baseline values and the outcome over time but now I would like to validate the use of the score.
I thought about taking a random sub-sample of my cohort to split in training and test and run some sort of validation statistics (i.e. ROC curves) but I have some concerns about this for a number of reasons:

My sample is relatively small ($n=2500$), and I am afraid that taking a sub-sample would reduce the power too much. 
Not sure whether the ROC (or alternatively the somerset) are the best tests in this case, as other tests (like those used in screenings evaluation) may suit better.

How shall I evaluate the use of this score?
Can you suggest tests that suit better for the problem?

For data analysis I use Stata.

Lennart · Answer

So you basically want to do a Cross Validation of your dataset.

One type of CV is the Holdout method which divides the data into two parts: Dtest and Dtrain. The model will predict values from Dtest (x → y) and since we know the actual values, that is what x-value correspond to what y-value we can compare the predicted and the actual values and estimate the performance. The subsample is quite arbitrarily Dtrain/Dtest: 70/30.

But your concern about splitting up your dataset into subset is valid. Why? Because when the data is split up into subparts with a dataset that is quite small, in your case N=2500, there's a much higher chance that Dtest and Dtrain is different from one another.

We can solve this by using a k-folded CV. K-folded cross validation divides the dataset into several parts. One of the subparts will be used as test data and the rest (k-1 parts) will be used as training data. The model is iterated (sort of) through each subpart and an error rate will be obtained for each iteration. The mean error rate will be used as a performance value. This solves the problem of having subparts of the data that is not representative of the whole dataset.

The problem with using k-folded cross validation is if the subparts are divided so as to give a model that overfits the data. That is, it predicts the value of the datapoints that are given quite well but when the model is given new data the error rate will be high. This usually happens when there's little data to begin with. To prevent this we can use Repeated cross validation.

Repeated cross validation:

Do CV (which will give mean error rate (Ê)
Reorder the data so as to give different subparts
Repeat 1-2

Figure 1. General idea behind CV.

What test shall I use to validate the use of a certain score to predict my outcome in a survival analysis?

One Answer

Add your own answers!

Ask a Question