Data Science Asked on March 27, 2021
Let’s say I have a medical dataset/EHR dataset that is retrospective and longitudinal in nature. Meaning one person has multiple measurements across multiple time points (in the past). I did post here but couldn’t get any response. So, posting it here
This dataset contains information about patients’ diagnosis, mortality flag, labs, admissions, and drugs consumed, etc.
Now, if I would like to find out predictors that can influence mortality, I can use logistic regression (whether the patient will die or not).
But my objective is to find out what are the predictors that can help me predict whether a person will die in the next 30 days or the next 240 days, how can I do this using ML/Data Analysis techniques?
In addition, I would also like to compute a score that can indicate the likelihood that this person will die in the next 30 days? How can I compute the scores? Any tutorials links on how is this score derived?, please?
Can you please let me know what are the different analytic techniques that I can use to address this problem and different approaches to calculate score?
I would like to read and try solving problems like this
This could be seen as a "simple" binary classification problem. I mean the type of problem is "simple", the task itself certainly isn't... And I'm not even going to mention the serious ethical issues about its potential applications!
First, obviously you need to have an entry in your data for a patient's death. It's not totally clear to me if you have this information? It's important that whenever a patient has died this is reported in the data, otherwise you cannot distinguish the two classes.
So the design could be like this:
Ideally I would recommend splitting between training and test data before even preparing the data in this way, typically by picking a period of time for training data and another for test data.
Once the data is prepared, in theory any binary classification method can be applied. Of course a probabilistic classifier can be used to predict a probability, but this can be misleading so be very careful: the probability itself is a prediction, it cannot be interpreted as the true chances of the patient to die or not. For example Naive Bayes is known to empirically always give extreme probabilities, i.e. close to 0 or close to 1, and quite often it's completely wrong in its prediction. This means that in general the predicted probability is only a guess, it cannot be used to represent confidence.
[edit: example]
Let's say we have:
Let's imagine the following data (to simplify I assume the time unit is year):
patientId birthYear year indicator
1 1987 2000 26
1 1987 2001 34
1 1987 2002 18
1 1987 2003 43
1 1987 2004 31
1 1987 2005 36
2 1953 2000 47
2 1953 2001 67
2 1953 2002 56
2 1953 2003 69
2 1953 2004 - DEATH
3 1969 2000 37
3 1969 2001 31
3 1969 2002 25
3 1969 2003 27
3 1969 2004 15
3 1969 2005 - DEATH
4 1936 2000 41
4 1936 2001 39
4 1936 2002 43
4 1936 2003 43
4 1936 2004 40
4 1936 2005 38
That would be transformed into this:
patientId yearT age indicatorT-2 indicatorT-1 indicatorT-0 label
1 2002 15 26 34 18 0
1 2003 16 34 18 43 0
1 2004 17 18 43 31 0
2 2002 49 47 67 56 0
2 2003 50 67 56 69 1
3 2002 33 37 31 25 0
3 2003 34 31 25 27 0
3 2004 35 25 27 15 1
4 2002 66 41 39 43 0
4 2003 67 39 43 43 0
4 2004 68 43 43 40 0
Note that I wrote the first two columns only to show how the data is calculated, these two are not part of the features.
Correct answer by Erwan on March 27, 2021
To clarify the questions raised by the user in response to the correct solution given by Erwan - the solution proposes going back in time to prepare the data across a series of timestamps.
There will be multiple points in time 't' where the input would be all the various features on the patients health, medication, reports etc..you need to see how best they can be converted to representational vectors. The labels would be a binary and indicate whether the patient lived after t+N days..where N can be 30,60,240 etc. 't' itself can be taken week on week or month on month.
Once data is prepared this way, it becomes a binary classification exercise.
The only additional consideration which can be added is - there could be elements of RNN here. The training data is not independent of one another and may contain recurring data of the same patient over multiple timestamps and perhaps there is a scope for capturing this information to model the situation better.
Answered by Allohvk on March 27, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP