Data Science Asked on November 25, 2021
I have a dataset which contains medical data about children and I am developing a predictive machine learning model to predict adverse pregnancy outcomes. The dataset contains mostly features with a single value per child, e.g. gender = ["Male", "Female].
However, I have some features that have multiple values per child, such as the abdominal circumference which has been recorded multiple times per child, as such:
ChildID abdomcirc
0 1 273
1 1 267
2 1 294
3 2 136
4 2 248
So in the above table child 1 has 3 values for abdomcirc and child 2 has two values for abdomcirc. Adding this feature to the remaining dataset (comprised of single observational features) will result in nearly duplicate rows, apart from the different values for abdomcirc, like so:
ChildID gender diabetes birthroute abdomcirc
0 1 Male No Normal 273
1 1 Male No Normal 267
2 1 Male No Normal 294
3 2 Female Yes csection 136
4 2 Female Yes csection 248
I am unsure what the best way to deal with these features is, without merging the data and having near-duplicate rows. I have considered the following:
Using python list type for abdomcirc. However, I do not know if a machine learning model can handle this data type. So my data will look something like this
ChildID gender diabetes birthroute abdomcirc
0 1 Male No Normal [273, 267, 294]
1 2 Female Yes csection [136, 248]
Transforming abdomcirc into a single observational feature by calculating the mean (although I am not sure how useful this information would be for my predictive model) like so:
ChildID gender diabetes birthroute abdomcirc
0 1 Male No Normal 278
1 2 Female Yes csection 192
I have tried looking for resources to help me with this but have not been very successful, maybe because I am not typing the correct keywords or something. So, I would appreciate your opinions and helpful resources. Many thanks!
A possible resource is featuretools, they do feature engineering on data that has many records. Their examples are not from medical cases but I think it should work for you too.
You can also manually build several features. For instance, given a list of abdomcirc
, you can compute its:
These features would get most of the information of the abdomcirc
list, and this should help your modelling.
I wouldn't go for the first approach of giving lists to the algorithm, although it is possible, I think it is a relatively advanced thing and I wouldn't go for it unless the simpler approaches don't work.
Answered by David Masip on November 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP