Data Science Asked on March 28, 2021
When Machine Learning libraries don’t support categorical features those features can be one-hot encoded into a series of binary feature columns. I have a feature that represents a sequence or permutation of values and I want to transform it into something scikit-learn or similar ML libraries can use. What are the well known ways of doing this?
In my problem a physical system is damaged and I’d like to use ML to recommend the sequence in which the damages should be repaired. Due to limited resources and limited repair crews and equipment only a certain number of components can be repaired at any one time. I’ve already determined a rough importance of the various components. Typically repairing the most important components first works well. But in specific situations non-obvious repair orders work even better. I have a dataset with a couple of million data points. For each data point I have the set of components that were damaged and the order that the repairs were undertaken as well as a metric of how well the strategy worked.
The number of components that can be damaged is fixed and approximately 1600.
In a realistic scenario there would be less than 50 damaged sub components.
Say there were four components A,B,C,D
Assume B, A and C were damaged but D was not.
In an example dataset there might be two entries:
[A,C,B] = 11
[B,A,C] = 7
I want to transform the [A,C,B,D] part into something I can give to a regressor or categorizer.
Approaches I have thought of so far:
A_repaired_at | B_repaired_at | C_repaired_at | D_repaired_at | result |
---|---|---|---|---|
1 | 3 | 2 | 0 | 11 |
2 | 1 | 3 | 0 | 7 |
A_repaired_at | B_repaired_at | C_repaired_at | D_repaired_at | result |
---|---|---|---|---|
1/3 | 3/3 | 2/3 | 0 | 11 |
2/3 | 1/1 | 3/3 | 0 | 7 |
A_before_B | A_before_C | A_before_D | B_before_C | B_before_D | C_before_D | result |
---|---|---|---|---|---|---|
True | True | na | False | na | na | 11 |
False | True | na | True | na | na | 7 |
Option #3 is nice because the columns are only {True,False,NA} but it has this nightmare inducing problem where my 1600 components become a million and some change feature columns.
Is there some other way to transform the sequence into useful ML features?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP