TransWikia.com

How can I transform a sequence into features

Data Science Asked on March 28, 2021

When Machine Learning libraries don’t support categorical features those features can be one-hot encoded into a series of binary feature columns. I have a feature that represents a sequence or permutation of values and I want to transform it into something scikit-learn or similar ML libraries can use. What are the well known ways of doing this?

In my problem a physical system is damaged and I’d like to use ML to recommend the sequence in which the damages should be repaired. Due to limited resources and limited repair crews and equipment only a certain number of components can be repaired at any one time. I’ve already determined a rough importance of the various components. Typically repairing the most important components first works well. But in specific situations non-obvious repair orders work even better. I have a dataset with a couple of million data points. For each data point I have the set of components that were damaged and the order that the repairs were undertaken as well as a metric of how well the strategy worked.
The number of components that can be damaged is fixed and approximately 1600.
In a realistic scenario there would be less than 50 damaged sub components.

Say there were four components A,B,C,D

Assume B, A and C were damaged but D was not.
In an example dataset there might be two entries:
[A,C,B] = 11
[B,A,C] = 7

I want to transform the [A,C,B,D] part into something I can give to a regressor or categorizer.

Approaches I have thought of so far:

  1. One column per component with the order the component was repaired. If a component wasn’t damaged then the column might have 0 or N/A
A_repaired_at B_repaired_at C_repaired_at D_repaired_at result
1 3 2 0 11
2 1 3 0 7
  1. Instead of using the rank, use a normalized rank. Seems like approach #1 wouldn’t work well when the number of damaged components changes. Being repaired third out of three damaged components means something was repaired last but being repaired third out of 50 damaged components means a component was repaired towards the front.
A_repaired_at B_repaired_at C_repaired_at D_repaired_at result
1/3 3/3 2/3 0 11
2/3 1/1 3/3 0 7
  1. Use comes-before attributes directly. #2 seems like it makes it possible for an ML library to compare the repair positions between rows – if that is how we’d like the ML to function maybe I should add those features directly.
A_before_B A_before_C A_before_D B_before_C B_before_D C_before_D result
True True na False na na 11
False True na True na na 7

Option #3 is nice because the columns are only {True,False,NA} but it has this nightmare inducing problem where my 1600 components become a million and some change feature columns.

Is there some other way to transform the sequence into useful ML features?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP