TransWikia.com

Determining a correct ML approach

Data Science Asked on December 5, 2021

I’ve little idea about choosing a ML approach for the following problem. It is a classification problem and there are 2 classes that are positive and negative. There are about 100k samples and samples are structured like this:

Period = 1min   Pattern = M1>S1>T1>B2>M2>S2>S3>T3>M3>B3
Period = 5min   Pattern = S1>M1>T1>B2>S2>M2>S3>T3>M3>B3
Period = 10min  Pattern = M1>T1>S1>M2>B2>S2>S3>T3>M3>B3
Period = 15min  Pattern = M1>T1>S1>B2>S3>M2>S2>T3>M3>B3
Period = 20min  Pattern = S1>M1>S3>T1>B2>M2>S2>T3>M3>B3
Period = 30min  Pattern = S1>S3>B2>M1>T1>S2>M2>T3>M3>B3
Period = 60min  Pattern = S1>B2>M1>T1>S2>M2>S3>T3>B3>M3
Period = 120min Pattern = S1>M1>T1>B2>S2>M2>T3>S3>M3>B3 

This sample is classified as negative. A sample is composed of 8 periods. Within each period there is a pattern such as M1>S1>T1>B2>M2>S2>S3>T3>M3>B3. Each pattern has 10 elements and their positions are changing along samples and periods. We need to come up with a solution that could tell which period or lineup of elements are responsible for classification.

Let’s say we have p1, p2, p3 positive examples and n1, n2, n3 negative examples with 1min Periods like this:

p1: M1>S1>T1>B2>M2>S2>S3>T3>M3>B3
p2: M1>S1>T1>B2>S2>M2>S3>T3>M3>B3
p3: M1>S1>T1>B2>M2>S2>S3>T3>M3>B3

n1: M1>S1>T1>B2>S2>M2>S3>T3>B3>M3
n2: M1>S1>T1>B2>M2>S2>S3>T3>B3>M3
n3: M1>S1>T1>B2>M2>S2>S3>T3>B3>M3

It could be inferred that first 4 elements M1,S1,T1, B2 are irrelevant for classification since they are all same across all samples. 5th and 6th elements are also irrelevant since they don’t show same pattern along same class of samples. However, elements B3, M3 is a solid positive since M3>B3 for positive samples and B3>M3 for negative samples.

Thanks.

One Answer

I think all you need is to build proper features.

For each period and element, I would build a categorical feature. That is 80 categorical features. It looks like there aren't many possible values by feature, let's say there are 3 or 4 possible values for each feature, by doing one-hot encoding you would end up with 240-320 features.

Then you can do some kind of feature selection, like Lasso, and train your model with the selected features.

Answered by David Masip on December 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP