Data Science Asked on December 5, 2021
I’ve little idea about choosing a ML approach for the following problem. It is a classification problem and there are 2 classes that are positive
and negative
. There are about 100k samples and samples are structured like this:
Period = 1min Pattern = M1>S1>T1>B2>M2>S2>S3>T3>M3>B3
Period = 5min Pattern = S1>M1>T1>B2>S2>M2>S3>T3>M3>B3
Period = 10min Pattern = M1>T1>S1>M2>B2>S2>S3>T3>M3>B3
Period = 15min Pattern = M1>T1>S1>B2>S3>M2>S2>T3>M3>B3
Period = 20min Pattern = S1>M1>S3>T1>B2>M2>S2>T3>M3>B3
Period = 30min Pattern = S1>S3>B2>M1>T1>S2>M2>T3>M3>B3
Period = 60min Pattern = S1>B2>M1>T1>S2>M2>S3>T3>B3>M3
Period = 120min Pattern = S1>M1>T1>B2>S2>M2>T3>S3>M3>B3
This sample is classified as negative
. A sample is composed of 8 periods
. Within each period there is a pattern
such as M1>S1>T1>B2>M2>S2>S3>T3>M3>B3
. Each pattern has 10 elements
and their positions are changing along samples and periods
. We need to come up with a solution that could tell which period or lineup of elements
are responsible for classification.
Let’s say we have p1, p2, p3
positive examples and n1, n2, n3
negative examples with 1min Periods
like this:
p1: M1>S1>T1>B2>M2>S2>S3>T3>M3>B3
p2: M1>S1>T1>B2>S2>M2>S3>T3>M3>B3
p3: M1>S1>T1>B2>M2>S2>S3>T3>M3>B3
n1: M1>S1>T1>B2>S2>M2>S3>T3>B3>M3
n2: M1>S1>T1>B2>M2>S2>S3>T3>B3>M3
n3: M1>S1>T1>B2>M2>S2>S3>T3>B3>M3
It could be inferred that first 4 elements M1,S1,T1, B2
are irrelevant for classification since they are all same across all samples. 5th and 6th elements are also irrelevant since they don’t show same pattern along same class of samples. However, elements B3, M3
is a solid positive since M3>B3
for positive samples and B3>M3
for negative samples.
Thanks.
I think all you need is to build proper features.
For each period and element, I would build a categorical feature. That is 80 categorical features. It looks like there aren't many possible values by feature, let's say there are 3 or 4 possible values for each feature, by doing one-hot encoding you would end up with 240-320 features.
Then you can do some kind of feature selection, like Lasso, and train your model with the selected features.
Answered by David Masip on December 5, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP