What would be the best machine learning approach for sets of varying sizes?

Question

I have the following problem: I have two differents sets of labels (extracted using N.E.R) and given a combination of labels of the first set (a,b,c or d) I have a supervised set of best combination of the second (x,y,z) as an "answer".
The problem is, both can vary in size.
A hypothetical training data would be something like:
{a1,b2,c4,d1} -> {x2,y4,z5}
{a1,b1,c1} -> {x2,y2,z1}
{a4,b2,c4,d1} -> {x1,y3,z5}
...
{a4,b2,c4,d3} -> {x1,y3,z5,w2}

Of course new types of combination of the first set would appear and, using ML, I'd expect to like to give the best prediction.
So, what would be the best machine learning approach for that situation?

Brian Spiering · Answer

This could be modeled as multi-label classification.  The features are nominal values, and the targets are the presence or absence of nominal values.
There are wide variety of algorithms that can learn multi-label classification. The "best" one is empirical question that depends on the specific dataset. One popular option is random forest classifier.
There is also a "strict" version of problem where each combination is considered a unique label. Then it would be multi-class classification. But the targets might be too sparse to learn.

What would be the best machine learning approach for sets of varying sizes?

One Answer

Add your own answers!

Ask a Question