Is there any OR way to solve this problem?

Question

For each observations there are 207 variables (binary, either a 'symptom' happened or not), class variable is also binary.
For each variable or symptom there is a weight attached (currently set manually to be between -5 and 50) and for each observation there is a critical line (there are 3 different critical lines). Matrix of dummy variables is multiplied by weights and resulting matrix is added up across different columns for each observations resulting in some score. If this score is higher then particular critical line associated with observation then prediction is 1, otherwise is 0.
The problem is to set up those weights and critical lines optimally. I obviously have a data set to see which symptoms usually correspond to '1' in prediction.
For me it looks like an optimization problem but obviously the prediction itself can be made with machine learning but I am looking for another resource.
Question is: do you guys know any areas of OR or can point out me some keywords to look at on how this type of problems are solved? I am good with Python so if you want to recommend me some packages I am more then happy. The only thing I though about is to randomly generate weights in (-5, 50) interval and for loads of trials maybe I will find ones that correspond to best accuracy (point is to minimize false positives).
Thank you!
-EDIT 20.07
My current formulation is as follows:
max( sum over N (t_i * s_i))
st.
(M x')_i >= L_i then s_i = 1
(M x')_i < L_i then s_i = 0
sum over N(s_i) =< 0.06N
where N is the number of observation, M number of variables
x is a vector of weights, M is a NxM matrix of dummy variables where each row represents one observation therefore Mx' results in a Nx1 vector of cumulative weights for each observation.
As I mentioned in the comment, optimal cut-off line L = [L_1, ..., L_n] is also a part of the problem. Vector of true allocations t is known.
The point is that once I got cut-off lines and weights, the system would process new observations using them.
I also don't want to many positive s_i and this is another constrain of the problem.
Thank you for all comments, I am new to stack exchange so please bear with me.

Blockquote

AirSquid · Answer

This sure sounds like you guys are taking the long road to Logistic Regression....
You have a bunch of observations, presumably with outcomes to do the training or calculate the model, right?
Each observation has 207 data elements that are numerical.  (Some/many of those will likely be dropped in the final model)
And you want to make a model from that to use on new data to predict 1/0 outcomes?
This is classic logistic regression, which should be your starting point (easiest) and then maybe some ML model, but this is not optimization unless you consider the calculation of weights for logistical regression an optimization problem.

huig · Answer

Given weights I can easily calculate how good these weights are for predicting but how can I determine weights?

From the answer above, $x$ would represent the weights and $b$ the cut point to decide whether a sample belongs to $S_0$ or $S_1$, those are the two variables in the OR problem. $a$ represents the observations from the sample. Solving that problem in linear programming would give you the resulting weights as well as the cut point.

huig · Answer

There are multiple ways to solve this problem, in my opinion it would be more of a ML problem but you can do with linear programming.
Let $a_i$ be the array of features for element $i$. Assuming you have a sample where given $a_i$ you are told the class it belongs to ($S_0$ or $S_1$), let $x$ be the matrix of weights and let $bin[0,1]$ be an scalar. Establishing that
begin{equation}
a_i'x geq b Longleftrightarrow a_i'in S_0
end{equation}
begin{equation}
a_i'x lt b Longleftrightarrow a_i'in S_1
end{equation}
Then, we could say the given sample should be classified correctly:
begin{equation*}
a_i'x ge b, hspace{10mm} iin S_0 \
a_i'x lt b, hspace{10mm} iin S_1 
end{equation*}
There is no need for an objective function, although you may need one in case the problem is infeasible (there is no linear separation). In that case your objective function could be to maximize the accuracy of your predictions, recall, f1-score, depends on the problem.

Is there any OR way to solve this problem?

3 Answers

Add your own answers!

Ask a Question