Multiple Regression, Classification and Boundary Poins

Question

I have two gangs which are doing crimes. And i want to classify them.
Lets say I'm looking for a regression function:
 M(x1, x2) = w1x1 + w2x2 + w3

Now I have found all three parameters w1, w2, w3.
Now I want to do classification. I get some boundary points which look like a line and they separate two classes from each other. Should I do another regression over that boundary points so that i have a exact line for my separation?
Because lets say i want Point(5,3). I want to know if its more likely that the crime is done from Gang A or B. But I have just some boundary points to separate. Should I use them for a regression?

WBM · Accepted Answer

I think you want a clustering algorithm rather than regression. You will have a decision boundary between clusters of data points which will determine whether that particular point e.g. 5,3 belongs to group (cluster) A or group (cluster) B
Fit your clustering model in x1 vs x2 feature space. Take the image below, we have x1 and x2 as the x&y axis. And then the black lines are the decision boundaries, essentially the model.

You can of course just cluster between 2 groups also, as in your case.
You can checkout different clustering algorithms here

HS-nebula · Answer

If you want to classify members into two gangs, you should use a classification algorithm, not a regression algorithm. A regression algorithm is used to predict  values for members, while a classification algorithm is used to predict class membership (i.e. which gang would a person who commits these crimes likely belong to?).
Which classification algorithm should you use? Two models immediately come to mind: Logistic regression (yes, called "regression" but it's a classification algorithm) and Support Vector Machines. Both models will give you probabilities for a person who has committed certain crimes belonging to either gang. Both models will give you weights for the features. The Support Vector Machine will create a decision boundary (hyperplane, like the line you drew separating the classes in the picture) based on the data points, with a few points (support vectors) that influence the position and direction of the hyperplane. You can use these to get an equation for the hyperplane. Logistic regression will give you an equation as well, one that is easily interpretable for the effects each feature has on predicting the gang.
How do you decide which to use? Run both and see if their results make sense and give you the appropriate information.

Multiple Regression, Classification and Boundary Poins

2 Answers

Add your own answers!

Ask a Question