Data Science Asked on August 6, 2020
I am not sure if “minimize correlation” is the right title for this issue but I could not find a better sentence to describe what I would like to achieve.
Let’s say that I have a black box with multiple inputs and a single output. I know one of the inputs and the output and I have multiple example recordings of both. This known input modifies the output in a way that it is not desired, therefore, I would like to get rid of this “noise” caused by the known input. The transfer function for this input can be safely assumed as linear.
What I am doing right now, it is to loop through the example recordings, creating a linear regression model to predict the unwanted outcome and subtracting it from the real measured output signal, for each example. Afterwards, I compute the average of all the fixed output signals to reveal meaningful data beyond noise.
This strategy seems to work according to the following plot:
X axis is the known input signal, Y axis is the output signal, blue and green dots represent the averaged data before and after applying the linear regression algorithm, respectively. Lines are the best fit for each data set.
You can see that the green line (“cleaned” dataset) has the smallest slope, meaning that the output variable is considerably less linearly correlated with the input than it was previously. Therefore, I assume that the regression technique explained before is working as expected.
My question, looking at the plot, is there any mathematical procedure to directly “project” the original dataset in a way that the correlation between the input and output variables is minimized? Is there any math trick to avoid the use of the regression technique on all the example datasets to obtain a similar result?
My written expression is not the best so please feel free to comment the question if you need further explanations.
Any code is welcomed but python (pandas, numpy, etc.) and Matlab are preferred. Theoretical explanations are also very welcomed.
Please allow me to paraphrase your question to make sure I get your question right.
Suppose you have 1 million data entries. Each entry consists of three inputs $X_1$, $X_2$, $X_3$ and one output $Y$. One of the three inputs, $X_1$, is a noise. You want to find the relation between $X_1$ and $Y$. So you can remove the impact of $X_1$ on $Y$.
Let's use a simplified example, $Y = 2X_1 + 3X_2 + 4X_3$ If you update each of the 1 million entries with: $Y_{new} = Y - 2X_1$
Now you can examine the updated 1 million entries to find the relation between input $X_2$, $X_3$ and $Y_{new}$: $Y_{new} = 3X_2 + 4X_3$
The relation between $X_1$ and $Y$ can be computed with different algorithms, such as regression, decision tree. How to measure which algorithm generates the best result, i.e. truly reflects the relation between $X_1$ and $Y$ hence removes the impact of $X_1$ on $Y$ as much as possible?
The problem you're solving is an interesting one. I would use a free tool such as Knime to find out which algorithm generates the best result in modeling the relation between $X_1$ and $Y$. It's faster than coding in Python or Matlab.
Details: to put the problem intuitively, suppose the price of a house is determined by location, size of the house, condition of the house. Assume these three factors are independent from each other. You want to remove the impact of "condition of the house" on "price" as much as possible. In essence, you want to find the relation between "condition of the house" and "price" which may not be linear.
You can let Knime read the 1 million entries of "condition of the house" and "price", try Decision Tree, SVM, or even Ensemble Learning, and see which algorithm generates the best model of $X_1$ and $Y$. In other words, try and error :)
Answered by hostjc on August 6, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP