Data Science Asked by Viktor Katzy on October 1, 2021
TL;DR: What is the impact of a linear trend on the correlation between time series that are (most likely) not spuriously correlated?
I’m currently trying to reconstruct/cross-validate an analysis delivered by one of my companies contractors.
The data is based on time series of sensor data (approx. 3.5m timestamps). Goal was to find the signals with the highest correlation with one specific signal.
Despite not being an expert in data science I was able to reproduce their data cleaning (drop columns with zero variance, interpolate linearly over smaller gaps, drop remaining columns containing NaN-values). But after that I’m not sure if I can confirm their findings.
Seemingly they did a simple pearson-correlation like
corr = df.corrwith(df['DesiredSignal'])
Yet looking at the data the signals seem definitely trended.
When I then apply a detrend-function like
from scipy import signal
df_d = signal.detrend(df[column])
df_n = pd.DataFrame(data=df_d)
and apply the corrwith-function to this new dataframe I get totally different results (e.g. a significant higher amount of highly negativ correlations).
My Question now is: Can I trust the findings of the contractor or are they rendered invaild by not considering the influence of trends on correlation or am I getting something completly wrong?
The 4 main measures of correlation are Pearson, Kendall rank, Spearman and Point-biserial (the latter of which is not applicable for this type of problem). For simplicity, I'll only explain how it affects measuring Pearson correlation.
Let's assume $X$ represents a sinusoidal time series without trend: $x_t = sin(t)$, $Y$ represents $X$ with an upwards linear trend: $y_t = t + sin(t)$ and $Z$ represents $X$ with a downwards linear trend: $z_t = -(t + sin(t))$. All series have identical timestamps and have the same unit of measurement (for ease of plotting):
One of the assumption to measure Pearson correlation between two time series is called linearity
, that when both series are plotted against one another on a scatter graph, there is a linear relationship:
As you can see, $X$ and $Y$ do not satisfy this condition and so Pearson correlation is the incorrect statistical measure to use, whereas for $Y$ and $Z$ it is. Why though?
Pearson correlation measures the degree to which values deviate from the linear line of best fit between the two series provided. If the relationship is not linear, this relationship will not be measured accurately. This can be shown by plotting the Pearson correlation coefficient as $t$ increases:
Notably, $X$ and $Y$ will also violate the assumption of a monotonic relationship for both Spearman and Kendall rank, and so you cannot measure correlation with any of these methods for $X$ and $Y$ unless transformations of the data are performed to satisfy the underlying assumptions - as you do in the question post.
Linear trends, therefore, don't have strictly positive or negative impacts on measuring correlation. You will just have to react accordingly to the underlying assumptions of the correlation measure you need to use.
To paraphrase Hanlon's razor:
It is better to assume ignorance than malicious intent.
If you provide your feedback, the analyst will have an opportunity to discuss through why they chose to pursue a certain route, give them a chance to realise what they did was incorrect, or that they misunderstood the requirements and/or limitations of the project.
Hopefully, this leads to a more positive outcome, given you want the best results and the analyst wants to provide the best service.
Correct answer by mwtmurphy on October 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP