How to improve the efficiency of periodicity detection for long time based lined and gapped datasets

Question

Our data set has $10^4$ data points, but has a long baseline and many gaps. As the histogram shows, the horizontal-axis is time and most of the time, there are no data. The vertical-axis is data counts. The total time is, in fact, short (green lines), but gaps make the time baseline very long.

If we bin the data, there would be $10^8$ data points[t,value], but only about $1%$ are non-zero values. After binning, most of those values are zero because of those gaps.

How to improve detection efficiency(a faster method)?

A multi-threading way is possible(especially for Lomb-Scargle)?

MPIchael · Answer

One way to approach it, could  be to do the analysis in two steps. First, you do a sweep along your dataset, and collect all nonzero values, and their times into a (much shorter!) data structure. You collect a list of tuples [t,value] basically. 
Every sweep after that will be extremely fast as you can safely assume every datapoint not in your list to be zero. Clicking on your link I have a hard time understanding what I see. These are not simple csv-style datapoints. They are in the form of:

139459196.2742752731

139462208.5806673169

139462689.1677284241

139467485.6161292493

...

Are these just your times and the second number is omitted?

How do you want to go about the periodicity detection? The obvious way here is to use a fourier transform which, after transform, will show you at which frequencies you have periodicity. Have you researched fast-fourier-transforms for python: (scipy.fftpack) ? I would be surprised if there would be no parallel implementation for a fourier transform in python. If that is still not fast enough
there is the FFTW-library, boldly calling itself "Fastest Fourier Transform in The West", but I have to warn you that it is a pain in the ass to use. [EDIT:] There is a pyhton wrapper for FFTW pyFFTW.

You seem to have an impressive ammount of data. You might get away with doing some averaging, depending on your precision needs. If you take the first, say 10, datapoints, and take the maximum amplitude, store in a shorter array, and repeat for the following. You may shorten your dataset significantly (with loss in precision on your time axis of course).

How to improve the efficiency of periodicity detection for long time based lined and gapped datasets

One Answer

Add your own answers!

Ask a Question