Which predictive model is appropriate?

Question

I'm completely lost when trying to choose the type of predictive model for my problem. Is it autoregressive model, nonlinear time series, Markov Chain or other? Can someone please give me some advise?

78, 
18, 
51, 
89, 
19, 
43, 
62, 
28, 
94, 
49

Suppose, everyday I'm given 10 data, and an example was listed above. They're random numbers generated by two devices, namely Device A and Device B. Each of them is capable to generate random numbers from 0 to 9.

The first number in the data is generated by Device A, while the second number is generated by Device B. For instance, for the first data of "78", "7" was generated by Device A and "8" was generated by Device B. Similarly, for the last data of "49", "4" was generated by Device A, and "9" was generated by Device B.

I want to be able to predict the next outcome variable after the last "49".

I have a total of 300 historical data for 30 days.

From my initial investigation for the 300 data, every device tends to produce repeated sequences. For instance, Device A will repeat the sequence "6-2-9-4" (as in the last 4 data). That means this sequence appeared twice within the 300 historical data for Device A. For another example, the sequence "8-1-9-9" (the 2nd to the 5th data) in Device B appeared twice, too. Each of them produce at least three repeated sequences.

I'd like to predict the next outcome variable after the last "49". Which model is more appropriate?

Thank you in advance!

Erwan · Answer

I will assume that by "random" you mean that the numbers don't follow any particular mathematical function. If they were really random then there wouldn't be any pattern to discover so there would be no point trying to predict anything.

From your description I understand the following:

The value of the digit doesn't have any numerical property. In particular the natural order doesn't play any role. This suggests that the digits can be considered as categorical variables.
The data is sequential (the order in the sequence of digits matters) but there is no notion of time involved.
Apparently the two devices produce two independent sequences. You might want to check but if this is the case this calls for two distinct models, one for each device (otherwise you should use a single joint model).

Based on these observations I would use a simple sequential model such as Hidden Markov Model or Conditional Random Fields.

ignatius · Answer

What you want to do is to discover the patterns behind two pseudo-random number generators, which are supposed to be independent and not correlated. From most naive generators to more sophisticated, they are not purelly random, if so, it will be imposible to predict anything effectivelly, as Erwan has pointed out.

For an easy to hack generator you can visit:

https://github.com/lemire/crackingxoroshiro128plus

Here you have a quite interesting paper about this topic:

https://arxiv.org/ftp/arxiv/papers/1801/1801.01117.pdf

Which predictive model is appropriate?

2 Answers

Add your own answers!

Ask a Question