I have seen BERT was one of the state-of-the-arts word embedding method in 2018 and then XLNet is proposed in 2019 to take care of the limitations of BERT. I have seen one limitation of BERT is the the maximum length of input tokens (which is 512, see this link ). Does anyone know the reason?

It's an arbitrary value. It is the longest length of input vector they assumed to be possible. Presumably, they didn't have longer vectors in the training set. Moreover, you can always truncate a vector and ignore farther away history, so in such case the length of the vector would be the farthest history you would considered to be useful. 512 is a power of two, what also suggests that the value is chosen arbitrarily by a computer science minded person.

