Wavenet - how are the skip connections from the residual blocks utilized?

Question

I've been attempting to implement the Wavenet paper: https://arxiv.org/pdf/1609.03499v2.pdf
In the paper, the main diagram they use to describe the architecture is this one:

The paper mentions the use of residual and skip connections in order to enable the training of deeper networks, which I understand. But what I do not understand is why they extract the skip values and sum them before passing them to the last portion of the network.
In the paper, they state that they are attempting to predict the value of the sequence at time T, given the values of X0 -> X(T-1). They quantize the values into the range [0,255] and output a probability distribution describing the likelihood that the next element belongs to one of the 256 quantized classes. Therefore, this last portion of the network should output a probability vector fitting the above description.
My questions are:

Does the description above have any misunderstandings about how the model functions?
Why do the authors extract the skip values? Why do they sum the skip values? What do these values "mean" in relation to the problem of audio generation?
If they are simply summing the skip vectors how do they control for dimensionality? Shouldn't each residual block have a different dimension for its output due to the varying dilation factors? If so, how do we end up with a probability vector with 256 entries describing the probability that the next sample is in one of those classes (how are the dimensions tailored to reach the desired output dimension)?

Wavenet - how are the skip connections from the residual blocks utilized?

Add your own answers!

Ask a Question