Relation between an underlying function and the underlying probability distribition function of data

Question

I heard and read a lot of times the following statements and got a lot of confusion over time.

Statement 1: The goal of machine learning is to get a function from
  the given data
  
  Statement 2: The goal of machine learning is to find the underlying
  distribution function of the given data

From the above two statements, I generally interpret that the underlying function is a probability distribution function of the given data.

But I did not understand the relation between the probability distribution function and the function we want to get for a particular task.

Let us consider the following example.

A random experiment $E$ has a sample space $Omega$ and I defined two random vectors $X_1, X_2$ on $Omega$. I am using neural network for my task. The domain of neural network is the range of $X_1$ and I am expecting the range of neural network to be the range of $X_2$ with correct mapping that satisfies the data. Let $f$ be the actual function that we want the neural network to become that maps from the range of $X_1$ to the range of $X_2$. Assume that I has the joint probability distribution $P(X_1, X_2)$ for my dataset.

Now, what is the relation between the actual  function $f$ we are approximating using the neural network that transforms range of $X_1$ into range of $X_2$ and the joint distribution $P$?

grochmal · Answer

The function you are thinking about (both statement 1 and statement 2) is the same function.  Allow me to use your framework to describe it.

The domain of the neural network is the range of $X_1$ and there exists some distribution function $F_{X1}(x)$ that describes how the values of $X_1$ are generated.  The same is true for $X_2$ and $F_{X2}(x)$.  Then there exists $f$ which maps:

$$
f: mathbf{range}(X_1) rightarrow mathbf{range}(X_2)
$$

Note that this is not $dot{f}: X_1 rightarrow X_2$, (not sure if we can this easily define a function whose domain and range are spaces with different densities, an interesting idea to explore but off-topic for us here).

Now.  Here I'll diverge from your framework.  We will assume that we in fact constructed and trained a neural network which estimates (with good confidence) $f: mathbf{range}(X_1) rightarrow mathbf{range}(X_2)$.  What we can say about this neural network and about $f$, is:

$f$ can be used to predict, given a value from the range of $X_1$ a corresponding value in the range of $X_2$.  This is by our understanding that our model is a good predictor.
The model models $F_{X2}(x)$.  In other words $f$ is $F_{X2}(x)$; and it models the distribution of $P(X_2)$.

Why is that?

In order for $f$ to be a model of $F_{X2}(x)$ then we must have:

$$
f: mathbf{domain}(X_2) rightarrow mathbf{range}(X_2) Longleftrightarrow
f: mathbf{range}(X_1) rightarrow mathbf{range}(X_2)
$$

Which is true by the way how we designed our neural network.  We said that the network receives the range of $X_1$ as its input, which is the domain of the neural network.  We also said that the output of the network is the range of $X_2$, therefore the network itself if the model of $P(X_2)$.

Common Sense Reasoning

The neural network receives values taken from $F_{X1}(x)$ as the input but compares its output against values in $F_{X2}(x)$, and it can only change (train) itself against values it can compare something to (its output).  In other words, the network does not attempt to understand the distribution of its input.  If we have the input as just a single sample from $F_{X1}(x)$ we should still be able to evaluate $f$ correctly.

Reality may sometimes seem different though, and the distribution of inputs actually may affect the training of a neural network in practice.  For example, classes/regions with very little support would impact the score of the network (and consequently its estimate of $P(X_2)$.  Yet, the idea of little support is not relevant for us since we are thinking of $X_1$, which is a random variable and its expected value would be from a big enough number of samples.

Extra Note

The constrain that $X_2$ must be from the same sample space ($Omega$) as $X_1$ is not needed, although possible.  This constrain would affect how the network is trained (it is a constrain after all), yet it will be the same constrain on how $P(X_2)$ looks.

Relation between an underlying function and the underlying probability distribition function of data

One Answer

Why is that?

Common Sense Reasoning

Extra Note

Add your own answers!

Ask a Question