Given a dataset $mathcal{D} = {x_i}, i = 1, ldots, N, x_i in mathbb{R}$ In machine learning, what assumption is made as to how data are generated? I've seen two basic ideas circulating around, and basically no comment on which idea is more valid: There exists a random variable $X$ whose outcome are ${x_i}$, that is $X in {x_1, ldots, x_N}$. $X$ is distributed according to some distribution $P_X$ and these data are sampled sequentially from $P_X$ through some independent process. There exists a random vector $X = (X_1, ldots, X_N)$, where each $X_i$ has a single realization $x_i$. $X$ has a joint distribution $P_X = P_{X_1, ldots, X_N}$ and the data is sampled once from $P_X$. Which is generative process is more valid/common in (different models of) machine learning? Please provide a reference if possible as backup.

Extremely basic question: how are data assumed to be generated in machine learning?

Cross Validated Asked by Fraïssé on December 20, 2021

Given a dataset $mathcal{D} = {x_i}, i = 1, ldots, N, x_i in mathbb{R}$

In machine learning, what assumption is made as to how data are generated?

I’ve seen two basic ideas circulating around, and basically no comment on which idea is more valid:

There exists a random variable $X$ whose outcome are ${x_i}$, that is $X in {x_1, ldots, x_N}$. $X$ is distributed according to some distribution $P_X$ and these data are sampled sequentially from $P_X$ through some independent process.
There exists a random vector $X = (X_1, ldots, X_N)$, where each $X_i$ has a single realization $x_i$. $X$ has a joint distribution $P_X = P_{X_1, ldots, X_N}$ and the data is sampled once from $P_X$.

Which is generative process is more valid/common in (different models of) machine learning? Please provide a reference if possible as backup.

Get help from others!