Percentage of smaller dataset with respect to bigger dataset

Question

I have two datasets, which are lists of multidimensional real-valued vectors. One dataset (call it $A={x_1, x_2, x_3, ..., x_n}$ is of a big size, the other (call it $B={x_1, x_2, x_3, ..., x_m}$). Furthermore, the other is far smaller and is a subset of the bigger one ($B subset A$). The smaller one $B$, comes from some sampling process and what I want to do is to calculate, what fraction of the smaller (obtained from sampling) is in the bigger.
Additionally, since those are real-valued vectors, I can't compare them directly one by one, so a clustering algorithm may be employed. Also the size of one dataset is bigger than the other $|A| >> |B|$.

Erwan · Answer

Naive approach: define a similarity or distance function, say for instance cosine similarity.

Calculate the similarity score between any pair $(x_iin A, y_jin B)$
Define a precision level, say $epsilon=0.000001$. The assumption is that it's extremely unlikely that two vectors would be this close by chance in $A$.
For every $y_jin B$, find the set $c(y_j) = { x_iin A | sim(x_i,y_j)geq 1-epsilon }$
Obtain the union: $C(B)={x_iin A | exists y_jin B: x_iin c(y_j) }$

The proportion of elements of $A$ which are "equal" to an element in $B$ is:
$$frac{|C(B)|}{|A|}$$

Percentage of smaller dataset with respect to bigger dataset

One Answer

Add your own answers!

Ask a Question