TransWikia.com

Percentage of smaller dataset with respect to bigger dataset

Data Science Asked on June 22, 2021

I have two datasets, which are lists of multidimensional real-valued vectors. One dataset (call it $A={x_1, x_2, x_3, …, x_n}$ is of a big size, the other (call it $B={x_1, x_2, x_3, …, x_m}$). Furthermore, the other is far smaller and is a subset of the bigger one ($B subset A$). The smaller one $B$, comes from some sampling process and what I want to do is to calculate, what fraction of the smaller (obtained from sampling) is in the bigger.
Additionally, since those are real-valued vectors, I can’t compare them directly one by one, so a clustering algorithm may be employed. Also the size of one dataset is bigger than the other $|A| >> |B|$.

One Answer

Naive approach: define a similarity or distance function, say for instance cosine similarity.

  1. Calculate the similarity score between any pair $(x_iin A, y_jin B)$
  2. Define a precision level, say $epsilon=0.000001$. The assumption is that it's extremely unlikely that two vectors would be this close by chance in $A$.
  3. For every $y_jin B$, find the set $c(y_j) = { x_iin A | sim(x_i,y_j)geq 1-epsilon }$
  4. Obtain the union: $C(B)={x_iin A | exists y_jin B: x_iin c(y_j) }$

The proportion of elements of $A$ which are "equal" to an element in $B$ is:

$$frac{|C(B)|}{|A|}$$

Answered by Erwan on June 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP