Data Science Asked on June 22, 2021
I have two datasets, which are lists of multidimensional real-valued vectors. One dataset (call it $A={x_1, x_2, x_3, …, x_n}$ is of a big size, the other (call it $B={x_1, x_2, x_3, …, x_m}$). Furthermore, the other is far smaller and is a subset of the bigger one ($B subset A$). The smaller one $B$, comes from some sampling process and what I want to do is to calculate, what fraction of the smaller (obtained from sampling) is in the bigger.
Additionally, since those are real-valued vectors, I can’t compare them directly one by one, so a clustering algorithm may be employed. Also the size of one dataset is bigger than the other $|A| >> |B|$.
Naive approach: define a similarity or distance function, say for instance cosine similarity.
The proportion of elements of $A$ which are "equal" to an element in $B$ is:
$$frac{|C(B)|}{|A|}$$
Answered by Erwan on June 22, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP