TransWikia.com

Dissimilarity Matrix of non-metric proximity data

Data Science Asked by ninji on December 9, 2020

we currently have a coding exercise, where we are asked to implement Constant Shift Embedding (Paper). This in itself is not a big problem. For the algorithm, all you need is a symmetric non-zero diagonal dissimilarity matrix of some non-metric proximity data. With the algorithm you can then embed the information into a vector space and therefore you can use commonly known denoising and dimensionality reduction methods to improve the results of for example k-means clustering.

Given the E-Mail communications based on this data set, how would go about choosing a reasonable dissimilarity matrix?

The data is simply a list of unique pairs, where at least one e-mail has been sent from node A to node B. This gives rise to a graph of around 1000 nodes and 25000 edges.

Creating an adjacency matrix of this undirected graph might be a first step (which is also already provided in the framework).

I’m thankful for any pointers in the right direction.

EDIT: Over night I had an idea:

Let’s say we only have 8 nodes. Now compare the proximity elements of two vertices. So if the prox. vectors would for example look like:

1 0 0 0 1 0 1 1

0 1 0 0 0 1 0 1

Their dissimilarity would be 5, since their vectors differ at 5 points.

Now just normalize w.r.t. the total number of nodes, therefore 5/8.

With this, we also incorporate the information of how many neighbors are shared instead of only looking at direct edges, and might therefore receive better results, when we later try to cluster the nodes.

Let me know what you think.

One Answer

Maybe I did not completely understand your question, but I think the answer you are looking for is one of the following:

  • You may want to fill a n-by-n matrix with 1 if the person $i$ has sent e-mail(s) to person $j$, 0 otherwise

  • Maybe you want to fill the n-by-n matrix with the number of emails sent from person $i$ to person $j$.

Both measures are distances in the mathematical definition.

For clarity:

You could program the dissimilarity matrix as $M[i,j] = 1$ if the pair of people in your data exists.

Answered by Juan Esteban de la Calle on December 9, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP