How do I read the cord_19_embeddings_2020-07-16.csv from the COVID-19 Open Research Dataset Challenge (CORD-19) on Kaggle?

Question

What I'm trying to do
I wanted to use the CORD19 word embeddings csv to map them to certain findings from the rest of the dataset, but as we can see there are no stings in the first column.
The way I know word or sentence embeddings, is what they map a word or a sentence to multiple vectors.
The values in the first column look somewhat like hashes, and they are the main problem I can't use the dataset.
Can somebody give me a pointer on what I'm looking at and how to use them?
I have not found documentation, usage examples or submissions on kaggle that have explained or outlined how this file is supposed to be used.

Tobias Kolb · Accepted Answer

So, after a lot of digging, I found something in the comment section.
They are document embeddings.

There is a github repo that specifies an API.

Paper on arxiv

Example usage of a similar approach

Relevant Comments from the Kaggle Comment section on the Data Update Log for the CORD19 Dataset:

Comment 1

Comment 2

Examples how to visualize the embeddings in a Jupyter Notebook:
import pandas as pd
from whatlies import Embedding, EmbeddingSet
#Docs: https://rasahq.github.io/whatlies/api/embeddingset/

#transponse dataframe
sample_df = pd.read_csv('data/cord_embeddings_sample.csv', header=None, delimiter=',', index_col=0).T

def to_ems(df):
    ems_dict = {}
    for columnName, columnData in df.iteritems():
        ems_dict.update({str(columnName): Embedding(columnName, columnData)})
    return EmbeddingSet(ems_dict)

ems = to_ems(sample_df.head(10))

ems.plot_correlation()

ems.plot_interactive("ug7v899j","02tnwd4m")

You can even do NLP with the jsons from the dataset and link them to the embeddings via the UUID and SHA from metadata.csv.
Example:
Find words that relate to smoking and color the respective papers:
I created 2 EmbeddingSets where I filtered the embeddings for papers that have smoking-related words in their text body and subtracted their UUIDs from the list.
Both EmbeddingSets can be displayed in the plot.
from whatlies.transformers import Umap

# add 2 embedding sets

emb1 = non_smoking_ems.add_property('set', lambda d: 'non smoking papers')
emb2 = smoking_ems.add_property('set', lambda d: 'smoking papers')

both = emb1.merge(emb2)

#add a clustering transformer that reduces dimensionality (like umap) and visualise them
both.transform(Umap(2)).plot_interactive('umap_0', 'umap_1',color='set', annot=False)

How do I read the cord_19_embeddings_2020-07-16.csv from the COVID-19 Open Research Dataset Challenge (CORD-19) on Kaggle?

What I’m trying to do

One Answer

Examples how to visualize the embeddings in a Jupyter Notebook:

Add your own answers!

Ask a Question