Data Science Asked by Tobias Kolb on September 5, 2021
I wanted to use the CORD19 word embeddings csv to map them to certain findings from the rest of the dataset, but as we can see there are no stings in the first column.
The way I know word or sentence embeddings, is what they map a word or a sentence to multiple vectors.
The values in the first column look somewhat like hashes, and they are the main problem I can’t use the dataset.
Can somebody give me a pointer on what I’m looking at and how to use them?
I have not found documentation, usage examples or submissions on kaggle that have explained or outlined how this file is supposed to be used.
So, after a lot of digging, I found something in the comment section.
They are document embeddings.
There is a github repo that specifies an API.
Relevant Comments from the Kaggle Comment section on the Data Update Log for the CORD19 Dataset:
import pandas as pd
from whatlies import Embedding, EmbeddingSet
#Docs: https://rasahq.github.io/whatlies/api/embeddingset/
#transponse dataframe
sample_df = pd.read_csv('data/cord_embeddings_sample.csv', header=None, delimiter=',', index_col=0).T
def to_ems(df):
ems_dict = {}
for columnName, columnData in df.iteritems():
ems_dict.update({str(columnName): Embedding(columnName, columnData)})
return EmbeddingSet(ems_dict)
ems = to_ems(sample_df.head(10))
ems.plot_correlation()
ems.plot_interactive("ug7v899j","02tnwd4m")
You can even do NLP with the json
s from the dataset and link them to the embeddings via the UUID
and SHA
from metadata.csv
.
Example:
Find words that relate to smoking and color the respective papers:
I created 2 EmbeddingSet
s where I filtered the embeddings for papers that have smoking-related words in their text body and subtracted their UUID
s from the list.
Both EmbeddingSets
can be displayed in the plot.
from whatlies.transformers import Umap
# add 2 embedding sets
emb1 = non_smoking_ems.add_property('set', lambda d: 'non smoking papers')
emb2 = smoking_ems.add_property('set', lambda d: 'smoking papers')
both = emb1.merge(emb2)
#add a clustering transformer that reduces dimensionality (like umap) and visualise them
both.transform(Umap(2)).plot_interactive('umap_0', 'umap_1',color='set', annot=False)
Correct answer by Tobias Kolb on September 5, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP