Data Science Asked by Devarshi Goswami on January 31, 2021
Given two sentences, I want to quantify the degree of similarity between the two text-based on Semantic similarity.
Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other.
say my input is of order:
index line1 line2
0 the cat ate the mouse the mouse was eaten by the cat
1 the dog chased the cat the alligator is fat
2 the king ate the cake the cake was ingested by the king
after application of the algorithm , the output needs to be
index line1 line2 lbl
0 the cat ate the mouse the mouse was eaten by the cat 1
1 the dog chased the cat the alligator is fat 0
2 the king ate the cake the cake was ingested by the king 1
Here lbl= 1 means the sentences are semantically similar and lbl=0 means it isn’t.
How would i implement this in python ?
I read the documentation of bert-as-a-service but since i am an absolute noob in this regard I couldn’t understand it properly.
BERT is trained on a combination of the losses for masked language modeling and next sentence prediction. For this, BERT receives as input the concatenation of the special token [CLS]
, the first sentence tokens, the special token [SEP]
, the second sentence tokens and a final [SEP]
.
[CLS] | First sentence tokens | [SEP] | Second sentence tokens | [SEP]
Some of the tokens in the sentences are "masked out" (i.e. replaced with the special token [MASK]
).
BERT generates as output a sequence of the same length as the input. The masked language loss ensures that the masked tokens are guessed correctly. The next sentence prediction loss takes the output at the first position (the one associated with the [CLS]
input and uses it as input to a small classification model to predict if the second sentence was the one actually following the first one in the original text where they come from.
Your task is neither masked language modeling nor next sentence prediction, so you need to train in your own training data. Given that your task consists of classification, you should use BERT's first token output ([CLS]
output) and train a classifier to tell if your first and second sentences are semantically equivalent or not. For this, you can either:
train the small classification model that takes as input BERT's first token output (reuse BERT-generated features).
train not only the small classification model, but also the whole BERT, but using a smaller learning rate for it (fine-tuning).
In order to decide what's best in your case, you can have a look at this article.
In order to actually implement it, you could use the popular transformers python package, which is already prepared for fine-tuning BERT on custom tasks (e.g. see this tutorial).
Answered by noe on January 31, 2021
Other way is to use pip install sentence-transformers I am posting it from mobile, sorry if there are any indentation issues
`from sentence_transformers import SentenceTransformer from sklearn.cluster import KMeans
embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
corpus = ['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.', 'The girl is carrying a baby.', 'The baby is carried by the woman', 'A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.', 'A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']
corpus_embeddings = embedder.encode(corpus)
from sklearn.cluster import KMeans
num_clusters = 5 clustering_model = KMeans(n_clusters=num_clusters) clustering_model.fit(corpus_embeddings) cluster_assignment = clustering_model.labels_ `
Answered by Syenix on January 31, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP