TransWikia.com

Sentence similarity using Doc2vec

Data Science Asked by Latent on February 20, 2021

I have a list of 50k sentences such as : ‘bone is making noise’, ‘nose is leaking’ ,’eyelid is down’ etc..

I’m trying to use Doc2Vec to find the most similar sentence from the 50k given a new sentence.

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.025,
                min_count=1,
                dm =0)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

test_data = word_tokenize("The nose is leaking blood after head injury".lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)

similar_doc = model.docvecs.most_similar(positive=[model.infer_vector(test_data)],topn=3)

for i in range(0,len(similar_doc)):
    print(tagged_data[int(similar_doc[i][0])],similar_doc[i][1])

Such that for the sentence “The nose is leaking blood after head injury” i would like to get the sentence with the highest similarity score ( i guess that it will bring sentences with words like leak or even synonyms like dripping?) . But the sentence i get back are unrelated and change each iteration of model.infer_vector(test_data)

Any idea about what is wrong?

One Answer

Doc2Vec (and words vectors) need significant amount of data to learn useful vector representation. 50k sentences is not sufficient for this. To overcome this, you can feed word vectors as initial weights in Embedding Layer of network.

For example, code from following question :

How to implement LSTM using Doc2Vec vectors?

model_doc2vec = Sequential()
model_doc2vec.add(Embedding(voacabulary_dim, 100, input_length=longest_document, weights=[training_weights], trainable=False))
model_doc2vec.add(LSTM(units=10, dropout=0.25, recurrent_dropout=0.25, return_sequences=True))
model_doc2vec.add(Flatten())
model_doc2vec.add(Dense(3, activation='softmax'))
model_doc2vec.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Output of "Flatten" layer will be vector representation of a sentence / document.

Article with example code.

Correct answer by Shamit Verma on February 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP