Data Science Asked by Latent on February 20, 2021
I have a list of 50k sentences such as : ‘bone is making noise’, ‘nose is leaking’ ,’eyelid is down’ etc..
I’m trying to use Doc2Vec to find the most similar sentence from the 50k given a new sentence.
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.025,
min_count=1,
dm =0)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
test_data = word_tokenize("The nose is leaking blood after head injury".lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)
similar_doc = model.docvecs.most_similar(positive=[model.infer_vector(test_data)],topn=3)
for i in range(0,len(similar_doc)):
print(tagged_data[int(similar_doc[i][0])],similar_doc[i][1])
Such that for the sentence “The nose is leaking blood after head injury” i would like to get the sentence with the highest similarity score ( i guess that it will bring sentences with words like leak or even synonyms like dripping?) . But the sentence i get back are unrelated and change each iteration of model.infer_vector(test_data)
Any idea about what is wrong?
Doc2Vec (and words vectors) need significant amount of data to learn useful vector representation. 50k sentences is not sufficient for this. To overcome this, you can feed word vectors as initial weights in Embedding Layer of network.
For example, code from following question :
How to implement LSTM using Doc2Vec vectors?
model_doc2vec = Sequential()
model_doc2vec.add(Embedding(voacabulary_dim, 100, input_length=longest_document, weights=[training_weights], trainable=False))
model_doc2vec.add(LSTM(units=10, dropout=0.25, recurrent_dropout=0.25, return_sequences=True))
model_doc2vec.add(Flatten())
model_doc2vec.add(Dense(3, activation='softmax'))
model_doc2vec.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Output of "Flatten" layer will be vector representation of a sentence / document.
Correct answer by Shamit Verma on February 20, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP