Sentence similarity using Doc2vec

Question

I have a list of 50k sentences  such as : 'bone is making noise', 'nose is leaking' ,'eyelid is down' etc..

I'm trying to use Doc2Vec to find the most similar sentence from the 50k given a new sentence.

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]

max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.025,
                min_count=1,
                dm =0)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

test_data = word_tokenize("The nose is leaking blood after head injury".lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)

similar_doc = model.docvecs.most_similar(positive=[model.infer_vector(test_data)],topn=3)

for i in range(0,len(similar_doc)):
    print(tagged_data[int(similar_doc[i][0])],similar_doc[i][1])

Such that for the sentence "The nose is leaking blood after head injury" i would like to get the sentence with the highest similarity score ( i guess that it will bring sentences with words like leak or even synonyms like dripping?) . But the sentence i get back are unrelated and change each iteration of model.infer_vector(test_data)

Any idea about what is wrong?

Shamit Verma · Accepted Answer

Doc2Vec (and words vectors) need significant amount of data to learn useful vector representation.  50k sentences is not sufficient for this.   To overcome this, you can feed word vectors as initial weights in Embedding Layer of network.
For example, code from following question :
How to implement LSTM using Doc2Vec vectors?
model_doc2vec = Sequential()
model_doc2vec.add(Embedding(voacabulary_dim, 100, input_length=longest_document, weights=[training_weights], trainable=False))
model_doc2vec.add(LSTM(units=10, dropout=0.25, recurrent_dropout=0.25, return_sequences=True))
model_doc2vec.add(Flatten())
model_doc2vec.add(Dense(3, activation='softmax'))
model_doc2vec.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Output of "Flatten" layer will be vector representation of a sentence / document.
Article with example code.

Sentence similarity using Doc2vec

One Answer

Add your own answers!

Ask a Question