Two questions about word2vec and gensim

Question

I've written the code below to try word2vec implementation of gensim. I've two questions:

Even though I've removed stop words, the word "the" is listed as one of the most similar words of "friend".
The most similar words of "friend" is not satisfying (at least according to my subjective evaluation). Should I try a larger text (austen-emma.txt file contains 192427 words) or the problem is something else?

Thanks.
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
import gensim 
from gensim.models import Word2Vec 
from gensim.parsing.preprocessing import remove_stopwords
from nltk.tokenize import RegexpTokenizer

text = gutenberg.raw('austen-emma.txt'). 
text = remove_stopwords(text). 
tokenizer = RegexpTokenizer(r'w+'). 
data = [].

for i in sent_tokenize(text):     
    temp = []. 
    for j in tokenizer.tokenize(i):   
       temp.append(j.lower()).  
    data.append(temp).

model = gensim.models.Word2Vec(data, min_count = 1,  
                          size = 32, window = 2)

model.wv.most_similar(positive='friend', topn=10)

[('mind', 0.9998476505279541),  
 ('present', 0.9998302459716797),  
 ('till', 0.9998292326927185),  
 ('herself', 0.9998183250427246),  
 ('highbury', 0.999806821346283),  
 ('the', 0.9998062252998352),  
 ('place', 0.9998047351837158),  
 ('house', 0.999799907207489),  
 ('her', 0.9997915029525757),  
 ('me', 0.9997879266738892)]

hssay · Accepted Answer

Run the lower casing operation before running stop-word removal since it is case sensitive.
Your window size, currently set to 2 is probably too low. Run it with default setting of 5. It will allow more natural similar pairs to be discovered (words frequently found in similar context but with distance more than 2). It would be also helpful to increase the minimum count threshold to weed out low frequency words which can give rise to spurious correlation.

In general, word2vec needs some manual tuning before the results start subjectively making sense (unless your dataset is very large).

Two questions about word2vec and gensim

One Answer

Add your own answers!

Ask a Question