Data Science Asked on September 27, 2021
I have thousands of headlines and I would like to build a semantic network using word2vec, specifically google news files.
My sentences look like
Titles
Dogs are humans’ best friends
A dog died because of an accident
You can clean dogs’ paws using natural products.
A cat was found in the kitchen
And so on.
What I would like to do is finding some specific pattern within this data, e.g. similarity in topics on dogs and cats, using semantic networks.
Could you give me some advice on how I can do it?
Code:
import pandas as pd
import gensim
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.manifold import TSNE
main_data.Titles = np.where(main_data.Titles.isnull(),'NA', main_data.Titles)
article_titles = main_data['Titles']
titles_list = [title for title in article_titles]
big_title_string = ' '.join(titles_list)
tokens = word_tokenize(big_title_string)
words = [word.lower() for word in tokens if word.isalpha()]
stop_words = set(stopwords.words('english'))
words = [word for word in words if not a word in stop_words]
model = gensim.models.KeyedVectors.load_word2vec_format('path/GoogleNews-vectors-negative300.bin', binary = True)
model.vector_size
vector_list = [model[word] for word in words if word in model.vocab]
words_filtered = [word for word in words if the word in `model.vocab`]
word_vec_zip = zip(words_filtered, vector_list)
word_vec_dict = dict(word_vec_zip)
df = pd.DataFrame.from_dict(word_vec_dict, orient='index')
tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)
tsne_df = tsne.fit_transform(df[:400])
sns.set()
fig, ax = plt.subplots(figsize = (11.7, 8.27))
sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)
from adjustText import adjust_text
texts = []
words_to_plot = list(np.arange(0, 400, 10))
for word in words_to_plot:
texts.append(plt.text(tsne_df[word, 0], tsne_df[word, 1], df.index[word], fontsize = 14))
adjust_text(texts, force_points = 0.4, force_text = 0.4,
expand_points = (2,1), expand_text = (1,2),
arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))
plt.show()
However, I cannot understand how to interpret the results. I think they are wrong and probably this is not a good approach for building a semantic network. maybe I have been missing something…For instance, this code is still keeping stopwords after the part of
words = [word for word in words if not a word in stop_words]
This is an example of output difficult to read and explain (at least, for me):
I would greatly appreciate it if you could give me some tips and advice on how to perform a semantic network that can show semantic similarity within titles.
You can try converting your word representation into a document representation by simply taking the average over the word vectors for a document. For example, if a document has 9 words with (9, 200) dimensions, by taking an average over the words you can have a document representation with a dimension (1,200).
After you have your document-representation, you can use T-SNE to find similar documents. The documents with a similar topic or theme will cluster near each other. You can always improve the document representation by improving your word vectors.
Check this
Answered by Krishnang K Dalal on September 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP