Data Science Asked by user3778289 on August 10, 2021
I have just started using word2vec and I have no idea how to create vectors (using word2vec) of two lists, each containing set of words and phrases and then how to calculate cosine similarity between these 2 lists.
For example :
list1 =['blogs', 'vmware', 'server', 'virtual', 'oracle update', 'virtualization', 'application','infrastructure', 'management']
list2 = ['microsoft visual studio','desktop virtualization',
'microsoft exchange server','cloud computing','windows server 2008']
Any help would be appreciated.
You cannot apply word2vec on multiple words. You should use something like doc2vec, which gives a vector for each phrase:
phrase = model.infer_vector(['microsoft', 'visual', 'studio'])
You can also average or sum the vectors of words (from word2vec) in each phrase, e.g.
phrase = w2v('microsoft') + w2v('visual') + w2v('studio')
This way, a phrase vector would be the same length as a word vector for comparison. But still, methods like doc2vec are better than a simple average or sum. Finally, you could proceed to compare each word in the first list to every phrase in the second list, and find the closest phrase.
Note that a phrase like "cloud computing" has a completely different meaning than the word "cloud". Therefore, these phrases, specially if frequent, better to be treated as a single word, e.g.
phrase = w2v('cloud_computing')
Extra directions:
Here is an answer by Astariul on stackoverflow that uses a function from word2vec package to calculate similarity between two sets of words.
Take a look at fastText that works better when there is a lot of misspelled, or out-of-vocabulary words.
Answered by Esmailian on August 10, 2021
Vector representation of phrases (called term-vectors) are used in projects like search results optimization and question answering.
A textbook example is "Chinese river" ~ {"Yangtze_River","Qiantang_River"} (https://code.google.com/archive/p/word2vec/)
Above example identifies phrases based on Nouns mentioned in Freebase DB. There are alternatives such as :
Filter the list above based on usage (E.g.: only retain terms that have been used at least 500 times in large corpus such as Wikipedia).
Once terms have been identified, Word Vector algo will work as it is :
Following patent from Google has more details
https://patents.google.com/patent/CN106776713A/en
Other papers that have examples of domains where term vectors have been evaluated / used :
https://arxiv.org/ftp/arxiv/papers/1801/1801.01884.pdf
https://www.cs.cmu.edu/~lbing/pub/KBS2018_bing.pdf
https://www.sciencedirect.com/science/article/pii/S1532046417302769
https://www.sciencedirect.com/science/article/pii/S1532046417302769
Answered by Shamit Verma on August 10, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP