Data Science Asked by Luca Frost on December 6, 2020
I’m working on a project using tf-idf values and cosine similarity for clustering. As my database (elasticsearch) provides tfidf values out of the box (term_freq & doc_freq), my code involves calculating the tfidf vectors manually using this data, and then performing cosine similarity.
However, I’m unable to use the sklearn cosine similarity on these values as they are 1-dimensional rather than 2-dimensional — this indicates to me that I need to input the values into a matrix or create a vector space model in order to do so? How would I achieve this?
Here’s some of my code for illustration 🙂
# cycle through the term vectors provided by elasticsearch
# and append them to their corresponding term in dataframe
for x in v1:
tf = v1[x]['doc_freq'] / len1
idf = math.log(num_docs / v1[x]['term_freq'])
df.at[0, x] = tf*idf
for x in v2:
tf = v2[x]['doc_freq'] / len2
idf = math.log(num_docs / v2[x]['term_freq'])
df.at[1, x] = tf*idf
df = df.fillna(0.00)
# -----------------------------------------
# create numpy matrix using dataframe
# approach 1
matrix = np.zeros((2, number_unique_terms))
# populate this with dataframe row 0 and row 1
# approach 2
matrix = np.matrix(df.iloc[0], df.iloc[1])
Both of the approaches outlined above still give me the 1D array error when attempting to perform cosine similarity. Am I making a simple mistake when it comes to numpy? Or am I attempting to create a vector space model (or matrix) in the wrong way?
Thanks in advance! 🙂
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP