TransWikia.com

Convert TFIDF Values to Vector Space Model

Data Science Asked by Luca Frost on December 6, 2020

I’m working on a project using tf-idf values and cosine similarity for clustering. As my database (elasticsearch) provides tfidf values out of the box (term_freq & doc_freq), my code involves calculating the tfidf vectors manually using this data, and then performing cosine similarity.

However, I’m unable to use the sklearn cosine similarity on these values as they are 1-dimensional rather than 2-dimensional — this indicates to me that I need to input the values into a matrix or create a vector space model in order to do so? How would I achieve this?

Here’s some of my code for illustration 🙂

# cycle through the term vectors provided by elasticsearch
# and append them to their corresponding term in dataframe

for x in v1:
    tf = v1[x]['doc_freq'] / len1
    idf = math.log(num_docs / v1[x]['term_freq'])
    df.at[0, x] = tf*idf

for x in v2:
    tf = v2[x]['doc_freq'] / len2
    idf = math.log(num_docs / v2[x]['term_freq'])
    df.at[1, x] = tf*idf

df = df.fillna(0.00)

# -----------------------------------------

# create numpy matrix using dataframe

# approach 1
matrix = np.zeros((2, number_unique_terms))
    # populate this with dataframe row 0 and row 1

# approach 2
matrix = np.matrix(df.iloc[0], df.iloc[1])

Both of the approaches outlined above still give me the 1D array error when attempting to perform cosine similarity. Am I making a simple mistake when it comes to numpy? Or am I attempting to create a vector space model (or matrix) in the wrong way?

Thanks in advance! 🙂

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP