Assigning a new document to a cluster based on keywords extracted and tf-idf

Question

I have about 40 clusters of documents defined by a combination of k-means clustering algorithm and hand curation. For example, some of the clusters given by k-means are too noisy so they have been further subdivided.

Now I want to assign new documents to these clusters.

I found that it is possible to extract keywords using tf-idf based methods as mentioned here.

My approach is to extract key terms from each of these clusters using tf-idf based method and I can extract the keywords from the new document using the same method.

My question is, how do I assign the new document to the cluster that has the most similarity?

Edit: I do not have enough reputation to comment on Marks answer: the input to kmeans are document vectors (from doc2vec) of all documents -- and I get the centroids of the initial clusters i.e. centroids = kmeans_model.cluster_centers_. But I have split many of these clusters manually into sub clusters. For example, original cluster 3 is now two clusters -- 3_1 and 3_2. How do I generate a representative vectors (like a centroid) for the documents in these sub clusters?

Mark.F · Answer

It seems that you are already half-way there.
If you have divided the documents to clusters, it implies that you already have some feature extraction method that was used to quantize each document to some feature vector.

Now, when you get a new document, you need to extract its features the same way and then use its feature vector to find the best fitting cluster for it.

The best fitting cluster can be determined by using 1NN (Nearest Neighbor) algorithm on either the cluster centers or one of their affiliated document vectors. If you use the vectors and not the cluster centers, you can also you try KNN algorithm (K-Nearest Neighbors).

Assigning a new document to a cluster based on keywords extracted and tf-idf

One Answer

Add your own answers!

Ask a Question