Matching similar strings

Question

I have a list of conferences on different topics, e.g.

Conference on genomics and neurosciences
Advances in string theory and astrophysics 
Genomics and neuroscience: 20 years of research
Swiss Physics society meeting on string theory and astrophysics
...

They fall into different classes, like 1 and 3, 2 and 4 together. What is the right tool to group those titles?

n1k31t4 · Answer

I assume you have some training data with labels, i.e. data where the titles are already linked to a given class? This is then supervised learning (as opposed to unsupervised learning), and so you could folow the following steps:

Step 1: you have words as input, so you will need a method to create numerical representation (vectors). For that you could look into algorithms such as Word2Vec, Doc2Vec, GLoVE or something like TF-IDF. If you go for the first, you might consider trying the spaCy library in python. Here is a tutorial on Word2Vec using spaCy.

Step 2: once you have your numerical representations for each of your titles, you need to somehow classify them. You could do this a few ways. Perhaps the simplest would be something like a clustering algorithm, e.g. the DB-Scan algorithm in SciKit Learn - here is a demo.
You could try more complicated methods, such as Support Vector Machines or Neural Networks, but probably best to start with a method that will get you to some results more quickly. You are classififying titles, so be sure to form your problem as a classification as opposed to a regression problem.

Step 3: assess your results and try changing a part of the loop above.

In the above, I assumed you are talking about the semantic meaning of the conference titles, and not similarity between literal word/letter combinations. That could of course be computed analytically, without the use of a model that learns.

In response to OP's comment:
From my experience, using TF-IDF or something called minimal new sets might be a good way to get your titles into representations that allow clustering. Once clusters are formed, it would be up to you to then interpret them and assign labels. If you know that there are e.g. only 10 conference, it shouldn't be too difficult to reach results. Have a look at this master thesis that does a similar thing - instead of conferences, they want to detect topics. Disclaimer: I supervised that thesis.

user42229 · Answer

If your data is not labeled and you want to transform them in numerical features, you could try Bourgain Embedding. For this you need a distance between two conference titles. This could be a combination of Jaccard Distance ( bag of words) and Levenshtein distance ( but this would only make sence if you have word which are written in characters similar, for example Physics and astrophysics). Having such a representation in numerical features, allows one to do k-means clustering for example or after labeling, doing supervised learning. 
For more details you might have a look here: http://www.orges-leka.de/automatic_feature_engineering.html (Disclaimer: I have written the blog article.) I have done something similar with search queries at my website: 
https://www.kaggle.com/orgesleka/keywords-similarity-dataset
The main point when using the Bourgain algorithm, is how you define the distance / similarity of two conference titles, which impacts the representation.

Matching similar strings

2 Answers

Add your own answers!

Ask a Question