Data Science Asked by Jonathan Hedger on March 5, 2021
I have a dataset of profiles which contain freeform text describing the work history of a number of individuals.
I would like to attempt to identify frequently used words or groups of words across the set of profiles in order that I can build a taxonomy (of skills) related to the profiles.
For example if the words ‘conversion rate optimisation’ appear together 300 times across all profiles, I would see this on my list as a high frequency keyphrase. I would expect to be able to filter the list based on single keywords, 2 words and 3 word strings.
I would then be able to manually pick out frequently used keyphrases relating to skills, that could be added to a master taxonomy list.
I would also need some way of filtering out invalid words like (‘I’, ‘and’ etc)
What is the best way to get something like this done?
I would like to attempt to identify frequently used words or groups of words
The difficulty here would be to capture multiword terms, as opposed to single words. This implies using n-grams for various values of $n$, and that can cause a bias when comparing the frequency of two terms of different length (number of words).
I would also need some way of filtering out invalid words like ('I', 'and' etc)
These are called stop words (sometimes function words or grammatical words). They are characterized by the fact that they appear very frequently even though they consist in a quite small subset of the vocabulary (fyi this is related to Zipf's law for natural language). These two properties make them easy enough to list in a predefined list so that they can be excluded, there are many lists available (e.g. here or there).
Since you don't have any predefined list of terms, a baseline approach could go along these lines:
This approach is very basic but it's easily adjustable, you can adapt it to your data, possibly add steps etc. Otherwise there are probably specialized tools for terminology extraction, but I'm not familiar with any.
Answered by Erwan on March 5, 2021
Clustering is the wrong tool for this purpose.
If you want to identify frequent patterns, use frequent pattern mining.
Here, you will want to consider order and locality, so some form of frequent sequence mining is certainly the way to go.
But since you likely only have a few hundred CVs, you likely can simply afford to count all words, 2-grams, 3-grams, 4-grams (which is still linear in the size of the input) and print the most frequent combinations each.
If you can afford to load multiple copies of your data into main memory, I suggest you simple use a dict
and count all occurrences.
Answered by Has QUIT--Anony-Mousse on March 5, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP