TransWikia.com

Idf values of English words

Data Science Asked by Thirupathi Thangavel on August 16, 2021

I’m working on keyword/phrase extraction from a single document. I started by doing term frequency analysis, but this returns words like “new” which aren’t very helpful. So I want to penalize the common words and phrases, for which we normally use idf (inverse document frequency). But since it’s for a single document, I’m not sure how to do idf analysis.

Is it possible to use tf-idf method with pre-calculated idf values for (all?) words?
And are such values available somewhere?

3 Answers

The list of 20,000 most common words in English is avaiable here.

By using Zipf's law, we can obtain the probability of these words as below.

Zipf's Law

In the English language, the probability of encountering the rth most common word is given roughly by P(r)=0.1/r for r up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. Pierce's (1980, p. 87) statement that sumP(r)>1 for r=8727 is incorrect. Goetz states the law as follows: The frequency of a word is inversely proportional to its statistical rank r such that

P(r) = 1/(rln(1.78R)),

where R is the number of different words.

These probability values can be used as a substitute for idf.

Correct answer by Thirupathi Thangavel on August 16, 2021

I don't believe that there are any precalculated idf values out there. Inverse Document Frequency (idf) is the inverse of the number of documents in which a particular word appears in your corpus. If you only have one document, I'm afraid that value is simply 1.

However, if you are looking to get rid of words such as the, as, and it which don't carry much meaning, nltk in Python has some useful tools to remove these "stop words" from your document and might help you.

Here is a helpful example.

Answered by gingermander on August 16, 2021

Thiraputhi is correct that Zipf's law can be used to get a fairly decent set of IDF values from an ordered list of the 20,000 most common words. However, Google's n-grams have been available since 2012 and these contain the data you are looking for, although you have to extract it from their unigrams (i.e. 1-grams) dataset using awk or some other programming language or tool. If you go to the top of the repo mentioned by Thiraputhi, they even kind of allude to these Google n-grams, strangely enough, and they also mention that the files in their repo are derived from Peter Norvig's 1/3 million most frquently used words list. Norvig claims on his site that these come from Google's "Trillion word corpus". This may be the same corpus that Google uses to generate their n-grams, I'm not sure. But Norvig's 1/3 million word list contains a column for the count of the word in the corpus, and this is the column you are looking for for your IDF frequency.

TF-IDF = frequency (word count) of a term in an indivividual document divided by that word's frequency (word count) in the larger corpus, and this latter term is found in column 2 of Norvig's file. It would be superfluous to approximate this column using Zipf's law when you have the original file containing that column available. Here are the links that answer your question:

Google's n-grams, including 1-grams: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Norvig's more convenient 1/3 million words dataset: https://norvig.com/ngrams/count_1w.txt

There are also many other valuable datasets on Norvig's main n-grams page: https://norvig.com/ngrams/

It might be tempting to just use Norvig's very convenient dataset, but I believe it will be more consistent with the scientific method (i.e. 'reproducibilty') to do your own extraction from Google's 1-grams. I believe this should actually yield much more than 1/3 of a million words, as google's n-grams interpret as words quite a lot things that most people will not consider to be words, i.e. 123:45 and similar things. You can leave these in the dataset if you have the processing power to do the lookup, or if you can, turn the dataset into a fast key-value store. There are many open source key-value stores available including Tokyo cabinet and others whose names I have forgotten, and there's also sqlite. So if you can turn it into a fast key-value store or other db somehow, or if you have the processing power, then this might be better than trying to sift through all those rows of data to get just the data that fits your needs. Otherwise you'll have to figure out some rule like 'no hyphens and no colons' or 'strictly alphabetical' or 'strictly alphanumeric' and prune out everything that doesn't fit. Just make sure to document everything you do if this is for some kind of scientific purpose.

Edit: You will need the total word counts from your individual documents as well as total word count from the corpus. For google's n-grams, the latter can be found in the files labeled total_counts on the linked page.

Answered by JMW on August 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP