Text classification using Python

Question

I am doing a text classification related task in Python using NLP and SkLearn. 
I need to remove random words from my text. I know I can remove stop words and punctuation using nlp. But what I am asking is about completely random strings like ('ncdjbcjdkckdvcj', 'khsjgcgjcbjbcj', 'kdhjgcjgjc', 'jsbjsgucgugcus') the one that you type completely randomly. Note that I have some words in my text which are misspelled and short forms, I don't want to remove them, just want to get rid of strings like this. ? 
Is there any python module or some external solution that can help me with this problem. ?

GBG · Answer

Libre Office offers a collection of word libraries in a variety of languages.  You could use the Pyenchant library to check words against the LibreOffice dictionaries to see if they were valid words or just garbage.
Look here for some clues on using the LibreOffice libraries with Pyenchant

quassy · Answer

You could use dictionaries for your target language (like nltk.corpus words) and also for special terms which are related to your topic and use fuzzy string matching (like fuzzywuzzy) to keep all words which are similar to real words.

Alternatively, depending on the amount and quality of your data you have, you could just remove all words that are not in any dictionary and only found once in the whole set. You will lose some rare misspellings but also most random gibberish.

Text classification using Python

2 Answers

Add your own answers!

Ask a Question