Python to clean miswritten words with repetitive letters such as "wwwwooorrrrddss" to "words"

Question

When cleaning text-data in Python3 for NLP, are there are any 'common practices' when it comes to addressing repetitive letters in words such as "wwwwooorrds" to "words", or "fffunnnyyyyyy" to "funny"?
The source of the miswritten words is an OCR and I am not able to address the issue upstream, and thought I would check if there was anything that I can do downstream to fix this.
Thanks!

lys · Answer

A simple two part solution from this site
remove any letter sequences longer than two (probably not good for welsh)
def reduce_lengthening(text):
    pattern = re.compile(r"(.)1{2,}")
    return pattern.sub(r"11", text)

print(reduce_lengthening( "finallllllly" ))

Then using pattern.en to check spelling.
from pattern.en import spelling

word = "amazzziiing"
word_wlf = reduce_lengthening(word) #calling function defined above
print word_wlf #word lengthening isn't being able to fix it completely

correct_word = spelling(word_wlf) 
print(correct_word)

NLTK is another common toolkit that can help with this

Python to clean miswritten words with repetitive letters such as "wwwwooorrrrddss" to "words"

One Answer

Add your own answers!

Ask a Question