Data Science Asked by Pythoner on April 29, 2021
When cleaning text-data in Python3 for NLP, are there are any ‘common practices’ when it comes to addressing repetitive letters in words such as "wwwwooorrds" to "words", or "fffunnnyyyyyy" to "funny"?
The source of the miswritten words is an OCR and I am not able to address the issue upstream, and thought I would check if there was anything that I can do downstream to fix this.
Thanks!
A simple two part solution from this site
remove any letter sequences longer than two (probably not good for welsh)
def reduce_lengthening(text):
pattern = re.compile(r"(.)1{2,}")
return pattern.sub(r"11", text)
print(reduce_lengthening( "finallllllly" ))
Then using pattern.en to check spelling.
from pattern.en import spelling
word = "amazzziiing"
word_wlf = reduce_lengthening(word) #calling function defined above
print word_wlf #word lengthening isn't being able to fix it completely
correct_word = spelling(word_wlf)
print(correct_word)
NLTK is another common toolkit that can help with this
Answered by lys on April 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP