NLP data cleaning and word tokenizing

Question

I am new to NLP and have a dataset that has a bunch of (social media) messages on which I would like to try some methods like latent Dirichlet allocation (LDA). First, I need to clean the data of things like punctuation, emojis, etc. I'm not sure how to go about doing this in the most efficient and accurate manner. My code right now is this:
import pandas as pd
import re

class TopicModel():
    def __init__(self, data_path = "data.csv"):
        self.data_path = data_path
        self.data = pd.read_csv(self.data_path, low_memory=False)

def clean_data(self):
        self.remove_message_na()
        self.remove_emojis()
        self.remove_punctuation_and_lower()
        self.remove_url()
        self.remove_empty_messages()

def remove_message_na(self):
        self.data = self.data.loc[~pd.isna(self.data['message'])]

def remove_emojis(self):
        self.data['message'] = self.data['message'].str.encode("ascii", "ignore").str.decode("utf8")

def remove_punctuation_and_lower(self):
        p = re.compile('''[!#?,.:";]''')
        self.data['cleaned_data'] = [p.sub("", ii).lower() for ii in self.data['message'].tolist()]

def remove_empty_messages(self):
        self.data = self.data.loc[self.data['cleaned_data'] != ""]

def remove_url(self):
        self.data = [re.sub(r"httpS+", "", ii) for ii in self.data['message'].tolist()]

I don't want to remove contractions, which is why I left out ' from my punctuation list, but I think optimally, the contractions would be reformatted as two separate words. I also wonder about the other punctuation marks when dealing with social media data, e.g., #. I know this question is a bit general, but I'm wondering if there is a good python library for performing the kind of data-cleaning operations that I want, prior to perform topic analysis, sentiment analysis, etc. I'd also like to know which libraries can efficiently perform these data-cleaning operations on a pandas data frame.

Fatemeh Rahimi · Answer

I summarize your questions and then try to answer under each bullet point:

How to remove punctuation marks (e.g. # for hashtags which is used in social media)

The first goto is a regular expression that is used in data preprocessing very frequently. But if you are looking for all your punctuation to be removed from the text you can use one of these two approaches:
import string
sentence = "hi; how*, @are^ you? wow!!!"
sentence = sentence.translate(sentence.maketrans('', '', string.punctuation))
print(sentence)

output: 'hi how are you wow'
or use a regular expression, example:
import re
s = "string. With. Punctuation? 3.2"
s = re.sub(r'[^ws]','',s)
print(s)

output: 'string With Punctuation 32'

What are the possible python library for data cleaning?

generally, NLTK and SciPy are very handly. But for specific purposes, other libraries also exist. For example library contractions, inflect.

Which libraries are efficient in data cleaning when using pandas dataframes?

You can apply any function from any library to your pandas dataframe using the apply method.
Here is an example:
mport pandas as pd 
s = pd.read_csv("stock.csv", squeeze = True) 
  
# adding 5 to each value 
new = s.apply(lambda num : num + 5)

code source

NLP data cleaning and word tokenizing

One Answer

Add your own answers!

Ask a Question