Ways of filtering erronous email addresses using NLP?

Question

Background:

I have a database of user information, in which they registered through a website.

Objective:

I would like to filter out erroneous emails, not by if it is malformed (i.e. it's missing an @-sign), but rather by "weird strings" in the "local-part" of the address. So examples of erroneous email address are things like:

zzzzzzzzzzzzzzzzz@gmail.com
asdfasdfasdf@gmail.com
yourenotgettingmyrealemail@gmail.com
123@yahoo.com
test@test.com

I know most of these require some "human" interpretation to figure that they're probably not real, but I was wondering if there are any algorithms that can help me out.

Jack · Answer

More important than the algorithm is having an existing corpus of labeled data. Most ML algorithms need to train on a huge amount of text before they start producing useful results (for example a NLP algorithm was trained to generate fiction by reading the entire Harry Potter series).

Do you have a list of known good and bad e-mail addresses?

You can still try to group email addresses even if your data isn’t labeled, but it’s harder.

Do you have a list of known good and bad e-mail addresses?

You can still try to group email addresses even if your data isn’t labeled, but it’s harder.

shepan6 · Answer

So the question is to develop a model / system to detect these "weird strings", which so far we can formalise as:

Strings with a large number of repetitions of a subsection of the string (e.g. zzzzzzzzzzzzzzzzz@gmail.com, asdfasdfasdf@gmail.com)
Strings which contain solely numbers (e.g. 123@yahoo.com)
Contain "meaningless" words (e.g. blah, test)
(I'm sure there are many more conditions)

James C's idea of a baseline system is a good idea to start off with, so then whenever you improve the model, you can test against the baseline model to see if the improvement improves the weird text classification.
So, an idea of a baseline system, would be to simply develop regular expressions Regular Expressions (Regex 101 is a great place to develop regular expressions), which can detect conditions (1) and (2). For (3) a start point is to say that an email address has a "weird string" within it if any of the "meaningless" words (stored in a list) appear in the email address.

Ways of filtering erronous email addresses using NLP?

2 Answers

Add your own answers!

Ask a Question