TransWikia.com

Ways of filtering erronous email addresses using NLP?

Data Science Asked by Shazbots on December 24, 2020

Background:

I have a database of user information, in which they registered through a website.

Objective:

I would like to filter out erroneous emails, not by if it is malformed (i.e. it’s missing an @-sign), but rather by “weird strings” in the “local-part” of the address. So examples of erroneous email address are things like:

I know most of these require some “human” interpretation to figure that they’re probably not real, but I was wondering if there are any algorithms that can help me out.

2 Answers

More important than the algorithm is having an existing corpus of labeled data. Most ML algorithms need to train on a huge amount of text before they start producing useful results (for example a NLP algorithm was trained to generate fiction by reading the entire Harry Potter series).

Do you have a list of known good and bad e-mail addresses?

You can still try to group email addresses even if your data isn’t labeled, but it’s harder.

Answered by Jack on December 24, 2020

So the question is to develop a model / system to detect these "weird strings", which so far we can formalise as:

  1. Strings with a large number of repetitions of a subsection of the string (e.g. [email protected], [email protected])
  2. Strings which contain solely numbers (e.g. [email protected])
  3. Contain "meaningless" words (e.g. blah, test) (I'm sure there are many more conditions)

James C's idea of a baseline system is a good idea to start off with, so then whenever you improve the model, you can test against the baseline model to see if the improvement improves the weird text classification.

So, an idea of a baseline system, would be to simply develop regular expressions Regular Expressions (Regex 101 is a great place to develop regular expressions), which can detect conditions (1) and (2). For (3) a start point is to say that an email address has a "weird string" within it if any of the "meaningless" words (stored in a list) appear in the email address.

Answered by shepan6 on December 24, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP