Data Science Asked by Shazbots on December 24, 2020
Background:
I have a database of user information, in which they registered through a website.
Objective:
I would like to filter out erroneous emails, not by if it is malformed (i.e. it’s missing an @-sign), but rather by “weird strings” in the “local-part” of the address. So examples of erroneous email address are things like:
I know most of these require some “human” interpretation to figure that they’re probably not real, but I was wondering if there are any algorithms that can help me out.
More important than the algorithm is having an existing corpus of labeled data. Most ML algorithms need to train on a huge amount of text before they start producing useful results (for example a NLP algorithm was trained to generate fiction by reading the entire Harry Potter series).
Do you have a list of known good and bad e-mail addresses?
You can still try to group email addresses even if your data isn’t labeled, but it’s harder.
Answered by Jack on December 24, 2020
So the question is to develop a model / system to detect these "weird strings", which so far we can formalise as:
James C's idea of a baseline system is a good idea to start off with, so then whenever you improve the model, you can test against the baseline model to see if the improvement improves the weird text classification.
So, an idea of a baseline system, would be to simply develop regular expressions Regular Expressions (Regex 101 is a great place to develop regular expressions), which can detect conditions (1) and (2). For (3) a start point is to say that an email address has a "weird string" within it if any of the "meaningless" words (stored in a list) appear in the email address.
Answered by shepan6 on December 24, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP