Data Science Asked by lhy on May 17, 2021
Say I have two email addresses and I would like to see if it is likely that they belong to the same person. For example, [email protected]
and [email protected]
is likely to be from the same person (it doesn’t have to be certain, providing the likeliness would be sufficient).
I had two directions in mind to achieve this, one is a string comparison between the two email addresses and the other is to first extract the names from the email addresses then compare if they might be the same person. Like in the example above, the names extracted should be Cameron M Thompson
and c thompson
.
I am also wondering if given that one of the email addresses is guaranteed to contain the full name (usually company email addresses have the full name), would that help the extraction of name in the other email address (personal email addresses might not always contain the full name), or would that help on the comparison of the two email addresses.
I have had a hard time trying to figure out if any of the above two directions would be feasible. Especially when email addresses might not have separators and names can vary a lot that a listing might not be sufficient to find a match.
How should I proceed in solving this problem? Would machine learning / deep learning help or I should go with something else simple like regex and fuzzy string match?
UPDATE:
I have a dataset that has two columns, email address and name, and about 2k rows there. I believe this could be used for the second direction (name extraction). For the first direction (string comparison similarity), I am thinking of modifying the dataset to three columns (email address 1, email address 2, label of whether they are the same person), which should give about 1k rows of data.
Before talking about the solution, why don't you focus on the content instead? I think it would be more helpful to solve your problem, considering that most of the email addresses end with the sender's sign, Name Surname. Also, the probability of failing to obtain this information from an email address is much higher than the probability of failing to get it from the content. Especially, this is the case with company email addresses which might not contain the whole name in the email address (first letter of name and surname e.g. John Travolta - [email protected]), but it must contain the author's full name (at least the name) at the end. Furthermore, consider that plenty of email addresses will contain only name or surname or neither of them, but substitutive words like superboy122133@+++.com :D. But most of the email apps contain a default sign that includes name and surname. In addition, you can combine these two techniques. That is, combine the email address data with email content data so that, if it is infeasible or impracticable to obtain data from one of these, then you can use another one.
However, if let's say you have to do it with nothing but an email address I think using Machine Learning techniques would be overrating or overestimating the problem. Also, using non-machine learning techniques does not mean you are simplifying the solution, all these techniques give the best outcome when they are applied in the correct context. Let's imagine a simple situation: if you know or can easily infer that [tax] = 0.2 * [salary] + 20 $, why would you find (or fit) this equation using Machine Learning?
Unless you have data in the format of |email address, fullname|, you shouldn't start with using Machine Learning. (If you would have |email address, fullname| data, as an option, you would train a model to learn the general relationship between the email address and full name, thus you would identify similar email addresses).
However, in this current situation, one approach would be finding all possible patterns in the email addresses. Which can be
Then these features that are extracted from email addresses using identified patterns can be compared with other emails either hashing or using string distance algorithms.
One alternative approach would be having a hashed dictionary of all available names and surnames, then you can cut pieces(substrings) from the email address then hash them to find the names and surnames from the address (Of course, vice versa would be highly inefficient). The email addresses that have the most similar, properties would be matched.
Another solution would be, using the above-mentioned patterns, you can generate a bunch of artificial email addresses. Considering that it is highly probable that there is not a dataset that includes the name and surname of people and their one or more email addresses, data augmentation is the first order of business. (I am not sure whether the data augmentation term fits this situation. If it does not then let's say data generation). So your input would be Name Surname (you can include middle name, number, etc.), and output would be randomly generated email addresses based on the pre-defined patterns. The number of emails that are generated for a single input should be randomly selected also, but be careful about the generation of the same email address more than once. E.g. input -> John Travolta -> output -> j_travolta12@+++.com, john.t.99@+++.com, john.travolta@+++.com (Lets suppose for this example we randomly choose 3 emails to be generated).
Then after you created, email addresses with all possible (almost) patterns you can get help from Machine Learning techniques. So the model might give you a probability with the relevant name and surname. (Also, you can configure the output so that it would give you top n name and surnames which have higher probability)
Another thing that is needed to be considered is the possibility of two different persons having the same name and surname. Lastly, independently from using which approach your solution cannot be perfect because, for example, it is not possible to understand whether the character 'j' stands for John or Jake in the email address. Thus, if you can integrate the email content into your solution, that will increase the performance drastically.
Update Accordingly: Check this answer which does not exactly answer your problem, but the context is the same.
Correct answer by Shahriyar Mammadli on May 17, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP