Reverse Engineering Asked by Alex Beals on February 26, 2021
Driver’s License numbers in New Jersey aren’t random. They follow the format: Affff lllii mmyye
, where A
is the first letter of the person’s last name, ffff
is some mapping of the remaining letters of the last name to a four digit numeric, lll
is a mapping of the full first name to a three digit numeric and ii
is a code representing the middle initial (according to the below table:
| | 6 | 7 | 8 |
|---|---|---|---|
| 1 | a | j | |
| 2 | b | k | s |
| 3 | c | l | t |
| 4 | d | m | u |
| 5 | e | n | v |
| 6 | f | o | w |
| 7 | g | p | x |
| 8 | h | q | y |
| 9 | i | r | z |
Where the number corresponding to the initial is 10*column number + row number. mm corresponds to the month born, and yy
to the year born. e
is the eye color (a value 1-8 corresponding to BRO
, BLU
, GRY
, GRN
, BLK
, etc.)
The only thing I don’t understand is how the names are mapped to the integer values. I only have 5 examples for the last name mappings: (ignoring the first letter because it doesn’t play into the mapping
aab -> 0001
ackson -> 0062
eals -> 2024
eimel -> 2278
ounds -> 6810
For first names, I only have four:
Alexander -> 019
Richard -> 655
John -> 407
Matthew -> 529
Does anyone have any ideas how the implementation is done, or even a general mapping function that will hash a max 25 length string to a four digit or three digit number while maintaining lexicographical order (<=, not <).
Things I’ve Tried
Convert each letter to a number 1-26. Then, taking only the first four numbers, create the number by the rule 26^3 * first number + 26^2 * second number + 26 * third + fourth. Then, divide this number by 26^4 + 26^3 + 26^2 + 26, and multiply by 10000 to map the decimal into 0-9999. This produces the following mappings:
aab -> 0000
ackson -> 0035
eals -> 1547
emiel -> 1722
ounds -> 5695
Get a list of the top 10,000 most common surnames. Order by the second letter, and then check the index. This produces the following mappings:
aab -> 0005
ackson -> 0128
eals -> 2813
emiel -> 3235
ounds -> 7588
Each letter subdivides the 10,000. The first number (according to 1-26) cuts it into one of 26 pieces. The second cuts the piece into one of 26, and so on and so forth. This produces the following mappings:
aab -> 0000
ackson -> 0028
eals -> 1536
emiel -> 1648
ounds -> 5656
Convert each of the first four letters to 1-26. Concatenate all of them, multiply the resulting number by 10,000, and divide by 26262626. This produces the following mappings:
aab -> 0003
ackson -> 0392
eals -> 1908
emiel -> 1953
ounds -> 5792
Do the above with 0-25, divide by 25252525. This produces the following mappings:
aab -> 0000
ackson -> 0008
eals -> 1584
emiel -> 1631
ounds -> 5623
Additional Samples
While I believe all of the above samples are correct, I tried to track down more authentic sample data points. Ones that I can guarantee are below:
Last Names
avis -> 0921
eals -> 2024
olff -> 6247
orello -> 6581
First Names
Alexander -> 019
Andrew -> 042
Gabriel -> 270
Lena -> 456
Many states use something called SoundEx to generate license numbers (sometimes you even see SoundEx on government forms and/or computer screens when they ask for drivers license numbers.)
The soundex system was designed to phonetically map names that sound similar to close values, even though they might be spelled wildly differently eg Pheiffer vs Fifer)
See also things like Metaphone. Also, they may not use soundex directly.
Answered by BitShifter on February 26, 2021
This is not yet a complete answer, but perhaps what I've found can be combined with other information to come up with the complete solution.
If we assume a linear encoding, then we have everything needed to figure this out based on your four samples. If we consider letter values as a=0, b=1, ...
regardless of whether they're uppercase or lowercase, your four samples can be turned into four linear equations:
a*0 +b*11+c*4 +d*23 = 19 (Alex)
a*12+b*0 +c*19+d*19 = 529 (Matt)
a*9 +b*14+c*7 +d*13 = 407 (John)
a*17+b*8 +c*2 +d*7 = 655 (Rich)
Since we have four equations and four unknowns, it's easily solved using simple but tedious algebra or in matrix form using Gaussian elimination. (Sorry for the ugly looking math, but unlike other StackExchange sites apparently ReverseEngineering doesn't support MathML, which is unfortunate.)
If you do so, you get the following values:
a = 83700 / 2279
b = 9484 / 2279
c = 16030 / 2279
d = −5441 / 2279
All very neat and accurate, but there's a problem, which is that any four samples would result in some answer. The question is whether it works for all possible names, and unfortunately, the answer is no.
I did some searching on the internet and found a few more samples. Here's an image of a Russian spy's New Jersey license and here is a Police guide (see page 60). This pamphlet from the NJ MVC encodes "Dennis J. Driver" as D4047-16371
If we try the first name equation above on these new samples, they fail, so it's not quite right. The result suggests that the weighting is not quite so simple. When searching, I also found that both Ontario and Québec licenses appear to use the same first and last name encodings. So for example, this temporary Ontario permit verifies that "Dennis" is encoded as 163 in Ontario as well as in New Jersey.
When I run a linear regression on all of the first name values vs. the first letter l
(encoded as a=0, b=1, ...
) I get the equation 32.42*l+52.55
with an R^2 value of 0.986 which shows this to be highly linear.
I tried a very simple experiment with the last name encoding which was a very simplistic method not mentioned in your list of things you have tried. That was to simply consider each character as a base-26 digit. Using the 4 characters following the first, the encodings for "Baab" and "Jackson" are correctly obtained, but no others matched.
I did some searching for existing encoding schemes. Soundex was both easily found and easily discounted, but there are many variations to it and it's possible that some expanded variation was used. I was not able to locate a Soundex variant that produced these particular values, but I learned some interesting things along the way.
First, perhaps not surprisingly, there has long been a need to try to match up names in a database using some kind of encoding. Generically, the problem is called record-linking and is typically thought of as mathing a possibly misspelled name to a subset of possible matches in a database. Soundex has been used for this purpose, but found to be somewhat lacking in effectiveness.
Other schemes I have located, or at least located references to include:
This stringmetric project has what appears to be a nice collection of algorithm implementations with links to the original describing papers, but I haven't tried all of these.
Perhaps if someone does, they can report back here.
Answered by Edward on February 26, 2021
I don't see this above, but male or female is coded in as well. in the last five digits, the first 2 are month of birth. Males are 01-12. Females 50 is added. so the run from 51 (january) to 62 (december) Also, my name is Alexandra, which is also 019 as is your example of alexander. The absence of a middle name is reflected as 00 i know a friend with middle name alexandra has 61 = (ii) another, is Serafina middle name 82 = (ii) another, is Dorothy middle name 64 = (ii) I would suggest collecting more name samples to compare
Answered by alexandra on February 26, 2021
In case you're still trying to figure this out, I've made some progress. With assistance from u/jccool5000 on reddit (post), who has a collection of over 900 samples mostly from Ontario. AFAIK, Ontario and NJ share the same encoding - Quebec, not so sure. I did some data manipulation to figure this out.
Starting with the numbers of the last name, 1st of 4 digits corresponds to the 2nd letter of the last name, as the 1st is already coded directly to the first letter of the license number.
0 = A
1 = B C D
2 = E
3 = F G H
4 = I J K
5 = L M N
6 = O
7 = P Q R
8 = S T
9 = U V W X Y Z
The remaining three numerical digits codes the second letter of the last name as well, from 000-999. However, each second-digit has its own 000-999 range. That is to say:
You can refer to the above table to see when the 999 will reset back to 000. This is just the pattern I've found so far. I don't know how the numbers are distributed to the names.
First name code is a lot simpler, but at the same time, it's also not evenly distributed. The difference with first name code is it only goes from 000 (Aaron) to probably 799 (796 for Zoe). What I mean by not evenly distributed is names that start with A range from 000 to 071, which 071 has some names that start with BA. Meanwhile, names that begin in Y are confined to a small range of no less than 785 to no more than 792.
Answered by boat on February 26, 2021
I believe the name is represented in something called the Soundex Code: the numbers represent the sounds of the last & first name. I can't quickly find the article I read breaking down the code for NJ licenses, but here is a general entry: https://www.tutorialgateway.org/sql-soundex-function/
Answered by bozz on February 26, 2021
I filed a FOIA request with the DMV. As I said many years ago, this was almost definitely protected information and likely to be rejected, but here's the formal response to that effect.
Request #W165699
Under the New Jersey Open Public Records Act, N.J.S.A. 47:1A-1 et seq., I am requesting an opportunity to inspect or obtain copies of public records that describe the algorithm for mapping first and last names to drivers license ID numbers (a Soundex-esque derivative). If there are any fees for searching or copying these records, please inform me if the cost will exceed $10. However, I would also like to request a waiver of all fees in that the disclosure of the requested information is in the public interest. This information is not being sought for commercial purposes.The New Jersey Open Public Records Act requires a response time of seven business days. If access to the records I am requesting will take longer than this amount of time, please contact me with information about when I might expect copies or the ability to inspect the requested records. Preferably I would like to receive all information through electronic records sent to my email address. If you deny any or all of this request, please cite each specific exemption you feel justifies the refusal to release the information and notify me of the appeal procedures available to me under the law. Thank you for considering my request.
Response (Excerpted)
The algorithm information you seek is exempt from disclosure by the Drivers' Privacy Protection Act, the Open Public Records Act, New Jersey Court Rules and Executive Order Number 21.
Further, N.J.S.A. 47:1A-1.1 provides:
A government record shall not include the following information which is deemed to be confidential for the purposes of P.L. 1963, c. 73 (C.47:1A-1 et seq.) as amended and supplemented:...trade secrets and proprietary commercial or financial information obtained from any souce...
Answered by Alex Beals on February 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP