Distinguish English and Spanish with regular expressions

Question

The task is to to compete for the shortest regex (in bytes) in your preferred programming language which can distinguish between English and Spanish with minimum 60% 90% accuracy.
Silvio Mayolo's submission (pinned as Best Answer) has secured his spot as the winner of original contest against any chance of being contested.  In order to provide room for further submissions, he has generously allowed the scoring requirement to be pushed to 90% accuracy.
Links to wordlists have been replaced due to concerns voiced in the comments.
The following word lists (based on these) must be used: English, Spanish
The Spanish wordlist is already transliterated into ASCII, and there is no word present in either which is also present in the other.
A naive approach to distinguishing Spanish from English might be to match if the word ends in a vowel:
[aeiou]$ i 9 bytes
Here's a live example, where 6 of 8 words are successfully identified, for 75% accuracy:


const regex = /[aeiou]$/i;

const words = [
  'hello',
  'hola',
  'world',
  'mundo',
  'foo',
  'tonto',
  'bar',
  'barra'
];

words.forEach(word => {
  const match = word.match(regex);
  const langs = ['English', 'Spanish'];
  const lang = langs[+!!match];
  console.log(word, lang);
});

Lynn · Answer

50 bytes, 90.02% accurate
(a(d?|is|r|se?)|dor|eis|ese|je|n|[ns]te|os?|res?)$
For 18,004 out of the 20,000 words in es_clean.json and en_clean.json, this regex matches iff the input word is Spanish.

Silvio Mayolo · Answer

Any Language, 0.3677 (60.6064%, 1 byte)

No, I'm not joking. The single-character regular expression a successfully identifies Spanish words over English given your input files 60.6064% of the time, which makes it a valid submission.

Here's a complete, runnable Perl script that checks the percentage of this regular expression, assuming you've downloaded english.json and spanish.json into the same folder as the script.

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

my @english;
my @spanish;

my $fh;
open $fh, '<', 'english.json';
while (<$fh>) {
    push @english, $1 if /"(w+)"/;
}
close $fh;

open $fh, '<', 'spanish.json';
while (<$fh>) {
    push @spanish, $1 if /"(w+)"/;
}
close $fh;

my $correct = 0;
my $total = 0;

my $re = qr/a/;

for (@english) {
    $total++;
    $correct++ unless /$re/;
}
for (@spanish) {
    $total++;
    $correct++ if /$re/;
}

say "$correct / $total (@{[100*$correct/$total]}%)";

Distinguish English and Spanish with regular expressions

Links to wordlists have been replaced due to concerns voiced in the comments.

`[aeiou]$` `i` ^{9 bytes}

2 Answers

50 bytes, 90.02% accurate

Any Language, 0.3677 (60.6064%, 1 byte)

Add your own answers!

Ask a Question