Classify and Predict results that I don't understand

Question

I have a list of rules that has as the left hand side the font size and a letter and for the right hand side the space in pixels between this letter and the following letter. The data is from a postscript file. Two examples.
{8.5, "S"} -> 5.5165
{8.5, "251"} -> 2.363

The second example uses a character code to represent the actual character.
Here is the complete list of rules. Apologize for the length but I don't know what parts are and are not important.
letterRules={{8.5, "S"} -> 5.5165, {8.5, "P"} -> 5.6695, {8.5, "E"} -> 
  5.508, {8.5, "C"} -> 6.2985, {8.5, "I"} -> 2.5075, {8.5, "A"} -> 
  5.8225, {8.5, "D"} -> 6.2988, {8.5, "O"} -> 6.613, {8.5, "U"} -> 
  6.298, {8.5, "B"} -> 5.984, {8.5, "L"} -> 5.041, {8.5, "E"} -> 
  5.508, {8.5, "S"} -> 5.5165, {8.5, "I"} -> 2.5075, {8.5, "M"} -> 
  7.7095, {8.5, "P"} -> 5.6695, {8.5, "L"} -> 5.0405, {8.5, "O"} -> 
  6.6131, {8.5, "V"} -> 5.355, {8.5, "E"} -> 5.508, {8.5, "R"} -> 
  6.137, {8.5, "C"} -> 6.298, {8.5, "A"} -> 5.823, {8.5, "L"} -> 
  5.04, {8.5, "J"} -> 4.726, {8.5, "U"} -> 6.2985, {8.5, "M"} -> 
  7.7095, {8.5, "O"} -> 6.613, {8.5, "V"} -> 5.3546, {8.5, "E"} -> 
  5.508, {8.5, "R"} -> 6.137, {8.5, "C"} -> 6.299, {8.5, "A"} -> 
  5.822, {8.5, "L"} -> 5.041, {8.5, "O"} -> 6.613, {8.5, "P"} -> 
  5.6695, {8.5, "E"} -> 5.508, {8.5, "N"} -> 6.2985, {8.5, "I"} -> 
  2.5075, {8.5, "N"} -> 6.2985, {8.5, "P"} -> 5.6697, {8.5, "R"} -> 
  6.137, {8.5, "E"} -> 5.508, {8.5, "E"} -> 5.508, {8.5, "M"} -> 
  7.709, {8.5, "P"} -> 5.67, {8.5, "T"} -> 5.193, {8.5, "N"} -> 
  6.299, {8.5, "O"} -> 6.613, {8.5, "T"} -> 5.193, {8.5, "R"} -> 
  6.137, {8.5, "U"} -> 6.299, {8.5, "M"} -> 7.709, {8.5, "O"} -> 
  6.613, {8.5, "V"} -> 5.355, {8.5, "E"} -> 5.508, {8.5, "R"} -> 
  6.137, {8.5, "C"} -> 6.298, {8.5, "A"} -> 5.823, {8.5, "L"} -> 
  5.04, {8.5, "L"} -> 5.041, {8.5, "D"} -> 6.299, {8.5, "E"} -> 
  5.508, {8.5, "F"} -> 5.04, {8.5, "E"} -> 5.508, {8.5, "N"} -> 
  6.299, {8.5, "S"} -> 5.516, {8.5, "V"} -> 5.355, {8.5, "N"} -> 
  6.298, {8.5, "O"} -> 6.613, {8.5, "T"} -> 5.194, {8.5, "R"} -> 
  6.137, {8.5, "U"} -> 6.298, {8.5, "M"} -> 7.71, {8.5, "O"} -> 
  6.613, {8.5, "V"} -> 5.355, {8.5, "E"} -> 5.508, {8.5, "O"} -> 
  6.613, {8.5, "P"} -> 5.669, {8.5, "P"} -> 5.67, {8.5, "251"} -> 
  2.363, {8.5, "T"} -> 5.194, {8.5, "/"} -> 3.153, {8.5, "D"} -> 
  6.299, {8.5, "O"} -> 6.613, {8.5, "U"} -> 6.298, {8.5, "B"} -> 
  5.984, {8.5, "L"} -> 5.041, {8.5, "D"} -> 6.2985, {8.5, "I"} -> 
  2.5075, {8.5, "R"} -> 6.137, {8.5, "E"} -> 5.508, {8.5, "C"} -> 
  6.2985, {8.5, "C"} -> 6.2988, {8.5, "U"} -> 6.298, {8.5, "E"} -> 
  5.508, {8.5, "B"} -> 5.984, {8.5, "I"} -> 2.508, {8.5, "S"} -> 
  5.4315, {8.5, "L"} -> 4.9555, {8.5, "A"} -> 5.7375, {8.5, "C"} -> 
  6.2135, {8.5, "O"} -> 6.528, {8.5, "N"} -> 6.2135, {8.5, "V"} -> 
  5.27, {8.5, "E"} -> 5.423, {8.5, "N"} -> 6.2135, {8.5, "T"} -> 
  5.1085, {8.5, "I"} -> 2.4224, {8.5, "O"} -> 6.528, {8.5, "N"} -> 
  6.214, {8.5, "L"} -> 5.542, {8.5, "E"} -> 6.0605, {8.5, "A"} -> 
  6.409, {8.5, "D"} -> 6.9275, {9, "P"} -> 6.003, {9, "r"} -> 
  3.501, {9, "i"} -> 2.322, {9, "m"} -> 8.154, {9, "a"} -> 
  5.166, {9, "r"} -> 3.501, {9, "s"} -> 4.833, {9, "i"} -> 
  2.322, {9, "g"} -> 5.499, {9, "n"} -> 5.337, {9, "a"} -> 
  5.166, {9, "t"} -> 3.168, {9, "p"} -> 5.499, {9, "a"} -> 
  5.166, {9, "r"} -> 3.501, {9, "t"} -> 3.168, {9, "n"} -> 
  5.337, {9, "e"} -> 5.166, {9, "r"} -> 3.501, {9, "251"} -> 
  2.502, {9, "l"} -> 2.322, {9, "e"} -> 5.166, {9, "a"} -> 
  5.166, {9, "d"} -> 5.499, {8.5, "D"} -> 6.801, {8.5, "E"} -> 
  5.949, {8.5, "F"} -> 5.441, {8.5, "E"} -> 5.949, {8.5, "N"} -> 
  6.801, {8.5, "S"} -> 5.958, {8.5, "I"} -> 2.711, {8.5, "V"} -> 
  5.78, {8.5, "C"} -> 6.8, {8.5, "A"} -> 6.291, {8.5, "R"} -> 
  6.63, {8.5, "D"} -> 6.8, {8.5, "I"} -> 2.711, {8.5, "N"} -> 6.8}

I use both Classify and Predict with the training set.
letterClassify = Classify[letterRules, Method -> "NaiveBayes"];
letterPredict = Predict[letterRules, Method -> "NeuralNetwork"];

In general I am happy with the results. I created tables to show the results for a rule, for example rule #139 has a 9 pt font size, the letter d and a spacing of 5.499 pixels

The letter N with a font size of 8.5 occurs in a number of rules from 38 up to 153, various spacings with the majority around 6.3 pixels.

The columns are the rule number, the rule, the Classify result and the Predict result. Both Classify and Predict give reasonable results for the rule with one occurrence, d, as well as the rule with multiple occurrences, N.
However Predict gives an awful result for this single occurrence rule with a slash, /.

and in the following example there are two single occurrences (because the font size differs) but Classify fails to treat them separately whereas Predict honors the font size.

Finally in this example there are many samples close to 5.8 but Classify picks the largest one which has only one occurrence.

I hope someone can help me to understand why I get mostly reasonable results but apparent failure in the above four cases.

Classify and Predict results that I don't understand

Add your own answers!

Ask a Question