What's the principle behind the new MoleculeRecognize function?

Question

I played a bit with the new MoleculeRecognize function implemented in the newly released WM 12.1:
which simply recognizes a molecule in image and returns it as a Molecule object.

MoleculeRecognize[image]

It is really an amazing tool but what I really want to know is the algorithm behind this cool function? Is it Machine-learning engine or a classical Image-analysis-based pipeline!
Also I want to know its accuracy and eventual limitations?

Carl Lange · Answer

Using ResourceFunction["PrintDefinitions"][MoleculeRecognize], one can determine that the source implementation is MolVec. The specific source code is:

Chemistry`MolVecLink`Private`molStringFromImageFile[
Pattern[path, 
Blank[String]]] := Block[
    {cleanup, jFile, molString},
    Chemistry`MolVecLink`Private`initialize[];
    JLink`JavaBlock[
            jFile = JLink`JavaNew["java.io.File", path];
            molString = 
     Quiet[Chemistry`MolVecLink`Private`runMolvec @ jFile, 
      JLink`Java::excptn];
            cleanup[];
            molString
        ]
   ];

which is pretty clear. You can check out this demo to see an example of MolVec outside of Wolfram Language.

Looking at the MolVec code, it seems like it uses a variety different methods to attempt to extract molecular information, but they all seem to be OCR related (ie Image-analysis-based) rather than a machine-learning-based pipeline.

Information about MolVec accuracy and performance can be found in this presentation, linked from the demo above. There does not seem to be any specific detail about its limitations on the web, and I would posit that after a point, this becomes no longer a Mathematica question :)

dkatzel · Answer

I am one of the authors of Molvec.  It was designed in a classical approach using various image processing heuristics to guess the structure.  While it's on my todo list, we still haven't published a Molvec paper.  However there is a recent 3rd party comparison paper which I did not have any part in that compared Molvec to other similar tools and it's available here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7541205/ which includes accuracy and performance metrics. We have used the results in this paper to make improvements in newer versions as well.
As far as limitations, (as of this writing that's Molvec version 0.9.8) there are some things molvec does really well and somethings it doesn't do well.  First, it tries to really hard to turn any image you provide it into some kind of structure even when the image is not a structure.  Furthermore text from captions or nearby paragraphs might confuse Molvec as well.
From a Machine Learning perspective having something sit in front using ML to find areas that aren't structures would help, for now I would just recommend being careful how you crop your image you submit to molvec.
Aside from that, Molvec has problems with structures that have brackets or variable groups like Markush structures. It thinks those lines for the brackets and labels are meant to be bonds. But this should be fixable in future versions.
These issues  haven't been a high priority for us because Molvec's main usecase is for structure registration and curation during regulatory data entry.  These registrars use Molvec as a first pass entering structure information and carefully review the results and make edits.
Thanks for your question and it's great it's been incorporated into Mathmatica.

What's the principle behind the new MoleculeRecognize function?

2 Answers

Add your own answers!

Ask a Question