TransWikia.com

How to extract features from long chemical names?

Data Science Asked by Aaron Johnson on April 14, 2021

I have an interesting problem that I am uncertain about how to even get started. I am working on a binary classifier that will take a chemical name, encoded as a string, and predict whether it is a ‘good’ or ‘bad’ name. I have had quite good success doing this by examining the structure of the chemical directly, but I would like to explore whether I can learn anything about the given name of the chemical itself (as the name can contain some structural information about the molecule that my encoding of the molecular structure is missing). I have been searching around trying to find anything built into sklearn to do text feature extraction. There is quite a bit, but it mostly seems to me like they are used for encoding whole sentences or paragraphs. My input would be very long, single words such as:

1-(aminoiminomethyl)-N’-[2,3,6-tri-O-benzoyl-4-O-(2,3,4,6-tetra-O-benzoyl-α-D-glucopyranosyl)-β-D-glucopyranosyl]-

2,4,5-trideoxy-2-[(16-mercapto-1-oxohexadecyl)amino]-1,3-O-(1-methylethylidene)-6-O-undecyl-

octahydro-7-hydroxy-1-[[2-O-(4-hydroxybenzoyl)-α-D-allopyranosyl]oxy]-7-methyl-

And as such, I’m not certain a bag of words or one-hot encoding of the strings would work. Could anyone perhaps point me in the right direction on methodologies or algorithms that might possibly be able to extract features from these strings such that I could train a binary classifier on them?

One Answer

You can train an RNN with character embeddings. This can be done by splitting the name into sequences of chars and vectorize them numerically. If you are working with Keras, you can feed them into an Embedding() layer that will learn how to represent characters. RNN layers will then process their sequence. At the output node, your Network will perform a classification ('good'/'bad').

Answered by Leevo on April 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP