TransWikia.com

Spacy Text classification (Binary Classification)

Data Science Asked by krishna rao gadde on March 30, 2021

I have a dataset of two folders. One of them contains the documents(text, pdfs) related to personal information (like name,email,address etc), the other contains non-personal information.

I have to train a model using Spacy, based on these two folders. So, when we predict a given document, it should predict among these two folders.

I have tried writing many codes taking reference from Github, but nothing seem to be worked.

So, can anyone give me a code sample to train a model based on the information given above and predict ?

I have done some hands on, on the below code

import spacy
from spacy import displacy
from spacy.util import minibatch, compounding

train_data = [("This has names, emails, addresses ", {'cats': {'POSITIVE': 1}} ), ("This has games, food, etc", {'cats': {'POSITIVE': 0}})]

nlp = spacy.load('en_core_web_sm')

if 'textcat' not in nlp.pipe_names:
    textcat = nlp.create_pipe("textcat")
    nlp.add_pipe(textcat, last=True)
else:
    textcat = nlp.get_pipe("textcat")

textcat.add_label('POSITIVE')

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']

n_iter = 1


with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()
    print("Training model...")
    for i in range(n_iter):
        losses = {}
        batches = minibatch(train_data, size=compounding(4,32,1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer,
                      drop=0.2, losses=losses)

Here in the above code, I have trained the model using two simple sentences. I need to train on two folders, as mentioned in the question.
This code just says model has trained.
And also how can i save this model and test it for documents to predict ??

One Answer

You're very close to having a working script. The textcat training example in the spacy repository shows how to save the model, reload it, and run it on a new text: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py

From around line 103:

    # test the trained model
    test_text = "This movie sucked"
    doc = nlp(test_text)
    print(test_text, doc.cats)

    if output_dir is not None:
        with nlp.use_params(optimizer.averages):
            nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc2 = nlp2(test_text)
        print(test_text, doc2.cats)
```

Answered by aab on March 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP