Data Science Asked by krishna rao gadde on March 30, 2021
I have a dataset of two folders. One of them contains the documents(text, pdfs) related to personal information (like name,email,address etc), the other contains non-personal information.
I have to train a model using Spacy, based on these two folders. So, when we predict a given document, it should predict among these two folders.
I have tried writing many codes taking reference from Github, but nothing seem to be worked.
So, can anyone give me a code sample to train a model based on the information given above and predict ?
I have done some hands on, on the below code
import spacy
from spacy import displacy
from spacy.util import minibatch, compounding
train_data = [("This has names, emails, addresses ", {'cats': {'POSITIVE': 1}} ), ("This has games, food, etc", {'cats': {'POSITIVE': 0}})]
nlp = spacy.load('en_core_web_sm')
if 'textcat' not in nlp.pipe_names:
textcat = nlp.create_pipe("textcat")
nlp.add_pipe(textcat, last=True)
else:
textcat = nlp.get_pipe("textcat")
textcat.add_label('POSITIVE')
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
n_iter = 1
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
print("Training model...")
for i in range(n_iter):
losses = {}
batches = minibatch(train_data, size=compounding(4,32,1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer,
drop=0.2, losses=losses)
Here in the above code, I have trained the model using two simple sentences. I need to train on two folders, as mentioned in the question.
This code just says model has trained.
And also how can i save this model and test it for documents to predict ??
You're very close to having a working script. The textcat training example in the spacy repository shows how to save the model, reload it, and run it on a new text: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
From around line 103:
# test the trained model
test_text = "This movie sucked"
doc = nlp(test_text)
print(test_text, doc.cats)
if output_dir is not None:
with nlp.use_params(optimizer.averages):
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
print(test_text, doc2.cats)
```
Answered by aab on March 30, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP