Data Science Asked by Madhur Yadav on April 2, 2021
I’m trying to get word embeddings for clinical data using microsoft/pubmedbert.
I have 3.6 million text rows. Converting texts to vectors for 10k rows takes around 30 minutes. So for 3.6 million rows, it would take around – 180 hours(8days approx).
Is there any method where I can speed up the process?
My code –
from transformers import AutoTokenizer
from transformers import pipeline
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('feature-extraction',model=model_name, tokenizer=tokenizer)
def lambda_func(row):
tokens = tokenizer(row['notetext'])
if len(tokens['input_ids'])>512:
tokens = re.split(r'b', row['notetext'])
tokens= [t for t in tokens if len(t) > 0 ]
row['notetext'] = ''.join(tokens[:512])
row['vectors'] = classifier(row['notetext'])[0][0]
return row
def process(progress_notes):
progress_notes = progress_notes.apply(lambda_func, axis=1)
return progress_notes
progress_notes = process(progress_notes)
vectors_2d = np.reshape(progress_notes['vectors'].to_list(), (vectors_length, vectors_breadth))
vectors_df = pd.DataFrame(vectors_2d)
My progress_notes dataframe looks like –
progress_notes = pd.DataFrame({'id':[1,2,3],'progressnotetype':['Nursing Note', 'Nursing Note', 'Administration Note'], 'notetext': ['Patient's skin is grossly intact with exception of skin tear to r inner elbow and r lateral lower leg','Patient with history of Afib with RVR. Patient is incontinent of bowel and bladder.','Give 2 tablet by mouth every 4 hours as needed for Mild to moderate Pain Not to exceed 3 grams in 24 hours']})
Note – 1) I’m running the code on aws ec2 instance r5.8x large(32 CPUs) – I tried using multiprocessing but the code goes into a deadlock because bert takes all my cpu cores.
I think the main problem is how you are using BERT, as you are processing your text sentence by sentence. Instead, you should be feeding the input to the model in mini-batches:
Neural networks for NLP are meant to receive not only one sentence at a time, but multiple sentences. The sentences are stacked together in a single tensor of integer numbers (token IDs) with dimensions number of sentences $times$ sequence length. As sentences have different lengths, normally the batch has as sequence length the length of the longest sequence in it, and all the sentences that are shorter than that, are filled with padding tokens. You can have a look at batching in huggingface's library here.
Using batch-oriented processing would allow you to profit from parallel processing of all the sentences in the batch.
The problem is that, while HuggingFace Transformer supports batch-oriented for training, it does not support batch-oriented inference in some cases. For instance, FeatureExtractionPipeline
, which extracts token embeddings like you want, does not support batch processing (unlike TableQuestionAnsweringPipeline
which has the sequential
parameter).
This way, in order to have batch-oriented inference you would need to feed the data manually instead of relying on the pipeline API. You can find examples on how to do that in this thread.
If you try using GPU instead of CPU, using batch-oriented processing would also be key to enable performance gains.
If you decide to stay on CPU, ensure that your Pytorch build is using MKL-DNN, which is a major performance booster in CPU. You can check this thread on how to do it. If you are not using it, install a newer version that includes it.
Correct answer by noe on April 2, 2021
There are several things that could make your code faster:
Stop using Pandas. Pandas is not designed for large-scale, fast text processing. It would be better to switch to something like Apache Arrow which is designed for efficient analytic operations on modern hardware.
Refactor code to avoid casting into memory-inefficient data types. progress_notes['vectors'].to_list()
converts to a Python list which will use a lot of memory.
Refactor code to avoid casting back and forth between data types. Replace the list comprehensions and ''.join
with complied regex. Complied regex processing will be much faster than equivalent Python code.
Refactor code to replace functions with inline code. Python creates a new stack for each function call. Inline code will be more memory efficient.
Answered by Brian Spiering on April 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP