TransWikia.com

What if My Word is not in Bert model vocabulary?

Data Science Asked on August 2, 2021

I am doing NER using Bert Model. I have encountered some words in my datasets which is not a part of bert vocabulary and i am getting the same error while converting words to ids. Can someone help me in this?

Below is the code i am using for bert.

df = pd.read_csv("drive/My Drive/PA_AG_123records.csv",sep=",",encoding="latin1").fillna(method='ffill')

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

import tensorflow_hub as hub
import tokenization
module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(module_url, trainable=True)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

tokens_list=['hrct',
 'heall',
 'government',
 'of',
 'hem',
 'snehal',
 'sarjerao',
 'nawale',
 '12',
 '12',
 '9999',
 'female',
 'mobile',
 'no',
 '1155812345',
 '3333',
 '3333',
 '3333',
 '41st',
 '3iteir',
 'fillow']

max_len =25
text = tokens_list[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
print("After adding  flasges -[CLS] and [SEP]: ")
print(input_sequence)


tokens = tokenizer.convert_tokens_to_ids(input_sequence )
print("tokens to id ")
print(tokens)
```

One Answer

The problem is that you are not using BERT's tokenizer properly.

Instead of using BERT's tokenizer to actually tokenize the input text, you are splitting the text in tokens yourself, in your token_list and then requesting the tokenizer to give you the IDs of those tokens. However, if you provide tokens that are not part of the BERT subword vocabulary, it will not be able to handle them.

You must not do this.

Instead, you should let the tokenizer tokenize the text and then ask for the token IDs, e.g.:

tokens_list = tokenizer.tokenize('Where are you going?') 

Remember, nevertheless, that BERT uses subword tokenization, so it will split the input text so that it can be represented with the subwords in its vocabulary.

Correct answer by noe on August 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP