Data Science Asked on August 12, 2021
I am new to Huggingface and have few basic queries. This post might be helpful to others as well who are starting to use longformer model from huggingface.
Create Sentence/document embeddings using longformer model. We don’t have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?
transformers
version:3.0.2##Models:
Model I am using longformer:
The problem arises when using:
The tasks I am working on is:
from transformers import LongformerModel, LongformerTokenizer
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
df = pd.read_csv("inshort_news_data-1.csv")
df.head(5)
#**news_article** column is used to generate embedding.
all_content=list(df['news_article'])
def sentence_bert():
list_of_emb=[]
for i in range(len(all_content)):
SAMPLE_TEXT = all_content[i] # long input document
print("length of string: ",len(SAMPLE_TEXT.split()))
input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)
# How to include batch of size here?
# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
attention_mask[:, [0,-1]] = 2
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
hidden_states = outputs[2]
token_embeddings = torch.stack(hidden_states, dim=0)
# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
token_vecs_sum = []
# For each token in the sentence...
for token in token_embeddings:
#but preferrable is
sum_vec=torch.sum(token[-4:],dim=0)
# Use `sum_vec` to represent `token`.
token_vecs_sum.append(sum_vec)
h=0
for i in range(len(token_vecs_sum)):
h+=token_vecs_sum[i]
list_of_emb.append(h)
return list_of_emb
f=sentence_bert()
Document1: Embeddings
Document2: Embeddings
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP