Are there any good out-of-the-box language models for python?

Question

I'm prototyping an application and I need a language model to compute perplexity on some generated sentences.

Is there any trained language model in python I can readily use? Something simple like

model = LanguageModel('en')
p1 = model.perplexity('This is a well constructed sentence')
p2 = model.perplexity('Bunny lamp robert junior pancake')
assert p1 < p2

I've looked at some frameworks but couldn't find what I want. I know I can use something like:

from nltk.model.ngram import NgramModel
lm = NgramModel(3, brown.words(categories='news'))

This uses a good turing probability distribution on Brown Corpus, but I was looking for some well-crafted model on some big dataset, like the 1b words dataset. Something that I can actually trust the results for a general domain (not only news)

lads · Accepted Answer

I also think that the first answer is incorrect for the reasons that @noob333 explained.
But also Bert cannot be used out of the box as a language model. Bert gives you the p(word|context(both left and right) ) and what you want is to compute p(word|previous tokens(only left contex)). The author explains here why you cannot use it as a lm.
However you can adapt Bert and use it as a language model, as explained here.
But you can use the open ai gpt or gpt-2 pre-tained models from the same repo
Here is how you can compute the perplexity using the gpt model.
import math
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# Load pre-trained model (weights)
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss=model(tensor_input, lm_labels=tensor_input)
    return math.exp(loss)

a=['there is a book on the desk',
                'there is a plane on the desk',
                        'there is a book in the desk']
print([score(i) for i in a])
21.31652459381952, 61.45907380241148, 26.24923942649312

Brian Spiering · Answer

The spaCy package has many language models, including ones trained on Common Crawl.

Language model has a specific meaning in Natural Language Processing (NlP). A language model is a probability distribution over sequences of tokens. Given a specific sequence of tokens, the model can assign a probability of that sequence appearing. SpaCy's language models include more than just a probability distribution.

The spaCy package needs to be installed and the language models need to be download:

$ pip install spacy 
$ python -m spacy download en

Then the language models can used with a couple lines of Python:

>>> import spacy
>>> nlp = spacy.load('en')

For a given model and token, there is a smoothed log probability estimate of a token's word type can be found with: token.prob attribute.

noob333 · Answer

I think the accepted answer is incorrect.
token.prob is the log-prob of the token being a particular type . I am guessing 'type' refers to something like POS-tag or type of named entity (it's not clear from spacy's documentation) and the score is a confidence measure over space of all types.
This is not the same as the probabilities assigned by a language model. A language model gives you the probability distribution over all possible tokens (not the type) saying which of them is most likely to occur next.
This repo has pretty nice documentation on using BERT (a state-of-the art model) with pre-trained weights for the neural network,

BERT

I think the API's don't give you perplexity directly but you should be able to get probability scores for each token quite easily.

Amit Chaudhary · Answer

You can use the lm_scorer package to calculate the language model probabilities using GPT-2 models.
First install the package as:
pip install lm-scorer

Then, you can create a scorer by specifying the model size.
from lm_scorer.models.auto import AutoLMScorer
scorer = AutoLMScorer.from_pretrained("gpt2-large")

def score(sentence):
    return scorer.sentence_score(sentence)

Apply it to your text and you get back the probabilities.
>>> score('good luck')
8.658163769270644e-11

You can also refer to a blogpost I had written a while back if you're looking for more details.

zijun · Answer

Here I would show how we can use transformers and the gpt model to compute the perplexity of a given sentence.
import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# You can change to gpt-large or other pretrained models that you can find in Huggingface.
tokenizer = GPT2TokenizerFast.from_pretrained('distilgpt2')
model = GPT2LMHeadModel.from_pretrained('distilgpt2')

def perplexity(sentence:str, stride:int=512) -> float:
    encodings = tokenizer(sentence, return_tensors='pt').input_ids
    max_length = model.config.n_positions  # 1024
    lls = []
    for i in range(0, encodings.size(1), stride):
        begin_loc = max(i + stride - max_length, 0)
        end_loc = min(i + stride, encodings.size(1))
        trg_len = end_loc - i    # may be different from stride on last loop
        input_ids = encodings[:,begin_loc:end_loc]
        target_ids = input_ids.clone()

target_ids[:,:-trg_len] = -100
        
        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            log_likelihood = outputs[0] * trg_len
        
        lls.append(log_likelihood)
    ppl = torch.exp(torch.stack(lls).sum() / end_loc)
    return ppl.item()

Then, you can do like this:
perplexity("I love you.")

Are there any good out-of-the-box language models for python?

5 Answers

Add your own answers!

Ask a Question