TransWikia.com

Is there a way to get the optimal cutoff points based on probability of topic models and the outcomes?

Cross Validated Asked by Kuni on January 6, 2021

I have topic models probability obtained using LDA topic models method. I’d like to use these probabilities for 5 topics to predict an outcome (0/1). I’d like to find an optimum cutoff point of each topic to the outcome (0/1). I was thinking about ROC but it seems like it may not work since I don’t have the predicted outcome. Are there any methods that allows me to find the cutoff points for each variable, or does it make sense to do it?

Background

  1. We collected open ended responses from participants and performed the topic modeling.
  2. we collected outcome of choice from these participants e.g., intent to leave, performance, etc
  3. hypothesis is that the response the participants enter is indicative of the outcome.

The question is: at what points in each of these topic model probabilities the outcome value changes e.g., from 0-1 or 1-0.

One Answer

It's seldom that there is a single "point" at which the "outcome value changes." There's usually a gradation with respect to a predictor in terms of the probability of a particular outcome.

So what usually works best for a binary outcome is a probability model for the outcome that takes as much information into account without overfitting. Furthermore, with binary outcomes there's a particular risk of omitted-variable bias; if you leave out any predictor associated with outcome it can make it harder to identify other predictors associated with outcome.

In that context, looking for separate cutoffs for each of the predictors based on ROC is throwing away the detailed information about how probability of outcome changes with each topic's probability. It's also treating the topics separately instead of together.

So instead of looking at your topic probabilities separately with respect to outcome, it would seem to make the most sense to combine them into a single model. A simple model might be a logistic regression model that includes each topic probability as a predictor, which in R might be written for a 0/1 outcome like:

glm(outcome ~ T1 + T2 + T3 + T4 + T5, family = "binomial")

where T1 etc represent the probabilities found for the corresponding topic mappings. That provides a probability model for outcome that uses all the probability information about each of the topics together at once. In principle that could be extended to a mixed model in which you allow for differences among individuals.

Then you can get an ROC that combines information from all of your topic mappings at once. You then use your knowledge of the subject matter and the goals of your project (e.g., how much more costly are false negatives than false positives for you) if you need to choose a probability cutoff that best represents the "point" at which the "outcome value changes."

Correct answer by EdM on January 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP