Is there a way to get the optimal cutoff points based on probability of topic models and the outcomes?

Question

I have topic models probability obtained using LDA topic models method. I’d like to use these probabilities for 5 topics to predict an outcome (0/1). I’d like to find an optimum cutoff point of each topic to the outcome (0/1). I was thinking about ROC but it seems like it may not work since I don’t have the predicted outcome. Are there any methods that allows me to find the cutoff points for each variable, or does it make sense to do it?
Background

We collected open ended responses from participants and performed the topic modeling.
we collected outcome of choice from these participants e.g., intent to leave, performance, etc
hypothesis is that the response the participants enter is indicative of the outcome.

The question is: at what points in each of these topic model probabilities the outcome value changes e.g., from 0-1 or 1-0.

EdM · Accepted Answer

It's seldom that there is a single "point" at which the "outcome value changes." There's usually a gradation with respect to a predictor in terms of the probability of a particular outcome.
So what usually works best for a binary outcome is a probability model for the outcome that takes as much information into account without overfitting. Furthermore, with binary outcomes there's a particular risk of omitted-variable bias; if you leave out any predictor associated with outcome it can make it harder to identify other predictors associated with outcome.
In that context, looking for separate cutoffs for each of the predictors based on ROC is throwing away the detailed information about how probability of outcome changes with each topic's probability. It's also treating the topics separately instead of together.
So instead of looking at your topic probabilities separately with respect to outcome, it would seem to make the most sense to combine them into a single model. A simple model might be a logistic regression model that includes each topic probability as a predictor, which in R might be written for a 0/1 outcome like:
glm(outcome ~ T1 + T2 + T3 + T4 + T5, family = "binomial")

where T1 etc represent the probabilities found for the corresponding topic mappings. That provides a probability model for outcome that uses all the probability information about each of the topics together at once. In principle that could be extended to a mixed model in which you allow for differences among individuals.
Then you can get an ROC that combines information from all of your topic mappings at once. You then use your knowledge of the subject matter and the goals of your project (e.g., how much more costly are false negatives than false positives for you) if you need to choose a probability cutoff that best represents the "point" at which the "outcome value changes."

Is there a way to get the optimal cutoff points based on probability of topic models and the outcomes?

One Answer

Add your own answers!

Ask a Question