Cross Validated Asked by Kuni on January 6, 2021
I have topic models probability obtained using LDA
topic models method. I’d like to use these probabilities for 5 topics to predict an outcome
(0/1). I’d like to find an optimum cutoff point of each topic to the outcome
(0/1). I was thinking about ROC but it seems like it may not work since I don’t have the predicted outcome. Are there any methods that allows me to find the cutoff points for each variable, or does it make sense to do it?
Background
outcome
of choice from these participants e.g., intent to leave, performance, etcThe question is: at what points in each of these topic model probabilities the outcome value changes e.g., from 0-1 or 1-0.
It's seldom that there is a single "point" at which the "outcome value changes." There's usually a gradation with respect to a predictor in terms of the probability of a particular outcome.
So what usually works best for a binary outcome is a probability model for the outcome that takes as much information into account without overfitting. Furthermore, with binary outcomes there's a particular risk of omitted-variable bias; if you leave out any predictor associated with outcome it can make it harder to identify other predictors associated with outcome.
In that context, looking for separate cutoffs for each of the predictors based on ROC is throwing away the detailed information about how probability of outcome changes with each topic's probability. It's also treating the topics separately instead of together.
So instead of looking at your topic probabilities separately with respect to outcome, it would seem to make the most sense to combine them into a single model. A simple model might be a logistic regression model that includes each topic probability as a predictor, which in R might be written for a 0/1 outcome
like:
glm(outcome ~ T1 + T2 + T3 + T4 + T5, family = "binomial")
where T1 etc represent the probabilities found for the corresponding topic mappings. That provides a probability model for outcome
that uses all the probability information about each of the topics together at once. In principle that could be extended to a mixed model in which you allow for differences among individuals.
Then you can get an ROC that combines information from all of your topic mappings at once. You then use your knowledge of the subject matter and the goals of your project (e.g., how much more costly are false negatives than false positives for you) if you need to choose a probability cutoff that best represents the "point" at which the "outcome value changes."
Correct answer by EdM on January 6, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP