Data Science Asked on December 9, 2021
I try to build the user specific model which predicts whether arbitrary English text is complex for particular user or not. Having the complex and easy text samples allows to build such model but what if I have only complex samples. How can I build the model in such case?
I can detect whether the given text is different (find the “outlier”) from those which user marked as difficult. But that information does not tell me in which way it’s different. The text could be easier or more difficult.
Currently I see only one way – make an assumption about how the easy text could look like. But it’s kind of unsafe since different people might have own unique areas which they do not understand in the text.
There have been many ways to measure text complexity proposed in the literature, I don't have any particular survey to recommend but google is your friend.
Many of these measures are heuristics, i.e. they work in an unsupervised way. I don't remember the details but I've seen some works using a combination of several of these measures to obtain more accurate results.
A basic way would be to be build a language model on the complex text, measure the complexity against this model for any new text and assume that if it's not similar then it's not complex, but as you rightly noticed it's not a very safe assumption.
At the most basic level, you can use the type token ratio (TTR): divide the number of types (unique tokens) by the total number of tokens. The TTR is a quite good indicator of lexical diversity, so complex text is likely to give a high value. It's a very crude measure but it's useful as a baseline: whatever system you try, if it doesn't give better results than a threshold on the TTR then it's not a good system :)
Answered by Erwan on December 9, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP