Clustering with Likert items and N/A option

Question

I am currently evaluating the results of a 72 question survey with response levels from "strongly disagree" to "strongly agree". I would like to cluster the questions by response patterns using R (I'm using "clara").

Here's the rub: There is also a response option "N/A", because some questions are not relevant for some of the respondents, so they couldn't have an opinion of whether to agree or disagree with the premise of the question.

Currently, I have coded the agreement levels from -2 to 2, and "N/A" got a -3 just to see if everything works. It does, so this is not a coding question.

My question is: Do you know of a clever distance function that I could use in this situation to calculate more meaningful clusters? The goal would be to compare only the responses of those for whom the question is relevant. I don't think I can just drop the "N/A" because that would give me points of different dimensions, so both Euclidean and taxicab metric would not be happy.

EDIT 1: One possibility would be to apply a chi-squaresque metric that only compares the distributions of the desired responses, but that strikes me as too crude.
EDIT 2: Another possibility would be to adapt the Euclidean or taxi-cab metric with appropriate weights so that "proper" responses would be given higher consideration.

PS: No, I cannot eliminate the N/As because I need to be able to calculate a "distance" between the response patterns of any two questions, so I can find out which groups of questions tended to be answered similarly throughout.

Matthias · Answer

The following measure brought good results heuristically. I have not checked into its mathematical properties:

Let $x_1, x_2$ be two response vectors in $mathbb{R}^N$, where $N$ is the number of responses. Let $n$ be the number of components where both $x_1$ and $x_2$ are not equal to "N/A". Let $k$ be the number of such components with absolute difference at most 1 (say).

Then define the measure of dissimilarity between $x_1$ and $x_2$ as $frac{k}{n}$, so $1-frac{k}{n}$ appears to give a reasonably good "metric." (As I said, I haven't checked in how far this violates the axioms for a metric.)

Clustering with Likert items and N/A option

One Answer

Add your own answers!

Ask a Question