Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

Question

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder.
For example, for a single record in my dataset, feature vector looks like this:

text feature's embedding is 512 dimension vector - 1 X 512
categorical (non-ordered) feature vector -  1 X 500 (since there are 500 unique values in the feature)
my final feature vector - 1 X 1012

After this, I derive similarity matrix using cosine-similarity to decide if two such records are semantically same or not.
Problem is, there is a difference in the range of values for text feature (real numbers) and one hot encoded feature (0 or 1).
So shall I scale the one hot encoded vector with min-max scalar or using some other technique?

Brian Spiering · Answer

No, do not scale the one hot encoded vector with min-max scaling. That will lose the meaning of the data. One hot encoding means a data point is completely on a dimension or not. There is no meaning for those data points that are only fractional part of a dimension.
A better option to derive textual similarity using multiple features is to embedded all features in the same embedding space (including categorical features). StarSpace is one such embedding method.

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

One Answer

Add your own answers!

Ask a Question