TransWikia.com

Should one-hot encoded categorical features needs to be scaled when used along with text feature while deriving semantic similarity?

Data Science Asked by Bruso on July 7, 2021

My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder.

For example, for a single record in my dataset, feature vector looks like this:

  1. text feature’s embedding is 512 dimension vector – 1 X 512
  2. categorical (non-ordered) feature vector – 1 X 500 (since there are 500 unique values in the feature)
  3. my final feature vector – 1 X 1012

After this, I derive similarity matrix using cosine-similarity to decide if two such records are semantically same or not.

Problem is, there is a difference in the range of values for text feature (real numbers) and one hot encoded feature (0 or 1).
So shall I scale the one hot encoded vector with min-max scalar or using some other technique?

One Answer

No, do not scale the one hot encoded vector with min-max scaling. That will lose the meaning of the data. One hot encoding means a data point is completely on a dimension or not. There is no meaning for those data points that are only fractional part of a dimension.

A better option to derive textual similarity using multiple features is to embedded all features in the same embedding space (including categorical features). StarSpace is one such embedding method.

Answered by Brian Spiering on July 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP