Data Science Asked by Bruso on July 7, 2021
My aim is to derive textual similarity using multiple features. Some of the features are textual for which I am using (Tfhub 2.0) Universal Sentence encoder. There are other categorical features which are encoded using one-hot encoder.
For example, for a single record in my dataset, feature vector looks like this:
After this, I derive similarity matrix using cosine-similarity to decide if two such records are semantically same or not.
Problem is, there is a difference in the range of values for text feature (real numbers) and one hot encoded feature (0 or 1).
So shall I scale the one hot encoded vector with min-max scalar or using some other technique?
No, do not scale the one hot encoded vector with min-max scaling. That will lose the meaning of the data. One hot encoding means a data point is completely on a dimension or not. There is no meaning for those data points that are only fractional part of a dimension.
A better option to derive textual similarity using multiple features is to embedded all features in the same embedding space (including categorical features). StarSpace is one such embedding method.
Answered by Brian Spiering on July 7, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP