Data Science Asked by tohid mon on February 21, 2021
How should I choose n_features
for FeatureHasher in scikit-learn? Assume that I have 1000 categories in feature "case" and I would like to hash them.
As mentioned in its documentation, it is advisable to use a power of 2 as the number of features; otherwise, the features will not be mapped evenly to the columns. Also, it is suggested to leave the number of features as its default value of 2 ** 20 for a real-world setting. Select a lower value such as 2 ** 18 when memory or downstream model size is an issue Ref.
Consider in mind that, as also stated in the documentation, small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners Ref
My overall suggestion is to use a power of 2 as the number of features. If you have a small number of features, you can treat the number of features (e.g., 30 as you mentioned in the comment below) as a hyperparameter and find the optimal value using cross-validation. For example, you can test different powers of 2 such as 2, 4, 8, 16, etc depending on the size of your data and use the cross-validation to find the optimal value. That is the best solution. But please note that the hashing method is efficient when the number of input features is very large. In your case, I would go with the other methods available in the technical literature.
Answered by nimar on February 21, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP