Data Science Asked on March 28, 2021
I have a huge data set with one of the columns named ‘mail_id’. The mail_id is given in a very creepy format as shown below:
mail_id
DQ/4I+GIOz2ZoIiK0Lg0AkwnI35XotghgUK/MYc101I=
BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
EHNBRbi6i9KO6cMHsuDPFjZVp2cY3RH+BiOKwPwzLQs=
K0y/NW59TJkYc5y0HUwDeAXrewYT0JQlkcozz0s2V5Q=
UGATDXARg7jMEInKH7oXgty2nwxnwD2l0OW/8Nsa0MI=
qE9zgWiITYA97RUiN4X/t9IVWLViLz+lKijaYegyBiQ=
BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls=
4+EEK8RbNYwuFCHznY9XSRCV4Yek60bHVgnP3jtjjzk=
After doing a lot of analysis on my data, I have found that I cannot drop this feature set from my model so I have to convert it to something meaningful. Can anyone please explain me how to do this efficiently?
I'd say there are pros and cons of using FeatureHasher for this purpose. If you really striving to use it, then just instantiate it like this:
In [1]:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=5, input_type='string')
f = h.transform(mail_id)
f.toarray()
Out[1]:
array([[ 1., 0., 0., 0., 0.],
[ 0., -1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., -1., 0., 0.],
[ 0., -1., 0., 0., 0.]])
So, after you have instantiated it, just .transform each your upcoming mail_id and use results in upstream applications ( like online learning, for instance ). Obviously n_features is some knob to tune. But this has its flip side: the cardinality of mail ids is apriori high, so unless you have very limited amount of users you will need enormous n_features to minimize collisions.
The better would be to take logs, where your ids coappear, and learn item2vec style model. This will deliver much denser (and meaningful) representation of mail_ids than FeatureHasher would do.
Also, take a look at this.
Answered by Vast Academician on March 28, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP