How to discretize certain features with a feature set?

Question

I am working with typing data with timing features(unit: ms) and some of the features are based on the keyboard keyCodes(positive integers, range:[8, 222]). Currently, I use StandardScaler() by scikit-learn to scale all the features, so that my learning models do not overweigh the keyCode based features. I would like dicretize the keyCode based features and run the StandardScaler() for the timing features only. How can I go about this?

glhuilli · Answer

Not sure about your question, but maybe something like the following could help:

Create dummy variables for your KeyCodes variable
Normalize only those variables using the StandardScaler.

import pandas as pd
import random
from sklearn.preprocessing import StandardScaler

# Let's assume the following dataframe
data = {
    'KeyCodes': [random.randrange(2, 223, 1) for _ in range(10000)],
    'age': [random.randrange(1, 100) for _ in range(10000)],
    'id': [i for i in range(10000)]
}

# First you need to create the dummy variables based in KeyCodes
df = pd.DataFrame.from_dict(data)
df.head()
dummies = pd.get_dummies(df['KeyCodes']).rename(columns=lambda x: f'KeyCode_{x}')
df = pd.concat([df, dummies], axis=1)
df.drop(['KeyCodes'], inplace=True, axis=1)

# Then you can apply the normalization to the subset of features you wish to normalize
normalized_df = df.copy()
col_names = [f'KeyCode_{i}' for i in range(8, 223)]
features_normalized = df[col_names]
scaler = StandardScaler().fit(features_normalized.values)
features_normalized = scaler.transform(features_normalized.values)
normalized_df[col_names] = features_normalized

# Explore the output
normalized_df['age'][:10]  # you can see it was not normalized
set(normalized_df['KeyCodes_8'])  # normalized version of the feature
set(df['KeyCodes_8'])  # not normalized version of the feature

How to discretize certain features with a feature set?

One Answer

Add your own answers!

Ask a Question