TransWikia.com

How to normalize the data correctly in spam dataset

Data Science Asked on May 22, 2021

  • I’m working on the spam dataset to classify the inputs into binary classes.
  • my problem is that: the observations in the dataset are floats small numbers in the first 53 column, and the 54 is float larger numbers, while the last two columns are integers.

My Question:

How to Normalize this dataset
correctly, so all the observations have the same importance?

import pandas as pd
spam = pd.read_table("spambase.data",sep=',',header=None)
  • one proposed approach which didn’t seem very convenient to me, because it normalize the whole row input is that:
#========================
# Normalization Function
#========================
def Normalize(x):
    '''
    ==================================
    Normalization Function
    ==================================
    -----------
    Parameters:
    -----------
    @Parameter x: Vector
    ---------
    Returns:
    ---------
    Normalized Vector.
    ================================
    '''
    norm=0.0
    for e in x:
        norm+=e**2
    for i in range(len(x)):
        x[i]/=sqrt(norm)
    return x

One Answer

Normalizing so that "all the observations have the same importance" is kinda ambiguous and ill-defined. In any case, it would be strongly advised to avoid re-inventing the wheel, and use one of the several scalers available out there (e.g. in the sklearn.preprocessing module).

Here is an example using MinMaxScaler, which will re-scale your data in [0, 1] column-wise:

import pandas as pd
df = pd.read_csv("spambase.data", header=None)
print(df.head())
# result:
     0     1     2    3     4     5   ...     52     53     54   55    56  57
0  0.00  0.64  0.64  0.0  0.32  0.00  ...  0.000  0.000  3.756   61   278   1
1  0.21  0.28  0.50  0.0  0.14  0.28  ...  0.180  0.048  5.114  101  1028   1
2  0.06  0.00  0.71  0.0  1.23  0.19  ...  0.184  0.010  9.821  485  2259   1
3  0.00  0.00  0.00  0.0  0.63  0.00  ...  0.000  0.000  3.537   40   191   1
4  0.00  0.00  0.00  0.0  0.63  0.00  ...  0.000  0.000  3.537   40   191   1

[5 rows x 58 columns]

from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler() # define the scaler
df_scaled = pd.DataFrame(sc.fit_transform(df)) # fit & transform the data
print(df_scaled.head())
# result:
         0         1         2    3   ...        54        55        56   57
0  0.000000  0.044818  0.125490  0.0  ...  0.002502  0.006007  0.017487  1.0
1  0.046256  0.019608  0.098039  0.0  ...  0.003735  0.010012  0.064836  1.0
2  0.013216  0.000000  0.139216  0.0  ...  0.008008  0.048458  0.142551  1.0
3  0.000000  0.000000  0.000000  0.0  ...  0.002303  0.003905  0.011995  1.0
4  0.000000  0.000000  0.000000  0.0  ...  0.002303  0.003905  0.011995  1.0

[5 rows x 58 columns]

Keep in mind that normalization depends also from your choice of a model to use: it is practically necessary for neural networks and k-nn (and for k-means clustering), but is is completely redundant for decision trees and tree-ensemble models (Random Forest, GBM etc).

Correct answer by desertnaut on May 22, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP