Data Science Asked on May 22, 2021
How to Normalize this dataset
correctly, so all the observations have the same importance?
import pandas as pd
spam = pd.read_table("spambase.data",sep=',',header=None)
#========================
# Normalization Function
#========================
def Normalize(x):
'''
==================================
Normalization Function
==================================
-----------
Parameters:
-----------
@Parameter x: Vector
---------
Returns:
---------
Normalized Vector.
================================
'''
norm=0.0
for e in x:
norm+=e**2
for i in range(len(x)):
x[i]/=sqrt(norm)
return x
Normalizing so that "all the observations have the same importance" is kinda ambiguous and ill-defined. In any case, it would be strongly advised to avoid re-inventing the wheel, and use one of the several scalers available out there (e.g. in the sklearn.preprocessing
module).
Here is an example using MinMaxScaler
, which will re-scale your data in [0, 1] column-wise:
import pandas as pd
df = pd.read_csv("spambase.data", header=None)
print(df.head())
# result:
0 1 2 3 4 5 ... 52 53 54 55 56 57
0 0.00 0.64 0.64 0.0 0.32 0.00 ... 0.000 0.000 3.756 61 278 1
1 0.21 0.28 0.50 0.0 0.14 0.28 ... 0.180 0.048 5.114 101 1028 1
2 0.06 0.00 0.71 0.0 1.23 0.19 ... 0.184 0.010 9.821 485 2259 1
3 0.00 0.00 0.00 0.0 0.63 0.00 ... 0.000 0.000 3.537 40 191 1
4 0.00 0.00 0.00 0.0 0.63 0.00 ... 0.000 0.000 3.537 40 191 1
[5 rows x 58 columns]
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler() # define the scaler
df_scaled = pd.DataFrame(sc.fit_transform(df)) # fit & transform the data
print(df_scaled.head())
# result:
0 1 2 3 ... 54 55 56 57
0 0.000000 0.044818 0.125490 0.0 ... 0.002502 0.006007 0.017487 1.0
1 0.046256 0.019608 0.098039 0.0 ... 0.003735 0.010012 0.064836 1.0
2 0.013216 0.000000 0.139216 0.0 ... 0.008008 0.048458 0.142551 1.0
3 0.000000 0.000000 0.000000 0.0 ... 0.002303 0.003905 0.011995 1.0
4 0.000000 0.000000 0.000000 0.0 ... 0.002303 0.003905 0.011995 1.0
[5 rows x 58 columns]
Keep in mind that normalization depends also from your choice of a model to use: it is practically necessary for neural networks and k-nn (and for k-means clustering), but is is completely redundant for decision trees and tree-ensemble models (Random Forest, GBM etc).
Correct answer by desertnaut on May 22, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP