Data Science Asked by Tanmey Rawal on December 9, 2020
I am training an SVM on UCI’s Bank Marketing Data Set, the bank additional-full.csv. As the data is skewed I am also interested in recall. I am getting accuracy of about 87.95% but my recall is around 51%. I want to know ways to increase recall without decreasing accuracy so much using SVM only.
My code:
from sklearn.svm import SVC
svm_clf = SVC(gamma="auto",class_weight={1: 2.6})
svm_clf.fit(X_transformed, y_train_binary.ravel())
Additional info:
I have not created any new feature (i.e combining features) and considered unknown as label.
I have also removed Duration attribute as suggested by attribute information
I have tried different class_weights, so I can increase recall upto 75.32% but then my accuracy drops to 68%
How can I increase recall in SVM models without decreasing accuracy so much?
Duplicating is RandomOverSampling, not help much in OverSampling.
I quickly did a RandomUnderSampling.The score looks good for a baseline to improve.
I have not done anything for the model improvement
Code as it is from my Google Colab -
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip'
from urllib.request import urlretrieve
urlretrieve(url, "data.zip")
from zipfile import ZipFile
file_name = "/content/data.zip"
with ZipFile(file_name, 'r') as zip:
zip.extractall()
import numpy as np,pandas as pd
data = pd.read_csv("/content/bank-additional/bank-additional-full.csv",delimiter=";")
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
y.value_counts()
X_cat = X.select_dtypes(include='object')
from sklearn.preprocessing import LabelEncoder
lbe = LabelEncoder()
for colname in X_cat.columns:
X_cat[colname] = lbe.fit_transform(X_cat[colname])
X[colname] = X_cat[colname]
y = lbe.fit_transform(y)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=201, stratify=y)
from imblearn.under_sampling import RandomUnderSampler
rand = RandomUnderSampler(sampling_strategy=.6)
x_train, y_train = rand.fit_resample(x_train, y_train)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200,max_samples=0.05)
model.fit(x_train, y_train)
from sklearn.metrics import accuracy_score
y_pred_train = model.predict(x_train)
####Metrics on train
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_train, y_pred_train).ravel() # (55148, 1715, 8,90)
print("Training",fp/(tn+fp),fn/(fn+tp), accuracy_score(y_train, y_pred_train), tn, fp, fn, tp)
####Metrics on test
y_pred = model.predict(x_test)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel() # (55148, 1715, 8,90)
print("Test",fp/(tn+fp),fn/(fn+tp), accuracy_score(y_test, y_pred), tn, fp, fn, tp)
from sklearn.metrics import recall_score
print("Test-recall",recall_score(y_test, y_pred))
Next -
- Try SMOTE and Combo of Over and Under
- Work on Feature Engg and Dimensionality reduction
- Check other models
Answered by 10xAI on December 9, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP