How do I correctly build model on given data to predict target parameter?

Question

I have some dataset which contains different paramteres and data.head() looks like this

Applied some preprocessing and performed Feature ranking -

dataset = pd.read_csv("ML.csv",header = 0)

#Get dataset breif
print(dataset.shape)
print(dataset.isnull().sum())
#print(dataset.head())

#Data Pre-processing
data = dataset.drop('organization_id',1)
data = data.drop('status',1)
data = data.drop('city',1)

#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude = 
    data['zip'].median(),
    data['role_id'].median(),
    data['specialty_id'].median(),
    data['latitude'].median(),
    data['longitude'].median()

data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)

#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)
target = dataset.location_id

#Perform Recursive Feature Extraction
svm = LinearSVC()
rfe = RFE(svm, 1)
rfe = rfe.fit(data, target) #IT give convergence Warning - Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables.

names = list(data)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names)))

Output

Features sorted by their score:

[(1, 'location_id'), (2, 'department_id'), (3, 'latitude'), (4, 'specialty_id'), (5, 'longitude'), (6, 'zip'), (7, 'shift_id'), (8, 'user_id'), (9, 'role_id'), (10, 'open_positions'), (11, 'years_of_experience')]

From this I understand that which parameters have more importance. 
Is above processing correct to understand the feature important. How can I use above information for better model training?

When I to model training it gives very high accuracy. How come it gives so high accuracy?

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)

#Data Pre-processing
data = dataset.drop('location_id',1)
data = data.drop('status',1)
data = data.drop('city',1)

#Find median for features having NaN
median_zip, median_role_id, median_specialty_id, median_latitude, median_longitude =
    data['zip'].median(),
    data['role_id'].median(),
    data['specialty_id'].median(),
    data['latitude'].median(),
    data['longitude'].median()

data['zip'].fillna(median_zip, inplace=True)
data['role_id'].fillna(median_role_id, inplace=True)
data['specialty_id'].fillna(median_specialty_id, inplace=True)
data['latitude'].fillna(median_latitude, inplace=True)
data['longitude'].fillna(median_longitude, inplace=True)

#Fill YearOFExp with 0
data['years_of_experience'].fillna(0, inplace=True)

#Start training

labels = dataset.location_id
train1 = data
algo = LinearRegression()
x_train , x_test , y_train , y_test = train_test_split(train1 , labels , test_size = 0.20,random_state =1)

# x_train.to_csv("x_train.csv", sep=',', encoding='utf-8')
# x_test.to_csv("x_test.csv", sep=',', encoding='utf-8')

algo.fit(x_train,y_train)
algo.score(x_test,y_test)

Output: 0.981150074104111

from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators = 400,
                                         max_depth = 5,
                                         min_samples_split = 2,
                                         learning_rate = 0.1,
                                         loss = 'ls')
clf.fit(x_train, y_train)
clf.score(x_test,y_test)

Output: 0.99

What I want to do is, predicting  location-id  correctly.
 Am I doing anything wrong? What ithe s correct way to build model for this sort of situation?

I know there is some way that I can get Precision, recall, f1 for each paramteres. Can anyone give me reference link to perform this.

Shamit Verma · Answer

Train data includes latitude, longitude and zipcode. Output variable is location.

It is trivial to predict location if zip and lat,lon are known.  Try removing these attributes and see if that has any impact on validation score.

Ben Reiniger · Answer

Feature ranking

You still have location_id as a feature when you're trying to predict location_id. 
 So of course that comes out as the "most important," and the other features' importance scores are probably mostly meaningless.

After fixing that, the feature ranking gives you some valuable insight to a problem, and depending on your needs you might drop low-performing variables, etc.

High performance

(I don't think you're actually computing accuracy in either case.)  It is extremely surprising to me that a LinearRegression model does so well; most of your variables seem categorical, even the dependent location_id.  Unless there's something predictive in the way the ids are actually assigned?  How many unique values does location_id have?

Is the location_id the location of the user, or the job (assuming I've gotten the context right)?  In either case, if you have many copies of the user/job and happen to split them across the training and test sets, then you may just be leaking information that way: the model learns the mapping user(/job)->location, and happens to be able to apply that to nearly every row in the test set.  (That still doesn't make much sense for LinearRegression, but could in the GBM.)  This is pretty similar to what @ShamitVerma has said, but doesn't rely on an interpretable mapping, just that the train/test split doesn't properly separate users/jobs.

How do I correctly build model on given data to predict target parameter?

2 Answers

Feature ranking

High performance

Add your own answers!

Ask a Question