TransWikia.com

Preprocessing dataset to predict salary

Data Science Asked on June 7, 2021

I’m currently a student in a machine learning course studying for an upcoming exam. Here’s a question I’ve been given for practice:

You have a very large dataset of employees and you’d like to investigate the drivers of employee salaries. Each instance of data contains the following information:

  1. The occupation of the employee (e.g. data scientist or sales representative). Assume that there are thousands of distinct occupations.

  2. Tenure (number of years the employee has been working at the company for)

  3. Company size (number of employees at the company)

  4. Years of experience (Number of years the employee has been in the workforce)

  5. Title (assume there are 6 levels in every company; analyst, associate, manager, etc).

Now you want to build a linear model to incorporate these features to predict salary. The question asks me which of the following are good steps to take to build such a model:

  1. Since occupations are categorial, create dummy variables for each occupation and use a one-hot encoding.

  2. Convert title into an ordinal variable by assigning more senior titles larger values (i.e. map analyst to 1, associate to 2, etc).

  3. Remove the "tenure" feature since it exhibits multicollinearity with the "years of experience" feature.

  4. Create and name clusters of occupation types (e.g. data scientist and data engineer go together), then create dummy variables for each cluster and perform a one-hot encoding.


I think that $1$ is a bad idea since you’ll increase the complexity (number of features) by a lot. The problem statement specifies that there are thousands of occupations. I think that $2$ is a good idea since it allows us to easily work with the rankings. I’m not sure about $3$. I don’t have much experience with multicollinearity, but I think it sounds reasonable to remove the tenure feature, so I think it’s a good idea. Finally, I don’t really know about $4$, but I’d guess that it’s also a good idea.

Can someone please give me some insight into this problem?

Thanks

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP