Data Science Asked on April 23, 2021
I am comfortable using Machine learning on my train data and test data and validate it.
But the question here is if I want to predict a single variable how do I do it?
Let’s suppose I have done feature selection using lasso or applied PCA or done feature engineering like transformation or scaling but when I want to utilize it for new data, maybe single data I am getting confused. Should I have to follow all the steps mentioned above even for the single row of data or how do I proceed with this. Please suggest.
You have to create what is known as a modeling pipeline this includes creating so-called transformers.
These transformers basically apply all necessary data transformations on your raw data to make the input "fit" for the trained model.
This might include things like One-Hot-Encoding, etc. It is important to "fit" these transformers on your training set only e.g. OHE only encodes values it knows from the training data set and treats everything else as NA.
Some of your transformers are a bit more involved though. If you have performed clustering / PCA for example and used the resulting dimensions in your model you cannot simply do the same steps on a new data. You have to create a suitable transformer which varies depending on your steps.
For example with a factor analysis you are usually able to extract a regression model which predicts the factors based on the raw input. You then apply this regression model fitted on the training data to the new unseen data.
Answered by Fnguyen on April 23, 2021
There are certain points that one should keep practicing that always prepare data first properly and make it uniform shape by applying the transformation technique or any other feature engineering methodology. As u mentioned problem in question i assume that u suppose to do iterative training. In that case when new data comes it comes with new features so whatever parameter learned during the previous iteration it will not help.U might have to deal with such a scenario in iteration wise.At each iteration u will have to train new parameter(new model). Technique suggested by @Fnguyen is cool but not scalable in production.
If your new data come up with same feature set than u can use existing trained parameter of PCA or transformer. If new data is for training than load existing PCA and retrain it and transform new data and fit into model. PLease put feedback and your point of view as response.
Answered by Gaurav Koradiya on April 23, 2021
Building a model pipeline is the way to go once you are done experimenting with your model. Following is a toy example -
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
#Generating random dataset
X, y = make_classification(random_state=0)
#Splitting test and tran sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
#Defining a pipeline with standardscaling transformer and then svc estimator
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)
#During prediction on a test or on a fresh set of data, just use the pipe object to run the whole pipeline and get predictions.
pipe.score(X_test, y_test)
You can easily create a pipeline in Sklearn for almost every possible transform available in Sklearn. Check the user guide for more details.
Another example with PCA -
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])
#Now you can use the fit method of the pipeline object to perform feature engineering and train model, while the scoring method for predictions
For Sklearn Pipelines, you can have either a single or multiple transformation paths (which are combined together with a FeatureUnion) and then it has to end with an estimator (only a single estimator is allowed)
While all this is powerful enough for most purposes, sometimes, there are custom transformations that you will need for your data, and those transformations will not be a part of the Sklearn API. In that case you can actually use the base classes of sklearn to defined your own transformer which can easily become a part of the pipeline that you define. More information here.
Answered by Akshay Sehgal on April 23, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP