What's the difference between fit and fit_transform in scikit-learn models?

Question

I'm a newbie to data science, and I do not understand the difference between the fit and fit_transform methods in scikit-learn. Can anybody explain simply why we might need to transform data?
What does it mean, fitting a model on training data and transforming to test data? Does it mean, for example, converting categorical variables into numbers in training and transforming the new feature set onto test data?

K3---rnc · Accepted Answer

To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation:
$$x' = frac{x-mu}{sigma}$$
You do that on the training set of the data. But then you have to apply the same transformation to your test set (e.g. in cross-validation), or to newly obtained examples before forecasting. But you have to use the exact same two parameters $mu$ and $sigma$ (values) that you used for centering the training set.
Hence, every scikit-learn's transform's fit() just calculates the parameters (e.g. $mu$ and $sigma$ in case of StandardScaler) and saves them as an internal object's state. Afterwards, you can call its transform() method to apply the transformation to any particular set of examples.
fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set $x$, while also returning the transformed $x'$. Internally, the transformer object just calls first fit() and then transform() on the same data.

LoMaPh · Answer

The following explanation is based on fit_transform of Imputer class, but the idea is the same for fit_transform of other scikit_learn classes like MinMaxScaler.

transform replaces the missing values with a number. By default this number is the means of columns of some data that you choose.
Consider the following example:
imp = Imputer()
# calculating the means
imp.fit([
         [1,      3], 
         [np.nan, 2], 
         [8,      5.5]
        ])

Now the imputer have learned to use a mean ${{(1+8)}over {2}} = 4.5$ for the first column and mean ${{(2+3+5.5)}over {3}} = 3.5$ for the second column when it gets applied to a two-column data:
X = [[np.nan, 11], 
     [4,      np.nan], 
     [8,      2],
     [np.nan, 1]]
print(imp.transform(X))

we get
[[4.5, 11], 
 [4, 3.5],
 [8, 2],
 [4.5, 1]]

So by fit the imputer calculates the means of columns from some data, and by transform it applies those means to some data (which is just replacing missing values with the means). If both these data are the same (i.e. the data for calculating the means and the data that means are applied to) you can use fit_transform which is basically a fit followed by a transform.
Now your questions:

Why we might need to transform data?

"For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical" (source)

What does it mean fitting model on training data and transforming to test data?

The fit of an imputer has nothing to do with fit used in model fitting.
So using imputer's fit on training data just calculates means of each column of training data. Using transform on test data then replaces missing values of test data with means that were calculated from training data.

Ashish Anand · Answer

In layman's terms, fit_transform means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values).
So for training set, you need to both calculate and do transformation.
But for testing set, machine learning applies prediction based on what was learned during the training set and so it doesn't need to calculate, it just performs the transformation.

Prasad Nageshkar · Answer

These methods are used for dataset transformations in scikit-learn:
Let us take an example for scaling values in a dataset:
Here the fit method, when applied to the training dataset, learns the model parameters (for example, mean and standard deviation). We then need to apply the transform  method on the training dataset to get the transformed (scaled) training dataset. We could also perform both of these steps in one step by applying fit_transform on the training dataset.
Then why do we need 2 separate methods - fit and transform?
In practice, we need to have separate training and testing dataset and that is where having a separate fit and transform method helps. We apply fit on the training dataset and use the transform method on both - the training dataset and the test dataset. Thus the training, as well as the test dataset, are then transformed(scaled) using the model parameters that were learned on applying the fit method to the training dataset.
Example Code:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler.transform(X_train) 
scaler.transform(X_test)

user66487 · Answer

By applying the transformations you are trying to make your data behave normally. For example, if you have two variables $V_1$ and $V_2$ both measure the distances but $V_1$ has centimeters as the units and $V_2$ has kilometers as the units so in order to compare these two you have to convert them to same units...just like that transforming is making similar behavior or making it behave like a normal distribution.
Coming to the other question you first build the model in training set that is (the model learns the patterns or behavior of your data from the training set) and when you run the same model in the test set it tries to identify the similar patterns or behaviors once it identifies it makes its conclusions and gives results according to the training data.

Eric McLachlan · Answer

This isn't a technical answer but, hopefully, it is helpful to build up our intuition:
Firstly, all estimators are trained (or "fit") on some training data. That part is fairly straightforward.
Secondly, all of the scikit-learn estimators can be used in a pipeline and the idea with a pipeline is that data flows through the pipeline. Once fit at a particular level in the pipeline, data is passed on to the next stage in the pipeline but obviously the data needs to be changed (transformed) in some way; otherwise, you wouldn't need that stage in the pipeline at all. So, transform is a way of transforming the data to meet the needs of the next stage in the pipeline.
If you're not using a pipeline, I still think it's helpful to think about these machine learning tools in this way because, even the simplest classifier is still performing a classification function. It takes as input some data and produces an output. This is a pipeline too; just a very simple one.
In summary, fit performs the training, transform changes the data in the pipeline in order to pass it on to the next stage in the pipeline, and fit_transform does both the fitting and the transforming in one possibly optimized step.

Lovelyn David · Answer

Consider a task that requires us to normalize the data. For example, we may use a min-max normalization or z-score normalization. There are some inherent parameters in the model. The minimum and maximum values in min-max normalization and the mean and standard deviation in z-score normalization.
The fit() function calculates the values of these parameters.

The transform function applies the values of the parameters on the actual data and gives the normalized value.

The fit_transform() function performs both in the same step.

Note that the same value is got whether we perform in 2 steps or in a single step.

Salman Tabatabai · Answer

fit computes the mean and std to be used for later scaling. (jsut a computation), nothing is given to you.
transform uses a previously computed mean and std to autoscale the data (subtract mean from all values and then divide it by std).
fit_transform does both at the same time. So you can do it with 1 line of code instead of 2.
Now let's look at it in practice:
For X training set, we do fit_transform because we need to compute mean and std, and then use it to autoscale the data. For X test set, well, we already have the mean and std, so we only do the transform part.
It's super simple. You are doing great. Keep up your good work my friend :-)

Sohil Grandhi · Answer

You don't want your model to learn anything from the test dataset. You just want to apply the learnings from your trained dataset. So, we only apply transform operation on test dataset and fit_transform operation on the train dataset.

Answered by Sohil Grandhi on June 28, 2021

Tariq Hussain · Answer

fit, transform, and fit_transform. keeping the explanation so simple.
When we have two Arrays with different elements we use 'fit' and transform separately, we fit 'array 1' base on its internal function such as in MinMaxScaler (internal function is to find mean and standard deviation). For example, if we fit 'array 1' based on its mean and transform array 2, then the mean of array 1 will be applied to array 2 which we transformed. In simple words, we transform one array on the basic internal functions of another array.
Showing you with code;
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

temperature = [32., np.nan, 28., np.nan, 32., np.nan, np.nan, 34., 40.]
 windspeed  = [ 6.,  9., np.nan,  7., np.nan, np.nan, np.nan,  8., 12.]

n_arr_1 = np.array(temperature).reshape(3,3)
print('temperature:n',n_arr_1)
n_arr_2 = np.array(windspeed).reshape(3,3)
print('windspeed:n',n_arr_2)

Output:
temperature:
 [[32. nan 28.]
 [nan 32. nan]
 [nan 34. 40.]]
windspeed:
 [[ 6.  9. nan]
 [ 7. nan nan]
 [nan  8. 12.]]

fit and transform separately, transforming array 2 for fitted (based on mean) array 1;
imp.fit(n_arr_1)
imp.transform(n_arr_2)

Output
Check the output below, observe the output based on the previous two outputs you will see the difference. Basically, on Array 1 it is taking the mean of every column and fitting in array 2 according to its column where ever missing value is missed.
array([[ 6.,  9., 34.],
       [ 7., 33., 34.],
       [32.,  8., 12.]])

This is what we doing when we want to transform one array based on another array. but when we have a single array and we want to transform it based on its own mean. In this condition, we use fit_transform together.
See below;
imp.fit_transform(n_arr_2)

Output
array([[ 6. ,  9. , 12. ],
       [ 7. ,  8.5, 12. ],
       [ 6.5,  8. , 12. ]])

(Above) Alternativily we doing:
imp.fit(n_arr_2)
imp.transform(n_arr_2)

Output
array([[ 6. ,  9. , 12. ],
       [ 7. ,  8.5, 12. ],
       [ 6.5,  8. , 12. ]])

Why we fitting and transforming the same array separately, it takes two line code, why don't we use simple fit_transform which can fit and transform the same array in one line code. That's what the difference is between fit and transform and fit_transform.
Check this Google Colab link, you can run it by yourself, and can understand it well.

What's the difference between fit and fit_transform in scikit-learn models?

10 Answers

Add your own answers!

Ask a Question