Data Science Asked by Kaggle on June 28, 2021
I’m a newbie to data science, and I do not understand the difference between the fit
and fit_transform
methods in scikit-learn. Can anybody explain simply why we might need to transform data?
What does it mean, fitting a model on training data and transforming to test data? Does it mean, for example, converting categorical variables into numbers in training and transforming the new feature set onto test data?
To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation:
$$x' = frac{x-mu}{sigma}$$
You do that on the training set of the data. But then you have to apply the same transformation to your test set (e.g. in cross-validation), or to newly obtained examples before forecasting. But you have to use the exact same two parameters $mu$ and $sigma$ (values) that you used for centering the training set.
Hence, every scikit-learn's transform's fit()
just calculates the parameters (e.g. $mu$ and $sigma$ in case of StandardScaler) and saves them as an internal object's state. Afterwards, you can call its transform()
method to apply the transformation to any particular set of examples.
fit_transform()
joins these two steps and is used for the initial fitting of parameters on the training set $x$, while also returning the transformed $x'$. Internally, the transformer object just calls first fit()
and then transform()
on the same data.
Correct answer by K3---rnc on June 28, 2021
The following explanation is based on fit_transform
of Imputer
class, but the idea is the same for fit_transform
of other scikit_learn classes like MinMaxScaler
.
transform
replaces the missing values with a number. By default this number is the means of columns of some data that you choose.
Consider the following example:
imp = Imputer()
# calculating the means
imp.fit([
[1, 3],
[np.nan, 2],
[8, 5.5]
])
Now the imputer have learned to use a mean ${{(1+8)}over {2}} = 4.5$ for the first column and mean ${{(2+3+5.5)}over {3}} = 3.5$ for the second column when it gets applied to a two-column data:
X = [[np.nan, 11],
[4, np.nan],
[8, 2],
[np.nan, 1]]
print(imp.transform(X))
we get
[[4.5, 11],
[4, 3.5],
[8, 2],
[4.5, 1]]
So by fit
the imputer calculates the means of columns from some data, and by transform
it applies those means to some data (which is just replacing missing values with the means). If both these data are the same (i.e. the data for calculating the means and the data that means are applied to) you can use fit_transform
which is basically a fit
followed by a transform
.
Now your questions:
Why we might need to transform data?
"For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical" (source)
What does it mean fitting model on training data and transforming to test data?
The fit
of an imputer has nothing to do with fit
used in model fitting.
So using imputer's fit
on training data just calculates means of each column of training data. Using transform
on test data then replaces missing values of test data with means that were calculated from training data.
Answered by LoMaPh on June 28, 2021
In layman's terms, fit_transform
means to do some calculation and then do transformation (say calculating the means of columns from some data and then replacing the missing values).
So for training set, you need to both calculate and do transformation.
But for testing set, machine learning applies prediction based on what was learned during the training set and so it doesn't need to calculate, it just performs the transformation.
Answered by Ashish Anand on June 28, 2021
These methods are used for dataset transformations in scikit-learn:
Let us take an example for scaling values in a dataset:
Here the fit
method, when applied to the training dataset, learns the model parameters (for example, mean and standard deviation). We then need to apply the transform
method on the training dataset to get the transformed (scaled) training dataset. We could also perform both of these steps in one step by applying fit_transform
on the training dataset.
Then why do we need 2 separate methods - fit
and transform
?
In practice, we need to have separate training and testing dataset and that is where having a separate fit
and transform
method helps. We apply fit
on the training dataset and use the transform
method on both - the training dataset and the test dataset. Thus the training, as well as the test dataset, are then transformed(scaled) using the model parameters that were learned on applying the fit
method to the training dataset.
Example Code:
scaler = preprocessing.StandardScaler().fit(X_train)
scaler.transform(X_train)
scaler.transform(X_test)
Answered by Prasad Nageshkar on June 28, 2021
By applying the transformations you are trying to make your data behave normally. For example, if you have two variables $V_1$ and $V_2$ both measure the distances but $V_1$ has centimeters as the units and $V_2$ has kilometers as the units so in order to compare these two you have to convert them to same units...just like that transforming is making similar behavior or making it behave like a normal distribution.
Coming to the other question you first build the model in training set that is (the model learns the patterns or behavior of your data from the training set) and when you run the same model in the test set it tries to identify the similar patterns or behaviors once it identifies it makes its conclusions and gives results according to the training data.
Answered by user66487 on June 28, 2021
This isn't a technical answer but, hopefully, it is helpful to build up our intuition:
Firstly, all estimators are trained (or "fit") on some training data. That part is fairly straightforward.
Secondly, all of the scikit-learn estimators can be used in a pipeline and the idea with a pipeline is that data flows through the pipeline. Once fit at a particular level in the pipeline, data is passed on to the next stage in the pipeline but obviously the data needs to be changed (transformed) in some way; otherwise, you wouldn't need that stage in the pipeline at all. So, transform is a way of transforming the data to meet the needs of the next stage in the pipeline.
If you're not using a pipeline, I still think it's helpful to think about these machine learning tools in this way because, even the simplest classifier is still performing a classification function. It takes as input some data and produces an output. This is a pipeline too; just a very simple one.
In summary, fit
performs the training, transform
changes the data in the pipeline in order to pass it on to the next stage in the pipeline, and fit_transform
does both the fitting and the transforming in one possibly optimized step.
Answered by Eric McLachlan on June 28, 2021
Consider a task that requires us to normalize the data. For example, we may use a min-max normalization or z-score normalization. There are some inherent parameters in the model. The minimum and maximum values in min-max normalization and the mean and standard deviation in z-score normalization.
The fit()
function calculates the values of these parameters.
The transform function applies the values of the parameters on the actual data and gives the normalized value.
The fit_transform()
function performs both in the same step.
Note that the same value is got whether we perform in 2 steps or in a single step.
Answered by Lovelyn David on June 28, 2021
fit
computes the mean and std to be used for later scaling. (jsut a computation), nothing is given to you.
transform
uses a previously computed mean and std to autoscale the data (subtract mean from all values and then divide it by std).
fit_transform
does both at the same time. So you can do it with 1 line of code instead of 2.
Now let's look at it in practice:
For X training set, we do fit_transform
because we need to compute mean and std, and then use it to autoscale the data. For X test set, well, we already have the mean and std, so we only do the transform
part.
It's super simple. You are doing great. Keep up your good work my friend :-)
Answered by Salman Tabatabai on June 28, 2021
You don't want your model to learn anything from the test dataset. You just want to apply the learnings from your trained dataset. So, we only apply transform operation on test dataset and fit_transform
operation on the train dataset.
Answered by Sohil Grandhi on June 28, 2021
fit, transform, and fit_transform. keeping the explanation so simple.
When we have two Arrays with different elements we use 'fit' and transform separately, we fit 'array 1' base on its internal function such as in MinMaxScaler (internal function is to find mean and standard deviation). For example, if we fit 'array 1' based on its mean and transform array 2, then the mean of array 1 will be applied to array 2 which we transformed. In simple words, we transform one array on the basic internal functions of another array.
Showing you with code;
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
temperature = [32., np.nan, 28., np.nan, 32., np.nan, np.nan, 34., 40.]
windspeed = [ 6., 9., np.nan, 7., np.nan, np.nan, np.nan, 8., 12.]
n_arr_1 = np.array(temperature).reshape(3,3)
print('temperature:n',n_arr_1)
n_arr_2 = np.array(windspeed).reshape(3,3)
print('windspeed:n',n_arr_2)
Output:
temperature:
[[32. nan 28.]
[nan 32. nan]
[nan 34. 40.]]
windspeed:
[[ 6. 9. nan]
[ 7. nan nan]
[nan 8. 12.]]
fit and transform separately, transforming array 2 for fitted (based on mean) array 1;
imp.fit(n_arr_1)
imp.transform(n_arr_2)
Output
Check the output below, observe the output based on the previous two outputs you will see the difference. Basically, on Array 1 it is taking the mean of every column and fitting in array 2 according to its column where ever missing value is missed.
array([[ 6., 9., 34.],
[ 7., 33., 34.],
[32., 8., 12.]])
This is what we doing when we want to transform one array based on another array. but when we have a single array and we want to transform it based on its own mean. In this condition, we use fit_transform together.
See below;
imp.fit_transform(n_arr_2)
Output
array([[ 6. , 9. , 12. ],
[ 7. , 8.5, 12. ],
[ 6.5, 8. , 12. ]])
(Above) Alternativily we doing:
imp.fit(n_arr_2)
imp.transform(n_arr_2)
Output
array([[ 6. , 9. , 12. ],
[ 7. , 8.5, 12. ],
[ 6.5, 8. , 12. ]])
Why we fitting and transforming the same array separately, it takes two line code, why don't we use simple fit_transform
which can fit and transform the same array in one line code. That's what the difference is between fit and transform and fit_transform
.
Check this Google Colab link, you can run it by yourself, and can understand it well.
Answered by Tariq Hussain on June 28, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP