How can I count the number of occurrences of a category in dataset as part of an Sklearn Pipeline

Question

Let us say we have a dataset with a feature such as Surname.

arr['Surname'] = ['Smith', 'Jones', 'Johnson', 'Smith']

And I want to encode this categorical info as a new feature like

arr['Surname_Count'] = [2, 1, 1, 2]

with the caveat that it is done within an Sklearn pipeline. Are there quick ways to do this that do not involve rolling my own partition counting transformer?

Alexander Wang · Answer

You can check out Featuretools, which an open source python framework for automated feature engineering. Specifically for you, it can generate aggregation features such as count for your dataset.

After generating the new feature matrix with the desired column, you can use the matrix as you normally would within an Sklearn pipeline.

How can I count the number of occurrences of a category in dataset as part of an Sklearn Pipeline

One Answer

Add your own answers!

Ask a Question