4 Mayıs 2022 Çarşamba

Feature engineering: Imputing missing values in a cell, based on subgroups of other columns

Imputing missing values in a specific cell can be tricky. I had to do this while working on a competition on Kaggle.

I had to impute values for a specific cell which also includes a categorical variable to be imputed (see the last one in the picture: 'GarageFinish').

The cells that should be imputed were about the garage features of the house. Here's what I did in brief:

I thought the neighborhood, home type and house style features could be good predictors to estimate the garage type. So I predicted the garage features by filtering these values and imputing either their mean or median or mode for different features.

As I said above, the interesting part was the 'GarageFinish' feature which is a categorical one. The way I found here was imputing the mean by combining value_counts and index[0] methods.

You can understand it better by checking the code below:


Though, I could have selected the house's build and renovation years to predict the garage's build year.

For another similar cell, I filtered by more features and conjectured on them based on the other observations -plus the build year of the house. The picture below shows that one. The first and last rows are before and after running the codes. There is an or statement in the first filter since the building type didn't produce enough observations and I added another similar type.



Hiç yorum yok:

Yorum Gönder