31 Mart 2022 Perşembe

Feature engineering: Imputing missing values in a column, based on subgroups of another column

 Suppose that we have null values in a column and we need to fill them. Filling all null values by simple means or medians may not be the best solution.

Why?

Suppose that we have a car database and a "Number of Seats" column. This column may have a median of 5 and a mean close to 5. However null values of "Mercedes SLK"s (a roadster with 2 seats) and "Cadillac Escalade"s (7 seater) should be treated differently. So we need to group by the car models (or car types such as roadster, suv, mid-size, etc.) and assign values to the null cells by finding the median or mean of these groups.

Assuming that we are using "car type" subgroup, to find the "median" of the subgroups and impute them into the null values, the code for such feature engineering may be as follows:

1st option:

data['Number_of_Seats'].fillna(data.groupby('Car_Type')['Number_of_Seats'].transform("mean")

2nd option: (there are two more options at the end of the post. I tried them with some errors and but didn't end up using them.)

# 1st step
median_seats = data.groupby('Car_Type').median()['Number_of_Seats']
#2nd step
data.loc[pd.isna(data['Number_of_Seats']),'Number_of_Seats'] = median_seats[data.loc[pd.isna(data['Number_of_Seats']),'Car_Type']].values

A good approach to choose between median and mean may be making copies of the workbook before filling the null values, filling the null values and after running the method (regression, decision tree, etc.),  trying other methods on other copied notebooks to fill the null values.

A subgroup may not include a value for our column at all. For example we have 5 Mercedes SLKs and none of them has Number_of_Seats value. In that case, we may try another relevant column (as Car_Type above). The easy solution of taking the column mean/median should be the last resort. If the dataset is big enough, dropping those rows may be a better solution.

Other code options:
3rd option:
data['Number_of_Seats'].fillna(data.Number_of_Seats.median(Car_Type), inplace=True)
4th option to work on:
df.groupby(["A","B"]).Z.median()


Hiç yorum yok:

Yorum Gönder