Suppose that we have null values in a column and we need to fill them. Filling all null values by simple means or medians may not be the best solution.
Why?
Suppose that we have a car database and a "Number of Seats" column. This column may have a median of 5 and a mean close to 5. However null values of "Mercedes SLK"s (a roadster with 2 seats) and "Cadillac Escalade"s (7 seater) should be treated differently. So we need to group by the car models (or car types such as roadster, suv, mid-size, etc.) and assign values to the null cells by finding the median or mean of these groups.
Assuming that we are using "car type" subgroup, to find the "median" of the subgroups and impute them into the null values, the code for such feature engineering may be as follows:
1st option:
data['Number_of_Seats'].fillna(data.groupby('Car_Type')['Number_of_Seats'].transform("mean")
2nd option: (there are two more options at the end of the post. I tried them with some errors and but didn't end up using them.)
data.loc[pd.isna(data['Number_of_Seats']),'Number_of_Seats'] = median_seats[data.loc[pd.isna(data['Number_of_Seats']),'Car_Type']].values
3rd option:
data['Number_of_Seats'].fillna(data.Number_of_Seats.median(Car_Type), inplace=True)
4th option to work on:
df.groupby(["A","B"]).Z.median()
Hiç yorum yok:
Yorum Gönder