Cüneyd Yasin'in Paylaştıkları: Mayıs 2022

9 Mayıs 2022 Pazartesi

Creating dummy variable without using get_dummies

get_dummies is a useful function for running regressions with categorical variables, but sometimes you may want to keep the variable it drops. I looked at its documentation, but couldn't get how I can select the feature to drop. Anyways, here is a 'manual' way of doing it instead of get_dummies. This seems more customizable to me, at least for some cases.

merged_data['MiscShed'] = np.where(merged_data['MiscFeature']=='Shed', 1, 0)
merged_data['MiscGar2'] = np.where(merged_data['MiscFeature']=='Gar2', 1, 0)
merged_data['MiscOthr'] = np.where(merged_data['MiscFeature']=='Othr', 1, 0)
merged_data['MiscTenC'] = np.where(merged_data['MiscFeature']=='TenC', 1, 0)

8 Mayıs 2022 Pazar

List of correlations of a specific column with other columns by filtering the absolute value

Sometimes we may need to capture the lowest correlations of other columns with our Y column (or maybe another column) to see if there is noise and room for improving our model.

This is how I did it lately:

# Building the correlation matrix
correlations = train_data.corr().unstack().sort_values()
# Converting the matrix to dataframe
correlations = pd.DataFrame(correlations).reset_index()
# Labeling the columns
correlations.columns = ['SalePrice', 'Col2', 'Correlation']
# Filter by absolute value
correlations=correlations[abs(correlations['Correlation']) <= 0.3]
# Filter by variable
correlations.query("SalePrice == 'SalePrice' & Col2 != 'SalePrice'")

And this sorts them by absolute value:

# Building the correlation matrix
correlations = merged_data.corr().unstack().abs().sort_values()
# Converting the matrix to dataframe
correlations = pd.DataFrame(correlations).reset_index()
# Labeling the columns
correlations.columns = ['SalePrice', 'Col2', 'Corr Abs.Val.']
# Filter by absolute value
correlations=correlations[abs(correlations['Corr Abs.Val.']) <= 0.3]
# Filter by variable
correlations.query("SalePrice == 'SalePrice' & Col2 != 'SalePrice'")

6 Mayıs 2022 Cuma

List of correlations between columns without duplicate rows

Getting a proper list of correlations between columns may be needed, especially when you can't generate a heatmap due to a high number of features. You may need this list to see highly correlated features so that you can drop some columns to improve your regression results and avoid possible multicollinearity problems.

This page was very useful for me to get rid of the duplicate rows.

Here is the code and the screenshot to see correlations above 0.7 (of course you can adjust this threshold):

corr_list = train_data.corr(method='pearson')
corr_list = corr_list.mask(np.tril(np.ones(corr_list.shape)).astype(np.bool))
corr_list = corr_list[abs(corr_list) >= 0.7].stack().reset_index()
corr_list = corr_list.rename(columns={'level_0':'Var1','level_1':'Var2'})
corr_list.sort_values(by=0, ascending=False)

5 Mayıs 2022 Perşembe

Showing counts of each unique values across columns of a dataframe

I wanted to see each unique value with the number of occurrences (or counts of each unique value) in each of the columns of my dataframe, as shown in the picture below. However, I couldn't manage it by groupby or pivot_table functions. I guess there is an easier method -a one line code- that I couldn't devise or find on google (I mean there should be), but I came up with this at the end:

To create categorical and numerical columns list:

A concise way of seeing the unique values without value counts is here:

Codes here:

df_N=pd.DataFrame()
for i in cat_cols:
s1=[i]
s2=pd.DataFrame(s1)
s2=s2.reset_index()
s2=s2.drop('index', axis=1)
s2 = s2.rename(columns={0: 'Column'})
s3=train_data[i].value_counts().reset_index()
s3 = s3.rename(columns={'index': 'Variable', i: 'Count'})
s4=pd.concat([s2,s3],axis=1)
df_N=pd.concat([df_N,s4])
df_N.style.hide_index().format(na_rep='')

# group cat and num cols
from sklearn.compose import make_column_selector as selector
num_cols_selector = selector(dtype_exclude=object)
cat_cols_selector = selector(dtype_include=object)
num_cols = num_cols_selector(train_data)
cat_cols = cat_cols_selector(train_data)

cat_cols=data.select_dtypes(include=['object']).columns

for column in cat_cols:
    print("For column:",column)
    print(data[column].unique())
    print('-'*50)

Printing correlation values of a specific column with other columns

If we are not using advanced or simple imputation techniques to fill the null values, looking at other columns may also help to conjecture the missing value.

So which column should we look at? Let's take this dataset which has more than 80 features. Let's assume that we have some null values in 'LotArea'. Which other features can we look at to make a conjecture on this feature? We can have a look at the feature with the highest correlation values with LotArea with this code that also sorts the values:

print(pd.DataFrame(df.corr())['LotArea'].sort_values(ascending=False))

So looking at houses with similar 'LotFrontage's, may help the most. Other features do not have high correlation, but we may still add second or third feature into account.

4 Mayıs 2022 Çarşamba

Feature engineering: Imputing missing values in a cell, based on subgroups of other columns

Imputing missing values in a specific cell can be tricky. I had to do this while working on a competition on Kaggle.

I had to impute values for a specific cell which also includes a categorical variable to be imputed (see the last one in the picture: 'GarageFinish').

The cells that should be imputed were about the garage features of the house. Here's what I did in brief:

I thought the neighborhood, home type and house style features could be good predictors to estimate the garage type. So I predicted the garage features by filtering these values and imputing either their mean or median or mode for different features.

As I said above, the interesting part was the 'GarageFinish' feature which is a categorical one. The way I found here was imputing the mean by combining value_counts and index[0] methods.

You can understand it better by checking the code below:

Though, I could have selected the house's build and renovation years to predict the garage's build year.

For another similar cell, I filtered by more features and conjectured on them based on the other observations -plus the build year of the house. The picture below shows that one. The first and last rows are before and after running the codes. There is an or statement in the first filter since the building type didn't produce enough observations and I added another similar type.

1 Mayıs 2022 Pazar

Dealing with the problem of infinite values after log transformation

Sometimes, when you do log transformation in highly skewed data, your zeros may be minus infinite.

Here's how you deal with it. (replacing infinite values with zeros and checking if they are still infinite or not)

test["Departure_Delay_in_Mins_log"] = np.log(test["Departure_Delay_in_Mins"])
test[test == -np.inf].count()
test=test.replace([np.inf, -np.inf], 0)
test[test == -np.inf].count()
test.drop('Departure_Delay_in_Mins', axis=1, inplace=True)

Log transformation, plotting the log column and dropping original column

Checking log distribution first

#Log transformation of the feature 'kilometers_driven'

sns.distplot(np.log(data["kilometers_driven"]), axlabel="Log(kilometers_driven)");

And doing the log transformation:

data["kilometers_driven_log"] = np.log(data["Kilometers_Driven"])

Lastly dropping the original column:

data.drop(['kilometers_driven'], axis=1, inplace=True)

Cüneyd Yasin'in Paylaştıkları