Getting a proper list of correlations between columns may be needed, especially when you can't generate a heatmap due to a high number of features. You may need this list to see highly correlated features so that you can drop some columns to improve your regression results and avoid possible multicollinearity problems.
This page was very useful for me to get rid of the duplicate rows.
Here is the code and the screenshot to see correlations above 0.7 (of course you can adjust this threshold):
corr_list = train_data.corr(method='pearson')
corr_list = corr_list.mask(np.tril(np.ones(corr_list.shape)).astype(np.bool))
corr_list = corr_list[abs(corr_list) >= 0.7].stack().reset_index()
corr_list = corr_list.rename(columns={'level_0':'Var1','level_1':'Var2'})
corr_list.sort_values(by=0, ascending=False)
corr_list = corr_list.mask(np.tril(np.ones(corr_list.shape)).astype(np.bool))
corr_list = corr_list[abs(corr_list) >= 0.7].stack().reset_index()
corr_list = corr_list.rename(columns={'level_0':'Var1','level_1':'Var2'})
corr_list.sort_values(by=0, ascending=False)
Hiç yorum yok:
Yorum Gönder