Cüneyd Yasin'in Paylaştıkları: 2023

6 Aralık 2023 Çarşamba

Lazy Predict Library

Here's a library that does hyperparameter tuning automatically as well as chosing the best algorithm for your predictive model. Works for regression and classification problems.

Lazy Predict

10 Temmuz 2023 Pazartesi

Finding the distribution of a dataset

Fitter library seems to be a very good solution. Here are the websites for documentation:

https://fitter.readthedocs.io/

https://github.com/cokelaer/fitter

9 Temmuz 2023 Pazar

Handling and visualizing missing values

Important tweet about handling and visualizing missing values here..

21 Mart 2023 Salı

Detecting multicollinearity (thanks to ChatGPT)

I just asked ChatGPT how to detect and handle multicollinearity (stg I probably already knew before). Here's the solution code:

import pandas as pd

import numpy as np

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Load data into a pandas DataFrame

data = pd.read_csv('your_data.csv')

# Create a list of column names to check for multicollinearity

cols_to_check = ['feature1', 'feature2', 'feature3', ...]

# Create a new DataFrame with only the columns to check

data_to_check = data[cols_to_check]

# Calculate the VIF scores for each feature

vif_scores = pd.DataFrame()

vif_scores["Feature"] = data_to_check.columns

vif_scores["VIF"] = [variance_inflation_factor(data_to_check.values, i) for i in range(data_to_check.shape[1])]

# Print the VIF scores

print(vif_scores)

19 Mart 2023 Pazar

I got this from ChatGPT again. This is a simple for loop to eliminate the outliers in a list of columns. We set an upper and lower limit of our choice, then delete the outliers in our dataset. (dataset name: sm_df in this example)

outlier_column_list = ["V1", "V4", "V5", "V6", "V7", "V8", "V10", "V11", "V12", "V13", "V14",
"V15", "V17", "V18", "V19", "V20", "V21", "V22", "V23", "V26"]

# Here we'll set upper and lower limits and then eliminate the outliers
# Loop through each column in outlier_column_list
for col in outlier_column_list:

# Calculate the upper and lower limits
upper_limit = sm_df[col].mean() + 2 * sm_df[col].std()
lower_limit = sm_df[col].mean() - 2 * sm_df[col].std()

# Replace outliers above the upper limit with NaN
sm_df[col] = np.where(sm_df[col] > upper_limit, np.nan, sm_df[col])

# Replace outliers below the lower limit with NaN
sm_df[col] = np.where(sm_df[col] < lower_limit, np.nan, sm_df[col])

How to handle highly skewed data according to ChatGPT :)

import numpy as np

# assume sm_df is your pandas DataFrame

sm_df["A"] = sm_df["A"] - sm_df["A"].min() + 1

from scipy.stats import boxcox

# assume sm_df is your pandas DataFrame

sm_df["A"], _ = boxcox(sm_df["A"])

# assume sm_df is your pandas DataFrame

sm_df["A"] = (sm_df["A"] - sm_df["A"].mean()) / sm_df["A"].std()

import numpy as np

# assume sm_df is your pandas DataFrame

sm_df["A"] = sm_df["A"] - sm_df["A"].min() + 1

sm_df["A"] = np.log(sm_df["A"])

Some other information:

The choice of transformation depends on the characteristics of your data and the requirements of your analysis. For example, if your data has a strong right-skewness, a log transformation might be a better option than a Box-Cox or Yeo-Johnson transformation. Conversely, if your data has a strong left-skewness or a significant number of negative values, the Yeo-Johnson transformation might be more appropriate.

21 Şubat 2023 Salı

Writing ensemble model code with ChatGPT

Utilizing ChatGPT may be the next great skill, as it has the potential to eliminate the need for advanced knowledge of coding.

Here is an example of how I used ChatGPT to create an ensemble model for a classification project:

18 Ocak 2023 Çarşamba

İki dataframe arasındaki farklı (birbirlerinde olmayan) kolonları bulmak

Birbirine benzer veriler içeren iki dataframe'den (örneğin aynı projedeki test ve train verileri) birinde olup diğerinde olmayan kolonun -ya da kolonların- adını bulmak için gereken kod aşağıda. Örneğimizde doğal olarak bağımlı değişken (dependent variable) yani "y" durumundaki satış fiyatı kolonu sonuç olarak karşımıza çıkıyor. (+ Öncesinde train_cols ve test_cols'u tanımlıyoruz.)

train_cols=train_data.columns

test_cols=test_data.columns

list_difference = []
for item in train_cols:
if item not in test_cols:
list_difference.append(item)
print(list_difference)

Cüneyd Yasin'in Paylaştıkları

6 Aralık 2023 Çarşamba

Lazy Predict Library

10 Temmuz 2023 Pazartesi

Finding the distribution of a dataset

9 Temmuz 2023 Pazar

Handling and visualizing missing values

21 Mart 2023 Salı

Detecting multicollinearity (thanks to ChatGPT)

19 Mart 2023 Pazar

Outlier handling

How to handle highly skewed data according to ChatGPT :)

21 Şubat 2023 Salı

Writing ensemble model code with ChatGPT

18 Ocak 2023 Çarşamba

İki dataframe arasındaki farklı (birbirlerinde olmayan) kolonları bulmak

İzleyiciler

Blog Arşivi