31 Aralık 2022 Cumartesi

XGBoost Parameters

In this blogpost, I will write down and update my notes on XGBoost parameters.

First five excellent videos on XGBoost from Statquest channel:

XGBoost Part 1 (of 4): Regression

XGBoost Part 2 (of 4): Classification

XGBoost Part 3 (of 4): Mathematical Details

XGBoost Part 4 (of 4): Crazy Cool Optimizations

XGBoost in Python from Start to Finish

and a bonus here from Pedram Jahangiry's Youtube channel.

Here is a nice introductory thread on twitter on this topic by Martin Bel.

11 Aralık 2022 Pazar

error 'DataFrame' object has no attribute 'Column_Name' hatasının çözümü

Bazen herhangi bir kolonda bir işlem yapmak istediğimizde "error 'DataFrame' object has no attribute 'Column_Name' " şeklinde bir uyarı çıkıyor. Bu durumda, özellikle de başka kolonlarda yapabildiğimiz ama bu kolonlarda yapamadığımız dönüşümlerin muhtemel bir sebebi, söz konusu kolonun adının aslında farklı olması ve bunu bizim tespit edemememiz. Aşağıda bunun aslında basit ama gözden kaçabilen tespiti ve çözüm yolu var. Özet olarak önce kolon isimlerine bir göz atıp ilk başta göremediğimiz ara tuşlarını tespit ediyoruz ve bunları ikinci kodla düzeltiyoruz.

Kod:

data_train.columns

data_train = data_train.rename(columns={'ChargedOff_Amount ': 'ChargedOff_Amount',
                                        'Gross_Amount_Disbursed  ': 'Gross_Amount_Disbursed',
                                        'Borrower_Name ': 'Borrower_Name',
                                        'Classification_Code ': 'Classification_Code',
                                        'Jobs_Created ': 'Jobs_Created',
                                        'Year_Of_Commitment ': 'Year_Of_Commitment'
                                        })

Kaynak: stackoverflow - How to resolve AttributeError: 'DataFrame' object has no attribute

Converting numerical value to category

Sometimes we may need to numerical values (integers) to categorical values (object/strings) because there is no point in running a regression as if thesee numbers represent a mathematical value or have a ranking (such as sector codes in a bank loan data). Here's how it's done:

Code:

# There columns should not be integers, but objects.
# These are Primary_Loan_Digit and Code_Franchise
data_train['Primary_Loan_Digit'] = pd.Categorical(data_train.Primary_Loan_Digit)
data_train['Code_Franchise'] = pd.Categorical(data_train.Code_Franchise)
data_test['Primary_Loan_Digit'] = pd.Categorical(data_test.Primary_Loan_Digit)
data_test['Code_Franchise'] = pd.Categorical(data_test.Code_Franchise)
data_train.dtypes

Türkçe:

Sayısal veriyi kategorik veriye dönüştürmek

10 Aralık 2022 Cumartesi

Bir kolonun verilerini kırparak yeni kolon oluşturmak & kategori kolonunu sayı kolonuna çevirmek

Bir dataframe'de "Rs." ile başlayan bir para birimi karşıma çıktı. Rs. (hint rupisi) benim için önemsiz olduğundan bu ilk üç karakteri kırpmam lazım. Karakterleri kırpma işini şuradaki ilk satırla yapıyorum:

Ama karakterleri kırpmak string'i nümerik hale getirmek anlamına gelmiyor ve kolon bu nedenle kategorik değişken olarak görünüyor. Dolayısıyla bunu ayrıca sayısal değişken haline getirmem lazım.

Bunu da şu kodla yapıyorum. Dikkat edilmesi gereken kısım ".astype(float)" kısmı. Bunu yazmayınca hata veriyor. (ValueError: invalid literal for int () with base 10)

Böylece bu kısmı hallediyoruz. Bunun sayısal değişkene döndüğünü görmek için num_cols ve cat_cols belirleyici kodunu tekrar çalıştırıyorum.


Ve num_cols'u kontrol edip "Loan_Approved_Gross_New" kolonunun dönüştüğünü görüyorum.


Bir de tarih (date) kolonu string olarak girildiğinden, object türünde görünüyor. Bunu da ikinci satırdaki kodla "date" türüne çeviriyorum:


Kodlar:

data_train['Gross_Amount_Balance_New'] = data_train['Gross_Amount_Balance'].str[3:]

data_train['Date_Of_Disbursement_New'] = pd.to_datetime(data_train['Date_Of_Disbursement'])

data_train['Loan_Approved_Gross_New'] = data_train['Loan_Approved_Gross_New'].astype(float).astype(int)


# group cat and num cols
from sklearn.compose import make_column_selector as selector
num_cols_selector = selector(dtype_exclude=object)
cat_cols_selector = selector(dtype_include=object)
num_cols = num_cols_selector(data_train)
cat_cols = cat_cols_selector(data_train)
cat_cols=data_train.select_dtypes(include=['object']).columns
for column in cat_cols:
    print("For column:",column)
    print(data_train[column].unique())
    print('-'*50)

Kaynaklar:

StackOverflow - Pandas make new column from string slice of another column

String'i tarih türüne çevirme: SparkByExamples






17 Kasım 2022 Perşembe

Creating tables for model comparison

In this post, I will compile codes to compare different models' results. The post will hopefully lengthen as I add new ones..

Here is the first one comparing a decision tree, rf, and other tuned regressors:

The code creates this table:
Code:

models_test_comp_df = pd.concat(
    [
        dt_regressor_perf_test.T,
        regressor_perf_test.T,
        bagging_estimator_perf_test.T,
        dtree_tuned_regressor_perf_test.T,
        bagging_tuned_regressor_perf_test.T,
        rf_tuned_regressor_perf_test.T
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision tree regressor",
    "Random Forest regressor",
    "Bagging regressor",
    "Tuned Decision Tree regressor",
    "Tuned Bagging Tree regressor",
    "Tuned Random Forest Regressor"]
print("Test performance comparison:")
models_test_comp_df


A second code to retrieve a table for a similar comparison: (table not included this time)


# defining list of models you have trained
models = [lr, dtree, ridge_model, dtree_tuned, rf_model, rf_model_tuned]
# defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train= []
rmse_test= []
# looping through all the models to get the rmse and r2 scores
for model in models:
    # accuracy score
    j = get_model_score(model,False)
    r2_train.append(j[0])
    r2_test.append(j[1])
    rmse_train.append(j[2])
    rmse_test.append(j[3])
-------
comparison_frame = pd.DataFrame({'Model':['Linear Regression','Decision Tree','Ridge Model','Tuned Decision Tree','Random Forest','Tuned Random Forest'], 
                                          'Train_r2': r2_train,'Test_r2': r2_test,
                                          'Train_RMSE':rmse_train,'Test_RMSE':rmse_test}) 
comparison_frame

4 Kasım 2022 Cuma

Three SQL Game Web Sites

1. SQL Murder Mystery: Using SQL knowledge to find a killer in SQL city. https://mystery.knightlab.com 

2. SQL Police Department: You are joining SQLPD, and solving crimes while learning SQL. https://sqlpd.com

3. Schemaverse: A space-based strategy game implemented entirely within a PostgreSQL database. You will be commanding your fleet and competing against other players using raw SQL commands. https://schemaverse.com

2 Kasım 2022 Çarşamba

How XGBoost works

A nice diagram to understand how XGBoost works:

17 Ağustos 2022 Çarşamba

Perfect visualizations of Joins in SQL

Here they are..





SQL Cheatsheet

 A nice SQL cheatsheet from here:



What did Acemoglu get wrong

Here's an excellent podcast for those interested in development economics and the drivers of economic growth: (Daron Acemoglu interviewed with Alice Evans)

https://open.spotify.com/episode/76fLUVwGM4glpRQ0dwhf58?si=8KuXHzP_TcmtnYbxnF6QfQ&nd=1

Downloading Youtube videos and converting docs to pdf with Python

 Here's a tweet (and the video) showing how to download a Youtube video in mp3 format with Python. You can follow Python coding channel to see such videos.


Another video here shows how to download videos (not just in mp3 format) with the same library:


Lastly, this video shows you how to convert a .docx file to a pdf:

2 Ağustos 2022 Salı

Exercises to improve photographic memory (speed reading exercises)

 Here they are: https://www.hizliokuma.org/mho/

I used to try them years ago and was happy to see some positive results while reading. This exercise is useful to improve eye span and grasp what you see very quickly.

This is just the first step for speed reading. The next step may be doing flashing blocks or numbers tests. (One example here: https://humanbenchmark.com/tests/memory)

These exercises can make it possible to grasp paragraphs or even pages at first sight. (I'm not at that level, unfortunately :) )

I think everyone should try it. If a chimpanzee can do it, why wouldn't we? (test here: https://humanbenchmark.com/tests/chimp, a better one here: https://chimptest.zone/)

Creating indexes for effective queries in SQL

 Here is a nice video about indexing and saving some time while doing SQL queries.

29 Haziran 2022 Çarşamba

Driving factors behind US Housing Prices and Demand

Here is a brief visualization that I am working on today..

US Average House Prices and Steel & Lumber Price Index 1982-2022

Full-visualization can be seen here:


Sources:





Additional:

9 Mayıs 2022 Pazartesi

Creating dummy variable without using get_dummies

get_dummies is a useful function for running regressions with categorical variables, but sometimes you may want to keep the variable it drops. I looked at its documentation, but couldn't get how I can select the feature to drop. Anyways, here is a 'manual' way of doing it instead of get_dummies. This seems more customizable to me, at least for some cases.


merged_data['MiscShed'] = np.where(merged_data['MiscFeature']=='Shed', 1, 0)
merged_data['MiscGar2'] = np.where(merged_data['MiscFeature']=='Gar2', 1, 0)
merged_data['MiscOthr'] = np.where(merged_data['MiscFeature']=='Othr', 1, 0)
merged_data['MiscTenC'] = np.where(merged_data['MiscFeature']=='TenC', 1, 0)

8 Mayıs 2022 Pazar

List of correlations of a specific column with other columns by filtering the absolute value

Sometimes we may need to capture the lowest correlations of other columns with our Y column (or maybe another column) to see if there is noise and room for improving our model.

This is how I did it lately:

# Building the correlation matrix
correlations = train_data.corr().unstack().sort_values()
# Converting the matrix to dataframe
correlations = pd.DataFrame(correlations).reset_index()
# Labeling the columns
correlations.columns = ['SalePrice', 'Col2', 'Correlation']
# Filter by absolute value
correlations=correlations[abs(correlations['Correlation']) <= 0.3]
# Filter by variable
correlations.query("SalePrice == 'SalePrice' & Col2 != 'SalePrice'")


And this sorts them by absolute value:



# Building the correlation matrix
correlations = merged_data.corr().unstack().abs().sort_values()
# Converting the matrix to dataframe
correlations = pd.DataFrame(correlations).reset_index()
# Labeling the columns
correlations.columns = ['SalePrice', 'Col2', 'Corr Abs.Val.']
# Filter by absolute value
correlations=correlations[abs(correlations['Corr Abs.Val.']) <= 0.3]
# Filter by variable
correlations.query("SalePrice == 'SalePrice' & Col2 != 'SalePrice'")

6 Mayıs 2022 Cuma

List of correlations between columns without duplicate rows

Getting a proper list of correlations between columns may be needed, especially when you can't generate a heatmap due to a high number of features. You may need this list to see highly correlated features so that you can drop some columns to improve your regression results and avoid possible multicollinearity problems.

This page was very useful for me to get rid of the duplicate rows.

Here is the code and the screenshot to see correlations above 0.7 (of course you can adjust this threshold):


corr_list = train_data.corr(method='pearson')
corr_list = corr_list.mask(np.tril(np.ones(corr_list.shape)).astype(np.bool))
corr_list = corr_list[abs(corr_list) >= 0.7].stack().reset_index()
corr_list = corr_list.rename(columns={'level_0':'Var1','level_1':'Var2'})
corr_list.sort_values(by=0, ascending=False)

5 Mayıs 2022 Perşembe

Showing counts of each unique values across columns of a dataframe

I wanted to see each unique value with the number of occurrences (or counts of each unique value) in each of the columns of my dataframe, as shown in the picture below. However, I couldn't manage it by groupby or pivot_table functions. I guess there is an easier method -a one line code- that I couldn't devise or find on google (I mean there should be), but I came up with this at the end:


To create categorical and numerical columns list:


A concise way of seeing the unique values without value counts is here:


Codes here:

df_N=pd.DataFrame()
for i in cat_cols:
    s1=[i]
    s2=pd.DataFrame(s1)
    s2=s2.reset_index()
    s2=s2.drop('index', axis=1)
    s2 = s2.rename(columns={0: 'Column'})
    s3=train_data[i].value_counts().reset_index()
    s3 = s3.rename(columns={'index': 'Variable', i: 'Count'})
    s4=pd.concat([s2,s3],axis=1)
    df_N=pd.concat([df_N,s4])
df_N.style.hide_index().format(na_rep='')

# group cat and num cols
from sklearn.compose import make_column_selector as selector
num_cols_selector = selector(dtype_exclude=object)
cat_cols_selector = selector(dtype_include=object)
num_cols = num_cols_selector(train_data)
cat_cols = cat_cols_selector(train_data)
cat_cols=data.select_dtypes(include=['object']).columns

for column in cat_cols:
    print("For column:",column)
    print(data[column].unique())
    print('-'*50)




Printing correlation values of a specific column with other columns

 If we are not using advanced or simple imputation techniques to fill the null values, looking at other columns may also help to conjecture the missing value.

So which column should we look at? Let's take this dataset which has more than 80 features. Let's assume that we have some null values in 'LotArea'. Which other features can we look at to make a conjecture on this feature? We can have a look at the feature with the highest correlation values with LotArea with this code that also sorts the values:

print(pd.DataFrame(df.corr())['LotArea'].sort_values(ascending=False))

So looking at houses with similar 'LotFrontage's, may help the most. Other features do not have high correlation, but we may still add second or third feature into account.

4 Mayıs 2022 Çarşamba

Feature engineering: Imputing missing values in a cell, based on subgroups of other columns

Imputing missing values in a specific cell can be tricky. I had to do this while working on a competition on Kaggle.

I had to impute values for a specific cell which also includes a categorical variable to be imputed (see the last one in the picture: 'GarageFinish').

The cells that should be imputed were about the garage features of the house. Here's what I did in brief:

I thought the neighborhood, home type and house style features could be good predictors to estimate the garage type. So I predicted the garage features by filtering these values and imputing either their mean or median or mode for different features.

As I said above, the interesting part was the 'GarageFinish' feature which is a categorical one. The way I found here was imputing the mean by combining value_counts and index[0] methods.

You can understand it better by checking the code below:


Though, I could have selected the house's build and renovation years to predict the garage's build year.

For another similar cell, I filtered by more features and conjectured on them based on the other observations -plus the build year of the house. The picture below shows that one. The first and last rows are before and after running the codes. There is an or statement in the first filter since the building type didn't produce enough observations and I added another similar type.



1 Mayıs 2022 Pazar

Dealing with the problem of infinite values after log transformation

 Sometimes, when you do log transformation in highly skewed data, your zeros may be minus infinite.

Here's how you deal with it. (replacing infinite values with zeros and checking if they are still infinite or not)


test["Departure_Delay_in_Mins_log"] = np.log(test["Departure_Delay_in_Mins"])
test[test == -np.inf].count()
test=test.replace([np.inf, -np.inf], 0)
test[test == -np.inf].count()
test.drop('Departure_Delay_in_Mins', axis=1, inplace=True)

Log transformation, plotting the log column and dropping original column

 Checking log distribution first

#Log transformation of the feature 'kilometers_driven'

sns.distplot(np.log(data["kilometers_driven"]), axlabel="Log(kilometers_driven)");


And doing the log transformation:

data["kilometers_driven_log"] = np.log(data["Kilometers_Driven"])


Lastly dropping the original column:

data.drop(['kilometers_driven'], axis=1, inplace=True)


27 Nisan 2022 Çarşamba

Boston Crime Data: Map and Bar chart on Tableau Dashboard

Here is a map and bar chart of Boston Crime Data on Tableau.. You can click on this link to see it on Tableau Public. To see the data and codes please check my GitHub.

The data is based on the City of Boston's public data and spans the first three months of 2022. I aggregated many offense descriptions in categories and here I show 5 of those categories: Assault, Drug-Alcohol Related, Murder-Injury, Theft-Damage and Trespassing.

The visualizations show the aggregates by days of the week. So you can have a general understanding of the distribution of those crime types across the districts by each day of the week.

I had published a heatmap here as well.

25 Nisan 2022 Pazartesi

Converting zeros to null values

Sometimes cells that are not null may need to be treated as null values and be converted to 'NaN' as shown below:

Number of unique observations in each categorical column

 Here's how you get it:




24 Nisan 2022 Pazar

Checking row wise missing value counts and filtering them

 I you want to check the number of missing values included in each row of your data, you can do it as shown here:


After checking these values, she filters the rows that have more than two missing values and deletes them:

Source: Handling missing values in Python Explained with example Fillna dropna sklearn KNN Model Imputation

Tr: Her bir satırda kaç tane eksik değer olduğunu kontrol etmek ve 2'den fazla eksik değer içeren satırları silmek için..





Error: Input contains NaN, infinity or a value too large for dtype

 Sometimes you get this error after running the regression on your data:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The first thing to check is the total values of both NaNs and infinity values (actually the last inquiry was unnecessary in this case):

As you can see there is no problem in those values.

One reason may be that: You already created a log data of your Y value (as Y_log), but forgot to delete Y column itself from the X_train set. In this case you are trying to find that Y_log by using Y column as an independent variable and it may lead to such an error.

Another thing to do may be tuning the hyperparameters of the regression (depending on the type of regression).

I also noticed that during one-hot-coding, my feature list got too crowded (like thousands of features). This was because python somehow saved a calculated field in "object" data type. This lead to a conversion of column's values to boolean features. (an important lesson to check your data types before getting dummies.)

I fixed this as well but I was still getting the same error.

My solution was interesting:
I assumed my notebook was too crowded with codes. It had gotten too slow, because data cleaning part was messy itself. So first, I exported the final dataframe -which I reached just before running regressions- as a .csv file. And I ran the regressions in a new workbook.

Sometimes simple solutions are worth trying.

18 Nisan 2022 Pazartesi

Vlookup in Python (other than JOIN)

If you want to handle everything in Python (instead of using SQL or Excel), there are many vlookup possibilities in python. Here I want to show the one I used in Boston Crime Data.

I wanted to group many crime definitions to decrease the number of categories in a new "offense description" column. For example instead of having "property damage", "auto theft", "larceny - all others" categories seperately, I wanted to group those crimes in one "theft-damage" category.

I created a new df with two columns, namely "Offense_Description" and "Offense_Category". The first column included all unique values of the offense description and the second one regrouped them in new categories.

Then I used ".map" method to act as the vlookup in excel and create a new column in the dataferame. I am adding this here, because at times merge or join functions do not work as well as expected and this may be a nice option to keep in mind. Here is how it looks:


Another example with .merge method (while converting the district codes to district names):



13 Nisan 2022 Çarşamba

A/B Test Data Munging: Changing a two-column number data to one-column number and group labels

Before doing an A/B test, data may come in different formats such as two data frames or a dataframe with two columns consisting of Group A figures and Group B figures respectively.

In such cases we may need to reorganize the data so that one column consists of the figures and the second column shows their respective groups. Here is a concise method of doing this:




#Data A and Group A
Group_A = np.arange(len(A))
Group_A = pd.DataFrame(Group_A)
Group_A[:] = "A"
A = pd.concat([A, Group_A], axis = 1)

#Data B and Group B
Group_B = np.arange(len(B))
Group_B = pd.DataFrame(Group_B)
Group_B[:] = "B"
B = pd.concat([B, Group_B], axis = 1)

#All data in one DF
AB = pd.concat([A,B])
AB.columns = ["Income","Group"]
print(AB.head())
print(AB.tail())

11 Nisan 2022 Pazartesi

Boston Crime Data Analysis - 2: A better heatmap with row and column totals

 I've just shared a blog post with a heatmap of crime rates per district and days. However, I wanted to add the totals of rows and columns and came up with this. So, this version shows the most dangerous districts and days as well, by aggregating the numbers.

I am also adding the codes below. You will find the source at the top, which was useful for me to reach this solution.

If you add totals as a part of the table (as the last row or column), it ruins the colors -as expected- since the aggregated numbers in the totals column and row would be the only dark cells and make all other cells lighter without much difference..

So in seaborn, adding a total column/row would be possible with .subplot2grid method.


Code:

# Source:
# https://stackoverflow.com/questions/33379261/how-can-i-have-a-bar-next-to-python-seaborn-heatmap-which-shows-the-summation-of

fig = plt.figure(figsize=(16,12))
ax1 = plt.subplot2grid((14,10), (0,0), rowspan=12, colspan=7)
ax2 = plt.subplot2grid((14,10), (0,8), rowspan=12, colspan=1)
ax3 = plt.subplot2grid((14,10), (13,0), rowspan=1, colspan=7)

table_pivot= data.pivot_table("Hour",
                              ["DISTRICT"],
                              columns="DAY_OF_WEEK",
                              aggfunc = "count")

column_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

table_pivot2 = table_pivot.reindex(column_order, axis=1)
table_pivot2 = table_pivot2.drop("External", axis=0)
cmap = sns.cm.rocket_r

sns.heatmap(table_pivot2,
            ax=ax1,
            annot = True,
            fmt="0.1f",
            linewidths=.8,
            vmin=-0.05,
            cmap = cmap,
            cbar = False).set_title('Number of Crimes per District and Day of Week',
                                   fontsize=14,
                                   fontweight='bold')

y_axis_labels=['A1','A15','A7','B2','B3','C11','C6','D14','D4','D13','E8','E5']
ax1.xaxis.tick_bottom()
ax1.set_xticklabels(table_pivot2.columns,rotation=40)
ax1.yaxis.tick_right()
ax1.set_yticklabels(y_axis_labels, rotation='horizontal')

sns.heatmap((pd.DataFrame(table_pivot2.sum(axis=0))).transpose(),
            ax=ax3,
            annot=True,
            fmt='g',
            cmap=cmap,
            cbar=False,
            xticklabels=False,
            yticklabels=False).set(xlabel='', ylabel='Totals')
sns.heatmap(pd.DataFrame(table_pivot2.sum(axis=1)),
            ax=ax2,
            annot=True,
            fmt='g',
            cmap=cmap,
            cbar=False,
            xticklabels=False,
            yticklabels=False).set(xlabel='', ylabel='', title='Totals');

Another heatmap here (no. of crimes per district/hours):





Boston Crime Data Analysis - 1: Number of Crimes per weekdays and district Heatmap

 Here is a heatmap showing the number of crimes per district and days of week according to City of Boston's public data for the first 3 months of 2022:

Below is the code..

table_pivot= data.pivot_table("Hour",
                 ["DISTRICT"],
                 columns="DAY_OF_WEEK",
                 aggfunc = "count")
column_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
table_pivot2 = table_pivot.reindex(column_order, axis=1)
table_pivot2=table_pivot2.drop("External", axis=0)
cmap = sns.cm.rocket_r
sns.set(rc = {'figure.figsize':(15,8)})
sns.heatmap(table_pivot2,
            annot = True,
            fmt="0.1f",
            linewidths=.8,
            vmin=-0.05,
            cmap = cmap,)

Note: "hour" is counted in pivot table because it doesn't have a null value.

Object türündeki veriyi sıralamayı koruyarak kategori türüne dönüştürme

Object türündeki veri kategoriye dönüştürülürken sıralaması kaybolabiliyor. Sıralamayı koruyarak dönüştürme işleminin açıklamaları ve kodlarını buraya ekliyorum. Açıklamalar resimlerin içinde ve kodlar en aşağıda..


Kodlar:

# Veriyi indirme ve genel bakış:
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
df = diamonds.copy()
df.head()

# ordinal tanımlama: Kategorik veri türünü import ediyoruz. 
from pandas.api.types import CategoricalDtype

# "Cut" kolonundaki ilk 5 gözleme bakış
df.cut.head()

# "Cut" kolonunun verisini object'ten kategori türüne dönüştürme:
df.cut = df.cut.astype(CategoricalDtype(ordered = True))

# "Cut" kolonu verisinin yeni türünü kontrol etme:
df.cut.dtypes

# "Cut" kolonunun verisinin kategori sıralamasını kontrol etme:
df.cut.head(1)

# listemiz:
cut_kategoriler = ["Fair","Good","Very Good","Premium","Ideal"]

# ve bu listeyi doğru sıralama olarak kategoriye tanıtıyoruz:
df.cut = df.cut.astype(CategoricalDtype(categories = cut_kategoriler, ordered = True))

# listeye göre sıralanıp sıralanmadığını kontrol ediyoruz:
df.cut.head(1)

10 Nisan 2022 Pazar

.unique() metoduyla tekil değerlerin alınması

Tek bir kolondaki, tekil değerleri görmek için:

titanic["embark_town"].unique()

Birden çok kolonun tekil değerlerini görmek için:

İki kolonun ayrı ayrı tekil değerlerini görmek için sadece bu iki kolondan oluşan ayrı bir dataframe kopyalamaktan başka bir çare bulamadım.
titanic_2 = titanic[["alive","embark_town"]].copy()
for col in titanic_2:
    print(col)
    print(titanic_2[col].unique())

Bununla birlikte, iki kolonun tekil değerlerini birleştirilmiş olarak görmek şu şekilde mümkün:

import pandas as pd
import numpy as np
column_values = titanic[['who','embark_town']].values.ravel()
unique_values =  pd.unique(column_values)

print(unique_values)



Tüm kolonların tekil değerlerini görmek için:

1. Şu kod çok sade ve derli toplu bir şekilde tüm kolonlardaki tekil değerleri görmemizi sağlıyor:

pd.Series({col:titanic[col].unique() for col in titanic})


2. Tüm kolonlardaki tüm tekil değerleri ayrıntılı olarak görmek istersek şu kodu da kullanabiliriz:

for col in titanic:
    print(col)
    print(titanic[col].unique())