Cüneyd Yasin'in Paylaştıkları: Nisan 2022

27 Nisan 2022 Çarşamba

Boston Crime Data: Map and Bar chart on Tableau Dashboard

Here is a map and bar chart of Boston Crime Data on Tableau.. You can click on this link to see it on Tableau Public. To see the data and codes please check my GitHub.

The data is based on the City of Boston's public data and spans the first three months of 2022. I aggregated many offense descriptions in categories and here I show 5 of those categories: Assault, Drug-Alcohol Related, Murder-Injury, Theft-Damage and Trespassing.

The visualizations show the aggregates by days of the week. So you can have a general understanding of the distribution of those crime types across the districts by each day of the week.

I had published a heatmap here as well.

25 Nisan 2022 Pazartesi

Converting zeros to null values

Sometimes cells that are not null may need to be treated as null values and be converted to 'NaN' as shown below:

Number of unique observations in each categorical column

Here's how you get it:

24 Nisan 2022 Pazar

Checking row wise missing value counts and filtering them

I you want to check the number of missing values included in each row of your data, you can do it as shown here:

After checking these values, she filters the rows that have more than two missing values and deletes them:

Source: Handling missing values in Python Explained with example Fillna dropna sklearn KNN Model Imputation

Tr: Her bir satırda kaç tane eksik değer olduğunu kontrol etmek ve 2'den fazla eksik değer içeren satırları silmek için..

Error: Input contains NaN, infinity or a value too large for dtype

Sometimes you get this error after running the regression on your data:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The first thing to check is the total values of both NaNs and infinity values (actually the last inquiry was unnecessary in this case):

As you can see there is no problem in those values.

One reason may be that: You already created a log data of your Y value (as Y_log), but forgot to delete Y column itself from the X_train set. In this case you are trying to find that Y_log by using Y column as an independent variable and it may lead to such an error.

Another thing to do may be tuning the hyperparameters of the regression (depending on the type of regression).

I also noticed that during one-hot-coding, my feature list got too crowded (like thousands of features). This was because python somehow saved a calculated field in "object" data type. This lead to a conversion of column's values to boolean features. (an important lesson to check your data types before getting dummies.)

I fixed this as well but I was still getting the same error.

My solution was interesting:

I assumed my notebook was too crowded with codes. It had gotten too slow, because data cleaning part was messy itself. So first, I exported the final dataframe -which I reached just before running regressions- as a .csv file. And I ran the regressions in a new workbook.

Sometimes simple solutions are worth trying.

18 Nisan 2022 Pazartesi

Vlookup in Python (other than JOIN)

If you want to handle everything in Python (instead of using SQL or Excel), there are many vlookup possibilities in python. Here I want to show the one I used in Boston Crime Data.

I wanted to group many crime definitions to decrease the number of categories in a new "offense description" column. For example instead of having "property damage", "auto theft", "larceny - all others" categories seperately, I wanted to group those crimes in one "theft-damage" category.

I created a new df with two columns, namely "Offense_Description" and "Offense_Category". The first column included all unique values of the offense description and the second one regrouped them in new categories.

Then I used ".map" method to act as the vlookup in excel and create a new column in the dataferame. I am adding this here, because at times merge or join functions do not work as well as expected and this may be a nice option to keep in mind. Here is how it looks:

Another example with .merge method (while converting the district codes to district names):

13 Nisan 2022 Çarşamba

A/B Test Data Munging: Changing a two-column number data to one-column number and group labels

Before doing an A/B test, data may come in different formats such as two data frames or a dataframe with two columns consisting of Group A figures and Group B figures respectively.

In such cases we may need to reorganize the data so that one column consists of the figures and the second column shows their respective groups. Here is a concise method of doing this:

#Data A and Group A
Group_A = np.arange(len(A))
Group_A = pd.DataFrame(Group_A)
Group_A[:] = "A"
A = pd.concat([A, Group_A], axis = 1)

#Data B and Group B
Group_B = np.arange(len(B))
Group_B = pd.DataFrame(Group_B)
Group_B[:] = "B"
B = pd.concat([B, Group_B], axis = 1)

#All data in one DF
AB = pd.concat([A,B])
AB.columns = ["Income","Group"]
print(AB.head())
print(AB.tail())

11 Nisan 2022 Pazartesi

Boston Crime Data Analysis - 2: A better heatmap with row and column totals

I've just shared a blog post with a heatmap of crime rates per district and days. However, I wanted to add the totals of rows and columns and came up with this. So, this version shows the most dangerous districts and days as well, by aggregating the numbers.

I am also adding the codes below. You will find the source at the top, which was useful for me to reach this solution.

If you add totals as a part of the table (as the last row or column), it ruins the colors -as expected- since the aggregated numbers in the totals column and row would be the only dark cells and make all other cells lighter without much difference..

So in seaborn, adding a total column/row would be possible with .subplot2grid method.

Code:

# Source:

# https://stackoverflow.com/questions/33379261/how-can-i-have-a-bar-next-to-python-seaborn-heatmap-which-shows-the-summation-of

fig = plt.figure(figsize=(16,12))

ax1 = plt.subplot2grid((14,10), (0,0), rowspan=12, colspan=7)

ax2 = plt.subplot2grid((14,10), (0,8), rowspan=12, colspan=1)

ax3 = plt.subplot2grid((14,10), (13,0), rowspan=1, colspan=7)

table_pivot= data.pivot_table("Hour",

["DISTRICT"],

columns="DAY_OF_WEEK",

aggfunc = "count")

column_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

table_pivot2 = table_pivot.reindex(column_order, axis=1)

table_pivot2 = table_pivot2.drop("External", axis=0)

cmap = sns.cm.rocket_r

sns.heatmap(table_pivot2,

ax=ax1,

annot = True,

fmt="0.1f",

linewidths=.8,

vmin=-0.05,

cmap = cmap,

cbar = False).set_title('Number of Crimes per District and Day of Week',

fontsize=14,

fontweight='bold')

y_axis_labels=['A1','A15','A7','B2','B3','C11','C6','D14','D4','D13','E8','E5']

ax1.xaxis.tick_bottom()

ax1.set_xticklabels(table_pivot2.columns,rotation=40)

ax1.yaxis.tick_right()

ax1.set_yticklabels(y_axis_labels, rotation='horizontal')

sns.heatmap((pd.DataFrame(table_pivot2.sum(axis=0))).transpose(),

ax=ax3,

annot=True,

fmt='g',

cmap=cmap,

cbar=False,

xticklabels=False,

yticklabels=False).set(xlabel='', ylabel='Totals')

sns.heatmap(pd.DataFrame(table_pivot2.sum(axis=1)),

ax=ax2,

annot=True,

fmt='g',

cmap=cmap,

cbar=False,

xticklabels=False,

yticklabels=False).set(xlabel='', ylabel='', title='Totals');

Another heatmap here (no. of crimes per district/hours):

Boston Crime Data Analysis - 1: Number of Crimes per weekdays and district Heatmap

Here is a heatmap showing the number of crimes per district and days of week according to City of Boston's public data for the first 3 months of 2022:

Below is the code..

table_pivot= data.pivot_table("Hour",
["DISTRICT"],
columns="DAY_OF_WEEK",
aggfunc = "count")
column_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
table_pivot2 = table_pivot.reindex(column_order, axis=1)
table_pivot2=table_pivot2.drop("External", axis=0)
cmap = sns.cm.rocket_r
sns.set(rc = {'figure.figsize':(15,8)})
sns.heatmap(table_pivot2,
annot = True,
fmt="0.1f",
linewidths=.8,
vmin=-0.05,
cmap = cmap,)

Note: "hour" is counted in pivot table because it doesn't have a null value.

Object türündeki veriyi sıralamayı koruyarak kategori türüne dönüştürme

Object türündeki veri kategoriye dönüştürülürken sıralaması kaybolabiliyor. Sıralamayı koruyarak dönüştürme işleminin açıklamaları ve kodlarını buraya ekliyorum. Açıklamalar resimlerin içinde ve kodlar en aşağıda..

Kodlar:

# Veriyi indirme ve genel bakış:
import seaborn as sns
diamonds = sns.load_dataset('diamonds')
df = diamonds.copy()
df.head()

# ordinal tanımlama: Kategorik veri türünü import ediyoruz.
from pandas.api.types import CategoricalDtype

# "Cut" kolonundaki ilk 5 gözleme bakış
df.cut.head()

# "Cut" kolonunun verisini object'ten kategori türüne dönüştürme:
df.cut = df.cut.astype(CategoricalDtype(ordered = True))

# "Cut" kolonu verisinin yeni türünü kontrol etme:
df.cut.dtypes

# "Cut" kolonunun verisinin kategori sıralamasını kontrol etme:
df.cut.head(1)

# listemiz:
cut_kategoriler = ["Fair","Good","Very Good","Premium","Ideal"]

# ve bu listeyi doğru sıralama olarak kategoriye tanıtıyoruz:
df.cut = df.cut.astype(CategoricalDtype(categories = cut_kategoriler, ordered = True))

# listeye göre sıralanıp sıralanmadığını kontrol ediyoruz:
df.cut.head(1)

10 Nisan 2022 Pazar

.unique() metoduyla tekil değerlerin alınması

Tek bir kolondaki, tekil değerleri görmek için:

titanic["embark_town"].unique()

Birden çok kolonun tekil değerlerini görmek için:

İki kolonun ayrı ayrı tekil değerlerini görmek için sadece bu iki kolondan oluşan ayrı bir dataframe kopyalamaktan başka bir çare bulamadım.

titanic_2 = titanic[["alive","embark_town"]].copy()

for col in titanic_2:

print(col)

print(titanic_2[col].unique())

Bununla birlikte, iki kolonun tekil değerlerini birleştirilmiş olarak görmek şu şekilde mümkün:

import pandas as pd

import numpy as np

column_values = titanic[['who','embark_town']].values.ravel()

unique_values = pd.unique(column_values)

print(unique_values)

Tüm kolonların tekil değerlerini görmek için:

1. Şu kod çok sade ve derli toplu bir şekilde tüm kolonlardaki tekil değerleri görmemizi sağlıyor:

pd.Series({col:titanic[col].unique() for col in titanic})

2. Tüm kolonlardaki tüm tekil değerleri ayrıntılı olarak görmek istersek şu kodu da kullanabiliriz:

for col in titanic:
print(col)
print(titanic[col].unique())

Sürekli ve object değişkenleri kategorik değişkene çevirmek (.cut ve categorical metodlarıyla) ve bazı pivot yöntemleri

Herhangi bir sürekli değişkeni kategorik değişkenlere ayırmak için ".cut" metodu çok kullanışlı. Şu videodan öğrendim. (dk. 8 civarı)

Mesela aşağıda titanic adlı verinin "age" kolonundaki değerleri 0-18 ve 18-90 şeklinde iki kategoriye bölüyoruz.

age=pd.cut(titanic["age"],[0,18,90])
age.head(20)

Bunu sonrasında pivot tabloya da kategorik değişken olarak aktarabiliyoruz:
titanic.pivot_table("survived", ["sex",age], "class")

Bunu üç kategoriye ayırıp etiket eklediğimiz bir örnek:

titanic["age_2"]=pd.cut(titanic["age"],[0,18,60,90],labels=["Under_18","Btw18and60","Over_60"])

titanic["age_2"].head(20)

Bunun yanı sıra bazı object değerlerini de kategorik değişken haline getirmek faydalı olabiliyor. Bunun için pd.Categorical kullanılabiliyor.

Örneğin:

import pandas as pd

titanic.embark_town = pd.Categorical(titanic.embark_town)

titanic['embark_town'].dtypes

Gelelim pivot tabloya:

Diyelim ki ortalamayı değil (1'lerden ve 0'lardan oluşan) survived kategorik değişkeninin 1'lerinin ve 0'larının toplamını görmek istiyoruz. Yani cinsiyet, ve "yaş kategorisi" kategorilerine göre 0'ların ve 1'lerin (hayatta kalanların ve kalmayanların) toplamlarını görmek istiyoruz. Burada en baştaki "alive" değişkeni sadece toplam gözlem sayısını bulabilmek için kullandığım bir değişken, çünkü "null value"su yok. Bunun yerine "null value"su olmayan başka bir kolon da kullanabilirdik. Survive değişkeninin ortalamasını alacak olsak "alive" yerinde "survived"ın olması gerekecekti.

titanic.pivot_table("alive", ["sex", age, "survived"], columns="class", aggfunc = "count")

5 Nisan 2022 Salı

FizzBuzz questions and answers

Write a code that lists the numbers from 1 to 50 and gives the outputs of

'fizz' if the number is a multiple of 3,

'buzz' if the number is a multiple of 5,

'fizzbuzz' if the number is a multiple of both 3 and 5,

or the number itself if the conditions above are not met.

list1 = list(range(1,51))

for i in list1:

if i%15==0:

print('fizzbuzz')

elif i%5==0:

print('buzz')

elif i%3==0:

print('fizz')

else:

print(i)

If the question asks to write a function this may help as well:

Cüneyd Yasin'in Paylaştıkları