31 Mart 2022 Perşembe

Feature engineering: Imputing missing values in a column, based on subgroups of another column

 Suppose that we have null values in a column and we need to fill them. Filling all null values by simple means or medians may not be the best solution.

Why?

Suppose that we have a car database and a "Number of Seats" column. This column may have a median of 5 and a mean close to 5. However null values of "Mercedes SLK"s (a roadster with 2 seats) and "Cadillac Escalade"s (7 seater) should be treated differently. So we need to group by the car models (or car types such as roadster, suv, mid-size, etc.) and assign values to the null cells by finding the median or mean of these groups.

Assuming that we are using "car type" subgroup, to find the "median" of the subgroups and impute them into the null values, the code for such feature engineering may be as follows:

1st option:

data['Number_of_Seats'].fillna(data.groupby('Car_Type')['Number_of_Seats'].transform("mean")

2nd option: (there are two more options at the end of the post. I tried them with some errors and but didn't end up using them.)

# 1st step
median_seats = data.groupby('Car_Type').median()['Number_of_Seats']
#2nd step
data.loc[pd.isna(data['Number_of_Seats']),'Number_of_Seats'] = median_seats[data.loc[pd.isna(data['Number_of_Seats']),'Car_Type']].values

A good approach to choose between median and mean may be making copies of the workbook before filling the null values, filling the null values and after running the method (regression, decision tree, etc.),  trying other methods on other copied notebooks to fill the null values.

A subgroup may not include a value for our column at all. For example we have 5 Mercedes SLKs and none of them has Number_of_Seats value. In that case, we may try another relevant column (as Car_Type above). The easy solution of taking the column mean/median should be the last resort. If the dataset is big enough, dropping those rows may be a better solution.

Other code options:
3rd option:
data['Number_of_Seats'].fillna(data.Number_of_Seats.median(Car_Type), inplace=True)
4th option to work on:
df.groupby(["A","B"]).Z.median()


28 Mart 2022 Pazartesi

Metinlerle oynama - String munging

String'le ilgili bazı alıştırmalar şurada mevcut:

String'i kelimelerine bölmek






Bir string'in sağından veya solundan kırpmak

OCCURED_ON_DATE example entry: 2022-02-02 00:00:00
What we want:
Date column: 2022-02-02
Hour Column: 00:00:00

import pandas as pd

data["Date"] = data['OCCURRED_ON_DATE'].str[:10]
data["Hour"] = data['OCCURRED_ON_DATE'].str[11:]
data.head() 

Hour column could also be created as:
data["Hour"] = data['OCCURRED_ON_DATE'].str[-8:]



Bir string'i tersine çevirmek

txt = "Hello World"[::-1]
print(txt)

Bunun çıkaracağı sonuç şudur: dlroW olleH

25 Mart 2022 Cuma

Çeşitli dağılımlara göre rastgele sayılar veya seriler oluşturma - Generating random numbers or series of various distributions

Python'da çeşitli dağılımlara göre (mesela normal ya da tekdüze dağılıma göre) rastgele sayılar veya seriler oluşturma nasıl olur?

Bakalım:

Şu kod tekdüze dağılıma göre bir sayı veriyor (vereceği sayıyı default olarak 0'la 1 arasından seçiyor):

import numpy as np
np.random.uniform()

bu da yine tekdüze dağılımda ama belirli bir aralıkta bir sayı veriyor:

np.random.uniform(low=0, high=10)

Array oluşturmak istersek şu kodu giriyoruz: (ilk iki sayı sınırları, sondaki sayı kaç elemanlı olacağını gösteriyor)

np.random.uniform(0, 10, size=4)

Birden fazla array oluşturmak istersek de: (mesela her biri 3 elemanlı 2 array)

np.random.uniform(0, 10, size=(2, 3))

Bu array'leri histogram'da göstermek içinse:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
a = np.random.uniform(0, 10, 10000)
sns.histplot(a)
plt.show()

Bir de aynısını normal dağılımla yapalım: (tek değişiklik "uniform" yerine "normal" yazıyoruz. bu yukarıdakiler için de geçerli.)

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
a = np.random.normal(0, 10, 10000)
sns.histplot(a)
plt.show()

Peki iki veya üç boyutlu bir düzlemde rastgele noktalar oluşturmak istersek nasıl bir kod yazıyoruz? O da burada:

from matplotlib import pyplot as plt
import numpy as np
# Generate 100 random data points along 3 dimensions
x, y, scale = np.random.randn(3, 100)
fig, ax = plt.subplots()
# Map each onto a scatterplot we'll create with Matplotlib
ax.scatter(x=x, y=y, c=scale, s=np.abs(scale)*500)
ax.set(title="Some random data, created with JupyterLab!")
plt.show()

Satırları silmek - Dropping raws

Satır silmek çeşitli nedenlerle gerekli olabilir. (outlier, null value, vs..)

Bunların kodları:

Dropna ile boş verisi olan tüm satırları birden silme:

df.dropna()

Index'e göre satır silme:

df = df.drop(index=2)

df = df.drop(index=[2,4,6])

df = df.drop(index=['Item_B','Item_D'])






















Feature engineering - Verilerin düzenlenmesi

Feature engineering'in tam Türkçesini bilmiyorum. (belki veri/nitelik/özellik yönetimi denebilir)

Bu post'ta eksik veya bir şekilde tamamlanması ya da dönüştürülmesi gereken verilerin nasıl işlenebileceğiyle ilgili örnekleri koyacağım. Sanırım ağırlıklı olarak python kodları olacak..


Finding Missing Values - Eksik verileri bulmak

Eksik verilerin toplamını bulmak:

data.isnull().sum()

Belirli bir kolondaki eksik verileri bulmak:

(data.isnull().sum())['the_column_in_question']


Eksik verileri silmek - Dropping null values:

.

.

.

.


Belirli bir kolondaki eksik verileri o kolon temelinde tamamlamak - Imputing missing values in a certain column based on that column:

(burada median'le tamamlanmış ama mesela mean kullanmak da bir seçenek)

data['the_column_in_question'].fillna(data.the_column_in_question.median(), inplace=True)

Birden fazla kolon için de şunlar var:





Belirli bir kolondaki eksik verileri başka kolonlara bakarak tamamlamak - Imputing/Handling missing values in a certain column based values in other columns:

Örnek: Başka bir kolondan direk transfer (direct transfer from another column)

# fill missing values
df['Postal Address'].fillna(df['Permanent Address'], inplace=True)
print(df)
# fill missing values
df['Postal Address'].fillna(df['Permanent Address'], inplace=True)
print(df)
# fill missing values
df['target_column'].fillna(df['other_column'], inplace=True)
print(df)


Örnek: Başka iki kolondaki verinin çarpımı

Example: multiplication of variables in two different columns

df['C'].fillna(df.A * df.B)

Bir diğer örnek:

Burada City kolonundaki boş değerler için diğer kolona (state kolonu) bakıp "City" kolonunun o kolondaki en fazla tekrarlanan karşılığı (max) neyse onu koy diyor.

df.update(df.replace('', np.nan).set_index('State', append=True) \
    .groupby(level='State').City \
    .apply(lambda x: x.fillna(x.value_counts().idxmax())) \
    .reset_index('State', drop=True))
df

















24 Mart 2022 Perşembe

Papers claiming that cash-like anonymity is not possible..


https://www.bankofcanada.ca/2020/06/staff-analytical-note-2020-9/

  • Techniques to achieve cash-like privacy are immature. They have limited deployments, none of which comply with know-your-customer (KYC) and anti–money laundering (AML) regulations. Their risks include hidden vulnerabilities, a lack of scalability and complicated operations.

https://www.bankofengland.co.uk/-/media/boe/files/paper/2020/central-bank-digital-currency-opportunities-challenges-and-design.pdf
p.32

The appropriate degree of anonymity in a CBDC system is a political and social question, rather than a narrow technical question. As discussed above, CBDC would need to be compliant with AML regulations, which rules out truly anonymous payments.


https://www.riksbank.se/globalassets/media/rapporter/staff-memo/engelska/2021/on-the-possibility-of-a-cash-like-cbdc.pdf

Third, legal regulations require digital payments to be non-anonymous. According to current regulations in the EU, account-based systems must have registers that make it possible to establish the identity of the owner of each account (Sveriges Riksbank, 2018). We suspect that any system where CBDCs are stored remotely will fall under this regulation, regardless of whether they involve tokens or not. In the case of locally stored tokens, anti-money laundering regulations at present allow for the payer to make a payment of up to EUR 150 without needing to identify themselves. 

https://www.riksbank.se/globalassets/media/rapporter/e-krona/2018/the-riksbanks-e-krona-project-report-2.pdf
p.16

 At the other end of the scale, there is something similar to deposits, where there are no limits on amounts and the bearer of the value, krona on a card or in an app, is linked to an individual who is a registered user and thus not anonymous.

Same page: Under the title "Payments with e-krona will be traceable":
An account-based e-krona must therefore, according to current legislation, be based on an owner register to be able to establish who the owner of the account is.


Under the title "Value-based e-krona can allow anonymous payments according to anti-money laundering regulations":

At present, this is when the payment amounts to less 3than EUR 250. (Now lowered to EUR 150)


MiCA Proposal - Two summaries:

https://www.sygna.io/blog/what-is-mica-markets-in-crypto-assets-eu-regulation-guide/

https://www.ashurst.com/en/news-and-insights/legal-updates/10-things-you-need-to-know-about-mica-europes-proposals-for-regulating-crypto-assets/

MiCA Proposal Text

https://eur-lex.europa.eu/resource.html?uri=cellar:f69f89bb-fe54-11ea-b44f-01aa75ed71a1.0001.02/DOC_1&format=PDF

Prohibition of private currencies. Page 93

The operating rules of the trading platform for crypto-assets shall prevent the admission to trading of crypto-assets which have inbuilt anonymisation function unless the holders of the crypto-assets and their transaction history can be identified by the crypto-asset service providers that are authorised for the operation of a trading platform for crypto-assets or by competent authorities. 


FINCEN Proposal

https://decrypt.co/53178/coinbase-square-rally-against-fincens-proposed-crypto-crackdown

A question of civil liberties

The Electronic Frontier Foundation (EFF), a civil liberties organization, has emphasized the perceived impact FinCEN’s proposal would have on privacy. 

“The proposed regulation would undermine the civil liberties of cryptocurrency users,” the EFF said in a prepared statement, adding, “Anonymity is important precisely because financial records can be deeply personal and revealing: they provide an intimate window into a person’s life, revealing familial, political, professional, religious and sexual associations.” 


To underline its point, the EFF pointed to photographs of Hong Kong protests that showed long lines of individuals trying to purchase subway tickets with cash so that electronic purchases would not place them at scenes of protests. “These photos underscore the importance of anonymous transactions for civil liberties,” the EFF said. 

The EFF also urged FinCEN to allow at least 60 days for consultation in order to correct “the serious abnormalities of this rulemaking process.”