pandas - Veri Manipülasyonu

Yazar

Taygun Bulmus

Yayınlanma Tarihi

26 Ağustos 2024

`pandas` Modülü Tekrar

Şuana kadar öğrendiğimiz komutların bir listesinini yazalım.

# Seriler
s= pd.Series([100, 200, 300, 400, 500]\
    ,index= ["Londra", "Paris", "Roma", "Berlin", "Oslo"])
s.index
s.values
s.name
# Seri Satırı (index)
s["Londra"]
s.Londra
s.describe()
s.max()
s.min()
s.std()
s.mean()
s.median()
s.info()
s.plot(kind= 'line'), # bar, barh, pie
print("---o---o---o---")
# DataFrame'ler
df= pd.DataFrame({
    "Şehir": ["Londra", "Paris", "Roma", "Berlin", "Oslo"],
    "Nüfus": [100, 200, 300, 400, 500],
    "Alan": [1000, 2000, 3000, 4000, 5000]
})
df.columns
df["Nüfus"] #! Sütun İsmi
df[["Nüfus","Şehir"]]
df.Nüfus #! Türkçe karakter kullanma, Sütun ismi
df.loc[0] # Satır ismi (etiket, index) olmadığı için 0
df.loc[[0, 1]] # Satır ismi (etiket, index) olmadığı için 0,1
df.rename(columns= {"Nüfus" : "Nüfuslar"})
df.info()

Veri Çekme

pandas modülü ile bir dosyadaki veriyi okuyabildiğimiz gibi bir internet sitesinden de veriyi çekebiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
print(df)

     Rank        City            State Population Date of census/estimate
0       1   London[2]   United Kingdom  8,615,246             1 June 2014
1       2      Berlin          Germany  3,437,916             31 May 2014
2       3      Madrid            Spain  3,165,235          1 January 2014
3       4        Rome            Italy  2,872,086       30 September 2014
4       5       Paris           France  2,273,305          1 January 2013
..    ...         ...              ...        ...                     ...
100   101        Bonn          Germany    309,869        31 December 2012
101   102       Malmö           Sweden    309,105           31 March 2013
102   103  Nottingham   United Kingdom    308,735            30 June 2012
103   104    Katowice           Poland    308,269            30 June 2012
104   105      Kaunas        Lithuania    306,888          1 January 2013

[105 rows x 5 columns]

Başlık Değiştirme

Şimdi df veri çerçevesinin “Rank” başlığını “Ranks” yapalım.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Sütunlar
print(df.columns)
# Sütun ismi değiştirme
df= df.rename(columns={'Rank':'Ranks'})
print(df.columns)

Index(['Rank', 'City', 'State', 'Population', 'Date of census/estimate'], dtype='object')
Index(['Ranks', 'City', 'State', 'Population', 'Date of census/estimate'], dtype='object')

Aynı işlemi inplace=True ile de yapabiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
df.rename(columns={'Ranks':'Rank'}, inplace=True)
print(df.columns)

Index(['Rank', 'City', 'State', 'Population', 'Date of census/estimate'], dtype='object')

Inplace işlemi veri çerçevesi ile yapılan işlemlerin çoğunda bulunur. Her seferinde df=df. ... şeklinde yazmak yerine inplace=True yazabiliriz.

Başını ve Sonunu Gösterme

Çok büyük veriye sahip veri çerçevesinin içeriğine göz atmak için head ve tail komutlarını kullanırız.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# ilk 5 satırı göster
print(df.head())
print("---o---o---o---")
# son 5 satırı göster
print(df.tail())
print("---o---o---o---")
# ilk 7 satırı göster
df.head(7)

   Rank       City            State Population Date of census/estimate
0     1  London[2]   United Kingdom  8,615,246             1 June 2014
1     2     Berlin          Germany  3,437,916             31 May 2014
2     3     Madrid            Spain  3,165,235          1 January 2014
3     4       Rome            Italy  2,872,086       30 September 2014
4     5      Paris           France  2,273,305          1 January 2013
---o---o---o---
     Rank        City            State Population Date of census/estimate
100   101        Bonn          Germany    309,869        31 December 2012
101   102       Malmö           Sweden    309,105           31 March 2013
102   103  Nottingham   United Kingdom    308,735            30 June 2012
103   104    Katowice           Poland    308,269            30 June 2012
104   105      Kaunas        Lithuania    306,888          1 January 2013
---o---o---o---

	Rank	City	State	Population	Date of census/estimate
0	1	London[2]	United Kingdom	8,615,246	1 June 2014
1	2	Berlin	Germany	3,437,916	31 May 2014
2	3	Madrid	Spain	3,165,235	1 January 2014
3	4	Rome	Italy	2,872,086	30 September 2014
4	5	Paris	France	2,273,305	1 January 2013
5	6	Bucharest	Romania	1,883,425	20 October 2011
6	7	Vienna	Austria	1,794,770	1 January 2015

Satır Sütunlara `numpy` array’i gibi Erişim

pandas veri çerçevesine numpy array’i gibi erişmek de mümkün. Bunun için iloc komutunu kullanırız.,

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# ilk 5 satırı göster
print(df.head())
print("---o---o---o---")
# mat1= np.array([[1,2],[3,4]])
# mat1[1,0] -> 3
print(df.iloc[2,3])
print("---o---o---o---")
print(df.iloc[0:2,1:3])

   Rank       City            State Population Date of census/estimate
0     1  London[2]   United Kingdom  8,615,246             1 June 2014
1     2     Berlin          Germany  3,437,916             31 May 2014
2     3     Madrid            Spain  3,165,235          1 January 2014
3     4       Rome            Italy  2,872,086       30 September 2014
4     5      Paris           France  2,273,305          1 January 2013
---o---o---o---
3,165,235
---o---o---o---
        City            State
0  London[2]   United Kingdom
1     Berlin          Germany

Boyut Öğrenme

Tıpkı numpy’da olduğu gibi burada da shape komutu ile veri çerçevesinin boyutunu öğrenebiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Boyut göster
df.shape

(105, 5)

Yeni Sütun Ekleme ve Silme

Veri çerçevesinde yeni bir sütun eklemek için df["Yeni Sütun"] şeklinde bir komut kullanırız.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# NumericPopulation sütunu ekleyelim. İçerisine Population sütununu kopyalayalım.
df["NumericPopulation"]= df["Population"]
print(df.head())

   Rank       City            State Population Date of census/estimate  \
0     1  London[2]   United Kingdom  8,615,246             1 June 2014   
1     2     Berlin          Germany  3,437,916             31 May 2014   
2     3     Madrid            Spain  3,165,235          1 January 2014   
3     4       Rome            Italy  2,872,086       30 September 2014   
4     5      Paris           France  2,273,305          1 January 2013   

  NumericPopulation  
0         8,615,246  
1         3,437,916  
2         3,165,235  
3         2,872,086  
4         2,273,305

Veri çerçevesinden bir sütunuu silmek için df.drop("Sütun İsmi", axis=1) komutunu veya df.pop("Sütun İsmi"), bir satırı silmek için df.drop("İndex İsmi", axis=0) komutunu kullanırız.

df.pop komutu veri çerçevesini değiştirirken df.drop komutu veri çerçevesini değiştirmez, değişikliği görmek için inplace=True eklememiz gerekir.
axis=0 varsayılan değerdir. Bu yüzden axis=0 yazmamıza gerek yoktur.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Sütun isimlerini göster
print(df.columns)
print("---o---o---o---")
# Population sütununu sil
df.drop("Population", axis=1, inplace=True)
#df.pop("Population") # <----- bu da aynı işi yapar
print(df.columns)
print("---o---o---o---")
print(df.head())
print("---o---o---o---")
print(df.drop(0, axis=0).head(3))
print("---o---o---o---")
print(df.head())

Index(['Rank', 'City', 'State', 'Population', 'Date of census/estimate'], dtype='object')
---o---o---o---
Index(['Rank', 'City', 'State', 'Date of census/estimate'], dtype='object')
---o---o---o---
   Rank       City            State Date of census/estimate
0     1  London[2]   United Kingdom             1 June 2014
1     2     Berlin          Germany             31 May 2014
2     3     Madrid            Spain          1 January 2014
3     4       Rome            Italy       30 September 2014
4     5      Paris           France          1 January 2013
---o---o---o---
   Rank    City     State Date of census/estimate
1     2  Berlin   Germany             31 May 2014
2     3  Madrid     Spain          1 January 2014
3     4    Rome     Italy       30 September 2014
---o---o---o---
   Rank       City            State Date of census/estimate
0     1  London[2]   United Kingdom             1 June 2014
1     2     Berlin          Germany             31 May 2014
2     3     Madrid            Spain          1 January 2014
3     4       Rome            Italy       30 September 2014
4     5      Paris           France          1 January 2013

Biricik (Unique) Değerler

unique komutu ile bir sütunun içindeki değerlerin biricik (unique) değerlerini görebiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# State sütunundaki değerler
print(df["State"].values)
print("---o---o---o---")
# State sütunundaki biricik değerler
print(df["State"].unique())

[' United Kingdom' ' Germany' ' Spain' ' Italy' ' France' ' Romania'
 ' Austria' ' Germany' ' Hungary' ' Poland' ' Spain' ' Germany' ' Italy'
 ' Bulgaria' ' Czech Republic' ' Belgium' ' United Kingdom' ' Germany'
 ' Italy' ' Sweden' ' Italy' ' France' ' Netherlands' ' Croatia' ' Spain'
 ' Poland' ' United Kingdom' ' Poland' ' Germany' ' Latvia' ' Spain'
 ' Italy' ' Spain' ' Greece' ' Poland' ' Netherlands' ' Finland'
 ' Germany' ' United Kingdom' ' Italy' ' Germany' ' Germany' ' Germany'
 ' Spain' ' Denmark' ' United Kingdom' ' Portugal' ' Poland' ' Germany'
 ' Lithuania' ' Germany' ' Germany' ' Sweden' ' Ireland' ' United Kingdom'
 ' Germany' ' Netherlands' ' United Kingdom' ' Belgium' ' United Kingdom'
 ' Germany' ' Germany' ' France' ' United Kingdom' ' Poland' ' France'
 ' Spain' ' Estonia' ' United Kingdom' 'Slovakia Slovak Republic'
 ' Poland' ' Spain' ' Italy' ' Spain' ' Italy' ' Czech Republic' ' Poland'
 ' Germany' ' Spain' ' United Kingdom' ' Poland' ' France' ' Germany'
 ' Bulgaria' ' Bulgaria' ' Spain' ' United Kingdom' ' Netherlands'
 ' Spain' ' Germany' ' United Kingdom' ' Denmark' ' Romania'
 ' United Kingdom' ' Italy' ' Greece' ' United Kingdom' ' Romania'
 ' Italy' ' Spain' ' Germany' ' Sweden' ' United Kingdom' ' Poland'
 ' Lithuania']
---o---o---o---
[' United Kingdom' ' Germany' ' Spain' ' Italy' ' France' ' Romania'
 ' Austria' ' Hungary' ' Poland' ' Bulgaria' ' Czech Republic' ' Belgium'
 ' Sweden' ' Netherlands' ' Croatia' ' Latvia' ' Greece' ' Finland'
 ' Denmark' ' Portugal' ' Lithuania' ' Ireland' ' Estonia'
 'Slovakia Slovak Republic']

Biricik Değerlerin Sayısı

value_counts() komutu ile bir sütunun içindeki değerlerin biricik (unique) değerlerinin sayısını görebiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# State sütunundaki biricik değerlerin sayısı
print(df["State"].value_counts())

State
 Germany                    19
 United Kingdom             16
 Spain                      13
 Italy                      10
 Poland                     10
 France                      5
 Netherlands                 4
 Romania                     3
 Sweden                      3
 Bulgaria                    3
 Lithuania                   2
 Czech Republic              2
 Belgium                     2
 Greece                      2
 Denmark                     2
 Austria                     1
 Hungary                     1
 Croatia                     1
 Latvia                      1
 Finland                     1
 Portugal                    1
 Ireland                     1
 Estonia                     1
Slovakia Slovak Republic     1
Name: count, dtype: int64

Tekrar Eden Değerler

duplicated komutu ile bir sütunun içindeki değerlerin tekrarlanan değerlerini görebiliriz. Bu komutun sonucunda True ve False değerleri döner.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# State sütunundaki tekrar eden değerler
print(df["State"].duplicated().head(8))
print("---o---o---o---")
# State sütunundaki ilk 8 değer
print(df.State.head(8))

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
Name: State, dtype: bool
---o---o---o---
0     United Kingdom
1            Germany
2              Spain
3              Italy
4             France
5            Romania
6            Austria
7            Germany
Name: State, dtype: object

Veri Manipülasyonu ve Veri Temizleme

Elimize geçen veride sistematik olarak değiştirmemiz gereken bazı durumlar olabilir. Bu tip veriye ham veri adı verilir. Ham veri içerisinde istenmeyen veya yanlış satırlar, yanlış yazılmış sütunlar olabilir. Bunun gibi kusurları temizlemek ve üzerine çalışılabilir bir hale getirmek için manipüle (oynama) yapmamız gerekir.

Bunun için df değişkenine tekrar bakalım.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# İlk 3 satırı göster
print(df.head(3))
print("---o---o---o---")
# Veri hakkında bilgi
print(df.info())
print("---o---o---o---")
# State sütunundaki ilk 3 değer
print(df["State"].head(3).values)

   Rank       City            State Population Date of census/estimate
0     1  London[2]   United Kingdom  8,615,246             1 June 2014
1     2     Berlin          Germany  3,437,916             31 May 2014
2     3     Madrid            Spain  3,165,235          1 January 2014
---o---o---o---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Rank                     105 non-null    int64 
 1   City                     105 non-null    object
 2   State                    105 non-null    object
 3   Population               105 non-null    object
 4   Date of census/estimate  105 non-null    object
dtypes: int64(1), object(4)
memory usage: 4.2+ KB
None
---o---o---o---
[' United Kingdom' ' Germany' ' Spain']

İlk bakışta görülen kusurlar şunlardır:

0. indexe sahip olan satırda London [2] yazılmış.
Population sütunu sayı olması gerekirken object olarak kaydedilmiş. Yani string olarak kaydedilmiş.
State sütunundaki verilerin başında bir adet fazladan boşluk var.

Bu sorunları sırasıyla çözelim.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
## 1.
df.iloc[0,1]= "London"
print(df.loc[0])
print("---o---o---o---")
## 2.
## Aşağıdaki komut çalışmıyor çünkü python virgüllü sayıları noktaya çeviremiyor.
#df["Population"]= df["Population"].astype(float)
# Tüm virgülleri kaldıralım ve float tipine çevirelim.
df["Population"]=df.Population.apply(lambda x: int(x.replace(',','')))
# str1='Naber Gizem'
# print(str1.replace('G',''))
print(df.info())
print("---o---o---o---")
print(df.head())
print("---o---o---o---")
## 3.
print(df["State"].head(3).values)
print("---o---o---o---")
# strip() fonksiyonu stringin başındaki ve sonundaki boşlukları siler.
df.State= df.State.apply(lambda x: x.strip())
# Göster
print(df["State"].head(3).values)

Rank                                     1
City                                London
State                       United Kingdom
Population                       8,615,246
Date of census/estimate        1 June 2014
Name: 0, dtype: object
---o---o---o---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Rank                     105 non-null    int64 
 1   City                     105 non-null    object
 2   State                    105 non-null    object
 3   Population               105 non-null    int64 
 4   Date of census/estimate  105 non-null    object
dtypes: int64(2), object(3)
memory usage: 4.2+ KB
None
---o---o---o---
   Rank    City            State  Population Date of census/estimate
0     1  London   United Kingdom     8615246             1 June 2014
1     2  Berlin          Germany     3437916             31 May 2014
2     3  Madrid            Spain     3165235          1 January 2014
3     4    Rome            Italy     2872086       30 September 2014
4     5   Paris           France     2273305          1 January 2013
---o---o---o---
[' United Kingdom' ' Germany' ' Spain']
---o---o---o---
['United Kingdom' 'Germany' 'Spain']

Veri çerçevesindeki verilerin ilk harflerini df.State.str.capitalize() ile büyük yapabiliriz.

Etiket (Index) İşlemleri

Kullandığımız veri çerçevesinde etiket bulunmuyor. Tüm veri şehirler üzerine kurulu olduğu için şehirleri etiket olarak alabiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Etiket ismini City'nin değerleri olarak ata
df.index= df["City"]
# Göster
df.head()

	Rank	City	State	Population	Date of census/estimate
City
London[2]	1	London[2]	United Kingdom	8,615,246	1 June 2014
Berlin	2	Berlin	Germany	3,437,916	31 May 2014
Madrid	3	Madrid	Spain	3,165,235	1 January 2014
Rome	4	Rome	Italy	2,872,086	30 September 2014
Paris	5	Paris	France	2,273,305	1 January 2013

Başlıklara bakarsak etiket isminin de City olduğunu görebiliriz. Etiket ismi ile sütun ismi ayrışması adına etiket ismi alt satıra yazılmıştır.

Etiketleri df.index ile de görebildiğimizi hatırlatalım.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Etiket ismini City'nin değerleri olarak ata
df.index= df["City"]
# Rank sütununu göster. Etiketler de gözükecek.
print(df.Rank.head())
print("---o---o---o---")
# Etiketleri göster ve listeye çevir
print(list(df.head().index))

City
London[2]    1
Berlin       2
Madrid       3
Rome         4
Paris        5
Name: Rank, dtype: int64
---o---o---o---
['London[2]', 'Berlin', 'Madrid', 'Rome', 'Paris']

Etiket grubunun ismini yani başlığını df.index.name ile görebiliriz/değiştirebiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Etiket ismini City'nin değerleri olarak ata
df.index= df["City"]
# Etiket isimlerini None yap
df.index.name= 'EtiketIsmi'
#$ Etiket ismini sıfırla
#df.index.name= None
# Göster
df.head()
print("---o---o---o---")
# Tüm etiketleri göster
print(df.head().index)

---o---o---o---
Index(['London[2]', 'Berlin', 'Madrid', 'Rome', 'Paris'], dtype='object', name='EtiketIsmi')

reset_index() komutu ile etiketleri varsayılan numaralara geri döndürebiliriz. Bu komut etiketleri numaraya döndürürken, kayıtlı olan etiketleri de yeni bir sütun olarak ekler.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Etiket ismini City State sütunlarının değerlerinin toplamı olarak ata
df.index= df["City"]+ " " + df["State"]
# Göstpd.Int16Dtype(er)
print(df.head())
# Etiketleri sıfırla
df=df.reset_index()
print("---o---o---o---")
print(df.head())

                           Rank       City            State Population  \
London[2]  United Kingdom     1  London[2]   United Kingdom  8,615,246   
Berlin  Germany               2     Berlin          Germany  3,437,916   
Madrid  Spain                 3     Madrid            Spain  3,165,235   
Rome  Italy                   4       Rome            Italy  2,872,086   
Paris  France                 5      Paris           France  2,273,305   

                          Date of census/estimate  
London[2]  United Kingdom             1 June 2014  
Berlin  Germany                       31 May 2014  
Madrid  Spain                      1 January 2014  
Rome  Italy                     30 September 2014  
Paris  France                      1 January 2013  
---o---o---o---
                       index  Rank       City            State Population  \
0  London[2]  United Kingdom     1  London[2]   United Kingdom  8,615,246   
1            Berlin  Germany     2     Berlin          Germany  3,437,916   
2              Madrid  Spain     3     Madrid            Spain  3,165,235   
3                Rome  Italy     4       Rome            Italy  2,872,086   
4              Paris  France     5      Paris           France  2,273,305   

  Date of census/estimate  
0             1 June 2014  
1             31 May 2014  
2          1 January 2014  
3       30 September 2014  
4          1 January 2013

Sıralama

Tüm veriyi etiket isimlerine göre sıralayabiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Etiket ismini City'nin değerleri olarak ata
df.index= df["City"]
# Göster
df.head()
print("---o---o---o---")
# Tüm veriyi etiket isimlerine göre sırala
df= df.sort_index()
# Göster
df.head()

---o---o---o---

	Rank	City	State	Population	Date of census/estimate
City
Aarhus	92	Aarhus	Denmark	326,676	1 October 2014
Alicante	86	Alicante	Spain	334,678	1 January 2012
Amsterdam	23	Amsterdam	Netherlands	813,562	31 May 2014
Antwerp	59	Antwerp	Belgium	510,610	1 January 2014
Athens	34	Athens	Greece	664,046	24 May 2011

Tüm veriyi belirlediğimiz sütuna göre sıralayabiliriz.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Etiket ismini City'nin değerleri olarak ata
df.index= df["City"]
# Göster
df.head()
print("---o---o---o---")
# Tüm veriyi belirlediğimiz sütuna, Rank süttununa göre sırala
df.sort_values(by="Rank", ascending=False, inplace=True)
# Göster
df.head()

---o---o---o---

	Rank	City	State	Population	Date of census/estimate
City
Kaunas	105	Kaunas	Lithuania	306,888	1 January 2013
Katowice	104	Katowice	Poland	308,269	30 June 2012
Nottingham	103	Nottingham	United Kingdom	308,735	30 June 2012
Malmö	102	Malmö	Sweden	309,105	31 March 2013
Bonn	101	Bonn	Germany	309,869	31 December 2012

Multi-Index

Veri çerçevesindeki veriyi gruplayabiliriz. Örneğin etiketlerde hem ülkeler, hem de o ülkelere ait olan şehirler olsun. Yani iki adet etiket olsun. set_index komutu ile etiketleri belirlerken iki sütun olarak girelim.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# 2 adet etiket kullan.
# 1. City'nin değerleri
# 2. State'nin değerleri
df.set_index(["State", "City"], inplace=True)
print("2 adet etiketli veri çerçevesi")
print(df.head())
print("---o---o---o---")
# 2 etiketli dataframi 0. etikete göre sırala
print("2 adet etiketli veri çerçevesi, 0. etikete göre sırala")
print(df.sort_index(level=0).head())
print("---o---o---o---")
# 2 etiketli dataframi 1. etikete göre sırala
print("2 adet etiketli veri çerçevesi, 1. etikete göre sırala")
print(df.sort_index(level=1).head())
print("---o---o---o---")
# 2 etiketli dataframi 0. etikete göre sırala ve Sweden olanları göster

2 adet etiketli veri çerçevesi
                          Rank Population Date of census/estimate
State          City                                              
United Kingdom London[2]     1  8,615,246             1 June 2014
Germany        Berlin        2  3,437,916             31 May 2014
Spain          Madrid        3  3,165,235          1 January 2014
Italy          Rome          4  2,872,086       30 September 2014
France         Paris         5  2,273,305          1 January 2013
---o---o---o---
2 adet etiketli veri çerçevesi, 0. etikete göre sırala
                       Rank Population Date of census/estimate
State    City                                                 
Austria  Vienna           7  1,794,770          1 January 2015
Belgium  Antwerp         59    510,610          1 January 2014
         Brussels[17]    16  1,175,831          1 January 2014
Bulgaria Plovdiv         84    341,041        31 December 2013
         Sofia           14  1,291,895        14 December 2014
---o---o---o---
2 adet etiketli veri çerçevesi, 1. etikete göre sırala
                       Rank Population Date of census/estimate
State       City                                              
Denmark     Aarhus       92    326,676          1 October 2014
Spain       Alicante     86    334,678          1 January 2012
Netherlands Amsterdam    23    813,562             31 May 2014
Belgium     Antwerp      59    510,610          1 January 2014
Greece      Athens       34    664,046             24 May 2011
---o---o---o---

İki adet etikete sahip veri çerçevesine erişmek için df.loc["Etiket1", "Etiket2"] komutunu kullanırız.

import pandas as pd
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# 2 adet etiket kullan.
# 1. City'nin değerleri
# 2. State'nin değerleri
df.set_index(["State", "City"], inplace=True)
print("2 adet etiketli veri çerçevesi")
print(df.head())
print("---o---o---o---")
# Germany olanı göster
print("Etiketi Germany olanı göster")
# Çalışmayan örnek:
#print(df.loc[('Germany')])
print(df.loc[(' Germany')].head())
print("---o---o---o---")
# Germany, Berlin olanı göster
print("Etiketi Germany, Berlin olanları göster")
print(df.loc[(' Germany', 'Berlin')])

2 adet etiketli veri çerçevesi
                          Rank Population Date of census/estimate
State          City                                              
United Kingdom London[2]     1  8,615,246             1 June 2014
Germany        Berlin        2  3,437,916             31 May 2014
Spain          Madrid        3  3,165,235          1 January 2014
Italy          Rome          4  2,872,086       30 September 2014
France         Paris         5  2,273,305          1 January 2013
---o---o---o---
Etiketi Germany olanı göster
             Rank Population Date of census/estimate
City                                                
Berlin          2  3,437,916             31 May 2014
Hamburg[10]     8  1,746,342        30 December 2013
Munich         12  1,407,836        31 December 2013
Cologne        18  1,034,175        31 December 2013
Frankfurt      29    701,350        31 December 2013
---o---o---o---
Etiketi Germany, Berlin olanları göster
Rank                                 2
Population                   3,437,916
Date of census/estimate    31 May 2014
Name: ( Germany, Berlin), dtype: object

Toplama İşlemi

Veri çerçevesindeki verileri df.sum() ile toplayabiliriz.

import pandas as pd
import numpy as np
# Veri çekme
df= pd.read_csv('https://raw.githubusercontent.com/jrjohansson/numerical-python-book-code/master/european_cities.csv')
# Population sütununu göster
print(df.Population.head())
print("---o---o---o---")
# Dtype göster
print(f"Data tipi : {df.Population.dtype}")
print("---o---o---o---")
# Population sütununu topla
print(df.Population.sum())
print("---o---o---o---")
# Tüm sütunları numpy ile topla
print(np.sum(df.Population.values))

0    8,615,246
1    3,437,916
2    3,165,235
3    2,872,086
4    2,273,305
Name: Population, dtype: object
---o---o---o---
Data tipi : object
---o---o---o---
8,615,2463,437,9163,165,2352,872,0862,273,3051,883,4251,794,7701,746,3421,744,6651,729,1191,602,3861,407,8361,332,5161,291,8951,246,7801,175,8311,092,3301,034,175989,845909,976898,095852,516813,562790,017786,424760,700757,655709,757701,350701,185696,676677,015666,058664,046632,432616,528605,523604,297596,550594,774593,682575,944569,884566,913559,440557,382547,631547,161546,451537,152531,562530,754528,014527,612524,619514,137510,909510,772510,610495,360495,121486,816484,344469,690460,354441,802441,354434,810432,451417,389409,211407,648384,202382,296377,207378,327362,286362,213351,629348,493348,120343,304342,885341,041335,819334,678331,606330,772328,841328,314327,627326,676324,576323,132322,751322,240320,229319,279315,576311,501309,869309,105308,735308,269306,888
---o---o---o---
8,615,2463,437,9163,165,2352,872,0862,273,3051,883,4251,794,7701,746,3421,744,6651,729,1191,602,3861,407,8361,332,5161,291,8951,246,7801,175,8311,092,3301,034,175989,845909,976898,095852,516813,562790,017786,424760,700757,655709,757701,350701,185696,676677,015666,058664,046632,432616,528605,523604,297596,550594,774593,682575,944569,884566,913559,440557,382547,631547,161546,451537,152531,562530,754528,014527,612524,619514,137510,909510,772510,610495,360495,121486,816484,344469,690460,354441,802441,354434,810432,451417,389409,211407,648384,202382,296377,207378,327362,286362,213351,629348,493348,120343,304342,885341,041335,819334,678331,606330,772328,841328,314327,627326,676324,576323,132322,751322,240320,229319,279315,576311,501309,869309,105308,735308,269306,888

Neler Öğrendik?

# Seriler
s= pd.Series()
s.index
s.values
s.name
s["Londra"]
s.Londra
s.describe()
s.max()
s.min()
s.std()
s.mean()
s.median()
s.plot(kind= 'line')
s.plot(kind= 'bar')
s.plot(kind= 'barh')
s.plot(kind= 'pie')
# ----------------
df= pd.DataFrame()
df.columns
df["Nüfus"] #! Sütunun İsmi
df.Nüfus #! Türkçe karakter kullanma, Sütunun ismi
df.loc["Londra"] # Satır ismi
df.loc[["Londra", "Roma"]]
df.rename(columns= {"Nüfus" : "Nüfuslar"})
df.info()
# ----------------
df= pd.read_csv()
df.head()
df.tail()
df.loc["Etiket İsmi"]
df.iloc[0]
df.iloc[1, 2]
df.shape
df["Yeni Sütun"]
df["Sütun İsmi"].unique()
df["Sütun İsmi"].value_counts()
df["Sütun İsmi"].duplicated()
df.drop("Sütun İsmi", axis=1)
df.pop("Sütun İsmi")
df.drop("Etiket İsmi", axis=0)
df.Population.apply(lambda x: int(x.replace(',','')))
df.State.str.capitalize()
df.index.name
df.sort_index()
df.sort_values(by="Sütun İsmi", ascending=True)
df.set_index(["State", "City"]).sort_index(level=0)
df.set_index(["State", "City"]).sort_index(level=1)
df.sum()

Problemler

Problem 1

https://www.kaggle.com/datasets/prithusharma1/all-nobel-laureates-1901-present sitesine gidin ve nobel ödülü almış kişiler ile ilgili veriyi indirin. Direkt indirmek için tıklayınız.
Verileri pandas paketi ile okuyun.
Cinsiyetler sütununda (Gender) biricik (unique) değerleri bulun. Bunları bir diziye (array) atayın.
Bulduğunuz biricik değerlerden “Gender” sütununda kaç adet olduğunu bulun.
Ödüllerin toplam cinsiyet (Gender) bar grafiğini çizdirin. plt.bar()
“Birth_Country_Code” sütununda biricik değerleri bulmadan önce dropna() komutu ile nan değerlerini silin.
Ödül alan kişilerin doğduğu ülkelerin (“Birth_Country_Code”) toplam sayısını gösteren bir bar grafiği çizdirin.

Problem 2

https://www.kaggle.com/datasets/abhinand05/daily-sun-spot-data-1818-to-2019 sitesine gidin ve Güneş lekesi verilerini indirin. Direkt indirmek için tıklayınız.
Verileri pandas paketi ile okuyun.
İlk sütunu (“Unnamed: 0”) silin.
“Number of Sunspots” başlığındaki tüm \(-1\) olan terimleri np.nan ile değiştirin.
Yeni bir sütun oluşturun. Bu sütunun adı “Year-Month-Day” olsun. Bu sütuna yılları, ayları ve günleri içeren bir dize (string) yazın. Örneğin, “1818-01-01” gibi. Bunu yapabilmek için dize (string) tipine geçmeniz gerekmektedir. Örneğin df["Day"].astype(str).
Günlere göre kaç adet güneş lekesi olduğunu (“Number of Sunspots”) gösteren bir grafik çizdirin. Yatay eksende bir şey olmasın.
Yeni bir veri çatısı (dataframe) oluşturun. Bu veri çatısının etiketleri (indices) yıllar, sütunu ise o yıldaki toplam güneş lekesi sayısı olsun. df.groupby("Hangi sütunu gruplayacak")["<Hangi sütuna göre gruplanacak>"].sum() komutu ile gruplayabilirsiniz.
Yeni oluşturduğunuz veri çatısını çizdirin. Yaklaşık her 11 senede bir güneş lekesi sayısının arttığını gözlemleyebilirsiniz. Buna Solar döngü (Solar cycle) denir.

pandas Modülü Tekrar