Python Data Analysis with pandas
Python is a great language for data analysis. pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
In this post we'll get to know more about doing data analysis using pandas.
Mainly pandas has two data structures, series and dataframes.
pandas Series
pandas series can be used for one-dimensional labeled array.
import pandas as pd
index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge', 'Oxford', 'UCL']
a = pd.Series([100, 98.7, 98.4, 97.7, 95.6, 95.3, 94.6],
index=index_list)
print(a)
MIT 100.0
Stanford 98.7
Harvard 98.4
Caltech 97.7
Cambridge 95.6
Oxford 95.3
UCL 94.6
dtype: float64
Labels can accessed using index
attribute
print(a.index)
Index(['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge', 'Oxford', 'UCL'], dtype='object')
You can use array indexing or labels to access data in the series
print(a[1])
print(a['Cambridge'])
98.7
95.6
You can also apply mathematical operations on pandas series.
b = a * 2
c = a ** 1.5
print(b)
print()
print(c)
MIT 200.0
Stanford 197.4
Harvard 196.8
Caltech 195.4
Cambridge 191.2
Oxford 190.6
UCL 189.2
dtype: float64
MIT 1000.000000
Stanford 980.563513
Harvard 976.096258
Caltech 965.699142
Cambridge 934.731414
Oxford 930.334981
UCL 920.103546
dtype: float64
You can even create a series of heterogeneous data.
s = pd.Series(['random', 1.2, 3, 'data'], index=['any', 'thing', 2, '4.3'])
print(s)
any random
thing 1.2
2 3
4.3 data
dtype: object
pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure. There are many methods to create DataFrames. We'll see each one by one.
Creating DataFrame from dictionary of Series
The following method can used to create DataFrames from a dictionary of pandas series.
index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge']
u = {
'citations': pd.Series([99.9, 99.4, 99.9, 100, 78.4], index=index_list),
'employer': pd.Series([100, 100, 100, 85.4, 100], index=index_list)
}
df = pd.DataFrame(u)
df
citations | employer | |
---|---|---|
MIT | 99.9 | 100.0 |
Stanford | 99.4 | 100.0 |
Harvard | 99.9 | 100.0 |
Caltech | 100.0 | 85.4 |
Cambridge | 78.4 | 100.0 |
print(df.index)
Index(['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge'], dtype='object')
print(df.columns)
Index(['citations', 'employer'], dtype='object')
Creating DataFrame from list of dictionaries
l = [{'orange': 32, 'apple': 42}, {'banana': 25, 'carrot': 44, 'apple': 34}]
df = pd.DataFrame(l, index=['value1', 'value2'])
df
apple | banana | carrot | orange | |
---|---|---|---|---|
value1 | 42 | NaN | NaN | 32.0 |
value2 | 34 | 25.0 | 44.0 | NaN |
You might have noticed that we got a DataFrame with NaN
values in it. This is because we didn't the data for that particular row and column.
Creating DataFrame from Text/CSV files
pandas tool come in handy when you want to load data from a csv or a text file. It has built in functions to do this for use.
df = pd.read_csv('happiness.csv')
Yes we created a DataFrame from a csv file. This dataset contains outcome of European quality of life survey. This dataset is available here. Now we have stored the DataFrame in df
, we want to see what's inside. First we will see the size of the DataFrame.
print(df.shape)
(105, 4)
It has 105 Rows and 4 Columns. Instead of printing out all the data, we will see the first 10 rows.
df.head(10)
Country | Gender | Mean | N= | |
---|---|---|---|---|
0 | AT | Male | 7.3 | 471 |
1 | NaN | Female | 7.3 | 570 |
2 | NaN | Both | 7.3 | 1041 |
3 | BE | Male | 7.8 | 468 |
4 | NaN | Female | 7.8 | 542 |
5 | NaN | Both | 7.8 | 1010 |
6 | BG | Male | 5.8 | 416 |
7 | NaN | Female | 5.8 | 555 |
8 | NaN | Both | 5.8 | 971 |
9 | CY | Male | 7.8 | 433 |
There are many more methods to create a DataFrames. But now we will see about basic operation on DataFrames.
Operations on DataFrame
We'll recall the DataFrame we made earlier.
index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge']
u = {
'citations': pd.Series([99.9, 99.4, 99.9, 100, 78.4], index=index_list),
'employer': pd.Series([100, 100, 100, 85.4, 100], index=index_list)
}
df = pd.DataFrame(u)
df
citations | employer | |
---|---|---|
MIT | 99.9 | 100.0 |
Stanford | 99.4 | 100.0 |
Harvard | 99.9 | 100.0 |
Caltech | 100.0 | 85.4 |
Cambridge | 78.4 | 100.0 |
Now we want to create a new row column from current columns. Let's see how it is done.
df['score'] = (2 * df['citations'] + 3 * df['employer'])/5
df
citations | employer | score | |
---|---|---|---|
MIT | 99.9 | 100.0 | 99.96 |
Stanford | 99.4 | 100.0 | 99.76 |
Harvard | 99.9 | 100.0 | 99.96 |
Caltech | 100.0 | 85.4 | 91.24 |
Cambridge | 78.4 | 100.0 | 91.36 |
We have created a new column score
from citations
and employer
. We'll create one more using boolean.
df['flag'] = df['citations'] > 99.5
df
citations | employer | score | flag | |
---|---|---|---|---|
MIT | 99.9 | 100.0 | 99.96 | True |
Stanford | 99.4 | 100.0 | 99.76 | False |
Harvard | 99.9 | 100.0 | 99.96 | True |
Caltech | 100.0 | 85.4 | 91.24 | True |
Cambridge | 78.4 | 100.0 | 91.36 | False |
We can also remove columns.
score = df.pop('score')
print(score)
MIT 99.96
Stanford 99.76
Harvard 99.96
Caltech 91.24
Cambridge 91.36
Name: score, dtype: float64
df
citations | employer | flag | |
---|---|---|---|
MIT | 99.9 | 100.0 | True |
Stanford | 99.4 | 100.0 | False |
Harvard | 99.9 | 100.0 | True |
Caltech | 100.0 | 85.4 | True |
Cambridge | 78.4 | 100.0 | False |
Descriptive Statistics using pandas
It's very easy to view descriptive statistics of a dataset using pandas. We are gonna use, Biomass data collected from this source. Let's load the data first.
url = 'https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/biomass.csv'
df = pd.read_csv(url)
df.head()
Unnamed: 0 | dbh | wood | bark | root | rootsk | branch | species | fac26 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 90 | 5528.0 | NaN | 460.0 | NaN | NaN | E. maculata | z |
1 | 2 | 106 | 13650.0 | NaN | 1500.0 | 665.0 | NaN | E. pilularis | 2 |
2 | 3 | 112 | 11200.0 | NaN | 1100.0 | 680.0 | NaN | E. pilularis | 2 |
3 | 4 | 34 | 1000.0 | NaN | 430.0 | 40.0 | NaN | E. pilularis | 2 |
4 | 5 | 130 | NaN | NaN | 3000.0 | 1030.0 | NaN | E. maculata | z |
We are not interested in the unnamed column. So, let's delete that first. Then we'll see the statistics with one line of code.
del df['Unnamed: 0']
df.describe()
dbh | wood | bark | root | rootsk | branch | |
---|---|---|---|---|---|---|
count | 153.000000 | 133.000000 | 17.000000 | 54.000000 | 53.000000 | 76.000000 |
mean | 26.352941 | 1569.045113 | 513.235294 | 334.383333 | 113.802264 | 54.065789 |
std | 28.273679 | 4071.380720 | 632.467542 | 654.641245 | 247.224118 | 65.606369 |
min | 3.000000 | 3.000000 | 7.000000 | 0.300000 | 0.050000 | 4.000000 |
25% | 8.000000 | 29.000000 | 59.000000 | 11.500000 | 2.000000 | 10.750000 |
50% | 15.000000 | 162.000000 | 328.000000 | 41.000000 | 11.000000 | 35.000000 |
75% | 36.000000 | 1000.000000 | 667.000000 | 235.000000 | 45.000000 | 77.750000 |
max | 145.000000 | 25116.000000 | 1808.000000 | 3000.000000 | 1030.000000 | 371.000000 |
It's simple as that. We can see all the statistics. Count, mean, standard deviation and other statistics. Now we are gonna find some other metrics which are not available in the describe()
summary.
Mean
print(df.mean())
dbh 26.352941
wood 1569.045113
bark 513.235294
root 334.383333
rootsk 113.802264
branch 54.065789
dtype: float64
Min and Max
print(df.min())
dbh 3
wood 3
bark 7
root 0.3
rootsk 0.05
branch 4
species Acacia mabellae
dtype: object
print(df.max())
dbh 145
wood 25116
bark 1808
root 3000
rootsk 1030
branch 371
species Other
dtype: object
Pairwise Correlation
df.corr()
dbh | wood | bark | root | rootsk | branch | |
---|---|---|---|---|---|---|
dbh | 1.000000 | 0.905175 | 0.965413 | 0.899301 | 0.934982 | 0.861660 |
wood | 0.905175 | 1.000000 | 0.971700 | 0.988752 | 0.967082 | 0.821731 |
bark | 0.965413 | 0.971700 | 1.000000 | 0.961038 | 0.971341 | 0.943383 |
root | 0.899301 | 0.988752 | 0.961038 | 1.000000 | 0.936935 | 0.679760 |
rootsk | 0.934982 | 0.967082 | 0.971341 | 0.936935 | 1.000000 | 0.621550 |
branch | 0.861660 | 0.821731 | 0.943383 | 0.679760 | 0.621550 | 1.000000 |
Data Cleaning
We need to clean our data. Our data might contain missing values, NaN values, outliers, etc. We may need to remove or replace that data. Otherwise our data might make any sense.
We can find null values using following method.
print(df.isnull().any())
dbh False
wood True
bark True
root True
rootsk True
branch True
species False
fac26 True
dtype: bool
We have to remove these null values. This can done by method shown below.
newdf = df.dropna()
print(newdf.shape)
(4, 8)
But, sadly our datset reduced to a small one. But, you get the point.
There are many more useful tools in pandas. We'll see more about them in upcoming posts.