Python Data Analysis with pandas

Python is a great language for data analysis. pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

In this post we’ll get to know more about doing data analysis using pandas.

Mainly pandas has two data structures, series and dataframes.

pandas Series

pandas series can be used for one-dimensional labeled array.

import pandas as pd
index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge', 'Oxford', 'UCL']
a = pd.Series([100, 98.7, 98.4, 97.7, 95.6, 95.3, 94.6],
              index=index_list)
print(a)
MIT          100.0
Stanford      98.7
Harvard       98.4
Caltech       97.7
Cambridge     95.6
Oxford        95.3
UCL           94.6
dtype: float64

Labels can accessed using index attribute

print(a.index)
Index(['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge', 'Oxford', 'UCL'], dtype='object')

You can use array indexing or labels to access data in the series

print(a[1])
print(a['Cambridge'])
98.7
95.6

You can also apply mathematical operations on pandas series.

b = a * 2
c = a ** 1.5
print(b)
print()
print(c)
MIT          200.0
Stanford     197.4
Harvard      196.8
Caltech      195.4
Cambridge    191.2
Oxford       190.6
UCL          189.2
dtype: float64

MIT          1000.000000
Stanford      980.563513
Harvard       976.096258
Caltech       965.699142
Cambridge     934.731414
Oxford        930.334981
UCL           920.103546
dtype: float64

You can even create a series of heterogeneous data.

s = pd.Series(['random', 1.2, 3, 'data'], index=['any', 'thing', 2, '4.3'])
print(s)
any      random
thing       1.2
2             3
4.3        data
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure. There are many methods to create DataFrames. We’ll see each one by one.

Creating DataFrame from dictionary of Series

The following method can used to create DataFrames from a dictionary of pandas series.

index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge']
u = {
    'citations': pd.Series([99.9, 99.4, 99.9, 100, 78.4], index=index_list),
    'employer': pd.Series([100, 100, 100, 85.4, 100], index=index_list)
}

df = pd.DataFrame(u)
df
citations employer
MIT 99.9 100.0
Stanford 99.4 100.0
Harvard 99.9 100.0
Caltech 100.0 85.4
Cambridge 78.4 100.0
print(df.index)
Index(['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge'], dtype='object')
print(df.columns)
Index(['citations', 'employer'], dtype='object')

Creating DataFrame from list of dictionaries

l = [{'orange': 32, 'apple': 42}, {'banana': 25, 'carrot': 44, 'apple': 34}]
df = pd.DataFrame(l, index=['value1', 'value2'])
df
apple banana carrot orange
value1 42 NaN NaN 32.0
value2 34 25.0 44.0 NaN

You might have noticed that we got a DataFrame with NaN values in it. This is because we didn’t the data for that particular row and column.

Creating DataFrame from Text/CSV files

pandas tool come in handy when you want to load data from a csv or a text file. It has built in functions to do this for use.

df = pd.read_csv('happiness.csv')

Yes we created a DataFrame from a csv file. This dataset contains outcome of European quality of life survey. This dataset is available here. Now we have stored the DataFrame in df, we want to see what’s inside. First we will see the size of the DataFrame.

print(df.shape)
(105, 4)

It has 105 Rows and 4 Columns. Instead of printing out all the data, we will see the first 10 rows.

df.head(10)
Country Gender Mean N=
0 AT Male 7.3 471
1 NaN Female 7.3 570
2 NaN Both 7.3 1041
3 BE Male 7.8 468
4 NaN Female 7.8 542
5 NaN Both 7.8 1010
6 BG Male 5.8 416
7 NaN Female 5.8 555
8 NaN Both 5.8 971
9 CY Male 7.8 433

There are many more methods to create a DataFrames. But now we will see about basic operation on DataFrames.

Operations on DataFrame

We’ll recall the DataFrame we made earlier.

index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge']
u = {
    'citations': pd.Series([99.9, 99.4, 99.9, 100, 78.4], index=index_list),
    'employer': pd.Series([100, 100, 100, 85.4, 100], index=index_list)
}

df = pd.DataFrame(u)
df
citations employer
MIT 99.9 100.0
Stanford 99.4 100.0
Harvard 99.9 100.0
Caltech 100.0 85.4
Cambridge 78.4 100.0

Now we want to create a new row column from current columns. Let’s see how it is done.

df['score'] = (2 * df['citations'] + 3 * df['employer'])/5
df
citations employer score
MIT 99.9 100.0 99.96
Stanford 99.4 100.0 99.76
Harvard 99.9 100.0 99.96
Caltech 100.0 85.4 91.24
Cambridge 78.4 100.0 91.36

We have created a new column score from citations and employer. We’ll create one more using boolean.

df['flag'] = df['citations'] > 99.5
df
citations employer score flag
MIT 99.9 100.0 99.96 True
Stanford 99.4 100.0 99.76 False
Harvard 99.9 100.0 99.96 True
Caltech 100.0 85.4 91.24 True
Cambridge 78.4 100.0 91.36 False

We can also remove columns.

score = df.pop('score')
print(score)
MIT          99.96
Stanford     99.76
Harvard      99.96
Caltech      91.24
Cambridge    91.36
Name: score, dtype: float64
df
citations employer flag
MIT 99.9 100.0 True
Stanford 99.4 100.0 False
Harvard 99.9 100.0 True
Caltech 100.0 85.4 True
Cambridge 78.4 100.0 False

Descriptive Statistics using pandas

It’s very easy to view descriptive statistics of a dataset using pandas. We are gonna use, Biomass data collected from this source. Let’s load the data first.

url = 'https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/biomass.csv'
df = pd.read_csv(url)
df.head()
Unnamed: 0 dbh wood bark root rootsk branch species fac26
0 1 90 5528.0 NaN 460.0 NaN NaN E. maculata z
1 2 106 13650.0 NaN 1500.0 665.0 NaN E. pilularis 2
2 3 112 11200.0 NaN 1100.0 680.0 NaN E. pilularis 2
3 4 34 1000.0 NaN 430.0 40.0 NaN E. pilularis 2
4 5 130 NaN NaN 3000.0 1030.0 NaN E. maculata z

We are not interested in the unnamed column. So, let’s delete that first. Then we’ll see the statistics with one line of code.

del df['Unnamed: 0']
df.describe()
dbh wood bark root rootsk branch
count 153.000000 133.000000 17.000000 54.000000 53.000000 76.000000
mean 26.352941 1569.045113 513.235294 334.383333 113.802264 54.065789
std 28.273679 4071.380720 632.467542 654.641245 247.224118 65.606369
min 3.000000 3.000000 7.000000 0.300000 0.050000 4.000000
25% 8.000000 29.000000 59.000000 11.500000 2.000000 10.750000
50% 15.000000 162.000000 328.000000 41.000000 11.000000 35.000000
75% 36.000000 1000.000000 667.000000 235.000000 45.000000 77.750000
max 145.000000 25116.000000 1808.000000 3000.000000 1030.000000 371.000000

It’s simple as that. We can see all the statistics. Count, mean, standard deviation and other statistics. Now we are gonna find some other metrics which are not available in the describe() summary.

Mean

print(df.mean())
dbh         26.352941
wood      1569.045113
bark       513.235294
root       334.383333
rootsk     113.802264
branch      54.065789
dtype: float64

Min and Max

print(df.min())
dbh                      3
wood                     3
bark                     7
root                   0.3
rootsk                0.05
branch                   4
species    Acacia mabellae
dtype: object
print(df.max())
dbh          145
wood       25116
bark        1808
root        3000
rootsk      1030
branch       371
species    Other
dtype: object

Pairwise Correlation

df.corr()
dbh wood bark root rootsk branch
dbh 1.000000 0.905175 0.965413 0.899301 0.934982 0.861660
wood 0.905175 1.000000 0.971700 0.988752 0.967082 0.821731
bark 0.965413 0.971700 1.000000 0.961038 0.971341 0.943383
root 0.899301 0.988752 0.961038 1.000000 0.936935 0.679760
rootsk 0.934982 0.967082 0.971341 0.936935 1.000000 0.621550
branch 0.861660 0.821731 0.943383 0.679760 0.621550 1.000000

Data Cleaning

We need to clean our data. Our data might contain missing values, NaN values, outliers, etc. We may need to remove or replace that data. Otherwise our data might make any sense.

We can find null values using following method.

print(df.isnull().any())
dbh        False
wood        True
bark        True
root        True
rootsk      True
branch      True
species    False
fac26       True
dtype: bool

We have to remove these null values. This can done by method shown below.

newdf = df.dropna()

print(newdf.shape)
(4, 8)

But, sadly our datset reduced to a small one. But, you get the point.

There are many more useful tools in pandas. We’ll see more about them in upcoming posts.