Python Data Analysis with pandas

25 September 2017 — Written by Mubaris NK

Python is a great language for data analysis. pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

In this post we'll get to know more about doing data analysis using pandas.

Mainly pandas has two data structures, series and dataframes.

pandas Series

pandas series can be used for one-dimensional labeled array.

import pandas as pd
index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge', 'Oxford', 'UCL']
a = pd.Series([100, 98.7, 98.4, 97.7, 95.6, 95.3, 94.6],
              index=index_list)

print(a)

MIT          100.0
Stanford      98.7
Harvard       98.4
Caltech       97.7
Cambridge     95.6
Oxford        95.3
UCL           94.6
dtype: float64

Labels can accessed using index attribute

print(a.index)

Index(['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge', 'Oxford', 'UCL'], dtype='object')

You can use array indexing or labels to access data in the series

print(a[1])
print(a['Cambridge'])

98.7
95.6

You can also apply mathematical operations on pandas series.

b = a * 2
c = a ** 1.5
print(b)
print()
print(c)

MIT          200.0
Stanford     197.4
Harvard      196.8
Caltech      195.4
Cambridge    191.2
Oxford       190.6
UCL          189.2
dtype: float64

MIT          1000.000000
Stanford      980.563513
Harvard       976.096258
Caltech       965.699142
Cambridge     934.731414
Oxford        930.334981
UCL           920.103546
dtype: float64

You can even create a series of heterogeneous data.

s = pd.Series(['random', 1.2, 3, 'data'], index=['any', 'thing', 2, '4.3'])
print(s)

any      random
thing       1.2
2             3
4.3        data
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure. There are many methods to create DataFrames. We'll see each one by one.

Creating DataFrame from dictionary of Series

The following method can used to create DataFrames from a dictionary of pandas series.

index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge']
u = {
    'citations': pd.Series([99.9, 99.4, 99.9, 100, 78.4], index=index_list),
    'employer': pd.Series([100, 100, 100, 85.4, 100], index=index_list)
}

df = pd.DataFrame(u)
df

	citations	employer
MIT	99.9	100.0
Stanford	99.4	100.0
Harvard	99.9	100.0
Caltech	100.0	85.4
Cambridge	78.4	100.0

print(df.index)

Index(['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge'], dtype='object')

print(df.columns)

Index(['citations', 'employer'], dtype='object')

Creating DataFrame from list of dictionaries

l = [{'orange': 32, 'apple': 42}, {'banana': 25, 'carrot': 44, 'apple': 34}]
df = pd.DataFrame(l, index=['value1', 'value2'])
df

	apple	banana	carrot	orange
value1	42	NaN	NaN	32.0
value2	34	25.0	44.0	NaN

You might have noticed that we got a DataFrame with NaN values in it. This is because we didn't the data for that particular row and column.

Creating DataFrame from Text/CSV files

pandas tool come in handy when you want to load data from a csv or a text file. It has built in functions to do this for use.

df = pd.read_csv('happiness.csv')

Yes we created a DataFrame from a csv file. This dataset contains outcome of European quality of life survey. This dataset is available here. Now we have stored the DataFrame in df, we want to see what's inside. First we will see the size of the DataFrame.

print(df.shape)

(105, 4)

It has 105 Rows and 4 Columns. Instead of printing out all the data, we will see the first 10 rows.

df.head(10)

	Country	Gender	Mean	N=
0	AT	Male	7.3	471
1	NaN	Female	7.3	570
2	NaN	Both	7.3	1041
3	BE	Male	7.8	468
4	NaN	Female	7.8	542
5	NaN	Both	7.8	1010
6	BG	Male	5.8	416
7	NaN	Female	5.8	555
8	NaN	Both	5.8	971
9	CY	Male	7.8	433

There are many more methods to create a DataFrames. But now we will see about basic operation on DataFrames.

Operations on DataFrame

We'll recall the DataFrame we made earlier.

index_list = ['MIT', 'Stanford', 'Harvard', 'Caltech', 'Cambridge']
u = {
    'citations': pd.Series([99.9, 99.4, 99.9, 100, 78.4], index=index_list),
    'employer': pd.Series([100, 100, 100, 85.4, 100], index=index_list)
}

df = pd.DataFrame(u)
df

	citations	employer
MIT	99.9	100.0
Stanford	99.4	100.0
Harvard	99.9	100.0
Caltech	100.0	85.4
Cambridge	78.4	100.0

Now we want to create a new row column from current columns. Let's see how it is done.

df['score'] = (2 * df['citations'] + 3 * df['employer'])/5
df

	citations	employer	score
MIT	99.9	100.0	99.96
Stanford	99.4	100.0	99.76
Harvard	99.9	100.0	99.96
Caltech	100.0	85.4	91.24
Cambridge	78.4	100.0	91.36

We have created a new column score from citations and employer. We'll create one more using boolean.

df['flag'] = df['citations'] > 99.5
df

	citations	employer	score	flag
MIT	99.9	100.0	99.96	True
Stanford	99.4	100.0	99.76	False
Harvard	99.9	100.0	99.96	True
Caltech	100.0	85.4	91.24	True
Cambridge	78.4	100.0	91.36	False

We can also remove columns.

score = df.pop('score')
print(score)

MIT          99.96
Stanford     99.76
Harvard      99.96
Caltech      91.24
Cambridge    91.36
Name: score, dtype: float64

df

	citations	employer	flag
MIT	99.9	100.0	True
Stanford	99.4	100.0	False
Harvard	99.9	100.0	True
Caltech	100.0	85.4	True
Cambridge	78.4	100.0	False

Descriptive Statistics using pandas

It's very easy to view descriptive statistics of a dataset using pandas. We are gonna use, Biomass data collected from this source. Let's load the data first.

url = 'https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/DAAG/biomass.csv'
df = pd.read_csv(url)
df.head()

	Unnamed: 0	dbh	wood	bark	root	rootsk	branch	species	fac26
0	1	90	5528.0	NaN	460.0	NaN	NaN	E. maculata	z
1	2	106	13650.0	NaN	1500.0	665.0	NaN	E. pilularis	2
2	3	112	11200.0	NaN	1100.0	680.0	NaN	E. pilularis	2
3	4	34	1000.0	NaN	430.0	40.0	NaN	E. pilularis	2
4	5	130	NaN	NaN	3000.0	1030.0	NaN	E. maculata	z

We are not interested in the unnamed column. So, let's delete that first. Then we'll see the statistics with one line of code.

del df['Unnamed: 0']
df.describe()

	dbh	wood	bark	root	rootsk	branch
count	153.000000	133.000000	17.000000	54.000000	53.000000	76.000000
mean	26.352941	1569.045113	513.235294	334.383333	113.802264	54.065789
std	28.273679	4071.380720	632.467542	654.641245	247.224118	65.606369
min	3.000000	3.000000	7.000000	0.300000	0.050000	4.000000
25%	8.000000	29.000000	59.000000	11.500000	2.000000	10.750000
50%	15.000000	162.000000	328.000000	41.000000	11.000000	35.000000
75%	36.000000	1000.000000	667.000000	235.000000	45.000000	77.750000
max	145.000000	25116.000000	1808.000000	3000.000000	1030.000000	371.000000

It's simple as that. We can see all the statistics. Count, mean, standard deviation and other statistics. Now we are gonna find some other metrics which are not available in the describe() summary.

Mean

print(df.mean())

dbh         26.352941
wood      1569.045113
bark       513.235294
root       334.383333
rootsk     113.802264
branch      54.065789
dtype: float64

Min and Max

print(df.min())

dbh                      3
wood                     3
bark                     7
root                   0.3
rootsk                0.05
branch                   4
species    Acacia mabellae
dtype: object

print(df.max())

dbh          145
wood       25116
bark        1808
root        3000
rootsk      1030
branch       371
species    Other
dtype: object

Pairwise Correlation

df.corr()

	dbh	wood	bark	root	rootsk	branch
dbh	1.000000	0.905175	0.965413	0.899301	0.934982	0.861660
wood	0.905175	1.000000	0.971700	0.988752	0.967082	0.821731
bark	0.965413	0.971700	1.000000	0.961038	0.971341	0.943383
root	0.899301	0.988752	0.961038	1.000000	0.936935	0.679760
rootsk	0.934982	0.967082	0.971341	0.936935	1.000000	0.621550
branch	0.861660	0.821731	0.943383	0.679760	0.621550	1.000000

Data Cleaning

We need to clean our data. Our data might contain missing values, NaN values, outliers, etc. We may need to remove or replace that data. Otherwise our data might make any sense.

We can find null values using following method.

print(df.isnull().any())

dbh        False
wood        True
bark        True
root        True
rootsk      True
branch      True
species    False
fac26       True
dtype: bool

We have to remove these null values. This can done by method shown below.

newdf = df.dropna()

print(newdf.shape)

(4, 8)

But, sadly our datset reduced to a small one. But, you get the point.

There are many more useful tools in pandas. We'll see more about them in upcoming posts.