How to use Pandas for data analysis in Python

Pandas makes it easy to quickly load, manipulate, align, merge, and even visualize data tables directly in Python.

How to use Pandas for data analysis in Python
Thinkstock

When it comes to working with data in a tabular form, most people reach for a spreadsheet. That's not a bad choice: Microsoft Excel and similar programs are familiar and loaded with functionality for massaging tables of data. But what if you want more control, precision, and power than Excel alone delivers?

In that case, the open source Pandas library for Python might be what you are looking for. It outfits Python with new data types for loading data fast from tabular sources, and for manipulating, aligning, merging, and doing other processing at scale.

Your first Pandas data set

Pandas is not part of the Python standard library. It's a third-party project, so you'll need to install it in your Python runtime with pip install pandas. Once installed, you can import it into Python with import pandas

Pandas gives you two new data types: Series and DataFrame. The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame. You can also think of the Pandas DataFrame as a dictionary or collection of Series objects. You'll find later that you can use dictionary- and list-like methods for finding elements in a DataFrame.

You typically work with Pandas by importing data in some other format. A common external tabular data format is CSV, a text file with values separated by commas. If you have a CSV handy, you can use it. For this article, we'll be using an excerpt from the Gapminder data set prepared by Jennifer Bryan from the University of British Columbia.

To begin using Pandas, we first import the library. Note that it's a common practice to alias the Pandas library as pd to save some typing:

import pandas as pd

To start working with the sample data in CSV format, we can load it in as a dataframe using the pd.read_csv function:


df = pd.read_csv("./gapminder/inst/extdata/gapminder.tsv", sep='\t')

The sep parameter lets us specify that this particular file is tab-delimited rather than comma-delimited.

Once the data's been loaded, you can peek at its formatting to make sure it's loaded correctly by using the .head() method on the dataframe:


print(df.head())
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106

Dataframe objects have a shape attribute that reports the number of rows and columns in the dataframe:


print(df.shape)
(1704, 6) # rows, cols

To list the names of the columns themselves, use .columns:


print(df.columns)
Index(['country', 'continent', 'year', 'lifeExp',
'pop', 'gdpPercap'], dtype='object')

Dataframes in Pandas work much the same way as dataframes in other languages, such as Julia and R. Each column, or Series, must be the same type, whereas each row can contain mixed types. For instance, in the current example, the country column will always be a string, and the year column is always an integer. We can verify this by using .dtypes to list the data type of each column:


print(df.dtypes)
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object

For an even more explicit breakdown of your dataframe's types, you can use .info():


df.info() # information is written to console, so no print required
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   country    1704 non-null   object
 1   continent  1704 non-null   object
 2   year       1704 non-null   int64
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB

Each Pandas data type maps to a native Python data type:

  • object is handled as a Python str type.
  • int64 is handled as a Python int. Note that not all Python ints can be converted to int64 types; anything larger than (2 ** 63)-1 will not convert to int64.
  • float64 is handled as a Python float (which is a 64-bit float natively).
  • datetime64 is handled as a Python datetime.datetime object. Note that Pandas does not automatically try to convert things that look like dates into date values; you must tell Pandas you want to do this for a specific column.

Pandas columns, rows, and cells

Now that you’re able to load a simple data file, you want to be able to inspect its contents. You could print the contents of the dataframe, but most dataframes will be too big to inspect by printing.

A better approach is to look at subsets of the data, as we did with df.head(), but with more control. Pandas lets you make excerpts from dataframes, using Python's existing syntax for indexing and creating slices.

Extracting Pandas columns

To examine columns in a Pandas dataframe, you can extract them by their names, positions, or by ranges. For instance, if you want a specific column from your data, you can request it by name using square brackets:


# extract the column "country" into its own dataframe
country_df = df["country"]
# show the first five rows
print(country_df.head())
| 0 Afghanistan
| 1 Afghanistan
| 2 Afghanistan
| 3 Afghanistan
| 4 Afghanistan
Name: country, dtype: object

# show the last five rows
print(country_df.tail())
| 1699  Zimbabwe
| 1700  Zimbabwe
| 1701  Zimbabwe
| 1702  Zimbabwe
| 1703  Zimbabwe
| Name: country, dtype: object

If you want to extract multiple columns, pass a list of the column names:


# Looking at country, continent, and year
subset = df[['country', 'continent', 'year']]

print(subset.head())

       country continent  year
| 0  Afghanistan    Asia  1952
| 1  Afghanistan    Asia  1957
| 2  Afghanistan    Asia  1962
| 3  Afghanistan    Asia  1967
| 4  Afghanistan    Asia  1972
print(subset.tail())

         country continent    year
| 1699  Zimbabwe    Africa    1987
| 1700  Zimbabwe    Africa    1992
| 1701  Zimbabwe    Africa    1997
| 1702  Zimbabwe    Africa    2002
| 1703  Zimbabwe    Africa    2007

Subsetting rows

If you want to extract rows from a dataframe, you can use one of two methods.

.iloc[] is the simplest method. It extracts rows based on their position, starting at 0. For fetching the first row in the above dataframe example, you'd use df.iloc[0].

If you want to fetch a range of rows, you can use .iloc[] with Python's slicing syntax. For instance, for the first 10 rows, you'd use df.iloc[0:10]. And if you wanted to obtain the last 10 rows in reverse order, you'd use df.iloc[::-1].

If you want to extract specific rows, you can use a list of the row IDs; for example, df.iloc[[0,1,2,5,7,10,12]]. (Note the double brackets—that means you're providing a list as the first argument.)

Another way to extract rows is with .loc[]. This extracts a subset based on labels for rows. By default, rows are labeled with an incrementing integer value starting with 0. But data can also be labeled manually by setting the dataframe's .index property.

For instance, if we wanted to re-index the above dataframe so that each row had an index using multiples of 100, we could use df.index = range(0, len(df)*100, 100). Then, if we used, df.loc[100], we'd get the second row.

Subsetting columns

If you want to retrieve only a certain subset of columns along with your row slices, you do this by passing a list of columns as a second argument: 

df.loc[[rows], [columns]] 

For instance, with the above dataset, if we want to get only the country and year columns for all rows, we'd do this:

df.loc[:, ["country","year"]]

The : in the first position means "all rows" (it's Python's slicing syntax). The list of columns follows after the comma.

You can also specify columns by position when using .iloc:

df.iloc[:, [0,2]]

Or, to get just the first three columns:

df.iloc[:, 0:3]

All of these approaches can be combined, as long as you remember loc is used for labels and column names, and iloc is used for numeric indexes. The following tells Pandas to extract the first 100 rows by their numeric labels, and then from that to extract the first three columns by their indexes:

df.loc[0:100].iloc[:, 0:3]

It's generally least confusing to use actual column names when subsetting data. It makes the code easier to read, and you don't have to refer back to the dataset to figure out which column corresponds to what index. It also protects you from mistakes if columns are re-ordered.

Grouped and aggregated calculations

Spreadsheets and number-crunching libraries all come with methods for generating statistics about data. Consider the Gapminder data again:


print(df.head(n=10))
|    country      continent  year  lifeExp  pop       gdpPercap
| 0  Afghanistan  Asia       1952  28.801    8425333  779.445314
| 1  Afghanistan  Asia       1957  30.332    9240934  820.853030
| 2  Afghanistan  Asia       1962  31.997   10267083  853.100710
| 3  Afghanistan  Asia       1967  34.020   11537966  836.197138
| 4  Afghanistan  Asia       1972  36.088   13079460  739.981106
| 5  Afghanistan  Asia       1977  38.438   14880372  786.113360
| 6  Afghanistan  Asia       1982  39.854   12881816  978.011439
| 7  Afghanistan  Asia       1987  40.822   13867957  852.395945
| 8  Afghanistan  Asia       1992  41.674   16317921  649.341395
| 9  Afghanistan  Asia       1997  41.763   22227415  635.341351

Here are some examples of questions we could ask about this data:

  1. What's the average life expectancy for each year in this data?
  2. What if I want averages across the years and the continents?
  3. How do I count how many countries in this data are in each continent?

The way to answer these questions with Pandas is to perform a grouped or aggregated calculation. We can split the data along certain lines, apply some calculation to each split segment, and then re-combine the results into a new dataframe.

Grouped means counts

The first method we'd use for this is Pandas's df.groupby() operation. We provide a column we want to split the data by:

df.groupby("year")

This allows us to treat all rows with the same year value together, as a distinct object from the dataframe itself.

From there, we can use the "life expectancy" column and calculate its per-year mean:


print(df.groupby('year')['lifeExp'].mean())
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423

This gives us the mean life expectancy for all populations, by year. We could perform the same kinds of calculations for population and GDP by year:


print(df.groupby('year')['pop'].mean())
print(df.groupby('year')['gdpPercap'].mean())

So far, so good. But what if we want to group our data by more than one column? We can do this by passing columns in lists:


print(df.groupby(['year', 'continent'])
  [['lifeExp', 'gdpPercap']].mean())
                  lifeExp     gdpPercap
year continent
1952 Africa     39.135500   1252.572466
     Americas   53.279840   4079.062552
     Asia       46.314394   5195.484004
     Europe     64.408500   5661.057435
     Oceania    69.255000  10298.085650
1957 Africa     41.266346   1385.236062
     Americas   55.960280   4616.043733
     Asia       49.318544   5787.732940
     Europe     66.703067   6963.012816
     Oceania    70.295000  11598.522455
1962 Africa     43.319442   1598.078825
     Americas   58.398760   4901.541870
     Asia       51.563223   5729.369625
     Europe     68.539233   8365.486814
     Oceania    71.085000  12696.452430

This .groupby() operation takes our data and groups it first by year, and then by continent. Then it generates mean values from the life-expectancy and GDP columns. This way, you can create groups in your data and rank how they are to be presented and calculated.

If you want to "flatten" the results into a single, incrementally indexed frame, you can use the .reset_index() method on the results:


gb = df.groupby(['year', 'continent'])
[['lifeExp', 'gdpPercap']].mean()
flat = gb.reset_index() 
print(flat.head())
|     year  continent  lifeExp    gdpPercap
| 0   1952  Africa     39.135500   1252.572466
| 1   1952  Americas   53.279840   4079.062552
| 2   1952  Asia       46.314394   5195.484004
| 3   1952  Europe     64.408500   5661.057435
| 4   1952  Oceana     69.255000  10298.085650

Grouped frequency counts

Another thing we often do with data is compute frequencies. The nunique and value_counts methods can be used to get unique values in a series, and their frequencies. For instance, here's how to find out how many countries we have in each continent:


print(df.groupby('continent')['country'].nunique()) 
continent
Africa    52
Americas  25
Asia      33
Europe    30
Oceana     2

Basic plotting with Pandas and Matplotlib

Most of the time, when you want to visualize data, you'll use another library such as Matplotlib to generate those graphics. However, you can use Matplotlib directly (along with some other plotting libraries) to generate visualizations from within Pandas.

To use the simple Matplotlib extension for Pandas, first make sure you've installed Matplotlib with pip install matplotlib.

Now let’s look at the yearly life expectancies for the world population again:


global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean() 
print(global_yearly_life_expectancy) 
| year
| 1952  49.057620
| 1957  51.507401
| 1962  53.609249
| 1967  55.678290
| 1972  57.647386
| 1977  59.570157
| 1982  61.533197
| 1987  63.212613
| 1992  64.160338
| 1997  65.014676
| 2002  65.694923
| 2007  67.007423
| Name: lifeExp, dtype: float64

To create a basic plot from this, use:


import matplotlib.pyplot as plt
global_yearly_life_expectancy = df.groupby('year')['lifeExp'].mean() 
c = global_yearly_life_expectancy.plot().get_figure()
plt.savefig("output.png")

The plot will be saved to a file in the current working directory as output.png. The axes and other labeling on the plot can all be set manually, but for quick exports this method works fine.

Conclusion

Python and Pandas offer many features you can't get from spreadsheets alone. For one, they let you automate your work with data and make the results reproducible. Rather than write spreadsheet macros, which are clunky and limited, you can use Pandas to analyze, segment, and transform data—and use Python's expressive power and package ecosystem (for instance, for graphing or rendering data to other formats) to do even more than you could with Pandas alone.

Copyright © 2023 IDG Communications, Inc.