Making Data Analysis Simpler With Dictionaries

While long-time Python users have a good sense of the usefulness of dictionaries, explaining why dictionaries are so useful is not so straightforward. Particularly to atmospheric and oceanic sciences (AOS) users, who generally are not that interested in the relative merits of various data structures, and whose main working language is Fortran, it can be difficult to describe what a dictionary buys you. In this article, I will give one example of how using dictionaries can make a data analysis routine simpler, more robust, and extendable.

Let’s say we have the following data analysis problem:

  • You have three data files named data0.txt, data1.txt, and data2.txt. Each data file contains a collection of values on which you will make some calculations. We assume that the function readdata (defined earlier in the program) will return a 1-D NumPy array of the data in the file.
  • For each of the data arrays, you want to calculate the mean, median, and standard deviation. These values may be used later on in the program.

A (traditional) Fortran programmer, if asked to write a program, in Python, to do the above, might write something like this:

import numpy
data0 = readdata('data0.txt')
data1 = readdata('data1.txt')
data2 = readdata('data2.txt')

mean0 = numpy.mean(data0)
median0 = numpy.median(data0)
stddev0 = numpy.std(data0)

mean1 = numpy.mean(data1)
median1 = numpy.median(data1)
stddev1 = numpy.std(data1)

mean2 = numpy.mean(data2)
median2 = numpy.median(data2)
stddev2 = numpy.std(data2)

Very simple and straightforward, and if you have only three files on whose data you only want to calculate three metrics, this is probably the fastest way to code it. But what if you have more files? The code, very quickly, becomes unmanageable and prone to catastrophe-by-typo.

So how to make the code better? Because we see that for each data file, we are calculating the same functions, one approach is to put the mean, median, and standard deviation into arrays, each element of which is the calculated value corresponding to the data file with that element address. We then use a for loop to go through each file, reading in the data, and making the calculations. We make use of Python’s string operators to construct the correct data filename at each iteration of the loop. Thus, we would have:

import numpy
mean = numpy.zeros(3)
median = numpy.zeros(3)
stddev = numpy.zeros(3)

for i in range(3):
    filename = 'data' + str(i) + '.txt'
    data = readdata(filename)
    mean[i] = numpy.mean(data)
    median[i] = numpy.median(data)
    stddev[i] = numpy.std(data)

This version is shorter and more easily extensible. If there are 1000 files, you just need to change the for loop iterable from range(3) to range(1000) and redeclare the mean, median, and stddev. arrays to be 1000 elements long.

But there are still at least two things I’m unsatisfied with. First, why should I need to redeclare anything at all? Python is an interpreted language; can’t it handle sizing on the fly? Second, storing calculated values in an array makes sense when the filenames differ from each other by characters that correspond to array index addresses, but most of the time, filenames can be anything. Wouldn’t it be clearer if the values of the mean, median, and standard deviation for each file were stored in structures that were indexed by the names of the files? After all, filenames do not necessarily have an order to them, and the indexing system for parameters related to those files (i.e., mean, median, and stddev) also do not need to be ordered.

To fix these last two problems, I’m going to make two changes to the code. First, instead of using arrays to store the mean, median, and standard deviation, I will use dictionaries, which are extensible (meaning that you do not have to preset their size) and which accept any unique, immutable variable as a key (meaning you can use the filename as the index). Second, instead of looping through array indices, I will loop through the filenames themselves. Here, I’m taking advantage of the fact that for loops in Python will loop through any iterable. Unlike traditional Fortran, which loops through a range of integers, in Python you can loop through a collection of numbers, strings, even functions and modules! Thus, my code is:

import numpy
mean = {}    #- Initialize as empty dictionaries
median = {}
stddev = {}
list_of_files = ['data0.txt', 'data1.txt', 'data2.txt']

for ifile in list_of_files:
    data = readdata(ifile)
    mean[ifile] = numpy.mean(data)
    median[ifile] = numpy.median(data)
    stddev[ifile] = numpy.std(data)

(With dictionaries, once you specify a variable as a dictionary (which is what initializing mean, median, and stddev as empty dictionaries does), if you try to set a value to a key, and the key does not yet exist, the key is created and the assignment is fulfilled. If the key already exists, the value is assigned to the key, overwriting the previous value. For more on dictionaries, please see this entry from the Python tutorial.)

The code is not shorter, but it now will work with any number of files; you only need to add in additional filenames to list_of_files, and the calculations will automatically be done and stored. To access, say, the mean of data1.txt, you just type in mean['data1.txt']. Filenames of any naming convention (or which completely lack a convention) are referenced similarly.

This new code also has a bonus feature. By changing the for loop from looping through a range of integers to looping through a list of files, it now becomes clear that if I can only automatically generate a list of my files, I won’t even have to type in new filenames to list_of_files. Making such a list, in fact, is very straightforward if you can get a directory listing from the operating system; Python makes that simple to do (and in a cross-platform way) via its built-in os module, coupled with string methods. I will talk about how to do this in a few weeks in a future post.

In our example, I assumed that the number of metrics to be calculated (mean, median, standard deviation) does not change, and that only the number of files changes. What if the number of metrics to be calculated also changes from three to 30? Are you stuck with writing at least 30 lines of code? In this case, dictionaries, coupled with the object-oriented nature of Python, can help us write more concise code. I will also talk about how to do this in another future post.

Dictionaries enable us to gather and store data without having to know ahead of time how many elements will need to be stored, and to reference the data using index values that make more sense than integers. The result is code that is more readable and flexible.

Hat tip: Thanks to Yun-Lan Chen for the idea for this article!

This entry was posted in Beginner, Data Analysis, Featured Tips, Tutorials. Bookmark the permalink.
  • N eil

    A note for that future post: I have found the glob module (also a builtin module) better than the os module for getting file lists. can use wildcards etc. for example:

    import glob
    dsrc=”/home/neil/data/cfsr/grb2files/”
    grblist=glob.glob(dsrc+”cfsr*/cfsr*/*.grb2″);
    grblist.sort()

    and now I have a sorted list of all grb2 files in those directories with paths.

    Apologies if you already know this, otherwise hope it works for you.

  • http://www.johnny-lin.com Johnny Lin

    Nope, I hadn’t heard about glob. Thanks much for the tip!