Data Visualization Bad vs Good

Posted on 09/26/2019 in Python Data Science

Lab 3 - Visualization - Solution¶

In [1]:

% matplotlib inline

The special command above will make all the matplotlib images appear in the notebook.

In [2]:

import numpy as np
import random as py_random
import numpy.random as np_random
import time
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid")

from IPython.display import YouTubeVideo, Image

THEME = "darkslategray"

Chart 1.

In [3]:

Image( "resources/chart_01.jpg", width=500)

Out[3]:

It appears that the main story here is total sales of various fast food chains. There is also a comparison with the GDP of Afghanistan. It's not entirely clear why that is a good comparison (you often see them) because 1. most Americans can't place Afghanistan on a map and 2. even if they can, it's not evident that Afghanistan is a relevant context. Finally, why $21 billion as that context?
The chart adheres to some of the tenets of charts. For example, it tries to be bar chart so the scale does start at 0. Additionally, the values are sorted which lead us to compare increasing values. There are some negatives. The most obvious is that it has fallen victim to the "graphics as bars" problem where the width is not constant. The next problem--purely from a visualization point of view which you should adhere to in this class--it includes value labels. If you have a story that can be visualized, it should only be a visualization. If you need to communicate the values also then they should be in an accompanying table.

In [4]:

sales = [4.1, 4.3, 8.0, 8.2, 9.4, 11.3, 41.0]
chain = ["Starbucks", "Taco Bell", "Pizza Hut", "KFC", "Wendy's", "Burger King", "McDonalds"]
width = 1/1.5

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "McDonald's Outpaces Competitors in Global Sales")
axes.bar(range(0, len(sales)), sales, width, color=THEME, align="center")
axes.set_xticks(range(0, len(sales)))
axes.set_xticklabels(chain)
axes.yaxis.grid( b=True, which="major")
axes.set_ylim((0, 50))
axes.set_ylabel( "Sales in Billions US $")

plt.show()

For a newspaper, this could certainly be dressed up a bit by either varying the background, bar or line colors. It might be appropriate to do so in the usual "theme" of the newspaper. For a data science presentation, this is probably sufficient. I like chart titles that communicate the theme, story or context.

I couldn't find a good contextual comparison. My guess is that originally the idea was to "put the money in context, wow, this is more than the GDP of Afghanistan!" I looked at states but the smallest state, Vermont, had a GSP of \$29.8 billion. It would be nice to find a context at around the \$10 billion mark and the \$40 billion mark.

I think the best context might be something related to food...perhaps a regular sit down chain like Olive Garden.

Finally, I don't know when the chart was produced, but it would be interesting to update it. For example, 2015 Data can be found here although the CSS layout is awful.

Chart 2.

In [5]:

Image( "resources/chart_02.jpg", width=500)

Out[5]:

This is an interesting combination of charts. It's trying to tell us two things. First, what percentage of certain foods and food groups are being imported and then who the major exporters are. I think the chart fails in a number of ways. First, why are fruits and nuts in one group but melons included with fresh vegetables? Why do we have food groups (fruits, vegetables, seafood) and individual foods (honey, lamb)? Weird.
From a principles point of view, the chart is a disaster. That is not 52% of that lamb chop, they seem to have randomly partitioned up the pictures. Additionally, the smaller strawberry and lobster are confusing and useless as decorations to the export tables. On the plus side, they used, basically, tables for the export countries. I'm just not sure people even know what Vietnam's flag looks like so the inclusion of flags is entirely for some spurious decoration. Because the actual pictures mean absolutely nothing, everything needs to be labeled when, as a visualization, no data points should be labeled (unless you're calling out something special).

Without getting raw data, we're not going to be able to fix this graphic 100%. Let's concentrate on the top half.

In [6]:

percent = [20.0, 51.0, 52.0, 61.0, 88.0]
food = ["Vegetables/Melons", "Fruits & Nuts", "Lamb", "Honey", "Seafood"]
width = 1/1.5

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "Most Seafood is Imported")
axes.bar(range(0, len(percent)), percent, width, color=THEME, align="center")
axes.set_xticks(range(0, len(percent)))
axes.set_xticklabels(food)
axes.yaxis.grid( b=True, which="major")
axes.set_ylim((0, 100))
axes.set_ylabel( "Percent Imported")

plt.show()

I think the best way to fix this chart would be a complete do-over. With the data available, there's not much you can do and I'm tempted to say this is an effective table with just rows and then a column that includes the top 2 exporters.

IMPORTANT when copying code, make sure you change the variable names. Failure to do so is a failure.

Chart 3.

In [7]:

Image( "resources/chart_03.jpg", width=500)

Out[7]:

This graphic is supposed to tell the story of perhaps a vote about changing the drinking age in Saskatchewan giving context by showing the drinking ages of the other provinces of Canada.
The main problem is that this is a bar chart that doesn't start at zero and doesn't really encode a lot of information. There are only 13 provinces and 2 drinking ages...it could easily be a table. However, I do like the "headline" approach to the title of the chart.

In [8]:

count = [3.0, 10.0]
age = ["18 years", "19 years"]
width = 1/1.5

figure = plt.figure(figsize=(5, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "Drinking age will remain 19 in Saskatchewan")
axes.bar(range(0, len(count)), count, width, color=THEME, align="center")
axes.set_xticks(range(0, len(count)))
axes.set_xticklabels(age)
axes.yaxis.grid( b=True, which="major")
axes.set_ylim((0, 13))
axes.set_ylabel( "Number of Provinces")
axes.set_xlabel( "Drinking Age")

plt.show()

Although I think this could easily be a table, I'm going to go with a chart to show what I think is the real story: the drinking age is 19 for most provinces in Canada.

Lab Questions¶

Follow the directions for Lab Questions and Discussion.

Chart 1

In [9]:

Image( "resources/chart_04.jpg", width=500)

Out[9]:

There are at least two stories here. The first is that "under 25" is the largest proportion of people enrolled in college. The other story is that the proportion of people "25 and over" has increased.
There are a number of problems with this chart. The fake 3d is useless. The scale is broken. The areas are inexplicably filled in (if they weren't, the broken scale wouldn't matter). More strangely, the one is just one minus the other. 100% of the possibilities are represented but the data doesn't fill the chart.

In [10]:

percent = [28.0, 29.2, 32.8, 33.6, 33.0]
years = ["1972", "1973", "1974", "1975", "1976"]
xs = list(range( 1, 6))

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "College enrollments of students 25 and older increasing")
axes.plot(xs, percent, "o", color=THEME)
axes.vlines(xs, [0], percent, linestyles='dotted', lw=2)
axes.set_xticks(xs)
axes.set_xlim((0, 6))
axes.set_ylim((25, 40))
axes.set_xticklabels(years)
axes.xaxis.grid(False)
axes.set_ylabel( "Percent Enrolled")

plt.show()

There are certainly a number of ways to redo this chart. One way would be to simply show just the "25 and Over" percentages of enrollment. Depending on the story you're trying to tell, a bar chart may or may not be better. I changed over from a bar chart because when you start at 0, the increases don't look very large. It's ok, if you are emphasizing changes, to pick a chart that emphasizes them as long as you have a scale of some kind.

Perhaps the more interesting question here is why did it increase? The Vietname War Draft ended January 1, 1973 so it's entirely possible that the increased enrollment is explained by returning, older soldiers. How would you add an annotation to this chart to indicate that?

Chart 2

In [11]:

Image( "resources/chart_05.jpg", width=500)

Out[11]:

This chart is trying to tell a story of jobloss. It is inadvertantly (?) telling a story of a constant rate of job loss. Since President Obama took office in January 2009, it's also trying to tell a story that the rate of job loss hasn't changed since Obama took office.
While the job loss data is accurate, there are several items about the chart that make it misleading. First, the dates are not appropriately spaced (the spaces between the dots do not reflect that actual time between the dates). Second, once you fill in below the line, you have created a sort of bar chart and the "start at zero" rule applies. Why? Because it looks like the job loss went from almost nothing to 8 times that amount...when, in fact, it only doubled. Our eyes focus on the two ends of the triangle instead of the scale.

In [12]:

figure = plt.figure(figsize=(10, 6))

xs = [1, 9, 15, 30]
ys = [7.0, 9.0, 13.5, 15.0]

axes = figure.add_subplot(1, 1, 1)
axes.plot(xs, ys, "o-", color=THEME)
axes.xaxis.grid(False)
axes.set_xlim((0,31))
axes.set_ylim((5.0, 17.0))
axes.set_xticks(xs)
axes.set_xticklabels(["Dec '07", "Sep '08", "March '09", "June '10"])

Out[12]:

[Text(0,0,"Dec '07"),
 Text(0,0,"Sep '08"),
 Text(0,0,"March '09"),
 Text(0,0,"June '10")]

The main thing here is that you need to have a chart that approriately expresses the actual time, in distance, between the dates. And so the challenge becomes expressing the dates. I decided to express the underlying data as months since November 2007 which made certain that December 2007 wouldn't appear on the y axis.

Note that I did not change the y axis range. Because I did not fill in under the line, I do not need to start at 0.

You can see that this chart tells a very different story than the original. The rate of job loss has declined.

Chart 3

In [13]:

Image( "resources/chart_06.jpg", width=500)

Out[13]:

This chart is telling a story about the trend in the unemployment rate in 2011.
Oh, Fox News. I love you for your charts. The main problem with this chart is that by filling in under the line and creating an area, the "start at zero" rule comes into play. Of course, you have to have the data correct as well. The last observation is 8.6% but it's shown at the same place as 9.0%. The November value is actually the lowest one for the entire year. Additionally, the numbers are included...luckily for us but in a pure visualization they should not be...visualizations should speak for themselves or they are by definition not effective. Communicate the actual data through tables.

In [14]:

percent = [9.0, 8.9, 8.8, 9.0, 9.1, 9.2, 9.1, 9.1, 9.1, 9.0, 8.6]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov"]
xs = list(range( 1, len(months)+1))

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "Unemployment Rate Under Obama at its lowest for 2011")
axes.plot(xs, percent, "o", color=THEME)
axes.vlines(xs, [0], percent, linestyles='dotted', lw=2)
axes.set_xticks(xs)
axes.set_xlim((0, len( months)+1))
axes.set_ylim((8, 10))
axes.set_xticklabels(months)
axes.xaxis.grid(False)
axes.set_ylabel( "Unemployment Rate")

plt.show()

A lot of charts involving time involve judgement. For example, we generally use bar charts for the months or quarters of a single year or years of a few years but switch to line charts for many years but not always.

Keeping a non-zero y-axis start and switching to a bar chart style, requires using dots. I'm ambivalent about dot charts because they're not really that common. In the chart below, I used a line chart. Because the area of the line is not shaded, we do not have to start at 0.

In [15]:

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "Unemployment Rate Under Obama at its lowest for 2011")
axes.plot(xs, percent, "o-", color=THEME)
axes.set_xticks(xs)
axes.set_xlim((0, len( months)+1))
axes.set_ylim((8, 10))
axes.set_xticklabels(months)
axes.xaxis.grid(False)
axes.set_ylabel( "Unemployment Rate")

plt.show()

This is not, of course, the story that Fox News wanted to tell. I'm not picking on Fox News...they just seem to be the worst offenders. Speaking of which...

Chart 4

In [16]:

Image( "resources/chart_07.jpg", width=500)

Out[16]:

This chart is trying to tell a story of increases in southwest board apprehensions.
The main problem here is a bar chart that doesn't start at zero. Our pre-attentive awareness sees the last bar as three times the first bar no matter what it says. Additionally, the numbers are shown on the chart and for pure visualizations, this is verbotten. If you have to include the numbers its not a visualization. If you need to include the numbers, attach a table.

In [17]:

apprehensions = [165224, 170223, 192298]
year = ["2011", "2012", "2013"]
width = 1/1.5

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "Border Apprehensions")
axes.bar(range(0, len(apprehensions)), apprehensions, width, color=THEME, align="center")
axes.set_xticks(range(0, len(apprehensions)))
axes.set_xticklabels(year)
axes.yaxis.grid( b=True, which="major")
axes.set_ylim((0, 195000))
axes.set_ylabel( "Border Apprehensions")

plt.show()

The interesting thing here is that Fox News didn't really need to "cheat" in order to show that 2013 had a marked increase in border apprehensions as it was an over 16% increase.

But there does seem to be some problems with the data. I'm not sure entirely what "October-April" means. The Federal Statistics should be on an "October-September" Fiscal Year and it's not clear if Fox News made the effort to only look at partial fiscal years for all years or if it's only partial for 2013. The thing that the graph does not make clear is that in FY2000, apprehensions were 1,643,679 according to CBP Data compared to 414,397 in FY2013...only one quarter of what they were. So what we're really talking about is a small increase in values that are only one quarter of what they were two decades ago.

You should make that chart.

Chart 5

In [18]:

Image( "resources/chart_08.jpg", width=500)

Out[18]:

I'm not even sure what this is about. They're trying to tell a story about proportions of something.
Ugh...3d rotated pie chart with labeled slices. From a visualization point of view, there aren't many more things you can get wrong in a single chart.

In [19]:

percent =   [29, 42, 19, 8, 2]
frequency = ["Never", "Every Few Months", "Every Few Weeks", "Every Few Days", "Daily"]
width = 1/1.5

figure = plt.figure(figsize=(10, 6)) # first element is width, second is height.

axes = figure.add_subplot(1, 1, 1)

axes.set_title( "Something")
axes.bar(range(0, len(percent)), percent, width, color=THEME, align="center")
axes.set_xticks(range(0, len(percent)))
axes.set_xticklabels(frequency)
axes.yaxis.grid( b=True, which="major")
axes.set_ylim((0, 50))
axes.set_ylabel( "Percent")

plt.show()

There's not a lot to say here because there's very little context. I would mention at this point that if your categories are ordered as these are and the values do not follow that order, consider doing two charts. One ordered by the categories and one ordered by the values (do the charts side by side).

I didn't need to do that here because the order of the values almost perfectly matched the order of the categories.

Only showing a single chart, where the categories are ordered, but only showing the value ordering can be very disorienting.

In [20]:

Image( "resources/chart_13.jpg", width=500)

Out[20]:

What's wrong with this set of charts? Can you fix it?

I'm not sure what's wrong with this set of charts is fixable.

First, the one thing I can see that they did right is have a single legend. Once you get it in your head that "gold" is "4 or more hours per day" then you can easily identify it in all the charts. Additionally, these are what we might call "small multiples".

The bad parts include, well, pie charts. And they know that pie charts are bad because they end up having to label each slice--and there are only 4 slices. You could have had six sorted tables and done as well if you definitely wanted to convey the values. Again the fact that you must include labels shows how bad pie charts are for this sort of thing.

The units are janky:

less than 1 hour per week => This is 8 minutes or less per day on average!
1 to 4 hours a week => This is 8 minutes to 30 minutes a day on average!
1 to 3 hours per day
more than 4 hours per day

So our intervals are not of equal sizes. I get that they're trying to after things that you do every week versus things you do every day but this is not a particularly effective way to do it.

It's difficult to compose the data in such a way that I might determine what the typical data scientist spends their day/week doing. If you take the smallest percent (5%) that does 4 hours or more of something, then you get a picture of someone who does:

4 hours of ETL
4 hours of data cleaning
4 hours of Exploratory data analysis
4 hours of machine learning and statistics
4 hours of creating visualizations.
4 hours of presenting results

every single day which is a pretty tiring schedule if you ask me ;)