logo

Computational and Inferential Thinking

  • 1. Data Science
    • 1.1. Introduction
      • 1.1.1. Computational Tools
      • 1.1.2. Statistical Techniques
    • 1.2. Why Data Science?
    • 1.3. Plotting the Classics
      • 1.3.1. Literary Characters
      • 1.3.2. Another Kind of Character
  • 2. Causality and Experiments
    • 2.1. John Snow and the Broad Street Pump
    • 2.2. Snow’s “Grand Experiment”
    • 2.3. Establishing Causality
    • 2.4. Randomization
    • 2.5. Endnote
  • 3. Programming in Python
    • 3.1. Expressions
    • 3.2. Names
      • 3.2.1. Example: Growth Rates
    • 3.3. Call Expressions
    • 3.4. Introduction to Tables
  • 4. Data Types
    • 4.1. Numbers
    • 4.2. Strings
      • 4.2.1. String Methods
    • 4.3. Comparisons
  • 5. Sequences
    • 5.1. Arrays
    • 5.2. Ranges
    • 5.3. More on Arrays
  • 6. Tables
    • 6.1. Sorting Rows
    • 6.2. Selecting Rows
    • 6.3. Example: Population Trends
    • 6.4. Example: Sex Ratios
  • 7. Visualization
    • 7.1. Categorical Distributions
    • 7.2. Numerical Distributions
    • 7.3. Overlaid Graphs
  • 8. Functions and Tables
    • 8.1. Applying Functions to Columns
    • 8.2. Classifying by One Variable
    • 8.3. Cross-Classifying
    • 8.4. Joining Tables by Columns
    • 8.5. Bike Sharing in the Bay Area
  • 9. Randomness
    • 9.1. Conditional Statements
    • 9.2. Iteration
    • 9.3. Simulation
    • 9.4. The Monty Hall Problem
    • 9.5. Finding Probabilities
  • 10. Sampling and Empirical Distributions
    • 10.1. Empirical Distributions
    • 10.2. Sampling from a Population
    • 10.3. Empirical Distibution of a Statistic
    • 10.4. Random Sampling in Python
  • 11. Testing Hypotheses
    • 11.1. Assessing a Model
    • 11.2. Multiple Categories
    • 11.3. Decisions and Uncertainty
    • 11.4. Error Probabilities
  • 12. Comparing Two Samples
    • 12.1. A/B Testing
    • 12.2. Causality
    • 12.3. Deflategate
  • 13. Estimation
    • 13.1. Percentiles
    • 13.2. The Bootstrap
    • 13.3. Confidence Intervals
    • 13.4. Using Confidence Intervals
  • 14. Why the Mean Matters
    • 14.1. Properties of the Mean
    • 14.2. Variability
    • 14.3. The SD and the Normal Curve
    • 14.4. The Central Limit Theorem
    • 14.5. The Variability of the Sample Mean
    • 14.6. Choosing a Sample Size
  • 15. Prediction
    • 15.1. Correlation
    • 15.2. The Regression Line
    • 15.3. The Method of Least Squares
    • 15.4. Least Squares Regression
    • 15.5. Visual Diagnostics
    • 15.6. Numerical Diagnostics
  • 16. Inference for Regression
    • 16.1. A Regression Model
    • 16.2. Inference for the True Slope
    • 16.3. Prediction Intervals
  • 17. Classification
    • 17.1. Nearest Neighbors
    • 17.2. Training and Testing
    • 17.3. Rows of Tables
    • 17.4. Implementing the Classifier
    • 17.5. The Accuracy of the Classifier
    • 17.6. Multiple Regression
  • 18. Updating Predictions
    • 18.1. A "More Likely Than Not" Binary Classifier
    • 18.2. Making Decisions
Powered by Jupyter Book
Contents
  • 6.3.1. Ages 97-100
  • 6.3.2. Percent Change

6.3. Example: Population Trends¶

We are now ready to work with large tables of data. The file below contains “Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.” Notice that read_table can read data directly from a URL.

# As of August 2021, this census file is online here: 
data = 'http://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/nc-est2019-agesex-res.csv'

# A local copy can be accessed here in case census.gov moves the file:
# data = path_data + 'nc-est2019-agesex-res.csv'

full_census_table = Table.read_table(data)
full_census_table
SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016 POPESTIMATE2017 POPESTIMATE2018 POPESTIMATE2019
0 0 3944153 3944160 3951430 3963092 3926570 3931258 3954787 3983981 3954773 3893990 3815343 3783052
0 1 3978070 3978090 3957730 3966225 3977549 3942698 3948891 3973133 4002903 3972711 3908830 3829599
0 2 4096929 4096939 4090621 3970654 3978925 3991740 3958711 3966321 3991349 4020045 3987032 3922044
0 3 4119040 4119051 4111688 4101644 3981531 3991017 4005928 3974351 3982984 4006946 4033038 3998665
0 4 4063170 4063186 4077346 4121488 4111490 3992502 4004032 4020292 3989750 3997280 4018719 4043323
0 5 4056858 4056872 4064521 4087054 4131049 4121876 4004576 4017589 4035033 4003452 4008443 4028281
0 6 4066381 4066412 4072904 4074531 4096631 4141126 4133372 4017388 4031568 4048018 4014057 4017227
0 7 4030579 4030594 4042990 4082821 4084175 4106756 4152666 4145872 4030888 4044139 4058370 4022319
0 8 4046486 4046497 4025501 4052773 4092559 4094513 4118349 4165033 4158848 4042924 4054236 4066194
0 9 4148353 4148369 4125312 4035319 4062726 4103052 4106068 4130887 4177895 4170813 4053179 4061874

... (296 rows omitted)

Only the first 10 rows of the table are displayed. Later we will see how to display the entire table; however, this is typically not useful with large tables.

A description of the table appears online.

The SEX column contains numeric codes: 0 stands for the total, 1 for male, and 2 for female. The assumptions underlying this binary categorization are questionable. We will discuss them in the next section.

In this section we will focus on age and population size.

The AGE column contains ages in completed years, but the special value 999 is a sum of the total population. The “age” 100 too has a special status. In that category, the Census Bureau includes all people aged 100 or more.

The other columns contain estimates of the US population in each category of sex and age in the years 2010 through 2019. The Census is decennial: it takes place every 10 years. The most recent Census was held in 2020 and the one before that in 2010. The Census Bureau also estimates population changes each year. As explained in the Bureau’s description of its methodology, it “adds [the estimated changes] to the last decennial census to produce updated population estimates every year.”

Typically, a publicly available table will contain more information than necessary for a particular investigation or analysis. To get the large table into a more usable form, we have to do some data cleaning.

Suppose we are only interested in the population changes from 2014 to 2019. Let’s select the relevant columns.

partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2014', 'POPESTIMATE2019')
partial_census_table
SEX AGE POPESTIMATE2014 POPESTIMATE2019
0 0 3954787 3783052
0 1 3948891 3829599
0 2 3958711 3922044
0 3 4005928 3998665
0 4 4004032 4043323
0 5 4004576 4028281
0 6 4133372 4017227
0 7 4152666 4022319
0 8 4118349 4066194
0 9 4106068 4061874

... (296 rows omitted)

We can simplify the labels of the selected columns.

us_pop = partial_census_table.relabeled('POPESTIMATE2014', '2014').relabeled('POPESTIMATE2019', '2019')
us_pop
SEX AGE 2014 2019
0 0 3954787 3783052
0 1 3948891 3829599
0 2 3958711 3922044
0 3 4005928 3998665
0 4 4004032 4043323
0 5 4004576 4028281
0 6 4133372 4017227
0 7 4152666 4022319
0 8 4118349 4066194
0 9 4106068 4061874

... (296 rows omitted)

6.3.1. Ages 97-100¶

As a warm-up, let’s examine the total population, labeled by SEX code 0. Since all these rows will have the same value 0 in the SEX column, we will drop that column.

us_pop_by_age = us_pop.where('SEX', are.equal_to(0)).drop('SEX')
us_pop_by_age
AGE 2014 2019
0 3954787 3783052
1 3948891 3829599
2 3958711 3922044
3 4005928 3998665
4 4004032 4043323
5 4004576 4028281
6 4133372 4017227
7 4152666 4022319
8 4118349 4066194
9 4106068 4061874

... (92 rows omitted)

Now let’s look at the population in the highest ages.

us_pop_by_age.where('AGE', are.between(97, 101))
AGE 2014 2019
97 82948 116969
98 59546 86150
99 41277 57124
100 70685 100322

Not surprisingly, the numbers of people are smaller at higher ages. For example, there are fewer 99-year-olds than 98-year-olds.

But the numbers for AGE 100 are quite a bit larger than those for age 99. That is because the row with AGE 100 doesn’t just represent 100-year-olds. It also includes those who are older than 100.

6.3.2. Percent Change¶

Each column of the table us_pop_by_age is an array of the same length, and so columns can be combined using arithmetic. The array below contains the change in population between 2014 and 2019. There is one entry corresponding to each row of us_pop_by_age.

change = us_pop_by_age.column('2019') - us_pop_by_age.column('2014')
change
array([-171735, -119292,  -36667, ...,   15847,   29637, 9938515])

We can augment us_pop_by_age with a column that contains these changes, both in absolute terms and as percents relative to the value in 2014.

us_pop_change = us_pop_by_age.with_columns(
    'Change', change,
    'Percent Change', change/us_pop_by_age.column('2014')
)
us_pop_change.set_format('Percent Change', PercentFormatter)
AGE 2014 2019 Change Percent Change
0 3954787 3783052 -171735 -4.34%
1 3948891 3829599 -119292 -3.02%
2 3958711 3922044 -36667 -0.93%
3 4005928 3998665 -7263 -0.18%
4 4004032 4043323 39291 0.98%
5 4004576 4028281 23705 0.59%
6 4133372 4017227 -116145 -2.81%
7 4152666 4022319 -130347 -3.14%
8 4118349 4066194 -52155 -1.27%
9 4106068 4061874 -44194 -1.08%

... (92 rows omitted)

Almost all the entries displayed in the Percent Change column are negative, demonstrating a drop in population at the youngest ages. However, the overall population grew by about 9.9 million people, a percent change of just over 3%.

us_pop_change.where('AGE', are.equal_to(999))
AGE 2014 2019 Change Percent Change
999 318301008 328239523 9938515 3.12%

Let us compare this to the change at each age. For ease of interpretation, we will sort the table in decreasing order of the absolute change in population, contained in the column Change.

us_pop_change.where(
    'AGE', are.below(999)
).sort('Change', descending=True)
AGE 2014 2019 Change Percent Change
72 2191642 3191048 999406 45.60%
68 2567511 3345475 777964 30.30%
69 2530460 3252423 721963 28.53%
70 2461426 3136704 675278 27.43%
71 2516392 3083083 566691 22.52%
76 1692960 2222392 529432 31.27%
62 3677408 4156645 479237 13.03%
28 4345247 4818725 473478 10.90%
64 3481789 3950578 468789 13.46%
38 3848856 4305576 456720 11.87%

... (91 rows omitted)

Take a look at the top few rows. While the percent change is about 3% for the overall population, it jumps to well over 20% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.

What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.

  • Those who were in the age group 69 to 72 in 2014 were born in the years 1942 to 1945. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945.

  • Those who were 69 to 72 years old in 2019 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States.

The post-war jump in births is a major reason for the large changes that we have observed.

previous

6.2. Selecting Rows

next

6.4. Example: Sex Ratios

By Ani Adhikari and John DeNero and David Wagner
© Copyright 2021.