In order to illustrate the use of jupyter notebook
and R
, let us prepare for the next lecture using a notebook
.
Notebooks run online (here on syzygy.ca). It uses two (mainly) types of "cells". A cell like this one is a text cell. Text is formatted using markdown
, which is a simple text description language yet has relatively powerful capabilities. See here for details, for instance.
# This is an R code cell
# The sign "#" is used for comments in R, so these lines do nothing
This is a markdown cell.
Cells are evaluated by pressing Shift+Enter or Ctrl+Enter. This current one is an evaluated markdown cell, the one just above is an unevaluated markdown cell, the first an R cell (evaluated or not, since it is only a comment, it is hard to tell the difference)
To illustrate the method, we will consider the evolution of the population of Canada through time. For this, we will grab the census data. We search for "canada historical census data csv", since csv
(comma separated values) is a very easy format to use with R
. Here, we find a csv
for 1851 to 1976. We follow the link to Table A2-14, where we find another link, this time to a csv
file. This is what we use in R
.
The function read.csv
reads in a file (potentially directly from the web). We assign the result to the variable data
. We then use the function head
to show the first few lines in the result.
data_old = read.csv("https://www150.statcan.gc.ca/n1/en/pub/11-516-x/sectiona/A2_14-eng.csv?st=L7vSnqio")
head(data_old)
X | Series.A2.14. | Population.of.Canada..by.province..census.dates..1851.to.1976 | X.1 | X.2 | X.3 | X.4 | X.5 | X.6 | X.7 | ⋯ | X.11 | X.12 | X.13 | X.14 | X.15 | X.16 | X.17 | X.18 | X.19 | X.20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <chr> | <int> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | ⋯ | <chr> | <int> | <chr> | <int> | <chr> | <int> | <chr> | <chr> | <int> | <lgl> | |
1 | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | ||||||||||||
2 | NA | Year | Canada | NA | Newfound- | Prince | NA | Nova | New | Quebec | ⋯ | Saskat- | NA | Alberta | NA | British | NA | Yukon | Northwest | NA | NA |
3 | NA | NA | land | Edward | NA | Scotia | Brunswick | ⋯ | chewan | NA | NA | Columbia | NA | Territory | Territories | NA | NA | ||||
4 | NA | NA | Island | NA | ⋯ | NA | NA | NA | NA | NA | |||||||||||
5 | NA | 2 | NA | 3 | 4 | NA | 5 | 6 | 7 | ⋯ | 10 | NA | 11 | NA | 12 | NA | 13 | 14 | NA | NA | |
6 | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA |
Obviously, this does not make a lot of sense. This is normal: take a look at the first few lines in the file. They take the form
,Series A2-14.,"Population of Canada, by province, census dates, 1851 to 1976",,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,
,Year,Canada,,Newfound-,Prince,,Nova,New,Quebec,Ontario, Manitoba,,Saskat-,,Alberta,,British,,Yukon,Northwest,,
,,,,land,Edward,,Scotia,Brunswick,,,,,chewan,,,,Columbia,,Territory,Territories,,
,,,,,Island,,,,,,,,,,,,,,,,,
,,2,,3,4,,5,6,7,8,9,,10,,11,,12,,13,14,,
,,,,,,,,,,,,,,,,,,,,,,
This happens often: the first few lines are here to set the information, they lay out a simple version of the so-called metadata
read.csv
takes the optional argument skip=
, which indicates how many lines to skip at the beginningdata_old = read.csv("https://www150.statcan.gc.ca/n1/en/pub/11-516-x/sectiona/A2_14-eng.csv?st=L7vSnqio",
skip = 2)
head(data_old)
X | Year | Canada | X.1 | Newfound. | Prince | X.2 | Nova | New | Quebec | ⋯ | Saskat. | X.4 | Alberta | X.5 | British | X.6 | Yukon | Northwest | X.7 | X.8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <chr> | <int> | <chr> | <chr> | <int> | <chr> | <chr> | <chr> | ⋯ | <chr> | <int> | <chr> | <int> | <chr> | <int> | <chr> | <chr> | <int> | <lgl> | |
1 | NA | NA | land | Edward | NA | Scotia | Brunswick | ⋯ | chewan | NA | NA | Columbia | NA | Territory | Territories | NA | NA | ||||
2 | NA | NA | Island | NA | ⋯ | NA | NA | NA | NA | NA | |||||||||||
3 | NA | 2 | NA | 3 | 4 | NA | 5 | 6 | 7 | ⋯ | 10 | NA | 11 | NA | 12 | NA | 13 | 14 | NA | NA | |
4 | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA | ||||||||||||
5 | NA | 1976 | 22,992,604 | NA | 557,725 | 118,229 | NA | 828,571 | 677,250 | 6,234,445 | ⋯ | 921,323 | NA | 1,838,037 | NA | 2,466,608 | NA | 21,836 | 42,609 | NA | NA |
6 | NA | NA | NA | ⋯ | NA | NA | NA | NA | NA |
Here, there is the further issue that to make things legible, the table authors used 3 rows (from 2 to 4) to encode for long names (e.g., Prince Edward Island is written over 3 rows). Note, however, that read.csv
has rightly picked up on the first row being the column names
Because we are only interested in the total population of the country and the year, let us simply get rid of the first 4 rows and of all columns except the second (Year) and third (Canada)
data_old = data_old[5:dim(data_old)[1], 2:3]
data_old
Year | Canada | |
---|---|---|
<chr> | <chr> | |
5 | 1976 | 22,992,604 |
6 | ||
7 | 1971 | 21,568,311 |
8 | 1966 | 20,014,880 |
9 | 1961 | 18,238,247 |
10 | 1956 | 16,080,791 |
11 | 1951 | 14,009,429 |
12 | ||
13 | 1941 | 11,506,655 |
14 | 1931 | 10,376,786 |
15 | 1921 | 8,787,949 |
16 | 1911 | 7,206,643 |
17 | 1901 | 5,371,315 |
18 | ||
19 | 1891 | 4,833,239 |
20 | 1881 | 4,324,810 |
21 | 1871 | 3,689,257 |
22 | 1861 | 3,229,633 |
23 | 1851 | 2,436,297 |
24 | ||
25 | Includes 485 members of the Royal Canadian Navy whose province of residence is not known. | |
26 | Included with Northwest Territories. | |
27 | For the discussion of the ambiguities and under-enumeration contained in these figures consult the notes to series A2-14 in original volume. For completeness of enumeration in censuses of 1961 and later years, | |
28 | see notes to series A15-53. | |
29 | 1848 figure. |
Still not perfect:
Let us fix these issues.
For 1 and 2, this is easy: remark that the Canada column is empty for both issues. Now remark as well that below Canada (and Year, for that matter), it is written < chr >. This means that entries in the column are characters
. Looking for empty content therefore means looking for empty character chains.
So to fix 1 and 2, we keep the rows where Canada does not equal the empty chain.
To get rid of commas, we just need to substitute an empty chain for ",".
To sort, we find the order for the years and apply it to the entire table.
Finally, as remarked above, for now, both the year and the population are considered as character chains. This means that in order to plot anything, we will have to indicate that these are numbers, not characters.
data_old = data_old[which(data_old$Canada != ""),]
data_old$Canada = gsub(",", "", data_old$Canada)
order_data = order(data_old$Year)
data_old = data_old[order_data,]
data_old$Year = as.numeric(data_old$Year)
data_old$Canada = as.numeric(data_old$Canada)
data_old
Year | Canada | |
---|---|---|
<dbl> | <dbl> | |
23 | 1851 | 2436297 |
22 | 1861 | 3229633 |
21 | 1871 | 3689257 |
20 | 1881 | 4324810 |
19 | 1891 | 4833239 |
17 | 1901 | 5371315 |
16 | 1911 | 7206643 |
15 | 1921 | 8787949 |
14 | 1931 | 10376786 |
13 | 1941 | 11506655 |
11 | 1951 | 14009429 |
10 | 1956 | 16080791 |
9 | 1961 | 18238247 |
8 | 1966 | 20014880 |
7 | 1971 | 21568311 |
5 | 1976 | 22992604 |
Row numbers are a little weird, so let us fix this.
row.names(data_old) = 1:dim(data_old)[1]
data_old
Year | Canada | |
---|---|---|
<dbl> | <dbl> | |
1 | 1851 | 2436297 |
2 | 1861 | 3229633 |
3 | 1871 | 3689257 |
4 | 1881 | 4324810 |
5 | 1891 | 4833239 |
6 | 1901 | 5371315 |
7 | 1911 | 7206643 |
8 | 1921 | 8787949 |
9 | 1931 | 10376786 |
10 | 1941 | 11506655 |
11 | 1951 | 14009429 |
12 | 1956 | 16080791 |
13 | 1961 | 18238247 |
14 | 1966 | 20014880 |
15 | 1971 | 21568311 |
16 | 1976 | 22992604 |
Well, that looks about right! Let's see what this looks like in a graph.
plot(data_old$Year, data_old$Canada,
type = "b", lwd = 2,
xlab = "Year", ylab = "Population")
But wait, this is only to 1976..! Looking around, we find another table here. There's a download csv link in there, let us see where this leads us. The table is 720KB, so surely there must be more to this than just the population. To get a sense of that, we dump the whole data.frame, not just its head.
data_new = read.csv("https://www12.statcan.gc.ca/census-recensement/2011/dp-pd/vc-rv/download-telecharger/download-telecharger.cfm?Lang=eng&CTLG=98-315-XWE2011001&FMT=csv")
data_new
GEOGRAPHY.NAME | CHARACTERISTIC | YEAR.S. | TOTAL | FLAG_TOTAL |
---|---|---|---|---|
<chr> | <chr> | <chr> | <dbl> | <chr> |
Canada | Population (in thousands) | 1956 | 16081.0 | |
Canada | Population (in thousands) | 1961 | 18238.0 | |
Canada | Population (in thousands) | 1966 | 20015.0 | |
Canada | Population (in thousands) | 1971 | 21568.0 | |
Canada | Population (in thousands) | 1976 | 22993.0 | |
Canada | Population (in thousands) | 1981 | 24343.0 | |
Canada | Population (in thousands) | 1986 | 25309.0 | |
Canada | Population (in thousands) | 1991 | 27297.0 | |
Canada | Population (in thousands) | 1996 | 28847.0 | |
Canada | Population (in thousands) | 2001 | 30007.0 | |
Canada | Population (in thousands) | 2006 | 31613.0 | |
Canada | Population (in thousands) | 2011 | 33477.0 | |
Canada | Population - % change | 1956 to 1961 | 13.4 | |
Canada | Population - % change | 1961 to 1966 | 9.7 | |
Canada | Population - % change | 1966 to 1971 | 7.8 | |
Canada | Population - % change | 1971 to 1976 | 6.6 | |
Canada | Population - % change | 1976 to 1981 | 5.9 | |
Canada | Population - % change | 1981 to 1986 | 4.0 | |
Canada | Population - % change | 1986 to 1991 | 7.9 | |
Canada | Population - % change | 1991 to 1996 | 5.7 | |
Canada | Population - % change | 1996 to 2001 | 4.0 | |
Canada | Population - % change | 2001 to 2006 | 5.4 | |
Canada | Population - % change | 2006 to 2011 | 5.9 | |
Canada | Total private dwellings occupied by usual residents | 1996 | 10820050.0 | |
Canada | Total private dwellings occupied by usual residents | 2001 | 11562975.0 | |
Canada | Total private dwellings occupied by usual residents | 2006 | 12435520.0 | |
Canada | Total private dwellings occupied by usual residents | 2011 | 13320614.0 | |
Canada | % of the population aged 0 to 14 years | 1921 | 34.4 | |
Canada | % of the population aged 0 to 14 years | 1931 | 31.6 | |
Canada | % of the population aged 0 to 14 years | 1941 | 27.8 | |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Quebec | Population share | 2001 | 24.1 | |
Quebec | Population share | 2011 | 23.6 | |
Ontario | Population share | 1951 | 32.8 | |
Ontario | Population share | 1961 | 34.2 | |
Ontario | Population share | 1971 | 35.7 | |
Ontario | Population share | 1981 | 35.4 | |
Ontario | Population share | 1991 | 36.9 | |
Ontario | Population share | 2001 | 38.0 | |
Ontario | Population share | 2011 | 38.4 | |
Prairie Provinces | Population share | 1951 | 18.2 | |
Prairie Provinces | Population share | 1961 | 17.4 | |
Prairie Provinces | Population share | 1971 | 16.4 | |
Prairie Provinces | Population share | 1981 | 17.4 | |
Prairie Provinces | Population share | 1991 | 17.0 | |
Prairie Provinces | Population share | 2001 | 16.9 | |
Prairie Provinces | Population share | 2011 | 17.6 | |
British Columbia | Population share | 1951 | 8.3 | |
British Columbia | Population share | 1961 | 8.9 | |
British Columbia | Population share | 1971 | 10.1 | |
British Columbia | Population share | 1981 | 11.3 | |
British Columbia | Population share | 1991 | 12.0 | |
British Columbia | Population share | 2001 | 13.0 | |
British Columbia | Population share | 2011 | 13.1 | |
Territories | Population share | 1951 | 0.2 | |
Territories | Population share | 1961 | 0.2 | |
Territories | Population share | 1971 | 0.2 | |
Territories | Population share | 1981 | 0.3 | |
Territories | Population share | 1991 | 0.3 | |
Territories | Population share | 2001 | 0.3 | |
Territories | Population share | 2011 | 0.3 |
Haha, this looks quite nice but has way more information than we need (we just want the population of Canada). Also, the population of Canada is expressed in thousands, so once we selected what we want, we will need to multiply by 1,000.
There are many ways to select rows. Let us proceed as follows: we want the rows where the geography is "Canada" and the characteristic is "Population (in thousands)". Let us find those indices of rows that satisfy the first criterion, those that satisfy the second; if we then intersect these two sets of indices, we will have selected the rows we want.
idx_CAN = which(data_new$GEOGRAPHY.NAME == "Canada")
idx_char = which(data_new$CHARACTERISTIC == "Population (in thousands)")
idx_keep = intersect(idx_CAN, idx_char)
idx_keep
Yes, this looks okay, so let us keep only these.
data_new = data_new[idx_keep,]
data_new
GEOGRAPHY.NAME | CHARACTERISTIC | YEAR.S. | TOTAL | FLAG_TOTAL | |
---|---|---|---|---|---|
<chr> | <chr> | <chr> | <dbl> | <chr> | |
1 | Canada | Population (in thousands) | 1956 | 16081 | |
2 | Canada | Population (in thousands) | 1961 | 18238 | |
3 | Canada | Population (in thousands) | 1966 | 20015 | |
4 | Canada | Population (in thousands) | 1971 | 21568 | |
5 | Canada | Population (in thousands) | 1976 | 22993 | |
6 | Canada | Population (in thousands) | 1981 | 24343 | |
7 | Canada | Population (in thousands) | 1986 | 25309 | |
8 | Canada | Population (in thousands) | 1991 | 27297 | |
9 | Canada | Population (in thousands) | 1996 | 28847 | |
10 | Canada | Population (in thousands) | 2001 | 30007 | |
11 | Canada | Population (in thousands) | 2006 | 31613 | |
12 | Canada | Population (in thousands) | 2011 | 33477 |
We want to concatenate this data.frame with the one from earlier. To do this, we need the two data frames to have the same number of columns and, actually, the same column names and entry types (notice that YEAR.S. in data_new is a column of characters).
So what remains to do:
colnames(data_old) = c("year", "population")
data_new = data_new[,c("YEAR.S.","TOTAL")]
colnames(data_new) = c("year", "population")
data_new$year = as.numeric(data_new$year)
data_new = data_new[which(data_new$year>1976),]
data_new$population = data_new$population*1000
data = rbind(data_old,data_new)
OK, we are ready now!!
plot(data$year, data$population,
type = "b", lwd = 2,
xlab = "Year", ylab = "Population")
In case we need the data elsewhere, we can save it.
write.csv(data, file = "Canada_census.csv")