Introduction to R and jupyter notebook¶

In order to illustrate the use of jupyter notebook and R, let us prepare for the next lecture using a notebook.

Notebooks run online (here on syzygy.ca). It uses two (mainly) types of "cells". A cell like this one is a text cell. Text is formatted using markdown, which is a simple text description language yet has relatively powerful capabilities. See here for details, for instance.

In [1]:

# This is an R code cell 
# The sign "#" is used for comments in R, so these lines do nothing

This is a markdown cell.

Cells are evaluated by pressing Shift+Enter or Ctrl+Enter. This current one is an evaluated markdown cell, the one just above is an unevaluated markdown cell, the first an R cell (evaluated or not, since it is only a comment, it is hard to tell the difference)

Grabing the Canadian census data¶

To illustrate the method, we will consider the evolution of the population of Canada through time. For this, we will grab the census data. We search for "canada historical census data csv", since csv (comma separated values) is a very easy format to use with R. Here, we find a csv for 1851 to 1976. We follow the link to Table A2-14, where we find another link, this time to a csv file. This is what we use in R.

The function read.csv reads in a file (potentially directly from the web). We assign the result to the variable data. We then use the function head to show the first few lines in the result.

In [2]:

data_old = read.csv("https://www150.statcan.gc.ca/n1/en/pub/11-516-x/sectiona/A2_14-eng.csv?st=L7vSnqio")
head(data_old)

A data.frame: 6 × 23
	X	Series.A2.14.	Population.of.Canada..by.province..census.dates..1851.to.1976	X.1	X.2	X.3	X.4	X.5	X.6	X.7	⋯	X.11	X.12	X.13	X.14	X.15	X.16	X.17	X.18	X.19	X.20
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<int>	<chr>	<int>	<chr>	<int>	<chr>	<chr>	<int>	<lgl>
1	NA			NA			NA				⋯		NA		NA		NA			NA	NA
2	NA	Year	Canada	NA	Newfound-	Prince	NA	Nova	New	Quebec	⋯	Saskat-	NA	Alberta	NA	British	NA	Yukon	Northwest	NA	NA
3	NA			NA	land	Edward	NA	Scotia	Brunswick		⋯	chewan	NA		NA	Columbia	NA	Territory	Territories	NA	NA
4	NA			NA		Island	NA				⋯		NA		NA		NA			NA	NA
5	NA		2	NA	3	4	NA	5	6	7	⋯	10	NA	11	NA	12	NA	13	14	NA	NA
6	NA			NA			NA				⋯		NA		NA		NA			NA	NA

Obviously, this does not make a lot of sense. This is normal: take a look at the first few lines in the file. They take the form

,Series A2-14.,"Population of Canada, by province, census dates, 1851 to 1976",,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,
,Year,Canada,,Newfound-,Prince,,Nova,New,Quebec,Ontario, Manitoba,,Saskat-,,Alberta,,British,,Yukon,Northwest,,
,,,,land,Edward,,Scotia,Brunswick,,,,,chewan,,,,Columbia,,Territory,Territories,,
,,,,,Island,,,,,,,,,,,,,,,,,
,,2,,3,4,,5,6,7,8,9,,10,,11,,12,,13,14,,
,,,,,,,,,,,,,,,,,,,,,,

This happens often: the first few lines are here to set the information, they lay out a simple version of the so-called metadata

The first line here does this; it is easy to deal with this: the function read.csv takes the optional argument skip=, which indicates how many lines to skip at the beginning
The second line is also empty, so let us skip it too

In [3]:

data_old = read.csv("https://www150.statcan.gc.ca/n1/en/pub/11-516-x/sectiona/A2_14-eng.csv?st=L7vSnqio",
                    skip = 2)
head(data_old)

A data.frame: 6 × 23
	X	Year	Canada	X.1	Newfound.	Prince	X.2	Nova	New	Quebec	⋯	Saskat.	X.4	Alberta	X.5	British	X.6	Yukon	Northwest	X.7	X.8
	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<int>	<chr>	<chr>	<chr>	⋯	<chr>	<int>	<chr>	<int>	<chr>	<int>	<chr>	<chr>	<int>	<lgl>
1	NA			NA	land	Edward	NA	Scotia	Brunswick		⋯	chewan	NA		NA	Columbia	NA	Territory	Territories	NA	NA
2	NA			NA		Island	NA				⋯		NA		NA		NA			NA	NA
3	NA		2	NA	3	4	NA	5	6	7	⋯	10	NA	11	NA	12	NA	13	14	NA	NA
4	NA			NA			NA				⋯		NA		NA		NA			NA	NA
5	NA	1976	22,992,604	NA	557,725	118,229	NA	828,571	677,250	6,234,445	⋯	921,323	NA	1,838,037	NA	2,466,608	NA	21,836	42,609	NA	NA
6	NA			NA			NA				⋯		NA		NA		NA			NA	NA

Here, there is the further issue that to make things legible, the table authors used 3 rows (from 2 to 4) to encode for long names (e.g., Prince Edward Island is written over 3 rows). Note, however, that read.csv has rightly picked up on the first row being the column names

Because we are only interested in the total population of the country and the year, let us simply get rid of the first 4 rows and of all columns except the second (Year) and third (Canada)

In [4]:

data_old = data_old[5:dim(data_old)[1], 2:3]
data_old

A data.frame: 25 × 2
	Year	Canada
	<chr>	<chr>
5	1976	22,992,604
6
7	1971	21,568,311
8	1966	20,014,880
9	1961	18,238,247
10	1956	16,080,791
11	1951	14,009,429
12
13	1941	11,506,655
14	1931	10,376,786
15	1921	8,787,949
16	1911	7,206,643
17	1901	5,371,315
18
19	1891	4,833,239
20	1881	4,324,810
21	1871	3,689,257
22	1861	3,229,633
23	1851	2,436,297
24
25	Includes 485 members of the Royal Canadian Navy whose province of residence is not known.
26	Included with Northwest Territories.
27	For the discussion of the ambiguities and under-enumeration contained in these figures consult the notes to series A2-14 in original volume. For completeness of enumeration in censuses of 1961 and later years,
28	see notes to series A15-53.
29	1848 figure.

Still not perfect:

there are some empty rows;
the last few rows need to be removed too, they contain remarks about the data;
the population counts contain commas;
it would be better if years were increasing.

Let us fix these issues.

For 1 and 2, this is easy: remark that the Canada column is empty for both issues. Now remark as well that below Canada (and Year, for that matter), it is written < chr >. This means that entries in the column are characters. Looking for empty content therefore means looking for empty character chains.

So to fix 1 and 2, we keep the rows where Canada does not equal the empty chain.
To get rid of commas, we just need to substitute an empty chain for ",".
To sort, we find the order for the years and apply it to the entire table.
Finally, as remarked above, for now, both the year and the population are considered as character chains. This means that in order to plot anything, we will have to indicate that these are numbers, not characters.

In [5]:

data_old = data_old[which(data_old$Canada != ""),]
data_old$Canada = gsub(",", "", data_old$Canada)
order_data = order(data_old$Year)
data_old = data_old[order_data,]
data_old$Year = as.numeric(data_old$Year)
data_old$Canada = as.numeric(data_old$Canada)
data_old

A data.frame: 16 × 2
	Year	Canada
	<dbl>	<dbl>
23	1851	2436297
22	1861	3229633
21	1871	3689257
20	1881	4324810
19	1891	4833239
17	1901	5371315
16	1911	7206643
15	1921	8787949
14	1931	10376786
13	1941	11506655
11	1951	14009429
10	1956	16080791
9	1961	18238247
8	1966	20014880
7	1971	21568311
5	1976	22992604

Row numbers are a little weird, so let us fix this.

In [6]:

row.names(data_old) = 1:dim(data_old)[1]
data_old

A data.frame: 16 × 2
	Year	Canada
	<dbl>	<dbl>
1	1851	2436297
2	1861	3229633
3	1871	3689257
4	1881	4324810
5	1891	4833239
6	1901	5371315
7	1911	7206643
8	1921	8787949
9	1931	10376786
10	1941	11506655
11	1951	14009429
12	1956	16080791
13	1961	18238247
14	1966	20014880
15	1971	21568311
16	1976	22992604

Well, that looks about right! Let's see what this looks like in a graph.

In [7]:

plot(data_old$Year, data_old$Canada,
    type = "b", lwd = 2,
    xlab = "Year", ylab = "Population")

But wait, this is only to 1976..! Looking around, we find another table here. There's a download csv link in there, let us see where this leads us. The table is 720KB, so surely there must be more to this than just the population. To get a sense of that, we dump the whole data.frame, not just its head.

In [8]:

data_new = read.csv("https://www12.statcan.gc.ca/census-recensement/2011/dp-pd/vc-rv/download-telecharger/download-telecharger.cfm?Lang=eng&CTLG=98-315-XWE2011001&FMT=csv")
data_new

A data.frame: 9960 × 5
GEOGRAPHY.NAME	CHARACTERISTIC	YEAR.S.	TOTAL	FLAG_TOTAL
<chr>	<chr>	<chr>	<dbl>	<chr>
Canada	Population (in thousands)	1956	16081.0
Canada	Population (in thousands)	1961	18238.0
Canada	Population (in thousands)	1966	20015.0
Canada	Population (in thousands)	1971	21568.0
Canada	Population (in thousands)	1976	22993.0
Canada	Population (in thousands)	1981	24343.0
Canada	Population (in thousands)	1986	25309.0
Canada	Population (in thousands)	1991	27297.0
Canada	Population (in thousands)	1996	28847.0
Canada	Population (in thousands)	2001	30007.0
Canada	Population (in thousands)	2006	31613.0
Canada	Population (in thousands)	2011	33477.0
Canada	Population - % change	1956 to 1961	13.4
Canada	Population - % change	1961 to 1966	9.7
Canada	Population - % change	1966 to 1971	7.8
Canada	Population - % change	1971 to 1976	6.6
Canada	Population - % change	1976 to 1981	5.9
Canada	Population - % change	1981 to 1986	4.0
Canada	Population - % change	1986 to 1991	7.9
Canada	Population - % change	1991 to 1996	5.7
Canada	Population - % change	1996 to 2001	4.0
Canada	Population - % change	2001 to 2006	5.4
Canada	Population - % change	2006 to 2011	5.9
Canada	Total private dwellings occupied by usual residents	1996	10820050.0
Canada	Total private dwellings occupied by usual residents	2001	11562975.0
Canada	Total private dwellings occupied by usual residents	2006	12435520.0
Canada	Total private dwellings occupied by usual residents	2011	13320614.0
Canada	% of the population aged 0 to 14 years	1921	34.4
Canada	% of the population aged 0 to 14 years	1931	31.6
Canada	% of the population aged 0 to 14 years	1941	27.8
⋮	⋮	⋮	⋮	⋮
Quebec	Population share	2001	24.1
Quebec	Population share	2011	23.6
Ontario	Population share	1951	32.8
Ontario	Population share	1961	34.2
Ontario	Population share	1971	35.7
Ontario	Population share	1981	35.4
Ontario	Population share	1991	36.9
Ontario	Population share	2001	38.0
Ontario	Population share	2011	38.4
Prairie Provinces	Population share	1951	18.2
Prairie Provinces	Population share	1961	17.4
Prairie Provinces	Population share	1971	16.4
Prairie Provinces	Population share	1981	17.4
Prairie Provinces	Population share	1991	17.0
Prairie Provinces	Population share	2001	16.9
Prairie Provinces	Population share	2011	17.6
British Columbia	Population share	1951	8.3
British Columbia	Population share	1961	8.9
British Columbia	Population share	1971	10.1
British Columbia	Population share	1981	11.3
British Columbia	Population share	1991	12.0
British Columbia	Population share	2001	13.0
British Columbia	Population share	2011	13.1
Territories	Population share	1951	0.2
Territories	Population share	1961	0.2
Territories	Population share	1971	0.2
Territories	Population share	1981	0.3
Territories	Population share	1991	0.3
Territories	Population share	2001	0.3
Territories	Population share	2011	0.3

Haha, this looks quite nice but has way more information than we need (we just want the population of Canada). Also, the population of Canada is expressed in thousands, so once we selected what we want, we will need to multiply by 1,000.

There are many ways to select rows. Let us proceed as follows: we want the rows where the geography is "Canada" and the characteristic is "Population (in thousands)". Let us find those indices of rows that satisfy the first criterion, those that satisfy the second; if we then intersect these two sets of indices, we will have selected the rows we want.

In [9]:

idx_CAN = which(data_new$GEOGRAPHY.NAME == "Canada")
idx_char = which(data_new$CHARACTERISTIC == "Population (in thousands)")
idx_keep = intersect(idx_CAN, idx_char)
idx_keep

1
2
3
4
5
6
7
8
9
10
11
12

Yes, this looks okay, so let us keep only these.

In [10]:

data_new = data_new[idx_keep,]
data_new

A data.frame: 12 × 5
	GEOGRAPHY.NAME	CHARACTERISTIC	YEAR.S.	TOTAL	FLAG_TOTAL
	<chr>	<chr>	<chr>	<dbl>	<chr>
1	Canada	Population (in thousands)	1956	16081
2	Canada	Population (in thousands)	1961	18238
3	Canada	Population (in thousands)	1966	20015
4	Canada	Population (in thousands)	1971	21568
5	Canada	Population (in thousands)	1976	22993
6	Canada	Population (in thousands)	1981	24343
7	Canada	Population (in thousands)	1986	25309
8	Canada	Population (in thousands)	1991	27297
9	Canada	Population (in thousands)	1996	28847
10	Canada	Population (in thousands)	2001	30007
11	Canada	Population (in thousands)	2006	31613
12	Canada	Population (in thousands)	2011	33477

We want to concatenate this data.frame with the one from earlier. To do this, we need the two data frames to have the same number of columns and, actually, the same column names and entry types (notice that YEAR.S. in data_new is a column of characters).

So what remains to do:

Rename the columns in the pruned old data (data_pruned) to year and population. Personally, I prefer lowercase column names.. and population is more informative than Canada.
Keep only the relevant columns in data_new, rename them accordingly and multiply population by 1,000 there.
Transform year in data_new to numbers.
We already have data up to and including 1976 in data_old, so get rid of that in data_new.
Append the rows of data_new to those of data_pruned.

In [11]:

colnames(data_old) = c("year", "population")
data_new = data_new[,c("YEAR.S.","TOTAL")]
colnames(data_new) = c("year", "population")
data_new$year = as.numeric(data_new$year)
data_new = data_new[which(data_new$year>1976),]
data_new$population = data_new$population*1000

data = rbind(data_old,data_new)

OK, we are ready now!!

In [12]:

plot(data$year, data$population,
    type = "b", lwd = 2,
    xlab = "Year", ylab = "Population")

In case we need the data elsewhere, we can save it.

In [13]:

write.csv(data, file = "Canada_census.csv")