R for modellers - Vignette 04

Data types and simple operations

Julien Arino

Department of Mathematics

University of Manitoba*




* The University of Manitoba campuses are located on original lands of Anishinaabeg, Cree, Oji-Cree, Dakota and Dene peoples, and on the homeland of the Métis Nation.

Assignment

Two ways:

X <- 10

or

X = 10


First version is preferred by R purists.. I don’t really care

Lists

A very useful data structure, quite flexible and versatile


Empty list

L <- list()


Convenient for things like parameters

L$a <- 10
L$b <- 3
L[["another_name"]] <- "Plouf plouf"
L[1]
$a
[1] 10
L[[2]]
[1] 3
L$a
[1] 10
L[["b"]]
[1] 3
L$another_name
[1] "Plouf plouf"

Accessing subsets of list entries

L = list()
for (i in 1:10) {
  L[[i]] = 2*i
}


Then to access entries 3 and 4

L[3:4]
[[1]]
[1] 6

[[2]]
[1] 8

List names can be parameters


L <- list()
L$a <- 10
L$b <- 3
L[["another_name"]] <- "Plouf plouf"
for (n in names(L)) {
  writeLines(paste0("n=", n, ", L[[n]]=", L[[n]]))
}
n=a, L[[n]]=10
n=b, L[[n]]=3
n=another_name, L[[n]]=Plouf plouf

List of lists

L <- list()
L[["2024"]] = list()
L[["2024"]]$population = 200
L[["2024"]]$v = 1:5
L
$`2024`
$`2024`$population
[1] 200

$`2024`$v
[1] 1 2 3 4 5

Convenient: we could replicate the same list elements for “2023”, for instance

Vectors

x = 1:10
(y <- c(x, 12))
 [1]  1  2  3  4  5  6  7  8  9 10 12


(Line 2: surrounded by ( ) so that the result appears)

Concatenating two vectors

  • The c() command is ubiquitous in R

  • Used to make vectors, concatenate them, etc.

x = 1:5
y = 10:12
(z = c(x, y))
[1]  1  2  3  4  5 10 11 12

Vectors have a single entry type


z = c("red", "blue")
(z = c(z, 1))
[1] "red"  "blue" "1"   


Since the first two entries are characters, the added entry is also a character. Contrary to lists, vectors have all entries of the same type

Populating an empty vector


v = c()
for (i in 1:10) {
  v = c(v, 2*i)
}
v
 [1]  2  4  6  8 10 12 14 16 18 20


Very useful method to create a vector if you don’t know in advance how many entries it will have

Vector operations - Beware !

Say

x = 1:10


Then x+1 gives

x+1
 [1]  2  3  4  5  6  7  8  9 10 11

i.e., adds 1 to all entries in the vector

Use seq to make more complex sequences

(x = seq(from = 2, to = 10, by = 1.5))
[1] 2.0 3.5 5.0 6.5 8.0 9.5


The (from, to, by) form is the default; others exist


y = seq(from = 2, to = 100, length.out = 6)

round(y, 2)
[1]   2.0  21.6  41.2  60.8  80.4 100.0

Naming vector entries


It is possible (and often useful) to name vector entries


x = seq(from = 2, to = 10, by = 1.5)
names(x) = sprintf("v%d", 1:length(x))
x
 v1  v2  v3  v4  v5  v6 
2.0 3.5 5.0 6.5 8.0 9.5 


x["v5"]
v5 
 8 

Matrices

Matrix (or vector) of zeros


(A <- mat.or.vec(nr = 2, nc = 3))
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0

Matrix with prescribed entries

(B <- matrix(c(1,2,3,4), nr = 2, nc = 2))
     [,1] [,2]
[1,]    1    3
[2,]    2    4
(C <- matrix(c(5,6,7,8), nc = 2, nr = 2, 
             byrow = TRUE))
     [,1] [,2]
[1,]    5    6
[2,]    7    8


Here and elsewhere, naming the arguments (e.g., nr = 2) allows to use arguments in any order

Matrix operations

Probably the biggest annoyance in R compared to other languages !


  • A*B is the Hadamard product \(A\circ B\) (denoted A.*B in matlab), not the standard matrix multiplication


  • Standard matrix multiplication is A %*% B

For the matlab-ers here


  • R does not have the keyword end to access the last entry in a matrix/vector/list..


  • Use length (lists or vectors), nchar (character chains), dim (matrices.. careful, of course returns 2 values)

Concatenating matrices

Dimensions must be compatible

rbind(B, C)
     [,1] [,2]
[1,]    1    3
[2,]    2    4
[3,]    5    6
[4,]    7    8
cbind(B, C)
     [,1] [,2] [,3] [,4]
[1,]    1    3    5    6
[2,]    2    4    7    8

Concatenating vectors and matrices

v = c(9, 10)
rbind(B, v)
  [,1] [,2]
     1    3
     2    4
v    9   10
cbind(B, v)
          v
[1,] 1 3  9
[2,] 2 4 10

Naming matrix rows/columns

Can be useful sometimes

rownames(B) = c("before", "after")
colnames(B) = c("Jane", "John")
B
       Jane John
before    1    3
after     2    4


Not assigning a value returns the existing values, if any

rownames(B)
[1] "before" "after" 
colnames(C)
NULL

Access matrix/vector entries

By position

B[1,2]
[1] 3
v[1]
[1] 9


By name, if present, and combining

B["before", "Jane"]
[1] 1
B["before", 2]
[1] 3

Whole rows/columns

B["before", ]
Jane John 
   1    3 
B[, "Jane"]
before  after 
     1      2 
C[,]
     [,1] [,2]
[1,]    5    6
[2,]    7    8

Submatrices


D = matrix(data = runif(100), nc = 10)
D[2:3, 5:7]
          [,1]      [,2]       [,3]
[1,] 0.2398867 0.3167605 0.04903629
[2,] 0.8981800 0.7612056 0.60225418


runif(100): generate 100 uniformly distributed random numbers between the default min=0 and max=1


Note that indices are “local” to the result

Submatrix of a named matrix


D = matrix(data = runif(100), nc = 10)
rownames(D) = sprintf("R%d", 1:dim(D)[1])
colnames(D) = sprintf("C%d", 1:dim(D)[2])
(E = D[2:3, 5:7])
          C5        C6        C7
R2 0.3275763 0.5880480 0.3150221
R3 0.8573982 0.3703588 0.7112473


Indices are “local” but names are those “extracted”

E["R3", "C6"]
[1] 0.3703588
E[2, 2]
[1] 0.3703588

Data frames

Data frames


From the R documentation:


[data frames are] tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R’s modeling software

Data frames are lists and matrices


  • Easier to access elements than in lists


  • More flexible than matrices (columns can be of different types)
L3 <- LETTERS[1:3]
fac <- sample(L3, 8, replace = TRUE)
(df <- data.frame(x = 1, y = 1:8, fac = fac))
  x y fac
1 1 1   B
2 1 2   A
3 1 3   A
4 1 4   B
5 1 5   C
6 1 6   B
7 1 7   A
8 1 8   A


is.character(df$x)
[1] FALSE
is.character(df$fac)
[1] TRUE

Data frames are lists and matrices (2)


df$fac
[1] "B" "A" "A" "B" "C" "B" "A" "A"
df[["fac"]]
[1] "B" "A" "A" "B" "C" "B" "A" "A"
df[, "fac"]
[1] "B" "A" "A" "B" "C" "B" "A" "A"
df[, 3]
[1] "B" "A" "A" "B" "C" "B" "A" "A"


df$fac[2]
[1] "A"
df[["fac"]][2]
[1] "A"
df[2, "fac"]
[1] "A"
df[2, 3]
[1] "A"

which

The which function


  • Extremely useful


  • Important to learn how to use


Give the TRUE indices of a logical object, allowing for array indices

TRUE indices of a logical object?


  • Return to logical tests in Vignette 05 about flow control


  • TRUE indices: those indices for which a property is TRUE


  • E.g., \(x<1\)?
df 
  x y fac
1 1 1   B
2 1 2   A
3 1 3   A
4 1 4   B
5 1 5   C
6 1 6   B
7 1 7   A
8 1 8   A
df$y < 5
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
which(df$y < 5)
[1] 1 2 3 4
df$fac == "A"
[1] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
which(df$fac == "A")
[1] 2 3 7 8

which is useful

df$fac[which(df$fac == "A")] = "Z"
df
  x y fac
1 1 1   B
2 1 2   Z
3 1 3   Z
4 1 4   B
5 1 5   C
6 1 6   B
7 1 7   Z
8 1 8   Z

which can return array indices

E = matrix(data = runif(25), nr = 5)
(rc = which(E < 0.1, arr.ind = TRUE))
     row col
[1,]   3   1
[2,]   3   3
[3,]   5   3
E[rc] = Inf
round(E, digits = 2)
     [,1] [,2] [,3] [,4] [,5]
[1,] 0.58 0.87 0.63 0.14 0.87
[2,] 0.11 0.69 0.43 0.33 0.26
[3,]  Inf 0.13  Inf 0.46 0.80
[4,] 0.60 0.67 0.97 0.35 0.29
[5,] 0.36 0.57  Inf 0.30 0.58

Type checking/casting

Checking types


is.type, for whatever type, is typically defined


is.array, is.atomic, is.character, is.data.frame, is.double, is.function, is.integer, is.list, is.logical, is.matrix, is.numeric, is.object, is.vector


Many packages also define specific types

Casting


Typically, if is.type exists for type type, then as.type also exists


as.array, as.data.frame, as.list, as.matrix, as.numeric, as.vector


Often: matrix \(\leftrightarrow\) data frame, list \(\leftrightarrow\) matrix

Example: matrix to list

to_vary_m = 
  expand.grid(p1= seq(1, 3, length.out = 10),
              p2 = seq(0.8, 3, length.out = 10))
to_vary_l = split(to_vary_m, seq(nrow(to_vary_m)))


expand.grid: makes a matrix with every combination of the values of the vectors p1 and p2

to_vary_m[3,]
        p1  p2
3 1.444444 0.8
to_vary_l[3]
$`3`
        p1  p2
3 1.444444 0.8