Department of Mathematics
University of Manitoba*
Department of Mathematics University of Manitoba*
Canadian Centre for Disease Modelling
Data you acquire is rarely in a format that is immediately useful for your purposes
Data wrangling is the process of transforming and mapping data from one “raw” data form into another format to make it more appropriate and valuable for a variety of downstream purposes
dplyr
vs sqldf
Can go “old school”
dplyr
is part of the tidyverse
set of libraries. Loads magrittr
and its pipe %>%
sqldf
allows to use SQL on dataframes.. interesting alternative if you know SQL
Let us load some data for SARS-CoV-1 (2003) that I collected some time back
github_URL = "https://raw.githubusercontent.com/"
repo_URL = "julien-arino/datasets/master/"
file_name = "SARS_data.csv"
SARS_URL = paste0(github_URL, repo_URL, file_name)
SARS = read.csv(SARS_URL)
We want to keep the data for one country (e.g., Canada)
SARS$country == ctry
SARS$country == ctry
is a vector of TRUE
(entries in SARS$country
which do equal ctry
) and FALSE
(entries in SARS$country
which do not equal ctry
)
Using this vector as an index for SARS
keeps only the entries for which the index is TRUE
which(SARS$country == ctry)
Same as before, except that here which
returns the indices of the entries for which SARS$country == ctry
is TRUE
, so idx
takes the form
These are the indices in the original dataframe SARS
of the entries we want to keep
which
We can make different index sets corresponding to different criteria, then combine them using intersect
, union
and setdiff
For instance, if we want to keep only the entries for which the country is Canada and the date is before 2003-04-30
idx_CAN = which(SARS$country == ctry)
idx_date = which(SARS$toDate < "2003-04-30")
idx_CAN_date = intersect(idx_CAN, idx_date)
sqldf
library(sqldf)
query = paste0("SELECT * ",
"FROM SARS ",
"WHERE country = '",
ctry, "'")
SARS_selected = sqldf(query)
dplyr
Create incidence for the selected country. diff does difference one by one, so one less entry than the vector on which it is being used, thus we pad with a 0
Keep only positive incidences (discard 0 or negative adjustments)
SARS_selected = SARS_selected %>%
filter(incidence > 0)
Plot the result. Before plotting, we need to make the dates column we will use be actual dates..
Select the data columns needed
We use a function from incidence2
to format the data as needed for the plot
Finally, we plot using ggplot2
library(ggplot2)
plot(incid) +
labs(fill = "Type") +
xlab("Date") + ylab("Incidence") +
labs(caption =
sprintf("SARS-CoV-1 incidence in %s", ctry)) +
theme(legend.position = "none")
(Note that the following plot uses + ggdark::dark_mode()
)