(This is reposted from my web page, where it has been since the indicated date.)

Introduction

As part of my work on the spatio-temporal spread of infectious diseases, I have had to use the Severe Acute Respiratory Syndrome (SARS) data available on the WHO website (here). This data comes in a challenging form: each page contains the report for a single day and the information presented evolves through time. The latter is to be expected: as the epidemic unfolded, the type of information that was required to deal with the problem evolved. However, in order to be able to use the data, some processing is necessary.

Formatting the data to make it easier to use was a somewhat long and rather annoying process and since this is publicly available data, I thought I would save others interested in this type of problems the time that it took me to do this, by making this processed data public as well.

Caveat

The WHO data was obtained during the course of the SARS epidemic. Some of the data was revised, but only a few days after it was initially posted. Therefore, compared to more thorough post-epidemic investigation and data gathering, the data here cannot be considered as the most up to date.

Epochs in the data

There are three main phases in the data, characterized by different variables being reported. In the processed data, all these variables are present, although they are empty for data in an epoch that does not report about given variables.

  • 17 March to 9 April. During this period, the variables considered are, for each country, the total number of cases, the number of deaths and the presence or absence of local transmission.

  • 10 April to 1 May. To the variables already mentioned are added the number of new cases since the last WHO update and the number of recovered individuals. From 17 April, the date of the last report for each country is listed.

  • 2 May to 11 July. The last epoch in the dataset drops the local transmission information (an additional dataset, that I have not processed yet, lists for each country affected, the regions where transmission is taking place) but adds a variable that indicates the date for which the cumulative number of cases is current.

Tidying up the data

As much of the information as possible was retained, with a few exceptions; see below. Dates were formatted consistently. The times at which the reports were posted are also available. When notes were relative to a given country on a given day, they are included as notes. The only notes that were omitted are relative to the general handling of the data. They are available in the full file, though.

Data files

Several data files are available.

  • The main cleaned-up file, in CSV format (file).

  • The raw, almost unprocessed file obtained by cutting and pasting the content of all the pages on the WHO site into a single file. The notes omitted in the cleaned-up file are present in this version.