ccodwg / covid19canadaarchive Goto Github PK

View Code? Open in Web Editor NEW

21.0 4.0 10.0 7.38 MB

Canadian COVID-19 Data Archive

Home Page: https://opencovid.ca

License: Other

Python 77.36% Shell 22.64%

covid-19 canada dataset covid19 covid19-data covid-data

covid19canadaarchive's Introduction

Canadian COVID-19 Data Archive

The Canadian COVID-19 Data Archive is a collection of datasets, documents and webpages related to the COVID-19 pandemic in Canada, with files spanning March 2020 to January 2024. This project supported automated, daily snapshots of Canadian COVID-19 data from governmental and non-governmental sources beginning August 25, 2020 and concluding January 31, 2024. The Archive is maintained by Jean-Paul R. Soucy on behalf of the COVID-19 Open Data Working Group. It is a sister project to the Timeline of COVID-19 in Canada, a definitive dataset for COVID-19 in Canada.

For a list of available datasets, see the Data catalogue below. For information on how to access the datasets in the archive, see Accessing the data.

File name timestamps are given in ET (America/Toronto) in the following format: %Y-%m-%d_%H-%M. Files were archived nightly beginning around 22:00 ET.

All code in this repository is covered by the MIT License. Archived datasets may be used under the licenses/terms of use assigned to them by the data creators.

Table of contents:

Data catalogue
Accessing the data
Recommended citation
Notes about the data archive
Notes about the archival tool
Acknowledgements

Data catalogue

A searchable catalogue of datasets, sorted by province/territory (and city/organization, if applicable), is available in the Data Explorer. Full details for each dataset, including any notes pertaining to them, are available in the Search list of datasets section of the Data Explorer. Feature requests and bug reports for the Data Explorer should be made in its dedicated GitHub repository.

A note about data from Quebec: when both French and English data files are available, the French dataset should usually be considered definitive (and in most cases, these files have been captured in the archive for a longer duration).

Accessing the data

The easiest way to explore the data in the archive and download individual files is the aforementioned Data Explorer.

The files in the archive are hosted under the following domain under the domain https://data.opencovid.ca/archive. For example, the PHAC Epidemiology Update from November 4, 2020 may be downloaded at the following URL:

https://data.opencovid.ca/archive/can/epidemiology-update-2/covid19-download_2020-11-04_23-38.csv

Additionally, a complete copy of the index is available as a SQLite database at the following URL:

https://data.opencovid.ca/archive/index.db

This database can easily be queried using a programming language and used to download a list of files.

Previously, a JSON API was available to search the file index, which supported filtering by UUID and date ranges, as well as removing duplicate files. This API was retired in February 2024.

Recommended citation

COVID-19 Canada Open Data Working Group. Canadian COVID-19 Data Archive. https://github.com/ccodwg/Covid19CanadaArchive. (Access date).

Notes about the data archive

On several occasions, the nightly archival script has failed to run. Depending on when the failure was identified, this may have resulted in a partial or total loss of archival data for that day. A list of these days is provided below:

2020-10-21
2020-11-19

In addition, the method of archiving websites (HTML files) was modified on 2021-12-30. This may have caused a handful of HTML files not to be marked duplicates of the previous day's file when they otherwise would have been. On 2022-03-26, the old method of archiving websites was erroneously used, once again resulting in some HTML files not being marked duplicates when they otherwise would have been.

Notes about the archival tool

Updates to the Canadian COVID-19 Data Archive are managed by the archivist package. Development of archivist originally took place in this repository but has since been migrated to its own repository.

Acknowledgements

Shannon Fiedler created the banner image for the Canadian COVID-19 Data Archive.

Many people are to thank for contributing archived data and code to this repository:

Jens von Bergmann / Simon Coulombe / James E. Wright / Farbod Abolhassani / Shelby L. Sturrock / Safa Ahmad / Jacques Marcoux / Shraddha Pai / Matti Aleve / Scott van Millingen / Robson Fletcher / Les Perreaux / Allen Kwan (Twitter/LinkedIn) / Christine Hagyard (Twitter/LinkedIn) / Amy Bihari (Twitter/LinkedIn) / Razieh Faraji (Twitter/LinkedIn) / David Lussier / Matthias Schoettle / Jeremy Moreau

covid19canadaarchive's People

Contributors

Stargazers

Watchers

Forkers

aetiologiccanada jacmarcx svmillin luoluogogogo mschoettle farbodab jurikim-ubc dexmcmillan chrisfcosta inthisworl

covid19canadaarchive's Issues

Add municipal data from Ottawa

https://www.theglobeandmail.com/canada/article-how-safe-is-school-it-depends-on-your-neighbourhood/

Fix Alberta school status download

AB - school status download failed 2020-09-11 and 2020-09-12.

Now each school has an expandable element, similar to Fraser school exposures page.

Add codebook: PHAC epidemiology update

https://open.canada.ca/data/en/dataset/261c32ab-4cfd-4f81-9dea-7b64065690dc

Contains:

covid19.csv
codebook (en)
codebook (fr)

Very large screenshots can fail on the server

Frequently failed:
ON - How Ontario is responding to COVID-19 (webpage screenshot)
ON - Cases in schools and childcare centres (webpage)

Add additional Toronto datasets

Check update frequency and adjust if necessary (e.g. 3 times/week):

https://www.toronto.ca/home/covid-19/covid-19-latest-city-of-toronto-news/covid-19-status-of-cases-in-toronto/

Add archived Alberta data

Fix SK data: CSV names change every day

The CSVs are named by numbers. Two possible solutions

Download by webdriver
Calculate number of CSV by incrementing number from first known date

Add PHAC "locations where you may have been exposed"

https://www.canada.ca/en/public-health/services/diseases/2019-novel-coronavirus-infection/latest-travel-health-advice/exposure-flights-cruise-ships-mass-gatherings.html

Update Toronto datasets

See new datasets here, remove no longer updated datasets.

https://www.toronto.ca/home/covid-19/covid-19-latest-city-of-toronto-news/covid-19-status-of-cases-in-toronto/

Add archived INSPQ data

https://twitter.com/CoulSim/status/1298669814676959233

Add archived Ontario & Toronto data

https://twitter.com/vb_jens/status/1298664720434487296

Make nightly update a single commit rather than commiting each file individually

Solution using PyGithub & GitHub API is non-trivial. https://stackoverflow.com/questions/38594717/how-do-i-push-new-files-to-github/39627647#39627647

Interfacing directly with a local git repo would make this task trivial but would require downloading the entire repo every time a commit should be issued: https://stackoverflow.com/questions/38594717/how-do-i-push-new-files-to-github/39627647

Add archived BCCDC data

Scrape Alberta HTML tables

https://www.alberta.ca/stats/covid-19-alberta-statistics.htm

Screenshot function creating instability on server run

Previously, a crash with the screenshot page function (ss_page) was solved by removing driver.implicitly_wait(). See 832de45.

Now that many more screenshots have been added, ss_page can sometimes fail (bringing down the whole script) with the following errors:

selenium.common.exceptions.SessionNotCreatedException: Message: session not created
(Session info: headless chrome=xx.x.xxxx.xxx)
from disconnected: Unable to receive message from renderer

raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: crashed.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /app/.apt/opt/google/chrome/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

In both cases, the errors occurred when trying to screenshot: https://www.northernhealth.ca/health-topics/public-exposures-and-outbreaks#covid-19-public-exposures

Note: Crashes were preceded by several "Error R14 (Memory quota exceeded)". Unsure if related.

The failure of an individual download should not halt the entire script, but should generate an error in the log and possibly an alert.
Downloads should be able to be retried, especially webdriver-based downloads (success when file exists).
Webdriver should implement a brief waiting period prior to running the rest of the script --- this ensures the page is fully loaded.