The goal
- Periodically use Internet Archive's "Save Page Now" capturing service to preserve copies of Data Sources in our database. (What is a data source?)
- Log 404s, timeouts, and other errors in a file. Store the
airtable_uid
, source_url
, and an error message as part of the log, so we can easily update the database if we can't fix the URL.
Why?
Retention policies can be unforgiving, and important records are lost to time every day. It's bad news when a URL in our data sources database The best time to plant a tree is 20 years ago…the second best time is now. Same here.
But why though?
It is an incentive to give us data sources. Currently, we ask people to do it just to do us a favor or because they believe in the cause. Instead, we can say “If you know about an internet data source, submit it to us and tell us how often it’s refreshed. We will create archives automatically and link to them.” Instead of just passively storing them, we’re doing our part to preserve data. Fun!
Seriously, explain why this is important
It makes our data sources database more and more useful over time. That URL we saved 404s now? No worries, we archived it. Instead of becoming a useless row we need to delete, it points to a place where information was published and can still be accessed.
Suggested approach