Internet Archive's Projects
Support for writing WARC files with Scrapy
Pure python HDFS client: python3.x version
Internet Archive's Sparkling Data Processing Library
Heritrix frontier files manipulation tool.
Sort-friendly URI Reordering Transform (SURT) python module
Content drift assessment tool for TARB project.
Soft-404 detction system for the TARB project.
A Streamlit application to visualize Wikipedia IABot statistics
GitHub Actions test for CI/CD pipelines -- see full CI/CD at https://github.com/internetarchive/cicd
(T)he (N)ew (H)otness. Improved full-txt search of archival web data.
A mathematical model to calculate a normalized score to quantify the temporal resilience of a web page as a time-series data based on the historical observations of the page in web archives.
Trough: Big data, small databases.
Ultra fast JSON decoder and encoder written in C with Python bindings
A queue-controlled browser automation tool for improving web crawl quality
Full-text indexing pipeline based on Hadoop/Pig scripts.
Python library for reading and writing warc files
WARC writing MITM HTTP/S proxy
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
IA's public Wayback Machine (moved from SourceForge)
React components to render differences between captures at the Wayback Machine
A Python 3.6+ application that calculates and returns simhash values for Internet Archive's snapshots
Reduce annoying 404 pages by automatically checking for an archived copy in the Wayback Machine. Learn more about this Test Pilot experiment at https://testpilot.firefox.com/