Topic: warc Goto Github
Some thing interesting about warc
Some thing interesting about warc
warc,Mounts WARC files on Windows
User: antiufo
warc,🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Organization: archivebox
Home Page: https://archivebox.io
warc,DigestBox takes any webpage URL (news article, video link, comment thread, etc.) and gives you just the raw content. It's powered by ArchiveBox.io under the hood.
Organization: archivebox
Home Page: https://DigestBox.io
warc,A Rails engine supporting the discovery of web archives.
Organization: archivesunleashed
Home Page: https://archivesunleashed.org/warclight/
warc,The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Organization: archiveteam
warc,Decentralized web archiving
Organization: archiveteam
warc,Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Organization: archiveteam
Home Page: https://www.archiveteam.org/
warc,Bitextor generates translation memories from multilingual websites
Organization: bitextor
Home Page: https://bitextor.readthedocs.io/en/latest/
warc,A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
User: centic9
warc,A robust web archive analytics toolkit
Organization: chatnoir-eu
Home Page: https://resiliparse.chatnoir.eu
warc,A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Organization: cocrawler
warc,CoCrawler is a versatile web crawler built using modern tools and concurrency.
Organization: cocrawler
warc,News crawling with StormCrawler - stores content as WARC
Organization: commoncrawl
warc,Drill into WARC web archives
Organization: crissyfield
warc,metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Organization: datacoon
warc,Golang WARC (Web ARChive) Library
Organization: datatogether
warc,ARCHIVED--Docker app to crawl URLs and generate WARCs
Organization: edgi-govdata-archiving
warc,WarcDB: Web crawl data as SQLite databases.
User: florents-tselai
Home Page: https://WarcDB.tselai.com
warc,WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Organization: harvard-lil
warc,An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
User: helgeho
warc,A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
User: helgeho
warc,:card_index: Tools to Work with the Web Archive Ecosystem in R
User: hrbrmstr
warc,Summarize web archive capture index (CDX) files.
Organization: internetarchive
warc,Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Organization: internetarchive
Home Page: https://heritrix.readthedocs.io/
warc,Support for writing WARC files with Scrapy
Organization: internetarchive
warc,:gear: A Rust library for reading and writing WARC files
User: jedireza
Home Page: https://docs.rs/warc/
warc,:whale2: Web Archiving Integration Layer: One-Click User Instigated Preservation
User: machawk1
Home Page: https://matkelly.com/wail
warc,Chrome extension to "Create WARC files from any webpage"
User: machawk1
Home Page: https://warcreate.com
warc,🗄️ A simple CLI for converting WARC to Parquet.
User: maxcountryman
warc,Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
User: mikwielgus
warc,Read Web ARChive (WARC) files in PHP.
Organization: mixnode
Home Page: https://www.mixnode.com
warc,Parse And Create Web ARChive (WARC) files with node.js
User: n0tan3rd
warc,Command line tool for digging into WARC files
Organization: nlnwa
Home Page: https://nlnwa.github.io/warchaeology/
warc,InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Organization: oduwsdl
warc,This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Organization: oduwsdl
warc,🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
User: pirate
Home Page: https://pirate.github.io/internet-archiving-talk/
warc,Web archiving using Google Chrome
User: promyloph
Home Page: https://6xq.net/crocoite/
warc,Collect and revisit web pages.
Organization: rhizome-conifer
Home Page: https://conifer.rhizome.org
warc,Awesome list dedicated to digital and data preservation tools, sources, services and so on.
Organization: ruarxive
warc,Shepherding our web archives from crawl to access.
Organization: ukwa
warc,A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
Organization: webrecorder
Home Page: https://chrome.google.com/webstore/detail/webrecorder/fpeoodllldobpkbkabpblcfaogecpndd
warc,Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
Organization: webrecorder
Home Page: https://browsertrix.com
warc,Run a high-fidelity browser-based crawler in a single Docker container
Organization: webrecorder
Home Page: https://crawler.docs.browsertrix.com
warc,CDXJ Indexing of WARC/ARCs
Organization: webrecorder
warc,Serverless replay of web archives directly in the browser
Organization: webrecorder
Home Page: https://replayweb.page
warc,Streaming WARC/ARC library for fast web archive IO
Organization: webrecorder
Home Page: https://pypi.python.org/pypi/warcio
warc,Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Organization: webrecorder
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.