nla / httrack2warc Goto Github PK
View Code? Open in Web Editor NEWConverts HTTrack crawls to WARC files
License: Apache License 2.0
Converts HTTrack crawls to WARC files
License: Apache License 2.0
HTTrack appears to write the URL in new.txt escaped (e.g. spaces replaced with %20) but unescaped in new.zip. This causes cache lookup error when the two forms do not match:
Exception in thread "main" java.io.IOException: no cache entry: http://example.org/some%20file.jpg
at au.gov.nla.httrack2warc.httrack.HttrackCrawl.buildRecord(HttrackCrawl.java:148)
It appears in the new.txt entry context HTTrack is escaping the following characters:
Notably this does not include the % character. Therefore this transformation is not safely reversible.
I tried to view a warc file just now with openwayback and it outputs the following. Is this a problem with the warc or with httrack2warc?
WARNING: Bad Record. Trying skip (Record start 782): Unexpected character 41(Expecting d)
Mar 02, 2020 10:47:38 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork
SEVERE: FAILED to index or upload (crawl.warc)
java.lang.RuntimeException: After retry (Offset 782)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourceindex.updater.IndexClient.addSearchResults(IndexClient.java:158)
at org.archive.wayback.resourcestore.indexer.IndexWorker.doWork(IndexWorker.java:111)
at org.archive.wayback.resourcestore.indexer.IndexWorker$WorkerThread.run(IndexWorker.java:244)
Caused by: java.io.IOException: Unexpected character 43(Expecting d)
at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
at org.archive.io.ArchiveReader.get(ArchiveReader.java:144)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:562)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
... 9 more
The command I used to download the website:
httrack "https://web.archive.org/web/20180611033123/https://github.com/adlio/usgs-waterdata/tree-commit/89c97a80cdd6fba90972fd137fcd5a7a92ad1fff" '-*' '+https://web.archive.org/web/20180611033123*' '+https://archive.org/includes*' '+https://web.archive.org/_static*' '+https://archive.org/images*' '+https://archive.org/services*' '+https://archive.org/components*' '+https://www.archiveteam.org*' -N1005 --advanced-progressinfo --can-go-up-and-down --display --keep-alive --mirror --robots=0 --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' --verbose
The command I used to create the warc:
java -jar /Users/fabiansturm/Documents/projects/httrack2warc/target/httrack2warc-0.4.0-shaded.jar /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/webcache-download731331670 -o /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/http2warc115706301 -C none
After that I renamed it to crawl.warc since I used -C none
.
To run the container:
docker pull iipc/openwayback
docker container run -it --rm -v /tmp/owb:/data -p 8089:8080 iipc/openwayback
Exception in thread "main" java.nio.file.NoSuchFileException: .../hts-ioinfo.txt
...
at java.nio.file.Files.newInputStream(Files.java:152)
at au.gov.nla.httrack2warc.httrack.HttrackCrawl.parseIoinfo(HttrackCrawl.java:50)
at au.gov.nla.httrack2warc.httrack.HttrackCrawl.<init>(HttrackCrawl.java:46)
at au.gov.nla.httrack2warc.Httrack2Warc.convert(Httrack2Warc.java:71)
at au.gov.nla.httrack2warc.Main.main(Main.java:103)
In 3.49-2 we have:
hts-cache/new.txt:11:21:41 185/185 ---M-- 301 error ('Moved%20Permanently') text/html date:Tue,%2009%20Jan%202018%2002:21:41%20GMT http://test.example.org/redirect test.example.org/redirect (from http://test.example.org/)
Binary file hts-cache/new.zip matches
hts-ioinfo.txt:[1] request for test.example.org/redirect:
hts-ioinfo.txt:<<< GET /redirect HTTP/1.1
hts-ioinfo.txt:[1] response for test.example.org/redirect:
the new.zip comment entry has:
HTTP/1.1 301 Moved Permanently
X-In-Cache: 1
X-StatusCode: 301
X-StatusMessage: Moved Permanently
X-Size: 185
Content-Type: text/html
Last-Modified: Tue, 09 Jan 2018 02:21:41 GMT
Location: http://test.example.org/another
X-Addr: test.example.org
X-Fil: /redirect
X-Save: test.example.org/redirect
these are converted ok if hts-ioinfo is present. But without hts-ioinfo currently a resource record is created.
I don't think a cache entry is present at all in early versions of HTTrack. It might be possible to recreate redirects from the log messages though.
Requests for URLs with an image file extension (e.g. foo.gif) might return a HTML 404 error message. In this case HTTrack appears to write the error message to a file named foo.html but still refers to it as foo.gif in the cache and in new.txt.
I've worked around this for now by allowing the skipping of missing files if they would have an HTTP error status code. Is there a way we can detect and handle this case properly? Maybe we can implement the same conditions HTTrack has for renaming the files and probe for their existence.
Even when we have the headers from the HTTrack debug log we don't have the original transfer-encoded bytes of the response message so we should remove the header before writing the WARC as the WARC file is supposed to contain the encoded response as it was on the wire.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.