toimik / warcprotocol Goto Github PK
View Code? Open in Web Editor NEWParser for WARC (aka WebArchive) files
License: Apache License 2.0
Parser for WARC (aka WebArchive) files
License: Apache License 2.0
I have an input WARC file with a response
record. This record has a WARC-Identified-Payload-Type
header, however its value is not used.
Specifically the ResponseRecord
returned by the WarcParser.Parse()
method doesn't set its IdentifiedPayloadType
to the value of the WARC-Identified-Payload-Type
header.
This isn't my expectation. The value exists in the WARC, and has a valid format. It should appear on the parsed response record.
(While I could create a custom PayloadTypeIdentifier, this is wasteful since it requires computation to happen on each parsed recorded, and its logic to determine the payload type may not be as sophisticated as the logic that originally determined the payload content type when creating the WARC)
Currently it is impossible to create RequestRecord
or ResponseRecord
records with any Payload data for non-HTTP traffic. This means that headers like "WARC-Identified-Payload-Type" or "WARC-Payload-Digest" cannot be set for these records if they contain non-HTTP traffic, and there is no way to manually set those.
While PayloadTypeIdentifier
can be extended to identify different payload content types, its Identify()
method is only called with any payload bytes that have been extracted from the content block bytes. So it depends on Payload detection.
Payload detection itself is done in the Utils.IndexOfPayload()
method, called with the content block bytes.IndexOfPayload()
is hardcoded to search the byte array for an index of an HTTP-style Double CRLFs. Anything after that index is considered the payload. If no HTTP-style double CRLF is found, the Payload
byte array is set to an empty array. This means even a custom PayloadTypeIdentifier
can't help since it receives an empty byte array for non-HTTP records.
Not having headers like "WARC-Identified-Payload-Type" creates major interoperability challenges, since tools in the WARC ecosystem, such as warcio
and cdxj-indexer
use those headers when creating CDX files and more.
When using "toimik/WarcProtocol" to parse the "warc.gz" file from CommonCrawl.
The HTML Body we parsed would got Extra ending. (One of the case can be seen in left part of snapshot).
But when we Unzip the "warc.gz" file first, then use "toimik/WarcProtocol" do the parse, we can get the Correct one.
Not sure if anyone met same issue, as CommonCrawl "warc.gz" files are widely used now.
WarcProtocol outputs WARC-Target-URI
headers using targetUri.Tostring()
which does not URL encode characters, as shown below:
This creates URLs with spaces and other unallowed characters to appear in the WARC-Target-URI
, which violates the spec. WARC listing tools like warcio flag this error:
$ warcio check converted.warc
Replacing spaces in invalid WARC-Target-URI: gemini://multiverse.thruhere.net/library/math_logic_comp/Unix System Administration Handbook.pdf
I believe the code should instead call targetUri.AbsoluteUrl
to get the URL with proper URL encoding
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.