alard / warc-proxy Goto Github PK
View Code? Open in Web Editor NEWServing content from a WARC
Serving content from a WARC
Hello! I'm a new one in these Python and Warc themes, so please be gentle.
Well, I've performed some actions, as seted out in steps 1-5: installed Python 3.6.2, tornado 4.5.2, ran the proxy and setted my browser to required parameters regarding proxy.
And what is next? How should I use this viewer? Can anyone explain me in more details?
Thank you.
HTTPS traffic is not going to the WARC proxy. The proxy settings should be changed to do that. The proxy code should be changed to be able to serve the HTTPS records.
In stead of scanning the disk for file system changes at the moment, when the file selection dialog is opened, the disk is scanned for file system changes only when the web application page at the web-browser is reloaded (by pushing the F5 button).
The result is that new WARC-files that are created after the web application has been launched, are not visible at the file selection dialog. People, who are new to the application do not necessarily expect that they have to refresh the whole web page to see the updates.
I installed the addon to Firefox 24.0 on Ubuntu Linux but it does not appear in the Tools menu. It does appear in the Addons list. Suggestions?
using the firefox extension, the filepicker doesn't allow me to select any .warc.gz
firefox 16.0.2 on osx
https://github.com/alard/warc-proxy/blob/master/firefox-addon/lib/main.js#L22
what is the right procedure to rebuild the xpi?
thank you
The proxy should remove unknown headers, such as X-Frame-Options SAMEORIGIN.
I tried to load a WARC with a few larger (200-300 MB) files in it. During the process of loading (indexing) the WARC, memory usage of the python process (that worked on the indexing) increased up to, like, 700 MB, and then ran out of memory, leaving the following error message in the terminal:
Loading /media/datadisk/upload_queue/hajduvolan_hu_2015_05.warc.gz
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "./warcproxy.py", line 112, in run
http_response = parse_http_response(record)
File "./warcproxy.py", line 24, in parse_http_response
remainder = message.feed(record.content[1])
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 576, in feed
text = HTTPMessage.feed(self, text)
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 94, in feed
text = self.feed_start(text)
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 179, in feed_start
line, text = self.feed_line(text)
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 159, in feed_line
text = str(self.buffer[pos:])
MemoryError
The progress bar stuck, the indexing stopped.
I bet on the big files being responsible for this, as I've been using this great tool for long and haven't experienced such a problem so far (this was the first time that I tried to load a WARC with files larger than a few tens of megabytes). However, I can't imagine why warc-proxy would need 700 MB of mermoy for indexing a 250 MB file.
I think you can easily reproduce the problem: you can find the problematic WARC here: https://archive.org/details/hajduvolan_hu_2015_05. The probably problematic files are http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt_2010-2012.flv (249 MB) and http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt.flv (146 MB).
warc_librarian@acstorage3334:/media/pi/Sinine230GiBUSB/warc_librarian $ Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "./warcproxy.py", line 112, in run
http_response = parse_http_response(record)
File "./warcproxy.py", line 24, in parse_http_response
remainder = message.feed(record.content[1])
File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 576, in feed
text = HTTPMessage.feed(self, text)
File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 97, in feed
text = self.feed_headers(text)
File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 191, in feed_headers
line, text = self.feed_line(text)
File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 159, in feed_line
text = str(self.buffer[pos:])
MemoryError
ERROR:tornado.application:Uncaught exception POST /load-warc (::1)
HTTPRequest(protocol='http', host='warc', method='POST', uri='/load-warc', version='HTTP/1.1', remote_ip='::1', headers={'Origin': 'http://warc', 'Content-Length': '102', 'Accept-Language': 'en-us;q=0.750', 'Accept-Encoding': 'gzip, deflate', 'Host': 'warc', 'Accept': 'application/json, text/javascript, */*; q=0.01', 'User-Agent': 'Mozilla/5.0 (X11; Linux) AppleWebKit/538.15 (KHTML, like Gecko) Chrome/18.0.1025.133 Safari/538.15 Midori/0.5', 'Connection': 'Keep-Alive', 'X-Requested-With': 'XMLHttpRequest', 'Referer': 'http://warc/static/list.html', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'})
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 1346, in _when_complete
callback()
File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 1367, in _execute_method
self._when_complete(method(*self.path_args, **self.path_kwargs),
File "./warcproxy.py", line 344, in post
index_status = self.warc_proxy.load_warc_file(path)
File "./warcproxy.py", line 142, in load_warc_file
self.indices[path] = indexer.records
AttributeError: 'WarcIndexer' object has no attribute 'records'
ERROR:tornado.access:500 POST /load-warc (::1) 30.11ms
The ~560MiB sized WARC-file that probably was used, when this happened, MIGHT be available from
http://temporary.softf1.com/2017/bugs/www.clausewitz.com-2017-02-09-8df72096-00000.warc.gz
It might have happened with some other WARC-file, I'm not totally sure, but the referenced one also fails to load for what ever reason.
Usually the bottle-neck is HDD/Flash speed, but in the case of the warc-proxy the command "top" shows on 4-core Raspberry Pi 3 that the "waiting" part is about a few %, while the python process that runs the warc-proxy consumes only slightly more than 1 CPU-core. The rest of the 2-point-something cores sit idle.
An initial workaround might be to read the number of cores from the
/proc/cpuinfo
and then start that many very-low-level threads, one thread per importable WARC-file. A more proper solution is to try to prepare for utilizing hundreds of CPU-cores (archival copy).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.