Coder Social home page Coder Social logo

warc-proxy's People

Contributors

alard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

warc-proxy's Issues

Don't know how to open WARC file via this viewer :)

Hello! I'm a new one in these Python and Warc themes, so please be gentle.

Well, I've performed some actions, as seted out in steps 1-5: installed Python 3.6.2, tornado 4.5.2, ran the proxy and setted my browser to required parameters regarding proxy.

And what is next? How should I use this viewer? Can anyone explain me in more details?

Thank you.

Browsing HTTPS

HTTPS traffic is not going to the WARC proxy. The proxy settings should be changed to do that. The proxy code should be changed to be able to serve the HTTPS records.

The File Selection Dialog at the web GUI gets out of sync

In stead of scanning the disk for file system changes at the moment, when the file selection dialog is opened, the disk is scanned for file system changes only when the web application page at the web-browser is reloaded (by pushing the F5 button).

The result is that new WARC-files that are created after the web application has been launched, are not visible at the file selection dialog. People, who are new to the application do not necessarily expect that they have to refresh the whole web page to see the updates.

Firefox addon problem

I installed the addon to Firefox 24.0 on Ubuntu Linux but it does not appear in the Tools menu. It does appear in the Addons list. Suggestions?

Excessive memory usage when loading a WARC with big files

I tried to load a WARC with a few larger (200-300 MB) files in it. During the process of loading (indexing) the WARC, memory usage of the python process (that worked on the indexing) increased up to, like, 700 MB, and then ran out of memory, leaving the following error message in the terminal:

Loading /media/datadisk/upload_queue/hajduvolan_hu_2015_05.warc.gz
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "./warcproxy.py", line 112, in run
    http_response = parse_http_response(record)
  File "./warcproxy.py", line 24, in parse_http_response
    remainder = message.feed(record.content[1])
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 576, in feed
    text = HTTPMessage.feed(self, text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 94, in feed
    text = self.feed_start(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 179, in feed_start
    line, text = self.feed_line(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 159, in feed_line
    text = str(self.buffer[pos:])
MemoryError

The progress bar stuck, the indexing stopped.
I bet on the big files being responsible for this, as I've been using this great tool for long and haven't experienced such a problem so far (this was the first time that I tried to load a WARC with files larger than a few tens of megabytes). However, I can't imagine why warc-proxy would need 700 MB of mermoy for indexing a 250 MB file.

I think you can easily reproduce the problem: you can find the problematic WARC here: https://archive.org/details/hajduvolan_hu_2015_05. The probably problematic files are http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt_2010-2012.flv (249 MB) and http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt.flv (146 MB).

AttributeError: 'WarcIndexer' object has no attribute 'records'

warc_librarian@acstorage3334:/media/pi/Sinine230GiBUSB/warc_librarian $ Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "./warcproxy.py", line 112, in run
    http_response = parse_http_response(record)
  File "./warcproxy.py", line 24, in parse_http_response
    remainder = message.feed(record.content[1])
  File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 576, in feed
    text = HTTPMessage.feed(self, text)
  File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 97, in feed
    text = self.feed_headers(text)
  File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 191, in feed_headers
    line, text = self.feed_line(text)
  File "/home/warc_librarian/m_local/bin_p/warc_proxy/v2016_11_03/hanzo/httptools/messaging.py", line 159, in feed_line
    text = str(self.buffer[pos:])
MemoryError

ERROR:tornado.application:Uncaught exception POST /load-warc (::1)
HTTPRequest(protocol='http', host='warc', method='POST', uri='/load-warc', version='HTTP/1.1', remote_ip='::1', headers={'Origin': 'http://warc', 'Content-Length': '102', 'Accept-Language': 'en-us;q=0.750', 'Accept-Encoding': 'gzip, deflate', 'Host': 'warc', 'Accept': 'application/json, text/javascript, */*; q=0.01', 'User-Agent': 'Mozilla/5.0 (X11; Linux) AppleWebKit/538.15 (KHTML, like Gecko) Chrome/18.0.1025.133 Safari/538.15 Midori/0.5', 'Connection': 'Keep-Alive', 'X-Requested-With': 'XMLHttpRequest', 'Referer': 'http://warc/static/list.html', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'})
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 1346, in _when_complete
    callback()
  File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 1367, in _execute_method
    self._when_complete(method(*self.path_args, **self.path_kwargs),
  File "./warcproxy.py", line 344, in post
    index_status = self.warc_proxy.load_warc_file(path)
  File "./warcproxy.py", line 142, in load_warc_file
    self.indices[path] = indexer.records
AttributeError: 'WarcIndexer' object has no attribute 'records'
ERROR:tornado.access:500 POST /load-warc (::1) 30.11ms

The ~560MiB sized WARC-file that probably was used, when this happened, MIGHT be available from
http://temporary.softf1.com/2017/bugs/www.clausewitz.com-2017-02-09-8df72096-00000.warc.gz
It might have happened with some other WARC-file, I'm not totally sure, but the referenced one also fails to load for what ever reason.

Heavy Inefficiency on Raspberry Pi like Computers due to lack of Multi-core Support

Usually the bottle-neck is HDD/Flash speed, but in the case of the warc-proxy the command "top" shows on 4-core Raspberry Pi 3 that the "waiting" part is about a few %, while the python process that runs the warc-proxy consumes only slightly more than 1 CPU-core. The rest of the 2-point-something cores sit idle.

An initial workaround might be to read the number of cores from the

/proc/cpuinfo

and then start that many very-low-level threads, one thread per importable WARC-file. A more proper solution is to try to prepare for utilizing hundreds of CPU-cores (archival copy).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.