internetarchive / brozzler Goto Github PK

View Code? Open in Web Editor NEW

629.0 36.0 98.0 4.11 MB

brozzler - distributed browser-based web crawler

License: Apache License 2.0

Python 83.83% JavaScript 6.59% HTML 5.83% Shell 0.81% Jinja 2.90% Makefile 0.04%

brozzler's People

Stargazers

Watchers

Forkers

bitbaron adam-miller ato machawk1 shawnmjones mouse-reeve jakop345 wolfspyre danielbicho detrout booi seweissman nlevitt datafyit number0 bytearchive prayashm segerberg pradeepsharma2004 vbanos ursafoot galgeek today010up chunde mit09m n0ncetonic melvile forapin kblumenthal jordiaphane geometrics hubprojects babibubebon justforkin knudaage ghsnd bradfordlynch sepastian awesome-archive asgdev mkroman qyqx censorship-no hhy5277 riviera12345 zymitsky mxnx1 hackthings backwardn yushu-liu awesome-crawler sprinterzzj corentinb banben g-elfling masterscott ra2003 sahwar francischung fakegit weselow joncampbell123 horstboy ngtmeaty vishalbelsare miku cclauss ldko scrapenx gasbarroni8 nsachin08 wolfgang42 twen007 cyberflamego jmvezic mishranitin2003 yusuf81 sts0mrg0 strogo usmanqnl shuiowa browserless zabrane akkun2 jpluimers rebootcs

brozzler's Issues

Feature request: Pass rendered DOM to youtube-dl instead of asking youtube-dl to download the page from the original URL

Many sites add video dynamically with JavaScript (such as what turned out to be the case with issue #104).

youtube-dl does not execute JavaScript. Instead, where video is added dynamically, youtube-dl relies on custom code for getting the video URLs for a page.

That works great when someone has written an extractor already, but in a lot of cases there has not been written any such extractor.

If instead of asking youtube-dl to download from the original URL, the rendered DOM was passed to youtube-dl then in cases where the JavaScript execution that has happened placed <video> and <source> tags in the DOM, youtube-dl should be able to find the video.

In other situations, passing the rendered DOM instead might possibly make it so that youtube-dl fails instead. In that case, perhaps it could be an option that you provide to brozzler-new-site whether to pass rendered DOM or have youtube-dl download the page from the original URL.

Let me know what you think. I know that feature requests in general are tall orders but I believe this would be useful and I hope you will consider it.

SHA1 Payload-Digest should use base 32 and not base 16

I know that the Warc specification do allow base 16 (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-payload-digest). But since most other Warc tools all produce base 32, SHA1, it would be beneficial for the community to use the same standard.

The following tools all produce base32:
Heritrix
wget
Webrecorder

It was discovered when the warc-indexer in the webarchive-discovery project (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer) reported errors with the payload when indexing warc-files generated woth Brozzler.

Package conflict to install brozzler[easy]

Hello,

I am trying to install brozzler[easy] (version 1.5.18) on linux (Centos with Python 3.6.8 and pip 21.1.2). Seems there are some package conflicts using pip.
I also install setuptools 57.0.0.

When I check requirement in setup.cfg file from brozzler-1.5.18.tar.gz, we have :

Install_requires, 'jinja2>=2.10' is required.
'easy' requires : pywb 0.33.2 (which depends on jinja2<2.9)
Seems we have a conflict regarding the version needed of jinja2.

Has anyone recently installed brozzler[easy] successfully ? Have we to adapt the choice of some packages versions ? force some version ?
Or is it a mistake of understanding on my part for the installation?

Thanks for help

Occasional brozzler hangs while scanning a large site.

I managed to catch the exception.

its on debian using chromium Version: 55.0.2883.75-1~deb8u1

2017-02-04 22:36:24,671 1736 INFO WarcWriterThread(tid=1743) warcprox.writer.WarcWriter.close_writer(writer.py:69) closing brozzler-20170205063139292-00094-vs75mxwt.warc.gz
2017-02-04 22:37:09,419 1736 CRITICAL BrozzlingThread:https://www.ncdc.noaa.gov/sotc brozzler.worker.BrozzlerWorker._brozzle_site(worker.py:360) unexpected exception
Traceback (most recent call last):
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/worker.py", line 341, in _brozzle_site
outlinks = self.brozzle_page(browser, site, page)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/worker.py", line 285, in brozzle_page
on_screenshot=_on_screenshot)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/browser.py", line 430, in browse_page
user_agent=user_agent, timeout=300)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/browser.py", line 480, in navigate_to_page
timeout=timeout)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/browser.py", line 291, in _wait_for
elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 300.1s waiting for: <function Browser.navigate_to_page.. at 0x7fa2ea0a8d08>
2017-02-04 22:37:09,426 1736 INFO BrozzlingThread:https://www.ncdc.noaa.gov/sotc brozzler.browser.Browser.stop(browser.py:344) shutting down websocket connection
2017-02-04 22:37:09,443 1736 INFO BrozzlingThread:https://www.ncdc.noaa.gov/sotc brozzler.chrome.Chrome.stop(chrome.py:267) terminating chrome pgid 13222
2017-02-04 22:37:09,974 1736 INFO BrozzlingThread:https://www.ncdc.noaa.gov/sotc

Don't depend on rethinkdb

Rethinkdb seems to be essentially dead in the water. There are some attempts to get back on track, but right now it sadly doesn't seem to be being maintained.

Scope rules are not obeyed

Hello!

job-conf.rst lacks info about scope rules so I did my best to try to define them in my job.yml:

id: myjob
time_limit: 60 # seconds
scope:
  surt: http://(com,site,www,)/wanted-path
seeds:
  - url: https://www.site.com

If I understand correctly, brozzler should find all links i.e. https://www.site.com/wanted-path/extra and crawl only links with wanted-path in them. Is this correct?

The problem I have is that Brozzler crawls all links i.e. https://www.site.com/other-path, but I don't want that. Is the problem with my config or does brozzler ignore path of provided surt?

Error Installing brozzler-easy

Hi I got this error while installing brozzler-easy in Python 3.8.5

ERROR: pywb 0.33.2 has requirement jinja2<2.9, but you'll have jinja2 3.0.1 which is incompatible.

installation difficulties of brozzler[easy] on cygwin and Linux

I've spent the last couple of days trying to get Brozzler[easy] to work in cygwin and Linux with no success. Brozzler 1.5.18 installs, but Brozzler[easy] does not. due to various dependency conflicts. I've tried with python 3.8 on a fresh ubuntu 20 image and python versions 3.8, 3.7, and 3.5 under cygwin with similar results. I've also tried older versions of Brozzler, e.g. 1.4 which also seem to have dependency issues. It would be great to have a recipe / detailed documentation for getting Brozzler to work on these platforms as I'd really like to test it.

Any advice?

Unable to get started based on README

I attempted to use this software per the Installation and Usage instructions in the project README and am unable to get far. I first install rethinkdb:

brew update && brew install rethinkdb

then install brozzler per the README instructions:

pip install brozzler

I attempt the first command in the Usage section of the README but receive an error:

$ brozzler-worker -e chromium
Traceback (most recent call last):
  File "/usr/local/bin/brozzler-worker", line 4, in <module>
    __import__('pkg_resources').run_script('brozzler==1.1.dev1', 'brozzler-worker')
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/EGG-INFO/scripts/brozzler-worker", line 8, in <module>
    import brozzler
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/__init__.py", line 48, in <module>
    from brozzler.site import Page, Site
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 12, in <module>
    class Site(brozzler.BaseDictable):
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 13, in Site
    logger = logging.getLogger(__module__ + "." + __qualname__)
NameError: name '__qualname__' is not defined

Figuring this might not be a fatal error, I also create a sample job based on the Job Configuration section of the README:

$ echo -e "id: test\nseeds:\n  - url: http://example.com" > testJob.yaml
$ brozzler-new-job testJob.yaml 
Traceback (most recent call last):
  File "/usr/local/bin/brozzler-new-job", line 4, in <module>
    __import__('pkg_resources').run_script('brozzler==1.1.dev1', 'brozzler-new-job')
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/EGG-INFO/scripts/brozzler-new-job", line 7, in <module>
    import brozzler
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/__init__.py", line 48, in <module>
    from brozzler.site import Page, Site
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 12, in <module>
    class Site(brozzler.BaseDictable):
  File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 13, in Site
    logger = logging.getLogger(__module__ + "." + __qualname__)
NameError: name '__qualname__' is not defined

OS X 10.11.5
Python 2.7.10 (Is this the issue due to warcprox needing py3?)

How do I get started with brozzler?

brozzler-worker hangs when --skip-youtube-dl option is used

When trying to capture this page https://jornadas.fccn.pt/agenda/, brozzler hangs trying to crawl the resources in the bottom (ex: https://jornadas.fccn.pt/wp-content/uploads/2019/08/NAU_Rui_Ribeiro_8min.pptx).

The behavior occurs only with --skip-youtube-dl option!

dashboard connect from external machine

Hi,

I tried to launch the dashboard on a machine, and to connect and diplay the dashboard from another machine using his ip adress : =>

I have this in my server :

2018-10-04 16:01:27,083 48494 INFO MainThread root.run(__init__.py:265) running brozzler-dashboard using gunicorn
2018-10-04 16:01:27,087 48494 INFO MainThread gunicorn.error.info(glogging.py:271) Starting gunicorn 19.8.1
2018-10-04 16:01:27,088 48494 INFO MainThread gunicorn.error.info(glogging.py:271) Listening at: http://127.0.0.1:8000 (48494)
2018-10-04 16:01:27,088 48494 INFO MainThread gunicorn.error.info(glogging.py:271) Using worker: sync
2018-10-04 16:01:27,091 48501 INFO MainThread gunicorn.error.info(glogging.py:271) Booting worker with pid: 48501

and when I try http;//ip:8000/ on the other machine, I have this : ERR_CONNECTION_REFUSED

Is it possible to do this in brozzler: launch dashboard on one machine and connect on it from another one?

Thanks

Facebook authentication fails

I don't think this is a brozzler issue. I guess it's facebook who has changed things. But if I try to capture a facebook page I always get following error:

Your request couldn't be processed. There was a problem with this request. We're working on getting it fixed as soon as we can

Username and password are correct. I have the same issue when trying to create a profile with Browsertrix (but there I can fix it by logging into the mobile version and go back to the web version).

How to connect db entries from the table "sites" to a belonging warc-file?

Hi brozzler-team,

I want to export database entries belonging to a specific warc-file, from the tables jobs, sites and pages.
I Know how connect those tables to each other, but i couldn't find a connection to the table captures or directly to the belonging warc-file.

Is it working via the "WARC_Date" in the warcinfo record of the warc-file and "last_claimed" in the table sites?

A hint Would be great. Thx.

Rationale for using browser

It may a newbie question, but in README I'm missing a rationale for using browser for scraping&archiving.

Question 1) Why you need browser in the first place?
Question 2) If it's for Javascript why not use http://phantomjs.org/ ?

brozzler[easy] not found

Hi, I'm trying to install brozzler[easy] in my virtualenv (Python 3.5.2), but pip says that no matches were found if I run this command:

pip install brozzler[easy]

I have successfully installed brozzler (1.1b10) with pip in this virtual environment.

Complete flow:

primoz@computer:~/projects|⇒  pip install brozzler[easy]
zsh: no matches found: brozzler[easy]

And brozler[dashboard] is not found either.

Random SAML Authentification

I've been using Brozzler to archive a SAML-protected site, which I've been able to enter through using user_agent to get the site to accept regular form authentication. However, I frequently encounter an issue where, after passing the form authentication, Brozzler archives the actual authentication page that asks for username and password instead of the page "under" the authentication request. Essentially, after the site is officially entered and SAML authentication of the home page is completed, some of the following pages have this issue.

This occurs in about 1/6 pages, and I'm unsure if this is due to site settings or Brozzler itself. Any tips / advice would be very much appreciated.

deadlock-ish due to thread_raise?

We occasionally find brozzler workers frozen such that kill -QUIT (https://github.com/internetarchive/brozzler/blob/506ab0c/brozzler/cli.py#L363) doesn't work, usually at shutdown. To debug this issue I'm running brozzler with python configured with --pydebug (using the python3-dbg package on ubuntu). I waited for the problem to happen, then ran sudo gdb -p 13663 -batch -ex 'thread apply all py-bt' -ex quit. I see the SIGQUIT handler blocked trying to acquire a lock inside of logging. Many other threads are also stuck waiting for the same lock.

Thread 1 (Thread 0x7f9af26b3700 (LWP 13663)):
Traceback (most recent call first):
  <built-in method acquire of _thread.RLock object at remote 0x7f9aeed3ed00>
  File "/usr/lib/python3.5/logging/__init__.py", line 804, in acquire
    self.lock.acquire()
  File "/usr/lib/python3.5/logging/__init__.py", line 853, in handle
    self.acquire()
  File "/usr/lib/python3.5/logging/__init__.py", line 1487, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.5/logging/__init__.py", line 1425, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.5/logging/__init__.py", line 1415, in _log
    self.handle(record)
  File "/usr/lib/python3.5/logging/__init__.py", line 1279, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.5/logging/__init__.py", line 1838, in info
    root.info(msg, *args, **kwargs)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/cli.py", line 346, in dump_state
    signum, '\n'.join(state_strs)))
  <built-in method acquire of _thread.lock object at remote 0x7f9aec0786c8>
  File "/usr/lib/python3.5/threading.py", line 1070, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/usr/lib/python3.5/threading.py", line 1054, in join
    self._wait_for_tstate_lock()
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/cli.py", line 369, in brozzler_worker
    th.join()
  File "/opt/brozzler-ve3/bin/brozzler-worker", line 11, in <module>
    load_entry_point('brozzler==1.1b13.dev284', 'console_scripts', 'brozzler-worker')()

And yet, no threads seem to be in the midst of logging while owning that lock. On reflection it seems to me that it could be the case that some thread owned the lock, had an exception raised by thread_raise, and did not release the lock.

The code in logging looks pretty disciplined:

    def handle(self, record):
        rv = self.filter(record)
        if rv:
            self.acquire()
            try:
                self.emit(record)
            finally:
                self.release()
        return rv

    def acquire(self):
        if self.lock:
            self.lock.acquire()

But it seems possible the exception was raised betweenself.lock.acquire() and try:.

One solution would be to avoid any logging (or any locking of any kind) inside of a with brozzler.thread_accept_exceptions() block. But a lot happens in those blocks that we want to log.

Maybe monkey-patching that logging handle() method to do self.acquire() inside the try block (and make sure self.release() proceeds silently in case the lock is not held) would fix this. That may be the simplest thing to try. (It could introduce another bug though in case the RLock is already held, if acquire() does not complete, release() could release the lock when it shouldn't, since it decrements a counter. But we're in logging here, so it seems like the worst that could happen would be some overlapping log messages...)

Even if that works, ensuring that no locking happens inside of a with brozzler.thread_accept_exceptions() block is a heavy burden to endure. Sooner or later it may be incumbent to rewrite brozzler to use multiprocessing instead of multithreading.

Brozzler scrapes only single page.

Brozzler doesn't crawl, extracts and follows all the links on the website, it only scrapes the main page (the url from job.yaml configuration file).

I guess something's wrong with my scope surt parameter? But scope rules are not yet documented and I don't know where else I can find the information about it, so I've decided to ask here.

I've tried the following scope configurations:

I've made surt rules with this module: https://github.com/internetarchive/surt

// job.yaml file

#other job parameters
seeds:
  url: http://iskme.org
    scope:
      surt: org,iskme)/

2.Surt rules according to this documentation: https://webarchive.jira.com/wiki/display/ARIH/SURT+Rules

// job.yaml file

#other job parameters
seeds:
  url: http://iskme.org 
    scope:
      surt: +http://(org,iskme,

But it still scrapes only one main page iskme.org. How do I configure brozzler scope rules to crawl all of the website, to follow all the links for the whole domain?

ModuleNotFoundError: No module named 'pywb.cdx'

I am trying to install brozzler on macOS 10.13.2. After running pip3 install brozzler[easy], I tried to run brozzler-easy, but got the following error:

CRITICAL:root:ModuleNotFoundError: No module named 'pywb.cdx'

You might need to run "pip install brozzler[easy]".
See README.rst for more information.

I checked and pywb is installed. However, it looks like the pywb package removed the pywb.cdx module at some point. Do I need to install an older version of pywb?

How to specify that videos from a separate domain are to be included when adding a site with brozzler-new-site?

I added a site with brozzler-new-site, and its pages are being captured but video files stored on https://example.s3.amazonaws.com are not archived. How can one specify that example.s3.amazonaws.com should be included when adding a site with brozzler-new-site.

Ideally, this should be mentioned in the README.

Starting and Stopping

How do you stop jobs in brozzler?
How do you stop jobless sites in brozzler?
How do you assign specific workers for sepcific jobs?

Do you have any tutorials for Ubuntu?

I tried 3 times with a virtual machine and 2 times with just a PC with Ubuntu reinstalled, but all failed...
I did a google search and didn't find any tutorials...

How to add behaviors?

Does brozzler have support for adding new, or customizing existing behaviors?

From what I understood, this requires both a yaml file matching urls to behaviors, and the actual behaviors in js files.

If there's no support currently, how about adding one or more flags, allowing to specify additional yaml and js files, or directories containing these? Where would be a good place to implement this, so it could become an official feature of brozzler?

--single-process chrome arg

When trying to run brozzler, this line appears in the logs and chromium crashes:

 DEBUG ChromeOutReaderThread:37385 brozzler.chrome.Chrome._read_stderr_stdout(chrome.py:273) chrome pid 21558 STDERR b'[21558:21558:1016/150743.534943:ERROR:default_network_context_params.cc(64)] Cannot use V8 Proxy resolver in single process mode.\n'

Removing the hard-coded --single-process parameter in brozzler/chrome.py seems to fix the issue. Is this flag necessary for brozzler to work?

Here's some more debug info.

chromium-browser --version
Chromium 69.0.3497.81 Built on Ubuntu , running on Ubuntu 18.04

How brozzler started chromium:

INFO BrozzlingThread:37385 brozzler.chrome.Chrome.start(chrome.py:180) running: 'chromium-browser --remote-debugging-port=37385 --use-mock-keychain --user-data-dir=/tmp/tmp5qcy14wv/chrome-user-data --disable-background-networking --disable-renderer-backgrounding --disable-hang-monitor --disable-background-timer-throttling --mute-audio --disable-web-sockets --disable-cache --single-process --window-size=1100,900 --no-default-browser-check --disable-first-run-ui --no-first-run --homepage=about:blank --disable-direct-npapi-requests --disable-web-security --disable-notifications --disable-extensions --disable-save-password-bubble --ignore-certificate-errors --proxy-server=localhost:8888 about:blank'

With a warcprox instance running at localhost:8888

Does Brozzler work on Operating Systems other than macOS (specifically Linux)?

AttributeError: 'Namespace' object has no attribute 'rethinkdb_dedup_url'

Trying to use the brozzler easy setup on Ubuntu 16.04.
Using Anaconda with a fresh environment:

conda create --name Brozzler python=3.5
source activate Brozzler 
pip install brozzler[easy]

Running:

brozzler-easy

produces the error:

Traceback (most recent call last):
  File "/home/thomaspr/anaconda3/envs/Brozzler/bin/brozzler-easy", line 11, in <module>
    sys.exit(main())
  File "/home/thomaspr/anaconda3/envs/Brozzler/lib/python3.5/site-packages/brozzler/easy.py", line 274, in main
    controller = BrozzlerEasyController(args)
  File "/home/thomaspr/anaconda3/envs/Brozzler/lib/python3.5/site-packages/brozzler/easy.py", line 126, in __init__
    self._warcprox_args(args))
  File "/home/thomaspr/anaconda3/envs/Brozzler/lib/python3.5/site-packages/warcprox/main.py", line 213, in init_controller
    if args.rethinkdb_dedup_url:
AttributeError: 'Namespace' object has no attribute 'rethinkdb_dedup_url'

Error replaying twitter pages

Why Im getting this error while replaying twitter in pywb in brozzler-easy?

JavaScript files harvested as partial content (HTTP 206) break playback

I've been testing Brozzler locally using the brozzler-easy option. I have generated a comprehensive list of URLs to visit for a Scalar publication I'm working on (i.e. 0 hops for each seed). The resulting WARC files have a large number of HTTP 206 partial responses for a portion of the JavaScript files, though each JS file has at least one 200 response. The result is, on playback in PyWb some pages load the 206 Partial Content response, others will load 200 OK. If the 206 response is loaded by PyWb, then a blank page is shown and the console has JS errors. I can fix it by removing the 206 rows from the .cdxj index file so it falls back to the 200 copy, then every page loads fine.

I noticed that some JS files don’t seem to have this problem – it looks like it’s only the ones where the <script> tag declaring the file does not include the type=”text/javascript” attribute, which should be optional – that may be a coincidence, but I tried 2 completely different Scalar sites and they did the same thing. I'm running Brozzler on a Mac with Google Chrome. I’m suspecting a possible Chrome behavior that has a negative impact on the WARC – but not sure whether Brozzler, Warcprox, Pywb, or somewhere else is the best place to handle it. Does this seem like a Brozzler issue?

If needed, I can supply a test configuration file for replicating the problem, but wanted to check I'm in the right place and that it's not a known issue or result of incorrect configuration. Thanks!

port 8000 change

Hi,

How can I change the port 8000 in the code?

Thanks

Evaluation of brozzler's scalability?

I am curious if there is any data reporting how well does brozzler scale with increasing the number of parallel browsers?
In my current (very limited) test bed, brozzler takes extremely long to crawl web pages and store the corresponding resources.

Attaching some results when I attempt to crawl 20 random web pages with brozzler while enabling headless Chrome browser.
Scalability results

I also track all the system resource usage (CPU, NW, disk). I am currently running this experiment on a 32 core linux server with 1Gbps NIC and storing data on an underlying hdd with r/w throughput of 150-200MBps

As you can see, neither of resources are being saturated, and yet brozzler is taking on avg ~40-50s to crawl and store a single page. Furthermore the low CPU usage is extremely concerning, since in my experience increasing the number of parallel browsers linearly increases the overall CPU usage of the system. This could be due to the proxy server used by brozzler?

Also, when I crawl the same corpus of pages using an extremely lightweight, custom, nodejs based crawler (written on top of puppeteer), it can do so about 10x faster than the above observed timings.

default WAYBACK_BASEURL may be incorrect

I installed brozzler via pip and launched it with brozzler-easy in a Debian Jessie VM and was able to scrape a site. (Brozzler 1.1b8, pywb 0.33.0, python3.4)

However the default page links in the dashboard on the job detail page were pointing to http://localhost:8091/brozzler/ As far as I can tell there was nothing started by default listening on port 8091.

After some investigation I found there was something listening on port 8880 that looked like a wayback process, so I tried launching brozzler-easy like this:

WAYBACK_BASEURL=http://192.168.122.152:8880/brozzler brozzler-easy -d warc/ --dashboard-address 0.0.0.0

(the ip addresses were so I could use my regular browser instead of the VM browser to use the site)

Doing that allowed the wayback links to work, but the thumbnail & screenshot urls are still 404ing.

how do I add a cookies.txt file?

I'm trying to crawl a site that requires Google authentication. Brozzler does not seem to offer an option to add a cookies.txt; or did I miss something?

Using WARCs with standard PyWb

Trying to take the WARCs saved by Warcprox, index them and play them back with standard pywb but getting some issues with some of the records inside the WARC. For example:

WARC/1.0^M
WARC-Type: response^M
WARC-Record-ID: <urn:uuid:c603136b-0bb2-4b94-8e49-9e7113b88e15>^M
WARC-Date: 2018-01-23T16:24:36Z^M
WARC-Target-URI: https://www.bbc.co.uk/^M
WARC-IP-Address: 212.58.244.67^M
Content-Type: application/http;msgtype=response^M
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ^M
Content-Length: 377^M
WARC-Block-Digest: sha1:UXTY2PXPGPSY6C3L6QLFUDNGWMS5CU7N^M
^M
HTTP/1.1 200 OK^M
Content-Type: text/html; charset=utf-8^M
ETag: W/"434a6-QjTzOogzAHForrr8966RtzKEhqg"^M
X-Frame-Options: SAMEORIGIN^M
Content-Length: 275622^M
Date: Tue, 23 Jan 2018 16:24:35 GMT^M
Connection: keep-alive^M
X-Cache-Action: HIT^M
X-Cache-Hits: 1580^M
X-Cache-Age: 94^M
Cache-Control: private, max-age=0, must-revalidate^M
Vary: Accept-Encoding, X-CDN, X-BBC-Edge-Scheme^M

is one of the records in the WARC. As you can see it doesn't have a payload other then the headers. Later down in the WARC is a record for the same url that does have a payload but pywb finds this record first so shows a blank page.
I don't really understand what is happening with the browser cache/server cache and why this request returns a 200 without a payload, any ideas? (or is this more of a warcprox question)
(I also notice that you give chrome the flag --disable-cache in the code but it's not in the list https://peter.sh/experiments/chromium-command-line-switches/ so wondering if the flag has changed and might be related to this issue?)

Can't start a worker

When trying to start a worker I get this error
File "/home/thore/.local/bin/brozzler-worker", line 7, in <module> from brozzler.cli import brozzler_worker File "/home/thore/.local/lib/python2.7/site-packages/brozzler/__init__.py", line 70, in <module> logging._levelToName[TRACE] = 'TRACE' AttributeError: 'module' object has no attribute '_levelToName'

Invalid flag in Chrome

In Chrome 52.0.2743.116 for OS X, I receive a dropbox in the browser's viewport stating that the command-line flag passed to it from Brozzler was invalid. This occurs after executing brozzler-easy.

Installed brozzler using pip3 install brozzler[easy].

how does worker pick a site after crash?

Scenario: I have warcprox and brozzler worker running on my local machine. While in the middle of archiving a website, if brozzler worker process is killed such as either using 'kill -9 <process_id>' or closing the console session.
After both warcprox and brozzler worker instances are restarted (on same ports as before), the site will not be picked for crawling. This is due to reason that db('Brozzler').table('sites').claimed property = true.

Query:

Is there a configuration property that can be set up so that the site can be picked by any single brozzler worker even if claimed=true?

Screenshots are completely black

I am running brozzler on Ubuntu 17.10. It's working quite well, except that all screenshots shown in the brozzler dashboard web UI are completely black. Is this a known issue? I am using Chromium Version 65.0.3325.181 (Official Build) Built on Ubuntu , running on Ubuntu 17.10 (64-bit)

`pip3 install brozzler[easy]` fails due to `warcprox>=2.4b2.dev173` requirement

When I run pip3 install brozzler[easy] I get the following error:

  Could not find a version that satisfies the requirement warcprox>=2.4b2.dev173 (from brozzler[easy]) (from versions: 1.0, 1.1, 1.2, 1.3, 1.4, 2.0.dev9, 2.0b1, 2.0b2, 2.0, 2.0.1, 2.1b1.dev60, 2.1b1.dev68, 2.1b1.dev71, 2.1b1.dev86, 2.1b1.dev87, 2.2, 2.3, 2.4b1)
No matching distribution found for warcprox>=2.4b2.dev173 (from brozzler[easy])

I checked on PyPi, and it looks like warcprox 2.4b1 is the most recent released published.

(FYI, I'm running Ubuntu 18.04, and Python 3.6.5, not that this should matter for this error.)

Brozzler in Docker with Rethinker

2017-06-13 06:02:03,368 5 ERROR MainThread rethinkstuff.Rethinker._random_server_connection(init.py:97) will keep trying to get a connection after failure connecting to localhost: Could not connect to localhost:28015. Error: [Errno 99] Cannot assign requested address

RethinkerDB has been started by default configuration. Any suggestions ?

fetch service worker script with proper headers

In brozzler, browser fetches service worker scripts, but it seems to happen in a special context, and it doesn't get the same devtools treatment as regular urls. In particular Network.setExtraHTTPHeaders does not apply, so the request is missing Warcprox-Meta, which means warcprox writes it to the wrong warc file.

https://webarchive.jira.com/browse/AITFIVE-1713 (internal issue tracker)

Performance Suggestions?

Hello,
I've been utilizing brozzler-easy for testing and brozzler looks to be working wonderfully. I have a very large website I am trying to archive and unsure of a few things that I can't figure out through the job-conf.rst.

I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to replace my local host domain with the actual public domain?

Another question I have, is there any way to boost the performance? Possibly configure it to use more threads? Currently when I setup a brozzler job and monitor it in Brozzler Dashboard, it shows two sites being actively crawled. Is that an example of Brozzler running two threads to crawl the site?

Maybe there's a writeup somewhere explaining optimal ways to use brozzler on a local machine?

greatly appreciate any insights. Sorry to post this here, not sure how else to get in touch with people on this project.

Thank you.

Brozzler-easy issue after start

Hi,

I successfully installed brozzler-easy via brew. After entering the command brozzler-easy in zsh the following error is being thrown:

2023-11-20 14:49:48,196 17658 ERROR MainThread root.init_app(wsgi_wrappers.py:169) *** pywb app init FAILED config from "create_wb_router"! Traceback (most recent call last): File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader' Traceback (most recent call last): File "/usr/local/bin/brozzler-easy", line 10, in <module> sys.exit(main()) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 273, in main controller = BrozzlerEasyController(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 128, in __init__ self.pywb_httpd = self._init_pywb(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 176, in _init_pywb wsgi_app = pywb.framework.wsgi_wrappers.init_app( File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader' bash-3.2$ brozzler-easy /Library/Python/3.8/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2023-11-20 14:50:34,352 17668 INFO MainThread root.stats_processor(controller.py:65) statistics tracking disabled 2023-11-20 14:50:34,353 17668 INFO MainThread warcprox.warcproxy.WarcProxy.__init__(mitmproxy.py:617) 100 proxy threads 2023-11-20 14:50:34,358 17668 NOTICE MainThread warcprox.warcproxy.WarcProxy.server_activate(warcproxy.py:493) listening on 127.0.0.1:56467 2023-11-20 14:50:34,376 17668 ERROR MainThread root.init_app(wsgi_wrappers.py:169) *** pywb app init FAILED config from "create_wb_router"! Traceback (most recent call last): File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader' Traceback (most recent call last): File "/usr/local/bin/brozzler-easy", line 10, in <module> sys.exit(main()) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 273, in main controller = BrozzlerEasyController(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 128, in __init__ self.pywb_httpd = self._init_pywb(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 176, in _init_pywb wsgi_app = pywb.framework.wsgi_wrappers.init_app( File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader'

Can someone help me please?

Best regards,

Steve

Does brozzler pass cookies to youtube-dl?

I'm worried that not all the videos arte being downloaded, because the pages require athentication, therefore youtube-dl can't download from them without the cookies (the browser is logged in but im not sure about youtube-dl). thansk!

brozzler-stop command would be nice

Hi,

I was trying to crawl https://www.ncdc.noaa.gov/sotc and for some reason it never finished I'm not sure why, I noticed there were a number of "revisit" tags in the brozzler-easy log messages and suspect it was just cycling over the same pages over and over. (I think I need to understand how to read warc files before I can confirm that though)

I found some code that referred to stop_requested, in brozzler/job.py but nothing seemed to ever set that variable to True.

I stopped the crawl by using the rethinkdb console and changing "state" to "FINISHED". I could probably figure out how to write a command that given a job id updated the variable, if that seems like a reasonable solution.

"status_info is missing required field ttl" when running brozzle-worker

Installing brozzler and running it according to the instrux does not work. The problem seems to be in the doublethink module where the following error occurs when running brozzle-worker:

2017-06-08 22:17:54,993 8363 CRITICAL MainThread brozzler.worker.BrozzlerWorker.run(worker.py:504) thread exiting due to unexpected exception
Traceback (most recent call last):
  File "/venv/lib/python3.6/site-packages/brozzler/worker.py", line 446, in _service_heartbeat
    self.status_info = self._service_registry.heartbeat(status_info)
  File "/venv/lib/python3.6/site-packages/doublethink/services.py", line 142, in heartbeat
    repr(field))
Exception: ('status_info is missing required field %s', "'ttl'")

The ttl field is indeed missing as the status_info dict looks like this:

{'browser_pool_size': 1,
 'browsers_in_use': 0,
 'heartbeat_interval': 20.0,
 'load': 0.0,
 'role': 'brozzler-worker'}

The check for ttl in status_info seems to have been added in this version of doublethink:
internetarchive/doublethink@a1c5a08

Downgrading to doublethink 0.2.0.dev73 seems to solve the problem.

brozzle-page Not Working With Recent Version of Google Chrome

When I recently used a brozzle-page command with a recent version of Google Chrome, I noticed that brozzler does not load the web page that should be archived.

This results in the web page not being archived successfully.

WARC file: WARCPROX-20230519163909687-00000-0so5t1md.warc

This issue also occurred when trying to archive other web pages:

The commands I used are listed below (video example):

warcprox -p 8081 -d ./warcs/IGN/brozzle_page/2023_05_19 --dedup-db-file /dev/null

export BROZZLER_EXTRA_CHROME_ARGS="--ignore-certificate-errors"

brozzle-page --chrome-exe '/usr/bin/google-chrome' --proxy localhost:8081 'https://www.ign.com/articles/the-last-of-us-season-1-review'

A “WebSocketBadStatusException: Handshake status 403 Forbidden” occurred when recently running these commands on Ubuntu (22.04.2 LTS and 20.04.6 LTS) and macOS (Ventura 13.3.1).

When I used these commands earlier this year it was working successfully (video):

After noticing this issue, I went through the recent stable versions of Google Chrome and found the last stable version that worked with the brozzle-page command was version 109.0.5414.119 which was released on January 24, 2023.

chrome.deb URI (109.0.5414.119): https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_109.0.5414.119-1_amd64.deb

Crawling session: https://youtu.be/A-zr6zVTZSo?t=5569

Replay session: https://youtu.be/A-zr6zVTZSo?t=6345

The first stable version of Chrome that did not work with the brozzle-page command is version 111.0.5563.110 which was released on March 21, 2023.

chrome.deb URI (111.0.5563.110): https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_111.0.5563.110-1_amd64.deb

Crawling session: https://youtu.be/A-zr6zVTZSo?t=4903

Replay session: https://youtu.be/A-zr6zVTZSo?t=4992

Chrome release blog post for 111.0.5563.110: https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop_21.html

Most of the fixed security issues for version 111.0.5563.110 were associated with an attacker creating an HTML page and then exploiting heap corruption:
There was also a security fix that involved internal security work so there was less details for this update to Chrome.

Images on Instagram and Twitter captures not shown in pywb

Hi, not sure if I add this issue in the correct repository, but it's about crawls I created with brozzler.

If I watch Instagram and twitter captures in pywb, I notice that the images are not shown (screenshots). However, I noticed that the images are present in the WARC file, cause I can export them from the WARC file.

Crawls are made with brozzler. I have the same issue when opening the WARC files with Webrecorder Player and Replayweb.page. In those applications, I only see the Instagram logo.

Could it be related to #198?

instagram WARC-file: brozzler-20201117134317487-b7gpz5v6-00000.warc.gz

edit: I crawled Instagram with Browsertrix in the meanwhile and have no issue with replaying it, so maybe it's a brozzler issue.

brozzler + headless

Is it possible to run brozzler with headless chromium? and how?
Thanks

In Logins, Check remember me box

Sometimes the chromium browser crashes (or locks up so I have to close it to reload it). That's not what I'm reporting - they're probably not actually bugs (lol).

But when that does happen, it's annoying to have to reauth for all pages - and actually could result in an incomplete crawl.

My Proposal: Add a bit more logic to the username/password fields to check for remember me boxes.

Videos on Twitter captures

Hi,

Before I describe the issue, I will preface by saying I am very new to brozzler and similar tools in general, so perhaps my question is a little bit simplistic.
Anyways, I was wondering if you have any pointers as to why videos on some hashtag feeds I captured do not seem to play when I view them on pywb. Is there any configuration I could make to possibly solve this, for Twitter or for other social media platforms/websites?
Thank you!

internetarchive / brozzler Goto Github PK

brozzler's People

Stargazers

Watchers

Forkers

brozzler's Issues

Recommend Projects

Recommend Topics

Recommend Org