internetarchive / brozzler Goto Github PK
View Code? Open in Web Editor NEWbrozzler - distributed browser-based web crawler
License: Apache License 2.0
brozzler - distributed browser-based web crawler
License: Apache License 2.0
Many sites add video dynamically with JavaScript (such as what turned out to be the case with issue #104).
youtube-dl does not execute JavaScript. Instead, where video is added dynamically, youtube-dl relies on custom code for getting the video URLs for a page.
That works great when someone has written an extractor already, but in a lot of cases there has not been written any such extractor.
If instead of asking youtube-dl to download from the original URL, the rendered DOM was passed to youtube-dl then in cases where the JavaScript execution that has happened placed <video>
and <source>
tags in the DOM, youtube-dl should be able to find the video.
In other situations, passing the rendered DOM instead might possibly make it so that youtube-dl fails instead. In that case, perhaps it could be an option that you provide to brozzler-new-site
whether to pass rendered DOM or have youtube-dl download the page from the original URL.
Let me know what you think. I know that feature requests in general are tall orders but I believe this would be useful and I hope you will consider it.
I know that the Warc specification do allow base 16 (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-payload-digest). But since most other Warc tools all produce base 32, SHA1, it would be beneficial for the community to use the same standard.
The following tools all produce base32:
Heritrix
wget
Webrecorder
It was discovered when the warc-indexer in the webarchive-discovery project (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer) reported errors with the payload when indexing warc-files generated woth Brozzler.
Hello,
I am trying to install brozzler[easy] (version 1.5.18) on linux (Centos with Python 3.6.8 and pip 21.1.2). Seems there are some package conflicts using pip.
I also install setuptools 57.0.0.
When I check requirement in setup.cfg file from brozzler-1.5.18.tar.gz, we have :
Has anyone recently installed brozzler[easy] successfully ? Have we to adapt the choice of some packages versions ? force some version ?
Or is it a mistake of understanding on my part for the installation?
Thanks for help
I managed to catch the exception.
its on debian using chromium Version: 55.0.2883.75-1~deb8u1
2017-02-04 22:36:24,671 1736 INFO WarcWriterThread(tid=1743) warcprox.writer.WarcWriter.close_writer(writer.py:69) closing brozzler-20170205063139292-00094-vs75mxwt.warc.gz
2017-02-04 22:37:09,419 1736 CRITICAL BrozzlingThread:https://www.ncdc.noaa.gov/sotc brozzler.worker.BrozzlerWorker._brozzle_site(worker.py:360) unexpected exception
Traceback (most recent call last):
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/worker.py", line 341, in _brozzle_site
outlinks = self.brozzle_page(browser, site, page)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/worker.py", line 285, in brozzle_page
on_screenshot=_on_screenshot)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/browser.py", line 430, in browse_page
user_agent=user_agent, timeout=300)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/browser.py", line 480, in navigate_to_page
timeout=timeout)
File "/home/diane/.virtualenvs/brozzler/lib/python3.4/site-packages/brozzler-1.1b9.dev175-py3.4.egg/brozzler/browser.py", line 291, in _wait_for
elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 300.1s waiting for: <function Browser.navigate_to_page.. at 0x7fa2ea0a8d08>
2017-02-04 22:37:09,426 1736 INFO BrozzlingThread:https://www.ncdc.noaa.gov/sotc brozzler.browser.Browser.stop(browser.py:344) shutting down websocket connection
2017-02-04 22:37:09,443 1736 INFO BrozzlingThread:https://www.ncdc.noaa.gov/sotc brozzler.chrome.Chrome.stop(chrome.py:267) terminating chrome pgid 13222
2017-02-04 22:37:09,974 1736 INFO BrozzlingThread:https://www.ncdc.noaa.gov/sotc
Rethinkdb seems to be essentially dead in the water. There are some attempts to get back on track, but right now it sadly doesn't seem to be being maintained.
Hello!
job-conf.rst lacks info about scope rules so I did my best to try to define them in my job.yml
:
id: myjob
time_limit: 60 # seconds
scope:
surt: http://(com,site,www,)/wanted-path
seeds:
- url: https://www.site.com
If I understand correctly, brozzler should find all links i.e. https://www.site.com/wanted-path/extra
and crawl only links with wanted-path
in them. Is this correct?
The problem I have is that Brozzler crawls all links i.e. https://www.site.com/other-path
, but I don't want that. Is the problem with my config or does brozzler ignore path of provided surt?
Hi I got this error while installing brozzler-easy in Python 3.8.5
ERROR: pywb 0.33.2 has requirement jinja2<2.9, but you'll have jinja2 3.0.1 which is incompatible.
I've spent the last couple of days trying to get Brozzler[easy] to work in cygwin and Linux with no success. Brozzler 1.5.18 installs, but Brozzler[easy] does not. due to various dependency conflicts. I've tried with python 3.8 on a fresh ubuntu 20 image and python versions 3.8, 3.7, and 3.5 under cygwin with similar results. I've also tried older versions of Brozzler, e.g. 1.4 which also seem to have dependency issues. It would be great to have a recipe / detailed documentation for getting Brozzler to work on these platforms as I'd really like to test it.
Any advice?
I attempted to use this software per the Installation and Usage instructions in the project README and am unable to get far. I first install rethinkdb
:
brew update && brew install rethinkdb
then install brozzler per the README instructions:
pip install brozzler
I attempt the first command in the Usage section of the README but receive an error:
$ brozzler-worker -e chromium
Traceback (most recent call last):
File "/usr/local/bin/brozzler-worker", line 4, in <module>
__import__('pkg_resources').run_script('brozzler==1.1.dev1', 'brozzler-worker')
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1504, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/EGG-INFO/scripts/brozzler-worker", line 8, in <module>
import brozzler
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/__init__.py", line 48, in <module>
from brozzler.site import Page, Site
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 12, in <module>
class Site(brozzler.BaseDictable):
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 13, in Site
logger = logging.getLogger(__module__ + "." + __qualname__)
NameError: name '__qualname__' is not defined
Figuring this might not be a fatal error, I also create a sample job based on the Job Configuration section of the README:
$ echo -e "id: test\nseeds:\n - url: http://example.com" > testJob.yaml
$ brozzler-new-job testJob.yaml
Traceback (most recent call last):
File "/usr/local/bin/brozzler-new-job", line 4, in <module>
__import__('pkg_resources').run_script('brozzler==1.1.dev1', 'brozzler-new-job')
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1504, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/EGG-INFO/scripts/brozzler-new-job", line 7, in <module>
import brozzler
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/__init__.py", line 48, in <module>
from brozzler.site import Page, Site
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 12, in <module>
class Site(brozzler.BaseDictable):
File "/usr/local/lib/python2.7/site-packages/brozzler-1.1.dev1-py2.7.egg/brozzler/site.py", line 13, in Site
logger = logging.getLogger(__module__ + "." + __qualname__)
NameError: name '__qualname__' is not defined
How do I get started with brozzler?
When trying to capture this page https://jornadas.fccn.pt/agenda/, brozzler hangs trying to crawl the resources in the bottom (ex: https://jornadas.fccn.pt/wp-content/uploads/2019/08/NAU_Rui_Ribeiro_8min.pptx).
The behavior occurs only with --skip-youtube-dl option!
Hi,
I tried to launch the dashboard on a machine, and to connect and diplay the dashboard from another machine using his ip adress : =>
I have this in my server :
2018-10-04 16:01:27,083 48494 INFO MainThread root.run(__init__.py:265) running brozzler-dashboard using gunicorn
2018-10-04 16:01:27,087 48494 INFO MainThread gunicorn.error.info(glogging.py:271) Starting gunicorn 19.8.1
2018-10-04 16:01:27,088 48494 INFO MainThread gunicorn.error.info(glogging.py:271) Listening at: http://127.0.0.1:8000 (48494)
2018-10-04 16:01:27,088 48494 INFO MainThread gunicorn.error.info(glogging.py:271) Using worker: sync
2018-10-04 16:01:27,091 48501 INFO MainThread gunicorn.error.info(glogging.py:271) Booting worker with pid: 48501
and when I try http;//ip:8000/
on the other machine, I have this : ERR_CONNECTION_REFUSED
Is it possible to do this in brozzler: launch dashboard on one machine and connect on it from another one?
Thanks
I don't think this is a brozzler issue. I guess it's facebook who has changed things. But if I try to capture a facebook page I always get following error:
Your request couldn't be processed. There was a problem with this request. We're working on getting it fixed as soon as we can
Username and password are correct. I have the same issue when trying to create a profile with Browsertrix (but there I can fix it by logging into the mobile version and go back to the web version).
Hi brozzler-team,
I want to export database entries belonging to a specific warc-file, from the tables jobs, sites and pages.
I Know how connect those tables to each other, but i couldn't find a connection to the table captures or directly to the belonging warc-file.
Is it working via the "WARC_Date" in the warcinfo record of the warc-file and "last_claimed" in the table sites?
A hint Would be great. Thx.
It may a newbie question, but in README I'm missing a rationale for using browser for scraping&archiving.
Question 1) Why you need browser in the first place?
Question 2) If it's for Javascript why not use http://phantomjs.org/ ?
Hi, I'm trying to install brozzler[easy] in my virtualenv (Python 3.5.2), but pip says that no matches were found if I run this command:
pip install brozzler[easy]
I have successfully installed brozzler
(1.1b10) with pip in this virtual environment.
Complete flow:
primoz@computer:~/projects|โ pip install brozzler[easy]
zsh: no matches found: brozzler[easy]
And brozler[dashboard]
is not found either.
I've been using Brozzler to archive a SAML-protected site, which I've been able to enter through using user_agent to get the site to accept regular form authentication. However, I frequently encounter an issue where, after passing the form authentication, Brozzler archives the actual authentication page that asks for username and password instead of the page "under" the authentication request. Essentially, after the site is officially entered and SAML authentication of the home page is completed, some of the following pages have this issue.
This occurs in about 1/6 pages, and I'm unsure if this is due to site settings or Brozzler itself. Any tips / advice would be very much appreciated.
We occasionally find brozzler workers frozen such that kill -QUIT
(https://github.com/internetarchive/brozzler/blob/506ab0c/brozzler/cli.py#L363) doesn't work, usually at shutdown. To debug this issue I'm running brozzler with python configured with --pydebug
(using the python3-dbg package on ubuntu). I waited for the problem to happen, then ran sudo gdb -p 13663 -batch -ex 'thread apply all py-bt' -ex quit
. I see the SIGQUIT handler blocked trying to acquire a lock inside of logging
. Many other threads are also stuck waiting for the same lock.
Thread 1 (Thread 0x7f9af26b3700 (LWP 13663)):
Traceback (most recent call first):
<built-in method acquire of _thread.RLock object at remote 0x7f9aeed3ed00>
File "/usr/lib/python3.5/logging/__init__.py", line 804, in acquire
self.lock.acquire()
File "/usr/lib/python3.5/logging/__init__.py", line 853, in handle
self.acquire()
File "/usr/lib/python3.5/logging/__init__.py", line 1487, in callHandlers
hdlr.handle(record)
File "/usr/lib/python3.5/logging/__init__.py", line 1425, in handle
self.callHandlers(record)
File "/usr/lib/python3.5/logging/__init__.py", line 1415, in _log
self.handle(record)
File "/usr/lib/python3.5/logging/__init__.py", line 1279, in info
self._log(INFO, msg, args, **kwargs)
File "/usr/lib/python3.5/logging/__init__.py", line 1838, in info
root.info(msg, *args, **kwargs)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/cli.py", line 346, in dump_state
signum, '\n'.join(state_strs)))
<built-in method acquire of _thread.lock object at remote 0x7f9aec0786c8>
File "/usr/lib/python3.5/threading.py", line 1070, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
File "/usr/lib/python3.5/threading.py", line 1054, in join
self._wait_for_tstate_lock()
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/cli.py", line 369, in brozzler_worker
th.join()
File "/opt/brozzler-ve3/bin/brozzler-worker", line 11, in <module>
load_entry_point('brozzler==1.1b13.dev284', 'console_scripts', 'brozzler-worker')()
And yet, no threads seem to be in the midst of logging while owning that lock. On reflection it seems to me that it could be the case that some thread owned the lock, had an exception raised by thread_raise
, and did not release the lock.
The code in logging
looks pretty disciplined:
def handle(self, record):
rv = self.filter(record)
if rv:
self.acquire()
try:
self.emit(record)
finally:
self.release()
return rv
def acquire(self):
if self.lock:
self.lock.acquire()
But it seems possible the exception was raised betweenself.lock.acquire()
and try:
.
One solution would be to avoid any logging (or any locking of any kind) inside of a with brozzler.thread_accept_exceptions()
block. But a lot happens in those blocks that we want to log.
Maybe monkey-patching that logging handle() method to do self.acquire()
inside the try
block (and make sure self.release()
proceeds silently in case the lock is not held) would fix this. That may be the simplest thing to try. (It could introduce another bug though in case the RLock is already held, if acquire()
does not complete, release()
could release the lock when it shouldn't, since it decrements a counter. But we're in logging
here, so it seems like the worst that could happen would be some overlapping log messages...)
Even if that works, ensuring that no locking happens inside of a with brozzler.thread_accept_exceptions()
block is a heavy burden to endure. Sooner or later it may be incumbent to rewrite brozzler to use multiprocessing instead of multithreading.
Brozzler doesn't crawl, extracts and follows all the links on the website, it only scrapes the main page (the url from job.yaml configuration file).
I guess something's wrong with my scope surt parameter? But scope rules are not yet documented and I don't know where else I can find the information about it, so I've decided to ask here.
I've tried the following scope configurations:
// job.yaml file
#other job parameters
seeds:
url: http://iskme.org
scope:
surt: org,iskme)/
2.Surt rules according to this documentation: https://webarchive.jira.com/wiki/display/ARIH/SURT+Rules
// job.yaml file
#other job parameters
seeds:
url: http://iskme.org
scope:
surt: +http://(org,iskme,
But it still scrapes only one main page iskme.org. How do I configure brozzler scope rules to crawl all of the website, to follow all the links for the whole domain?
I am trying to install brozzler on macOS 10.13.2. After running pip3 install brozzler[easy]
, I tried to run brozzler-easy
, but got the following error:
CRITICAL:root:ModuleNotFoundError: No module named 'pywb.cdx'
You might need to run "pip install brozzler[easy]".
See README.rst for more information.
I checked and pywb
is installed. However, it looks like the pywb
package removed the pywb.cdx
module at some point. Do I need to install an older version of pywb
?
I added a site with brozzler-new-site
, and its pages are being captured but video files stored on https://example.s3.amazonaws.com are not archived. How can one specify that example.s3.amazonaws.com should be included when adding a site with brozzler-new-site
.
Ideally, this should be mentioned in the README.
How do you stop jobs in brozzler?
How do you stop jobless sites in brozzler?
How do you assign specific workers for sepcific jobs?
I tried 3 times with a virtual machine and 2 times with just a PC with Ubuntu reinstalled, but all failed...
I did a google search and didn't find any tutorials...
Does brozzler have support for adding new, or customizing existing behaviors?
From what I understood, this requires both a yaml file matching urls to behaviors, and the actual behaviors in js files.
If there's no support currently, how about adding one or more flags, allowing to specify additional yaml and js files, or directories containing these? Where would be a good place to implement this, so it could become an official feature of brozzler?
When trying to run brozzler, this line appears in the logs and chromium crashes:
DEBUG ChromeOutReaderThread:37385 brozzler.chrome.Chrome._read_stderr_stdout(chrome.py:273) chrome pid 21558 STDERR b'[21558:21558:1016/150743.534943:ERROR:default_network_context_params.cc(64)] Cannot use V8 Proxy resolver in single process mode.\n'
Removing the hard-coded --single-process
parameter in brozzler/chrome.py
seems to fix the issue. Is this flag necessary for brozzler to work?
Here's some more debug info.
chromium-browser --version
Chromium 69.0.3497.81 Built on Ubuntu , running on Ubuntu 18.04
How brozzler started chromium:
INFO BrozzlingThread:37385 brozzler.chrome.Chrome.start(chrome.py:180) running: 'chromium-browser --remote-debugging-port=37385 --use-mock-keychain --user-data-dir=/tmp/tmp5qcy14wv/chrome-user-data --disable-background-networking --disable-renderer-backgrounding --disable-hang-monitor --disable-background-timer-throttling --mute-audio --disable-web-sockets --disable-cache --single-process --window-size=1100,900 --no-default-browser-check --disable-first-run-ui --no-first-run --homepage=about:blank --disable-direct-npapi-requests --disable-web-security --disable-notifications --disable-extensions --disable-save-password-bubble --ignore-certificate-errors --proxy-server=localhost:8888 about:blank'
With a warcprox instance running at localhost:8888
Trying to use the brozzler easy setup on Ubuntu 16.04.
Using Anaconda with a fresh environment:
conda create --name Brozzler python=3.5
source activate Brozzler
pip install brozzler[easy]
Running:
brozzler-easy
produces the error:
Traceback (most recent call last):
File "/home/thomaspr/anaconda3/envs/Brozzler/bin/brozzler-easy", line 11, in <module>
sys.exit(main())
File "/home/thomaspr/anaconda3/envs/Brozzler/lib/python3.5/site-packages/brozzler/easy.py", line 274, in main
controller = BrozzlerEasyController(args)
File "/home/thomaspr/anaconda3/envs/Brozzler/lib/python3.5/site-packages/brozzler/easy.py", line 126, in __init__
self._warcprox_args(args))
File "/home/thomaspr/anaconda3/envs/Brozzler/lib/python3.5/site-packages/warcprox/main.py", line 213, in init_controller
if args.rethinkdb_dedup_url:
AttributeError: 'Namespace' object has no attribute 'rethinkdb_dedup_url'
I've been testing Brozzler locally using the brozzler-easy
option. I have generated a comprehensive list of URLs to visit for a Scalar publication I'm working on (i.e. 0 hops for each seed). The resulting WARC files have a large number of HTTP 206 partial responses for a portion of the JavaScript files, though each JS file has at least one 200 response. The result is, on playback in PyWb some pages load the 206 Partial Content
response, others will load 200 OK
. If the 206 response is loaded by PyWb, then a blank page is shown and the console has JS errors. I can fix it by removing the 206 rows from the .cdxj
index file so it falls back to the 200 copy, then every page loads fine.
I noticed that some JS files donโt seem to have this problem โ it looks like itโs only the ones where the <script> tag declaring the file does not include the type=โtext/javascriptโ
attribute, which should be optional โ that may be a coincidence, but I tried 2 completely different Scalar sites and they did the same thing. I'm running Brozzler on a Mac with Google Chrome. Iโm suspecting a possible Chrome behavior that has a negative impact on the WARC โ but not sure whether Brozzler, Warcprox, Pywb, or somewhere else is the best place to handle it. Does this seem like a Brozzler issue?
If needed, I can supply a test configuration file for replicating the problem, but wanted to check I'm in the right place and that it's not a known issue or result of incorrect configuration. Thanks!
Hi,
How can I change the port 8000 in the code?
Thanks
I am curious if there is any data reporting how well does brozzler scale with increasing the number of parallel browsers?
In my current (very limited) test bed, brozzler takes extremely long to crawl web pages and store the corresponding resources.
Attaching some results when I attempt to crawl 20 random web pages with brozzler while enabling headless Chrome browser.
Scalability results
I also track all the system resource usage (CPU, NW, disk). I am currently running this experiment on a 32 core linux server with 1Gbps NIC and storing data on an underlying hdd with r/w throughput of 150-200MBps
As you can see, neither of resources are being saturated, and yet brozzler is taking on avg ~40-50s to crawl and store a single page. Furthermore the low CPU usage is extremely concerning, since in my experience increasing the number of parallel browsers linearly increases the overall CPU usage of the system. This could be due to the proxy server used by brozzler?
Also, when I crawl the same corpus of pages using an extremely lightweight, custom, nodejs based crawler (written on top of puppeteer), it can do so about 10x faster than the above observed timings.
I installed brozzler via pip and launched it with brozzler-easy in a Debian Jessie VM and was able to scrape a site. (Brozzler 1.1b8, pywb 0.33.0, python3.4)
However the default page links in the dashboard on the job detail page were pointing to http://localhost:8091/brozzler/ As far as I can tell there was nothing started by default listening on port 8091.
After some investigation I found there was something listening on port 8880 that looked like a wayback process, so I tried launching brozzler-easy like this:
WAYBACK_BASEURL=http://192.168.122.152:8880/brozzler brozzler-easy -d warc/ --dashboard-address 0.0.0.0
(the ip addresses were so I could use my regular browser instead of the VM browser to use the site)
Doing that allowed the wayback links to work, but the thumbnail & screenshot urls are still 404ing.
I'm trying to crawl a site that requires Google authentication. Brozzler does not seem to offer an option to add a cookies.txt; or did I miss something?
Trying to take the WARCs saved by Warcprox, index them and play them back with standard pywb but getting some issues with some of the records inside the WARC. For example:
WARC/1.0^M
WARC-Type: response^M
WARC-Record-ID: <urn:uuid:c603136b-0bb2-4b94-8e49-9e7113b88e15>^M
WARC-Date: 2018-01-23T16:24:36Z^M
WARC-Target-URI: https://www.bbc.co.uk/^M
WARC-IP-Address: 212.58.244.67^M
Content-Type: application/http;msgtype=response^M
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ^M
Content-Length: 377^M
WARC-Block-Digest: sha1:UXTY2PXPGPSY6C3L6QLFUDNGWMS5CU7N^M
^M
HTTP/1.1 200 OK^M
Content-Type: text/html; charset=utf-8^M
ETag: W/"434a6-QjTzOogzAHForrr8966RtzKEhqg"^M
X-Frame-Options: SAMEORIGIN^M
Content-Length: 275622^M
Date: Tue, 23 Jan 2018 16:24:35 GMT^M
Connection: keep-alive^M
X-Cache-Action: HIT^M
X-Cache-Hits: 1580^M
X-Cache-Age: 94^M
Cache-Control: private, max-age=0, must-revalidate^M
Vary: Accept-Encoding, X-CDN, X-BBC-Edge-Scheme^M
is one of the records in the WARC. As you can see it doesn't have a payload other then the headers. Later down in the WARC is a record for the same url that does have a payload but pywb finds this record first so shows a blank page.
I don't really understand what is happening with the browser cache/server cache and why this request returns a 200 without a payload, any ideas? (or is this more of a warcprox question)
(I also notice that you give chrome the flag --disable-cache in the code but it's not in the list https://peter.sh/experiments/chromium-command-line-switches/ so wondering if the flag has changed and might be related to this issue?)
When trying to start a worker I get this error
File "/home/thore/.local/bin/brozzler-worker", line 7, in <module> from brozzler.cli import brozzler_worker File "/home/thore/.local/lib/python2.7/site-packages/brozzler/__init__.py", line 70, in <module> logging._levelToName[TRACE] = 'TRACE' AttributeError: 'module' object has no attribute '_levelToName'
Scenario: I have warcprox and brozzler worker running on my local machine. While in the middle of archiving a website, if brozzler worker process is killed such as either using 'kill -9 <process_id>' or closing the console session.
After both warcprox and brozzler worker instances are restarted (on same ports as before), the site will not be picked for crawling. This is due to reason that db('Brozzler').table('sites').claimed property = true.
Query:
I am running brozzler on Ubuntu 17.10. It's working quite well, except that all screenshots shown in the brozzler dashboard web UI are completely black. Is this a known issue? I am using Chromium Version 65.0.3325.181 (Official Build) Built on Ubuntu , running on Ubuntu 17.10 (64-bit)
When I run pip3 install brozzler[easy]
I get the following error:
Could not find a version that satisfies the requirement warcprox>=2.4b2.dev173 (from brozzler[easy]) (from versions: 1.0, 1.1, 1.2, 1.3, 1.4, 2.0.dev9, 2.0b1, 2.0b2, 2.0, 2.0.1, 2.1b1.dev60, 2.1b1.dev68, 2.1b1.dev71, 2.1b1.dev86, 2.1b1.dev87, 2.2, 2.3, 2.4b1)
No matching distribution found for warcprox>=2.4b2.dev173 (from brozzler[easy])
I checked on PyPi, and it looks like warcprox 2.4b1 is the most recent released published.
(FYI, I'm running Ubuntu 18.04, and Python 3.6.5, not that this should matter for this error.)
2017-06-13 06:02:03,368 5 ERROR MainThread rethinkstuff.Rethinker._random_server_connection(init.py:97) will keep trying to get a connection after failure connecting to localhost: Could not connect to localhost:28015. Error: [Errno 99] Cannot assign requested address
RethinkerDB has been started by default configuration. Any suggestions ?
In brozzler, browser fetches service worker scripts, but it seems to happen in a special context, and it doesn't get the same devtools treatment as regular urls. In particular Network.setExtraHTTPHeaders
does not apply, so the request is missing Warcprox-Meta
, which means warcprox writes it to the wrong warc file.
https://webarchive.jira.com/browse/AITFIVE-1713 (internal issue tracker)
Hello,
I've been utilizing brozzler-easy for testing and brozzler looks to be working wonderfully. I have a very large website I am trying to archive and unsure of a few things that I can't figure out through the job-conf.rst.
I'm running a local version of the website on my local machine. So that site is not running from it's public domain. Is there way to get brozzler to replace my local host domain with the actual public domain?
Another question I have, is there any way to boost the performance? Possibly configure it to use more threads? Currently when I setup a brozzler job and monitor it in Brozzler Dashboard, it shows two sites being actively crawled. Is that an example of Brozzler running two threads to crawl the site?
Maybe there's a writeup somewhere explaining optimal ways to use brozzler on a local machine?
greatly appreciate any insights. Sorry to post this here, not sure how else to get in touch with people on this project.
Thank you.
Hi,
I successfully installed brozzler-easy via brew. After entering the command brozzler-easy in zsh the following error is being thrown:
2023-11-20 14:49:48,196 17658 ERROR MainThread root.init_app(wsgi_wrappers.py:169) *** pywb app init FAILED config from "create_wb_router"! Traceback (most recent call last): File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader' Traceback (most recent call last): File "/usr/local/bin/brozzler-easy", line 10, in <module> sys.exit(main()) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 273, in main controller = BrozzlerEasyController(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 128, in __init__ self.pywb_httpd = self._init_pywb(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 176, in _init_pywb wsgi_app = pywb.framework.wsgi_wrappers.init_app( File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader' bash-3.2$ brozzler-easy /Library/Python/3.8/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020 warnings.warn( 2023-11-20 14:50:34,352 17668 INFO MainThread root.stats_processor(controller.py:65) statistics tracking disabled 2023-11-20 14:50:34,353 17668 INFO MainThread warcprox.warcproxy.WarcProxy.__init__(mitmproxy.py:617) 100 proxy threads 2023-11-20 14:50:34,358 17668 NOTICE MainThread warcprox.warcproxy.WarcProxy.server_activate(warcproxy.py:493) listening on 127.0.0.1:56467 2023-11-20 14:50:34,376 17668 ERROR MainThread root.init_app(wsgi_wrappers.py:169) *** pywb app init FAILED config from "create_wb_router"! Traceback (most recent call last): File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader' Traceback (most recent call last): File "/usr/local/bin/brozzler-easy", line 10, in <module> sys.exit(main()) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 273, in main controller = BrozzlerEasyController(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 128, in __init__ self.pywb_httpd = self._init_pywb(args) File "/Library/Python/3.8/site-packages/brozzler/easy.py", line 176, in _init_pywb wsgi_app = pywb.framework.wsgi_wrappers.init_app( File "/Library/Python/3.8/site-packages/pywb/framework/wsgi_wrappers.py", line 166, in init_app wb_router = init_func(config) File "/Library/Python/3.8/site-packages/pywb/webapp/pywb_init.py", line 256, in create_wb_router defaults = load_yaml_config(DEFAULT_CONFIG) File "/Library/Python/3.8/site-packages/pywb/utils/loaders.py", line 49, in load_yaml_config config = yaml.load(configdata) TypeError: load() missing 1 required positional argument: 'Loader'
Can someone help me please?
Best regards,
Steve
I'm worried that not all the videos arte being downloaded, because the pages require athentication, therefore youtube-dl can't download from them without the cookies (the browser is logged in but im not sure about youtube-dl). thansk!
Hi,
I was trying to crawl https://www.ncdc.noaa.gov/sotc and for some reason it never finished I'm not sure why, I noticed there were a number of "revisit" tags in the brozzler-easy log messages and suspect it was just cycling over the same pages over and over. (I think I need to understand how to read warc files before I can confirm that though)
I found some code that referred to stop_requested, in brozzler/job.py but nothing seemed to ever set that variable to True.
I stopped the crawl by using the rethinkdb console and changing "state" to "FINISHED". I could probably figure out how to write a command that given a job id updated the variable, if that seems like a reasonable solution.
Installing brozzler and running it according to the instrux does not work. The problem seems to be in the doublethink module where the following error occurs when running brozzle-worker:
2017-06-08 22:17:54,993 8363 CRITICAL MainThread brozzler.worker.BrozzlerWorker.run(worker.py:504) thread exiting due to unexpected exception
Traceback (most recent call last):
File "/venv/lib/python3.6/site-packages/brozzler/worker.py", line 446, in _service_heartbeat
self.status_info = self._service_registry.heartbeat(status_info)
File "/venv/lib/python3.6/site-packages/doublethink/services.py", line 142, in heartbeat
repr(field))
Exception: ('status_info is missing required field %s', "'ttl'")
The ttl field is indeed missing as the status_info dict looks like this:
{'browser_pool_size': 1,
'browsers_in_use': 0,
'heartbeat_interval': 20.0,
'load': 0.0,
'role': 'brozzler-worker'}
The check for ttl in status_info seems to have been added in this version of doublethink:
internetarchive/doublethink@a1c5a08
Downgrading to doublethink 0.2.0.dev73 seems to solve the problem.
When I recently used a brozzle-page command with a recent version of Google Chrome, I noticed that brozzler does not load the web page that should be archived.
This results in the web page not being archived successfully.
WARC file: WARCPROX-20230519163909687-00000-0so5t1md.warc
This issue also occurred when trying to archive other web pages:
The commands I used are listed below (video example):
warcprox -p 8081 -d ./warcs/IGN/brozzle_page/2023_05_19 --dedup-db-file /dev/null
export BROZZLER_EXTRA_CHROME_ARGS="--ignore-certificate-errors"
brozzle-page --chrome-exe '/usr/bin/google-chrome' --proxy localhost:8081 'https://www.ign.com/articles/the-last-of-us-season-1-review'
A โWebSocketBadStatusException: Handshake status 403 Forbiddenโ occurred when recently running these commands on Ubuntu (22.04.2 LTS and 20.04.6 LTS) and macOS (Ventura 13.3.1).
When I used these commands earlier this year it was working successfully (video):
After noticing this issue, I went through the recent stable versions of Google Chrome and found the last stable version that worked with the brozzle-page command was version 109.0.5414.119 which was released on January 24, 2023.
chrome.deb URI (109.0.5414.119): https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_109.0.5414.119-1_amd64.deb
Crawling session: https://youtu.be/A-zr6zVTZSo?t=5569
Replay session: https://youtu.be/A-zr6zVTZSo?t=6345
The first stable version of Chrome that did not work with the brozzle-page command is version 111.0.5563.110 which was released on March 21, 2023.
chrome.deb URI (111.0.5563.110): https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_111.0.5563.110-1_amd64.deb
Crawling session: https://youtu.be/A-zr6zVTZSo?t=4903
Replay session: https://youtu.be/A-zr6zVTZSo?t=4992
Chrome release blog post for 111.0.5563.110: https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop_21.html
Hi, not sure if I add this issue in the correct repository, but it's about crawls I created with brozzler.
If I watch Instagram and twitter captures in pywb, I notice that the images are not shown (screenshots). However, I noticed that the images are present in the WARC file, cause I can export them from the WARC file.
Crawls are made with brozzler. I have the same issue when opening the WARC files with Webrecorder Player and Replayweb.page. In those applications, I only see the Instagram logo.
Could it be related to #198?
instagram WARC-file: brozzler-20201117134317487-b7gpz5v6-00000.warc.gz
edit: I crawled Instagram with Browsertrix in the meanwhile and have no issue with replaying it, so maybe it's a brozzler issue.
Is it possible to run brozzler with headless chromium? and how?
Thanks
Sometimes the chromium browser crashes (or locks up so I have to close it to reload it). That's not what I'm reporting - they're probably not actually bugs (lol).
But when that does happen, it's annoying to have to reauth for all pages - and actually could result in an incomplete crawl.
My Proposal: Add a bit more logic to the username/password fields to check for remember me boxes.
Hi,
Before I describe the issue, I will preface by saying I am very new to brozzler and similar tools in general, so perhaps my question is a little bit simplistic.
Anyways, I was wondering if you have any pointers as to why videos on some hashtag feeds I captured do not seem to play when I view them on pywb. Is there any configuration I could make to possibly solve this, for Twitter or for other social media platforms/websites?
Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.