peterk / warcworker Goto Github PK
View Code? Open in Web Editor NEWA dockerized, queued high fidelity web archiver based on Squidwarc
License: GNU General Public License v3.0
A dockerized, queued high fidelity web archiver based on Squidwarc
License: GNU General Public License v3.0
When selecting which user scripts to run, make it possible to configure the order.
the worker Dockerfile contains two unexistent test files
Lines 34 to 35 in ce93eda
The screenshot is now saved in the root archive folder. It would be great to have them saved in the job dir instead.
Instructions in README give this:
/bin/sh: 1: wget: not found
E: gnupg, gnupg2 and gnupg1 do not seem to be installed, but one of them is required for this operation
ERROR: Service 'worker' failed to build: The command '/bin/sh -c wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -' returned a non-zero code: 255
Hi, I'm exploring tools for crawling social media. I got a FileNotFoundError after starting a crawl. I chose scroll_everything as script.
FileNotFoundError
FileNotFoundError: [Errno 2] No such file or directory: '/scripts/job/0fe3e4dc888e2f497d59d20ccf551c38.js'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2464, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2450, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1867, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/main.py", line 121, in process
message = make_job(jobid, output_path, seeds, description, scripts)
File "/app/main.py", line 29, in make_job
with open(jobscript_file, 'w') as outfile:
FileNotFoundError: [Errno 2] No such file or directory: '/scripts/job/8c5bad7d59ecfbe5f8aa2a4df4bffa6b.js'
Getting tracebacks when ticking a script checkbox.
FileNotFoundError: [Errno 2] No such file or directory: '/scripts/job/6469c99f84619919ec151be6e5d28a3c.js'
Works as expected when no script boxes are ticked.
Currently the worker is using Python 3.6 compiled from source. It could probably just as well use the bundled javascript facilities from the base image to work on queue items. Would reduce dependencies and make it a faster install.
Internet Archive runs a service called "Archive-It" that many that do personal archiving use.
The screenshot (and interface elements when testing) made me initially question what this tool has to do with Archive-It. As a suggestion: maybe put the name of the tool (warcworker) in the header box and something more representative (e.g., Archive URLs) in the button text. This would prevent any confusion and still be descriptive of what the tool accomplishes.
One of the use cases I have wanted to support in Squidwarc is multiple worker crawlers populating and pulling from a single master frontier.
As well as a move from the current in memory frontier to a more scalable frontier scheme.
Since warcworker is light years ahead in this regard ๐ (i.e. frontend for Squidwarc with multiple crawler workers and expandability potential for managing long crawls), I thought it best to see it if warcworker has any interest in this functionality and if so to coordinate development ๐
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.