Coder Social home page Coder Social logo

webcomix's Introduction

webcomix

Build StatusCoverage StatusPyPI version

Description

webcomix is a webcomic downloader that can additionally create a .cbz (Comic Book ZIP) file once downloaded.

Notice

This program is for personal use only. Please be aware that by making the downloaded comics publicly available without the permission of the author, you may be infringing upon various copyrights.

Installation

Dependencies

  • Python (3.8 or newer)
  • click
  • scrapy (Some additional steps might be required to include this package and can be found here)
  • scrapy-splash
  • scrapy-fake-useragent
  • tqdm
  • Docker (To be able to download JavaScript-dependent websites with -j option)

Process

End user

  1. Install Python 3
  2. Install the command line interface tool with pip install webcomix

Developer

  1. Install Python 3
  2. Clone this repository and open a terminal in its directory
  3. Install poetry with pip install poetry
  4. Download the dependencies by running poetry install
  5. Install pre-commit hooks with pre-commit install

Usage

webcomix [OPTIONS] COMMAND [ARGS]

Global Flags

help

Show the help message and exit.

Version

Show the version number and exit.

Commands

comics

Shows all predefined comics which can be used with the download command.

download

Downloads a predefined comic. Supports the --cbz flag, which creates a .cbz archive of the downloaded comic.

search

Searches for an XPath that can download the whole comic. Supports the --cbz flag, which creates a .cbz archive of the downloaded comic,-s, which verifies only the provided page of the comic, -y, which skips the verification prompt, and -j, which runs the javascript on pages before downloading.

custom

Downloads a user-defined comic. To download a specific comic, you'll need a link to the first page, an XPath expression giving out the link to the next page and an XPath expression giving out the link to the image. More info here. Supports the --cbz flag, which creates a .cbz archive of the downloaded comic, -s, which verifies only the provided page of the comic, and -y, which skips the verification prompt.

Examples

  • webcomix download xkcd
  • webcomix search xkcd --start-url=http://xkcd.com/1/
  • webcomix custom --cbz (You will be prompted about other needed arguments)
  • webcomix custom xkcd --start-url=http://xkcd.com/1/ --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='comic']//img/@src" --cbz (Same as before, but with all arguments declared beforehand)

Making an XPath selector

Using an HTML inspector, spot a html path to the next link's href attribute/comic image's src attribute.

e.g.: //div[@class='foo']/img/@src This will select the src attribute of the first image whose class is: foo

Note: webcomix works best on static websites, since scrapy(the framework we use to travel web pages) doesn't process Javascript.

To make sure your XPath is correct, you have to go into scrapy shell, which should be downloaded when you've installed webcomix.

scrapy shell <website> --> Use the website's url to go to it.
> response.body --> Will give you the html from the website.
> response.xpath --> Test an xpath selection. If you get [], this means your XPath expression hasn't gotten anything from the webpage.

Contribution

The procedure depends on the type of contribution:

  • If you simply want to request the addition of a comic to the list of supported comics, make an issue with the label "Enhancement".
  • If you want to request the addition of a feature to the system or a bug fix, make an issue with the appropriate label.

Running the tests

To run the tests, you have to use the pytest command in the webcomix folder.

webcomix's People

Contributors

dependabot[bot] avatar j-cpelletier avatar thereverend403 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

mirror-dump

webcomix's Issues

Only first image is downloaded

I can't seem to figure out an expression that works for --next-page-xpath. The one I'm using for --image-xpath seems to work fine, though. I've tested both and I can't see what's wrong. Here's a transcript of an interactive session (same results when passing the strings as arguments):

% webcomix custom --cbz "Dead Man at Devil's Cove"
Start url: https://jonnycrossbones.com/comic/page-001a/
Image xpath: //*[@id="comic"]/a/img/@src
Next page xpath: string(//*[@id="sidebar-under-comic"]/div[1]/table/tbody/tr/td[3]/a[2]/@href)
Page 1:
Page URL: https://jonnycrossbones.com/comic/page-001a/
Image URLs:
https://jonnycrossbones.com/wp-content/uploads/2015/06/001A-Dig.png

Verify that the links above are correct.
Are you sure you want to proceed? [y/N]: y
Downloading page https://jonnycrossbones.com/comic/page-001a/
Saving image https://jonnycrossbones.com/wp-content/uploads/2015/06/001A-Dig.png
Finished downloading the images.

I'm not sure if I'm expected to use string() here, but if I don't, it says "Could not find next link". With string() it doesn't give any error, and as you can see the first image downloads correctly, but it seems not to even try advancing to the next page.

Custom Not Working

Hello,

I've tried a couple custom sites and I can't seem to get it to work. Here are my parameters:
webcomix custom sdamned --start-url=https://www.sdamned.com/comic/prologue --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='cc-comicbody']/a/img/@src"
and
webcomix custom funnyfarm --start-url="https://web.archive.org/web/20190719121109/http://funnyfarmcomics.com/index.php?date=2009-01-01" --next-page-xpath="//li[@class='nextlink']/a/@href" --image-xpath="//div[@id='comic-image']/img/@src"

If I open up a scrapy shell and run response.xpath("//div[@id='cc-comicbody']/a/img/@src") for instance, it outputs
Selector xpath="//div[@id='cc-comicbody']/a/img/@src" data='https://www.sdamned.com/comics/153381...'
which appears to be a fully working link. I verified all the xpath parameters and they all seem to work in scrapy, but when I try to run it in webcomix, I get the following error:
sdamned could not be accessed with webcomix.
Chances are the website you're trying to download images from doesn't want to be scraped.
Aborted!

For the sdamned one, I would understand somewhat, but there are known scrapers for Internet Archive, so I don't think the issue is the site blocking scraping.

Feature request: Skip pages

I follow some webcomics that just throw other webcomics in the main rotation. It would be nice to have pages be or be not downloaded depending on an allowlist/blocklist condition (also pausing the page index incrementation).

Bug with custom parameters on 3.8.0

It looks like there are some issues with the latest release version of Webcomix. Below is output from the terminal.

`(pv) [noel@noelarch ~]$ webcomix --version
webcomix, version 3.8.0

(pv) [noel@noelarch ~]$ webcomix custom xkcd --start-url=http://xkcd.com/1/ --end-url=http://xkcd.com/5/ --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='comic']//img/@src" --cbz
Traceback (most recent call last):
File "/home/noel/pv/bin/webcomix", line 8, in
sys.exit(cli())
^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/webcomix/cli.py", line 249, in custom
validation = comic.verify_xpath()
^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/webcomix/comic.py", line 164, in verify_xpath
verification = worker.start()
^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/webcomix/scrapy/crawler_worker.py", line 46, in start
raise result[0]
ValueError: XPath error: Invalid expression in http://xkcd.com/5/
`

Add --end-url

Add a feature similar to --start-url that specifies the last page of a comic to download, possibly --end-url.
A use case would be to download web comics that are chapter based and save each chapter as a cbz, giving a start and end page for the given chapter.

Vgcats different xpaths

Hey I notice for vgcats the xpaths are different compared to other webcomics, is there a way to make these work?

webcomix custom Vgcats --start-url="https://www.vgcats.com/comics/?strip_id=0" --image-xpath="/html/body/center/table/tbody/tr[5]/td/img" --next-page-xpath="/html/body/center/table/tbody/tr[2]/td[4]/a[2]"

Custom: Add a -y (yes) option

Looking to use this as a replacement for Dosage, as this allows for custom comics. I'd like to run this daily (or every few days) on a number of comics to pull latest comic. The prompting on Custom Comics (are you sure) is a stumbling block to script it. Can you maybe add a -y to custom, for auto-acknowledging?

Custom Comics Don't Seem To Work

I spent a few hours trying to get some custom comics to work, just to test out the system, but was unable to get any to work. It's possible that it's my XPATHs, but they work when I test them in chrome. Here's an example:

webcomix custom --comic_name=EFT --start_url=https://www.bigheadpress.com/eft?page=1 --next_page_xpath=//[@id="pagewrapper976"]/div[2]/div[4]/a/@href --image_xpath=//[@id="pagewrapper"]/div[2]/table/tbody/tr/td[1]/div[1]/a/img/@src
Traceback (most recent call last):
File "/home/neuman/.local/lib/python3.6/site-packages/webcomix/comic.py", line 101, in verify_xpath
next_link = parsed_html.xpath(next_page)[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/neuman/.local/bin/webcomix", line 11, in
sys.exit(cli())
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/neuman/.local/lib/python3.6/site-packages/webcomix/main.py", line 122, in custom
comic.comic_image_selector)
File "/home/neuman/.local/lib/python3.6/site-packages/webcomix/comic.py", line 106, in verify_xpath
Failed on URL: {}""".format(next_page, image, url))
Exception:

                Next page XPath: //*[@id=pagewrapper976]/div[2]/div[4]/a/@href

                Image XPath: //*[@id=pagewrapper]/div[2]/table/tbody/tr/td[1]/div[1]/a/img/@src

                Failed on URL: https://www.bigheadpress.com/eft?page=1

Am I doing something wrong?

Feature Request - Alt Text

Could you add an option download the title text (alt-text) of an xkcd comic and save it say in a text file with the name of the image?

Nerfnow "Enhancement"

I was wandering if you can add nerfnow to the list of web-comics or try to integrate to the hive web comic network.

I tried doing it myself, but I am unfamiliar to xpath coding.

Feature. add --cookie to use browser cookies for comics that require a log in

I've found this software to be incredible and easy to use and I love it, I ask this because there are some comics that require you to be logged in in a website in order the view the pages and webcomix can't download the comic for this reason. I was thinking that maybe using the browser cookies of the logged in website would solve this problem.

How to use?

I installed webcomix using pip and can import it in python but quite cant see how it should be used?
There seems to be no cli.

Feature Request - Titles

Is it possible to add an option to title the images the title of the comic strip? As it is right now it's difficult to tell which comic each image corresponds to.

Question: Script Stops after awhile?

I am using the CMD to start the script and it runs for a bit, but always stop prematurely before downloading all the comics in the archive, its random too where it stops. Am I doing something wrong? (I am using latest version with the committed nerfnow addition)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.