j-cpelletier / webcomix Goto Github PK

View Code? Open in Web Editor NEW

32.0 4.0 1.0 835 KB

Webcomic downloader

License: MIT License

Python 97.42% HTML 2.58%

comics cbz python webcomics xpath lxml cbz-archive

webcomix's Introduction

webcomix

Description

webcomix is a webcomic downloader that can additionally create a .cbz (Comic Book ZIP) file once downloaded.

Notice

This program is for personal use only. Please be aware that by making the downloaded comics publicly available without the permission of the author, you may be infringing upon various copyrights.

Installation

Dependencies

Python (3.8 or newer)
click
scrapy (Some additional steps might be required to include this package and can be found here)
scrapy-splash
scrapy-fake-useragent
tqdm
Docker (To be able to download JavaScript-dependent websites with -j option)

Process

End user

Install Python 3
Install the command line interface tool with pip install webcomix

Developer

Install Python 3
Clone this repository and open a terminal in its directory
Install poetry with pip install poetry
Download the dependencies by running poetry install
Install pre-commit hooks with pre-commit install

Usage

webcomix [OPTIONS] COMMAND [ARGS]

Global Flags

help

Show the help message and exit.

Version

Show the version number and exit.

Commands

comics

Shows all predefined comics which can be used with the download command.

download

Downloads a predefined comic. Supports the --cbz flag, which creates a .cbz archive of the downloaded comic.

search

Searches for an XPath that can download the whole comic. Supports the --cbz flag, which creates a .cbz archive of the downloaded comic,-s, which verifies only the provided page of the comic, -y, which skips the verification prompt, and -j, which runs the javascript on pages before downloading.

custom

Downloads a user-defined comic. To download a specific comic, you'll need a link to the first page, an XPath expression giving out the link to the next page and an XPath expression giving out the link to the image. More info here. Supports the --cbz flag, which creates a .cbz archive of the downloaded comic, -s, which verifies only the provided page of the comic, and -y, which skips the verification prompt.

Examples

webcomix download xkcd
webcomix search xkcd --start-url=http://xkcd.com/1/
webcomix custom --cbz (You will be prompted about other needed arguments)
webcomix custom xkcd --start-url=http://xkcd.com/1/ --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='comic']//img/@src" --cbz (Same as before, but with all arguments declared beforehand)

Making an XPath selector

Using an HTML inspector, spot a html path to the next link's href attribute/comic image's src attribute.

e.g.: //div[@class='foo']/img/@src This will select the src attribute of the first image whose class is: foo

Note: webcomix works best on static websites, since scrapy(the framework we use to travel web pages) doesn't process Javascript.

To make sure your XPath is correct, you have to go into scrapy shell, which should be downloaded when you've installed webcomix.

scrapy shell <website> --> Use the website's url to go to it.
> response.body --> Will give you the html from the website.
> response.xpath --> Test an xpath selection. If you get [], this means your XPath expression hasn't gotten anything from the webpage.

Contribution

The procedure depends on the type of contribution:

If you simply want to request the addition of a comic to the list of supported comics, make an issue with the label "Enhancement".
If you want to request the addition of a feature to the system or a bug fix, make an issue with the appropriate label.

Running the tests

To run the tests, you have to use the pytest command in the webcomix folder.

webcomix's People

Contributors

Stargazers

Watchers

Forkers

mirror-dump

webcomix's Issues

Feature. Add --end-url to search parameter

Add the --end-url option found on a custom request to the search request.

Enhancement: Aurora Comic

https://comicaurora.com/

Only first image is downloaded

I can't seem to figure out an expression that works for --next-page-xpath. The one I'm using for --image-xpath seems to work fine, though. I've tested both and I can't see what's wrong. Here's a transcript of an interactive session (same results when passing the strings as arguments):

% webcomix custom --cbz "Dead Man at Devil's Cove"
Start url: https://jonnycrossbones.com/comic/page-001a/
Image xpath: //*[@id="comic"]/a/img/@src
Next page xpath: string(//*[@id="sidebar-under-comic"]/div[1]/table/tbody/tr/td[3]/a[2]/@href)
Page 1:
Page URL: https://jonnycrossbones.com/comic/page-001a/
Image URLs:
https://jonnycrossbones.com/wp-content/uploads/2015/06/001A-Dig.png

Verify that the links above are correct.
Are you sure you want to proceed? [y/N]: y
Downloading page https://jonnycrossbones.com/comic/page-001a/
Saving image https://jonnycrossbones.com/wp-content/uploads/2015/06/001A-Dig.png
Finished downloading the images.

I'm not sure if I'm expected to use string() here, but if I don't, it says "Could not find next link". With string() it doesn't give any error, and as you can see the first image downloads correctly, but it seems not to even try advancing to the next page.

Split the main program and the logic behind the comic downloading in two files

Custom Not Working

Hello,

I've tried a couple custom sites and I can't seem to get it to work. Here are my parameters:
webcomix custom sdamned --start-url=https://www.sdamned.com/comic/prologue --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='cc-comicbody']/a/img/@src"
and
webcomix custom funnyfarm --start-url="https://web.archive.org/web/20190719121109/http://funnyfarmcomics.com/index.php?date=2009-01-01" --next-page-xpath="//li[@class='nextlink']/a/@href" --image-xpath="//div[@id='comic-image']/img/@src"

If I open up a scrapy shell and run response.xpath("//div[@id='cc-comicbody']/a/img/@src") for instance, it outputs
Selector xpath="//div[@id='cc-comicbody']/a/img/@src" data='https://www.sdamned.com/comics/153381...'
which appears to be a fully working link. I verified all the xpath parameters and they all seem to work in scrapy, but when I try to run it in webcomix, I get the following error:
sdamned could not be accessed with webcomix.
Chances are the website you're trying to download images from doesn't want to be scraped.
Aborted!

For the sdamned one, I would understand somewhat, but there are known scrapers for Internet Archive, so I don't think the issue is the site blocking scraping.

Enhancement

https://www.housepetscomic.com/ would be nice to see.
Image xpath: //div[@id='comic']//img/@src
Next page xpath: //div[@id='comic']//@href
Works well. Thanks for the great scripts.

Enhancement: Gunnerkrigg Court

https://www.gunnerkrigg.com/

Feature request: Skip pages

I follow some webcomics that just throw other webcomics in the main rotation. It would be nice to have pages be or be not downloaded depending on an allowlist/blocklist condition (also pausing the page index incrementation).

Bug with custom parameters on 3.8.0

It looks like there are some issues with the latest release version of Webcomix. Below is output from the terminal.

`(pv) [noel@noelarch ~]$ webcomix --version
webcomix, version 3.8.0

(pv) [noel@noelarch ~]$ webcomix custom xkcd --start-url=http://xkcd.com/1/ --end-url=http://xkcd.com/5/ --next-page-xpath="//a[@rel='next']/@href" --image-xpath="//div[@id='comic']//img/@src" --cbz
Traceback (most recent call last):
File "/home/noel/pv/bin/webcomix", line 8, in
sys.exit(cli())
^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/webcomix/cli.py", line 249, in custom
validation = comic.verify_xpath()
^^^^^^^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/webcomix/comic.py", line 164, in verify_xpath
verification = worker.start()
^^^^^^^^^^^^^^
File "/home/noel/pv/lib/python3.11/site-packages/webcomix/scrapy/crawler_worker.py", line 46, in start
raise result[0]
ValueError: XPath error: Invalid expression in http://xkcd.com/5/
`

Add --end-url

Add a feature similar to --start-url that specifies the last page of a comic to download, possibly --end-url.
A use case would be to download web comics that are chapter based and save each chapter as a cbz, giving a start and end page for the given chapter.

Vgcats different xpaths

Hey I notice for vgcats the xpaths are different compared to other webcomics, is there a way to make these work?

webcomix custom Vgcats --start-url="https://www.vgcats.com/comics/?strip_id=0" --image-xpath="/html/body/center/table/tbody/tr[5]/td/img" --next-page-xpath="/html/body/center/table/tbody/tr[2]/td[4]/a[2]"

Custom: Add a -y (yes) option

Looking to use this as a replacement for Dosage, as this allows for custom comics. I'd like to run this daily (or every few days) on a number of comics to pull latest comic. The prompting on Custom Comics (are you sure) is a stumbling block to script it. Can you maybe add a -y to custom, for auto-acknowledging?

Custom Comics Don't Seem To Work

I spent a few hours trying to get some custom comics to work, just to test out the system, but was unable to get any to work. It's possible that it's my XPATHs, but they work when I test them in chrome. Here's an example:

webcomix custom --comic_name=EFT --start_url=https://www.bigheadpress.com/eft?page=1 --next_page_xpath=//[@id="pagewrapper976"]/div[2]/div[4]/a/@href --image_xpath=//[@id="pagewrapper"]/div[2]/table/tbody/tr/td[1]/div[1]/a/img/@src
Traceback (most recent call last):
File "/home/neuman/.local/lib/python3.6/site-packages/webcomix/comic.py", line 101, in verify_xpath
next_link = parsed_html.xpath(next_page)[0]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/neuman/.local/bin/webcomix", line 11, in
sys.exit(cli())
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 764, in call
return self.main(*args, **kwargs)
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/neuman/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/neuman/.local/lib/python3.6/site-packages/webcomix/main.py", line 122, in custom
comic.comic_image_selector)
File "/home/neuman/.local/lib/python3.6/site-packages/webcomix/comic.py", line 106, in verify_xpath
Failed on URL: {}""".format(next_page, image, url))
Exception:

                Next page XPath: //*[@id=pagewrapper976]/div[2]/div[4]/a/@href

                Image XPath: //*[@id=pagewrapper]/div[2]/table/tbody/tr/td[1]/div[1]/a/img/@src

                Failed on URL: https://www.bigheadpress.com/eft?page=1

Am I doing something wrong?

Translating the CSS selectors(with Beautiful Soup) into XPath selectors(with lxml)

Some websites are harder to scrape with CSS selectors, so we'll need XPath selectors to target the specific parts of the DOM we need.

Enhancement Schlock Mercenary

Is it possible to add https://www.schlockmercenary.com/ to the list of supported comics?

Feature Request - Alt Text

Could you add an option download the title text (alt-text) of an xkcd comic and save it say in a text file with the name of the image?

Nerfnow "Enhancement"

I was wandering if you can add nerfnow to the list of web-comics or try to integrate to the hive web comic network.

I tried doing it myself, but I am unfamiliar to xpath coding.

Feature. add --cookie to use browser cookies for comics that require a log in

I've found this software to be incredible and easy to use and I love it, I ask this because there are some comics that require you to be logged in in a website in order the view the pages and webcomix can't download the comic for this reason. I was thinking that maybe using the browser cookies of the logged in website would solve this problem.

How to use?

I installed webcomix using pip and can import it in python but quite cant see how it should be used?
There seems to be no cli.

Enhancement

New Comic: ElGoonishShive
webcomix custom --comic_name=ElGoonishShive --start_url=http://www.egscomics.com/comic/2002-01-21 --next_page_xpath="//a[@Class='cc-next']/@href" --image_xpath="//div[@id='cc-comicbody']//img/@src"

Feature Request - Titles

Is it possible to add an option to title the images the title of the comic strip? As it is right now it's difficult to tell which comic each image corresponds to.

Question: Script Stops after awhile?

I am using the CMD to start the script and it runs for a bit, but always stop prematurely before downloading all the comics in the archive, its random too where it stops. Am I doing something wrong? (I am using latest version with the committed nerfnow addition)

j-cpelletier / webcomix Goto Github PK

webcomix's Introduction

webcomix

Description

Notice

Installation

Dependencies

Process

End user

Developer

Usage

Global Flags

help

Version

Commands

comics

download

search

custom

Examples

Making an XPath selector

Contribution

Running the tests

webcomix's People

Contributors

Stargazers

Watchers

Forkers

webcomix's Issues

Recommend Projects

Recommend Topics

Recommend Org