artskydj / comicsrss.com Goto Github PK

View Code? Open in Web Editor NEW

70.0 7.0 8.0 745.15 MB

RSS feeds for comics

Home Page: https://www.comicsrss.com

JavaScript 1.26% HTML 98.52% CSS 0.20% Batchfile 0.02%

comic comics rss gocomics feed rss-generator arcamax hacktoberfest

comicsrss.com's Introduction

comicsrss.com

Source code for the site generator and rss feed generator for comicsrss.com.

Also, all of the site's content is in this repository, as it is hosted by GitHub Pages.

Support Me

If you'd like to help keep this site going, you can send me a few bucks using Patreon. I'd really appreciate it!

Technical Details

I have received many requests to add more comic series to the site. However, my time is limited. So if you want to help out, you can make a scraper!

To be able to add comic series to Comics RSS, it is helpful to understand the basics of what is going on.

Comics RSS has scrapers, and the site generator. Each scraper parses a different comic website, and writes a temporary file to the disk. The site generator reads the temporary JSON files, and writes static HTML/RSS files to the disk.

How scrapers work

The scrapers make https requests to a website (for example, https://www.gocomics.com), parse the responses, and write temporary JSON files to the disk.

On a multi-comic site like https://www.gocomics.com, a scraper has to get the list of comic series (e.g. Agnes, Baby Blues, Calvin and Hobbes, etc). For example, the scraper might request and parse https://www.gocomics.com/comics/a-to-z.

Then, for each comic series, it gets the most recent comic strip. Then it looks up the previous day's comic strip. When it finds a comic strip that it has seen before, it will continue to the next comic series, until it finishes the website.

Finally, it writes the lists of comic series with their list of strips to a temporary JSON file on the hard drive.

How the site generator works

The site generator reads the temporary JSON files made by the scrapers. Those files are read into one big list of comic series, each with their list of comic strips. The generator uses templates to generate an index.html file, and rss/{comic}.rss files.

When these updated/new files are committed and pushed to this repository, they get hosted on gh-pages, which is how you view the site today.

Run locally

Fork and clone the repository
Run these commands on your command line:

# in /comicsrss.com
npm install

cd _generator

# If you want to see all the options:
# node bin --help

# Re-generate the site with the cached scraped site data:
node bin --generate

# If you want to run the scrapers (takes a while) then run this:
# node bin --scrape --generate

# I have nginx serving up my whole code directory, so I can go to http://localhost:80/comicsrss.com/
# If you don't have anything similar set up, you can try:
cd ..
npx serve
# Then open http://localhost:3000 in your browser

Run your own auto-updating scraper and website using CircleCI

Fork the repository
Create a GitHub Deploy Key, add it to GitHub, and CircleCI
Change .circleci/config.yml from my username, email, and key fingerprint to your username, email, and key fingerprint
Enable the repo in CircleCI
I think that's it? Make a PR if you attempt the above steps and I missed something!

Scraper API

To create a scraper for a single-series website that shows multiple days' comic strips per web page, copy the code from dilbert.js and change it as needed.

To create a scraper for a multi-series website, copy the code from arcamax.js and change it as needed.

If you're not sure which to use, probaby start from arcamax.js, or feel free to open a GitHub issue to discuss it with me.

License

MIT

comicsrss.com's People

Contributors

Stargazers

Watchers

Forkers

seekersapp2013 scharlau othalan prk60091 nirvananimbusa giventofly nnisarggada shubhamraj01

comicsrss.com's Issues

Cron Daemon Sun, 23 Jul 2017 02:51:20 -0400

Sun, 23 Jul 2017 02:51:20 -0400
comic no longer exists
comic no longer exists

Tue, 6 Jun 2017 02:17:21 -0400

http://www.gocomics.com/webcomic-name Unable to parse comicImageUrl

Add feedly and rss buttons

<a href='http://cloud.feedly.com/#subscription%2Ffeed%2Fhttp%3A%2F%2Fwww.comicsrss.com%2Frss%2Fcalvinandhobbes.rss'  target='blank'><img id='feedlyFollow' src='http://s3.feedly.com/img/follows/feedly-follow-logo-black_2x.png' alt='follow us in feedly' width='28' height='28'></a>

feedly follow button like or

Also add an rss button to copy the rss link, and some sort of Copied! text.

<svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="128px" height="128px" id="RSSicon" viewBox="0 0 256 256">
<rect width="256" height="256" rx="50" ry="50" x="0" y="0" fill="#F49C52"/>
<circle cx="68" cy="189" r="24" fill="#FFF"/>
<path d="M160 213h-34a82 82 0 0 0 -82 -82v-34a116 116 0 0 1 116 116z" fill="#FFF"/>
<path d="M184 213A140 140 0 0 0 44 73 V 38a175 175 0 0 1 175 175z" fill="#FFF"/>
</svg>

Add feed preview icons

You might want to read a few comics instead of just blindly subscribing.

Add a button to the left of the rss button for going to the gocomics page for that comic.

One option is to use the gocomics logo, but I don't think that would be more clear. And there are probably IP issues with that. And if I ever support a site other than gocomics, I would have to use different icons, or something ugly like that. And I don't really like their logo. And I don't think it would work well at 24px square.

I'm thinking some sort of "eye" icon, like Github's notification "watching" icon:

The SVG is here:

<svg aria-hidden="true" class="octicon octicon-eye" height="16" version="1.1" viewBox="0 0 16 16" width="16">
	<path fill-rule="evenodd" d="M8.06 2C3 2 0 8 0 8s3 6 8.06 6C13 14 16 8 16 8s-3-6-7.94-6zM8 12c-2.2 0-4-1.78-4-4 0-2.2 1.8-4 4-4 2.22 0 4 1.8 4 4 0 2.22-1.78 4-4 4zm2-4c0 1.11-.89 2-2 2-1.11 0-2-.89-2-2 0-1.11.89-2 2-2 1.11 0 2 .89 2 2z"></path>
</svg>

but there might be IP issues...

Found a similar one on wikimedia:

<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24">
	<path d="M12 8c-5 0-11 6-11 6s6 6 11 6 11-6 11-6-6-6-11-6zm0 10c-2.2 0-4-1.8-4-4s1.8-4 4-4 4 1.8 4 4-1.8 4-4 4z"/>
	<circle cx="12" cy="14" r="2"/>
</svg>

I think this option is the best

Make the website decent

The website is just an auto-generated README.md.

Finding feeds

It's hard to find the feed you want from a list of 300+ links. Maybe there could be buttons like
A - B - C - D - E - F - G - H - I - J ... X - Y - Z that would send you to the right anchor.

Differ from README

The github readme should explain the project, not be an alternate place to get the feed links

Links to me

Should link to the gh repo, and my site

Search

It would have to be javascript based. Take each search word, and add the hidden class to any li element with li.innerHTML.toLowerCase().indexOf(searchWord) !== -1

Responsive

More buzzwords, and stuff

Cron Daemon Sat, 22 Jul 2017 02:51:20 -0400

Sat, 22 Jul 2017 02:51:20 -0400
comic no longer exists
comic no longer exists

Thu, 25 May 2017 02:17:22 -0400

http://www.gocomics.com/the-last-mechanical-monster Unable to parse comicImageUrl

Mobile

Terrible on mobile

https://search.google.com/test/mobile-friendly?utm_source=mft&utm_medium=redirect&utm_campaign=mft-redirect&id=tVhQskH7xy7qoojXqDaElA

Wed, 7 Jun 2017 02:17:28 -0400

http://www.gocomics.com/webcomic-name Unable to parse comicImageUrl

Buy a domain for this

This will also be a backwards-incompatible change. Maybe I can add a note to the end of the rss stream that I will be changing URLs.

Generate site from my VPS

This is surprisingly annoying to accomplish because of pushing to github. Try to find a way to authenticate that vps. Generating a new ssh key is probably easiest and best.

The other part is to get a cron job working.

While creating this, make a script so that if I ever change to another vps, then I can just run the script.

Cron Daemon Wed, 24 May 2017 02:17:21 -0400

Wed, 24 May 2017 02:17:21 -0400
http://www.gocomics.com/the-last-mechanical-monster Unable to parse comicImageUrl

Sun, 28 May 2017 02:17:22 -0400

http://www.gocomics.com/the-last-mechanical-monster Unable to parse comicImageUrl

Cron Daemon Sat, 17 Jun 2017 03:52:37 -0400

Sat, 17 Jun 2017 03:52:37 -0400
comic no longer exists

Show a last-generated-site date on the main page

Tests

This will take some work, since I didn't build this as well as I should've.

I'm wanting unit tests and integration tests.

unit tests
- each file in _generator, except index.js and generate....bat
integration tests:
- Test against a locally-hosted version of the site that with fewer files. Just delete most of sitemap.xml.
- Assert the output feed files match
both:
- Needs to take a hostname upon initialization. Certain places just assume www.gocomics.com

gocomics.com is blocking requests

I accessed gocomics.com from my computer. Then I tried running this job. Then I tried accessing gocomics.com, and I was blocked. A few hours later, I was no longer blocked. My guess is that it's got rate limiting in place, or it is looking at the User Agent?

To Do:

Find a way around it
Try to hit their servers less
a. Maybe just keep the last 3 days comics instead of 5?
b. Limit the request rate to avoid excessive traffic?

events.js:141 throw er; // Unhandled 'error' event ^ Error: read ECONNRESET at exports._errnoException (util.js:870:11) at TCP.onread (net.js:552:26)

events.js:141
throw er; // Unhandled 'error' event
^

Error: read ECONNRESET
at exports._errnoException (util.js:870:11)
at TCP.onread (net.js:552:26)

Sun, 11 Jun 2017 02:15:07 -0400

Cron Daemon Fri, 21 Jul 2017 02:51:12 -0400

Fri, 21 Jul 2017 02:51:12 -0400
comic no longer exists
comic no longer exists

Sat, 27 May 2017 02:17:20 -0400

http://www.gocomics.com/the-last-mechanical-monster Unable to parse comicImageUrl

events.js:141 throw er; // Unhandled 'error' event ^ Error: read ECONNRESET at exports._errnoException (util.js:870:11) at TCP.onread (net.js:552:26)

events.js:141
throw er; // Unhandled 'error' event
^

Error: read ECONNRESET
at exports._errnoException (util.js:870:11)
at TCP.onread (net.js:552:26)

Mon, 12 Jun 2017 02:15:08 -0400

Fri, 26 May 2017 02:16:00 -0400

Auto packing the repository for optimum performance. You may also
run "git gc" manually. See "git help gc" for more information.
events.js:141
throw er; // Unhandled 'error' event
^

Error: socket hang up
at createHangUpError (_http_client.js:200:15)
at Socket.socketOnEnd (_http_client.js:292:23)
at emitNone (events.js:72:20)
at Socket.emit (events.js:166:7)
at endReadableNT (_stream_readable.js:913:12)
at nextTickCallbackWith2Args (node.js:442:9)
at process._tickCallback (node.js:356:17)

Don't throw when unable to parse comicImageUrl

A common issue is that gocomics.com does not have a comicImageUrl right away. This causes the generator to throw an error, which causes me to get an email about it, which causes a github issue to be opened. Of the 8 issues opened with the cron job, only 2 so far (#14, #19) have been unrelated to parsing the comicImageUrl.

Perhaps the feed just shouldn't be generated that time?
Maybe if the feed doesn't exist yet, then the issue can be ignored?
Maybe if it happens to fewer than 1 in 10 then it is ok?

Most likely it isn't actually the comicImageUrl, as it is just an invalid page. Note that comicImageUrl comes first in the validation: get-comic-pages.js, line 42.

Related #20, #18, #17, #16, #15, #13, #12 .

Cron Daemon Mon, 19 Jun 2017 03:52:29 -0400

Mon, 19 Jun 2017 03:52:29 -0400
comic no longer exists

Better search

An appersand (&) and the word and should be treated as equivalent in a search.
If someone searches by, it should not include every single comic, like it does now.
en espanol should pull up comics with en Español.
Also, this should be splitting the search words into words, and looking up each of those words.
E.g. Cow Mark should get Cow and Boy Classics by Mark Leiknes and Lucky Cow by Mark Pett
Partial words should continue to be ok, if they're the beginning of the word.
E.g. Cow Ch should get 2 Cows and a Chicken by Steve Skelton and CowTown by Charlie Podrebarac
If you type in 4 search words, and no comics match all 4, but some match 3 search-words, pull those up?

Feed link should not be day-specific

E.g. https://artskydj.github.io/gocomics-to-rss/calvinandhobbes.rss

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>Calvin and Hobbes</title>
        <description>Calvin and Hobbes by Bill Watterson</description>
-       <link>http://www.gocomics.com/calvinandhobbes/2017/03/29</link>
+       <link>http://www.gocomics.com/calvinandhobbes</link>
        ...

Remove case-sensitive sorting

Needs a favicon

Drop number of pages from 8 to 3

See #23.

https://github.com/ArtskydJ/comicsrss.com/blob/gh-pages/_generator/get-comic-pages.js#L7-L14

Show that text was copied

Right now, if you click that link, you get no feedback.

Wed, 24 May 2017 02:17:21 -0400

http://www.gocomics.com/the-last-mechanical-monster Unable to parse comicImageUrl

Show the gocomics url of the page that failed

I think I broke this recently.

Note that #27 shows the erroring page http://www.gocomics.com/mollyandthebear, but #29 does not.

Fix this.

Better test for if comic no longer exists

https://github.com/ArtskydJ/comicsrss.com/blob/gh-pages/_generator/http-get.js#L10

Only checks for a 3xx status code. Check the Location header too.

Cron Daemon Thu, 15 Jun 2017 03:52:40 -0400

Thu, 15 Jun 2017 03:52:40 -0400
Auto packing the repository for optimum performance. You may also
run "git gc" manually. See "git help gc" for more information.
comic no longer exists

Add install script

Add an install script... I think it would look something like this:

(Should I explain the git setup in this file?)

cd ~
git clone [email protected]:ArtskydJ/comicsrss.com.git
crontab -l > ./crontab.txt
echo "[email protected]" >> ./crontab.txt
echo "# Runs at 1:15 CDT. It would work at 12:15, but I don't want to" >> ./crontab.txt
echo "# have to change it for DST. Not sure if I would have to or not..." >> ./crontab.txt
echo "15 2 * * * sh /root/comicsrss.com/_generator/generate-and-push.sh" >> ./crontab.txt
crontab ./crontab.txt

(Should I have quotes around the email address? Do I need the sh command, or can I just call the .sh file?)

Move rss files to a subdirectory

To do this without breaking everyone's feeds, I would need to have 301 redirects. Not sure if there's a way to do that with gh-pages.

http://www.gocomics.com/mollyandthebear comic no longer exists[TypeError: Cannot read property 'toLowerCase' of undefined]

http://www.gocomics.com/mollyandthebear comic no longer exists
[TypeError: Cannot read property 'toLowerCase' of undefined]

Wed, 14 Jun 2017 03:52:43 -0400

http://www.gocomics.com/the-last-mechanical-monster Unable to parse comicImageUrl

Wed, 24 May 2017 02:17:21 -0400

Cron Daemon Thu, 20 Jul 2017 02:51:18 -0400

Thu, 20 Jul 2017 02:51:18 -0400
comic no longer exists
comic no longer exists

events.js:141 throw er; // Unhandled 'error' event ^ Error: read ECONNRESET at exports._errnoException (util.js:870:11) at TCP.onread (net.js:552:26)

events.js:141
throw er; // Unhandled 'error' event
^

Error: read ECONNRESET
at exports._errnoException (util.js:870:11)
at TCP.onread (net.js:552:26)

Sat, 10 Jun 2017 02:15:08 -0400

events.js:141 throw er; // Unhandled 'error' event ^ Error: read ECONNRESET at exports._errnoException (util.js:870:11) at TCP.onread (net.js:552:26)

events.js:141
throw er; // Unhandled 'error' event
^

Error: read ECONNRESET
at exports._errnoException (util.js:870:11)
at TCP.onread (net.js:552:26)

Tue, 13 Jun 2017 02:15:07 -0400

selector	part	color	css rule
`input.search`	underline	`#dddddd`	`border-bottom: 1px solid #ddd`
`input.search:focus`	underline	`#888888`	`border-bottom: 1px solid #888`
`input.search`	placeholder	`#757575`	Not sure if there is a css rule attached.
`.icon-link`	icon	`#808080`	`opacity: 0.5; bg-color: #fff; color: #000`
`ul`	underline	`#808080`	`border-top: 1px solid gray`
`li`	underline	`#808080`	`border-bottom: 1px solid gray`

Make it consistent.

Fri, 9 Jun 2017 02:15:08 -0400

events.js:141
throw er; // Unhandled 'error' event
^

Error: read ECONNRESET
at exports._errnoException (util.js:870:11)
at TCP.onread (net.js:552:26)

artskydj / comicsrss.com Goto Github PK

comicsrss.com's Introduction

comicsrss.com

Support Me

Technical Details

How scrapers work

How the site generator works

Run locally

Run your own auto-updating scraper and website using CircleCI

Scraper API

License

comicsrss.com's People

Contributors

Stargazers

Watchers

Forkers

comicsrss.com's Issues

Finding feeds

Differ from README

Links to me

Search

Responsive

Recommend Projects

Recommend Topics

Recommend Org