killedmufasa / amputatorbot Goto Github PK

Remove AMP from your URLs. AmputatorBot is a highly specialised Reddit and Twitter bot that automatically replies to comments, submissions and tweets containing AMP URLs with the canonical link(s). It's also available as a website and REST API. See also: https://www.reddit.com/r/AmputatorBot/comments/ehrq3z/why_did_i_build_amputatorbot/.

Home Page: https://www.amputatorbot.com/

License: GNU General Public License v3.0

Python 100.00%

amp amp-html reddit reddit-bot amputatorbot bot praw praw-reddit reddit-api open

amputatorbot's Introduction

TL;DR: Remove AMP from your URLs. AmputatorBot is a highly specialised Reddit (and former Twitter) bot that automatically replies to comments and submissions containing AMP URLs with the canonical link(s). It's also available as a website and REST API, but those haven't been made open source here.

FAQ, About & Why

Features

Main features:

10 specialised canonical-finding methods, allowing for an accuracy rate of +97%. For example, by:
- Scanning the HTML contents
- Detecting and following redirects
- Guessing, and then checking article similarity with newspaper
- … and many more!
Detect AMP links using 14 patterns, and reply to items containing them with the canonical link and some info
Compare and test canonicals and pick the best
Stream Reddit comments, submissions and inbox messages
Extensively tested using a (private) database of over 200K AMP links and their canonicals, also functioning as caching

Nice bonuses:

Detect unique URLs with URLExtract and strip them of any artifacts
Object-oriented, allowing for a handy, free and publicly available API
Allow users to opt out and undo this
Send DMs when summoned by a user
Items interacted with are automatically being tracked
Log and datafiles are automatically generated

Set up

Clone the repository
Run pip install -r requirements.txt to install dependencies
Change the filename of static.txt to .py (see /static)
Configure the application by tweaking static.py (required)
Choose a check-[...].py script to run
Configure the script's settings in run_bot(). Set everything (guess_and_check, reply_to_post, save_to_database) to False when starting out. Consider deleting or disabling the database canonical method.
Run the script - All logs and required datafiles should be automatically and dynamically created.
Stop the script.
Check out the new files in /data and edit them to your liking.
Re-run the script and enjoy!

Support the project

Summon AmputatorBot on Reddit, like so: u/AmputatorBot. For more info, see here.
Give feedback: Most new features and improvements are directly influenced by your feedback. So, hit me up if you have any feedback. Contact me on Reddit or Fill an issue.
Star: By starring the project here on GitHub, we can reach more folks and unlock new options. It also gives me something to brag about :p
Contribute: Pull requests are a great way to contribute directly to the code and functionality.
Spread the word: In the end, the only goal of AmputatorBot is to allow people to have an informed choice. You can help by simply spreading the word!

Sponsor

The server for the bot, website, and API costs about €10 ($12) per month. If you support AmputatorBot's mission and can chip in, any donation would be a huge help. Every bit goes straight into server expenses. Thanks a bunch!

PayPal: https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=EU6ZFKTVT9VH2
Or, donate to our friends in Ukraine instead: https://u24.gov.ua

From the bottom of my heart, huge thanks for the tremendous support! <3

amputatorbot's People

Contributors

Stargazers

Watchers

Forkers

pokechu22 gschizas dusky3 nickjamesnorris planet-express-labs mgitre miguelx413 mapsai danilomatosdev bladexo

amputatorbot's Issues

Suggestions Of Changes

This list isn't a complete list but it does include some really important changes.

Miscellaneous

Just have once license file. You don't need both LICENSE and LICENSE.txt. This should be easy enough to fix.
File names should be more clear. Mentions_bot and comment_bot and submissions_bot aren't clear names to differentiate. They probably should all be one file anyway with different threads. You can read more on how multithreading works in python here.

Code Style

Reading through the code, there's a number of things that should be addressed in terms of coding style.

very long one-line strings

in Python, you can actually make multi-line strings that are much easier to read.

cleaner_message = """This message is very long.
In this format, newlines are newlines in the code instead of \n.
so I can make a multi-line message actually be multiple lines in code instead of 
one long line"""

or you can do something like this if you still want it to be just on line

cleaner_message = ("this message is going to be way too long to put on one"
                  "line of code. In this method, cleaner_message will still be one line,"
                  "but I can write it out in more lines in code")

I'd recommend using PEP8 style guidlines (editors can actually verify that you are following the guidelines). Using a consistent styling type is important and PEP8 is a pretty widespread style so it's a good one to use.
- Gives other people reading your code a reference point
- Following PEP8 will actually have your editor warn you when your lines get to large. PEP8 styling has lines no larger than 79 characters.
- Editors/IDEs nicely support PEP8

There's other feedback to give but I think that's feedback is at least good to start with

Suggestion: Add a public REST API that amputates links

I'd love to be able to use your amputation in my own projects, so it would be really cool if there was a REST API that takes a URL and amputates it.
I know this is a lot to ask and I would offer my help but I don't know Python at all, so I'm just throwing this suggestion out here in case you or anyone else would like to implement it.

Example:

GET to https://amputatorbot.com/api/?url=https%3A%2F%2Fgoogle.com%2Famp%2Fs%2Ffoo.com%2Fbar

would return something like:

{
	"error": false,
	"url": "https://foo.com/bar"
}

Config template doesn't contain sample headers

In the function random_headers there's a reference to a config.headers field, which doesn't exist inside the template_config file, therefore you can't really run any script.

Even though the implementation should be trivial, it would be better if the sample list of headers actually existed in the template.

Website and api uses cloudflare

Hello,
I've noticed that the website and the api uses cloudflare.

This is an issue due to the issues with the way cloudflare works, the fact that the site is inaccessible through Tor, and the general issues related to privacy.

See http://crimeflare.eu.org for more information.

False Positive when URL ends in 'amp' and has query params

Observed here: https://www.reddit.com/r/mtgcube/comments/103dc4d/is_it_just_me_or_do_people_seem_to_be/j2y91j1/

Then attempted to verify the behavior using https://www.amputatorbot.com/
It appears to be a false positive occurring when the url ends in amp and has query params trailing that.

Flagged:

https://scryfall.com/card/clb/870/skullclamp?utm_source=mtgcardfetcher

Not Flagged:

https://scryfall.com/card/clb/870/skullclamp
https://scryfall.com/card/clb/870/skullclamps?utm_source=mtgcardfetcher

And then interestingly this errors out with a 500 on the website:

https://scryfall.com/card/clb/870/skullclamp?

False negative on amputatorbot.com when Javascript is not available

When visiting amputatorbot.com with Firefox and Noscript enabled and trying to remove AMP from https://www.bloomberg.com/amp/news/articles/2021-01-16/cyberpunk-2077-what-caused-the-video-game-s-disastrous-rollout only "Error: That doesn't look like a valid AMP URL!" is displayed. Also the "Try with an example"-Button does not do anything. After enabling JS/whitelisting in Noscript it works.

Online version not working due to full storage at PythonAnywhere

Hey I've been trying to reach the online version for a while and I'm getting the same error, have you also noti

ced!?

URL encoding breaks the API

https://www.amputatorbot.com/api/v1/convert?gac=true&md=3&q=https://amp.cnn.com/cnn/us/live-news/damar-hamlin-collapse-bills-bengals-game-intl-hnk/index.html yields:

[{'amp_canonical': None, 'canonical': {'domain': 'cnn', 'is_alt': False, 'is_amp': False, 'is_cached': None, 'is_valid': True, 'type': 'REL', 'url': 'https://www.cnn.com/us/live-news/damar-hamlin-collapse-bills-bengals-game-intl-hnk/index.html', 'url_similarity': 0.9473684210526315}, 'canonicals': [{'domain': 'cnn', 'is_alt': False, 'is_amp': False, 'is_cached': None, 'is_valid': True, 'type': 'REL', 'url': 'https://www.cnn.com/us/live-news/damar-hamlin-collapse-bills-bengals-game-intl-hnk/index.html', 'url_similarity': 0.9473684210526315}, {'domain': 'cnn', 'is_alt': False, 'is_amp': False, 'is_cached': None, 'is_valid': True, 'type': 'OG_URL', 'url': 'https://www.cnn.com/us/live-news/damar-hamlin-collapse-bills-bengals-game-intl-hnk/index.html', 'url_similarity': 0.9473684210526315}, {'domain': 'cnn', 'is_alt': False, 'is_amp': False, 'is_cached': None, 'is_valid': True, 'type': 'SCHEMA_MAINENTITY', 'url': 'https://www.cnn.com/us/live-news/damar-hamlin-collapse-bills-bengals-game-intl-hnk/index.html', 'url_similarity': 0.9473684210526315}], 'origin': {'domain': 'cnn', 'is_amp': True, 'is_cached': False, 'is_valid': True, 'url': 'https://amp.cnn.com/cnn/us/live-news/damar-hamlin-collapse-bills-bengals-game-intl-hnk/index.html'}}]

whilest
https://www.amputatorbot.com/api/v1/convert?gac=True&md=3&q=https%3A%2F%2Famp.cnn.com%2Fcnn%2Fus%2Flive-news%2Fdamar-hamlin-collapse-bills-bengals-game-intl-hnk%2Findex.html yields:

{'error_message': "Error: Entry doesn't meet criteria (no AMP link detected)", 'result_code': 'error_no_amp'}

False Positive in Long Query String

I guess one of the urls here https://www.reddit.com/r/metamagic/comments/17hk4ql/color_spotlight_for_10272023_boros/ is being detected as an amp link?

False positive on twitter links

On reddit, this bot is incorrectly being triggered by many twitter links, because of a tracking entry in the URL. For instance, this URL triggered in this comment on r/spacex:
https://mobile.twitter.com/SpaceX/status/1300037857793290243?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Etweet

False positive, AMP tried to amputate an already canonical link

This problem was found in this thread: https://www.reddit.com/r/hockey/comments/i36dqw/kuzy_scores_on_pp_to_tie_game_at_2/g09gkwn/?context=2

The original comment was quickly deleted for reasons I'm not clear on.

What appears to have happened is it found the string "amp" in a link (https://www.gfycat.com/**amp**lepaltrycardinal) and attempted to cut it down to a canonical link despite it already being so.

AmputatorBot doesn't work with thelocal.it pages

AmputatorBot couldn't reply to the comment or submission you summoned it for.

AmputatorBot ran into the following error: there were no canonical URLs found.

This error has been logged and is being investigated. Common causes for this error are: bot- and geoblocking websites and badly implemented AMP specs.

Feel free to leave feedback by contacting u/killed_mufasa, by posting on r/AmputatorBot or by opening an issue on GitHub.

You're a very good human for trying <3

NEW: With AmputatorBot.com you can remove AMP from your URLs in just one click! You could try it again there but it will probably raise an error again: https://AmputatorBot.com/?https://www.google.com/amp/s/www.thelocal.it/20191115/five-italian-police-officers-jailed-over-death-of-stefano-cucchi/amp

The url

https://www.google.com/amp/s/www.thelocal.it/20191115/five-italian-police-officers-jailed-over-death-of-stefano-cucchi/amp

Strip user agent

Strip out (and change) the user agent from the source files. This will stop issues with other script's using it and thus reddit being cranky.

Can either use environment variables, or add it to the config file

Some suggestions

Hello! I want to preface this by thanking you for making this bot. I appreciate this kind of initiative. That being said, I read that you were new to Python when making this, so I looked over it and have a few ideas for improvements. I could go through and make some changes, but I figured you may prefer to architect this yourself.

Line 30 of submissions_boy.py and line 36 of comment_bot.py both contain a similar line:
for _ in r.subreddit('amputatorbot+audio+chrome+degoogle+europe+google+firefox+gaming+history+programming+robotics+security+seo+tech+technology+test+todayilearned+worldnews')

This is a really long, and I might say unruly, line. The contents (the subreddit names) happen to be the same. I'm also assuming that they will remain to stay the same in the future. This means that to update the list somebody will have to make sure to change it in both spots. Instead perhaps you could do something like the following (this probably go in a separate file):

def allowed_subreddits():
    return ["amputatorbot", "audio", "chrome", "degoogle", ... ]

def format_allowed_subreddits()"
   return "+".join(allowed_subreddits())

Doing this could help you print the list of subreddits beforehand with something like print("Obtaining the stream of subreddits", ", ".join(allowed_subreddits())")

This could also support other options for specifying the subreddits in the future. You could change allowed_subreddits to read from a file instead if you chose.

There are plenty of spots that you can do similar tricks to make the code easier to read and improve in the future. Another good example could be a is_amp_url function to avoid doing if "/amp" in submission.url or ".amp" in submission.url or "amp." in submission.url or "?amp" in submission.url or "amp?" in submission.url or "=amp" in submission.url or "amp=" in submission.url and "https://" in submission.url: when needed (although this may not be the best way to name the function as in another spot you're looking through an entire comment. Perhaps contains_amp_url would be better?). In general it is considered a good idea to not repeat yourself. That's one reason why laziness is one of the three great virtues of a programmer.

I'm happy to help make some changes if you'd like. Thanks again for putting your into developing this.

Bot does not recognize Google pages on sites besides google.com

Hey, quick thing I noticed when I summoned the bot:

The bot will recognize when AMP links are hosted on `google.com and add a little note to the end of the message: https://www.reddit.com/r/worldnews/comments/extdrc/thailand_cures_coronavirus_with_antihiv_drug/fge6u3g/?context=3

However, it will not recognize Google websites on foreign-country domains, such as google.co.uk, which is what happened when I summoned it (comment is invisible in the subreddit but shows on the profile): https://www.reddit.com/r/nextfuckinglevel/comments/exqwpv/they_actually_did_it_a_1000_bed_hospital/fge9t18/?context=3

Anyways, thank you for your work on the bot!