codl / forget Goto Github PK

View Code? Open in Web Editor NEW

147.0 147.0 8.0 8.46 MB

Continuous post deletion for twitter, mastodon, and misskey

Home Page: https://forget.codl.fr/

License: ISC License

Python 84.18% Mako 0.18% HTML 8.47% CSS 1.84% JavaScript 4.59% Shell 0.29% Procfile 0.05% Dockerfile 0.41%

mastodon twitter

forget's Introduction

Offsite projects:

Feedplz, a scraper and feed generator for Furaffinity and more.

git clone https://f.codl.fr/feedplz.git

forget's People

Contributors

Stargazers

Watchers

Forkers

nottavi kiku-jw siemvaessen robbinespu reviforks johann150 phtan braechnov

forget's Issues

version number is broken as hell because im an idiot

RFC: Forget shouldn't delete keybase.io proofs

Thoughts on filtering keybase.io proofs (e.g. https://twitter.com/ilianaweller/status/1066886431107371008) from the list of tweets to delete, and if so whether it should be a user-configurable option or not?

mark accounts dormant when they have been throwing TemporaryErrors for a long time

there's already a way to make accounts dormant but it only kicks in when we get a clear signal that it won't ever work again ie the token was revoked, account was suspended, etc

some high profile mastodon instances closed down recently, meaning they either don't respond or respond 410 Gone. this is interpreted as a TemporaryError and causes a retry, but they won't ever be coming back up. I think after 2 weeks of temporary errors an account should be considered to be dormant and no tasks should be scheduled for it

replace interval selector ui with something cleaner

the interval selector is a weird mess. it's absolutely unnecessary for it to allow users to enter "20 years" just as easily as "6 hours" for their delete_every

im thinking i can replace it with a non-linear slider like this:

(but with time intervals instead of numbers)

noUiSlider seems like a decent implementation of this at first glance but it doesn't have any keyboard accessibility built-in https://github.com/leongersen/noUiSlider

as of 2017-09-13 the most common values for intervals are

forget=> select count(1), policy_keep_younger from accounts where accounts.policy_enabled = 't' group by policy_keep_younger having count(1) > 1 order by count desc;
 count | policy_keep_younger 
-------+---------------------
     3 | 10 days
     3 | 90 days
     3 | 60 days
     3 | 30 days
     3 | 365 days 05:49:12
     2 | 91 days 07:27:18
     2 | 31 days
     2 | 365 days
     2 | 00:00:00
     2 | 180 days
(10 rows)

forget=> select count(1), policy_delete_every from accounts where accounts.policy_enabled = 't' group by policy_delete_every having count(1) > 1 order by count desc;
 count | policy_delete_every 
-------+---------------------
    13 | 00:30:00
     7 | 00:01:00
     4 | 03:00:00
     2 | 02:00:00
     2 | 00:00:00
(5 rows)

id propose pips at:

0, 1d, 10d, 1mo, 2mo, 3mo, 6mo, 1yr for keep_younger
0, 1min, 30min, 1hr, 2hr, 3hr, 6hr for delete_every

the handle would snap to pips

space in between pips would interpolate linearly between the pip values, quantized to a reasonable unit (no one wants 1yr 2mo 3d 4h 5min 6s precisely)

also unlike the example above the current value would need to be explicitly shown (the example did not interpolate, only snapped to pips)
either as a bubble above or below the handle, or next to the slider

deadlock in delete_from_account

DETAIL:  Process 21681 waits for ShareLock on transaction 8612212; blocked by process 21685.
Process 21685 waits for ShareLock on transaction 8612213; blocked by process 21681.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (6804,73) in relation "posts"
 [SQL: 'WITH latest AS \n(SELECT posts.created_at AS created_at, posts.updated_at AS updated_at, posts.id AS id, posts.author_id AS author_id, posts.favourite AS favourite, posts.has_media AS has_media, posts.direct AS direct, posts.favourites AS favourites, posts.reblogs AS reblogs, posts.is_reblog AS is_reblog \nFROM posts \nWHERE %(param_2)s = posts.author_id ORDER BY posts.created_at DESC \n LIMIT %(param_3)s)\n SELECT posts.created_at AS posts_created_at, posts.updated_at AS posts_updated_at, posts.id AS posts_id, posts.author_id AS posts_author_id, posts.favourite AS posts_favourite, posts.has_media AS posts_has_media, posts.direct AS posts_direct, posts.favourites AS posts_favourites, posts.reblogs AS posts_reblogs, posts.is_reblog AS posts_is_reblog \nFROM posts \nWHERE %(param_1)s = posts.author_id AND posts.created_at + %(created_at_1)s <= now() AND posts.id NOT IN (SELECT latest.id \nFROM latest) ORDER BY random() \n LIMIT %(param_4)s FOR UPDATE'] [parameters: {
 'param_1': 'REDACTED', 'created_at_1': datetime.timedelta(14), 'param_2': 'REDACTED', 'param_3': 500, 'param_4': 100}]

forget/tasks.py

Lines 233 to 237 in eae4d06

    
           posts = ( 
        
               Post.query.with_parent(account, 'posts') 
        
               .filter(Post.created_at + account.policy_keep_younger <= db.func.now()) 
        
               .filter(~Post.id.in_(db.select((latest_n_posts.c.id, )))).order_by( 
        
                   db.func.random()).limit(100).with_for_update().all())

remember a user's mastodon instance(s)

while showing the most common instances is sensible for a new user, for a returning user they most likely want to log in to the same one again so we should remember which one it is and show it (first? second after mastodon.social? what if theres several of them?) in the log in buttons

support for ActivityPub archives

now that mastodon supports exporting all of ones data to an archive, it would be neat to support those for parity with the twitter side

because those archives contain media they will get muuch larger. mine is ~600 MB to date

i wonder if I can read the archive from the client, strip media, re-zip it and upload it

Workers hang

Something is causing celery workers to hang

In the log below, only one of the workers, ForkPoolWorker-15, was still alive and processing tasks, then froze until i manually restarted several days later (!)

Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:19 worker.1 | [2017-09-25 22:13:19,751: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:20 worker.1 | [2017-09-25 22:13:20,356: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:21 worker.1 | [2017-09-25 22:13:21,252: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 30 15:56:55 slyme.codl.fr systemd[376]: Stopping Forget web application server...
Sep 30 15:57:00 slyme.codl.fr systemd[376]: Stopped Forget web application server.
Sep 30 15:57:00 slyme.codl.fr systemd[376]: Started Forget web application server.

Celery has built-in support for time limits which would be a great first step, I'd also like to know what in the world is causing these workers to hang forever

proper db session isolation in workers

rn workers just reuse db sessions all willy nilly and thats gross. idk if celery has a way to run things before/after every worker or if im gonna have to make my own decorator

task refresh_account_with_oldest_post is slow

this query:

forget/tasks.py

Lines 388 to 391 in 78c84ed

    
           post = (Post.query.outerjoin(Post.author).join(Account.tokens) 
        
                   .filter(Account.backoff_until < db.func.now()) 
        
                   .filter(~Account.dormant).group_by(Post).order_by( 
        
                       db.asc(Post.updated_at)).first())

SELECT posts.created_at AS posts_created_at, posts.updated_at AS posts_updated_at, posts.id AS posts_id, posts.author_id AS posts_author_id, posts.favourite AS posts_favourite, posts.has_media AS posts_has_media, posts.direct AS posts_direct, posts.favourites AS posts_favourites, posts.reblogs AS posts_reblogs, posts.is_reblog AS posts_is_reblog 
    FROM posts
        LEFT OUTER JOIN accounts ON accounts.id = posts.author_id -- dunno why this isn't a plain join
        JOIN oauth_tokens ON accounts.id = oauth_tokens.account_id 
    WHERE accounts.backoff_until < now()
        AND NOT accounts.dormant
    GROUP BY posts.created_at, posts.updated_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
    ORDER BY posts.updated_at ASC 
    LIMIT 1

... is slow. 5 seconds on my production database and its 1.8M posts.

A majority of that time is spent sorting all the candidate posts.

EXPLAIN output

                                                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=327303.91..327303.93 rows=1 width=83)
   ->  Group  (cost=327303.91..363028.72 rows=1299084 width=83)
         Group Key: posts.updated_at, posts.created_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
         ->  Sort  (cost=327303.91..330551.62 rows=1299084 width=83)
               Sort Key: posts.updated_at, posts.created_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
               ->  Hash Join  (cost=172.85..71061.01 rows=1299084 width=83)
                     Hash Cond: ((posts.author_id)::text = (accounts.id)::text)
                     ->  Seq Scan on posts  (cost=0.00..51109.60 rows=1813260 width=83)
                     ->  Hash  (cost=166.83..166.83 rows=481 width=50)
                           ->  Hash Join  (cost=141.67..166.83 rows=481 width=50)
                                 Hash Cond: ((oauth_tokens.account_id)::text = (accounts.id)::text)
                                 ->  Seq Scan on oauth_tokens  (cost=0.00..18.35 rows=535 width=25)
                                 ->  Hash  (cost=134.08..134.08 rows=607 width=25)
                                       ->  Seq Scan on accounts  (cost=0.00..134.08 rows=607 width=25)
                                             Filter: ((NOT dormant) AND (backoff_until < now()))
(15 rows)

figure out wtf is up with mastodon and all the throttling

Error returning to forget after cancelling authorization

After I click "log in with twitter" and get to the page that says "authorize codl.forget.fr to use your acccount?" and click cancel, there's a button that says "return to forget.codl.fr." I click it and it takes me to a page with this error:
"Bad Request

The browser (or proxy) sent a request that this server could not understand."

I really like forget and this isn't a big deal at all, but I figured I should mention it! 👍

unify button styles

there's two, arguably three different kinds of buttons. there should be only one

handle missing anonymous oauth token more gracefully

https://sentry.io/codl/forget/issues/324754633/

Exception: OAuth token has expired
(3 additional frame(s) were not displayed)
...
  File "flask/_compat.py", line 33, in reraise
    raise value
  File "flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "forget/routes.py", line 52, in twitter_login_step2
    token = lib.twitter.receive_verifier(oauth_token, oauth_verifier, **app.config.get_namespace("TWITTER_"))
  File "lib/twitter.py", line 34, in receive_verifier
    raise Exception("OAuth token has expired")

Exception: OAuth token has expired

support full twitter archive format

Archive imports have been disabled because Twitter have disabled the kind of archive that Forget knew about.

The new archive format is much larger, because it includes media, and it is a privacy nightmare because it includes everything twitter knows about a user. DMs, ad targeting info, the whole mess. The old strategy of uploading the zip file and letting the server figure it out is not going to work.

The plan is to extract and parse archives in the browser, and send batches of statuses to the server.

Individual issues for tasks to be done as part of this project are #458, #459

Original issue text follows:

reported by [email protected] https://cybre.space/@rrix/100591445673683791

hey, it looks like the Twitter archive format changed at some point that makes it not work with Forget. there's no longer a data/js/tweets dir with monthly files, just a big jsonp file (67mib in my case) in the root of the zip

Are boosts deleted? And the favourited ones?

I will ask my question in Mastodon vocabulary, hoping that it will be easily translatable for non-Mastodon users. My question is: does Forget delete boosts, i.e. sharings of someone else's posts? And if I set “keep favourited posts” to true, will it delete boosts of posts that I also favourited?

Forget Only Media Posts

I'd really like to be able to have every post I make except media posts be permanent. Hopefully that's simple enough!

Nothing to expire

Hi,
Than you for the work on Forget, it is really a very good idea. Too bad very few tools have a feature like this one: machines should forget more...
My problem is that I tried to set up expiration rules for my Twitter account today from https://forget.codl.fr/ but it targets zero tweet. It precisely says:

Currently keeping track of 0 of your posts, roughly 0 of which currently match your expiration rules.

I uploaded an archive and it returned:

0/72 months imported, 0 failed

Forget got the right authorizations from Twitter's side, I checked it. Looks something fails, don't know what to do. Thanks.

Can't login with Twitter

"Sorry, something went wrong when logging you in with Twitter. Give it another shot, maybe?"

I've relogged and tried again in 3 different browsers.

(on https://forget.codl.fr/)

rate-limit expensive actions

tweet archive upload, DoS on celery workers
login, DoS on twitter rate limit (and celery workers)

flask-limiter https://flask-limiter.readthedocs.io might help

Quick switch between accounts

allow users to be logged into >1 account at once and switch between them

backoff

somewhat related to #34

apply exponential backoff to mastodon instances when failing to talk to them. immediately failing tasks when the backoff isnt over yet is probably enough, or not queuing fetches or deletes

have a separate backoff for each account with some randomness to avoid a thundering herd when the server comes back

warn if archive looks too big

follow up to #64, there is potential for confusion between tweet archives and full account data archives

the former is what we accept, the latter is usually huge because it includes all media. we should warn the user if it looks like they've selected a full archive in the upload form

accounts with short delete_every but restrictive filters cause lots of refreshes

ive had the occasional complaint that forget was making a lot of requests and i couldnt say why but now i know

in delete_from_account, this query:

forget/tasks.py

Lines 231 to 238 in 41683fc

    
           latest_n_posts = (Post.query.with_parent(account, 'posts').order_by( 
        
               db.desc(Post.created_at)).limit(account.policy_keep_latest) 
        
                             .cte(name='latest')) 
        
           posts = ( 
        
               Post.query.with_parent(account, 'posts') 
        
               .filter(Post.created_at + account.policy_keep_younger <= db.func.now()) 
        
               .filter(~Post.id.in_(db.select((latest_n_posts.c.id, )))).order_by( 
        
                   db.func.random()).limit(100).all())

does not use any of the filters, and every post that is returned from it is refreshed until one of them matches all the filters

if someone has restrictive filters like "delete media only" or "delete faves only" then this will likely get 100 posts that don't match, do 100 requests to refresh them, and fail to delete anything

Support for json file [twitter]

Currently, the system only allows the upload of the zip archive.

But that archive contains all the medias ever posted, so the zip is quite massive, and I guess there's no reason for it, wouldn't only the tweet.js json file be sufficient ?

I tried to bypass it making an archive out of my json file, but I get The file you uploaded is not a valid tweet archive. No posts have been imported.

keep posts based on popularity (likes & boosts)

As suggested by alyx@‌witches.town ~~[1]~~:

a good expiration rule might be X number of boosts/likes can stay

document config.py.example better

the front half is fine but the back half is either hastily documented or not at all

More robust fetching

Right now fetching assumes that every post before the most recent known post has already been fetched. If there are no known posts then it will recurse until the remote service doesn't return any posts anymore but otherwise it will only go up to the most recent known posts.

Unfortunately this means there is a failure state where posts were fetched from the most recent post to an arbitrary post that isnt the oldest post. This doesn't happen usually but if a worker is killed or throws an unusual exception while fetching then this can happen.

Recursing as far as possible every time we fetch isn't feasible due to rate limits

I'm not sure how to fix this, maybe keep track of which time (or id?) ranges have been successfully fetched, and try to fetch whichever ones are missing?

potential information disclosure vulnerability

it is possible to find if someone has a specific set of "known instances" (the instance buttons on the front page)

more details withheld until i fix it

Support Reddit

forgot to make an issue for this one, sorry

health check should check redis too

i believe /api/health_check only checks that postgres is alive and well. earlier today redis ran out of disk space and stopped accepting writes. that should have made the health check fail

keep posts based on post privacy

hoodie suggested this

Support GNU social

I have an old GNU social account that I would love to purge messages from. Instances such as quitter.se and quitter.es, for instance.

keep media tweets

as requested by dani on twitter https://twitter.com/heartpastels/status/894824457071624193

this is maybe a bit too overly-specific—is there a good way to keep tweets with media attached as fresh forever, too?

account deletion

people should be able to delete all data associated with their account

p sure it's even a law, uh oh

unify redis config

right now there are i think three different things connecting to redis (lib.brotli, flask-limiter, and celery) and all three have different ways to configure where redis is. for my current use it doesn't matter because i'm just using the default redis unix socket, but there should really be a single config variable that sets all three at once

Ignore fav on retweets

if you fav your own retweet of a tweet then the rt post also gets faved. this isn't any different than just faving the og tweet in UI but changes forget's behaviour. forget should just ignore favs on tweets that are RTs

Support alternate expiration policies

Such as with a regex or a keyword match. Should be able to match the post, CW text, or either.

Example: default policy expires posts after 7 days, all posts that contain "announcement" expire after 30 days, all posts that contain "vent" in the CW expire after 1 hour

Keep Posts with Specific Hashtag

I'd like to specifically choose which posts to keep. Any way we can do that with a specific hashtag of our choosing?

surface information about first-time fetch being in progress

now that the historical_fetch_finished column, or whatever it is called, exists, it would be nice to surface that on the UI, for example with a message that says "Your past statuses are now being fetched from service. This may take a while, why not set up your expiration policy while you wait? You can close this page and Forget will continue to fetch your statuses and delete them when they reach expiration."

although maybe that last sentence is better suited for somewhere where it will still be seen after the first fetch is complete

500 error when submitting settings with JS disabled

guh

(builtins.TypeError) Not a boolean value: 'true'

keep posts with a hashtag

As suggested by hoodie

Option to preserve posts that have a specific hashtag (ie #mastoart)

Would require parsing & storing the hashtags found in the post (and a privacy policy update)

could be useful to also allow the opposite ie delete only posts with a specific hashtag?

Save pinned posts

Pinned posts probably shouldn't be deleted

big "keep posts by default" "delete posts by default" switch

so something that came up is that there's no way to keep reblogs if you want to

this is starting to make a lot of buttons and a confusing ass interface so i think a way to clean everything up is to have a big master switch between "Keep posts" / "Delete posts"

and then every setting can be "unless older/newer than," "unless faved," "unless it has media"

this is a pretty big one. writing a sql migration for this is going to be fun 😬

Forbidden page when trying to enable twitter delete

Hi,

After I authorized twitter access, when I try to enable the tool I get a simple "403 Forbidden" page when I confirm I want to delete the things.

test more stuff!!!!!!!!

this test suite is sorely lacking!!!!!!!

"latest x posts" is unreliable

if some posts have been deleted externally and have not been refreshed since, then the number of posts we have won't match. this probably doesn't matter in most cases but is annoying if you're trying to make your post count a funny number

Delete favs

Although harder to crawl there's no doubt that favourites are a source of information: by logging favs one could not only deduct interests but also wake time, account activity, account age, ...
Things that forget would usually make impossible to crawl at a later time.

Please add ability to delete favs in a similar fashion of toots & boosts

radio strips are not accessible to switch users on android

these are what i call radio strips

they're keyboard-accessible thanks to a hack and apparently the hack is too hacky for android's switch support cos it just passes over them completely

theyre still accessible through the crosshair thing but that's messy

Option to only keep unfaved posts

This is pretty much the same thing as the media / no media thing so I already know how to do this 😛

	posts = (
	Post.query.with_parent(account, 'posts')
	.filter(Post.created_at + account.policy_keep_younger <= db.func.now())
	.filter(~Post.id.in_(db.select((latest_n_posts.c.id, )))).order_by(
	db.func.random()).limit(100).with_for_update().all())

	post = (Post.query.outerjoin(Post.author).join(Account.tokens)
	.filter(Account.backoff_until < db.func.now())
	.filter(~Account.dormant).group_by(Post).order_by(
	db.asc(Post.updated_at)).first())

	latest_n_posts = (Post.query.with_parent(account, 'posts').order_by(
	db.desc(Post.created_at)).limit(account.policy_keep_latest)
	.cte(name='latest'))
	posts = (
	Post.query.with_parent(account, 'posts')
	.filter(Post.created_at + account.policy_keep_younger <= db.func.now())
	.filter(~Post.id.in_(db.select((latest_n_posts.c.id, )))).order_by(
	db.func.random()).limit(100).all())

codl / forget Goto Github PK

forget's Introduction

forget's People

Contributors

Stargazers

Watchers

Forkers

forget's Issues

Recommend Projects

Recommend Topics

Recommend Org