Coder Social home page Coder Social logo

codl / forget Goto Github PK

View Code? Open in Web Editor NEW
147.0 147.0 8.0 8.46 MB

Continuous post deletion for twitter, mastodon, and misskey

Home Page: https://forget.codl.fr/

License: ISC License

Python 84.18% Mako 0.18% HTML 8.47% CSS 1.84% JavaScript 4.59% Shell 0.29% Procfile 0.05% Dockerfile 0.41%
mastodon twitter

forget's Introduction

Offsite projects:

  • Feedplz, a scraper and feed generator for Furaffinity and more.

    git clone https://f.codl.fr/feedplz.git

forget's People

Contributors

codacy-badger avatar codl avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar johann150 avatar shibaobun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

forget's Issues

mark accounts dormant when they have been throwing TemporaryErrors for a long time

there's already a way to make accounts dormant but it only kicks in when we get a clear signal that it won't ever work again ie the token was revoked, account was suspended, etc

some high profile mastodon instances closed down recently, meaning they either don't respond or respond 410 Gone. this is interpreted as a TemporaryError and causes a retry, but they won't ever be coming back up. I think after 2 weeks of temporary errors an account should be considered to be dormant and no tasks should be scheduled for it

replace interval selector ui with something cleaner

the interval selector is a weird mess. it's absolutely unnecessary for it to allow users to enter "20 years" just as easily as "6 hours" for their delete_every

im thinking i can replace it with a non-linear slider like this:

2017-09-13t10-33-47-0ooo

(but with time intervals instead of numbers)

noUiSlider seems like a decent implementation of this at first glance but it doesn't have any keyboard accessibility built-in https://github.com/leongersen/noUiSlider

as of 2017-09-13 the most common values for intervals are

forget=> select count(1), policy_keep_younger from accounts where accounts.policy_enabled = 't' group by policy_keep_younger having count(1) > 1 order by count desc;
 count | policy_keep_younger 
-------+---------------------
     3 | 10 days
     3 | 90 days
     3 | 60 days
     3 | 30 days
     3 | 365 days 05:49:12
     2 | 91 days 07:27:18
     2 | 31 days
     2 | 365 days
     2 | 00:00:00
     2 | 180 days
(10 rows)

forget=> select count(1), policy_delete_every from accounts where accounts.policy_enabled = 't' group by policy_delete_every having count(1) > 1 order by count desc;
 count | policy_delete_every 
-------+---------------------
    13 | 00:30:00
     7 | 00:01:00
     4 | 03:00:00
     2 | 02:00:00
     2 | 00:00:00
(5 rows)

id propose pips at:

  • 0, 1d, 10d, 1mo, 2mo, 3mo, 6mo, 1yr for keep_younger
  • 0, 1min, 30min, 1hr, 2hr, 3hr, 6hr for delete_every

the handle would snap to pips

space in between pips would interpolate linearly between the pip values, quantized to a reasonable unit (no one wants 1yr 2mo 3d 4h 5min 6s precisely)

also unlike the example above the current value would need to be explicitly shown (the example did not interpolate, only snapped to pips)
either as a bubble above or below the handle, or next to the slider

2017-09-13t11-23-32-oooo

deadlock in delete_from_account

DETAIL:  Process 21681 waits for ShareLock on transaction 8612212; blocked by process 21685.
Process 21685 waits for ShareLock on transaction 8612213; blocked by process 21681.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (6804,73) in relation "posts"
 [SQL: 'WITH latest AS \n(SELECT posts.created_at AS created_at, posts.updated_at AS updated_at, posts.id AS id, posts.author_id AS author_id, posts.favourite AS favourite, posts.has_media AS has_media, posts.direct AS direct, posts.favourites AS favourites, posts.reblogs AS reblogs, posts.is_reblog AS is_reblog \nFROM posts \nWHERE %(param_2)s = posts.author_id ORDER BY posts.created_at DESC \n LIMIT %(param_3)s)\n SELECT posts.created_at AS posts_created_at, posts.updated_at AS posts_updated_at, posts.id AS posts_id, posts.author_id AS posts_author_id, posts.favourite AS posts_favourite, posts.has_media AS posts_has_media, posts.direct AS posts_direct, posts.favourites AS posts_favourites, posts.reblogs AS posts_reblogs, posts.is_reblog AS posts_is_reblog \nFROM posts \nWHERE %(param_1)s = posts.author_id AND posts.created_at + %(created_at_1)s <= now() AND posts.id NOT IN (SELECT latest.id \nFROM latest) ORDER BY random() \n LIMIT %(param_4)s FOR UPDATE'] [parameters: {
 'param_1': 'REDACTED', 'created_at_1': datetime.timedelta(14), 'param_2': 'REDACTED', 'param_3': 500, 'param_4': 100}]

forget/tasks.py

Lines 233 to 237 in eae4d06

posts = (
Post.query.with_parent(account, 'posts')
.filter(Post.created_at + account.policy_keep_younger <= db.func.now())
.filter(~Post.id.in_(db.select((latest_n_posts.c.id, )))).order_by(
db.func.random()).limit(100).with_for_update().all())

remember a user's mastodon instance(s)

while showing the most common instances is sensible for a new user, for a returning user they most likely want to log in to the same one again so we should remember which one it is and show it (first? second after mastodon.social? what if theres several of them?) in the log in buttons

support for ActivityPub archives

now that mastodon supports exporting all of ones data to an archive, it would be neat to support those for parity with the twitter side

because those archives contain media they will get muuch larger. mine is ~600 MB to date

i wonder if I can read the archive from the client, strip media, re-zip it and upload it

Workers hang

Something is causing celery workers to hang

In the log below, only one of the workers, ForkPoolWorker-15, was still alive and processing tasks, then froze until i manually restarted several days later (!)

Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:19 worker.1 | [2017-09-25 22:13:19,751: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:20 worker.1 | [2017-09-25 22:13:20,356: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:21 worker.1 | [2017-09-25 22:13:21,252: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 30 15:56:55 slyme.codl.fr systemd[376]: Stopping Forget web application server...
Sep 30 15:57:00 slyme.codl.fr systemd[376]: Stopped Forget web application server.
Sep 30 15:57:00 slyme.codl.fr systemd[376]: Started Forget web application server.

Celery has built-in support for time limits which would be a great first step, I'd also like to know what in the world is causing these workers to hang forever

proper db session isolation in workers

rn workers just reuse db sessions all willy nilly and thats gross. idk if celery has a way to run things before/after every worker or if im gonna have to make my own decorator

task refresh_account_with_oldest_post is slow

this query:

forget/tasks.py

Lines 388 to 391 in 78c84ed

post = (Post.query.outerjoin(Post.author).join(Account.tokens)
.filter(Account.backoff_until < db.func.now())
.filter(~Account.dormant).group_by(Post).order_by(
db.asc(Post.updated_at)).first())

SELECT posts.created_at AS posts_created_at, posts.updated_at AS posts_updated_at, posts.id AS posts_id, posts.author_id AS posts_author_id, posts.favourite AS posts_favourite, posts.has_media AS posts_has_media, posts.direct AS posts_direct, posts.favourites AS posts_favourites, posts.reblogs AS posts_reblogs, posts.is_reblog AS posts_is_reblog 
    FROM posts
        LEFT OUTER JOIN accounts ON accounts.id = posts.author_id -- dunno why this isn't a plain join
        JOIN oauth_tokens ON accounts.id = oauth_tokens.account_id 
    WHERE accounts.backoff_until < now()
        AND NOT accounts.dormant
    GROUP BY posts.created_at, posts.updated_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
    ORDER BY posts.updated_at ASC 
    LIMIT 1

... is slow. 5 seconds on my production database and its 1.8M posts.

A majority of that time is spent sorting all the candidate posts.

EXPLAIN output
                                                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=327303.91..327303.93 rows=1 width=83)
   ->  Group  (cost=327303.91..363028.72 rows=1299084 width=83)
         Group Key: posts.updated_at, posts.created_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
         ->  Sort  (cost=327303.91..330551.62 rows=1299084 width=83)
               Sort Key: posts.updated_at, posts.created_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
               ->  Hash Join  (cost=172.85..71061.01 rows=1299084 width=83)
                     Hash Cond: ((posts.author_id)::text = (accounts.id)::text)
                     ->  Seq Scan on posts  (cost=0.00..51109.60 rows=1813260 width=83)
                     ->  Hash  (cost=166.83..166.83 rows=481 width=50)
                           ->  Hash Join  (cost=141.67..166.83 rows=481 width=50)
                                 Hash Cond: ((oauth_tokens.account_id)::text = (accounts.id)::text)
                                 ->  Seq Scan on oauth_tokens  (cost=0.00..18.35 rows=535 width=25)
                                 ->  Hash  (cost=134.08..134.08 rows=607 width=25)
                                       ->  Seq Scan on accounts  (cost=0.00..134.08 rows=607 width=25)
                                             Filter: ((NOT dormant) AND (backoff_until < now()))
(15 rows)

Error returning to forget after cancelling authorization

After I click "log in with twitter" and get to the page that says "authorize codl.forget.fr to use your acccount?" and click cancel, there's a button that says "return to forget.codl.fr." I click it and it takes me to a page with this error:
"Bad Request

The browser (or proxy) sent a request that this server could not understand."

I really like forget and this isn't a big deal at all, but I figured I should mention it! 👍

unify button styles

2018-01-11t10-08-44-0oo0
2018-01-11t10-08-32-o0o0
2018-01-11t10-08-26-0ooo

there's two, arguably three different kinds of buttons. there should be only one

handle missing anonymous oauth token more gracefully

https://sentry.io/codl/forget/issues/324754633/

Exception: OAuth token has expired
(3 additional frame(s) were not displayed)
...
  File "flask/_compat.py", line 33, in reraise
    raise value
  File "flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "forget/routes.py", line 52, in twitter_login_step2
    token = lib.twitter.receive_verifier(oauth_token, oauth_verifier, **app.config.get_namespace("TWITTER_"))
  File "lib/twitter.py", line 34, in receive_verifier
    raise Exception("OAuth token has expired")

Exception: OAuth token has expired

support full twitter archive format

Archive imports have been disabled because Twitter have disabled the kind of archive that Forget knew about.

The new archive format is much larger, because it includes media, and it is a privacy nightmare because it includes everything twitter knows about a user. DMs, ad targeting info, the whole mess. The old strategy of uploading the zip file and letting the server figure it out is not going to work.

The plan is to extract and parse archives in the browser, and send batches of statuses to the server.

Individual issues for tasks to be done as part of this project are #458, #459

Original issue text follows:


reported by [email protected] https://cybre.space/@rrix/100591445673683791

hey, it looks like the Twitter archive format changed at some point that makes it not work with Forget. there's no longer a data/js/tweets dir with monthly files, just a big jsonp file (67mib in my case) in the root of the zip

Are boosts deleted? And the favourited ones?

I will ask my question in Mastodon vocabulary, hoping that it will be easily translatable for non-Mastodon users. My question is: does Forget delete boosts, i.e. sharings of someone else's posts? And if I set “keep favourited posts” to true, will it delete boosts of posts that I also favourited?

Forget Only Media Posts

I'd really like to be able to have every post I make except media posts be permanent. Hopefully that's simple enough!

Nothing to expire

Hi,
Than you for the work on Forget, it is really a very good idea. Too bad very few tools have a feature like this one: machines should forget more...
My problem is that I tried to set up expiration rules for my Twitter account today from https://forget.codl.fr/ but it targets zero tweet. It precisely says:

Currently keeping track of 0 of your posts, roughly 0 of which currently match your expiration rules.

I uploaded an archive and it returned:

0/72 months imported, 0 failed

Forget got the right authorizations from Twitter's side, I checked it. Looks something fails, don't know what to do. Thanks.

backoff

somewhat related to #34

apply exponential backoff to mastodon instances when failing to talk to them. immediately failing tasks when the backoff isnt over yet is probably enough, or not queuing fetches or deletes

have a separate backoff for each account with some randomness to avoid a thundering herd when the server comes back

warn if archive looks too big

follow up to #64, there is potential for confusion between tweet archives and full account data archives

the former is what we accept, the latter is usually huge because it includes all media. we should warn the user if it looks like they've selected a full archive in the upload form

accounts with short delete_every but restrictive filters cause lots of refreshes

ive had the occasional complaint that forget was making a lot of requests and i couldnt say why but now i know

in delete_from_account, this query:

forget/tasks.py

Lines 231 to 238 in 41683fc

latest_n_posts = (Post.query.with_parent(account, 'posts').order_by(
db.desc(Post.created_at)).limit(account.policy_keep_latest)
.cte(name='latest'))
posts = (
Post.query.with_parent(account, 'posts')
.filter(Post.created_at + account.policy_keep_younger <= db.func.now())
.filter(~Post.id.in_(db.select((latest_n_posts.c.id, )))).order_by(
db.func.random()).limit(100).all())

does not use any of the filters, and every post that is returned from it is refreshed until one of them matches all the filters

if someone has restrictive filters like "delete media only" or "delete faves only" then this will likely get 100 posts that don't match, do 100 requests to refresh them, and fail to delete anything

Support for json file [twitter]

Currently, the system only allows the upload of the zip archive.

But that archive contains all the medias ever posted, so the zip is quite massive, and I guess there's no reason for it, wouldn't only the tweet.js json file be sufficient ?

I tried to bypass it making an archive out of my json file, but I get The file you uploaded is not a valid tweet archive. No posts have been imported.

More robust fetching

Right now fetching assumes that every post before the most recent known post has already been fetched. If there are no known posts then it will recurse until the remote service doesn't return any posts anymore but otherwise it will only go up to the most recent known posts.

Unfortunately this means there is a failure state where posts were fetched from the most recent post to an arbitrary post that isnt the oldest post. This doesn't happen usually but if a worker is killed or throws an unusual exception while fetching then this can happen.

Recursing as far as possible every time we fetch isn't feasible due to rate limits

I'm not sure how to fix this, maybe keep track of which time (or id?) ranges have been successfully fetched, and try to fetch whichever ones are missing?

health check should check redis too

i believe /api/health_check only checks that postgres is alive and well. earlier today redis ran out of disk space and stopped accepting writes. that should have made the health check fail

Support GNU social

I have an old GNU social account that I would love to purge messages from. Instances such as quitter.se and quitter.es, for instance.

account deletion

people should be able to delete all data associated with their account

p sure it's even a law, uh oh

unify redis config

right now there are i think three different things connecting to redis (lib.brotli, flask-limiter, and celery) and all three have different ways to configure where redis is. for my current use it doesn't matter because i'm just using the default redis unix socket, but there should really be a single config variable that sets all three at once

Ignore fav on retweets

if you fav your own retweet of a tweet then the rt post also gets faved. this isn't any different than just faving the og tweet in UI but changes forget's behaviour. forget should just ignore favs on tweets that are RTs

Support alternate expiration policies

Such as with a regex or a keyword match. Should be able to match the post, CW text, or either.

Example: default policy expires posts after 7 days, all posts that contain "announcement" expire after 30 days, all posts that contain "vent" in the CW expire after 1 hour

surface information about first-time fetch being in progress

now that the historical_fetch_finished column, or whatever it is called, exists, it would be nice to surface that on the UI, for example with a message that says "Your past statuses are now being fetched from service. This may take a while, why not set up your expiration policy while you wait? You can close this page and Forget will continue to fetch your statuses and delete them when they reach expiration."

although maybe that last sentence is better suited for somewhere where it will still be seen after the first fetch is complete

keep posts with a hashtag

As suggested by hoodie

Option to preserve posts that have a specific hashtag (ie #mastoart)

Would require parsing & storing the hashtags found in the post (and a privacy policy update)

could be useful to also allow the opposite ie delete only posts with a specific hashtag?

big "keep posts by default" "delete posts by default" switch

so something that came up is that there's no way to keep reblogs if you want to

this is starting to make a lot of buttons and a confusing ass interface so i think a way to clean everything up is to have a big master switch between "Keep posts" / "Delete posts"

and then every setting can be "unless older/newer than," "unless faved," "unless it has media"

this is a pretty big one. writing a sql migration for this is going to be fun 😬

"latest x posts" is unreliable

if some posts have been deleted externally and have not been refreshed since, then the number of posts we have won't match. this probably doesn't matter in most cases but is annoying if you're trying to make your post count a funny number

Delete favs

Although harder to crawl there's no doubt that favourites are a source of information: by logging favs one could not only deduct interests but also wake time, account activity, account age, ...
Things that forget would usually make impossible to crawl at a later time.

Please add ability to delete favs in a similar fashion of toots & boosts

radio strips are not accessible to switch users on android

sscrot

these are what i call radio strips

they're keyboard-accessible thanks to a hack and apparently the hack is too hacky for android's switch support cos it just passes over them completely

theyre still accessible through the crosshair thing but that's messy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.