Offsite projects:
-
Feedplz, a scraper and feed generator for Furaffinity and more.
git clone https://f.codl.fr/feedplz.git
Continuous post deletion for twitter, mastodon, and misskey
Home Page: https://forget.codl.fr/
License: ISC License
Offsite projects:
Feedplz, a scraper and feed generator for Furaffinity and more.
git clone https://f.codl.fr/feedplz.git
Thoughts on filtering keybase.io proofs (e.g. https://twitter.com/ilianaweller/status/1066886431107371008) from the list of tweets to delete, and if so whether it should be a user-configurable option or not?
there's already a way to make accounts dormant but it only kicks in when we get a clear signal that it won't ever work again ie the token was revoked, account was suspended, etc
some high profile mastodon instances closed down recently, meaning they either don't respond or respond 410 Gone. this is interpreted as a TemporaryError
and causes a retry, but they won't ever be coming back up. I think after 2 weeks of temporary errors an account should be considered to be dormant and no tasks should be scheduled for it
the interval selector is a weird mess. it's absolutely unnecessary for it to allow users to enter "20 years" just as easily as "6 hours" for their delete_every
im thinking i can replace it with a non-linear slider like this:
(but with time intervals instead of numbers)
noUiSlider seems like a decent implementation of this at first glance but it doesn't have any keyboard accessibility built-in https://github.com/leongersen/noUiSlider
as of 2017-09-13 the most common values for intervals are
forget=> select count(1), policy_keep_younger from accounts where accounts.policy_enabled = 't' group by policy_keep_younger having count(1) > 1 order by count desc;
count | policy_keep_younger
-------+---------------------
3 | 10 days
3 | 90 days
3 | 60 days
3 | 30 days
3 | 365 days 05:49:12
2 | 91 days 07:27:18
2 | 31 days
2 | 365 days
2 | 00:00:00
2 | 180 days
(10 rows)
forget=> select count(1), policy_delete_every from accounts where accounts.policy_enabled = 't' group by policy_delete_every having count(1) > 1 order by count desc;
count | policy_delete_every
-------+---------------------
13 | 00:30:00
7 | 00:01:00
4 | 03:00:00
2 | 02:00:00
2 | 00:00:00
(5 rows)
id propose pips at:
keep_younger
delete_every
the handle would snap to pips
space in between pips would interpolate linearly between the pip values, quantized to a reasonable unit (no one wants 1yr 2mo 3d 4h 5min 6s precisely)
also unlike the example above the current value would need to be explicitly shown (the example did not interpolate, only snapped to pips)
either as a bubble above or below the handle, or next to the slider
DETAIL: Process 21681 waits for ShareLock on transaction 8612212; blocked by process 21685.
Process 21685 waits for ShareLock on transaction 8612213; blocked by process 21681.
HINT: See server log for query details.
CONTEXT: while locking tuple (6804,73) in relation "posts"
[SQL: 'WITH latest AS \n(SELECT posts.created_at AS created_at, posts.updated_at AS updated_at, posts.id AS id, posts.author_id AS author_id, posts.favourite AS favourite, posts.has_media AS has_media, posts.direct AS direct, posts.favourites AS favourites, posts.reblogs AS reblogs, posts.is_reblog AS is_reblog \nFROM posts \nWHERE %(param_2)s = posts.author_id ORDER BY posts.created_at DESC \n LIMIT %(param_3)s)\n SELECT posts.created_at AS posts_created_at, posts.updated_at AS posts_updated_at, posts.id AS posts_id, posts.author_id AS posts_author_id, posts.favourite AS posts_favourite, posts.has_media AS posts_has_media, posts.direct AS posts_direct, posts.favourites AS posts_favourites, posts.reblogs AS posts_reblogs, posts.is_reblog AS posts_is_reblog \nFROM posts \nWHERE %(param_1)s = posts.author_id AND posts.created_at + %(created_at_1)s <= now() AND posts.id NOT IN (SELECT latest.id \nFROM latest) ORDER BY random() \n LIMIT %(param_4)s FOR UPDATE'] [parameters: {
'param_1': 'REDACTED', 'created_at_1': datetime.timedelta(14), 'param_2': 'REDACTED', 'param_3': 500, 'param_4': 100}]
Lines 233 to 237 in eae4d06
while showing the most common instances is sensible for a new user, for a returning user they most likely want to log in to the same one again so we should remember which one it is and show it (first? second after mastodon.social? what if theres several of them?) in the log in buttons
now that mastodon supports exporting all of ones data to an archive, it would be neat to support those for parity with the twitter side
because those archives contain media they will get muuch larger. mine is ~600 MB to date
i wonder if I can read the archive from the client, strip media, re-zip it and upload it
Something is causing celery workers to hang
In the log below, only one of the workers, ForkPoolWorker-15
, was still alive and processing tasks, then froze until i manually restarted several days later (!)
Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:19 worker.1 | [2017-09-25 22:13:19,751: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:20 worker.1 | [2017-09-25 22:13:20,356: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 25 22:13:21 slyme.codl.fr bash[9950]: 22:13:21 worker.1 | [2017-09-25 22:13:21,252: WARNING/ForkPoolWorker-15] fetching <Account(XXX, XXX, XXX)>
Sep 30 15:56:55 slyme.codl.fr systemd[376]: Stopping Forget web application server...
Sep 30 15:57:00 slyme.codl.fr systemd[376]: Stopped Forget web application server.
Sep 30 15:57:00 slyme.codl.fr systemd[376]: Started Forget web application server.
Celery has built-in support for time limits which would be a great first step, I'd also like to know what in the world is causing these workers to hang forever
rn workers just reuse db sessions all willy nilly and thats gross. idk if celery has a way to run things before/after every worker or if im gonna have to make my own decorator
this query:
Lines 388 to 391 in 78c84ed
SELECT posts.created_at AS posts_created_at, posts.updated_at AS posts_updated_at, posts.id AS posts_id, posts.author_id AS posts_author_id, posts.favourite AS posts_favourite, posts.has_media AS posts_has_media, posts.direct AS posts_direct, posts.favourites AS posts_favourites, posts.reblogs AS posts_reblogs, posts.is_reblog AS posts_is_reblog
FROM posts
LEFT OUTER JOIN accounts ON accounts.id = posts.author_id -- dunno why this isn't a plain join
JOIN oauth_tokens ON accounts.id = oauth_tokens.account_id
WHERE accounts.backoff_until < now()
AND NOT accounts.dormant
GROUP BY posts.created_at, posts.updated_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
ORDER BY posts.updated_at ASC
LIMIT 1
... is slow. 5 seconds on my production database and its 1.8M posts.
A majority of that time is spent sorting all the candidate posts.
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=327303.91..327303.93 rows=1 width=83)
-> Group (cost=327303.91..363028.72 rows=1299084 width=83)
Group Key: posts.updated_at, posts.created_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
-> Sort (cost=327303.91..330551.62 rows=1299084 width=83)
Sort Key: posts.updated_at, posts.created_at, posts.id, posts.author_id, posts.favourite, posts.has_media, posts.direct, posts.favourites, posts.reblogs, posts.is_reblog
-> Hash Join (cost=172.85..71061.01 rows=1299084 width=83)
Hash Cond: ((posts.author_id)::text = (accounts.id)::text)
-> Seq Scan on posts (cost=0.00..51109.60 rows=1813260 width=83)
-> Hash (cost=166.83..166.83 rows=481 width=50)
-> Hash Join (cost=141.67..166.83 rows=481 width=50)
Hash Cond: ((oauth_tokens.account_id)::text = (accounts.id)::text)
-> Seq Scan on oauth_tokens (cost=0.00..18.35 rows=535 width=25)
-> Hash (cost=134.08..134.08 rows=607 width=25)
-> Seq Scan on accounts (cost=0.00..134.08 rows=607 width=25)
Filter: ((NOT dormant) AND (backoff_until < now()))
(15 rows)
After I click "log in with twitter" and get to the page that says "authorize codl.forget.fr to use your acccount?" and click cancel, there's a button that says "return to forget.codl.fr." I click it and it takes me to a page with this error:
"Bad Request
The browser (or proxy) sent a request that this server could not understand."
I really like forget and this isn't a big deal at all, but I figured I should mention it! 👍
https://sentry.io/codl/forget/issues/324754633/
Exception: OAuth token has expired
(3 additional frame(s) were not displayed)
...
File "flask/_compat.py", line 33, in reraise
raise value
File "flask/app.py", line 1612, in full_dispatch_request
rv = self.dispatch_request()
File "flask/app.py", line 1598, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "forget/routes.py", line 52, in twitter_login_step2
token = lib.twitter.receive_verifier(oauth_token, oauth_verifier, **app.config.get_namespace("TWITTER_"))
File "lib/twitter.py", line 34, in receive_verifier
raise Exception("OAuth token has expired")
Exception: OAuth token has expired
Archive imports have been disabled because Twitter have disabled the kind of archive that Forget knew about.
The new archive format is much larger, because it includes media, and it is a privacy nightmare because it includes everything twitter knows about a user. DMs, ad targeting info, the whole mess. The old strategy of uploading the zip file and letting the server figure it out is not going to work.
The plan is to extract and parse archives in the browser, and send batches of statuses to the server.
Individual issues for tasks to be done as part of this project are #458, #459
Original issue text follows:
reported by [email protected]
https://cybre.space/@rrix/100591445673683791
hey, it looks like the Twitter archive format changed at some point that makes it not work with Forget. there's no longer a data/js/tweets dir with monthly files, just a big jsonp file (67mib in my case) in the root of the zip
I will ask my question in Mastodon vocabulary, hoping that it will be easily translatable for non-Mastodon users. My question is: does Forget delete boosts, i.e. sharings of someone else's posts? And if I set “keep favourited posts” to true, will it delete boosts of posts that I also favourited?
I'd really like to be able to have every post I make except media posts be permanent. Hopefully that's simple enough!
Hi,
Than you for the work on Forget, it is really a very good idea. Too bad very few tools have a feature like this one: machines should forget more...
My problem is that I tried to set up expiration rules for my Twitter account today from https://forget.codl.fr/ but it targets zero tweet. It precisely says:
Currently keeping track of 0 of your posts, roughly 0 of which currently match your expiration rules.
I uploaded an archive and it returned:
0/72 months imported, 0 failed
Forget got the right authorizations from Twitter's side, I checked it. Looks something fails, don't know what to do. Thanks.
"Sorry, something went wrong when logging you in with Twitter. Give it another shot, maybe?"
I've relogged and tried again in 3 different browsers.
flask-limiter https://flask-limiter.readthedocs.io might help
allow users to be logged into >1 account at once and switch between them
somewhat related to #34
apply exponential backoff to mastodon instances when failing to talk to them. immediately failing tasks when the backoff isnt over yet is probably enough, or not queuing fetches or deletes
have a separate backoff for each account with some randomness to avoid a thundering herd when the server comes back
follow up to #64, there is potential for confusion between tweet archives and full account data archives
the former is what we accept, the latter is usually huge because it includes all media. we should warn the user if it looks like they've selected a full archive in the upload form
ive had the occasional complaint that forget was making a lot of requests and i couldnt say why but now i know
in delete_from_account
, this query:
Lines 231 to 238 in 41683fc
does not use any of the filters, and every post that is returned from it is refreshed until one of them matches all the filters
if someone has restrictive filters like "delete media only" or "delete faves only" then this will likely get 100 posts that don't match, do 100 requests to refresh them, and fail to delete anything
Currently, the system only allows the upload of the zip archive.
But that archive contains all the medias ever posted, so the zip is quite massive, and I guess there's no reason for it, wouldn't only the tweet.js
json file be sufficient ?
I tried to bypass it making an archive out of my json file, but I get The file you uploaded is not a valid tweet archive. No posts have been imported.
As suggested by alyx@witches.town [1]:
a good expiration rule might be X number of boosts/likes can stay
the front half is fine but the back half is either hastily documented or not at all
Right now fetching assumes that every post before the most recent known post has already been fetched. If there are no known posts then it will recurse until the remote service doesn't return any posts anymore but otherwise it will only go up to the most recent known posts.
Unfortunately this means there is a failure state where posts were fetched from the most recent post to an arbitrary post that isnt the oldest post. This doesn't happen usually but if a worker is killed or throws an unusual exception while fetching then this can happen.
Recursing as far as possible every time we fetch isn't feasible due to rate limits
I'm not sure how to fix this, maybe keep track of which time (or id?) ranges have been successfully fetched, and try to fetch whichever ones are missing?
it is possible to find if someone has a specific set of "known instances" (the instance buttons on the front page)
more details withheld until i fix it
forgot to make an issue for this one, sorry
i believe /api/health_check
only checks that postgres is alive and well. earlier today redis ran out of disk space and stopped accepting writes. that should have made the health check fail
hoodie suggested this
I have an old GNU social account that I would love to purge messages from. Instances such as quitter.se and quitter.es, for instance.
as requested by dani on twitter https://twitter.com/heartpastels/status/894824457071624193
this is maybe a bit too overly-specific—is there a good way to keep tweets with media attached as fresh forever, too?
people should be able to delete all data associated with their account
p sure it's even a law, uh oh
right now there are i think three different things connecting to redis (lib.brotli, flask-limiter, and celery) and all three have different ways to configure where redis is. for my current use it doesn't matter because i'm just using the default redis unix socket, but there should really be a single config variable that sets all three at once
if you fav your own retweet of a tweet then the rt post also gets faved. this isn't any different than just faving the og tweet in UI but changes forget's behaviour. forget should just ignore favs on tweets that are RTs
Such as with a regex or a keyword match. Should be able to match the post, CW text, or either.
Example: default policy expires posts after 7 days, all posts that contain "announcement" expire after 30 days, all posts that contain "vent" in the CW expire after 1 hour
I'd like to specifically choose which posts to keep. Any way we can do that with a specific hashtag of our choosing?
now that the historical_fetch_finished
column, or whatever it is called, exists, it would be nice to surface that on the UI, for example with a message that says "Your past statuses are now being fetched from service. This may take a while, why not set up your expiration policy while you wait? You can close this page and Forget will continue to fetch your statuses and delete them when they reach expiration."
although maybe that last sentence is better suited for somewhere where it will still be seen after the first fetch is complete
guh
(builtins.TypeError) Not a boolean value: 'true'
As suggested by hoodie
Option to preserve posts that have a specific hashtag (ie #mastoart)
Would require parsing & storing the hashtags found in the post (and a privacy policy update)
could be useful to also allow the opposite ie delete only posts with a specific hashtag?
Pinned posts probably shouldn't be deleted
so something that came up is that there's no way to keep reblogs if you want to
this is starting to make a lot of buttons and a confusing ass interface so i think a way to clean everything up is to have a big master switch between "Keep posts" / "Delete posts"
and then every setting can be "unless older/newer than," "unless faved," "unless it has media"
this is a pretty big one. writing a sql migration for this is going to be fun 😬
Hi,
After I authorized twitter access, when I try to enable the tool I get a simple "403 Forbidden" page when I confirm I want to delete the things.
this test suite is sorely lacking!!!!!!!
if some posts have been deleted externally and have not been refreshed since, then the number of posts we have won't match. this probably doesn't matter in most cases but is annoying if you're trying to make your post count a funny number
Although harder to crawl there's no doubt that favourites are a source of information: by logging favs one could not only deduct interests but also wake time, account activity, account age, ...
Things that forget would usually make impossible to crawl at a later time.
Please add ability to delete favs in a similar fashion of toots & boosts
As suggested by maple https://computerfairi.es/@squirrel/99265955755880591
This is pretty much the same thing as the media / no media thing so I already know how to do this 😛
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.