Coder Social home page Coder Social logo

Support episodes cleanup about podsync HOT 21 CLOSED

mxpv avatar mxpv commented on May 24, 2024 2
Support episodes cleanup

from podsync.

Comments (21)

kgraefe avatar kgraefe commented on May 24, 2024 4

Yeah, or just do

cd /home/pi/podsync/data/
for f in */*.mp4 */*.mp3; do
    grep -q "$f" *.xml || rm "$f"
done

from podsync.

tuxpeople avatar tuxpeople commented on May 24, 2024 2

@amcgregor Not sure if podsync needs to solve that problem. We're speaking about podcasts here. Not someones wedding video :-P So the files do not need to be saved until the end of the world. But that's just how I use podcasts.

Besides the statement from @psyciknz just a little addition: If you would like to have ANYTHING in your feed, you are ending up in also implementing RFC5005 (see https://podlove.org/paged-feeds/).

What we would need do discuss is which kind of variable keep_size is. It could be numbers of episodes, age of episodes, or both.

from podsync.

DeXteRrBDN avatar DeXteRrBDN commented on May 24, 2024 1

I agree with @tuxpeople approach. Having two variables would solve the issue for me also:

page_size = 5
keep_size = 10

5 items to download
10 items to keep as downloaded

Re garding what @amcgregor said, I would not mind losing those episodes not played (even downloaded) if they are beyond the keep_size value

from podsync.

kgraefe avatar kgraefe commented on May 24, 2024 1

You have to save it as a shell script file and add a crontab entry to run it periodically via cron. (I don't know how to do that on your NAS, on a Linux box I'd run crontab -e.)

E.g. I have in my crontab something like:

  0      3       *       *       *       /bin/sh /home/pi/bin/clean-data.sh

which means "run /bin/sh /home/pi/bin/clean-data.sh every day at 3:00 A.M."

You may have to add PATH settings at the beginning of the script, as cron runs in a different (minimal) environment:

PATH=/bin:/usr/local/bin:/usr/bin

Otherwise it may not find the grep and rm executables.

from podsync.

mxpv avatar mxpv commented on May 24, 2024

Is this expected behavior?

Yep, this is how it works as of now. CLI gets page_size episodes via API and for each episode checks whether it needs to be downloaded to disk (and downloads if necessary).
Your point make sense, there should be some kind of cleanup mechanism, especially for video files.
I think we can define update_strategy field instead of page_size, with values like keep entire feed in sync, keep last 5 episodes, etc.

from podsync.

DeXteRrBDN avatar DeXteRrBDN commented on May 24, 2024

update_strategy would be an interesting addition. I was looking for to add this too. I could help with this.

from podsync.

tuxpeople avatar tuxpeople commented on May 24, 2024

Not sure if I understand correctly. For my use case, best fit would be if I have two variables. One for how many items to check on Youtube and one for how many downloaded files should be kept (and added to the RSS feed). Because I would set those two to different values.

from podsync.

amcgregor avatar amcgregor commented on May 24, 2024

There's an interesting chicken-egg problem relating to identifying if an episode is "safe to remove" or not. My own process generates RSS XML feeds from all video content within a (channel, playlist) download directory, any additions always appear after each new run. Client-side (e.g. in the Podcasts app) it tracks watched state, the server side is never informed of this information — this is actually a massive reason I prefer the Podsync approach to media consumption. Google doesn't need to know what I watch and when.

Automated clean-up becomes problematic and tricky. If it's tied to the HTTP request to pull in the video file, the episode may be cleaned up after only being partially watched (e.g. pause long enough to break the streaming connection… and your video gets deleted, great job!) If it's tied entirely to episode count, unwatched episodes may be unintentionally removed by an arbitrary metric ("n episodes"). If tied to a sensible metric, e.g. time ("keep all episodes from the last 7 days"), then episodes you just don't quite get to may disappear automatically. I'm fortunate in that I have tens of terabytes of available storage at home, so currently I don't bother to delete episodes at all. (I actually have all digital media I have touched since 2001… just shy of 50 TiB at the moment. The first video: a bad VHS transfer of "blowing up the whale". ;)

A combination of metrics for identifying cullable (able to be safely removed ;) episodes such as "7 days after the last time of access + three months for untouched episodes" might be optimal.

from podsync.

psyciknz avatar psyciknz commented on May 24, 2024

I'm not sure we need to over complicate this....I thought the main use of a podcasts was a single use episode. So after x days clean up old episodes....it's not life or death if you miss one.

If you're after "I want to save the entire internet at home" then don't remove.....or put podcasts in something more permanent.

Just my thought IMO - but that's slanted with how I use podcasts.

from podsync.

psyciknz avatar psyciknz commented on May 24, 2024

I’d say number of episodes. Then it’s up to the user to set it as appropriate depending on the release frequency

from podsync.

DeXteRrBDN avatar DeXteRrBDN commented on May 24, 2024

As page_size is determined by number of episodes to look, keep_size should be number episodes to keep also.

Maybe an option to "define" size would be useful (items, age), but probably this is for another future task.

from podsync.

amcgregor avatar amcgregor commented on May 24, 2024

So the files do not need to be saved until the end of the world.

In my case, that is explicitly the point. Actually, a step further: my archive needs to survive the end of the world. (It's geographically redundant—offsite backups—and locally redundant—extensive RAID, multiple servers, replicated MongoDB GridFS cluster atop, hosting a non-heirarchical metadata filesystem.) "All digital media I have touched since 2001" includes a complete copy of Wikipedia, most of Wikimedia (e.g. dict, books, so forth), Project Gutenberg, every StackExchange site including StackOverflow, with an additional set of ~45 days of music (24h a day, no repeats, every genre), millions of works of fiction, hundreds of thousands of works of non-fiction, … forming a forever library. A body of knowledge sufficient to learn to read, understand, and propagate that body of knowledge, with instructions sufficient to survive long enough to do so. (Agriculture on up.) Yes, this sounds ludicrous. And yes, it took three months to initially off-site backup. Thank the Elders of the Internet for Backblaze.

I am sadly content in the knowledge that my own requirements will not be met by any of the simplistic "rate limiting" approaches being agreed upon here, and that my own hackish attempt to replicate Podsync functionality already offers substantially more powerful controls over media ingest. For example, a frequent scenario I encounter: download i episodes on each run, preserve j episodes total per channel, covering k months of time at most.

Hypothetically applied: 3 months of episodes, say, 200 episodes within that time period, downloading at most 10 episodes every 6 hours, preserving the 30 most recent. Thus requiring three synchronization periods, or 18 hours, to be "fully caught up" and ready to go with all expected media available for viewing, while not waiting on complete ingest of a channel before continuing on the next, e.g. downloading all 200 episode for that three-month period all in one go and taking 12 hours to do so while processing nothing else, missing/skipping additional synchronization periods. This algorithm can offer guarantees about behavior while being flexible to time, count, and batch sizes, while reducing blocking (improving "turnaround time" on refreshes/updates).

(Edited to add: periodic regular data synchronization, ingress and egress, combined with feed generation, is my literal day job. That template engine used in my hackish recreation was invented at work for the purpose of streaming RSS feed generation. ;)

from podsync.

DeXteRrBDN avatar DeXteRrBDN commented on May 24, 2024

Well, I don't see the issue here. We can implement the new config value with the following option:

keep_size: 0 //Disable 'episode cleanup' functionality, keep forever

from podsync.

kgraefe avatar kgraefe commented on May 24, 2024

amcgregor is not even using podsync, yet keeps trying to make discussions difficult.

As for the variable name, I'd name it keep_items. If we later want more flexibility we can add more variables like keep_age (e.g. "keep items newer than x days") and keep_size (e.g. "keep items until storage size exceeds x MiB").

from podsync.

amcgregor avatar amcgregor commented on May 24, 2024

amcgregor is not even using podsync

I did. And I paid for the privilege of deeper archiving and higher quality. Then it ceased functioning. So I eliminated it as a dependency in the operation of my media consumption workflow by examining its mechanism of operation and replicating the essential process using a literal shell script and common, highly functional open-source tools (GNU parallel across multiple machines with a live progress indicator is beautiful), and now rely on functionality the abstraction of podsync does not provide for, despite the underlying mechanism supporting all of my needs. Discussions on exposing related functionality up from the underlying tools are all being driven towards the lowest common denominator of least functionality, with short-sighted implications. I want to use podsync. I have come to realize I might never be able to. (Without rewriting it, as I essentially have… in BASH.)

Well, I don't see the issue here.

Readily apparent, and not unique. (Pardon the snark; bit frustrating on this side, too. ;^P)

We can implement the new config value with the following option…

Giving up entirely is certainly an option. A regression of what my shell script is capable of, though, so I see no point in shooting myself in the foot. Mostly trying to encourage a tool I like the concept of (and early growth of) to be… less… poorly/inflexibly architected… so that I can consider using it once more. Being able to combine criteria (n episodes maximum, m days maximum, j episodes pulled maximum per run) really isn't that big of an ask. youtube-dl already does it through the combination of three command-line switches, and offers even more flexible selection criteria that just aren't exposed currently.

For clean-up, I'll be adding to my own script—for some channels or playlists—a find -mtime … -exec rm {} + invocation to remove episodes based on relatively long duration (creation) age and a shorter duration modification time age, having nginx touch files it streams to indicate possible watched state to the find call. (Or use the atime, if I can resolve the paradox of fetching the atime without updating the atime during cleanup…)

Lastly, to cover the "n episodes" culling criteria, ls *.json | sort -r | tail -n +30 | xargs rm ← keep only the latest 30 episodes, for example. (Yes, I'm aware that last is only cleaning the JSON metadata; this needs expanding upon, but even that would be effective in excluding the videos from the generated RSS file.) #TruePowerOfUNIX — pressing the fact that these requested features are implementable with system standard, absolutely basic, freely available GNU/Linux/BSD tools, right now, with only a few minutes of effort.

from podsync.

DeXteRrBDN avatar DeXteRrBDN commented on May 24, 2024

I've been looking at the code, and currently, Podsync is not tracking or looking to how many items has already been downloaded, so it does not know what can or not be deleted.

It looks for all the items requested, it checks if they exist, if not, it downloads the items. But it does not know which items are already downloaded that are not inside the current "page_size".

Youtube gets 50 items per query, so we could check if we have some downloaded items inside those first 50 ones. So we can build the XML with the elements inside the page_size and those ones that are already downloaded, but outside the page_size.

For those items already downloaded, outside the page_size, inside the 'keep_items`, and inside the 50 first items from Youtube, we can take the info from Youtube, but we would need to paginate Youtube to get more items if there is some downloaded items outside those first 50 ones.

To prevent checking Youtube for those items already downloaded, we should "store" the downloaded item info. We already have an XML file with that info, so we could use that file as an storage. So we could create an initial step to modify the current code to be able to modify the XML instead of creating a new one every time.

Once we get the functionality to modify the XML file instead of recreating it, we can create the new functionality just to read the current XML, and remove those items (including the stored file) using the criteria we decide (max total size, max items, max age, etc.).

So we could split the task in two steps:

  • Modify XML generation from recreate to read & modify.
  • Read XML items and delete those ones not passing user configuration.

from podsync.

kgraefe avatar kgraefe commented on May 24, 2024

Oh I wasn't aware that the XML is currently not read by podsync. That means older items will exist as files but not in the XML file. Deleting those files via cron and controling the number of episodes with the page_size parameter will be good enough for me.

from podsync.

billflu avatar billflu commented on May 24, 2024

I just created this script (it can probably be cleaned up) to compare mp4's referenced in the XML files versus what is downloaded. It then removes the extra mp4's. When videos are high quality and 30-60 minutes, they can take up quite a bit of space.

#Cleans up extra mp4 files which are downloaded by podsync, but not referenced
#These files are likely older than the current length of the feed
#
#Directory of podsync data (with trailing/)
podsyncdata='/home/pi/podsync/data/'
#Find referenced mp4 files in xml feeds
grep -Eoh '[A-Za-z0-9_-]{11}.mp4' $podsyncdata*.xml | sort -u > xml-mp4.txt
#Find downloaded mp4 files
find $podsyncdata -name '*.mp4' -exec basename {} \; | sort -u > mp4.txt
#Compare files to see which downloaded files aren't referenced
comm -23 mp4.txt xml-mp4.txt > diff-mp4.txt
#Remove the extra downloaded files
cat diff-mp4.txt | while read line; do rm $podsyncdata*/$line; done
#Clean up temporary files
rm *mp4.txt

Next up might be to create a script to clean up partial downloads. It can then wait for the next run or possibly kick off YouTube-dl itself.

from podsync.

billflu avatar billflu commented on May 24, 2024

Yeah, or just do

cd /home/pi/podsync/data/
for f in */*.mp4 */*.mp3; do
    grep -q "$f" *.xml || rm "$f"
done

Touché kgraefe! I knew there had to be a way to do it more efficiently.

from podsync.

Rumik avatar Rumik commented on May 24, 2024

Is the above script usable? If so, how so? Auto cleanup would be very handy! Especially because I have no idea where my episodes are being downloaded to! They're not in the data directory! lol

Thanks :)

from podsync.

amcgregor avatar amcgregor commented on May 24, 2024

Ah, for those digging into this problem, there's a slightly more "refined" approach I'm investigating leads on, now. Instead of relying on arbitrary rules around time-based expiry / expunging, pure episode count limits, etc., since the machine acting as my server is also a macOS machine running Podcasts.app locally, why can't I pull the view state from the Podcasts app?

Turns out, you can!

/Users/$USER/Library/Group\ Containers/??????????.groups.com.apple.podcasts/Documents/MTLibrary.sqlite

That's the path to the SQLite database backing Podcasts.app. The question marks are a "blobbing pattern" indicating that your particular "group ID" may differ from mine, but this path should work [as-is] for passing to command-line programs, such as the sqlite3 REPL. Now, finding the path to the database is only the first part. The episode data is stored in the ZMTEPISODE table.

I'm choosing to identify episodes that are possible to clean up using a URL prefix match on the domain name I'm hosting the podcasts from—ZENCLOSUREURL column. Another idea that popped to mind was matching the format of the episode ID—ZGUID column—but the URL will be more reliable. This table tracks play status, play head position, saved state, and more, but we only need the path portion of the URL.

Cleanup requires finding episodes that:

  • Have been marked as played at some point in time.
  • Have not been marked as "saved", i.e. marked for intentional preservation.
  • Are not in a (short) list of "archival" playlists/channels, ones that should never be cleaned up.
SELECT substr(ZENCLOSUREURL, 25) FROM ZMTEPISODE
WHERE ZENCLOSUREURL LIKE "https://cast.example.com/%"
AND ZSAVED = 0 AND ZUNPLAYEDTAB != 1
AND ZAUTHOR NOT IN (...archival...)

(The 25-character prefix removal is correct for my domain name in use; this lets me feed this query to sqlite3 then use the output as on-disk file paths, one episode per line.)

Just an idea that's been bouncing around my head this last week, having poked around in the sqlite3 command-line tool a bit to see what information is available there. If this information is available, why not use it? :)

from podsync.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.