brokkr / poca Goto Github PK

View Code? Open in Web Editor NEW

23.0 4.0 4.0 730 KB

A fast, multithreaded and highly customizable command line podcast client, written in Python 3

License: GNU General Public License v3.0

Python 100.00%

podcast metadata python cron rss xml filter id3 cli mutagen

poca's Introduction

Poca

Poca is a fast, multithreaded and highly customizable command line podcast client, written in Python 3.

Features

Maximum amount. Specify how many episodes the subscription should get before deleting old episodes to make room for new ones.
Override ID3/MP4/Vorbis metadata. If you want Savage Love to have Dan Savage in the artist field (rather than The Stranger), poca will automatically update the metadata upon download of each new episode. Or set genre to be overwritten by Podcast as a default.
Filter a feed. Only want news reports in the morning or on Wednesdays? Use criteria such as filename and title, or the hour, weekday or date of publishing to filter what you want from a feed.
Rename files automatically. Not all feeds have sensibly named media files. Specify a renaming template like date_title to know what you're dealing with or to get alphabetical ordering to match chronology.
From the top. A latecomer to Serial or other audiobook style podcasts? Specify from_the_top to get oldest episodes first, rather than the latest. To move on to later episodes simply delete old ones and poca will fill up with the next in line.
Keeping track. Poca logs downloads and removals to a local file so you easily see what's changed. Or configure it with an SMTP server and get notified when a feed stops working.
Manage your shows by editing an easy-to-understand xml file. Or use the accompanying tool to add, delete, sort them, or get info about their publishing frequency, average episode length and more.

Poca also: has excellent unicode support for feeds, filenames and tags, gets cover images for feeds, has the ability to spoof user agents, can pause your subscriptions, deals intelligently with interruptions, updates moved feeds (HTTP 301) automatically, and more.

Interface

All configuration is done in a single XML-format file. For cron job compatibility, Poca has a quiet mode in addition to normal and verbose.

Installation

You can install poca from pypi using pip.

pip3 install poca

If you are upgrading from any pre-1.0 release, please see this upgrade notice. To remove Poca simply do:

pip3 uninstall poca

Requirements

Python 3.6 or later
Third-party modules: requests feedparser lxml mutagen
Pip will automatically install any one of these found missing
A unicode capable terminal is recommended but not required
Due to dependencies of third-party modules (lxml requires libxml2 v. 2.9.2 and libxslt 1.1.27) distros no older than Ubuntu 18.04 are recommended.
For use on WSL, the "-g wsl" flag is recommended as it will substitute out characters known not to work on WSL (see microsoft/WSL#75)

Quickstart

[ ~ ] poca
No config file found. Making one at /home/user/.poca/poca.xml.
Please enter the full path for placing media files.
Press Enter to use default (/home/user/poca): /tmp/poca
 ⚠ Default config succesfully written to /home/user/.poca/poca.xml.
Please edit or run 'poca-subscribe' to add subscriptions.

[ ~ ] poca-subscribe add
Url of subscription: http://crateandcrowbar.com/feed/

Author: The Crate and Crowbar                            PUBLISHED / 5 WEEKS
Title:  The Crate and Crowbar

Last episode: Episode 216: Videocrates Crowdog                       ▮
Published:    24 Nov 2017                                            ▮
                                                                     ▮     ▮
Avg. size of episode:   52 Mb                            ▮  ▮     ▮  ▮  ▮  ▮
Avg. length of episode: 1h 52m                           M  T  W  T  F  S  S

Title of subscription: (Enter to use feed title)
Maximum number of files in subscription: (integer/Enter to skip) 5
Get earliest entries first: (yes/no/Enter to skip) no
Category for subscription (Enter to skip): gaming
To add metadata, rename or filters settings, please edit poca.xml

[ ~ ] poca --verbose
THE CRATE AND CROWBAR. 5 ➕
 ⇵ CCEp214.mp3  [56 Mb]
 ⇵ LGCEp004.mp3  [35 Mb]
 ⇵ CCEp215.mp3  [61 Mb]
 ...

poca's People

Contributors

Stargazers

Watchers

Forkers

code-slave adarnimrod newrain7803 nsw42

poca's Issues

Defaults

Some settings are fairly complex to most users. Some may edit them without knowing what they mean. It may therefore be necessary to load defaults and only override them with legitimate settings and settings combinations (utf8/2.4)

'Terse' output

Terse output would mean only outputting actual changes. So the user would only see lines saying episode removed and episode downloaded. And probably the error ones as well. This could be useful for logging, especially as a prerequisite for email logging (issue #26)

Limiting new downloads without deleting old episodes

I would prefer to manually delete outdated episodes, however I don't want to download entire feeds. Is it possible to limit the number of new downloads without automatically deleting old episodes?

poca-subscribe: various questions

~~Should delete with no title/url parameters loop though each and every sub? Or just inform user of option to match using --title/--url? Or both? (yes)~~
~~Should add inform user of defaults? (no there are not enough settings that benefit)~~
~~Should add have a better way to apply metadata/filters to subs? (yes but this depends on the one below - add a new issue to 0.9)~~
~~Should add sample metadata from most recent file? (This belongs in 0.9)~~
~~Should we add a check command to poca-subscribe that runs through the settings, defaults and subscriptions and informs user of validity and consequences? (no)~~

Options to rename files

Many podcasts have either random UID filenames or just name the files inconsistently. This can cause problems with the order in which the files appear in a player.

We should have options for each podcast to rename files based on:

Metadata
feed data
pubDate
serial number running from first downloaded to latest
?

Instead of giving free reins we could simply start by having a few simple prepackaged solutions for misbehaving podcasts.

We might also need sign scrubbing similar to derailleur?

Use variables in metadata (aka make up consecutive track numbers because the podcast's own are useless)

Some podcasts leave out track numbering or play fast and loose with it, occasionally inserting 'special' shows, that do not get a track number. This can be a problem for audio players.

A solution could be to allow the user to draw on variables for insertion into the metadata, specifically:

Consecutive track numbering: We don't care what number the episode has in the mind of the creator, we simply label the first one '001' and take it from there, incrementing one with each new episode.
'Reverse date' into title or album? Or track? (if id3 fields accept So January 18th 2017 would be 20170118 ensuring that ordering by title or album will be ordered in the order they appear.
Other feed data -> metadata?

This raises two related question

whether any or all of this does not apply equally to file naming (see issue #16 )
whether we want to go down the put variables into users' hands route or just add a toggle switch ("yes, please overwrite this subscription's track numbers with made up stuff")

If we want it in file names, variables is the best option. If not it might be best to contain it to a few select scenarios.

Old (ancient) RFI entries are not replaced

Some entries in the RFI feed are not being replaced despite them being from november 2015. How they got in there is a mystery. More improtantly: Why aren't they being replaced by newer ones? They have clearcut entries in both jar.lst and jar.dic - though they may not be conform to 0.5 specs? Maybe 'valid' is not in entry?

Filter: Add per-day-quota to deal with too-frequent updates

Some podcasts, typically news, update more frequently than you might need them to. Limiting the max_number doesn't do anything to combat that as you may only ever have one episode but it will constantly be a new one.

One way to deal with this is using the hour fitler which filters according to pubdate. However, some feeds either vary in the hour of publishing or simply disregard setting the hour on pubdate.

To deal with this we add a quota filter. This will simply instruct poca to filter the feed so that only X entries from any single day remain in the feed. So it will still rely on pubdate but to a lesser extent - hopefully.

Feed failures are saved to .poca db but episode failures are not

The following download failure entries into the file log


2017-02-27 13:10 RADIOAVISEN. Removed: radioavisen-2017-02-24-12-00-2.mp3
2017-02-27 13:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 14:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 15:10 RADIOAVISEN. Failed: radioavisen-2017-02-27-12-00-2.mp3
2017-02-27 16:10 RADIOAVISEN. Downloaded: radioavisen-2017-02-27-12-00-2.mp3

are not added to the buffer:

In [2]: fname = '/home/mads/.poca/db/.poca'

In [3]: with open(fname, 'rb') as f:
    jar = pickle.load(f)

In [4]: jar.buffer
Out[4]: []

It seems only failures on the feedparser part are added to the buffer. Is this how we want it to work

Filter entries

Similar to other restrictions on combo.lst we could restrict it further by filtering based on

feed info
filename
size
date and time

Unchanged etag prevents update after change in max_no / max_mb

Since all feed requests are made with saved etag, a feed request made after changes to a subscrition's max_mb or max_no attributes in the config file will return an empty feed, causing the program to skip the subscription.

Email log

It should be comparaively easy to add support for mailing the changes to yourself? At least with a local mail server. See dispatches for inspiration.

filters: changes to filters do not cause poca to revisit recent feeds (etags)

When poca notices changes to max_no and max_mb, it resets etags so a reset is forced. This does not happen when filters are changed.

Multiprocessing

Set up a socket for receiving feed updates and fire off one process for each subscription. The feed processes report back to the socket. On that socket runs a single, serial downloader that processes the updates (little Wanted+Unwanted+Lacking etc. packages). The processing includes deletes, downloads, and reports to user. The downloader/main process simply deals with the updates in the order they appear on the socket, i.e. more responsive servers will get first in line.

The proposed distinction between multiple update processes and main process is identifiable in the current code as that between 'plans' and 'execution'.

Since the downloading will still be serial, multiprocessing won't accomplish much in terms of speed gains but it should minimize 'lag' and waiting. We stay away from parallel downloads partly because each dl would steal bandwidth from the others, partly because most updates won't see multiple downloads if your average user subscribes to say 10-20 podcasts and update once an hour (assuming). Finally and most importantly, total multiprocessing invites far more chaos when things go wrong and would require a greater ui rethink.

Syntastic moaning

Syntastic has a ton of (style?) complaints. Go through them and either dismiss or adjust.

download cover: assumes jpg

Download cover should check for extension rather than assume jpg

Option to start a podcast from the beginning rather than the latest episodes

Working your way through: When a narrative podcast - e.g. Welcome to Night Vale, has a large back archive, you'll want to start at the beginning and work your way through. We need a setting that will give you the first ten episodes and then when you send the signal, it will replace those with the next ten and so forth.

Documentation: Architecture, configuration missing

Architecture: A wiki page detailing the inner workings and categories ('fruit' labels and color codes)
Configuration: An overview of the configuration file, a listing of all settings and an example configuration

reinstate mp3 tagging

If no metadata overrides are defined, leave id3 headers alone

download function: '%20' are turned into '20' rather than spaces

Download cover.jpg image from feed

Most podcast MP3s come with an embedded image these days but some seem to rely on some itunes magic with images inserted into the feed (usually itunes speciffic tags) Does feedparser report these? can we access them? Download them as a fallback cover.jpg in the folder?

Assumption: File has id3 tag; ID3NoHeaderError if file has no tag

Check to see if tag exists, if not create empty tag

poca-subscribe: update docs and setup.py before release

We're adding a new script and a new submodule. check that they're added to setup.py.
Also: Update readme.md and wiki

Entries in combo list are re-built every time we check

Everytime a subscription is given the once-over we rebuilt the metadata. Seems kinda pointless and a waste?

Unicode testing

We never really probe for what sort of strings we are tossing around. More testing to make sure that we don't run into trouble with unicode/non-unicode strings in filenames, feeds or tags.

Detect & remove missing feeds

If a feed has been removed from poca.xml, remove the dl folder and the history

Use symbols in output for easier parsing

While the less verbose output is easier to interpret, it is still not immediately obvious when there are changes as opposed to when nothing new is in the pipes. One way to make the output quicker to eye-parse is by changing the output ("No changes", "1 file(s) to download", etc.) to signs indicating what's going on.

There shouldn't be an issue with the encoding seeing as we're running Python 3 and Bash (from which we would be cat/less-reading the log) shouldn't have an issue with it either. I believe.

It isn't customary in a CLI program but why not? It could also be an option in preferences:
<pictogram_output>yes</pictogram_output>

Suggestions:

Error: ⚠ (http://unicode-table.com/en/26A0/)
Download: ➕ (http://unicode-table.com/en/2795/)
Remove: ➖ (http://unicode-table.com/en/2796/)
Exit: ❌ (http://unicode-table.com/en/274C/)
Downloading: ⇵ (http://unicode-table.com/en/21C5/)
Failed download: ☇ (http://unicode-table.com/en/2607/)

Files that by some error drop out of db are never removed

An error in some (previous?) version seems to have caused some files in Savage Love and TAC to 'drop out' of the db. These files are then invisible to poca and are never removed unless by hand.

Solutions:

Ideally not to have files drop out
Abandon db, embrace file-on-disk-is-history
Some check-up/reset loop that cleans up discrepancies

Run with alternate config

Add flag to run with user designated config directory.

Ogg file support

CUrrently only mp3 files are tagged. We should extend support for ogg. (test case: Linux Voice)

poca-subscribe: subscription attributes

Feature: Add tags to subscriptions by way of tag attributes. Specifically:

Categories: <subscription category="news">...</subscription> This ties mostly into the list command that would be able to group subscriptions in the same category together
State: <subscription state="inactive">...</subscription> A way to temporarily opt out of a podcast without having to save it somewhere else. Should delete audio files but keep db.

Assumption: All files are mp3

Add skip audio tagging if file is not mp3 (or implement Ogg tagging)

Global subscription settings

Subscription settings for

max_number
metadata
filters

should inherit global settings for same. Overrides should be possible on a per-subscription basis.

poca-subscribe: online podcast search

Add a podcast search to as-yet-to-be poca-subscribe tool. I think I've bookmarked one with an API somewhere, no?

poca-subscribe: review xmlconf

xmlconf seems a bit antiquated way of doing things - just write out a huge string - when we have lxml.objectify. Also, it's style is a bit different from that output by poca-subscribe. Is there a better way of producing a default template than giant string-writing?

Upgrade path from 0.7 to 0.8 dbs: Does it work?

Save an 0.7 snapshot for testing the upgrade path. Will 0.8 correctly recognise and parse the dbs of 0.7? We haven't changed anything about history but configs are saved...?

timeout: timeout for download of block_size seems to apply to whole update

We have assigned 60s as the timeout for a single block of about 8Kb. However, the timeout seems to kick in if the entire download takes more than 60s, despite the individual block taking much less time than that. Need to investigate how to control those signals.

logging: filter solution on stream logger is a bad hack

In order to avoid getting summaries of file actions on the stream (in addition to the one-per-line +/-/%) we use logger.warn() but filter warnings out from the stream handler. This works but is utterly incomprehensible to anybody not in on it. Requires explanation or reworking.

feature: limit by number of entries as well as/rather than than size

Currently you assign a set limit of MBs to the subscription. Make it an optional setting (if not set, there is no limit) and add an additional optional setting limiting the number of entries to keep (e.g. I always want the latest newshour and the one from the day before)

Looking up file sizes imposes serious lag

When the amount was governed by file sizes we needed file sizes on every file. As part of creating a combo instance an expansion was done on all file entries, including adding information about file size. When this is not included in the feed, we resort to pinging each url in turn to gather this information. For a long feed this can take several minutes.

This should only happen once, because the entryinfo.expand function is only run on entries not in jar. However, it seems to be a returning issue in some cases....?

Options:

Investigate if it is indeed a returning issue or just a one time thing per feed
Remove all references to file size (we aren't using it currently but it might return?)
Work around the fact that some entries will not have file size information

Check validity of config

We do a select few checks on config settings but not in any consistent way. E.g. if an incorrect date format is used in after_date the program simply crashes with a ValueError.

There are actually a number of distinct jobs here:

Checking if needed elements are present (like settings and subscriptions in the global part and title + url in each subscription)
Checking if the values are valid - like a correct url, the proper date format, a path, etc.
Converting certain values, like a string into an integer (max_number) or an XXXX-XX-XX date string into a struct_time instance.

Currently a selection of these tasks are performed in between harvesting XML and creating poca's own data holding objects. Which begs three questions:

Could we make config less of a jungle if we separated these tasks into their own functions to be performed one after the other?
What are the criteria for testing values and element presence? Should we test a select few, all or none?
Should we perform all needed conversions in config or is it ok to pass on max_number as a string, to be converted at convenience?

TOR support

Maybe it could be an option to download via TOR?

http://stackoverflow.com/a/2015649

Download: socket.gaierror is not caught

If a download starts up without internet connection a socket.gaierror is generated but we're unable to catch it. Instead we cuse a genereic catch-all exception.

files.py, line 47:

except: return Outcome(False, "Unknown error")

Outdated man file

Man file makes references to google code and other outdated information.

logging actions: standardize on either entry or filename for both file and stream log

Currently file logging user deletions are being handed uids and logging uids. This is due to the confusion over what sort of entity we're handing over to output. Standardizing on filename or entrys would help avoid this confusion.

Pro filenames:

It's all output needs. Output is a dumb function that should only be given the very basic necessities, unlike the central machinery of the Feed/Combo/Wanted classes.

Pro entries:

Entries are the standard of data exchange throughout the program
entry['poca_filename'] is instantly recognisable - you know what that is. Plain filename could be anything.

Crash on no-enclosures entries

If n entry has no valid enclosures, i.e. a pure text entry, the program crashes.

README.md needs work

The readme could be updated after all these years.

Only Python 3 instructions
Pip install
Overhauling / rewriting the description, making note of the prevalence of smart phones (rsync anybody?)
Jazzing up: Add a recording of the tool working (https://asciinema.org/ ?)

CLI tool to manage subscriptions

Removal should also remove files (and possibly clear out any history?)

history: pickle problems are recorded but not acted upon

Config checks that the db directory is writable but the results of all the checks we do in history.py are simply dropped and all is assumed well.