I've had the BBC 6-o-clock-news feed working for quite some time, but on December 2, g

This is similar to <a class="issue-link js-issue-link" data-error-text="Failed to load

Actually, now that I think about it, isn't it simpler than <a class="user-mention notr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

greg sync hangs on BBC Six O'Clock News feed about greg HOT 23 CLOSED

manolomartinez commented on May 24, 2024

greg sync hangs on BBC Six O'Clock News feed

from greg.

Comments (23)

manolomartinez commented on May 24, 2024

Hi there,

If you look at greg info, you'll see that greg has been instructed to sync the feed from Dec 2nd 2016. Note the year :) That's what the feed says about the time of publication of that entry.

One fix is to edit the feed to start downloading from today on. The problem is that, while that entry remains in the rss feed (I think it goes away tomorrow), it will keep pushing poor greg's next download to 2016. As I say, if I understand that feed correctly, that entry will be pushed out of the stack tomorrow and will stop misleading greg.

Going forward, I don't think there's any easy way to avoid this. Perhaps a sanity check, so that greg issues a warning if the published date is in the future. But if, for example, the typo that creeps into the feed is that the published date is said to be, e.g., 2012, it will not be downloaded, sanity check or not.

Any ideas?

from greg.

commented on May 24, 2024

This is similar to #35 in that it's dealing with typos or errors in the feed. I suppose you could limit the feed to the current year, anything in the future is pulled back to this year. Realistically, there should be no way to publish a feed item in the future, (doesn't even make sense) so limit to current year.

For entry errors of previous years, perhaps do a test to look at surrounding entries. If last 3 entries are 2015, this entry is 2015 as well, even if it claims to be 2012. Perhaps show the date as 2015 (modified from 2012) or something to show it's an error.

from greg.

n8willis commented on May 24, 2024

Frankly, I think the problem is that sync should not be automatically
setting the date based on the contents of the feed at all -- it ought to
automatically set it based on the last sync execution time. The actual
run-time on my machine is what matters for automation purposes -- if there
were no dates at all in the feed metadata, you would still expect sync to
download what was new since the previous run. Wouldn't you? I sure would.

I guess I had mistakenly thought this was what it was doing. I don't see
any reason why the error prone metadata should matter for syncing at all;
at most it's informative for the user, but these examples show the problem
with assuming it is accurate.

There might be valid reasons for the user to set the sync date to the
future with 'edit' -- kind of a 'hold my calls' for example. But I would
contend that "last time I ran sync on feed X" is what sync should
automatically set the date to, not something pulled from anywhere else....
On Dec 9, 2015 3:52 AM, "Nick" [email protected] wrote:

This is similar to #35 #35
in that it's dealing with typos or errors in the feed. I suppose you could
limit the feed to the current year, anything in the future is pulled back
to this year. Realistically, there is no way to publish a feed item in the
future, (doesn't even make sense) so limit to current year.

For entry errors of previous years, do a test to look at previous and
future entries. If last 3 entries are 2015, this entry is 2015 as well,
even if it claims to be 2012.

—
Reply to this email directly or view it on GitHub
#36 (comment).

from greg.

manolomartinez commented on May 24, 2024

Here is one reason why you really cannot get rid of error-prone metadata:

Suppose that you sync today, Dec 9th. How are you going to decide which entries to download in your next sync? Presumably, those that were published after your last sync; but that implies parsing the "published" field. Another way of doing it would be to download everything except those entries that have already been downloaded, but this will clearly not be the right thing to do for the many podcasts that carry the whole story of entries in their feed.

If you are stuck with parsing the published date (and, if I have missed an obvious way to avoid this, do let me know), I thought, one might as well use it to allow things such as --downloadfrom a particular date, which would be impossible otherwise.

Again, I'm open to changing my mind about this if there's a better solution.

from greg.

n8willis commented on May 24, 2024

Sync could definitely still parse the "published" timestamp on episodes to decide whether they're new or old, but it's better to compare those with the execution timestamp than it is to compare them to an arbitrary, possibly-totally-wrong "published" timestamp. Because at least you know one of the values (the last-sync date) is valid.

Yes, you could still encounter situations where the "published" field misleads greg, but in that case you would still get fewer missed episodes than you do by setting the sync-date to a (potentially) mangled date.

E.g., if only episode 12/02 contained a bad date, then at worst only episode 12/02 would get bungled. As it is now, a metadata typo in a single episode can kill all future downloads.

Just at a conceptual level, I think it's important that the sync date reflects when the sync actually took place, rather than recording different a piece of info pulled from elsewhere.

from greg.

commented on May 24, 2024

Suppose that you sync today, Dec 9th. How are you going to decide which entries to download in your next sync? Presumably, those that were published after your last sync; but that implies parsing the "published" field.

Is the "published field" the only one which denotes which entries succeed other entries? Is there another way to determine which items 'come after' preceding items?

Some sort of comparison method of feed states might get around the dates.

Perhaps you could parse the feed into a JSON file each time you sync, and do a compare against the existing data, diff the 2 JSON arrays, and for each given section newly present, download those specific entries?

from greg.

manolomartinez commented on May 24, 2024

Sync could definitely still parse the "published" timestamp on episodes to decide whether they're new or old, but it's better to compare those with the execution timestamp than it is to compare them to an arbitrary, possibly-totally-wrong "published" timestamp. Because at least you know one of the values (the last-sync date) is valid.

That would have problems too: if you are syncing, say, 5 entries of a feed and interrupt the sync before it finishes, greg would not know to restart it at the correct entry.

I will think about this, and try to come up with something, perhaps building on xHN35RQ's suggestions.

from greg.

n8willis commented on May 24, 2024

Maybe that's true, but I think that outcome would still be a preferable failure mode to having the feed silently fail to update for a year.

At the very least, that situation would only be a problem if the sync got interrupted, as opposed to happening all the time.

from greg.

n8willis commented on May 24, 2024

Actually, now that I think about it, isn't it simpler than @xhn35rq 's suggestion? You could just compare the publication timestamp to the current time, and if timestamp is in the future, disregard it just like you would a syntactically malformed date, right? Or am I missing something?

from greg.

commented on May 24, 2024

@n8willis That's more or less what I suggested here: #36 (comment) and @manolomartinez reply to that: #36 (comment)

from greg.

commented on May 24, 2024

To elaborate on my feed state suggestion:

Greg loads in new feed, and captures the feed state (json? hash of some kind?)
Upon each greg sync, capture the feed state, and compare against last stored feed state. Determine what entries are different, and mark those entries as new.
For each different entry, do the appropriate thing with the new entries (download, notify user, etc)

This gets around the date issue, since greg is not using any data from the feed to determine which items to download. Instead it is diffing the feed against itself to determine what is different, and therefore changed.

This has the added benefit of allowing feed publishers to edit existing feed items, and depending on how you capture the feed state, greg should be able to recognize these edited feed items, and re-download them accordingly.

Just some ideas, I'm not sure if this is the best approach but it feels like it's the most robust. At least with my limited understanding of all this 😄

from greg.

n8willis commented on May 24, 2024

Well, I don't think that's what I was suggesting at all. I.e., I think there are simpler approaches than trying to track multiple/all entries.

I was suggesting that, right now, the downloadfrom date is set to be max(linkdates)
It ought to be simple to check that max(linkdates) is less than or equal to the current time; if max(linkdates) is in the future, greg could just set the downloadfrom time to 'right now'. That would fix this issue, AFAICT....

from greg.

commented on May 24, 2024

I caused some confusion, I'm sorry. @n8willis my reply to you is this comment: #36 (comment)

The next comment is just me further explaining my feed state idea unrelated to the previous comment: #36 (comment)

That said, I'm still not seeing how what you're describing is different from the date comparison idea I suggested earlier: #36 (comment) - isn't it the same idea? Maybe I'm missing something. If it is the same idea, then see @manolomartinez's response here: #36 (comment)

from greg.

n8willis commented on May 24, 2024

It's certainly possible we're describing precisely the same thing; sure. It seems -- to me -- like your suggestion involves rather more than mine: (1) altering the metadata of items to change the year to the current year, (2) examining the dates of three other feed items, (3) testing for dates in the past.

I'm not suggesting making any changes to the metadata in any feed item. [Nor, for that matter, adding additional state, though I understand that that, from the later comment, is a separate idea.] I'm just saying "don't set the downloaddate for a feed to be a date in the future when you do a simple sync". It just seems like a minimal sanity check, which is what I was trying to find.

I think that check would have prevented the issue I originally reported, because that was caused by greg assuming the max pubdate should automatically be the new downloaddate to sync from. I know there's other conditions that the comparison wouldn't catch.

Bonus fun fact I found while reading around: apparently, it's not invalid for a pubdate to be in the future: https://validator.w3.org/feed/docs/rss2.html#ltpubdategtSubelementOfLtitemgt

...but the spec says aggregators can choose to ignore those feed items until they arrive in the future.

from greg.

n8willis commented on May 24, 2024

Oh, and @xhn35rq -- I was also not trying to say your suggestions were not good ideas; I was merely championing consideration for a quick fix (that would hopefully not rule out other enhancements)...!

from greg.

commented on May 24, 2024

No problem! I appreciate the response. A quick fix is preferable and your idea is the simplest. I understand it now. I was under the impression @manolomartinez addressed something similar to your idea... thus my more complex suggestions.

from greg.

manolomartinez commented on May 24, 2024

Hey, thanks both for the ideas and your sustained attention to this -- I mean it! I hope I'll have something to show in a few days. Meanwhile, I wouldn't at all mind having to review a PR, just sayin' :)

from greg.

commented on May 24, 2024

I was thinking about this, and at L228 would something like this gist be on the right track?

https://gist.github.com/xHN35RQ/ac0ca34df5f85628495f

EDIT - updated gist URL

from greg.

manolomartinez commented on May 24, 2024

Hi, you are just taking out of the if-then two redundant assignments, aren't you?

from greg.

commented on May 24, 2024

No, I added some new code, in which I introduced those redundant assignments myself, then removed them. (because I'm a noob at python)

Here's another gist comparing the code I am suggesting against the original block at L228

https://gist.github.com/xHN35RQ/ac0ca34df5f85628495f

from greg.

commented on May 24, 2024

@manolomartinez just now saw your comment on that gist. PR inbound.

from greg.

commented on May 24, 2024

Should be fixed now, as of 73152e1

Is greg updated on pypi?

from greg.

manolomartinez commented on May 24, 2024

It is now!

from greg.

greg sync hangs on BBC Six O'Clock News feed about greg HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent