I have an issue I cant seem to work out on my own, not sure if it is a bug or if it is

Same with non-namespace nodes, with <a href="http://lolo.sickbeard.com/api?t=caps" rel

Having the same problem with newznab api <div class="snippet-clipboard-content not

Namespace attrs overwritten by dupekeys? about feedparser HOT 8 OPEN

kurtmckee commented on August 23, 2024 1

Namespace attrs overwritten by dupekeys?

from feedparser.

Comments (8)

miigotu commented on August 23, 2024

Same with non-namespace nodes, with http://lolo.sickbeard.com/api?t=caps I only get back category 8000, but I seem to get all of the subcats for the category it does return even if there is more than 1

from feedparser.

Andy2244 commented on August 23, 2024

Having the same problem with newznab api

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/">
<channel>
    <title>example.com</title>
    <description>example.com API results</description>
    <!--
      More RSS content
    -->

    <!-- offset is the current offset of the response
         total is the total number of items found by the query
    -->
    <newznab:response offset="0" total="1234"/>

    <item>
      <!-- Standard RSS 2.0 data -->
      <title>A.Public.Domain.Album.Name</title>
      <guid isPermaLink="true">http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c</guid>
      <link>http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&amp;i=1&amp;r=18cf9f0a736041465e3bd521d00a90b9</link>
      <comments>http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c#comments</comments>
      <pubDate>Sun, 06 Jun 2010 17:29:23 +0100</pubDate>
      <category>Music > MP3</category>
      <description>Some music</description>
      <enclosure url="http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&amp;i=1&amp;r=18cf9f0a736041465e3bd521d00a90b9" length="154653309" type="application/x-nzb" />

      <!-- Additional attributes -->
      <newznab:attr name="category" value="3000" />
      <newznab:attr name="category" value="3010" />
      <newznab:attr name="size"     value="144967295" />
      <newznab:attr name="artist"   value="Bob Smith" />
      <newznab:attr name="album"    value="Groovy Tunes" />
      <newznab:attr name="publisher" value="Epic Music" />
      <newznab:attr name="year"     value="2011" />
      <newznab:attr name="tracks"   value="track one|track two|track three" />
      <newznab:attr name="coverurl" value="http://servername.com/covers/music/12345.jpg" />
      <newznab:attr name="review"   value="This album is great" />
    </item>

</channel>
</rss>

All i get is the 'newznab' namespace populated with 'attr' and a single 'category' node. I checked the debug object and all other tags are simply lost by feedparser. In contrast xml.minidom will give me a list of all nodes if i do 'dom.getElementsByTagNameNS'. From a quick look the formating is within the W3C specs.

Given the issue is over a year old i assume feedparser development has been halted?

from feedparser.

Safihre commented on August 23, 2024

Yeah same here, also want to parse newznab attributes but feedparser won't let me.
I guess I will have to create my own parser and all the pain that comes with it.
@Andy2244 might want to check out Python's xml.etree.cElementTree, it's blazing fast in parsing XML

For example:

from urllib2 import urlopen
import xml.etree.cElementTree as ET

rss_data = urlopen("https://api.nzbgeek.info/rss?t=2000&dl=1&num=200&r=xx")

tree = ET.parse(rss_data)
root = tree.getroot()

# Need to define namespaces
ns = {'newznab': 'http://www.newznab.com/DTD/2010/feeds/attributes/',
      'nZEDb': 'http://www.newznab.com/DTD/2010/feeds/attributes/'}

for item in root.findall('*item', ns):
    b = item.find("newznab:attr[@name='size']", ns) or item.find("nZEDb:attr[@name='size']", ns)
    print b.get('value')

from feedparser.

Andy2244 commented on August 23, 2024

Here is what i'm doing as a quick fix, since i generally like the ease of use of feedparser.

NAMESPACE_NAME = 'newznab'
NAMESPACE_URL = 'http://www.newznab.com/DTD/2010/feeds/attributes/'
NAMESPACE_TAGNAME = 'attr'

    # feedparser cant handle namespace attributes with same tagname, so rename those nodes.
    def make_feedparser_friendly(self, data):
        try:
            dom = minidom.parseString(data)
            items_ns = dom.getElementsByTagNameNS(NAMESPACE_URL, NAMESPACE_TAGNAME)
            if items_ns:
                for node in items_ns:
                    if node.attributes and 'name' in node.attributes and 'value' in node.attributes:
                        node.tagName = NAMESPACE_NAME + ':%s' % node.attributes['name'].value
                        node.name = node.attributes['name'].value
                        node.value = node.attributes['value'].value
        except Exception as ex:
            log.trace('Unable to rename nodes in XML: %s' % ex)
            return None
        return dom.toxml()

from feedparser.

miigotu commented on August 23, 2024

I didtched feedparser altogether due to this, because it suited my application. Now I parse both xml and html pages using bs4.

from feedparser.

Andy2244 commented on August 23, 2024

Yeah i checked BS, but the xml part depends on a external working lxml, which is a pain to install on windows. Thats why i picked minidom and just hotfix the xml namespace attributes.

from feedparser.

miigotu commented on August 23, 2024

You dont need lxml, I use html5lib as the parser for everything.

from feedparser.

miigotu commented on August 23, 2024

This is pretty hacky, but it works to use feedparser without having to double parse data. Im sure it can be improved and made generic to auto convert values to lists when an overwrite would occur, but this is good enough for me for now.:

diff --git a/lib/feedparser/api.py b/lib/feedparser/api.py
index 614bd2d..12eafd2 100644
--- a/lib/feedparser/api.py
+++ b/lib/feedparser/api.py
@@ -60,6 +60,7 @@ from .sanitizer import replace_doctype
 from .sgml import *
 from .urls import _convert_to_idn, _makeSafeAbsoluteURI
 from .util import FeedParserDict
+from . import USER_AGENT
 
 bytes_ = type(b'')
 unicode_ = type('')
diff --git a/lib/feedparser/util.py b/lib/feedparser/util.py
index f7c02c0..df36b3e 100644
--- a/lib/feedparser/util.py
+++ b/lib/feedparser/util.py
@@ -122,9 +122,23 @@ class FeedParserDict(dict):
 
     def __setitem__(self, key, value):
         key = self.keymap.get(key, key)
-        if isinstance(key, list):
-            key = key[0]
-        return dict.__setitem__(self, key, value)
+        if key == 'newznab_attr':
+            if isinstance(value, dict) and value.keys() == ['name', 'value']:
+                key = value['name']
+                value = value['value']
+
+            if not dict.__contains__(self, 'categories'):
+                dict.__setitem__(self, 'categories', [])
+
+            if key == 'category':
+                self['categories'].append(value)
+            else:
+                dict.__setitem__(self, key, value)
+        else:
+            if isinstance(key, list):
+                key = key[0]
+
+            return dict.__setitem__(self, key, value)
 
     def setdefault(self, key, value):
         if key not in self:

from feedparser.

Namespace attrs overwritten by dupekeys? about feedparser HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent