Coder Social home page Coder Social logo

Comments (8)

miigotu avatar miigotu commented on August 23, 2024

Same with non-namespace nodes, with http://lolo.sickbeard.com/api?t=caps I only get back category 8000, but I seem to get all of the subcats for the category it does return even if there is more than 1

from feedparser.

Andy2244 avatar Andy2244 commented on August 23, 2024

Having the same problem with newznab api

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/">
<channel>
    <title>example.com</title>
    <description>example.com API results</description>
    <!--
      More RSS content
    -->

    <!-- offset is the current offset of the response
         total is the total number of items found by the query
    -->
    <newznab:response offset="0" total="1234"/>

    <item>
      <!-- Standard RSS 2.0 data -->
      <title>A.Public.Domain.Album.Name</title>
      <guid isPermaLink="true">http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c</guid>
      <link>http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&amp;i=1&amp;r=18cf9f0a736041465e3bd521d00a90b9</link>
      <comments>http://servername.com/rss/viewnzb/e9c515e02346086e3a477a5436d7bc8c#comments</comments>
      <pubDate>Sun, 06 Jun 2010 17:29:23 +0100</pubDate>
      <category>Music > MP3</category>
      <description>Some music</description>
      <enclosure url="http://servername.com/rss/nzb/e9c515e02346086e3a477a5436d7bc8c&amp;i=1&amp;r=18cf9f0a736041465e3bd521d00a90b9" length="154653309" type="application/x-nzb" />

      <!-- Additional attributes -->
      <newznab:attr name="category" value="3000" />
      <newznab:attr name="category" value="3010" />
      <newznab:attr name="size"     value="144967295" />
      <newznab:attr name="artist"   value="Bob Smith" />
      <newznab:attr name="album"    value="Groovy Tunes" />
      <newznab:attr name="publisher" value="Epic Music" />
      <newznab:attr name="year"     value="2011" />
      <newznab:attr name="tracks"   value="track one|track two|track three" />
      <newznab:attr name="coverurl" value="http://servername.com/covers/music/12345.jpg" />
      <newznab:attr name="review"   value="This album is great" />
    </item>

</channel>
</rss>

All i get is the 'newznab' namespace populated with 'attr' and a single 'category' node. I checked the debug object and all other tags are simply lost by feedparser. In contrast xml.minidom will give me a list of all nodes if i do 'dom.getElementsByTagNameNS'. From a quick look the formating is within the W3C specs.

Given the issue is over a year old i assume feedparser development has been halted?

from feedparser.

Safihre avatar Safihre commented on August 23, 2024

Yeah same here, also want to parse newznab attributes but feedparser won't let me.
I guess I will have to create my own parser and all the pain that comes with it.
@Andy2244 might want to check out Python's xml.etree.cElementTree, it's blazing fast in parsing XML

For example:

from urllib2 import urlopen
import xml.etree.cElementTree as ET

rss_data = urlopen("https://api.nzbgeek.info/rss?t=2000&dl=1&num=200&r=xx")

tree = ET.parse(rss_data)
root = tree.getroot()

# Need to define namespaces
ns = {'newznab': 'http://www.newznab.com/DTD/2010/feeds/attributes/',
      'nZEDb': 'http://www.newznab.com/DTD/2010/feeds/attributes/'}

for item in root.findall('*item', ns):
    b = item.find("newznab:attr[@name='size']", ns) or item.find("nZEDb:attr[@name='size']", ns)
    print b.get('value')

from feedparser.

Andy2244 avatar Andy2244 commented on August 23, 2024

Here is what i'm doing as a quick fix, since i generally like the ease of use of feedparser.

NAMESPACE_NAME = 'newznab'
NAMESPACE_URL = 'http://www.newznab.com/DTD/2010/feeds/attributes/'
NAMESPACE_TAGNAME = 'attr'

    # feedparser cant handle namespace attributes with same tagname, so rename those nodes.
    def make_feedparser_friendly(self, data):
        try:
            dom = minidom.parseString(data)
            items_ns = dom.getElementsByTagNameNS(NAMESPACE_URL, NAMESPACE_TAGNAME)
            if items_ns:
                for node in items_ns:
                    if node.attributes and 'name' in node.attributes and 'value' in node.attributes:
                        node.tagName = NAMESPACE_NAME + ':%s' % node.attributes['name'].value
                        node.name = node.attributes['name'].value
                        node.value = node.attributes['value'].value
        except Exception as ex:
            log.trace('Unable to rename nodes in XML: %s' % ex)
            return None
        return dom.toxml()

from feedparser.

miigotu avatar miigotu commented on August 23, 2024

I didtched feedparser altogether due to this, because it suited my application. Now I parse both xml and html pages using bs4.

from feedparser.

Andy2244 avatar Andy2244 commented on August 23, 2024

Yeah i checked BS, but the xml part depends on a external working lxml, which is a pain to install on windows. Thats why i picked minidom and just hotfix the xml namespace attributes.

from feedparser.

miigotu avatar miigotu commented on August 23, 2024

You dont need lxml, I use html5lib as the parser for everything.

from feedparser.

miigotu avatar miigotu commented on August 23, 2024

This is pretty hacky, but it works to use feedparser without having to double parse data. Im sure it can be improved and made generic to auto convert values to lists when an overwrite would occur, but this is good enough for me for now.:

diff --git a/lib/feedparser/api.py b/lib/feedparser/api.py
index 614bd2d..12eafd2 100644
--- a/lib/feedparser/api.py
+++ b/lib/feedparser/api.py
@@ -60,6 +60,7 @@ from .sanitizer import replace_doctype
 from .sgml import *
 from .urls import _convert_to_idn, _makeSafeAbsoluteURI
 from .util import FeedParserDict
+from . import USER_AGENT
 
 bytes_ = type(b'')
 unicode_ = type('')
diff --git a/lib/feedparser/util.py b/lib/feedparser/util.py
index f7c02c0..df36b3e 100644
--- a/lib/feedparser/util.py
+++ b/lib/feedparser/util.py
@@ -122,9 +122,23 @@ class FeedParserDict(dict):
 
     def __setitem__(self, key, value):
         key = self.keymap.get(key, key)
-        if isinstance(key, list):
-            key = key[0]
-        return dict.__setitem__(self, key, value)
+        if key == 'newznab_attr':
+            if isinstance(value, dict) and value.keys() == ['name', 'value']:
+                key = value['name']
+                value = value['value']
+
+            if not dict.__contains__(self, 'categories'):
+                dict.__setitem__(self, 'categories', [])
+
+            if key == 'category':
+                self['categories'].append(value)
+            else:
+                dict.__setitem__(self, key, value)
+        else:
+            if isinstance(key, list):
+                key = key[0]
+
+            return dict.__setitem__(self, key, value)
 
     def setdefault(self, key, value):
         if key not in self:

from feedparser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.