uogbuji / amara3-xml Goto Github PK
View Code? Open in Web Editor NEWA data processing library built on Python 3 and MicroXML
License: Apache License 2.0
A data processing library built on Python 3 and MicroXML
License: Apache License 2.0
Came across the now archived XML Witch and thought it might be handy to have something like this in Amara. Might even be good to support async with
for it, for cases of XML building interleaved with I/O.
Possibly via Cython? Check on cross-platform deployment issues in any such case.
Idea popped into my head when I stumbled across this thread.
Release notes (update in place, for easy cut & paste):
The Amara saga continues! I don't exactly remember why I decided to dead end the Amara PyPI project when it hit 2.0, but I moved to a series of Amara 3 generation projects (amara3.iri, amara3.xml & amara3-names). Those were far more lone wolf efforts, but at Oori Data we're seeing a lot of need for the sorts of capability that's inchoate in Amara 3.
Time to re-consolidate the Amara projects, call it a 4th generation, and move it to Oori Data GitHub.
4.0.0a1
(counting up via 4.0.0a2
…4.0.0a9
…4.0.0a10
……4.0.0b1
……4.0.0rc1
, and so on). I picked this designation after far too much pondering of SemVer and PEP 440 considerations (e.g. this)Note: Pip & PyPI are case insensitive. PEP 426 says All comparisons of distribution names MUST be case insensitive, and MUST consider hyphens and underscores to be equivalent. An amara 4.0.0a1
package will seamlessly supersede Amara 2.0.0
As usual for Python tree-like & graph-like structures amara.uxml.tree seems to be plagued by memory weeks. I went ahead and replaced child-to-parent links with weakrefs as best I could, but seems it's still leaking. I've been using the following MARC splitter program with memory profiler.
https://gist.github.com/uogbuji/bccdb1e2fdbb7bb88459#file-marc-split-memprof-py
from amara3.uxml import html5
from amara3.uxml.treeutil import descendants, select_elements
from amara3.uxml import xml
from amara3.uxml.treeutil import *
from amara3.uxml.tree import *
from amara3.uxml.uxpath import context as xpathcontext, parse as xpathparse
import requests
resp = requests.get('http://garybyker.library.link/resource/5bLglR2qVao/')
root = html5.parse(resp.text)
xpathctx = xpathcontext(root)
ALL_IMAGES = xpathparse('//img')
it = ALL_IMAGES.compute(xpathctx)
i = next(it) #StopIteration
# root => {uxml.element (-9223363273091541553) "html" with 3 children}
# root.xml_children[2] => {uxml.element (-9223363273092099405) "body" with 10 children}
X = xpathparse('/html//img')
it = X.compute(xpathctx)
n = next(it)
Last 3 lines result in:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/uche/.local/pyenv/main/lib/python3.6/site-packages/amara3/uxml/uxpath/ast.py", line 313, in compute
yield from self.relative.compute(new_ctx)
File "/home/uche/.local/pyenv/main/lib/python3.6/site-packages/amara3/uxml/uxpath/ast.py", line 234, in compute
yield from self.right.compute(new_ctx)
File "/home/uche/.local/pyenv/main/lib/python3.6/site-packages/amara3/uxml/uxpath/ast.py", line 379, in compute
to_process = list(child.xml_children) + to_process[1:]
AttributeError: 'comment' object has no attribute 'xml_children'
Note alternative works fine
imgs = [ e for e in descendants(root) if e.xml_name == 'img' ]
# imgs => [{uxml.element (8763762573951) "img" with 0 children}, {uxml.element (8763762576069) "img" with 0 children}, {uxml.element (8763762579471) "img" with 0 children}, {uxml.element (-9223363273092194223) "img" with 0 children}, {uxml.element (8763762583444) "img" with 0 children}, {uxml.element (-9223363273092209555) "img" with 0 children}]
Also not working:
PARENT_RESOURCE = xpathparse('ancestor::div[@class="thumbnail-holder"]/a/@href')
img = imgs[0]
imgctx = xpathcontext(img) #, force_root=False)
res = next(PARENT_RESOURCE.compute(imgctx), None)
Working alternative:
preparent, parent = img, img.xml_parent
while parent:
if parent.xml_name == 'div' and 'thumbnail-holder' in parent.xml_attributes.get('class', ''):
break
preparent = parent
parent = parent.xml_parent
I wrote some code using amara3.uxml to modify MARCXML records. I thought I'd be able to use a xmlter.sender coroutine for input and write it out with uxml.writer losslessly. That's not happening though, when it encounters character references on input, specifically the quot character reference in this case, it gets turned into a quote character, producing non-well-formed output.
Here's a script and some input data to reproduce
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.