ozekik / lightrdf Goto Github PK
View Code? Open in Web Editor NEWA fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3
License: Apache License 2.0
A fast and lightweight Python RDF parser which wraps bindings to Rust's Rio using PyO3
License: Apache License 2.0
Download Starwars Turtle file:
Starwars.ttl
Parse through all triples. There's a line on 3569 that causing the parse to crash.
rdfs:label "ประเทศไนเจอร์"@th , "Niġer"@mt , "尼日尔"@zh-SG , "尼日尔"@zh-MY , "尼日尔"@zh-Hans , "尼日尔"@zh-CN , "尼日尔"@zh , "Niyer"@tl , "Ngāika"@mi , "Nigeru"@olo , "ނީޖަރު"@dv , "Nijèr"@gcr , "නයිජර්"@si , "Nnijer"@kab , "Nìger"@co , "Nìger"@pms , "Ниҷер"@tg , "Niiser"@ff , "Níher"@gn , "Niiger"@frr , "Nigerän"@vo , "Nícher"@an , "نایجېر"@ps , "尼日"@zh-classical , "နိုင်ဂျာနိုင်ငံ"@my , "Nìjẹ̀r"@yo , "尼日"@zh-TW , "尼日"@zh-Hant , "尼日"@lzh , "નાઈજર"@gu , "Niijer"@om , "നീഷർ"@ml , "Niseer"@wo , "ኒጄር"@am , "নাইজের"@bpy , "Nixèr"@sc , "नाइजर"@new , "नाइजर"@mai , "नाइजर"@hi , "नाइजर"@dty , "नाइजर"@bho , "नाइजर"@bh , "نیجر"@mzn , "نیجر"@lrc , "نیجر"@fa , "نیجر"@azb , "نيجر"@arz , "نيجر"@ary , "نائجر"@ur , "نائجر"@pnb , "Нигермудин Орн"@xal , "Republiek Niger"@nds , "Pow Nijer"@kw , "Niger"@hif , "Niger"@hak , "Niger"@gsw , "Niger"@gag , "Niger"@fy , "Niger"@fr , "Niger"@fo , "Niger"@fiu-vro , "Niger"@fi , "Niger"@eu , "Niger"@et , "Niger"@en-GB , "Niger"@en-CA , "Niger"@en , "Niger"@ee , "Niger"@dsb , "Niger"@de-CH , "Niger"@de-AT , "Niger"@de , "Niger"@da , "Niger"@cy , "Niger"@cs , "Niger"@crh-Latn , "Niger"@crh , "Niger"@ceb , "Niger"@cdo , "Niger"@bs , "Niger"@br , "Niger"@bjn , "Niger"@ban , "Niger"@az , "Niger"@als , "Niger"@ak , "Niger"@af , "Niger"@ace , "Niger"@bcl , "Niger"@hr , "Niger"@hsb , "Niger"@hu , "Niger"@ia , "Niger"@id , "Niger"@ie , "Niger"@ig , "Niĝero"@eo , "Niger"@ilo , "Niger"@it , "Niger"@jv , "Niger"@kaa , "Niger"@ki , "Niger"@nl , "Niger"@no , "Niger"@simple , "Niger"@ts , "Niger"@uz , "Niger"@vec , "Niger"@lb , "Niger"@vep , "Niger"@lg , "Niger"@li , "Niger"@lij , "Niger"@lmo , "Niger"@vi , "Niger"@vro , "Niger"@war , "Niger"@za , "Niger"@zh-min-nan , "Niger"@min , "Niger"@ms , "Niger"@nah , "Niger"@nan , "Niger"@nb , "Niger"@nds-NL , "Niger"@nn , "Niger"@nov , "Niger"@nso , "Niger"@pam , "Niger"@pap , "Niger"@pl , "Niger"@ro , "Niger"@scn , "Niger"@sco , "Niger"@se , "Niger"@sh , "Niger"@sk , "Niger"@sl , "Niger"@sm , "Niger"@sn , "Niger"@sr-EL , "Niger"@st , "Niger"@stq , "Niger"@su , "Niger"@sv , "Niger"@sw , "Niger"@szy , "Niger"@tk , "Niijir"@pih , "Nizëre"@sg , "Nijar"@ha , "नाईजर"@ne , "ניז'ר"@he , "নাইজার"@bn , "Niher"@zea , "Nigeri"@rw , "Nigeri"@sq , "ニジェール"@ja , "Nizer"@ln , "Nîjer"@ku , "Nigèr"@oc , "ନାଇଜର"@or , "Нігер"@uk , "Нігер"@be-x-old , "Нігер"@be-tarask , "Нігер"@be , "Níxer"@ast , "Níxer"@gl , "ನೈಜರ್"@kn , "Nicer"@diq , "INayijari"@ss , "Ніґер"@rue , "Նիգեր"@hy , "ናይጀር"@ti , "Nijier"@jam , "नीजे"@sa , "नीजे"@pi , "ܢܝܓܪ"@arc , "Nijer"@bm , "Nijer"@din , "Nijer"@io , "Nijer"@kg , "Nijer"@lad , "Nijer"@lfn , "Nijer"@tr , "Nìgeir"@gd , "Nijera"@mg , "Nigi"@ext , "Nigeris"@bat-smg , "Nigeris"@lt , "Nigeris"@sgs , "ניזשער"@yi , "ནི་ཇར།"@bo , "Nayjar"@so , "Yn Neegeyr"@gv , "Νίγηρας"@el , "Nigēra"@lv , "ნიგერი"@xmf , "ნიგერი"@ka , "நைஜர்"@ta , "Нигер"@udm , "Нигер"@tt , "Нигер"@sr-EC , "Нигер"@sr , "Nijè"@ht , "Нигер"@sah , "Нигер"@ru , "Нигер"@os , "Нигер"@mrj , "Нигер"@mn , "Нигер"@mk , "Нигер"@ky , "Нигер"@kk , "Нигер"@ce , "Нигер"@bxr , "Нигер"@bg , "Нигер"@ba , "Нигер"@ady , "Nizɛɛrɩ"@kbp , "Nig·èr"@frp , "INayighe"@zu , "An Nígir"@ga , "니제르"@ko , "نیجەر"@ckb , "Res publica Nigritana"@la , "నైజర్"@te , "नायजर"@mr , "النيجر"@ar , "النيجر"@aeb-Arab , "نائيجر"@sd , "نىگېر"@ug , "Niqir"@qu , "Ńiger"@szl , "မိူင်းၼၢႆးၵျႃး"@shn , "尼日爾"@zh-yue , "尼日爾"@zh-MO , "尼日爾"@zh-HK , "尼日爾"@yue , "尼日爾"@wuu , "ਨਾਈਜਰ"@pa , "Níger"@ca , "Níger"@cbk-zam , "Níger"@es , "Níger"@is , "Níger"@pt , "Níger"@pt-BR ;
lightrdf.Error: error while parsing language tag 'zh-classical': A subtag may be eight characters in length at maximum on line 3569 at position 468
Hi @ozekik!
Thank you for the awesome library! 👏
Unfortunately, while using your library, I got the error 🐛 mentioned in the title. 😞
But using rdflib I was not getting a similar error. 🤔
wget -q https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/pathways.rdf.xz
sudo apt install xz-utils
unxz pathways.rdf.xz
python3 count_triples_lightrdf_parser.py pathways.rdf
Traceback (most recent call last):
File "count_triples_lightrdf_parser.py", line 8, in <module>
for triple in parser.parse(sys.argv[1]):
lightrdf.Error: error while parsing IRI '': No scheme found in an absolute IRI
Please tell me where I am wrong. Thank you 🙏
Hi @ozekik!
I found a bug when parsing. I considered generations.rdf file when parsing, but a similar bug appeared in many other files. For the some reason the library recognizes this tag
<ns2:versionInfo rdf:datatype="http://www.w3.org/2001/XMLSchema#string">An example ontology created by Matthew Horridge</ns2:versionInfo>
like this str
'"An example ontology created by Matthew Horridge"^^<http://www.w3.org/2001/XMLSchema#string>'
in last item of triple ( triple[-1] ).
When using the rdflib library, I was not getting a similar problem.
Thanks.
It would be convenient to pass in a map of prefix-namespace expansions and have the search option return CURIEs where contraction is possible
While trivial to do in a python wrapper, it would be presumably faster to do at the rust level
Currently the user has to parse the object to be able to do a lot of operations on it
This is relatively straightforward, I think:
^riog\d+$
is a blank node^"(.*)"^^<(\S+)>$
type^"(.*)"@\w+$
language^"(.*)"$
untypedBut it might be nice to centralize this, or do it in rust for speed. To avoid the overhead of an OO interface how about a parallel search_statements
with arguments s, p, o_uri, o_literal_value, o_datatype, o_lang?
This is my use case:
https://github.com/INCATools/rdf-sql-bulkloader
For now I am doing this in python
When using LightRDF in the Ontology Development Kit, we have come across a very strange bug where LightRDF would fail to parse RDF/XML files that seem completely valid.
Here is a file that LightRDF fails to parse: https://github.com/INCATools/ontology-development-kit/files/10042121/tdm-bad.txt
(Sorry for the size of the file, but I was unable to reduce the error case to a minimal demonstrating example.)
Trying to parse that file with LightRDF as follows:
import sys
from lightrdf import Parser
parser = Parser()
try:
for triple in parser.parse("tdm-bad.xml"):
pass
except Exception as e:
print(e)
sys.exit(1)
yields the following error: Unexpected EOF during reading Comment
.
I have no idea where the bug exactly is. However, rebuilding LightRDF after updating the Rio dependencies (rio_api
, rio_turtle
, and rio_xml
) in Cargo.toml
to their latest version (0.8.3) seems enough to fix it.
LightRDF is available as a wheel on PyPI for many combinations of systems and architectures (Windows/MacOS/Linux…, i686/x86_64/arm64…). Thanks for that!
However, one particular combination that is missing is Linux/arm64. Any chance it could be added?
I was looking for a replacement for RDFLib for just parsing, do some BGP searching and write the new triples back.
It seems that LightRDF can handle parsing and BGP searching, but not serializing the RDF triples again to a file.
Are there any plans for this?
Hello,
I am interested in using your library for fast parsing from turtle to n-triples.
However, as the current API only supports parsing from a file, I was wondering if it would be possible to extend the library to also parse string objects?
Thanks
Lars
First I have to say nice job on the library, I really love the speed and simplicity of lightrdf, and it worked very well with ChEMBL_27, but I ran into an issue when I tried to read the wikidata tll.
Details
wikidata's file latest-all.ttl
lightrdf.Error: error while parsing IRI 'https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew': Invalid IRI code point '#' on line 3135044 at position 92
--- nearby lines from file, including problematic line, if my sed is correct
sed -n '3135042,3135046p;3135047q' latest-all.ttl
ref:8f38c16e1f141b68f172b65f48b0982234890e56 a wikibase:Reference ;
pr:P854 <https://memory-alpha.fandom.com/wiki/USS_Voyager#Crew?oldid=2476206#Command_crew> ;
pr:P813 "2020-03-12T00:00:00Z"^^xsd:dateTime ;
prv:P813 v:ae909ef12942e232eea24326bdd78c8e ;
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.gz
gunzip and parse the file
Note: these are big files, another strength of lightrdf, the ability to process large files without requiring the entire data set to fit in RAM.
parse the file` 'revisions_lang=en_uris.ttl'`
for each .ttl in this dataset, I'm simply dumping the triples to a file like this
parser = lightrdf.Parser()
triples = parser.parse(str(input_file), format="ttl", base_iri=None)
for (s, p, o) in triples:
f_triples.write(f"{s}\t{p}\t{o}\n")
Other notes:
- if there is a way to use a try catch to move past the line I would like to know how that is done
- ubuntu 20.04, conda env pip install lightrdf
looks like Rio can handle it, there's just no module for it in lightrdf
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.