oracc / pyoracc Goto Github PK
View Code? Open in Web Editor NEWPython tools for working with ORACC
License: GNU General Public License v3.0
Python tools for working with ORACC
License: GNU General Public License v3.0
Working on #78, I noticed some inconsistencies in the way line labels (numbers) are recognized by the lexer. For example, the list of characters accepted as a primer marker for relative line numbers is different in different contexts, and none of them accept labels like 109a. in Q000040 from the sample corpus.
Probably the pattern for the line label should be declared once and concatenated into all the lexer patterns which need to recognize them.
Why have the ruling count saved as an int by the parser, when the serialiser has to do so much extra work to convert it back to a string? I.e.:
def getRulingType(self):
typeArr = [ "single", "double", "triple"]
try:
return typeArr[self.count - 1]
except TypeError:
print("Error: Ruling count " + self.count + " must be an integer.")
except IndexError:
print("Error: Ruling count (" + self.count + ") is out of bounds (" + typeArr.__len__() + ").")
MEE15_44.atf
Parsing file /Users/jhn/ucl/oracc/oracc_corpus/dcclt/ebla/00atf/MEE15_44.atf ... Failed with message: 'PyOracc got an illegal character ' ''
Contains a #bib:
field however there is no reference in the bibliography. This should probably fail
This is a list of various issues that may be due to missing features and or issues within files
cams/gkab/00atf/o2/bb_2_115.atf
cams/gkab/00atf/o2/bb_2_116.atf
cams/gkab/00atf/o2/bb_2_117.atf
cams/gkab/00atf/o2/bb_2_118.atf
cams/gkab/00atf/o2/sptu_1_130.atf
cams/gkab/00atf/o2/sptu_1_129.atf
cams/gkab/00atf/o2/sptu_5_314.atf
cams/gkab/00atf/o2/legal-1.atf
cams/gkab/00atf/o2/legal-2.atf
cams/gkab/00atf/o2/legal-3.atf
contains
#atf: script 4
Validates with
`
`
The parser tests for some ATF features are missing meaningful assertions and therefore pass trivially.
This is because these tests were initially skipped while the relevant features were not supported, but no assertions were provided when they were reenabled.
The features in question and corresponding tests in test_atfparser.py
are:
Feature | Test functions |
---|---|
#key |
test_key_protocol , test_double_equals_in_key_protocol , test_many_equals_in_key_protocol , test_empty_key_in_key_protocol |
use mylines |
test_mylines_protocol |
use lexical |
test_lexical_protocol |
#lemmatizer |
test_lemmatizer_protocol |
={ |
test_line_equalsbrace |
When opening some files in Nammu, pyoracc returns a parsing error. I've seen these two different error messages:
/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/cmawro/00atf/cmawro-09-03.atf
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 7, in validate_corpus
File "python/nammu/controller/AtfAreaController.py", line 34, in setAtfAreaText
self.view.syntax_highlight()
File "python/nammu/view/AtfAreaView.py", line 153, in syntax_highlight
for tok in lexer:
File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 419, in next
t = self.token()
File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 320, in token
m = lexre.match(lexdata, lexpos)
RuntimeError: maximum recursion limit exceeded
/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/dcclt/00atf/2-ed-nmpr-q-l.atf
/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/dcclt/00atf/3-ob-diri-q-l.atf
/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/dcclt/00atf/3-ob-eaaa-l.atf
/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/pyoracc/atf/atflex.py:223: UserWarning: Illegal @STRING 'div'
warnings.warn(formatstring, UserWarning)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 7, in validate_corpus
File "python/nammu/controller/AtfAreaController.py", line 34, in setAtfAreaText
self.view.syntax_highlight()
File "python/nammu/view/AtfAreaView.py", line 153, in syntax_highlight
for tok in lexer:
File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 419, in next
t = self.token()
File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 350, in token
newtok = func(tok)
File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/pyoracc/atf/atflex.py", line 206, in t_INITIAL_parallel_labeled_ATID
t.lexer.pop_state()
File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 284, in pop_state
self.begin(self.lexstatestack.pop())
IndexError: pop from empty list
I've tried both of the commands listed in the documentation to install pyoracc using pip, but both fail since the setup script uses Python 2 syntax for a print statement.
I was able to successfully do it by following the more explicit step by step instructions (clone the repo, cd into the repo, then pip install), but as Python 2 will no longer be supported very soon, I figure that this is a minor bug that should be addressed.
Some of the lexer rules specify their regular expressions using strings which will become invalid in a future Python version (a DeprecationWarning
is raised from version 3.6 onwards).
These are:
t_ID
t_transctrl_ID
terminates_para
The simplest way would be to double escape (\\.
) the special characters there. Converting to raw strings might not be an option, because these strings also contain unicode escape sequences (\u2019
), which may or may not work within raw strings (I haven't tested it).
In fact, some of the characters in them don't need to be escaped at all, but we will need to check what the intention for each regex was (and double check that it actually matches the intended thing correctly now...).
This might make it simpler to include pyoracc as a dependency when building Nammu, as well as making this more widely available.
Import JH's pyORACC work so far in Eclipse.
This needs installing and configuring:
Eclipse
PyDev
Python
Nosetests
PLY
Mako
Git
At the moment, composites are only discovered when a new &-line is found in a text. At this point, the Text object created so far, has to be adapted to be a composite, and that's a little messy.
This could be improved if we consider all files as composites. Non-composite files would be composites with only one text element.
Right now, the entry point of the grammar is:
document : text
| object
| composite
It should be changed to:
document : composite
composite : text
| object
text: AMPERSAND .........
This is not a trivial change and requires close attention and significant development time. It is not high priority, though.
This is probably always the case but pyoracc should verify it
In sample corpus the following files currently fail:
sample_corpus/cmawro-01-01.atf SyntaxError: PyOracc could not parse token 'LexToken(SCORELABEL,'A₁_obv_i_1′',1268,371)'
sample_corpus/SAA06_08.atf SyntaxError: PyOracc could not parse token 'LexToken(ID,'space',731,5705)' Fails to parse $ blank space of 3 lines in line 220 is that valid in strict mode?
sample_corpus/SAA10.atf SyntaxError: PyOracc got an illegal character '@' (fails to parse $@(r 1) (first line broken away) on line 13378
Many of the unit tests are written the Python standard library's unittest
framework. For automated testing the external pytest
package is used. Normally that's fine, but in preparing #81 I ran into the issue that pytest doesn't support parametrization of unittest methods.
At least for the file I was working on, test_aftlex.py
, it looked like the unittest
functionality could be replaced by a pytest fixture
. Moving the tests up to module level would reduce indentation and allow cleaner reporting of parameterized tests.
Less worthy motivation, but possibly also expedient, is that travis builds fail on Python 3.7.1 in the unittest loader. I don't see this locally with Python 3.7.3, so issue may be addressed by more up-to-date Python packages, but dropping the dependency might work around the bug.
Nammu needs to read the project from a parsed ATF. The way it was doing it so far was like this:
parsed_atf = self.parse(nammu_text)
project = parsed_atf.text.project
However this only works for non composites, since composites can have more than one text element, each having it's own PROJECT token.
I haven't looked through the whole corpus but had a look at several of the composites and all the texts inside seem to belong to the same project.
When sending the SOAP envelope for the validation, one project name needs to be specified, so it makes sense it is unique through the file, although it doesn't make sense the original ORACC grammar includes the PROJECT token in all elements of a composite if it's going to be the same.
Need to clarify this with Steve/Eleanor to make sure the contents of the soap envelope are correct.
ATF protocols are currently being skipped by yacc:
"""skipped_protocol : ATF USE UNICODE newline
| ATF USE MATH newline
| ATF USE LEGACY newline
| ATF USE MYLINES newline
| ATF USE LEXICAL newline
| key_statement
| BIB ID newline
| lemmatizer_statement"""
The only one being processed at the moment is the language protocol.
All ATF protocols, including language, should be added to an array of protocols in the text object.
Can you provide me the test files - tiny corpus and sample corpus, which are used for running the test?
Set up pull request job following Jens' instructions looking at CookOff configuration, in both staging and production.
Avoid future headaches when developing the serialised, having all the member variables declared and initialised in the model classes.
Mockup GUI workflow with Balsamiq or similar to demonstrate ideas to client in next review meeting TBC 1/3/2015.
@raquel-ucl to read http://oracc.museum.upenn.edu/doc/help/editinginatf/primer/structuretutorial/index.html etc.
As part of the MTAAC project, we are processing C-ATF data from the CDLI database. PyOracc looks like the tool that best fits most of our criteria to perform the job so we are planning on extending it to fully support C-ATF, partly basing ourselves on rules from @stinney 's ATF Processor. As such, we would be interested in working directly in a dev branch of your repository. We can also fork and create pull requests. Please let me know if there is an interest on your side.
Thank you,
Émilie
MTAAC: https://cdli-gh.github.io/mtaac
CDLI: https://cdli.ucla.edu
Team: /cdli-gh/mtaac-cdli-engineers
Go through some tutorials to get up to speed with JH's work so far:
Lex/Yacc http://dinosaur.compilertools.net
PLY http://www.dabeaz.com/ply/
Jython http://www.jython.org
MVC for Jython http://www.jython.org/jythonbook/en/1.0/SimpleWebApps.html
Mako http://www.makotemplates.org
nosetests https://nose.readthedocs.org/en/latest/
Review:
Python http://development.rc.ucl.ac.uk/training/engineering/session01/
Java Swing http://docs.oracle.com/javase/tutorial/uiswing/index.html
Maven http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html
There are two different rules corresponding to milestones and surfaces:
def p_surface_nolabel(self, p):
'''surface_specifier : OBVERSE
| REVERSE
| LEFT
| RIGHT
| TOP
| BOTTOM
'''
p[0] = OraccObject(p[1])
def p_milestone_brief(self, p):
"""milestone_name : CATCHLINE
| COLOPHON
| DATE
| EDGE
| SIGNATURES
| SIGNATURE
| SUMMARY
| WITNESSES"""
p[0] = Milestone(p[1])
However, this will only parse, if EDGE is at the start of the file before any surface due to precedence of surface over milestone.
This will pass:
&P359293 = Prag 711
#atf: lang akk
@tablet
@edge
1. lu sze2-s,u2-a-at-ma szu-ma-am ra-ma-ni
2. lu ni-isz-ku-un u2 ba-asz2-tam2
3. i-a-ti2 ta-sza-ki-ni _ku3-babbar_ 1(disz) _ma-na_ sza
4. a-la2-hi-nim la2 i-ru-aq
@obverse
1. a-na puzur4-a-szur3 qi2-bi4-ma
2. um-ma bu-za-zu-ma i-na
3. {d}utu-szi a-szur-i-di2 u2-s,a-ni
4. ga-me-er a-wa-tim
5. nu-qa2-di2-isz3 t,up-pa2-asz2-nu ba-ab-_dingir_
6. ni-ha-ri-ma u2 sza szi2-bi
7. ni-ha-ri-ma za-ku-sa3 a-sza-pa2-ra-kum
8. u3 ni-im-ta-lik-ma
9. a-wa-ti2-ni a-mi3-sza-am
10. ni-na-di2-am a-na-kam ta-ni-isz3-tum
11. sza-lu-uk-tum ma-da-at
12. a-wa-tu3-ni ra-ak-ba
13. szu-ma li-bi4-ka3
14. a-ni-na a-ma-kam
15. a-wa-tim sza szi2-be
@reverse
1. sza ra-bi4-s,u2-um
2. isz-ta-u2-lu-szu
3. lu-ba-ti2-qa2-ma szi-it-a-al
4. a-wa-ti2-ni a-mi3-sza-am
5. lu ni-di2-a-am u3 te2-er-ta#-ka3
6. a-pa2-ni-a li-li-kam
7. a-szi2-bu-tim sza a-szur-i-di2
8. li-ba-ka3 la2 i-pa2-ri-id
9. a-szur-i-di2 sza-il5
10. a-bi4 a-ta be-li a-ta
11. a-la2-nu-ka3 a-ba-am sza-nim
12. u2-la2 i-szu / _ku3-babbar_
13. 1(disz) _gin2_ ba-ab2-tum
14. la2 i-[ru]-aq-ma
15. li-bi4#-ni la2 i-ma-ra-as,
16. i-hi-id-ma
17. ba-ab2-tum
This will fail:
&P359293 = Prag 711
#atf: lang akk
@tablet
@obverse
1. a-na puzur4-a-szur3 qi2-bi4-ma
2. um-ma bu-za-zu-ma i-na
3. {d}utu-szi a-szur-i-di2 u2-s,a-ni
4. ga-me-er a-wa-tim
5. nu-qa2-di2-isz3 t,up-pa2-asz2-nu ba-ab-_dingir_
6. ni-ha-ri-ma u2 sza szi2-bi
7. ni-ha-ri-ma za-ku-sa3 a-sza-pa2-ra-kum
8. u3 ni-im-ta-lik-ma
9. a-wa-ti2-ni a-mi3-sza-am
10. ni-na-di2-am a-na-kam ta-ni-isz3-tum
11. sza-lu-uk-tum ma-da-at
12. a-wa-tu3-ni ra-ak-ba
13. szu-ma li-bi4-ka3
14. a-ni-na a-ma-kam
15. a-wa-tim sza szi2-be
@reverse
1. sza ra-bi4-s,u2-um
2. isz-ta-u2-lu-szu
3. lu-ba-ti2-qa2-ma szi-it-a-al
4. a-wa-ti2-ni a-mi3-sza-am
5. lu ni-di2-a-am u3 te2-er-ta#-ka3
6. a-pa2-ni-a li-li-kam
7. a-szi2-bu-tim sza a-szur-i-di2
8. li-ba-ka3 la2 i-pa2-ri-id
9. a-szur-i-di2 sza-il5
10. a-bi4 a-ta be-li a-ta
11. a-la2-nu-ka3 a-ba-am sza-nim
12. u2-la2 i-szu / _ku3-babbar_
13. 1(disz) _gin2_ ba-ab2-tum
14. la2 i-[ru]-aq-ma
15. li-bi4#-ni la2 i-ma-ra-as,
16. i-hi-id-ma
17. ba-ab2-tum
@edge
1. lu sze2-s,u2-a-at-ma szu-ma-am ra-ma-ni
2. lu ni-isz-ku-un u2 ba-asz2-tam2
3. i-a-ti2 ta-sza-ki-ni _ku3-babbar_ 1(disz) _ma-na_ sza
4. a-la2-hi-nim la2 i-ru-aq
@raquel-ucl @epageperron
We need to make a distinction between a composite object and a composite text.
ATF files should only be treated as composites when they have an @composite
line in the metadata of the object descriptions. E.g.: anzu.atf
Composites are abstract concepts Assyriologists use to refer to the data collected from several sources (fragments of objects) often coming from different places. In the @composite
files, there are no references to objects such as @tablet
or @object statue
.
The grammar as it is now considers composites as ATF files with more than one &-line and this is not correct.
I need to discuss with James about this and see how is best to modify the grammar/data model.
Steve, Niek and Eleanor are happy for us to give low priority by now and focus on an initial draft version of the GUI.
Should the type of (non-interlinear) translations be saved in the translation object? Eg: parallel vs labelled vs unitary. See:
http://oracc.museum.upenn.edu/doc/help/editinginatf/translations/index.html
Otherwise, it becomes a bit awkward for the serializer to print out correctly the translation heading line, e.g..:
@translation ????? en project
Given the recently-announced changes to Travis.
At the moment references are listed in an array, children of lines, notes, etc.
For example, in belsunu there is this line in a translation:
1. Year 63, Ṭebetu (Month X), night of day 2:^1^
@note ^1^ A note to the translation.
The way it's being stored is in an array of references in the line object, and duplicated in an array of references in the note object.
We should come up with a way of:
a) not duplicating
b) linking the reference to both the note and the line
c) linking the object with the specific place it has been written. It could be an individual word, another part of the line (some random lemma), etc.
d) do a, b and c in a way that doesn't overcomplicate the serializer
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.