oracc / pyoracc Goto Github PK

View Code? Open in Web Editor NEW

12.0 9.0 10.0 4.45 MB

Python tools for working with ORACC

License: GNU General Public License v3.0

Python 100.00%

python ply mako parser sumerian

pyoracc's Introduction

pyoracc

Python tools for working with ORACC/C-ATF files

Depends on PLY, Mako and Pytest

Installation

If you don't use pip, you're missing out. Here are installation instructions.

Simply run:

    $ cd pyoracc
    $ git pull origin master
    $ pip install .

Or you can just do

$ pip install git+git://github.com/cdli-gh/pyoracc.git

Or you can also do

$ pip install git+https://github.com/cdli-gh/pyoracc.git

Upgrading

If you already have installed it and want to upgrade the tool:

    $ cd pyoracc
    $ git pull origin master
    $ pip install . --upgrade

Or you can just do

$ pip install git+git://github.com/cdli-gh/pyoracc.git --upgrade

Or you can also do

$ pip install git+https://github.com/cdli-gh/pyoracc.git --upgrade

Usage

To use it:

$ pyoracc --help

*Only files with the .atf extension can be processed. *

To run it on file:

$ pyoracc -i ./pyoracc/test/data/cdli_atf_20180104.atf -f cdli

For a fresh copy of CDLI ATF, download the data bundle here : https://github.com/cdli-gh/data/blob/master/cdliatf_unblocked.atf

To run it on oracc file:

$ pyoracc -i ./pyoracc/test/data/cdli_atf_20180104.atf -f oracc

To run it on folder:

$ pyoracc -i ./pyoracc/test/data -f cdli

To see the console messages of the tool, use --verbose switch

$ pyoracc -i ./pyoracc/test/data -f cdli --verbose

Note that using the verbose option will also create a parselog.txt file, containing the log output along with displaying it on command line. The verbose output contains the lexical symbols, the parse grammer table and the LR parsing table states.

Also note that, first time usage with any atf format will always display the parse tables irrespective of verbose switch.

If you don't give arguments, it will prompt for the path and atf file type.

Help

$ pyoracc --help
Usage: pyoracc [OPTIONS]

  My Tool does one work, and one work well.

Options:
  -i, --input_path PATH      Input the file/folder name.  [required]
  -f, --atf_type [cdli|atf]  Input the atf file type.  [required]
  -v, --verbose              Enables verbose mode
  --version                  Show the version and exit.
  --help                     Show this message and exit.

Internal Dev Usage

Development Guideline

ORACC atf based changes will go in pyoracc/atf/oracc
CDLI atf based changes will go in pyoracc/atf/cdli
Common atf based changes will go in pyoracc/atf/common

To run on directory

$ python  -m pyoracc.model.corpus ./pyoracc/test/data  cdli

To run on individual file

$ python -m pyoracc.atf.common.atffile ./pyoracc/test/data/cdli_atf_20180104.atf cdli True

Running Tests

Before running pytest and coverage, install py.test and pytest-cov.

$ py.test --cov=pyoracc --cov-report xml --cov-report html --cov-report annotate

Before running pycodestyle, install pycodestyle.

$ pycodestyle

API Consumption

from pyoracc.atf.common.atffile import file_process
file_process(pathname, atftype, verbose)

pyoracc's People

Contributors

Stargazers

Watchers

Forkers

jenshnielsen raquelalegre d-k-e edihb cdli-gh rillian adeloucas

pyoracc's Issues

Treat all files as composites to simplify grammar

At the moment, composites are only discovered when a new &-line is found in a text. At this point, the Text object created so far, has to be adapted to be a composite, and that's a little messy.
This could be improved if we consider all files as composites. Non-composite files would be composites with only one text element.

Right now, the entry point of the grammar is:

        document : text
                    | object
                    | composite

It should be changed to:

        document : composite
        composite : text 
                    | object
         text: AMPERSAND .........

This is not a trivial change and requires close attention and significant development time. It is not high priority, though.

Raquel get up to speed on ATF format

@raquel-ucl to read http://oracc.museum.upenn.edu/doc/help/editinginatf/primer/structuretutorial/index.html etc.

Non lexing files

MEE15_44.atf

Parsing file /Users/jhn/ucl/oracc/oracc_corpus/dcclt/ebla/00atf/MEE15_44.atf ... Failed with message: 'PyOracc got an illegal character ' ''

Contains a #bib: field however there is no reference in the bibliography. This should probably fail

Create mako templates for all objects in JH's model

Rulings count

Why have the ruling count saved as an int by the parser, when the serialiser has to do so much extra work to convert it back to a string? I.e.:

def getRulingType(self):
    typeArr = [ "single", "double", "triple"]
    try:
        return typeArr[self.count - 1]
    except TypeError:
        print("Error: Ruling count " + self.count + " must be an integer.")
    except IndexError:
        print("Error: Ruling count (" + self.count + ") is out of bounds (" + typeArr.__len__() + ").")

Sanitise regular expressions in lexer

Some of the lexer rules specify their regular expressions using strings which will become invalid in a future Python version (a DeprecationWarning is raised from version 3.6 onwards).

These are:

t_ID
t_transctrl_ID
terminates_para

The simplest way would be to double escape (\\.) the special characters there. Converting to raw strings might not be an option, because these strings also contain unicode escape sequences (\u2019), which may or may not work within raw strings (I haven't tested it).
In fact, some of the characters in them don't need to be escaped at all, but we will need to check what the intention for each regex was (and double check that it actually matches the intended thing correctly now...).

is EDGE milestone or surface?

There are two different rules corresponding to milestones and surfaces:

 def p_surface_nolabel(self, p):
        '''surface_specifier  : OBVERSE
                              | REVERSE
                              | LEFT
                              | RIGHT
                              | TOP
                              | BOTTOM
                             '''
        p[0] = OraccObject(p[1])

def p_milestone_brief(self, p):
        """milestone_name : CATCHLINE
                          | COLOPHON
                          | DATE
                          | EDGE
                          | SIGNATURES
                          | SIGNATURE
                          | SUMMARY
                          | WITNESSES"""
        p[0] = Milestone(p[1])

However, this will only parse, if EDGE is at the start of the file before any surface due to precedence of surface over milestone.

This will pass:

&P359293 = Prag 711
#atf: lang akk
@tablet
@edge
1. lu sze2-s,u2-a-at-ma szu-ma-am ra-ma-ni
2. lu ni-isz-ku-un u2 ba-asz2-tam2
3. i-a-ti2 ta-sza-ki-ni _ku3-babbar_ 1(disz) _ma-na_ sza
4. a-la2-hi-nim la2 i-ru-aq
@obverse
1. a-na puzur4-a-szur3 qi2-bi4-ma
2. um-ma bu-za-zu-ma i-na
3. {d}utu-szi a-szur-i-di2 u2-s,a-ni
4. ga-me-er a-wa-tim
5. nu-qa2-di2-isz3 t,up-pa2-asz2-nu ba-ab-_dingir_
6. ni-ha-ri-ma u2 sza szi2-bi
7. ni-ha-ri-ma za-ku-sa3 a-sza-pa2-ra-kum
8. u3 ni-im-ta-lik-ma
9. a-wa-ti2-ni a-mi3-sza-am
10. ni-na-di2-am a-na-kam ta-ni-isz3-tum
11. sza-lu-uk-tum ma-da-at
12. a-wa-tu3-ni ra-ak-ba
13. szu-ma li-bi4-ka3
14. a-ni-na a-ma-kam
15. a-wa-tim sza szi2-be
@reverse
1. sza ra-bi4-s,u2-um
2. isz-ta-u2-lu-szu
3. lu-ba-ti2-qa2-ma szi-it-a-al
4. a-wa-ti2-ni a-mi3-sza-am
5. lu ni-di2-a-am u3 te2-er-ta#-ka3
6. a-pa2-ni-a li-li-kam
7. a-szi2-bu-tim sza a-szur-i-di2
8. li-ba-ka3 la2 i-pa2-ri-id
9. a-szur-i-di2 sza-il5
10. a-bi4 a-ta be-li a-ta
11. a-la2-nu-ka3 a-ba-am sza-nim
12. u2-la2 i-szu / _ku3-babbar_
13. 1(disz) _gin2_ ba-ab2-tum
14. la2 i-[ru]-aq-ma
15. li-bi4#-ni la2 i-ma-ra-as,
16. i-hi-id-ma
17. ba-ab2-tum

This will fail:

&P359293 = Prag 711
#atf: lang akk
@tablet
@obverse
1. a-na puzur4-a-szur3 qi2-bi4-ma
2. um-ma bu-za-zu-ma i-na
3. {d}utu-szi a-szur-i-di2 u2-s,a-ni
4. ga-me-er a-wa-tim
5. nu-qa2-di2-isz3 t,up-pa2-asz2-nu ba-ab-_dingir_
6. ni-ha-ri-ma u2 sza szi2-bi
7. ni-ha-ri-ma za-ku-sa3 a-sza-pa2-ra-kum
8. u3 ni-im-ta-lik-ma
9. a-wa-ti2-ni a-mi3-sza-am
10. ni-na-di2-am a-na-kam ta-ni-isz3-tum
11. sza-lu-uk-tum ma-da-at
12. a-wa-tu3-ni ra-ak-ba
13. szu-ma li-bi4-ka3
14. a-ni-na a-ma-kam
15. a-wa-tim sza szi2-be
@reverse
1. sza ra-bi4-s,u2-um
2. isz-ta-u2-lu-szu
3. lu-ba-ti2-qa2-ma szi-it-a-al
4. a-wa-ti2-ni a-mi3-sza-am
5. lu ni-di2-a-am u3 te2-er-ta#-ka3
6. a-pa2-ni-a li-li-kam
7. a-szi2-bu-tim sza a-szur-i-di2
8. li-ba-ka3 la2 i-pa2-ri-id
9. a-szur-i-di2 sza-il5
10. a-bi4 a-ta be-li a-ta
11. a-la2-nu-ka3 a-ba-am sza-nim
12. u2-la2 i-szu / _ku3-babbar_
13. 1(disz) _gin2_ ba-ab2-tum
14. la2 i-[ru]-aq-ma
15. li-bi4#-ni la2 i-ma-ra-as,
16. i-hi-id-ma
17. ba-ab2-tum
@edge
1. lu sze2-s,u2-a-at-ma szu-ma-am ra-ma-ni
2. lu ni-isz-ku-un u2 ba-asz2-tam2
3. i-a-ti2 ta-sza-ki-ni _ku3-babbar_ 1(disz) _ma-na_ sza
4. a-la2-hi-nim la2 i-ru-aq

@raquel-ucl @epageperron

Undocumented syntax to ask about

This is a list of various issues that may be due to missing features and or issues within files

cams/gkab/00atf/o2/bb_2_115.atf
cams/gkab/00atf/o2/bb_2_116.atf
cams/gkab/00atf/o2/bb_2_117.atf
cams/gkab/00atf/o2/bb_2_118.atf
cams/gkab/00atf/o2/sptu_1_130.atf
cams/gkab/00atf/o2/sptu_1_129.atf
cams/gkab/00atf/o2/sptu_5_314.atf
cams/gkab/00atf/o2/legal-1.atf
cams/gkab/00atf/o2/legal-2.atf
cams/gkab/00atf/o2/legal-3.atf

contains

#atf: script 4

Validates with
`

atf: script' is no longer supported; use '#atf: lang' instead

Set up Jenkins pull request job

Set up pull request job following Jens' instructions looking at CookOff configuration, in both staging and production.

Consolidate line label parsing

Working on #78, I noticed some inconsistencies in the way line labels (numbers) are recognized by the lexer. For example, the list of characters accepted as a primer marker for relative line numbers is different in different contexts, and none of them accept labels like 109a. in Q000040 from the sample corpus.

Probably the pattern for the line label should be declared once and concatenated into all the lexer patterns which need to recognize them.

Background training

Go through some tutorials to get up to speed with JH's work so far:
Lex/Yacc http://dinosaur.compilertools.net
PLY http://www.dabeaz.com/ply/
Jython http://www.jython.org
MVC for Jython http://www.jython.org/jythonbook/en/1.0/SimpleWebApps.html
Mako http://www.makotemplates.org
nosetests https://nose.readthedocs.org/en/latest/

Review:
Python http://development.rc.ucl.ac.uk/training/engineering/session01/
Java Swing http://docs.oracle.com/javase/tutorial/uiswing/index.html
Maven http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html

Create GUI mockups to show to client in next meeting

Mockup GUI workflow with Balsamiq or similar to demonstrate ideas to client in next review meeting TBC 1/3/2015.

Parsing tests skip some features

The parser tests for some ATF features are missing meaningful assertions and therefore pass trivially.

This is because these tests were initially skipped while the relevant features were not supported, but no assertions were provided when they were reenabled.

The features in question and corresponding tests in test_atfparser.py are:

Feature	Test functions
`#key`	`test_key_protocol`, `test_double_equals_in_key_protocol`, `test_many_equals_in_key_protocol`, `test_empty_key_in_key_protocol`
`use mylines`	`test_mylines_protocol`
`use lexical`	`test_lexical_protocol`
`#lemmatizer`	`test_lemmatizer_protocol`
`={`	`test_line_equalsbrace`

Complete JH's data model with python member variables created on the fly

Avoid future headaches when developing the serialised, having all the member variables declared and initialised in the model classes.

References need enhancement

At the moment references are listed in an array, children of lines, notes, etc.
For example, in belsunu there is this line in a translation:

1. Year 63, Ṭebetu (Month X), night of day 2:^1^

@note ^1^ A note to the translation.

The way it's being stored is in an array of references in the line object, and duplicated in an array of references in the note object.

We should come up with a way of:
a) not duplicating
b) linking the reference to both the note and the line
c) linking the object with the specific place it has been written. It could be an individual word, another part of the line (some random lemma), etc.
d) do a, b and c in a way that doesn't overcomplicate the serializer

Errors returned from pyoracc when doing syntax highlight

When opening some files in Nammu, pyoracc returns a parsing error. I've seen these two different error messages:

/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/cmawro/00atf/cmawro-09-03.atf

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in validate_corpus
  File "python/nammu/controller/AtfAreaController.py", line 34, in setAtfAreaText
    self.view.syntax_highlight()
  File "python/nammu/view/AtfAreaView.py", line 153, in syntax_highlight
    for tok in lexer:
  File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 419, in next
    t = self.token()
  File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 320, in token
    m = lexre.match(lexdata, lexpos)
RuntimeError: maximum recursion limit exceeded

/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/dcclt/00atf/2-ed-nmpr-q-l.atf
/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/dcclt/00atf/3-ob-diri-q-l.atf
/Users/raquel/workspace/ORACC/whole_corpus/whole_corpus/dcclt/00atf/3-ob-eaaa-l.atf

/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/pyoracc/atf/atflex.py:223: UserWarning: Illegal @STRING 'div'
  warnings.warn(formatstring, UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in validate_corpus
  File "python/nammu/controller/AtfAreaController.py", line 34, in setAtfAreaText
    self.view.syntax_highlight()
  File "python/nammu/view/AtfAreaView.py", line 153, in syntax_highlight
    for tok in lexer:
  File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 419, in next
    t = self.token()
  File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 350, in token
    newtok = func(tok)
  File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/pyoracc/atf/atflex.py", line 206, in t_INITIAL_parallel_labeled_ATID
    t.lexer.pop_state()
  File "/Users/raquelalegre/.virtualenvs/ORACC_Jython/Lib/site-packages/ply/lex.py", line 284, in pop_state
    self.begin(self.lexstatestack.pop())
IndexError: pop from empty list

Create nosetest draft for a few simple cases to demonstrate the serializer

Make pyoracc available on PyPi

This might make it simpler to include pyoracc as a dependency when building Nammu, as well as making this more widely available.

Files with multiple texts should have matching projects

This is probably always the case but pyoracc should verify it

Non-interlinear translations types

Should the type of (non-interlinear) translations be saved in the translation object? Eg: parallel vs labelled vs unitary. See:
http://oracc.museum.upenn.edu/doc/help/editinginatf/translations/index.html

Otherwise, it becomes a bit awkward for the serializer to print out correctly the translation heading line, e.g..:
@translation ????? en project

Consider moving to GitHub Actions

Given the recently-announced changes to Travis.

Port `unittest` classes to `pytest`.

Many of the unit tests are written the Python standard library's unittest framework. For automated testing the external pytest package is used. Normally that's fine, but in preparing #81 I ran into the issue that pytest doesn't support parametrization of unittest methods.

At least for the file I was working on, test_aftlex.py, it looked like the unittest functionality could be replaced by a pytest fixture. Moving the tests up to module level would reduce indentation and allow cleaner reporting of parameterized tests.

Less worthy motivation, but possibly also expedient, is that travis builds fail on Python 3.7.1 in the unittest loader. I don't see this locally with Python 3.7.3, so issue may be addressed by more up-to-date Python packages, but dropping the dependency might work around the bug.

Fix skipped protocols in grammar

ATF protocols are currently being skipped by yacc:

    """skipped_protocol : ATF USE UNICODE newline
                        | ATF USE MATH newline
                        | ATF USE LEGACY newline
                        | ATF USE MYLINES newline
                        | ATF USE LEXICAL newline
                        | key_statement
                        | BIB ID newline
                        | lemmatizer_statement"""

The only one being processed at the moment is the language protocol.

All ATF protocols, including language, should be added to an array of protocols in the text object.

oracc test files needed

Can you provide me the test files - tiny corpus and sample corpus, which are used for running the test?

Composites concept in the grammar might be wrong

We need to make a distinction between a composite object and a composite text.

ATF files should only be treated as composites when they have an @composite line in the metadata of the object descriptions. E.g.: anzu.atf

Composites are abstract concepts Assyriologists use to refer to the data collected from several sources (fragments of objects) often coming from different places. In the @composite files, there are no references to objects such as @tablet or @object statue.

The grammar as it is now considers composites as ATF files with more than one &-line and this is not correct.

I need to discuss with James about this and see how is best to modify the grammar/data model.

Steve, Niek and Eleanor are happy for us to give low priority by now and focus on an initial draft version of the GUI.

Find out if composites can have several projects

Nammu needs to read the project from a parsed ATF. The way it was doing it so far was like this:

parsed_atf = self.parse(nammu_text)
project = parsed_atf.text.project

However this only works for non composites, since composites can have more than one text element, each having it's own PROJECT token.

I haven't looked through the whole corpus but had a look at several of the composites and all the texts inside seem to belong to the same project.

When sending the SOAP envelope for the validation, one project name needs to be specified, so it makes sense it is unique through the file, although it doesn't make sense the original ORACC grammar includes the PROJECT token in all elements of a composite if it's going to be the same.

Need to clarify this with Steve/Eleanor to make sure the contents of the soap envelope are correct.

Pip Install Fails with Python 3

I've tried both of the commands listed in the documentation to install pyoracc using pip, but both fail since the setup script uses Python 2 syntax for a print statement.

I was able to successfully do it by following the more explicit step by step instructions (clone the repo, cd into the repo, then pip install), but as Python 2 will no longer be supported very soon, I figure that this is a minor bug that should be addressed.

Project configuration on RA's machine

Import JH's pyORACC work so far in Eclipse.

This needs installing and configuring:
Eclipse
PyDev
Python
Nosetests
PLY
Mako
Git

C-ATF processing and ORACC "ox" ATF processor rules integration

As part of the MTAAC project, we are processing C-ATF data from the CDLI database. PyOracc looks like the tool that best fits most of our criteria to perform the job so we are planning on extending it to fully support C-ATF, partly basing ourselves on rules from @stinney 's ATF Processor. As such, we would be interested in working directly in a dev branch of your repository. We can also fork and create pull requests. Please let me know if there is an interest on your side.
Thank you,
Émilie

MTAAC: https://cdli-gh.github.io/mtaac
CDLI: https://cdli.ucla.edu
Team: /cdli-gh/mtaac-cdli-engineers

Make sure that full corpus passes

In sample corpus the following files currently fail:

sample_corpus/cmawro-01-01.atf  SyntaxError: PyOracc could not parse token 'LexToken(SCORELABEL,'A₁_obv_i_1′',1268,371)'
sample_corpus/SAA06_08.atf SyntaxError: PyOracc could not parse token 'LexToken(ID,'space',731,5705)' Fails to parse $ blank space of 3 lines in line 220 is that valid in strict mode?
sample_corpus/SAA10.atf SyntaxError: PyOracc got an illegal character '@' (fails to parse $@(r 1) (first line broken away) on line 13378