annotation / stam Goto Github PK

Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.

Home Page: https://annotation.github.io/stam/

License: Creative Commons Attribution Share Alike 4.0 International

Makefile 0.63% HTML 99.37%

annotation linguistics stand-off text text-annotation webannotation

stam's People

Contributors

Stargazers

Watchers

Forkers

tvermaut openpecha

stam's Issues

Allow annotation stores to include/depend on other annotation stores (stand-off STAM JSON files)

Currently an annotation store in STAM JSON can reference annotation datasets
and resources in separate stand-off files. What is not yet possible, however,
is to reference annotations defined in other STAM JSON annotation stores.

This use-case was raised in #21 by @tenzin3, see the lead up discussion there.

In such a case, an annotation in store_a.store.stam.json makes reference (via
an annotation selector) to an annotation defined in store_b.store.stam.json.
That is currently not possible. I do think it is a fair use case and more
flexibility in using stand-off files fits nicely with STAM's stand-off
philosophy.

This issue proposes to expand the STAM model to allow this:

The @include mechanism in STAM JSON would be extended to allow including
other annotation stores. In effect, an annotation store can then depend
on on another by importing it, these includes are executed before loading any of its own annotations.
Recursive includes would be allowed (allowing more complex dependency chains),
but cyclic includes would be explicitly forbidden! Includes may (and are in fact encouraged to) reference the same
stand-off resources and annotation data sets.

Possible syntax for this (not final):

{
    "@type":"AnnotationStore",
    "@include": [ "store_b.stam.store.json", "store_c.stam.store.json" ],
    ...
}

On the implementation-side, when loaded into memory there is still always one
AnnotationStore instance to work with at any given time. This would
however serialize to multiple files. This requires some extra bookkeeping
to be implemented, as for each annotation we need to know to what
annotation store it should go. The implementation might define
'substores' and keep map filenames to lists of annotation handles.
This new bookkeeping would at the same time make splitting stores easier
than it is in the currently implementation (where splitting is basically a fairly expensive deletion action).
Merging and splitting becomes more reversible.

Add examples other than "explicit_containment"

Could you add .txt and .json files for all the examples listed in https://github.com/annotation/stam/blob/master/examples/README.md

Thanks for the great work!

the importance of having a coordinate system independent of what the source files offer

I was thinking about "the importance of having a coordinate system independent
of what the source files offer" which @dirkroorda mentioned the other day, and
which was also described in the Unlocking Digital Texts Position
Paper:

From these formats, it is definitely possible to introduce a “glyph1-level” fragment addressing
scheme, comprising an offset from the start of the file. This effectively reduces all text
formats to plain-text by stripping away any additional tagging and non-textual components.

This is not an entirely trivial exercise, since some additional complexities around Unicode
normalisation rules and white-space handling will need to be dealt with, in order to ensure
that plain-text conversions are carried out in a consistent manner.

However, at this stage, it appears that it would be advantageous to also have a
higher level scheme that operates in a more “human-friendly” way, with word (or token)
granularity and some sense of semantic structure at a level similar to Markdown or a
light-TEI schema.

How does this relate to STAM? We have higher-order annotations that allows modelling higher-level schemes.
We can annotate a sentence and then annotate a word in that sentence using relative offsets:

The offsets still refer to the unicodepoint level, but no longer relative to
the resource as a whole but to the annotation that is being pointed at (the
sentence in this case).

The recent proposal for the STAM Baseoffset
extension is also relevant in this
because it allows us to use a start/base offset that deviates from the actual
text (a simple decoupling from the actual coordinate system, though the units
are still the same).

Next we also have our CompositeSelector (and MultiSelector) that would let us
model things the other way round, we can have the sentence be the higher-order
annotation and have it point annotations that are words, and those in turn point to the resource
(using offsets).

At this point a question arises of something we can't model in STAM yet. Our
offsets are always unicode points (as that's our most atomic unit). If you want
to address things at a higher level like described in the previous paragraph
then that requires explicitly enumerating all the targets in a
CompositeSelector/MultiSelector/DirectionalSelector. But what if we want to use
offsets in another coordinate system here? Say a selector that selects the
second up to the ninth word? Do we want a selector that can express this
whilst automatically interpolating the points in between?

Adding something like that should be possible and adds more flexibility to how
people can use STAM for modelling, but it comes at the cost of adding further
complexity to STAM. So probably it should be an extension.

Eventually we could even go as far as have a universal Selector that points to
something (resource/annotation) that is the result of a whole query. That might
subsume the above use-case as well, but would rely on several extensions (most
notably the query system which will be upcoming anyway) that are not trivial.

Text-Fabric and FoLiA both rely in the core on a coordinate system
more detached from the text, in both a text is merely an annotation like anything else.

The situation in STAM is a bit different, almost everything is an annotation
but the text itself is the primary thing an annotation points to (a slice
thereof), either directly or indirectly. I do think that's the proper method
for a standoff text annotation model.

Last, a word about complex selectors like those in the Web Annotation model
which can reference XPath and other complex file formats. I do consider these
explicitly out of scope for STAM. We want to untangle text and annotations
completely, so text is its most bare form (plain text, utf-8) and all
annotations reference that, rather than some hybrid.

I just wanted to throw all this out here to voice and hear some thoughts and if
needed have some discussion, I'm especially interested in what @dirkroorda
thinks.

Improve the space-efficiency of complex selectors

I'd like to improve the space-efficiency of the complex selectors
(MultiSelector/CompositeSelector/DirectionalSelector). In earlier discussions,
we already established that the MultiSelector is a valid tool to annotate
multiple targets using only a single annotation. However, in the current
implementation, the selector is still implemented in a way that explicitly
enumerates all the offsets in the text. So if you annotate 100,000 targets with
a single annotation via a MultiSelector (saving yourself 99,999 annotations in
the process), you still have 100,000 subselectors in memory.

This can be done more space-efficient. In Text Fabric @dirkroorda efficiently maps entire
ranges of nodes to annotation content (features):

1-426590 word
426591-426629 book
426630-427558 chapter
427559-515689 clause
515690-606393 clause_atom
606394-651572 half_verse
651573-904775 phrase
904776-1172307 phrase_atom
1172308-1236024 sentence
1236025-1300538 sentence_atom
1300539-1414388 subphrase
1414389-1437601 verse
1437602-1446831 lex

I think we need a similar way to express large ranges in STAM. We too have
'nodes' that are expressed by an internal integer ID (TextSelections,
Annotations, TextResources, AnnotationDataSets), and if there's a large
contigent range of them we can refer to them by a simple begin intID and end intID
(or multiple if there are non-contingent parts).

In ideal circumstances, we can then express complex selector with 100,000
subselectors using just one (new) ranged subselector instead.

Such a ranged subselector may be best kept as a part of STAM's 'extended
model', i.e. parts of its internals and not expressed in canonical
serialisation. This keeps the model simple and easier to interpret for the outside world, but uses
the necessary optimisations internally.

There's one limitation in this approach: When targetting text, using such a
ranged subselector would only work for 'simple' offsets, that is, offsets that
refer directly to the resource using begin-aligned cursors. If the offset is
relative (goes through another annotation) or uses end-aligned cursors, then we
need to store a copy of that offset.

High-level API design

I want to take the next step towards designing a good high-level API for STAM. In the current implementation, things have grown somewhat organically, but we've reached a stage where things are becoming cluttered or confusing if not well designed, and where some expected high-level methods are still clearly missing.

Please read my API proposal and comment here in this issue. The document is not normative for STAM itself (any implementation may decide to do things differently); STAM as such prescribes only a data model and expected functionality for implementations, but not an API.

I also want to more clearly separate the internal API in stam-rust from the higher-level API that is exposed, right now too many internals are exposed publicly in the library. This means I want to close off parts of the low-level API, such a decoupling layer allows for easier internal changes without affecting the outside world.

It does imply there's going to be a fairly big API breakage for next stam-rust and stam-python releases, but that was coming anyway because of other changes, and at this stage that is still manageable. I hope to cover most breaking changes in a single release.

The high-level API design also relates to our aim to formulate a query language (#12) and implementation thereof (annotation/stam-rust#14), because most of the methods are related to searching. The proposed API sits at one level below a full query implementation (which was already underway), but if done right, the query implementation itself becomes less urgent and can delegate a lot to the new high-level API methods.

Support external annotations files to allow selective loading and avoid memory issues

We're working on PechaData, a multilingual Buddhist corpus project in collaboration with bdrc.io and pecha.org. As a format, Stam is a dream for our project, and we're starting to build our project on top of it with a mechanism to update annotation coordinates when the base text is updated.

However, our dataset includes many large texts (>10mb .txt) featuring multiple annotation layers often larger than the initial text file and we are concerned about performance issues when we have to load all the annotations in memory even when we only need a couple of sets of annotations. (i.e. we have a file with 15 annotation sets including POS tags and dependencies but we only need the text and the annotations for the table of content.)

Have you considered externalizing annotations in separate files like the .ann files of BrAT or do you have another solution to load annotations selectively? We thought about patching Stam to find a solution but we would much prefer a solution coming from the creators.

Thanks a lot for your work!

How to deal with resource changes?

Perhaps think about support for dealing with resource changes that possibly break existing Cursors.

timestamp?
checksum?
notification via pub/sub (resource notifies stam)?
stam stores initially selected text and validates if that changed?
rely on persistent identifiers?
....

How to save spelling variant annotations in STAM?

Annotated data:

Let's go to party(1)<{E1}part{E2}parties>. We will have lots of fun.

Over here {E1} and {E2} refers to edition 1 and 2. the part and parties and spelling variant found in different editions. Party is the latest edition spelling.
I have parser which parse the annotation in a dictionary where it saves:

{
   'span':[11,15],
   'spelling_varaint': {
              'E1':'part',
              'E2':'parties',
              'LE':'party'
         }
}

I able to save the span in target, but i am not able to save spelling variant in annotationdata. Kindly help me.

Updating offsets when a text resource is altered

One of the main challenges our project faces is that we have multiple copies of the same text resource with degrees of cleanliness and annotations. For instance we will have 50 instances of the heart sutra with the cleanest one not having TOC annotations or with a very dirty version with great NER tags. In some cases at might also only have a bad quality text resource that is being proofread and annotated over a year.

Our goal is to be able to combine the best aspects of all resources and annotations at any given time.

In other words, we see STAM as the pivot format that will link Buddhist data in archives like BDRC, sttacentral or CBETA and websites like 84000, pecha.org, which means that we will have to update, split and merge text resources and annotations on a regular basis.

We are also putting together training datasets for the project monlam.ai which also requires annotation transfer. For instance our MT model currently suffers from a lot of typos in our 2 million aligned sentence dataset and we need to transfer the segment annotations to cleaner versions of texts we are currently producing.

A couple of years ago, our team came up with an "annotation transfer" or "base text update" mechanism combining our CCTV algorithm with Google's Diff Match Patch package.

What would be your approach to tackle this challenge with STAM?

Add LIMIT keyword to STAMQL

Add and implement a LIMIT keyword to the query language that allows limiting a result sequence to only n results: LIMIT n, or that allows taking arbitrary ranges LIMIT m,n. Ideally with support for negative numbers, for end-aligned results, e.g. LIMIT -5 gives the last five results whereas LIMIT 5 give the first five.

Improve STAM Query Language documentation

Could be a bit nicer and clearer for end-users.

STAM poster and/or presentation

For presentation at the CLARIAH conference 2024

Formulate a STAM Query Language

A query language should be formulated to effectively query a STAM model. The
query language will be formulated as an extension to STAM and effectively
provides a higher-level interface that can be directly exposed to end-users as
the primary means of interacting with a STAM model.

The query language should be able to express (non-exhaustive):

Querying by text
Querying by text relations (overlap, embedding, adjacency, etc) (as implemented via TextSelectionOperator)
Querying by annotation data (with various operators, equality, inequality, greater than, less than, etc)
Querying by resource or annotation dataset
All common logical operators
Adding, editing, and deletion of annotations/annotationdata. That is, the queries are not just used to retrieve data, but also to add/update/delete data.

The query language should be accessible enough for (technical) researchers.

Disallow nesting complex selectors

In the current specification complex selections (multi selectors, composite
selectors and directional selectors) can be nested at will, including multiple
nested layers. This allows the user to build a whole tree of selectors but
creates the problem that the semantic interpretation of such a tree is not
clearly defined.

I want to prevent this issue from arising by simply forbidding nesting of
complex selectors. If users want to build a tree-like structure then the proper
way to do so in in STAM is to create annotations that refer to other
annotations, not through selectors (annotations carry labels, ids, etc..
selectors do not). This fits better with our 'everything is an annotation'
principle. It also simplifies implementing selectors.

Initial STAM presentation

The time has come for some outreach, a presentation is planned for an internal CLARIAH WP3 meeting, which serves as a nice test bed. I hope to later record and disseminate a STAM presentation in a video.

The slides can be found at https://github.com/annotation/stam/tree/master/docs/presentation

Expand STAM Query language with the ability to ADD and DELETE items

Right now STAMQL is read-only, add ADD and DELETE statements to make the query language able to manipulate data.

Note: an EDIT statement is less likely to be implemented due to the immutable nature of annotations in STAM.

STAMQL: Allow multiple subqueries

Currently a query only allows a single subquery. This was done to ensure all
variables in the results are always bound.

Having multiple subqueries means a combinatorial 'explosion' in the result set.
Given a query (A) and subqueries (X,Y,Z) where X in turn has subqueries (C,D).
We have the following hierarchy:

?A
    ?X
        ?C
        ?D
    ?Y
    ?Z

The result row binds not just a single set of variables now but differs according to the subquery path:

?A * ?X * ?C
?A * ?X * ?D
?A * ?Y
?A * ?Z

Say ?A yields a,b,c and ?Z yields p,q, then for the last bindings alone we get six results:

a p
a q
b p
b q
c p
c q

In spite of the combinatorial explosion, having multiple subqueries may give
necessary flexibility. Most notably, I think we can do away with the concept of
'highlight queries' in visualisation then, as subqueries can take over that role.

why must a private identifier start with _?

The second is a private identifier, an internal numeric identifier (starting with an underscore)....

I know this is sometimes used as coding convention (i.e. in environments that do not support scoping), but I am not a fan.

Annotate existing xml resources?

stam might be very usefull for existing xml resources, of which there are many. This could be left to extenders or of course not be considered at all and let stam be purely text based.

instead of converting xml to text first (must often be tailor made I expect, whitespace handling and tags to convert to text) and use that as basis to annotate, you could consider using xpath for pointing. Perhaps analogous to Cursor using XpathBegin, XpathEnd.

At the moment xml extenders of stam must provide there own model and implementation for this part (except datasetselector):

I think making Cursor abstract simplifies adding support for xml/xpath (and more?).

Write STAM paper

When the library and tooling around STAM is mature enough, I'd like to write
and publish a paper on it, in which we present STAM and evaluate various aspects of it.
This was also suggested by @roelandordelman during the CLARIAH Tech Day recently.

This would happen at the earliest in Q4 2023 (but probably even later).

consider adding remarks or descriptions

Especially with annotationdataset and annotationdata it may be a good idea to add a clarifying remark.