interedition / collatex Goto Github PK

CollateX – Software for Collating Textual Sources

License: GNU General Public License v3.0

Java 34.87% Python 12.40% Makefile 0.36% JavaScript 1.85% CSS 0.19% HTML 44.43% Batchfile 0.29% Jupyter Notebook 0.52% Kotlin 2.03% Less 0.25% Pug 2.81%

collatex's Introduction

CollateX is a software to

read multiple (≥ 2) versions of a text, splitting each version into parts (tokens) to be compared,
identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and
output the alignment results in a variety of formats for further processing, for instance
to support the production of a critical apparatus or the stemmatical analysis of a text's genesis.

It resembles software used to compute differences between files (e.g. diff) or tools for sequence alignment which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results.

As such it is primarily designed for use cases in disciplines like Philology or – more specifically – the field of Textual Criticism where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable.

Please go to http://collatex.net/ for further information.

collatex's People

Contributors

Stargazers

Watchers

collatex's Issues

Empty <B> element in SVG output raises warning (error?)

In display_module.py, the following line:

readings = ["<TR><TD ALIGN='LEFT'><B>" + n.label + "</B></TD><TD ALIGN='LEFT'>exact: " + str(rank) + "</TD></TR>"]

raises a warning (error?) when n.label is null.

Possible fix: Test n.label and replace with   (non-breaking space) if null.

CSV/TSV Output

The ability to output in CSV and/or TSV formats

Alignment result is dependent on witness order.

Case supplied by Hayco de Jong. (Hermans project)

Problem: Two nodes are added to the variant graph for ! which is suboptimal.

W1: a b c d F g h i ! K ! q r s t
W2: a b c d F g h i ! q r s t

Longest sequence is now: a b c d F g h i !

W3: a b c d E g h i ! q r s t

Longest sequence is now: ! q r s t

This case can be solved by using a non progressive multiple witness aligner.
I am working on a prototype of such an implementation.

Homepage is offline

The CollateX website returns HTTP 404.

Update python bindings to current API

Needed for Python-based UI work.

Error if segmentation used for example

I am getting the following error in both the pypi and github version of the python-port when trying to run the example collation code. I am using Python3 and have not checked this in python2. If segmentation is set to False in the collate() call then the collation completes as expected. I do not need to use the built in segmentation so this does not cause a problem for me but I thought it should be reported as it causes the example code to break.

Error message below:

/srv/itsee/django_project/collation/collatex/collatex-pythonport/collatex/core_functions.py in collate(collation, output, layout, segmentation, near_match, astar, detect_transpositions, debug_scores, properties_filter, indent, scheduler)
71 # join parallel segments
72 if segmentation:
---> 73 join(graph)
74 ranking = VariantGraphRanking.of(graph)
75 # check which output format is requested: graph or table

/srv/itsee/django_project/collation/collatex/collatex-pythonport/collatex/core_classes.py in join(graph)
321 out_edges = graph.out_edges(vertex)
322 if len(out_edges) is 1:
--> 323 (_, join_candidate) = out_edges[0]
324 can_join = join_candidate != end and len(graph.in_edges(join_candidate)) == 1
325 if can_join:

TypeError: 'OutEdgeDataView' object does not support indexing

Documentation bug in tokenized input example

This example from the documentation:

{
  "witnesses" : [
    {
      "id" : "A",
      "tokens" : [
          { "t" : "A", "ref" : 123 },
          { "t" : "black" , "adj" : true },
          { "t" : "cat", "id" : "xyz" }
      ]
    },
    {
      "id" : "B",
      "tokens" : [
          { "t" : "A" },
          { "t" : "white" , "adj" : true },
          { "t" : "kitten.", "n" : "cat" }
      ]
    }
  ]
}

is misleading because the tokens "t" should include trailing whitespace when appropriate. If you use the built-in tokenizer instead, the tokens include whitespace by default. Also, the normalized "n" should be shown to exclude whitespace so as not to fool the token comparators.

Why is this important? Because if you omit whitespace the segment joining phase will run tokens together like this:

digraph G {
  v0 [label = ""];
  v1 [label = "Ablackkitten."];
  v2 [label = ""];
  v0 -> v1 [label = "A, B"];
  v1 -> v2 [label = "A, B"];
  v0 -> v2 [color =  "white"];
}

N.B. This output was generated from the slighly modified (to exercise the segment joiner) input:

{
  "witnesses" : [
    {
      "id" : "A",
      "tokens" : [
          { "t" : "A", "ref" : 123 },
          { "t" : "black" , "adj" : true },
          { "t" : "cat", "id" : "xyz" }
      ]
    },
    {
      "id" : "B",
      "tokens" : [
          { "t" : "A" },
          { "t" : "black" , "adj" : true },
          { "t" : "kitten.", "n" : "cat" }
      ]
    }
  ]
}

Alignment is suboptimal in the following transposition case

1: a, b, c, d, e
2: a, e, c, d
3: a, d, b

Case was brought to our attention by Daniel Stoekl

Integrate CollateX with Juxta

Best of both worlds ...

Stabilize default resolver mechanism for SimpleWitness/SimpleToken

The generic java.util.Map-based resolver for mapping witness and token instances to integer values seems unstable in multithreaded environments. A simpler and more stable approach seems to be an auto-generated integer property of those instances.

Create embedded servlet container application for standalone deployment

Some of Interedition's RESTful web-services are implemented as a JavaEE web application. To ease their deployment, a small Java application with an embedded Java Servlet Container shall be provided which allows for standalone deployment of the services on a desktop computer.

Release 2.1.3r1 installs networkx 2.0; needs to be 1.11

networkx 2.0 changes the API in a way that breaks collatex. In setup.py, the latest collatex code base correctly specifies networkx version 1.11:

install_requires=['networkx==1.11','prettytable']

but 2.1.3rc1 in pypi installs version 2.0 of networkx.

Fix: This will fix itself on next release.
Workaround:

pip uninstall networkx
pip install -Iv networkx==1.11

MWA problem: sometimes a double "eeex" node is created, other times a double "dddd" node is created

from collatex import *
collation = Collation()
collation.add_plain_witness("A", "aaaa bbbb cccc dddd eeee ffff")
collation.add_plain_witness("B", "aaaa bbbb eeex ffff") # Near-match gap
collation.add_plain_witness("C", "aaaa bbbb cccc eeee ffff")
collation.add_plain_witness("D", "aaaa bbbb eeex dddd ffff") # Transposition
// collation.add_plain_witness("E", "aaa aaa aaa aaa aaa")
// table = collate(collation, segmentation=False, near_match=True)
// print(table)
collate(collation, segmentation=False, output="svg", near_match=False)

CommandLine version of Collatex (v1.7) tries to align in a progressive matter.

It shouldn't do that.

CollationPipe (in CollateX tools module):
final VariantGraph variantGraph = new VariantGraph();
for (SimpleWitness witness : witnesses) {
collationAlgorithm.collate(variantGraph, witness);
}

Reported by Gioele

Crash with tokenComparator = levenshtein

Given this JSON input:

{
    "tokenComparator": {
        "type": "levenshtein",
        "distance": 1
    },
    "witnesses": [
        {
            "id": "id1",
            "content": "ad capellas dominicas dantur"
        },
        {
            "id": "id2",
            "content": "ad capellam dominicam dantur"
        }
    ]
}

1.7.1 and 1.8.SNAPSHOT both crash with:

Unexpected error
null

Got this stack trace (1.8.SNAPSHOT):

java.lang.NullPointerException
at eu.interedition.collatex.suffixarray.GenericArrayAdapter.buildSuffixArray(GenericArrayAdapter.java:62)
at eu.interedition.collatex.suffixarray.SuffixArrays.createWithLCP(SuffixArrays.java:108)
at eu.interedition.collatex.dekker.token_index.TokenIndex.prepare(TokenIndex.java:49)
at eu.interedition.collatex.dekker.DekkerAlgorithm.collate(DekkerAlgorithm.java:74)
at eu.interedition.collatex.tools.CollationPipe.start(CollationPipe.java:153)
at eu.interedition.collatex.tools.CollateX.main(CollateX.java:45)

Following is a tentative fix that makes EditDistanceTokenComparator act more like a comparator should. (We still have a transitivity problem with this comparator because A == B and B == C do not imply A == C. Don't know if that matters much though.)

--- a/collatex-core/src/main/java/eu/interedition/collatex/matching/EditDistanceTokenComparator.java
+++ b/collatex-core/src/main/java/eu/interedition/collatex/matching/EditDistanceTokenComparator.java
@@ -40,6 +40,7 @@ public class EditDistanceTokenComparator implements Comparator<Token> {
     public int compare(Token base, Token witness) {
         final String baseContent = ((SimpleToken) base).getNormalized();
         final String witnessContent = ((SimpleToken) witness).getNormalized();
-        return (EditDistance.compute(baseContent, witnessContent) <= threshold) ? 0 : -1;
+        return (EditDistance.compute(baseContent, witnessContent) <= threshold) ?
+                0 : baseContent.compareTo(witnessContent);
     }
 }

Regards

Most Python unit tests are failing

I tried to run the Python unit tests in collatex/pythonport. Half of the test files have ImportErrors from calling missing or renamed modules, and one file has a failing test. Let's fix this.

Support editing of the collation results by scholars

Requested e.g. by New Testament Project.

GraphML output in CollateX Python

@tla @rhdekker I'm looking at adding GraphML output to CollateX Python (it's already in CollateX Java), and I'm not confident about the target output format. Specifically, the Java GraphML output for nodes contains three fields, one for the node id, one for the rank, and one that is a concatenation of the t properties of the tokens on that node. For example, using the first example at https://collatex.net/demo/, the Dekker alignment algorithm, and with Segmentation and Transposition both checked, the first non-start node in the GraphML output is:

        <node id="n1">
            <data key="d0">1</data>
            <data key="d2">1</data>
            <data key="d1">This morning </data>
        </node>

This means node 1, rank 1, and the concatenated t value of the tokens is “This morning ”.

This output seems to have two limitations:

It does not persist the n values, which cannot be recreated without knowing how normalization was performed. It is possible to add this a separate property on (that is, <data> child of) the node. It also does not persist other properties that the user might have added to the token during normalization.
It does not persist the tokenization, which cannot be recreated without knowing how it was performed originally.

The second of these limitations goes away if Segmentation is turned off, so that no node can contain more than one token, but that then restricts the types of CollateX variant graphs that can be exported as GraphML.

It is possible to support complex objects in GraphML by customizing the schema (http://graphml.graphdrawing.org/primer/graphml-primer.html#Complex). In that case, even with Segmentation enabled, each pair of t and n values (and other properties that the user might have added to the token during normalization) could be represented by a complex type. It isn't clear to me, though, whether that is the best strategy, especially because it was not adopted for the CollateX Java output,

Might either of you be able to provide some guidance about the requirements and expectations?

Reading txt in the command line interface

Dear Collatex creators,
thank you so much for making your tool available!! It sounds super useful. Sadly, I haven't been able to run it. It would be lovely if you can show me an example of how to read a .txt file using the Command Line Interface.
Say I have output-adobe.txt + output-tesseract.txt + original.txt and want to compare them.
I open collatex like:
C:\Users\xxxx\Desktop> java -jar collatex-tools-1.7.1.jar
and then?

Pythonport: claimed python3 compatibility

colaltex-pythonport/setup.py (and such also pypi) claims the package to be Python3.x compatible, but it is not.

While I'd be inclined to create a fork which is python3 compatible which you could pull, I'm not sure I'm willing or able to create a py2-py3 ambicompatible version.

[Adding info on setup]:

Dual python setup with python2.7 and python3.4. On a recent Lubuntu. (uname -a output: Linux luby-VB 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:45 UTC 2014 i686 i686 i686 GNU/Linux)

installed networkx, prettytable and collatex in this order from the manually downloaded tarballs with sudo pip3 [...] (to install in the python3 dist-packages directory).

When trying to import collatex typical py2-instead-of-py3-compatible-code type of errors are thrown, one after the other when fixing some locally.

Load CLI result in Python?

Thanks again for a great tool. Is there an easy way to load the result of an alignment created with the command line tool into a Python Collation object? (I want to use the Java tool for its speed, but keep the flexible visualization and postprocessing capabilities that I have in a Jupyter notebook.)

Mysterious error on collation attempt

Hi - I have run into an error when trying to collate the attached JSON input, which doesn't give much idea of the problem. If I remove the first witness (Bz449) then it works, but I can't see anything obviously wrong in the Bz449 input. Any hints would be very welcome.

Taras-iMac:collatex tla$ collatex -t --format json milestone-455.json > /dev/null
Unexpected error
null
Taras-iMac:collatex tla$

milestone-455.json.gz

CollateX Python: Check input for duplicate witness ID and throw error if that is the case

Reported by Elena Spadini. Collated 9 witnesses but used the following witness ID's by accident. ["W1", "W2", "W3", "W4", "W1", "W2", "W3", "W4", "W1"]

CX core should be more defensive.

Custom matching function

Mike Kestemont asks whether their is a custom matching function, that allows to use the cosinus of two vectors of numbers. It would be even better if the matching is not just true or false and the alignment is scored globally. Both should be possible. He calls the Java library from Python.

CollateX refuses Json input

Not sure if this repo is being maintained.
Possibly a version of #44
json of tokenized witnesses in order A (working.txt) works; in order B (nonworking.txt) collatex returns an error
Hand editing nonworking.txt so that the witnesses and array of tokens are in the same order returns alignment.
Sending data to collatex via REST

nonworking.txt
working.txt

The Python bindings need to be brought back to life

The Python bindings were donated to the project in 2010. The major API changes in CollateX version 1.3 broke the bindings. The bindings need to be brought up to date.

Requirements:

Uberjar. Added a nodeps module to the project. To generate jar run mvn package.
JPype. Install JPype. $> python ./setup.py

Regex to specify lower-priority collation tokens

It often happens in automated collation that very common / frequent tokens, e.g. punctuation or words like 'and' or 'the', get matched a little too eagerly by the algorithm so that more substantive tokens are misaligned. Moreover, the set of tokens that cause this problem will vary according to language / text type / etc.

At the moment I am dealing with this by assigning random strings of characters in the n field of the JSON object for these tokens, so that CollateX won't match them with anything else. This works, but leads to a bunch of duplicated tokens in the output, which I deal with using a graph search algorithms.

Since what I am doing in post-processing looks and smells an awful lot like collation, it seems like something CollateX should be able to handle internally - match the 'substantive' tokens on a first pass, and the non-substantive ones on a second pass, relative to the alignment that has already been done. The easiest way of specifying these 'unimportant' tokens might be a regular expression, since (as mentioned) they will vary from text to text.

Integrate CollateX with eXist XML database

Best of both worlds ...

GraphML output: add ranking information to nodes

Requested by Dirk Roorda to improve post processing

Alignment in linking phase is suboptimal in following example

the cat and the dog
the dog and the cat

There are 3 islands of size 2 that overlap.
One of them should be split up in 2 islands of size 1 to get to the optimal alignment.

Alignment error

Given the following in a Jupyter Notebook, using CollateX Python 2.2:

%reload_ext autoreload
%autoreload 2
from collatex import *
collation1 = Collation()
collation1.add_plain_witness("A", "The gray koala.")
collation1.add_plain_witness("B", "The gray koala.")
collation1.add_plain_witness("C", "The koala lives in a tree.")
table1 = collate(collation1, segmentation=False)
print(table1)
collation2 = Collation()
collation2.add_plain_witness("A", "The gray koala.")
collation2.add_plain_witness("B", "The big gray koala.")
collation2.add_plain_witness("C", "The koala lives in a tree.")
table2 = collate(collation2, segmentation=False)
print(table2)
collation3 = Collation()
collation3.add_plain_witness("C", "The koala lives in a tree.")
collation3.add_plain_witness("A", "The gray koala.")
collation3.add_plain_witness("B", "The big gray koala.")
table3 = collate(collation3, segmentation=False)
print(table3)

The output is:

+---+-----+------+-------+-------+----+---+------+---+
| A | The | gray | koala | -     | -  | - | -    | . |
| B | The | gray | koala | -     | -  | - | -    | . |
| C | The | -    | koala | lives | in | a | tree | . |
+---+-----+------+-------+-------+----+---+------+---+
+---+-----+-------+-------+-------+---+------+---+
| A | The | -     | gray  | koala | - | -    | . |
| B | The | big   | gray  | koala | - | -    | . |
| C | The | koala | lives | in    | a | tree | . |
+---+-----+-------+-------+-------+---+------+---+
+---+-----+-----+------+-------+-------+----+---+------+---+
| C | The | -   | -    | koala | lives | in | a | tree | . |
| A | The | -   | gray | koala | -     | -  | - | -    | . |
| B | The | big | gray | koala | -     | -  | - | -    | . |
+---+-----+-----+------+-------+-------+----+---+------+---+

The first and third collations are correct; the second is incorrect. The second has the same witnesses as the third, but they are added in a different order.

Choosing an algorithm

Hello,
this could be a silly question. According to the documentation, one can choose between three algorithms: Dekker, Needleman-Wunsch and MEDITE. I do not know how to specify that in the python collate function. Any idea?

toTable method should have a different return type RowSortedTable<Witness, Integer, Token>

CollateX crashes with «Unexpected error null»

CollateX will crash with the following error

Unexpected error
null

when run with the following JSON input file:

{
    "witnesses" : [
        {
            "id" : "base",
            "tokens" : [
                { "t" : "id " },
                { "t" : "est " },
                { "t" : "sexaginta ", "n": "[num]" },
                { "t" : "solidos " }
            ]
        },
        {
            "id" : "w1",
            "tokens" : [
                { "t" : "solidos " },
                { "t" : "triplo " }
            ]
        },
        {
            "id" : "w2",
            "tokens" : [
                { "t" : "nostrum " },
                { "t" : "cogatur " },
                { "t" : "id " },
                { "t" : "xl ", "n": "[num]" },
                { "t" : "solidos " },
                { "t" : "in " }
            ]
        },
        {
            "id" : "w3",
            "tokens" : [
                { "t" : "nostrum " },
                { "t" : "in " },
                { "t" : "triplo " },
                { "t" : "conpon " },
                { "t" : ". " },
                { "t" : ". " },
                { "t" : "id " },
                { "t" : "lx ", "n": "[num]" }
            ]
        },
        {
            "id" : "w4",
            "tokens" : [
                { "t" : "est " },
                { "t" : ". " },
                { "t" : ". " },
                { "t" : ". " },
                { "t" : "solidos " },
                { "t" : "in " }
            ]
        }
    ]
}

The error started appearing after we normalized the token sexaginta to [num].

Please note that this JSON file is a reduced minimal test case, the original was much longer and very different.

Sequences should only contain matches/non-matches

Sequence detection does not make a distinction between matches/non-matches.

TEI output error (CollateX Python 2.2)

Given:

%reload_ext autoreload
%autoreload 2
from collatex import *
collation = Collation()
collation.add_plain_witness("A","The big gray koala.")
collation.add_plain_witness("B", "The big gray koala.")
collation.add_plain_witness("C","The gray fuzzy koala lives in a tree.")
table = collate(collation)
print(table)

table alignment is correct:

+---+-----+-----+------+-------+-------+-----------------+---+
| A | The | big | gray | -     | koala | -               | . |
| B | The | big | gray | -     | koala | -               | . |
| C | The | -   | gray | fuzzy | koala | lives in a tree | . |
+---+-----+-----+------+-------+-------+-----------------+---+

but TEI alignment doesn’t recognize that all instances of “koala” agree. When we run:

tei = collate(collation, output="tei", indent=True)
print(tei)

we get:

<?xml version="1.0" ?>
<cx:apparatus xmlns="http://www.tei-c.org/ns/1.0" xmlns:cx="http://interedition.eu/collatex/ns/1.0">
	The 
	<app>
		<rdg wit="#A #B">big</rdg>
		<rdg wit="#C"/>
	</app>
	 
	gray 
	<app>
		<rdg wit="#C">fuzzy</rdg>
		<rdg wit="#A #B"/>
	</app>
	 
	<app>
		<rdg wit="#A #B">koala</rdg>
		<rdg wit="#C">koala</rdg>
	</app>
	 
	<app>
		<rdg wit="#C">lives in a tree</rdg>
		<rdg wit="#A #B"/>
	</app>
	.
</cx:apparatus>

The “koala” readings all agree, and therefore should be output as plain text, and not inside a <rdg>.

Furthermore, there should not be two <rdg> children of the same <app> that have the same textual content. If we add another witness to remove the exact equality:

%reload_ext autoreload
%autoreload 2
from collatex import *
collation = Collation()
collation.add_plain_witness("A","The big gray koala.")
collation.add_plain_witness("B", "The big gray koala.")
collation.add_plain_witness("D", "The big gray wombat.")
collation.add_plain_witness("C","The gray fuzzy koala lives in a tree.")
table = collate(collation,segmentation=False, near_match=True)
print(table)

The table output is again correct:

+---+-----+-----+------+-------+--------+-------+----+---+------+---+
| A | The | big | gray | -     | koala  | -     | -  | - | -    | . |
| B | The | big | gray | -     | koala  | -     | -  | - | -    | . |
| D | The | big | gray | -     | wombat | -     | -  | - | -    | . |
| C | The | -   | gray | fuzzy | koala  | lives | in | a | tree | . |
+---+-----+-----+------+-------+--------+-------+----+---+------+---+

but the TEI output incorrectly puts the koalas in different <rdg> elements, so that:

tei = collate(collation, output="tei", indent=True, segmentation=False, near_match=True)
print(tei)

outputs:

<?xml version="1.0" ?>
<cx:apparatus xmlns="http://www.tei-c.org/ns/1.0" xmlns:cx="http://interedition.eu/collatex/ns/1.0">
	The 
	<app>
		<rdg wit="#A #B #D">big</rdg>
		<rdg wit="#C"/>
	</app>
	 
	gray 
	<app>
		<rdg wit="#C">fuzzy</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#A #B">koala</rdg>
		<rdg wit="#C">koala</rdg>
		<rdg wit="#D">wombat</rdg>
	</app>
	 
	<app>
		<rdg wit="#C">lives</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#C">in</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#C">a</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#C">tree</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	.
</cx:apparatus>

These may be consequences of a single problem, the failure to recognize that the koalas belong together.

CollateX cannot read JSON from stdin

CollateX cannot be used in a unix pipe. I'd like to pipe JSON files into CollateX.

<rdg> value in TEI output (Java) is "n", and should be "t"

The value of the <rdg> element is the n property, rather than the t. This may look correct in situations where the normalization is limited to stripping trailing whitespace (the default), especially with segmentation enabled, but even case-folding produces output that is visibly inconsistent with normal practice in critical editing, and anything more complex (e.g., thesaurus, Soundex) may produce output that is illegible.

which version of java is required

Hi, I have tried a local version of collatex-tools-1.7.0.jar on my computer.

I obtained " Unsupported major.minor version 52.0".

So what is the minimal requirement? Should it be add on the website?

And also: I read you are doing a python version. Is both java version and python will be maintened, or only python.

Colours in HTML

May be beneficial to have a different HTML2 colour scheme, such as Blue/Yellow to aid those with Red/Green colour-blindness

Misalignment

In CollateX Python 2.1.3rc2, the input:

from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The big, gray, fuzzy koala.")
collation.add_plain_witness("B","The big, old, gray koala:")
collation.add_plain_witness("C","The big, gray, fuzzy wombat.")
table = collate(collation, segmentation=False, near_match=True)
print(table)

produces (with or without near matching):

+---+-----+-----+---+------+---+-------+--------+---+
| A | The | big | , | gray | , | fuzzy | koala  | . |
| B | The | big | , | old  | , | gray  | koala  | : |
| C | The | big | , | gray | , | fuzzy | wombat | . |
+---+-----+-----+---+------+---+-------+--------+---+

This fails to align “gray”, which matches exactly in all witnesses. The desired alignment is:

+---+-----+-----+---+------+---+------+---+-------+--------+---+
| A | The | big | , |      |   | gray | , | fuzzy | koala  | . |
| B | The | big | , | old  | , | gray |   |       | koala  | : |
| C | The | big | , |      |   | gray | , | fuzzy | wombat | . |
+---+-----+-----+---+------+---+------+---+-------+--------+---+

This appears to be a transposition situation, where CollateX aligns the commas in preference to the words. A philologist would prioritize aligning the words.

Input encoding of known gaps

I want to use CollateX on pretty fragmentary text witnesses. In some cases I know about existing gaps and sometimes their length even, from the given data.
Is there a way to inform CollateX about such gaps? I am using a json input with tokens and their normalized forms atm.

Crash when trying to collate pretokenized witnesses with space or punctuation

If you have a witness that looks like this:

                {
                    "id": "B",
                    "tokens": [
                        {"t": "A"},
                        {"t": "white", "adj": True},
                        {"t": "mousedog bird", "adj": False}
                    ]
                }

with a token that has either a space or a punctuation mark, then collate_pretokenized_json crashes like this:

Error
Traceback (most recent call last):
  File "/Users/tla/Projects/collatex/collatex-pythonport/tests/test_witness_tokens.py", line 42, in testPretokenizedWitness
    result = collate_pretokenized_json(pretokenized_witness)
  File "/Users/tla/Projects/collatex/collatex-pythonport/collatex/core_functions.py", line 69, in collate_pretokenized_json
    new_row.cells.append(tokenized_witness[token_counter])
IndexError: list index out of range

This is because the pretokenized witness gets concatenated together and then re-tokenized on whitespace and punctuation. It shouldn't be doing that in the first place.

Error in documentation on python develop env

Contributing.rst now says:

$ mkvirtualenv collatex
$ cd collatex/
$ python setup.py develop

But I think it should be:

$ mkvirtualenv collatex
$ cd collatex/collatex-pythonport
$ python setup.py develop

Which means "4. Create a branch for local development:" should read "cd .." in line 1
(because one needs to check out the project in the main collatex directory) and "5." should start width "cd collatex-pythonport" again.

Bad UTF-8 content in GraphML output on plain text

In constructing a simple test case for other purposes, I ran across a bug in the GraphML output of CollateX, which should be reproducible as follows:

Taras-iMac:MatthewEdessa tla$ cat first.txt 
The quick brown fox jumped over the lazy dogs.
Taras-iMac:MatthewEdessa tla$ cat second.txt 
the quick brown fox jumped over the lazy sleeping dog.
Taras-iMac:MatthewEdessa tla$ cat third.txt 
The quick brown fox jumped over the sleeping cat.
Taras-iMac:MatthewEdessa tla$ file *.txt
first.txt:  ASCII text
second.txt: ASCII text
third.txt:  ASCII text
Taras-iMac:MatthewEdessa tla$ ~/bin/collatex -t -f graphml first.txt second.txt third.txt > test.xml

Here is a zip file of the output - you can see an enormous blob of bad data in the middle of node 14.
test.xml.zip

CollateX Python port, unicode and Python 3

Python 3 is a backwards incompatible API break of Python with the goal to make Unicode Strings the default. See the reasoning on the following page:
http://ncoghlan-devs-python-notes.readthedocs.org/en/latest/python3/questions_and_answers.html

Unicode support is a major problem in the current preview versions of CollateX Python.

I am currently investigating what a clean port to Python 3 would entail.

CSV output contains "melded" tokens

The tokenized JSON input at the end of this report will generate the following incorrect CSV:

BK_Text_Superstruktur-s1,cava-dei-tirreni-bdb-4-s1
Siquis,Siquis
autexleui,qualibet
causaautsinecausahomineminecclesia,causaautsinecausahominemInecclesia
interfecerit,Interficereuoluerit
de,de

As you can see, many tokens have been "melded" in a very long string. This problem does not appear when the JSON output format is chosen.

This is the source JSON:

{
    "witnesses" : [
        {
            "id" : "BK_Text_Superstruktur-s1",
            "tokens" : [
                { "t" : "Si" },
                { "t" : "quis" },
                { "t" : "aut" },
                { "t" : "ex" },
                { "t" : "leui" },
                { "t" : "causa" },
                { "t" : "aut" },
                { "t" : "sine" },
                { "t" : "causa" },
                { "t" : "hominem" },
                { "t" : "in" },
                { "t" : "ecclesia" },
                { "t" : "interfecerit" },
                { "t" : "de" }
            ]
        },
        {
            "id" : "cava-dei-tirreni-bdb-4-s1",
            "tokens" : [
                { "t" : "Si" },
                { "t" : "quis" },
                { "t" : "qualibet" },
                { "t" : "causa" },
                { "t" : "aut" },
                { "t" : "sine" },
                { "t" : "causa" },
                { "t" : "hominem" },
                { "t" : "In" },
                { "t" : "ecclesia" },
                { "t" : "Interficere" },
                { "t" : "uoluerit" },
                { "t" : "de" }
            ]
        }
    ]
}

Feature request for traceability of alignment decisions

A request by Elena and Gioele.

Elena wants to add lots of information to tokens and make complex alignment decisions based on all that information.

No stack trace for "Unexpected error"

When CollateX crashes with "unexpected error" it does not print any other information. it should print a stack trace.

TypeError OutEdgeDataView

Hi,
When calling the collate() function I get an error message:
File "/Users/ellibleeker/anaconda3/lib/python3.6/site-packages/collatex/core_classes.py", line 329, in join
(_, join_candidate) = out_edges[0]
TypeError: 'OutEdgeDataView' object does not support indexing

Working with latest version (2.0) of NetworkX; this might be the problem?

interedition / collatex Goto Github PK

collatex's Introduction

collatex's People

Contributors

Stargazers

Watchers

Forkers

collatex's Issues

Recommend Projects

Recommend Topics

Recommend Org