Coder Social home page Coder Social logo

interedition / collatex Goto Github PK

View Code? Open in Web Editor NEW
88.0 23.0 36.0 22.8 MB

CollateX – Software for Collating Textual Sources

Home Page: http://collatex.net/

License: GNU General Public License v3.0

Java 34.87% Python 12.40% Makefile 0.36% JavaScript 1.85% CSS 0.19% HTML 44.43% Batchfile 0.29% Jupyter Notebook 0.52% Kotlin 2.03% Less 0.25% Pug 2.81%

collatex's Introduction

CollateX is a software to

  1. read multiple (≥ 2) versions of a text, splitting each version into parts (tokens) to be compared,
  2. identify similarities of and differences between the versions (including moved/transposed segments) by aligning tokens, and
  3. output the alignment results in a variety of formats for further processing, for instance
  4. to support the production of a critical apparatus or the stemmatical analysis of a text's genesis.

It resembles software used to compute differences between files (e.g. diff) or tools for sequence alignment which are commonly used in Bioinformatics. While CollateX shares some of the techniques and algorithms with those tools, it mainly aims for a flexible and configurable approach to the problem of finding similarities and differences in texts, sometimes trading computational soundness or complexity for the user's ability to influence results.

As such it is primarily designed for use cases in disciplines like Philology or – more specifically – the field of Textual Criticism where the assessment of findings is based on interpretation and therefore can be supported by computational means but is not necessarily computable.

Please go to http://collatex.net/ for further information.

collatex's People

Contributors

aklsdm avatar brambg avatar brambgit avatar catsmith avatar dependabot[bot] avatar djbpitt avatar gertjanf avatar gioele avatar gremid avatar jorisvanzundert avatar marcelloperathoner avatar mhbeals avatar rhdekker avatar tgriffitts-vs avatar tla avatar tparkola avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

collatex's Issues

Colours in HTML

May be beneficial to have a different HTML2 colour scheme, such as Blue/Yellow to aid those with Red/Green colour-blindness

The Python bindings need to be brought back to life

The Python bindings were donated to the project in 2010. The major API changes in CollateX version 1.3 broke the bindings. The bindings need to be brought up to date.

Requirements:

  • Uberjar. Added a nodeps module to the project. To generate jar run mvn package.
  • JPype. Install JPype. $> python ./setup.py

MWA problem: sometimes a double "eeex" node is created, other times a double "dddd" node is created

from collatex import *
collation = Collation()
collation.add_plain_witness("A", "aaaa bbbb cccc dddd eeee ffff")
collation.add_plain_witness("B", "aaaa bbbb eeex ffff") # Near-match gap
collation.add_plain_witness("C", "aaaa bbbb cccc eeee ffff")
collation.add_plain_witness("D", "aaaa bbbb eeex dddd ffff") # Transposition
// collation.add_plain_witness("E", "aaa aaa aaa aaa aaa")
// table = collate(collation, segmentation=False, near_match=True)
// print(table)
collate(collation, segmentation=False, output="svg", near_match=False)

Choosing an algorithm

Hello,
this could be a silly question. According to the documentation, one can choose between three algorithms: Dekker, Needleman-Wunsch and MEDITE. I do not know how to specify that in the python collate function. Any idea?

Crash with tokenComparator = levenshtein

Given this JSON input:

{
    "tokenComparator": {
        "type": "levenshtein",
        "distance": 1
    },
    "witnesses": [
        {
            "id": "id1",
            "content": "ad capellas dominicas dantur"
        },
        {
            "id": "id2",
            "content": "ad capellam dominicam dantur"
        }
    ]
}

1.7.1 and 1.8.SNAPSHOT both crash with:

Unexpected error
null

Got this stack trace (1.8.SNAPSHOT):

java.lang.NullPointerException
at eu.interedition.collatex.suffixarray.GenericArrayAdapter.buildSuffixArray(GenericArrayAdapter.java:62)
at eu.interedition.collatex.suffixarray.SuffixArrays.createWithLCP(SuffixArrays.java:108)
at eu.interedition.collatex.dekker.token_index.TokenIndex.prepare(TokenIndex.java:49)
at eu.interedition.collatex.dekker.DekkerAlgorithm.collate(DekkerAlgorithm.java:74)
at eu.interedition.collatex.tools.CollationPipe.start(CollationPipe.java:153)
at eu.interedition.collatex.tools.CollateX.main(CollateX.java:45)

Following is a tentative fix that makes EditDistanceTokenComparator act more like a comparator should. (We still have a transitivity problem with this comparator because A == B and B == C do not imply A == C. Don't know if that matters much though.)

--- a/collatex-core/src/main/java/eu/interedition/collatex/matching/EditDistanceTokenComparator.java
+++ b/collatex-core/src/main/java/eu/interedition/collatex/matching/EditDistanceTokenComparator.java
@@ -40,6 +40,7 @@ public class EditDistanceTokenComparator implements Comparator<Token> {
     public int compare(Token base, Token witness) {
         final String baseContent = ((SimpleToken) base).getNormalized();
         final String witnessContent = ((SimpleToken) witness).getNormalized();
-        return (EditDistance.compute(baseContent, witnessContent) <= threshold) ? 0 : -1;
+        return (EditDistance.compute(baseContent, witnessContent) <= threshold) ?
+                0 : baseContent.compareTo(witnessContent);
     }
 }

Regards

Most Python unit tests are failing

I tried to run the Python unit tests in collatex/pythonport. Half of the test files have ImportErrors from calling missing or renamed modules, and one file has a failing test. Let's fix this.

which version of java is required

Hi, I have tried a local version of collatex-tools-1.7.0.jar on my computer.

I obtained " Unsupported major.minor version 52.0".

So what is the minimal requirement? Should it be add on the website?

And also: I read you are doing a python version. Is both java version and python will be maintened, or only python.

TEI output error (CollateX Python 2.2)

Given:

%reload_ext autoreload
%autoreload 2
from collatex import *
collation = Collation()
collation.add_plain_witness("A","The big gray koala.")
collation.add_plain_witness("B", "The big gray koala.")
collation.add_plain_witness("C","The gray fuzzy koala lives in a tree.")
table = collate(collation)
print(table)

table alignment is correct:

+---+-----+-----+------+-------+-------+-----------------+---+
| A | The | big | gray | -     | koala | -               | . |
| B | The | big | gray | -     | koala | -               | . |
| C | The | -   | gray | fuzzy | koala | lives in a tree | . |
+---+-----+-----+------+-------+-------+-----------------+---+

but TEI alignment doesn’t recognize that all instances of “koala” agree. When we run:

tei = collate(collation, output="tei", indent=True)
print(tei)

we get:

<?xml version="1.0" ?>
<cx:apparatus xmlns="http://www.tei-c.org/ns/1.0" xmlns:cx="http://interedition.eu/collatex/ns/1.0">
	The 
	<app>
		<rdg wit="#A #B">big</rdg>
		<rdg wit="#C"/>
	</app>
	 
	gray 
	<app>
		<rdg wit="#C">fuzzy</rdg>
		<rdg wit="#A #B"/>
	</app>
	 
	<app>
		<rdg wit="#A #B">koala</rdg>
		<rdg wit="#C">koala</rdg>
	</app>
	 
	<app>
		<rdg wit="#C">lives in a tree</rdg>
		<rdg wit="#A #B"/>
	</app>
	.
</cx:apparatus>

The “koala” readings all agree, and therefore should be output as plain text, and not inside a <rdg>.

Furthermore, there should not be two <rdg> children of the same <app> that have the same textual content. If we add another witness to remove the exact equality:

%reload_ext autoreload
%autoreload 2
from collatex import *
collation = Collation()
collation.add_plain_witness("A","The big gray koala.")
collation.add_plain_witness("B", "The big gray koala.")
collation.add_plain_witness("D", "The big gray wombat.")
collation.add_plain_witness("C","The gray fuzzy koala lives in a tree.")
table = collate(collation,segmentation=False, near_match=True)
print(table)

The table output is again correct:

+---+-----+-----+------+-------+--------+-------+----+---+------+---+
| A | The | big | gray | -     | koala  | -     | -  | - | -    | . |
| B | The | big | gray | -     | koala  | -     | -  | - | -    | . |
| D | The | big | gray | -     | wombat | -     | -  | - | -    | . |
| C | The | -   | gray | fuzzy | koala  | lives | in | a | tree | . |
+---+-----+-----+------+-------+--------+-------+----+---+------+---+

but the TEI output incorrectly puts the koalas in different <rdg> elements, so that:

tei = collate(collation, output="tei", indent=True, segmentation=False, near_match=True)
print(tei)

outputs:

<?xml version="1.0" ?>
<cx:apparatus xmlns="http://www.tei-c.org/ns/1.0" xmlns:cx="http://interedition.eu/collatex/ns/1.0">
	The 
	<app>
		<rdg wit="#A #B #D">big</rdg>
		<rdg wit="#C"/>
	</app>
	 
	gray 
	<app>
		<rdg wit="#C">fuzzy</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#A #B">koala</rdg>
		<rdg wit="#C">koala</rdg>
		<rdg wit="#D">wombat</rdg>
	</app>
	 
	<app>
		<rdg wit="#C">lives</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#C">in</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#C">a</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	 
	<app>
		<rdg wit="#C">tree</rdg>
		<rdg wit="#A #B #D"/>
	</app>
	.
</cx:apparatus>

These may be consequences of a single problem, the failure to recognize that the koalas belong together.

CSV output contains "melded" tokens

The tokenized JSON input at the end of this report will generate the following incorrect CSV:

BK_Text_Superstruktur-s1,cava-dei-tirreni-bdb-4-s1
Siquis,Siquis
autexleui,qualibet
causaautsinecausahomineminecclesia,causaautsinecausahominemInecclesia
interfecerit,Interficereuoluerit
de,de

As you can see, many tokens have been "melded" in a very long string. This problem does not appear when the JSON output format is chosen.

This is the source JSON:

{
    "witnesses" : [
        {
            "id" : "BK_Text_Superstruktur-s1",
            "tokens" : [
                { "t" : "Si" },
                { "t" : "quis" },
                { "t" : "aut" },
                { "t" : "ex" },
                { "t" : "leui" },
                { "t" : "causa" },
                { "t" : "aut" },
                { "t" : "sine" },
                { "t" : "causa" },
                { "t" : "hominem" },
                { "t" : "in" },
                { "t" : "ecclesia" },
                { "t" : "interfecerit" },
                { "t" : "de" }
            ]
        },
        {
            "id" : "cava-dei-tirreni-bdb-4-s1",
            "tokens" : [
                { "t" : "Si" },
                { "t" : "quis" },
                { "t" : "qualibet" },
                { "t" : "causa" },
                { "t" : "aut" },
                { "t" : "sine" },
                { "t" : "causa" },
                { "t" : "hominem" },
                { "t" : "In" },
                { "t" : "ecclesia" },
                { "t" : "Interficere" },
                { "t" : "uoluerit" },
                { "t" : "de" }
            ]
        }
    ]
}

Input encoding of known gaps

I want to use CollateX on pretty fragmentary text witnesses. In some cases I know about existing gaps and sometimes their length even, from the given data.
Is there a way to inform CollateX about such gaps? I am using a json input with tokens and their normalized forms atm.

Error if segmentation used for example

I am getting the following error in both the pypi and github version of the python-port when trying to run the example collation code. I am using Python3 and have not checked this in python2. If segmentation is set to False in the collate() call then the collation completes as expected. I do not need to use the built in segmentation so this does not cause a problem for me but I thought it should be reported as it causes the example code to break.

Error message below:

/srv/itsee/django_project/collation/collatex/collatex-pythonport/collatex/core_functions.py in collate(collation, output, layout, segmentation, near_match, astar, detect_transpositions, debug_scores, properties_filter, indent, scheduler)
71 # join parallel segments
72 if segmentation:
---> 73 join(graph)
74 ranking = VariantGraphRanking.of(graph)
75 # check which output format is requested: graph or table

/srv/itsee/django_project/collation/collatex/collatex-pythonport/collatex/core_classes.py in join(graph)
321 out_edges = graph.out_edges(vertex)
322 if len(out_edges) is 1:
--> 323 (_, join_candidate) = out_edges[0]
324 can_join = join_candidate != end and len(graph.in_edges(join_candidate)) == 1
325 if can_join:

TypeError: 'OutEdgeDataView' object does not support indexing

Alignment error

Given the following in a Jupyter Notebook, using CollateX Python 2.2:

%reload_ext autoreload
%autoreload 2
from collatex import *
collation1 = Collation()
collation1.add_plain_witness("A", "The gray koala.")
collation1.add_plain_witness("B", "The gray koala.")
collation1.add_plain_witness("C", "The koala lives in a tree.")
table1 = collate(collation1, segmentation=False)
print(table1)
collation2 = Collation()
collation2.add_plain_witness("A", "The gray koala.")
collation2.add_plain_witness("B", "The big gray koala.")
collation2.add_plain_witness("C", "The koala lives in a tree.")
table2 = collate(collation2, segmentation=False)
print(table2)
collation3 = Collation()
collation3.add_plain_witness("C", "The koala lives in a tree.")
collation3.add_plain_witness("A", "The gray koala.")
collation3.add_plain_witness("B", "The big gray koala.")
table3 = collate(collation3, segmentation=False)
print(table3)

The output is:

+---+-----+------+-------+-------+----+---+------+---+
| A | The | gray | koala | -     | -  | - | -    | . |
| B | The | gray | koala | -     | -  | - | -    | . |
| C | The | -    | koala | lives | in | a | tree | . |
+---+-----+------+-------+-------+----+---+------+---+
+---+-----+-------+-------+-------+---+------+---+
| A | The | -     | gray  | koala | - | -    | . |
| B | The | big   | gray  | koala | - | -    | . |
| C | The | koala | lives | in    | a | tree | . |
+---+-----+-------+-------+-------+---+------+---+
+---+-----+-----+------+-------+-------+----+---+------+---+
| C | The | -   | -    | koala | lives | in | a | tree | . |
| A | The | -   | gray | koala | -     | -  | - | -    | . |
| B | The | big | gray | koala | -     | -  | - | -    | . |
+---+-----+-----+------+-------+-------+----+---+------+---+

The first and third collations are correct; the second is incorrect. The second has the same witnesses as the third, but they are added in a different order.

Empty <B> element in SVG output raises warning (error?)

In display_module.py, the following line:

readings = ["<TR><TD ALIGN='LEFT'><B>" + n.label + "</B></TD><TD ALIGN='LEFT'>exact: " + str(rank) + "</TD></TR>"]

raises a warning (error?) when n.label is null.

Possible fix: Test n.label and replace with &#xa0; (non-breaking space) if null.

Pythonport: claimed python3 compatibility

colaltex-pythonport/setup.py (and such also pypi) claims the package to be Python3.x compatible, but it is not.

While I'd be inclined to create a fork which is python3 compatible which you could pull, I'm not sure I'm willing or able to create a py2-py3 ambicompatible version.

[Adding info on setup]:

Dual python setup with python2.7 and python3.4. On a recent Lubuntu. (uname -a output: Linux luby-VB 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:45 UTC 2014 i686 i686 i686 GNU/Linux)

installed networkx, prettytable and collatex in this order from the manually downloaded tarballs with sudo pip3 [...] (to install in the python3 dist-packages directory).

When trying to import collatex typical py2-instead-of-py3-compatible-code type of errors are thrown, one after the other when fixing some locally.

Release 2.1.3r1 installs networkx 2.0; needs to be 1.11

networkx 2.0 changes the API in a way that breaks collatex. In setup.py, the latest collatex code base correctly specifies networkx version 1.11:

install_requires=['networkx==1.11','prettytable']

but 2.1.3rc1 in pypi installs version 2.0 of networkx.

Fix: This will fix itself on next release.
Workaround:

pip uninstall networkx
pip install -Iv networkx==1.11

Alignment result is dependent on witness order.

Case supplied by Hayco de Jong. (Hermans project)

Problem: Two nodes are added to the variant graph for ! which is suboptimal.

W1: a b c d F g h i ! K ! q r s t
W2: a b c d F g h i ! q r s t

Longest sequence is now: a b c d F g h i !

W3: a b c d E g h i ! q r s t

Longest sequence is now: ! q r s t

This case can be solved by using a non progressive multiple witness aligner.
I am working on a prototype of such an implementation.

GraphML output in CollateX Python

@tla @rhdekker I'm looking at adding GraphML output to CollateX Python (it's already in CollateX Java), and I'm not confident about the target output format. Specifically, the Java GraphML output for nodes contains three fields, one for the node id, one for the rank, and one that is a concatenation of the t properties of the tokens on that node. For example, using the first example at https://collatex.net/demo/, the Dekker alignment algorithm, and with Segmentation and Transposition both checked, the first non-start node in the GraphML output is:

        <node id="n1">
            <data key="d0">1</data>
            <data key="d2">1</data>
            <data key="d1">This morning </data>
        </node>

This means node 1, rank 1, and the concatenated t value of the tokens is “This morning ”.

This output seems to have two limitations:

  1. It does not persist the n values, which cannot be recreated without knowing how normalization was performed. It is possible to add this a separate property on (that is, <data> child of) the node. It also does not persist other properties that the user might have added to the token during normalization.
  2. It does not persist the tokenization, which cannot be recreated without knowing how it was performed originally.

The second of these limitations goes away if Segmentation is turned off, so that no node can contain more than one token, but that then restricts the types of CollateX variant graphs that can be exported as GraphML.

It is possible to support complex objects in GraphML by customizing the schema (http://graphml.graphdrawing.org/primer/graphml-primer.html#Complex). In that case, even with Segmentation enabled, each pair of t and n values (and other properties that the user might have added to the token during normalization) could be represented by a complex type. It isn't clear to me, though, whether that is the best strategy, especially because it was not adopted for the CollateX Java output,

Might either of you be able to provide some guidance about the requirements and expectations?

CollateX crashes with «Unexpected error null»

CollateX will crash with the following error

Unexpected error
null

when run with the following JSON input file:

{
    "witnesses" : [
        {
            "id" : "base",
            "tokens" : [
                { "t" : "id " },
                { "t" : "est " },
                { "t" : "sexaginta ", "n": "[num]" },
                { "t" : "solidos " }
            ]
        },
        {
            "id" : "w1",
            "tokens" : [
                { "t" : "solidos " },
                { "t" : "triplo " }
            ]
        },
        {
            "id" : "w2",
            "tokens" : [
                { "t" : "nostrum " },
                { "t" : "cogatur " },
                { "t" : "id " },
                { "t" : "xl ", "n": "[num]" },
                { "t" : "solidos " },
                { "t" : "in " }
            ]
        },
        {
            "id" : "w3",
            "tokens" : [
                { "t" : "nostrum " },
                { "t" : "in " },
                { "t" : "triplo " },
                { "t" : "conpon " },
                { "t" : ". " },
                { "t" : ". " },
                { "t" : "id " },
                { "t" : "lx ", "n": "[num]" }
            ]
        },
        {
            "id" : "w4",
            "tokens" : [
                { "t" : "est " },
                { "t" : ". " },
                { "t" : ". " },
                { "t" : ". " },
                { "t" : "solidos " },
                { "t" : "in " }
            ]
        }
    ]
}

The error started appearing after we normalized the token sexaginta to [num].

Please note that this JSON file is a reduced minimal test case, the original was much longer and very different.

CollateX refuses Json input

Not sure if this repo is being maintained.
Possibly a version of #44
json of tokenized witnesses in order A (working.txt) works; in order B (nonworking.txt) collatex returns an error
Hand editing nonworking.txt so that the witnesses and array of tokens are in the same order returns alignment.
Sending data to collatex via REST

nonworking.txt
working.txt

Load CLI result in Python?

Thanks again for a great tool. Is there an easy way to load the result of an alignment created with the command line tool into a Python Collation object? (I want to use the Java tool for its speed, but keep the flexible visualization and postprocessing capabilities that I have in a Jupyter notebook.)

Reading txt in the command line interface

Dear Collatex creators,
thank you so much for making your tool available!! It sounds super useful. Sadly, I haven't been able to run it. It would be lovely if you can show me an example of how to read a .txt file using the Command Line Interface.
Say I have output-adobe.txt + output-tesseract.txt + original.txt and want to compare them.
I open collatex like:
C:\Users\xxxx\Desktop> java -jar collatex-tools-1.7.1.jar
and then?

Documentation bug in tokenized input example

This example from the documentation:

{
  "witnesses" : [
    {
      "id" : "A",
      "tokens" : [
          { "t" : "A", "ref" : 123 },
          { "t" : "black" , "adj" : true },
          { "t" : "cat", "id" : "xyz" }
      ]
    },
    {
      "id" : "B",
      "tokens" : [
          { "t" : "A" },
          { "t" : "white" , "adj" : true },
          { "t" : "kitten.", "n" : "cat" }
      ]
    }
  ]
}

is misleading because the tokens "t" should include trailing whitespace when appropriate. If you use the built-in tokenizer instead, the tokens include whitespace by default. Also, the normalized "n" should be shown to exclude whitespace so as not to fool the token comparators.

Why is this important? Because if you omit whitespace the segment joining phase will run tokens together like this:

digraph G {
  v0 [label = ""];
  v1 [label = "Ablackkitten."];
  v2 [label = ""];
  v0 -> v1 [label = "A, B"];
  v1 -> v2 [label = "A, B"];
  v0 -> v2 [color =  "white"];
}

N.B. This output was generated from the slighly modified (to exercise the segment joiner) input:

{
  "witnesses" : [
    {
      "id" : "A",
      "tokens" : [
          { "t" : "A", "ref" : 123 },
          { "t" : "black" , "adj" : true },
          { "t" : "cat", "id" : "xyz" }
      ]
    },
    {
      "id" : "B",
      "tokens" : [
          { "t" : "A" },
          { "t" : "black" , "adj" : true },
          { "t" : "kitten.", "n" : "cat" }
      ]
    }
  ]
}

Error in documentation on python develop env

Contributing.rst now says:

$ mkvirtualenv collatex
$ cd collatex/
$ python setup.py develop

But I think it should be:

$ mkvirtualenv collatex
$ cd collatex/collatex-pythonport
$ python setup.py develop

Which means "4. Create a branch for local development:" should read "cd .." in line 1
(because one needs to check out the project in the main collatex directory) and "5." should start width "cd collatex-pythonport" again.

Crash when trying to collate pretokenized witnesses with space or punctuation

If you have a witness that looks like this:

                {
                    "id": "B",
                    "tokens": [
                        {"t": "A"},
                        {"t": "white", "adj": True},
                        {"t": "mousedog bird", "adj": False}
                    ]
                }

with a token that has either a space or a punctuation mark, then collate_pretokenized_json crashes like this:

Error
Traceback (most recent call last):
  File "/Users/tla/Projects/collatex/collatex-pythonport/tests/test_witness_tokens.py", line 42, in testPretokenizedWitness
    result = collate_pretokenized_json(pretokenized_witness)
  File "/Users/tla/Projects/collatex/collatex-pythonport/collatex/core_functions.py", line 69, in collate_pretokenized_json
    new_row.cells.append(tokenized_witness[token_counter])
IndexError: list index out of range

This is because the pretokenized witness gets concatenated together and then re-tokenized on whitespace and punctuation. It shouldn't be doing that in the first place.

<rdg> value in TEI output (Java) is "n", and should be "t"

The value of the <rdg> element is the n property, rather than the t. This may look correct in situations where the normalization is limited to stripping trailing whitespace (the default), especially with segmentation enabled, but even case-folding produces output that is visibly inconsistent with normal practice in critical editing, and anything more complex (e.g., thesaurus, Soundex) may produce output that is illegible.

Bad UTF-8 content in GraphML output on plain text

In constructing a simple test case for other purposes, I ran across a bug in the GraphML output of CollateX, which should be reproducible as follows:

Taras-iMac:MatthewEdessa tla$ cat first.txt 
The quick brown fox jumped over the lazy dogs.
Taras-iMac:MatthewEdessa tla$ cat second.txt 
the quick brown fox jumped over the lazy sleeping dog.
Taras-iMac:MatthewEdessa tla$ cat third.txt 
The quick brown fox jumped over the sleeping cat.
Taras-iMac:MatthewEdessa tla$ file *.txt
first.txt:  ASCII text
second.txt: ASCII text
third.txt:  ASCII text
Taras-iMac:MatthewEdessa tla$ ~/bin/collatex -t -f graphml first.txt second.txt third.txt > test.xml

Here is a zip file of the output - you can see an enormous blob of bad data in the middle of node 14.
test.xml.zip

Regex to specify lower-priority collation tokens

It often happens in automated collation that very common / frequent tokens, e.g. punctuation or words like 'and' or 'the', get matched a little too eagerly by the algorithm so that more substantive tokens are misaligned. Moreover, the set of tokens that cause this problem will vary according to language / text type / etc.

At the moment I am dealing with this by assigning random strings of characters in the n field of the JSON object for these tokens, so that CollateX won't match them with anything else. This works, but leads to a bunch of duplicated tokens in the output, which I deal with using a graph search algorithms.

Since what I am doing in post-processing looks and smells an awful lot like collation, it seems like something CollateX should be able to handle internally - match the 'substantive' tokens on a first pass, and the non-substantive ones on a second pass, relative to the alignment that has already been done. The easiest way of specifying these 'unimportant' tokens might be a regular expression, since (as mentioned) they will vary from text to text.

TypeError OutEdgeDataView

Hi,
When calling the collate() function I get an error message:
File "/Users/ellibleeker/anaconda3/lib/python3.6/site-packages/collatex/core_classes.py", line 329, in join
(_, join_candidate) = out_edges[0]
TypeError: 'OutEdgeDataView' object does not support indexing

Working with latest version (2.0) of NetworkX; this might be the problem?

Custom matching function

Mike Kestemont asks whether their is a custom matching function, that allows to use the cosinus of two vectors of numbers. It would be even better if the matching is not just true or false and the alignment is scored globally. Both should be possible. He calls the Java library from Python.

Misalignment

In CollateX Python 2.1.3rc2, the input:

from collatex import *
collation = Collation()
collation.add_plain_witness("A", "The big, gray, fuzzy koala.")
collation.add_plain_witness("B","The big, old, gray koala:")
collation.add_plain_witness("C","The big, gray, fuzzy wombat.")
table = collate(collation, segmentation=False, near_match=True)
print(table)

produces (with or without near matching):

+---+-----+-----+---+------+---+-------+--------+---+
| A | The | big | , | gray | , | fuzzy | koala  | . |
| B | The | big | , | old  | , | gray  | koala  | : |
| C | The | big | , | gray | , | fuzzy | wombat | . |
+---+-----+-----+---+------+---+-------+--------+---+

This fails to align “gray”, which matches exactly in all witnesses. The desired alignment is:

+---+-----+-----+---+------+---+------+---+-------+--------+---+
| A | The | big | , |      |   | gray | , | fuzzy | koala  | . |
| B | The | big | , | old  | , | gray |   |       | koala  | : |
| C | The | big | , |      |   | gray | , | fuzzy | wombat | . |
+---+-----+-----+---+------+---+------+---+-------+--------+---+

This appears to be a transposition situation, where CollateX aligns the commas in preference to the words. A philologist would prioritize aligning the words.

Mysterious error on collation attempt

Hi - I have run into an error when trying to collate the attached JSON input, which doesn't give much idea of the problem. If I remove the first witness (Bz449) then it works, but I can't see anything obviously wrong in the Bz449 input. Any hints would be very welcome.

Taras-iMac:collatex tla$ collatex -t --format json milestone-455.json > /dev/null
Unexpected error
null
Taras-iMac:collatex tla$

milestone-455.json.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.