Coder Social home page Coder Social logo

tokenizer's Introduction

Tokenizer: A tokenizer for Icelandic text

Overview

Tokenization is a necessary first step in many natural language processing tasks, such as word counting, parsing, spell checking, corpus generation, and statistical analysis of text.

Tokenizer is a compact pure-Python (>= 3.8) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences.

The package contains a dictionary of common Icelandic abbreviations, in the file src/tokenizer/Abbrev.conf.

Tokenizer is an independent spinoff from the Greynir project (GitHub repository here), by the same authors. The Greynir natural language parser for Icelandic uses Tokenizer on its input.

Note that Tokenizer is licensed under the MIT license while GreynirEngine is licensed under GPLv3.

Deep vs. shallow tokenization

Tokenizer can do both deep and shallow tokenization.

Shallow tokenization simply returns each sentence as a string (or as a line of text in an output file), where the individual tokens are separated by spaces.

Deep tokenization returns token objects that have been annotated with the token type and further information extracted from the token, for example a (year, month, day) tuple in the case of date tokens.

In shallow tokenization, tokens are in most cases kept intact, although consecutive white space is always coalesced. The input strings "800 MW", "21. janúar" and "800 7000" thus become two tokens each, output with a single space between them.

In deep tokenization, the same strings are represented by single token objects, of type TOK.MEASUREMENT, TOK.DATEREL and TOK.TELNO, respectively. The text associated with a single token object may contain spaces, although consecutive whitespace is always coalesced into a single space " ".

By default, the command line tool performs shallow tokenization. If you want deep tokenization with the command line tool, use the --json or --csv switches.

From Python code, call split_into_sentences() for shallow tokenization, or tokenize() for deep tokenization. These functions are documented with examples below.

Installation

To install:

$ pip install tokenizer

Command line tool

After installation, the tokenizer can be invoked directly from the command line:

$ tokenize input.txt output.txt

Input and output files are in UTF-8 encoding. If the files are not given explicitly, stdin and stdout are used for input and output, respectively.

Empty lines in the input are treated as hard sentence boundaries.

By default, the output consists of one sentence per line, where each line ends with a single newline character (ASCII LF, chr(10), "\n"). Within each line, tokens are separated by spaces.

The following (mutually exclusive) options can be specified on the command line:

--csv
Deep tokenization. Output token objects in CSV format, one per line. Sentences are separated by lines containing 0,"",""
--json
Deep tokenization. Output token objects in JSON format, one per line.

Other options can be specified on the command line:

-n

--normalize
Normalize punctuation, causing e.g. quotes to be output in Icelandic form and hyphens to be regularized. This option is only applicable to shallow tokenization.
-s

--one_sent_per_line
Input contains strictly one sentence per line, i.e. every newline is a sentence boundary.
-o

--original
Output original token text, i.e. bypass shallow tokenization. This effectively runs the tokenizer as a sentence splitter only.
-m
--convert_measurements
Degree signal in tokens denoting temperature normalized (200° C -> 200 °C)
-p
--coalesce_percent
Numbers combined into one token with the following token denoting percentage word forms (prósent, prósentustig, hundraðshlutar)
-g
--keep_composite_glyphs
Do not replace composite glyphs using Unicode COMBINING codes with their accented/umlaut counterparts
-e
--replace_html_escapes
HTML escape codes replaced by their meaning, such as á -> á
-c
--convert_numbers
English-style decimal points and thousands separators in numbers changed to Icelandic style
-k N
--handle_kludgy_ordinals N
Kludgy ordinal handling defined. 0: Returns the original mixed word form 1. Kludgy ordinal returned as pure word forms 2: Kludgy ordinals returned as pure numbers

Type tokenize -h or tokenize --help to get a short help message.

Example

$ echo "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize
3. janúar sl. keypti ég 64kWst rafbíl .
Hann kostaði €30.000 .

$ echo "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize --csv
19,"3. janúar","0|1|3"
6,"sl.","síðastliðinn"
6,"keypti",""
6,"ég",""
22,"64kWst","J|230400000.0"
6,"rafbíl",""
1,".","."
0,"",""
6,"Hann",""
6,"kostaði",""
13,"€30.000","30000|EUR"
1,".","."
0,"",""

$ echo "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize --json
{"k":"BEGIN SENT"}
{"k":"DATEREL","t":"3. janúar","v":[0,1,3]}
{"k":"WORD","t":"sl.","v":["síðastliðinn"]}
{"k":"WORD","t":"keypti"}
{"k":"WORD","t":"ég"}
{"k":"MEASUREMENT","t":"64kWst","v":["J",230400000.0]}
{"k":"WORD","t":"rafbíl"}
{"k":"PUNCTUATION","t":".","v":"."}
{"k":"END SENT"}
{"k":"BEGIN SENT"}
{"k":"WORD","t":"Hann"}
{"k":"WORD","t":"kostaði"}
{"k":"AMOUNT","t":"€30.000","v":[30000,"EUR"]}
{"k":"PUNCTUATION","t":".","v":"."}
{"k":"END SENT"}

Python module

Shallow tokenization example

An example of shallow tokenization from Python code goes something like this:

from tokenizer import split_into_sentences

# A string to be tokenized, containing two sentences
s = "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000."

# Obtain a generator of sentence strings
g = split_into_sentences(s)

# Loop through the sentences
for sentence in g:

    # Obtain the individual token strings
    tokens = sentence.split()

    # Print the tokens, comma-separated
    print("|".join(tokens))

The program outputs:

3.|janúar|sl.|keypti|ég|64kWst|rafbíl|.
Hann|kostaði|€30.000|.

Deep tokenization example

To do deep tokenization from within Python code:

from tokenizer import tokenize, TOK

text = ("Málinu var vísað til stjórnskipunar- og eftirlitsnefndar "
    "skv. 3. gr. XVII. kafla laga nr. 10/2007 þann 3. janúar 2010.")

for token in tokenize(text):

    print("{0}: '{1}' {2}".format(
        TOK.descr[token.kind],
        token.txt or "-",
        token.val or ""))

Output:

BEGIN SENT: '-' (0, None)
WORD: 'Málinu'
WORD: 'var'
WORD: 'vísað'
WORD: 'til'
WORD: 'stjórnskipunar- og eftirlitsnefndar'
WORD: 'skv.' [('samkvæmt', 0, 'fs', 'skst', 'skv.', '-')]
ORDINAL: '3.' 3
WORD: 'gr.' [('grein', 0, 'kvk', 'skst', 'gr.', '-')]
ORDINAL: 'XVII.' 17
WORD: 'kafla'
WORD: 'laga'
WORD: 'nr.' [('númer', 0, 'hk', 'skst', 'nr.', '-')]
NUMBER: '10' (10, None, None)
PUNCTUATION: '/' (4, '/')
YEAR: '2007' 2007
WORD: 'þann'
DATEABS: '3. janúar 2010' (2010, 1, 3)
PUNCTUATION: '.' (3, '.')
END SENT: '-'

Note the following:

  • Sentences are delimited by TOK.S_BEGIN and TOK.S_END tokens.
  • Composite words, such as stjórnskipunar- og eftirlitsnefndar, are coalesced into one token.
  • Well-known abbreviations are recognized and their full expansion is available in the token.val field.
  • Ordinal numbers (3., XVII.) are recognized and their value (3, 17) is available in the token.val field.
  • Dates, years and times, both absolute and relative, are recognized and the respective year, month, day, hour, minute and second values are included as a tuple in token.val.
  • Numbers, both integer and real, are recognized and their value is available in the token.val field.
  • Further details of how Tokenizer processes text can be inferred from the test module in the project's GitHub repository.

The tokenize() function

To deep-tokenize a text string, call tokenizer.tokenize(text, **options). The text parameter can be a string, or an iterable that yields strings (such as a text file object).

The function returns a Python generator of token objects. Each token object is a simple namedtuple with three fields: (kind, txt, val) (further documented below).

The tokenizer.tokenize() function is typically called in a for loop:

import tokenizer
for token in tokenizer.tokenize(mystring):
    kind, txt, val = token
    if kind == tokenizer.TOK.WORD:
        # Do something with word tokens
        pass
    else:
        # Do something else
        pass

Alternatively, create a token list from the returned generator:

token_list = list(tokenizer.tokenize(mystring))

The split_into_sentences() function

To shallow-tokenize a text string, call tokenizer.split_into_sentences(text_or_gen, **options). The text_or_gen parameter can be a string, or an iterable that yields strings (such as a text file object).

This function returns a Python generator of strings, yielding a string for each sentence in the input. Within a sentence, the tokens are separated by spaces.

You can pass the option normalize=True to the function if you want the normalized form of punctuation tokens. Normalization outputs Icelandic single and double quotes („these“) instead of English-style ones ("these"), converts three-dot ellipsis ... to single character ellipsis …, and casts en-dashes – and em-dashes — to regular hyphens.

The tokenizer.split_into_sentences() function is typically called in a for loop:

import tokenizer
with open("example.txt", "r", encoding="utf-8") as f:
    # You can pass a file object directly to split_into_sentences()
    for sentence in tokenizer.split_into_sentences(f):
        # sentence is a string of space-separated tokens
        tokens = sentence.split()
        # Now, tokens is a list of strings, one for each token
        for t in tokens:
            # Do something with the token t
            pass

The correct_spaces() function

The tokenizer.correct_spaces(text) function returns a string after splitting it up and re-joining it with correct whitespace around punctuation tokens. Example:

>>> import tokenizer
>>> tokenizer.correct_spaces(
... "Frétt \n  dagsins:Jón\t ,Friðgeir og Páll ! 100  /  2  =   50"
... )
'Frétt dagsins: Jón, Friðgeir og Páll! 100/2 = 50'

The detokenize() function

The tokenizer.detokenize(tokens, normalize=False) function takes an iterable of token objects and returns a corresponding, correctly spaced text string, composed from the tokens' text. If the normalize parameter is set to True, the function uses the normalized form of any punctuation tokens, such as proper Icelandic single and double quotes instead of English-type quotes. Example:

>>> import tokenizer
>>> toklist = list(tokenizer.tokenize("Hann sagði: „Þú ert ágæt!“."))
>>> tokenizer.detokenize(toklist, normalize=True)
'Hann sagði: „Þú ert ágæt!“.'

The normalized_text() function

The tokenizer.normalized_text(token) function returns the normalized text for a token. This means that the original token text is returned except for certain punctuation tokens, where a normalized form is returned instead. Specifically, English-type quotes are converted to Icelandic ones, and en- and em-dashes are converted to regular hyphens.

The text_from_tokens() function

The tokenizer.text_from_tokens(tokens) function returns a concatenation of the text contents of the given token list, with spaces between tokens. Example:

>>> import tokenizer
>>> toklist = list(tokenizer.tokenize("Hann sagði: \"Þú ert ágæt!\"."))
>>> tokenizer.text_from_tokens(toklist)
'Hann sagði : " Þú ert ágæt ! " .'

The normalized_text_from_tokens() function

The tokenizer.normalized_text_from_tokens(tokens) function returns a concatenation of the normalized text contents of the given token list, with spaces between tokens. Example (note the double quotes):

>>> import tokenizer
>>> toklist = list(tokenizer.tokenize("Hann sagði: \"Þú ert ágæt!\"."))
>>> tokenizer.normalized_text_from_tokens(toklist)
'Hann sagði : „ Þú ert ágæt ! “ .'

Tokenization options

You can optionally pass one or more of the following options as keyword parameters to the tokenize() and split_into_sentences() functions:

  • convert_numbers=[bool]

    Setting this option to True causes the tokenizer to convert numbers and amounts with English-style decimal points (.) and thousands separators (,) to Icelandic format, where the decimal separator is a comma (,) and the thousands separator is a period (.). $1,234.56 is thus converted to a token whose text is $1.234,56.

    The default value for the convert_numbers option is False.

    Note that in versions of Tokenizer prior to 1.4, convert_numbers was True.

  • convert_measurements=[bool]

    Setting this option to True causes the tokenizer to convert degrees Kelvin, Celsius and Fahrenheit to a regularized form, i.e. 200° C becomes 200 °C.

    The default value for the convert_measurements option is False.

  • replace_composite_glyphs=[bool]

    Setting this option to False disables the automatic replacement of composite Unicode glyphs with their corresponding Icelandic characters. By default, the tokenizer combines vowels with the Unicode COMBINING ACUTE ACCENT and COMBINING DIAERESIS glyphs to form single character code points, such as 'á' and 'ö'.

    The default value for the replace_composite_glyphs option is True.

  • replace_html_escapes=[bool]

    Setting this option to True causes the tokenizer to replace common HTML escaped character codes, such as á with the character being escaped, such as á. Note that ­ (soft hyphen) is replaced by an empty string, and   is replaced by a normal space. The ligatures fi and fl are replaced by fi and fl, respectively.

    The default value for the replace_html_escapes option is False.

  • handle_kludgy_ordinals=[value]

    This options controls the way Tokenizer handles 'kludgy' ordinals, such as 1sti, 4ðu, or 2ja. By default, such ordinals are returned unmodified ('passed through') as word tokens (TOK.WORD). However, this can be modified as follows:

    • tokenizer.KLUDGY_ORDINALS_MODIFY: Kludgy ordinals are corrected to become 'proper' word tokens, i.e. 1sti becomes fyrsti and 2ja becomes tveggja.
    • tokenizer.KLUDGY_ORDINALS_TRANSLATE: Kludgy ordinals that represent proper ordinal numbers are translated to ordinal tokens (TOK.ORDINAL), with their original text and their ordinal value. 1sti thus becomes a TOK.ORDINAL token with a value of 1, and 3ja becomes a TOK.ORDINAL with a value of 3.
    • tokenizer.KLUDGY_ORDINALS_PASS_THROUGH is the default value of the option. It causes kludgy ordinals to be returned unmodified as word tokens.

    Note that versions of Tokenizer prior to 1.4 behaved as if handle_kludgy_ordinals were set to tokenizer.KLUDGY_ORDINALS_TRANSLATE.

The token object

Each token is an instance of the class Tok that has three main properties: kind, txt and val.

The kind property

The kind property contains one of the following integer constants, defined within the TOK class:

Constant Value Explanation Examples
PUNCTUATION 1 Punctuation . ! ; % &
TIME 2 Time (h, m, s)
11:35:40
kl. 7:05
klukkan 23:35
DATE * 3 Date (y, m, d) [Unused, see DATEABS and DATEREL]
YEAR 4 Year
árið 874 e.Kr.
1965
44 f.Kr.
NUMBER 5 Number
100
1.965
1.965,34
1,965.34
2⅞
WORD 6 Word
kattaeftirlit
hunda- og kattaeftirlit
TELNO 7 Telephone number
5254764
699-4244
410 4000
PERCENT 8 Percentage 78%
URL 9 URL
ORDINAL 10 Ordinal number
30.
XVIII.
TIMESTAMP * 11 Timestamp [Unused, see TIMESTAMPABS and TIMESTAMPREL]
CURRENCY * 12 Currency name [Unused]
AMOUNT 13 Amount
€2.345,67
750 þús.kr.
2,7 mrð. USD
kr. 9.900
EUR 200
PERSON * 14 Person name [Unused]
EMAIL 15 E-mail [email protected]
ENTITY * 16 Named entity [Unused]
UNKNOWN 17 Unknown token  
DATEABS 18 Absolute date
30. desember 1965
30/12/1965
1965-12-30
1965/12/30
DATEREL 19 Relative date
15. mars
15/3
15.3.
mars 1911
TIMESTAMPABS 20 Absolute timestamp
30. desember 1965 11:34
1965-12-30 kl. 13:00
TIMESTAMPREL 21 Relative timestamp
30. desember kl. 13:00
MEASUREMENT 22 Value with a measurement unit
690 MW
1.010 hPa
220 m²
80° C
NUMWLETTER 23 Number followed by a single letter
14a
7B
DOMAIN 24 Domain name
greynir.is
Reddit.com
HASHTAG 25 Hashtag
#MeToo
#12stig
MOLECULE 26 Molecular formula
H2SO4
CO2
SSN 27 Social security number (kennitala)
591213-1480
USERNAME 28 Twitter user handle
@username_123
SERIALNUMBER 29 Serial number
394-5388
12-345-6789
COMPANY * 30 Company name [Unused]
S_BEGIN 11001 Start of sentence  
S_END 11002 End of sentence  

(*) The token types marked with an asterisk are reserved for the Greynir package and not currently returned by the tokenizer.

To obtain a descriptive text for a token kind, use TOK.descr[token.kind] (see example above).

The txt property

The txt property contains the original source text for the token, with the following exceptions:

  • All contiguous whitespace (spaces, tabs, newlines) is coalesced into single spaces (" ") within the txt string. A date token that is parsed from a source text of "29. \n janúar" thus has a txt of "29. janúar".
  • Tokenizer automatically merges Unicode COMBINING ACUTE ACCENT (code point 769) and COMBINING DIAERESIS (code point 776) with vowels to form single code points for the Icelandic letters á, é, í, ó, ú, ý and ö, in both lower and upper case. (This behavior can be disabled; see the replace_composite_glyphs option described above.)
  • If the appropriate options are specified (see above), it converts kludgy ordinals (3ja) to proper ones (þriðja), and English-style thousand and decimal separators to Icelandic ones (10,345.67 becomes 10.345,67).
  • If the replace_html_escapes option is set, Tokenizer replaces HTML-style escapes (á) with the characters being escaped (á).

The val property

The val property contains auxiliary information, corresponding to the token kind, as follows:

  • For TOK.PUNCTUATION, the val field contains a tuple with two items: (whitespace, normalform). The first item (token.val[0]) specifies the whitespace normally found around the symbol in question, as an integer:

    TP_LEFT = 1   # Whitespace to the left
    TP_CENTER = 2 # Whitespace to the left and right
    TP_RIGHT = 3  # Whitespace to the right
    TP_NONE = 4   # No whitespace
    

    The second item (token.val[1]) contains a normalized representation of the punctuation. For instance, various forms of single and double quotes are represented as Icelandic ones (i.e. „these“ or ‚these‘) in normalized form, and ellipsis ("...") are represented as the single character "…".

  • For TOK.TIME, the val field contains an (hour, minute, second) tuple.

  • For TOK.DATEABS, the val field contains a (year, month, day) tuple (all 1-based).

  • For TOK.DATEREL, the val field contains a (year, month, day) tuple (all 1-based), except that a least one of the tuple fields is missing and set to 0. Example: 3. júní becomes TOK.DATEREL with the fields (0, 6, 3) as the year is missing.

  • For TOK.YEAR, the val field contains the year as an integer. A negative number indicates that the year is BCE (fyrir Krist), specified with the suffix f.Kr. (e.g. árið 33 f.Kr.).

  • For TOK.NUMBER, the val field contains a tuple (number, None, None). (The two empty fields are included for compatibility with Greynir.)

  • For TOK.WORD, the val field contains the full expansion of an abbreviation, as a list containing a single tuple, or None if the word is not abbreviated.

  • For TOK.PERCENT, the val field contains a tuple of (percentage, None, None).

  • For TOK.ORDINAL, the val field contains the ordinal value as an integer. The original ordinal may be a decimal number or a Roman numeral.

  • For TOK.TIMESTAMP, the val field contains a (year, month, day, hour, minute, second) tuple.

  • For TOK.AMOUNT, the val field contains an (amount, currency, None, None) tuple. The amount is a float, and the currency is an ISO currency code, e.g. USD for dollars ($ sign), EUR for euros (€ sign) or ISK for Icelandic króna (kr. abbreviation). (The two empty fields are included for compatibility with Greynir.)

  • For TOK.MEASUREMENT, the val field contains a (unit, value) tuple, where unit is a base SI unit (such as g, m, , s, W, Hz, K for temperature in Kelvin).

  • For TOK.TELNO, the val field contains a tuple: (number, cc) where the first item is the phone number in a normalized NNN-NNNN format, i.e. always including a hyphen, and the second item is the country code, eventually prefixed by +. The country code defaults to 354 (Iceland).

Abbreviations

Abbreviations recognized by Tokenizer are defined in the Abbrev.conf file, found in the src/tokenizer/ directory. This is a text file with abbreviations, their definitions and explanatory comments.

When an abbreviation is encountered, it is recognized as a word token (i.e. having its kind field equal to TOK.WORD). Its expansion(s) are included in the token's val field as a list containing tuples of the format (ordmynd, utg, ordfl, fl, stofn, beyging). An example is o.s.frv., which results in a val field equal to [('og svo framvegis', 0, 'ao', 'frasi', 'o.s.frv.', '-')].

The tuple format is designed to be compatible with the Database of Icelandic Morphology (DIM), Beygingarlýsing íslensks nútímamáls, i.e. the so-called Sigrúnarsnið.

Development installation

To install Tokenizer in development mode, where you can easily modify the source files (assuming you have git available):

$ git clone https://github.com/mideind/Tokenizer
$ cd Tokenizer
$ # [ Activate your virtualenv here, if you have one ]
$ pip install -e ".[dev]"

Test suite

Tokenizer comes with a large test suite. The file test/test_tokenizer.py contains built-in tests that run under pytest.

To run the built-in tests, install pytest, cd to your Tokenizer subdirectory (and optionally activate your virtualenv), then run:

$ python -m pytest

The file test/toktest_large.txt contains a test set of 13,075 lines. The lines test sentence detection, token detection and token classification. For analysis, test/toktest_large_gold_perfect.txt contains the expected output of a perfect shallow tokenization, and test/toktest_large_gold_acceptable.txt contains the current output of the shallow tokenization.

The file test/Overview.txt (only in Icelandic) contains a description of the test set, including line numbers for each part in both test/toktest_large.txt and test/toktest_large_gold_acceptable.txt, and a tag describing what is being tested in each part.

It also contains a description of a perfect shallow tokenization for each part, acceptable tokenization and the current behaviour. As such, the description is an analysis of which edge cases the tokenizer can handle and which it can not.

To test the tokenizer on the large test set the following needs to be typed in the command line:

$ tokenize test/toktest_large.txt test/toktest_large_out.txt

To compare it to the acceptable behaviour:

$ diff test/toktest_large_out.txt test/toktest_large_gold_acceptable.txt > diff.txt

The file test/toktest_normal.txt contains a running text from recent news articles, containing no edge cases. The gold standard for that file can be found in the file test/toktest_normal_gold_expected.txt.

Changelog

  • Version 3.4.3: Various minor fixes. Now requires Python 3.8 or later.
  • Version 3.4.2: Abbreviations and phrases added, META_BEGIN token added.
  • Version 3.4.1: Improved performance on long input chunks.
  • Version 3.4.0: Improved handling and normalization of punctuation.
  • Version 3.3.2: Internal refactoring; bug fixes in paragraph handling.
  • Version 3.3.1: Fixed bug where opening quotes at the start of paragraphs were sometimes incorrectly recognized and normalized.
  • Version 3.2.0: Numbers and amounts that consist of word tokens only ('sex hundruð') are now returned as the original TOK.WORD s ('sex' and 'hundruð'), not as single coalesced TOK.NUMBER / TOK.AMOUNT /etc. tokens.
  • Version 3.1.2: Changed paragraph markers to [[ and ]] (removing spaces).
  • Version 3.1.1: Minor fixes; added Tok.from_token().
  • Version 3.1.0: Added -o switch to the tokenize command to return original token text, enabling the tokenizer to run as a sentence splitter only.
  • Version 3.0.0: Added tracking of character offsets for tokens within the original source text. Added full type annotations. Dropped Python 2.7 support.
  • Version 2.5.0: Added arguments for all tokenizer options to the command-line tool. Type annotations enhanced.
  • Version 2.4.0: Fixed bug where certain well-known word forms (, fær, mín, ...) were being interpreted as (wrong) abbreviations. Also fixed bug where certain abbreviations were being recognized even in uppercase and at the end of a sentence, for instance Örn.
  • Version 2.3.1: Various bug fixes; fixed type annotations for Python 2.7; the token kind NUMBER WITH LETTER is now NUMWLETTER.
  • Version 2.3.0: Added the replace_html_escapes option to the tokenize() function.
  • Version 2.2.0: Fixed correct_spaces() to handle compounds such as Atvinnu-, nýsköpunar- og ferðamálaráðuneytið and bensínstöðvar, -dælur og -tankar.
  • Version 2.1.0: Changed handling of periods at end of sentences if they are a part of an abbreviation. Now, the period is kept attached to the abbreviation, not split off into a separate period token, as before.
  • Version 2.0.7: Added TOK.COMPANY token type; fixed a few abbreviations; renamed parameter text to text_or_gen in functions that accept a string or a string iterator.
  • Version 2.0.6: Fixed handling of abbreviations such as m.v. (miðað við) that should not start a new sentence even if the following word is capitalized.
  • Version 2.0.5: Fixed bug where single uppercase letters were erroneously being recognized as abbreviations, causing prepositions such as 'Í' and 'Á' at the beginning of sentences to be misunderstood in GreynirPackage.
  • Version 2.0.4: Added imperfect abbreviations (amk., osfrv.); recognized klukkan hálf tvö as a TOK.TIME.
  • Version 2.0.3: Fixed bug in detokenize() where abbreviations, domains and e-mails containing periods were wrongly split.
  • Version 2.0.2: Spelled-out day ordinals are no longer included as a part of TOK.DATEREL tokens. Thus, þriðji júní is now a TOK.WORD followed by a TOK.DATEREL. 3. júní continues to be parsed as a single TOK.DATEREL.
  • Version 2.0.1: Order of abbreviation meanings within the token.val field made deterministic; fixed bug in measurement unit handling.
  • Version 2.0.0: Added command line tool; added split_into_sentences() and detokenize() functions; removed convert_telno option; splitting of coalesced tokens made more robust; added TOK.SSN, TOK.MOLECULE, TOK.USERNAME and TOK.SERIALNUMBER token kinds; abbreviations can now have multiple meanings.
  • Version 1.4.0: Added the **options parameter to the tokenize() function, giving control over the handling of numbers, telephone numbers, and 'kludgy' ordinals.
  • Version 1.3.0: Added TOK.DOMAIN and TOK.HASHTAG token types; improved handling of capitalized month name Ágúst, which is now recognized when following an ordinal number; improved recognition of telephone numbers; added abbreviations.
  • Version 1.2.3: Added abbreviations; updated GitHub URLs.
  • Version 1.2.2: Added support for composites with more than two parts, i.e. „dómsmála-, ferðamála-, iðnaðar- og nýsköpunarráðherra“; added support for ± sign; added several abbreviations.
  • Version 1.2.1: Fixed bug where the name Ágúst was recognized as a month name; Unicode nonbreaking and invisible space characters are now removed before tokenization.
  • Version 1.2.0: Added support for Unicode fraction characters; enhanced handing of degrees (°, °C, °F); fixed bug in cubic meter measurement unit; more abbreviations.
  • Version 1.1.2: Fixed bug in liter (l and ltr) measurement units.
  • Version 1.1.1: Added mark_paragraphs() function.
  • Version 1.1.0: All abbreviations in Abbrev.conf are now returned with their meaning in a tuple in token.val; handling of 'mbl.is' fixed.
  • Version 1.0.9: Added abbreviation 'MAST'; harmonized copyright headers.
  • Version 1.0.8: Bug fixes in DATEREL, MEASUREMENT and NUMWLETTER token handling; added 'kWst' and 'MWst' measurement units; blackened.
  • Version 1.0.7: Added TOK.NUMWLETTER token type.
  • Version 1.0.6: Automatic merging of Unicode COMBINING ACUTE ACCENT and COMBINING DIAERESIS code points with vowels.
  • Version 1.0.5: Date/time and amount tokens coalesced to a further extent.
  • Version 1.0.4: Added TOK.DATEABS, TOK.TIMESTAMPABS, TOK.MEASUREMENT.

tokenizer's People

Contributors

haukurpall avatar holado avatar jokull avatar peturorri avatar sultur avatar sveinbjornt avatar sverrirab avatar thorunna avatar vthorsteinsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tokenizer's Issues

Bandstrik skilin frá orði

Í núverandi mynd slítur tókarinn bandstrik frá orði ef bandsrikið er í lok orðs:
félags - og menntamálaráðherra.

Okkur hjá Árnastofnun þætti betra ef þetta væri ekki gert og úttakið væri:
félags- og menntamálaráðherra

Not enough test coverage

We could use having better code coverage of tests.

There is an easy way to get the coverage: There is a pytest plugin called pytest-cov that generates coverage reports.

pip install pytest pytest-cov
pytest --cov=src/tokenizer --cov-report=html

The current results look like this:

----------- coverage: platform linux, python 3.8.6-final-0 -----------
Name                           Stmts   Miss  Cover
--------------------------------------------------
src/tokenizer/__init__.py          7      0   100%
src/tokenizer/abbrev.py          157     12    92%
src/tokenizer/definitions.py     121      9    93%
src/tokenizer/main.py            103    103     0%
src/tokenizer/tokenizer.py      1168    200    83%
--------------------------------------------------
TOTAL                           1556    324    79%

and a html-report that highlights uncovered lines is generated in the folder htmlcov

The tokenizer is slow when the input string is long.

When the input into tokenizer is very long, the time it takes to tokenise the string into sentence does not grow linearly with the input length as one might expect.

Test case to demonstrate the issue:

from time import time

import tokenizer
from tqdm import tqdm


def create_test_sentence(num_words, num_sentences):
    """Create a test string which contains num_words words and num_sentences sentences."""
    sentences = []
    for i in range(num_sentences):
        sentence = ["Hæ"] * num_words
        sentences.append(" ".join(sentence))
        sentences[-1] += "."
    return " ".join(sentences) + "\n"


def run_multi_sentence_test(num_words, num_sentences, num_lines):
    print(f"Running test with {num_words} words and {num_sentences} sentences in each line and {num_lines} lines.")
    start = time()
    test_data = list(create_test_sentence(num_words, num_sentences) for _ in range(num_lines))
    count = 0
    for line in tqdm(test_data):
        for sent in tokenizer.split_into_sentences(line, original=True):
            count += 1
    end = time()
    assert count == num_lines * num_sentences
    print("number of sentences encountered:", count)
    print("time taken:", end - start)
    print(f"{(end - start) / count:.6f} seconds per sentence")


num_words = 20
num_sentences = 40
num_lines = 300
run_multi_sentence_test(num_words, num_sentences, num_lines)
# Running test with 20 words and 40 sentences in each line and 300 lines.
# ...
# number of sentences encountered: 12000
# time taken: 17.760034322738647
# 0.001480 seconds per sentence

num_sentences = 1
num_lines = 300 * 40
run_multi_sentence_test(num_words, num_sentences, num_lines)
# Running test with 20 words and 1 sentences in each line and 12000 lines.
# ...
# number of sentences encountered: 12000
# time taken: 3.183797597885132
# 0.000265 seconds per sentence

num_sentences = 300 * 40
num_lines = 1
run_multi_sentence_test(num_words, num_sentences, num_lines)
# Running test with 20 words and 12000 sentences in each line and 1 lines.
# Does not return a value in 5 minutes.

Twitter handles and @usernames can contain periods (@mat­ur.a.mbl) but are broken into sentences

The following text

Þetta var notandinn @matur.a.mbl á Twitter.

becomes

Tok(kind=11001, txt=None, val=(0, None))
Tok(kind=6, txt='Þetta', val=None)
Tok(kind=6, txt='var', val=None)
Tok(kind=6, txt='notandinn', val=None)
Tok(kind=28, txt='@matur', val='matur')
Tok(kind=1, txt='.', val=(3, '.'))
Tok(kind=11002, txt=None, val=None)
Tok(kind=11001, txt=None, val=(0, None))
Tok(kind=6, txt='a.mbl', val=None)
Tok(kind=6, txt='á', val=None)
Tok(kind=6, txt='Twitter', val=None)
Tok(kind=1, txt='.', val=(3, '.'))
Tok(kind=11002, txt=None, val=None)

Bigger ordinal numbers in the tokenizer

Hi! I'm making a normalizer and have made rules that recognize both cardinal and ordinal numbers up to 999 billions (999.999.999.999. is the highest ordinal number, I don't expect anyone to ever write this but whatever). I use the tokenizer to split up to sentences and was wondering about the thought behind when the tokenizer should recognize an ordinal number and when it's read as a cardinal number and an end of a sentence. I did an experiment:

try_ordinal = "Hæ, þetta er 5. dagurinn, þetta er 51. dagurinn, þetta er 512. dagurinn, þetta er 5123. dagurinn, " + \
              "þetta er 5.234. dagurinn, þetta er 51234. dagurinn, þetta er 52.345. dagurinn, þetta er 512345. " + \
              "dagurinn, þetta er 523.456. dagurinn, þetta er 5123456. dagurinn, þetta er 5.234.567. dagurinn."

>>> list(split_into_sentences(try_ordinal))
['Hæ , þetta er 5. dagurinn , þetta er 51. dagurinn , þetta er 512. dagurinn , þetta er 5123. dagurinn , þetta er 5.234 .',
 'dagurinn , þetta er 51234. dagurinn , þetta er 52.345 .',
 'dagurinn , þetta er 512345. dagurinn , þetta er 523.456 .',
 'dagurinn , þetta er 5123456 .',
 'dagurinn , þetta er 5.234.567 .',
 'dagurinn .']

I've generally tried to keep in periods every third digit, if someone wants to write 8923402 it is recognized as a sequence of digits (a phone number, átta níu tveir þrír fjórir núll tveir, not átta milljónir níu hundruð tuttugu og þrjú þúsund fjögur hundruð og tvö, that would be ridiculous). However if someone actually writes 8.923.402 they get the millions because they were clear with the periods. 🙂

So the normalizer recognizes everything over 9999. ONLY as ordinals with the period separators but the tokenizer wants nothing to do with them. Is there reasoning behind this? Of course my reasoning is only my personal opinion so I'm very open to the conversation. 😊 Have you assessed that no one will ever write such big ordinals? At least I think numbers with periods (like 52.345.) for clarity should work as well as the other numbers!

Thank you 😁

Two dots

Sentence that ends with two dots is split into two sentences while a sentence that ends with three dots is not:
"This is a sentence.." becomes "This is a sentence. ¦ ."
while
"This is a sentence..." becomes "This is a sentence ..."

Character omitted

When using split_into_sentences() on the string

@@[smiley: Too Funny: [4/4_1_72]]

the result is

@ @ [ smiley : Too Funny : [ 4/4 _ 1 _ 72

so "]]" from the end of the sentence is omitted.

UnboundLocalError: local variable 'unit' referenced before assignment

I was tokenizing the ParIce dataset when I encountered an error:

UnboundLocalError: local variable 'unit' referenced before assignment. tokenizer.py:1455

There are quite a few sentences which will cause this error, here is an example:
test = "framkvæmdastjórnin skal einnig birta skýrslu um framvindu framkvæmdarinnar byggða á yfirlitsskýrslum, sem aðildarríki leggja fram skv2mgr15gr., og leggja hana fyrir evrópuþingið og aðildarríkin eigi síðar en tveimur árum eftir dagsetningarnar sem um getur í 5og 8gr."

This segment can be found on line 3254414 in the ees.tmx:

Framkvæmdastjórnin skal einnig birta skýrslu um framvindu framkvæmdarinnar byggða á yfirlitsskýrslum, sem aðildarríki leggja fram skv2mgr15gr., og leggja hana fyrir Evrópuþingið og aðildarríkin eigi síðar en tveimur árum eftir dagsetningarnar sem um getur í 5og 8gr.

The text is clearly broken, but looking at the code, the error still seems to be valid.

Spaces deleted

When tokenizing ? … the out put is ?…
Is it an expected behaviour of the tokenizer to delete spaces?

The tokenizer is missing some abbreviations

Hallóó,

we're making a speech synthesizer and use the shallow tokenizer. The tokenizer splits into sentences for normalization and the sentence structure helps to know where the pause should be for the synthesized speech. However, there are some abbreviations (some common, others not so but still allowed 😊) that the tokenizer does not handle and splits between sentences, which in the most serious cases could prevent the normalization happening, as well as obviously making the phrasing weird. This happens when an abbreviation ends with a period and the tokenizer reads it as end-of-sentence instead of a part of the abbreviation. Could you add them?

The list is:

  • The reason I started collecting these cases is the following. The normalizer expands the s in "s. 550-1234" to sími ONLY if it's followed by seven digits. However, the tokenizer splits this up to two sentences, making a break between s. and the number. The same applies to rn. (reikningsnúmer). Would it be possible to add this rule? I feel like I have at least written these abbreviations veeery often. 🤪
  • frák. (fráköst) – normally it's written without a period but it's more correct with the period and the discussion of fráköst feels like the most common one in the whole RMH. (I manually annotated 40,000 random sentences and I think most of them were describing basketball matches.)
  • ath. (athugið) – it's very common to write this both without a period and not but the tokenizer splits between sentences when the period is there.
  • ps. – this is not normally written with a dot but someone might have the idea, then it's beneficial to handle it (at least not ambiguous with anything else, right? :))
  • B.Sc. is correct and not split between sentences but M.Sc. (1375 mentions in RMH) are.
  • m.v. (miðað við) – occurs 3867 times in RMH but splits between sentences.
  • vs. (versus) – not so common with the period but occurs 194 times in RMH.
  • km. (mm, dm, hm, sm, cm, etc.) – I wouldn't write these with a following period but according to RMH a LOT of people (2696 just for km.) do.
  • kcal. – another case of not the most common with a period (I wouldn't) but more correct.

Takk!

Detokenization adds spaces to "o.s.frv."

I am testing the detokenization and noticed that the detokenization adds spaces between "o.s.frv." so it becomes "o. s. frv."

tokenized = list(tokenizer.tokenize("o.s.frv.", normalize=False))
detokenized = tokenizer.detokenize(tokenized, normalize=False)

correct_spaces incorrectly inserts spaces into abbreviations

Using the newest version of Tokenizer, 3.4.2:

from tokenizer import correct_spaces
>>> correct_spaces('Þarna voru t.d. tveir hundar , m.a. hundurinn hans Jóns .')
# Expected output: 'Þarna voru t.d. tveir hundar, m.a. hundurinn hans Jóns.'
# Output:          'Þarna voru t. d. tveir hundar, m. a. hundurinn hans Jóns.'

Tokeniize() options

Is it worth making the decisions tokenize() makes, optional?

For instance the "kludgy ordinals" conversion..

Inconsistent application of abbreviation expansion

I noticed different handling of abbreviations between version 1.4.0 and 2.0.0 in a test case of mine.
test = "nr., gr., 1sti fyrsti, 1., 2ja, o.s.frv."
In particular, the handling of "gr." can differ between runs and I've seen it return one of:

  • "greinir"
  • "grein"
  • "grískur"
  • "greiðsla"

I know that the test case is out of context, in the sense that there is no correct answer out of these options. Regardless, I find the inconsistency of outputs troubling.

I briefly looked at the code and saw that "set()" is used to hold abbreviations which is probably the culprit.

KeyError for unknown abbreviations

using the example code and the text:
text = "Best að athuga vedur.is. Er ekki gott veður úti?"

Results in an exception:
KeyError: 'vedur.is.'

Support for citation characters

The tokenizer should support superscripted citation characters. This will also help with GreynirCorrect, which I assume will be heavily used to read student essays and academic papers.

Screen Shot 2020-06-30 at 23 14 20

split_into_sentences changes sentences

It would be helpful to have a version of split_into_sentences that does only that and does not touch the sentences in any other way. Two examples:

input: Faxaflói Suðlæg átt , 5 - 10 m/s , él og hiti kringum frostmark .
output: Faxaflói Suðlæg átt , 5 - 10 m / s , él og hiti kringum frostmark . # adding spaces around '/'

input: Áfram hélt fjörið í síðari hálfleik og þegar 3. leikhluti var tæplega hálfnaður var staðan 64-52 .
output: Áfram hélt fjörið í síðari hálfleik og þegar 3. leikhluti var tæplega hálfnaður var staðan 64 - 52 . # adding spaces around '-'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.