Coder Social home page Coder Social logo

apertium / apertium-afr-nld Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 30.31 MB

Apertium translation pair for Afrikaans and Dutch

License: GNU General Public License v2.0

Makefile 1.18% Shell 8.45% M4 0.12% Python 0.41% TeX 13.37% PostScript 0.02% XML 76.45%
apertium-trunk

apertium-afr-nld's Introduction

Apertium

Apertium is an open-source rule-based machine translation toolchain and ecosystem. It facilitates the creation of consistent and transparent machine translation systems by relying on deterministic linguistic rules rather than statistical or neural models. Apertium's tools are designed to be language-agnostic and platform-independent, making them suitable for a wide range of languages and applications.

Project Overview

Apertium's framework is based on finite-state transducers, which enable efficient and accurate processing of natural languages. The language data used by Apertium is stored in XML and other human-readable text formats, organized into modular single-language packages and translation pairs. This modularity allows for the reuse of language data across multiple translation systems.

Features

  • Rule-Based Translation: Consistent and understandable translations based on deterministic rules.
  • Finite-State Transducers: Efficient language processing using advanced computational models.
  • Language-Agnostic Tools: Broad applicability across multiple languages.
  • Modular Design: Reusable language packages simplify the development of new translation pairs.

Installation

Apertium provides binaries for several platforms, including Debian, Ubuntu, Fedora, CentOS, OpenSUSE, Windows, and macOS. Both nightly builds and official releases are available. If you are on a supported platform, it is recommended to use the pre-built binaries.

For more information, see the Apertium Installation Guide.

Building from Source

If you need to modify Apertium’s behavior or are on a platform that is not officially supported, follow these steps to build from source.

Requirements

Compiling

$ autoreconf -fvi
$ ./configure
$ make

Usage

Apertium can be used to translate text between supported languages. Assuming the relevant language data (here the Spanish-Catalan translator) has been installed, translation can be achieved with the following command:

$ apertium spa-cat input.txt output.txt

The apertium executable can also use piped streams:

$ echo "La casa es roja." | apertium spa-cat

Language data which has been compiled but not installed can be used with the -d flag:

$ echo "La casa es roja." | apertium -d ./apertium-spa-cat spa-cat

Formats other than plaintext can be specified with the -f flag:

$ apertium -f html spa-cat input.html output.html

Data packages may provide modes besides the main translation mode. Use the -l flag to list them.

$ apertium -l
$ apertium -l -d ./apertium-spa-cat

Additional Tools

This repository also provides the following executables:

Pipeline Modules

  • apertium-extract-caps, apertium-restore-caps: Handle capitalization
  • apertium-pretransfer: Split compound analyses into separate words for processing by apertium-transfer
  • apertium-posttransfer: Clean up repeated spaces
  • apertium-tagger: Perform statistical part-of-speech tagging
  • apertium-transfer, apertium-interchunk, apertium-postchunk: Structural transfer modules (documentation)
  • apertium-wblank-attach, apertium-wblank-detach, apertium-wblank-mode: Handle word-bound blanks

Build Tools

These programs are used in the process of compiling linguistic data packages.

  • apertium-compile-caps: Compile capitalization-handling rules for use by apertium-restore-caps (documentation)
  • apertium-gen-modes: Process the modes.xml file, which specifies what translation and analysis modes a data package provides
  • apertium-preprocess-transfer: Process structural transfer rule files for use by apertium-transfer
  • apertium-validate-acx, apertium-validate-crx, apertium-validate-dictionary, apertium-validate-interchunk, apertium-validate-modes, apertium-validate-postchunk, apertium-validate-tagger, apertium-validate-transfer: Validators for various XML rule formats

Format Handlers

For each supported file format, there is a deformatter named apertium-des[NAME] (e.g. apertium-deshtml) which reads formatted text from standard input and writes Apertium stream format to standard output. There is also a corresponding set of reformatters which do the reverse and are named apertium-re[NAME] (e.g. apertium-rehtml). These programs rarely need to be invoked directly, since they are handled by the apertium executable.

Most of the format handlers are currently deprecated in favor of Transfuse.

License

This project is licensed under the GNU General Public License v2.0. See the COPYING file for details.

For more information, visit Apertium or the Apertium Wiki.

apertium-afr-nld's People

Contributors

bentley avatar ftyers avatar jimregan avatar marcriera avatar mr-martian avatar pimotte avatar sushain97 avatar tinodidriksen avatar trondtr avatar unhammer avatar wolfgangth avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apertium-afr-nld's Issues

Vocabulary issues Dutch --> Afrikaans

Some vocabulary issues Dutch --> Afrikaans

duikoperator --> duikdiensverskaffer
nestplaats --> nesplek
rond (as preposition) --> rondom
e.g.: rond het eiland -- rondom die eiland
cactus --> kaktus
lig, ligt, liggen (verb) --> lê
e.g.: het ligt --> dit lê
licht (noun) --> lig
lichten (plural) --> ligte
recreatieoord --> ontspanningsoord

Participles

Participles are a big problem in Afrikaans. If they are used as part of the verb's conjugation they are usually just the verb with a ge- prefix:

ek loop --: ek het geloop

The prefix is dropped if there is an inseperable prefix

ek beveel -- ek het beveel.

The problem comes when they are used as attibutive (or predicative) adjectives:

verbal usage: hy het dit aanbeveel - hij heeft dit aanbevolen - he has recommended this
adjectival: die daaglikse aanbevole inname - de dagelijks aanbevolen hoeveelheid - the recommended daily intake
predicative: hierdie inname is aanbevole - this intake is recommended

Sometimes the attributive participle is still strong like Dutch aanbevolen, particularly in formal usuage and idioms, but for lower register don't be surprised to see 'aanbeveelde'. The language seems in flux on this point.

Can apertium be taught to distinguish the difference between verbal and adjectival usage?

Double negation

Afrikaans has double negation, something Dutch does not have. In Afrikaans this means that a sentence that contains a negation typically ends in the (extra) word nie.

Apertium does not add that as yet not.

Examples:
afr: Johnnie is nie dood nie.
nld:Johnnie is niet dood.

afr:Maar sy het nie omgegee nie.
nld: Maar zij gaf er niet om.

There are cases where Afrikaans only has one negative element in the sentence but that is rare and I'm not quite sure how that works. They are typically short utterances in the present tense. I don't think you ever get two 'nie'-s at the end of the phrase.

The double negation can be triggered by other words besides nie itself that imply a negation like:
*geen, g'n
*niemand
*niks
*nooit
*moenie

afr: Dis g'n wonder nie.
nld: 't Is geen wonder.

afr: Dit was 'n dag wat ek nooit sal vergeet nie.
nld: Dit was een dag die ik nooit zal vergeten.

afr: Ek het niks gedoen nie.
nld: Ik heb niets gedaan.

Moenie initiates a negative imperative:

afr: Moenie huil nie!
nld: Huil niet!

Sometimes the distance between the two negatives is quite large, e.g. when a subordinate clause intervenes:

afr: Niemand is nog in hegtenis geneem nadat 'n man Maandagaand buite die bekende M-kern-apteek in Bellville in die Wes-Kaap verskeie kere in die been geskiet is nie.
nld: Niemand is er nog in hechtenis genomen nadat een man maandagavond buiten de bekende M-kern-apotheek in Belville in de provincie West-Kaap verscheidene keren in het been geschoten werd.

Vocabulary issues 4: Euratom

nld: vreedzaam (adj) --> afr:vreedsaam
inflected:
nld:vreedzame --> afr: vreedsame
nld:kernenergie (noun; mf)--> afr: kernkrag
nld:brandstof (noun; mf) --> afr: brandstof
nld:brandstoffen (noun, pl) --> afr:brandstowwe
nld:broeikaseffect (noun; nt) --> kweekhuiseffek

Move to three letter ISO codes

This pair should be moved to three letter ISO codes. The name should probably be apertium-afr-nld.

The following files (at minimum) will need to be checked:

  • Makefile.am
  • configure.ac
  • modes.xml
  • README

The pair should also be checked to see if it can be adapted to work with monolingual language packages in languages/

inflection of adjectives

This is a complicated issue in both languages and the complications are not the same.

The ground rule in both languages is the same:

*Used as predicate (or as adverb): adjective remains without inflection:

afr: Soms is kerk regtig vervelig en voorspelbaar. --> nld: Soms is kerk echt vervelend en voorspelbaar

*Used as attribute the adjective gets an inflection -e:

afr: vervelige boeke --> nld: vervelende boeken
afr:Suid-Afrika se taamlik voorspelbare politieke situasie -> Zuid-Afrika's nogal voorspelbare politieke situatie.

Unfortunately the latter rule has many exceptions in both languages and they are very different in the two languages. I do not pretend to know the Afrikaans ones all that well.

In nld the biggest exception is if the noun is singular neuter and used in indefinite form, for example with the indefinite article "een" or its negative "geen"

het paard is mooi
het mooie paard
een mooi paard

In Afrikaans neutral gender does not exist:

die perd is mooi
die mooie perd
'n mooie perd

In Afrikaans there are different exceptions: a whole bunch of monosyllabic adjectives never get inflected; e.g. groot:

die perd is groot
die groot perd
'n groot perd

For us Dutchies this is really hard, we would agree with 'n groot perd but not with die groot perd... The rules are really different..

I have no idea how you would code this. Perhaps in Afrikaans you can define a predicate form and an attribute form and simply make the two the same for cases like groot? In Dutch the rules would have to include gender and indefiniteness for the attribute form.

There is more on this subject but let me leave it at this.

Migrate to monolingual language packages

This pair currently embeds the following monolingual information:

  • apertium-afr-nld.nld.dix
  • apertium-afr-nld.afr.dix
  • apertium-afr-nld.nld.acx
  • apertium-afr-nld.afr.acx
  • apertium-afr-nld.post-nld.dix
  • apertium-afr-nld.post-afr.dix
  • nld-afr.prob
  • afr-nld.prob

These files should be imported from apertium-afr and apertium-nld. After doing this, testvoc will need to be done. It is recommended to do this on a branch and then merge to master after it is finished and the testvoc results as as good or better.

Vocabulary issues 2

nld: link (masc noun) --> afr:skakel (computer link)
nld: externe links -> afr: eksterne skakels
nld:zich (refelxive pronoun) -> afr: hom (apertium gives sig, this word is very rare and obsolete)

Upgrade tagger to use vislcg too

So we can take care of examples such as this:

$ echo "zij zijn een man" | apertium -d . nld-afr-tagger
^prpers<prn><subj><p3><mf><pl>$ ^zijn<vbser><inf>$ ^een<det><ind><mf><sg>$ ^man<n><m><sg>$^.<sent>$

Related to #3.

to be / om te wees / om te zijn

<e><p><l>wees<s n="vbser"/></l><r>zijn<s n="vbser"/></r></p></e>

Is there a way to accommodate the following construction: zij zijn <-> hulle is

Vocabulary issues 5: Moezel

:nld: verwateren (vb. erg, insep.) --> afr: afwater (vb. sep)
:nld: afwateren (vb. abs, sep.) --> afr: tot die stroomgebied behoort
:dit gebied watert af op de Rijn --> hierdie gebied behoort tot die Ryn se stroomgebied
:nld: Vogezen (name, pl) --> afr: Vogese (name, pl)
:nld: uitmonden (vb. abs. sep.) --> afr: uitmond (vb. sep)
:nld na (prep) --> afr: ná (prep)
:nld: naar (prep) --> afr: na (prep)
:nld: Rijn (name masc) --> afr: Ryn (name)
:nld: bovenloop (n. masc sg) --> afr: boloop (n. sg.)
:nld: bovenlopen (n. pl.) --> afr: bolope (n. pl.)

relative pronouns

The simplest relative pronouns in Dutch are:
*dat for sing neutrum
*die for masc/fem sing and for plural

In Afrikaans the equivalent is:
*wat in all cases

Apertium now translates nld:die into afr:dit (the personal pronoun) in the following sentence:

afr:Die gebou is opgebou uit stene wat in die son gedroog word. -- nld:Het gebouw is opgebouwd uit stenen die in de zon gedroogd worden.

Other examples

afr:dit is 'n lae getal in vergelyking met die getal renosters wat gestroop word -- nld:dit is een laag getal in vergelijking met het aantal neushoorns dat gestroopt wordt.

afr:God gaan elke boom wat nie gesonde vrugte dra nie, afkap en in die vuur gooi -- nld:God zal iedere boom die geen gezonde vruchten draagt, kappen en op het vuur gooien.

afr:en is dit 'n verskynsel wat ons nog hoegenaamd ernstig behoort op te vat? -- nld: en is dit een verschijnsel dat we zelfs maar ernstig behoren op te vatten?

vocabulary issues 3

nld:gesteente fem. noun pl:gesteentes or gesteenten
afr:gesteente pl: gesteentes (only)

this holds for most nouns on -te.

nld:tienduizendste (ordinal) -> afr: tienduisendste

nld:opgesloten -> afr: opgesluit

This is a participle of opsluit in verbal use. I don't think it is used much as an adjective

nld: halveringstijd (noun masc) -> afr: halfleeftyd

plural halveringstijden -> halfleeftye

nld: methode (noun fm) -> afr: metode

nld pl: methodes, methoden -> afr: metodes

nld: recent (adj) -> afr: onlangs

In nld onlangs is purely an adverb, in afr it is also an adjective and it gets inflected as such.

Move to three letter ISO codes

This pair should be moved to three letter ISO codes. The name should probably be apertium-afr-nld.

The following files (at minimum) will need to be checked:

  • Makefile.am
  • configure.ac
  • modes.xml
  • README

The pair should also be checked to see if it can be adapted to work with monolingual language packages in languages/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.