There are several issues with the font encoding switches for languages "imported" from

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

8-bit font encodings in *.ini files about babel HOT 7 OPEN

gmilde commented on September 4, 2024

8-bit font encodings in *.ini files

from babel.

Comments (7)

gmilde commented on September 4, 2024 1

If necessary, the preferred encoding for a language can be set by users, if the font is not a problem or for whatever reason they want.

Loading a font encoding in the document preamble can be interpreted as a statement that the document author wants to use this font encoding at some place in the document.

This is why I propose to switch the preferred font encoding for text parts in a "foreign" language if this font encoding is known. (Avoiding the font-encoding switch is as easy as deleting the respective font encoding from the list of "fontenc" arguments.)

The real limitation is the selected encoding must render all characters (thorn, eth, ogonek, schwa, eng, etc.).

This is why I propose to switch to an "ersatz" font encoding, if the preferred font encoding is not known and the current font encoding not in the list of compatible font encodings. If no compatible font encoding is declared, write a warning and try with the current font encoding (maybe it works because the missing characters are not used or the document provides some other workaround). In case of a compilation error, the combination of the actual error message and the preceding warning can give the user sensible feedback.

from babel.

ivankokan commented on September 4, 2024

Some languages currently listed as supporting OT1 use characters or accents that are not supportd by OT1. Examples are Polish and Lithuanian, there may be others where OT1 should be stripped from the "encodings" list.

@gmilde @jbezos I guess that Croatian is one of them. However, there are \dj and \DJ implemented in the old days to address this issue, so eventually I think that OT1 can be left in the encodings list:

encodings = T1 OT1 LY1

On the other hand, I am not familiar with LY1. Please clarify.

from babel.

jbezos commented on September 4, 2024

@gmilde Selecting OT1 as the main encoding and therefore the preferred one effectively makes many other Latin encodings no-op (T4 and T5 are exceptions), so I wonder if it makes sense, considering T1 (and iir, LY1) is a superset. But clearly OT1 cannot be included in the list for Polish and Lithuanian, because there is no glyph for the ogonek at all (there are other languages with ogonek, like Icelandic and Navajo).

from babel.

gmilde commented on September 4, 2024

Selecting OT1 as the main encoding and therefore the preferred one effectively makes many other Latin encodings no-op (T4 and T5 are exceptions), so I wonder if it makes sense, considering T1 (and iir, LY1) is a superset.

IMO, a font encoding switch suggests itself, if the current encoding only provides a subset of the required encoding.

For every language, we may distinguish two sets of encodings:¹

canonical (hyphenation works, drag-and-drop works, characters are correctly represented in print) and
substitute (no compilation errors but some characters are composites leading to omissions in hyphenation and possibly errors with drag-and-drop/search from the PDF).

We should consider a suitable way to represent these sets in the *.ini files.

Both sets may contain more than one font encoding with variations outside the letters actually used in the respective language.

We may use some term or external list for supersets, e.g.

"canonical" OT1 implies that all standard text encodings and also non-standard but ASCII-compatible ones are "canonical" too (https://hyphenation.org uses the qualifier "ASCII").
"canonical" T1 implies that all standard text encodings (as well as LY1 and probably some more) should work as "substitute" font encodings.

¹There is a grey zone when composite representations have wrong accent glyphs (like Romanian/Latvian characters with comma below in OT1) or misplaced accents (like the comma below in T1).

But clearly OT1 cannot be included in the list for Polish and Lithuanian, because there is no glyph for the ogonek at all (there are other languages with ogonek, like Icelandic and Navajo).

Currently, 173 babel/locale/*.ini files contain OT1 in the "encodings" list. All of them should be tested for OT1 compatibility. (There is no ogonek in Icelandic but thorn and eth fail with OT1, hyphenation requires T1.)

@ivankokan All non-ASCII chars used in Croation (č ć ǆ đ ǉ ǌ š ž Č Ć Ǆ ǅ Đ Ǉ ǈ Ǌ ǋ Š Ž) work with OT1 while hyphenation requires T1. (The double-letters are automatically decomposed here but the legacy characters work fine in my example file).
LY1 is an alternative to the T1 encoding developed by Y&Y and
used in their commercial TEX implementation. "encguide.pdf" has an encoding table. For many western European languages is a "canonical" encoding. For Croatian, it can be used as "substitute" encoding (as can OT1, T2A and others).

IMO, Babel's on-the-fly/imported languages should

switch to a canonical font encoding if one is declared in the document.
Otherwise, emit a warning (suggesting to declare one of ) and select a known substitute font encoding.
If no substitute font encoding is declared, emit a warning (suggesting to declare one of or at least ).

from babel.

jbezos commented on September 4, 2024

A choice for the default behavior must be made – prioritize either font or hyphenation. There are ~50 fd files for QX vs. ~800 for T1, and the manual for babel-polish doesn’t even mention the former. The current rules prioritize fonts because a sudden change is usually meaningful, at the cost of some missing hyphens. The real limitation is the selected encoding must render all characters (thorn, eth, ogonek, schwa, eng, etc.). If necessary, the preferred encoding for a language can be set by users, if the font is not a problem or for whatever reason they want.

from babel.

ivankokan commented on September 4, 2024

@ivankokan All non-ASCII chars used in Croation (č ć ǆ đ ǉ ǌ š ž Č Ć Ǆ ǅ Đ Ǉ ǈ Ǌ ǋ Š Ž) work with OT1 while hyphenation requires T1.

@gmilde @jbezos So, the issue with Croatian and OT1 is 99 % with the hyphenation (the other 1 % is about missing Đđ which is at least handled somehow)? I leave you two to decide whether the OT1 should be excluded or not (indifferent on this matter but generally tend to be strict :D).

from babel.

gmilde commented on September 4, 2024

@ivankokan

[...] missing Đđ which is at least handled somehow)?

Đđ are handled exactly like the other "adorned" characters: T1 has slots for pre-composed characters while in OT1 they are created by superposition of the base character and adornment (haček, acute, stroke, ...).
The same holds for, e.g. German umlauts (äöü) and French letters with grave and circomflex.
The legacy ligatures are converted to two characters (like in Unicode) already by "inputenc" (cf. utf8enc.dfu).

This makes T1 the preferred font encoding for these languages and OT1 a "compatibility font encoding" (it works with some drawbacks).

from babel.

8-bit font encodings in *.ini files about babel HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent