Coder Social home page Coder Social logo

w3c / ilreq Goto Github PK

View Code? Open in Web Editor NEW
10.0 27.0 11.0 3.7 MB

Former repo for Indic Layout Requirements. See new repo at

Home Page: https://github.com/w3c/iip/

License: Other

HTML 97.14% CSS 2.86%
indic-scripts devanagari bengali hindi typography text-layout

ilreq's Introduction

ilreq's People

Contributors

plehegar avatar r12a avatar slata avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ilreq's Issues

When does the ABNF work for Tamil consonant clusters?

The document largely gives the impression that the ABNF rules indicate what must be kept together for "text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation".

However, is that true for Tamil? Consonant clusters in Tamil don't interact with left-positioned vowel signs in the same way as Devanagari or Bengali conjuncts. Here are some examples i took from the UDHR.

  1. in these words the left-positioned vowel appears between the two consonants in a cluster:
    யாவற்றையும்
    yāvaṟṟaiyum

கௌரவத்தையும்
kauravattaiyum

அசிரத்தையும் அவற்றை
acirattaiyum avaṟṟai

ஏற்கப்பெற்று
ēṟkappeṟṟu

எல்லோரும்
ellōrum

  1. in these the vowel shaping interacts only with the final consonant:
    செயல்களுக்கு
    ceyalkaḷukku

கேட்டுக்
kēṭṭuk

The table of examples of the ABNF doesn't include this type of cluster, only conjuncts such as க்ஷ, ஶ்ரீ , and ஸ்ரீ , which are special because they ligate.

So, given examples such as those in the list above, is it or is it not normal to keep consonant clusters together in Tamil for text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation?

Distinguish languages and scripts clearly

It seems currently in the doc, all discussions are by default based on languages, especially Hindi. However, as a doc of layout requirements, most issues are actually script-related and language-independent.

I suppose it'd better if:

  • Languages an scripts are distinguished clearly. Even for words like "Gujarati" should be clarified if "the Gujarati language" of "the Gujarati script" is being talked in a certain piece of text if it's unclear from the context.

  • Scripts should be discussed by default. Languages should be discussed only when there're language-dependent issues.

Issues related to 'Initial letter styling'

https://w3c.github.io/ilreq/#h_initial_letter_styling

I have just committed some editorial changes for section 3.3, but while reviewing it i also ran into the following issues, which i think need to be addressed:

[1] Just below fig 4 it says:

In examples of this kind

what kind is meant here?

[2]

In Indic scripts the top reference point is the hanging base line for those scripts that have one, and the bottom alignment point is the text after-edge.

This ignores scripts without a hanging baseline. I suggest the following:

In Indic scripts the top reference point is the hanging base line for those scripts that have one, and the mean/median line for those that don't, and the bottom alignment point is the text after-edge."

[3]

Initial letter wrap property is not applicable for Indian languages. No contour-filling is required in Indian languages.

I think the document should have links to explanations of what 'Initial letter wrap' and 'contour-filling' mean.

[4]
https://w3c.github.io/ilreq/#h_scripts_without_hanging_baseline

The explanatory text in the images (which is crucial to understanding the text) is very hard to read, because of its size. Please redesign the images.

[5]

or X1 to descent of the last line or line XN

The meaning of this is hard to decipher.

Sunken and raised initial letters

The sunken and raised initial letter are not preferred in Indian languages.

What does not preferred mean here? Unless I misunderstood the meaning, I have seen several instances of such letters, e.g.

initial1
initial2
initial3

I18N-ISSUE-401: Clarify initial letter requirements and alignment points

[moving here from tracker]

http://www.w3.org/TR/2014/WD-ilreq-20141216/#first-letter
5.1 First Letter

This issue attempts to summarise various questions related to the First Letter section, which needs some clarification about requirements for positioning of highlighted 'initial letters'.

It currently says "in Devanagari the hanging baseline may be preferred".

This wording implies that in other cases the hanging baseline may not be preferred. If so, and if this refers to drop intial letters, what are the alternatives, and what are the font metrics or other considerations involved in alignment?

Are sunken and raised initial letters required, in addition to the drop initial? If so, what are the alignment points to be used for those?

See also Dave Cramer's email request (http://lists.w3.org/Archives/Public/public-i18n-indic/2014OctDec/0038.html)

I'm working on the CSS Line Layout specification [1], which covers
drop caps and other initial letter effects. I was very happy to see
examples of this in the Indic Layout Requirements document [2].

Can you clarify the bottom alignment point of the initial letter? The
text mentions the text-after-edge, but the illustrations are not
completely clear to me. Is there something in the font metrics that
would define this?

[1] http://dev.w3.org/csswg/css-inline/
[2] http://www.w3.org/TR/2014/WD-ilreq-20141216/#first-letter

The text "the primary connection point connects the text-after-edge of the initial letter with the text-after-edge of the nth line, but the secondary connection point connects the hanging baselines of the initial letter and the initial line" isn't very clear, and doesn't seem to clearly relate to what is shown in the pictures.

It will probably help to look through http://dev.w3.org/csswg/css-inline/#initial-letter-styling section 2, to get ideas for the topics to cover in ilreq. It would be good to note which requirements are and are not relevant to indic scripts, and what differences need to be taking into account in the way initial letters are positioned. (For example, is the formula at http://dev.w3.org/csswg/css-inline/#sizing-initial-letters relevant?)

Additional topics needed

By comparing http://w3c.github.io/typography/questionnaire.html with the ilreq document, i came up with the following, non-exhaustive, list of items that i think should be covered. The first (justification) is a hot topic currently for CSS.

Justification & line-end alignment : When text in a paragraph needs to have flush lines down both sides, does it follow the rules for your script? Does the script conform to a grid pattern? Does your script allow punctuation to hang outside the text box at the start or end of a line? Where adjustments are need to make a line flush, how is that done? Do you shrink/stretch space between words and/or letters? Are word baselines stretched, as in Arabic?

Numbers and digits : Does the script have its own set of number digits? Does the numbering system use base-10, or some other type of base?

Counters, lists, etc : The CSS Counter Styles specification describes a limited set of simple and complex styles for counters to be used in list numbering, chapter heading numbering, etc. Are the details correct? We have another document that provides over 120 templates for user-defined counter styles in over 30 scripts. Are there more? Are there other aspects related to counters and lists that need to be addressed?

Quotations : What is the expected behaviour for quotations marks, especially when nested? Should block quotes be indented or handled specially?

Baselines & inline alignment : What are the requirements for baseline alignment between mixed scripts and in general?

Line decoration : Some aspects related to the drawing of lines alongside or through text involve local typographic considerations. For example, underlines need to be broken in special ways for some scripts, and the height of strike-through may vary depending on the script. What about vertical text?

Emphasis : Bold and italic are not always appropriate for expressing emphasis, and some scripts have their own unique ways of doing it, that are not in the Western tradition at all.

Other paragraph features : Is it normal to indent the start of a paragraph in indic scripts? Are there other features relating to paragraphs that should be mentioned?

Notes, footnotes, etc : How does the script deal with notes, footnotes, endnotes or other necessary annotations of this kind in the way needed for your culture?

Page layout and pagination : Some cultures define page areas and page progression direction very differently from those in the West. Is this an issue for you? Are widows and orphans relevant? Are there special conventions for page numbering, or the way that running headers and the like are handled in your culture?

Intrasyllabic character spacing

The draft says “in case of Indian language, the space needs to be introduced after each syllable for correct representation” and “letter spacing in all Indian languages must follow Indic Orthographic syllable definition.” Here is a counterexample: the Malayalam heading of https://archive.org/stream/englishmalayalam00tobirich#page/1/mode/1up spaces ⟨കോ⟩ into three pieces and ⟨ശം⟩ into two, although each is one syllable. The document’s style is to add space between base glyphs, regardless of syllable boundaries. This makes sense because, for example, the vowel sign ē has the same visual weight as the consonant ka.

i18n-ISSUE-407: Clarification of initial letter example

[moved here from tracker]

5.1 First Letter
http://www.w3.org/TR/2014/WD-ilreq-20141216/#first-letter

"Note how the vowel sign appears to the left of the first character, not the third. There are three grapheme clusters here. The first includes the SA+VIRAMA,THA+I and T+II. We see that the styling is done on the basis of the syllable, not the first character. A syllable includes a base consonant and any combination of the following characters in the text stream:"

This text is misleading when paired with figure 4 when it talks about 3 graphemes and there are 3 red circles. It also doesn't show first letter styling, as the text says, which is confusing. There is also an error in the romanization.

How about the following wording, based around the example at https://www.flickr.com/photos/ishida/16084553630/
I also suggest renaming the section to Initial Letter Styling, to match the CSS Inline spec

Indic script behavior in initial letter styling is based on syllables, rather than individual letter forms.

Figure 4 shows an example of a drop intial in Hindi. In the first word of the paragraph, स्कूल ('skūl'),
the sequence of characters is stored in memory is as follows:

स ‎U+0938 DEVANAGARI LETTER SA
् ‎U+094D DEVANAGARI SIGN VIRAMA
क ‎U+0915 DEVANAGARI LETTER KA
ू ‎U+0942 DEVANAGARI VOWEL SIGN UU
ल ‎U+0932 DEVANAGARI LETTER LA

There are two syllables in this word: SA+VIRAMA+KA+UU and LA. Note, however, that there are three Unicode grapheme clusters here: SA+VIRAMA, KA+UU and LA.

Styling is done on the basis of the whole orthographic syllable, not the first character, nor even the first grapheme.

A syllable includes a base consonant and any combination of the following characters in the text stream:

  • sequences of consonants preceded by virama (i.e. conjuncts).
  • vowel signs
  • visarga, anusvara or candrabindu.

NOTE: The detailed definition of Indic syllables is given in section 2.

Here are some further examples of initial letter styling based on the Indic syllable definition.

...

An alternative would be to take the above text and put it at the bottom of section 3 Text Segmentation, as an illustration of the point made in the last paragraph ("text segmentation should be done as Indic syllable"). This is useful because it clearly distinguishes between grapheme cluster and syllabic units, and could be referred to from other sections, too, such as the section on vertical text.

And then simply say, at the start of section 5.1 that selection of initial letters uses the orthographic syllable as the unit, as illustrated in section 2, and then simply give some examples. The majority of section 5.1 could then focus on more specific requirements, such as what styles of highlighting are common, and what the alignment points, etc, are.

Introduction of ABNF in section 2.2 should be streamlined.

What does this sentence mean: "The motive principle for ABNF is to describe a formal system of a language to be used as a bidirectional communications protocol."?

The entire opening paragraph (above the ABNF expression) in section 2.2 does not contribute to the understanding of the Indic issues. ABNF is something that can simply be referenced as it is done in the third sentence.

In fact, the text would be improved if it read something like this (insertions shown with []):

Augmented Backus–Naur Form (ABNF) is a meta-language based on Backus–Naur Form (BNF), but consisting of its own syntax and derivation rules. The motive principle for ABNF is to describe a formal system of a language to be used as a bidirectional communications protocol.
The linguistic definition of Indic orthographic syllable has been mapped to [the following] ABNF (Augmented Backus–Naur Form)[.]

V[m] |{CH}C[v][m]|CH

[This definition of an Indic orthographic syllable may be used] for the purpose of text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation. The definition has been elaborated , taking Hindi as an example. [using examples from various Indic scripts in the table below.]

new-updated-ilreq

<title>Indic Layout Requirements</title> <script src='http://www.w3.org/Tools/respec/respec-w3c-common' async class='remove'></script> <style type="text/css"> @import url(http://fonts.googleapis.com/earlyaccess/notosansdevanagari.css); :lang(hi) { font-family:"Devanagari MT", "Kokila", "Noto Sans Devanagari", "Devanagari Sangam MN", Mangal, sans-serif; } .TableGen { margin:0px; padding:0px; width:70%; box-shadow: 10px 10px 5px #888888; border:1px solid #000000; } .TableGen table { width:100%; height:100%; margin:0px; padding:0px; } .TableGen tr:nth-child(odd) { background-color:#BBDDFF; } .TableGen tr:nth-child(even) { background-color:#ffffff; } .TableGen td { vertical-align:middle; border:1px solid #000000; border-width:0px 1px 1px 0px; text-align:left; font-size:18px; font-family:Arial, Helvetica, sans-serif; font-weight:normal; color:#000000; } .TableGen tr:last-child td { border-width:0px 1px 0px 0px; } .TableGen tr td:last-child { border-width:0px 0px 1px 0px; } .TableGen tr:last-child td:last-child { border-width:0px 0px 0px 0px; } .TableGen tr:first-child td { background-color:#0080C0; border:0px solid #000000; text-align:center; border-width:0px 0px 1px 1px; font-size:18px; font-family:Arial, Helvetica, sans-serif; font-weight:bold; color:#ffffff; } .TableGen tr:first-child:hover td { background-color:#0080C0; } .TableGen tr:first-child td:first-child { border-width:0px 0px 1px 0px; } .TableGen tr:first-child td:last-child { border-width:0px 0px 1px 1px; } .tab-format1 table { border:1px solid #000000; border-width:1px 1px 1px 1px; padding:2px; } .tab-format1 td { vertical-align:middle; border:1px solid #000000; border-width:1px 1px 1px 1px; text-align:left; padding:5px; } em.rfc2119 { text-transform: lowercase; font-variant: small-caps; font-style: normal; color: #900; } h1 acronym, h2 acronym, h3 acronym, h4 acronym, h5 acronym, h6 acronym, a acronym, h1 abbr, h2 abbr, h3 abbr, h4 abbr, h5 abbr, h6 abbr, a abbr { border: none; } dfn { font-weight: bold; } a.internalDFN { color: inherit; border-bottom: 1px solid #99c; text-decoration: none; } a.externalDFN { color: inherit; border-bottom: 1px dotted #ccc; text-decoration: none; } a.bibref { text-decoration: none; } cite .bibref { font-style: normal; } code { color: #C83500; } /\* --- TOC --- _/ .toc a, .tof a { text-decoration: none; } a .secno, a .figno { color: #000; } ul.tof, ol.tof { list-style: none outside none; } .caption { margin-top: 0.5em; font-style: italic; } /_ --- TABLE --- _/ table.simple { border-spacing: 0; border-collapse: collapse; border-bottom: 3px solid #005a9c; } .simple th { background: #005a9c; color: #fff; padding: 3px 5px; text-align: left; } .simple th[scope="row"] { background: inherit; color: inherit; border-top: 1px solid #ddd; } .simple td { padding: 3px 10px; border-top: 1px solid #ddd; } .simple tr:nth-child(even) { background: #f0f6ff; } /_ --- DL --- */ .section dd > p:first-child { margin-top: 0; } .section dd > p:last-child { margin-bottom: 0; } .section dd { margin-bottom: 1em; } .section dl.attrs dd, .section dl.eldef dd { margin-bottom: 0; } @media print { .removeOnSave { display: none; } } </style> <script class="remove"> var respecConfig = { // specification status (e.g. WD, LCWD, WG-NOTE, etc.). If in doubt use ED. specStatus: "ED", noRecTrack: true,
      // the specification's short name, as in http://www.w3.org/TR/short-name/
      shortName:            "ilreq",

      // if your specification has a subtitle that goes below the main
      // formal title, define it here
      // subtitle   :  "an excellent document",

      // if you wish the publication date to be other than the last modification, set this
      //publishDate:  "2014-12-16",

      // if the specification's copyright date is a range of years, specify
      // the start date here:
      // copyrightStart: "2005"

      // if there is a previously published draft, uncomment this and set its YYYY-MM-DD date
      // and its maturity status
      // previousPublishDate:  "1977-03-15",
      // previousMaturity:  "WD",

      // if there a publicly available Editor's Draft, this is the link
      edDraftURI:   "http://www.w3.org/International/docs/indic-layout/",

      // if this is a LCWD, uncomment and set the end of its review period
      // lcEnd: "2009-08-05",

      // editors, add as many as you like
      // only "name" is required
      editors:  [
          {
              name:       "Swaran Lata"
           ,   mailto:     "[email protected]"
          ,   company:    "DeitY"

          },
                        {
              name:       "Somnath Chandra"
          ,   mailto:     "[email protected]"
          ,   company:    "DeitY"

          },

                      {
              name:       "Prashant Verma"
          ,   mailto:     "[email protected]"
          ,   company:    "Web Standardization Initiative, DeitY"

          }

             ],


      // name of the WG
      wg:           "Internationalization Working Group",

      // URI of the public WG page
      wgURI:         "http://www.w3.org/International/core/",

      // name (without the @w3c.org) of the public mailing to which comments are due
      wgPublicList: "public-i18n-indic",

      // URI of the patent status for this WG, for Rec-track documents
      // !!!! IMPORTANT !!!!
      // This is important for Rec-track documents, do not copy a patent URI from a random
      // document unless you know what you're doing. If in doubt ask your friendly neighbourhood
      // Team Contact.
      wgPatentURI:  "http://www.w3.org/2004/01/pp-impl/32113/status",
      // !!!! IMPORTANT !!!! MAKE THE ABOVE BLINK IN YOUR HEAD
       localBiblio: {
    "Code-Charts": {
        title: "Unicode Code Charts",
        href: "http://www.unicode.org/charts/",

    },
    "Evolution-of-Indic-Scripts": {
        title: "Indic Scripts",
        href: "http://www.ciillibrary.org/Sites/Photography/PhotographyHome.html",

    },
    "CLDR": {
        title: "Unicode CLDR",
        href: "http://cldr.unicode.org",

    },
    "South-Asian-Scripts": {
        title: "Unicode Technical note#10 : South Asian Scripts",
        href: "http://www.unicode.org/notes/tn10/",

    },

    "UAX29": {
        title: "Grapheme Cluster boundaries",
        href: "http://www.unicode.org/reports/tr29/",

    },
    "UAX14": {
        title: "Unicode Line Breaking Algorithm",
        href: "http://www.unicode.org/reports/tr14/",

    },

    "Normalization": {
        title: "Unicode Normalization",
        href: "http://unicode.org/reports/tr15/",

    },
    "Draft-Script-Grammar": {
        title: "Draft-Scrip-Grammer Devanagari",
        href: "http://tdil-dc.in/index.php?option=com_vertical&task=view-article&article_id=149&lang=en",

    }


                    }
  };
</script>
</head>
<body id="respecDocument" role="document" class="h-entry">

This document describes the basic requirements for Indian Languages layout for display purpose. It discusses some of the major layout requirements in first letter pseudo-element, vertical arrangements of characters, letter spacing, text segmentation, line breaking and collation rules in Indic languages.

The minimal requirements presented in this document for Indian languages text layout will also be used in E-publishing and CSS Standard. This documents covers major issues of e-content in Indian languages in order to create standardize format of text layout to address storage, rendering problems, vertical writing, letter spacing, collation, line breaking etc.

It also describes the definition of ABNF(Augmented Backus–Naur Form) based valid segmentation-Indic orthographic syllable in order to get the proper display in the browsers. The text segmentation[[!UAX29]] and line breaking [[!UAX14]] algorithms are considered in detail. The CSS & digital publications standards will benefit from this document.

This document describes the basic requirements for Indic script layout and text support on the Web and in eBooks. These requirements provide information for Web technologies such as CSS, HTML and SVG about how to support users of Indic scripts. The current document focuses on Devanagari, but there are plans to widen the scope to encompass additional Indian scripts as time goes on.

The editor's draft of this document is being developed by the Indic Layout Task Force, part of the W3C Internationalization Interest Group. It is published by the Internationalization Working Group. The end target for this document is a Working Group Note.

Introduction

Indian language complexities

India has large linguistic diversity with 22 constitutionally recognized languages and 12 scripts.This document is currently focused on the Devanagari script. The expectation is that over time its scope will widen to cover additional major scripts from the list below.

The mapping between languages and scripts is complex. Multiple languages may have common scripts, while a language can be written in multiple scripts. Each language and script is unique in nature and cannot be easily replicated, even if they share common characteristics. The orthographic changes may also occur in some languages and adoption of new orthography is a gradual process, thus posing additional challenges.

Serial No. Language Script
1 Hindi Devanagari
2 Sanskrit Devanagari
3 Marathi Devanagari
4 Konkani Devanagari
5 Nepali Devanagari
6 Maithili Devanagari
7 Sindhi Devanagari, Perso-Arabic
8 Bodo Devanagari
9 Dogri Devanagari
10 Bengali Bengali
11 Assamese Bengali
12 Manipuri Bengali, Meetei (Mayak)
13 Gujarati Gujarati
14 Kannada Kannada
15 Malayalam Malayalam
16 Odia Odia
17 Punjabi Gurmukhi
18 Tamil Tamil
19 Telugu Telugu
20 Urdu Perso-Arabic
21 Santhali Ol-Chiki, Devanagari
22 Kashmiri Devanagari, Perso-Arabic

The scripts of South Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letter forms. They are all abugidas in which most symbols stand for a consonant plus an inherent vowel (usually the sound /a/).The North Indian branch of scripts was, like Brahmi itself, mainly used to write Indo-European languages such as Pali and Sanskrit, and eventually the Hindi, Bengali, and Gujarati languages, though it was also the source for scripts for non-Indo-European languages such as Tibetan, Mongolian, and Lepcha. The South Indian scripts are also derived from Brahmi and, therefore, share many similarities in structural characteristics. For more details visit [[!South-Asian-Scripts]].

The following figure shows the evolution of Indian scripts over a period of times from Brahmi script.

Evolution of Indic Scripts Development of Indian Scripts For more details visit [[!Evolution-of-Indic-Scripts]]

Basic components of Indian languages

Unicode & CLDR

Unicode is the Universal character encoding standard, used for representing text for information processing. Unicode encodes all of the individual characters used for all the written languages of the world. The standards provide information about the character and their use.

Common Locale Data Repository is the largest standard repository of locale data in the world. It is managed by Unicode Consortium. It provides locale data in an XML format for use in computer applications. It facilitates locale-related information sharing among applications regardless of their domains. Its goal is to provide basic linguistic information for diverse “locales” in an open, interoperable form.

This data is usable for localizing applications.

Some examples of the information that CLDR gathers for languages and territories are:

  • Date formats
  • Time Zones
  • Number formats
  • Currency and its formats
  • Measurement Systems
  • Collation (Sort order) Specification: Sorting, Searching and Matching
  • Translations of names for language, territory, script, time zones, currencies
  • Script and exemplar characters used by a language
  • Calendaring rules, Formats and important dates.

Reference URL: [[!CLDR]]

Unicode Normalization

Unicode normalization[[!UAX15]] is a form of text normalization that transforms equivalent sequences of characters into the same representation. Unicode normalization is important in Unicode text processing applications, because it affects the semantics of comparing, searching, and sorting Unicode sequences

When a unique representation is required , a normalized form of Unicode text can be used to eliminate unwanted distinctions. The key part of normalization is to provide a unique canonical order for visually non distinct sequences of combining characters.

Canonical & Compatible Equivalence

Unicode contains numerous characters to maintain compatibility with existing standards, some of which are functionally equivalent to other characters or sequences of characters. Because of this, Unicode defines some code point sequences as equivalent. Unicode provides two notions of equivalence: canonical and compatible.

Canonical equivalence is a form of equivalence that preserves visually and functionally equivalent characters.

The following figure shows the canonical equivalence:

Canonical equivalence in Hindi Canonical Equivalence

Unicode Code charts- Devanagari & Devanagari Extended

The following Unicode Character Code chart as per The Unicode Standard, Version 7.0 :

Devanagari and Devanagari extended Code Chart Unicode Devanagari and Devanagari extended Code Chart

The Unicode code charts for other Indic scripts are available at [[!Code-Charts]]

Character Set for Hindi

This section provides the basic alphabetic system of Devanagari Script as used for Hindi Consonants, Vowels, Modifiers, Matras, Virama/ Halant, Nukta etc.

Consonant set
क़ ख़
ग़
ज़
ड़ ढ़
फ़
Vowel set
Modifiers
ं - Anuswara Anuswara, an archinasal, is denoted by a dot  above the letter after which it is to be pronounced. This falls under Nasal category.
ँ -Chandrabindu Candrabindu is pure nasalization as air comes from the nose. It is denoted by a breve with a dot superposed above  the letter after which it is to be pronounced. This falls under Nasal category.
ः - Visarga Visarga(sending forth), denoted by two dots placed one above the other.
ऽ - Avagraha For extra length with long vowels as seen in the Sanskrit text
Matras
ि
Virama(्)
Virama is used in most writing system to signify the lack of inherent vowel.
Nukta(़)
Nukta is used in Hindi

For more information See [[!Draft-Script-Grammar]]

Indic orthographic syllable boundaries(ABNF Valid segmentation-Proposed solution for layout issues in Indian languages)

Need for ABNF valid segmentation

ABNF Valid Segmentation based Indic orthographic syllable definition is provided here for correct and standardized representation of Indian languages layout. This will address various issues mentioned in the following sections.

This definition will be useful in order to get the uniform display of Indic layout in the browsers, applications, Digital publishing etc.

ABNF based definition of Indic orthographic syllable

Augmented Backus–Naur Form (ABNF) is a meta-language based on Backus–Naur Form (BNF), but consisting of its own syntax and derivation rules. The motive principle for ABNF is to describe a formal system of a language to be used as a bidirectional communications protocol.

V[m] |{CH}C[v][m]|CH


The linguistic definition of Indic orthographic syllable has been mapped to ABNF(Augmented Backus–Naur Form) for the purpose of text segmentation, line breaking , drop letter, letter spacing in horizontal text and vertical text representation. The definition has been elaborated , taking Hindi as an example.

The definition is a combination of 3 rules :

Rule 1 : V[m]

Rule 2 : {CH}C[v][m]

Rule 3 : CH (This rule is applicable only at the end of the word)

V(upper case) is independent vowel

m is modifier(Anusvara/Visarga/Chandrabindu)

C is a consonant which may or may not include a single nukta

v (lower case) is any dependent vowel or vowel sign (mātrā)

H is Virama

| is a rule separator

[ ] - The enclosed items is optional under this bracket

{} - The enclosed item/items occurs zero or repeated multiple times

Various Use cases of ABNF based Indic orthographic syllable definition for Hindi language as example

Rule 1 : V[m]

Sl. No. Examples Definition
1 अ, ई, उ V (Vowel) is a syllable
2 अं, उँ, आः V+ Modifier is a syllable

Rule 2 : {CH}C[v][m]

Sl. No. Examples Definition
1 र, क, ज, ल, म Consonant is a syllable
2 प्प, क्ख,च्त, ज्ज्व, त्क्ल, त्स्न Zero or more Consonant + Virama sequences followed by consonant is a syllable
3 र्त, र्त्स, र्त्स्न, र्त्स्न्य, फ़्क़ Zero or more Consonant (Nukta) +Virama  followed by consonant is a syllable
4 र्ता, र्त्स्न्या, फ़्जी, क्या Zero or more consonant+ (Nukta)+ virāma sequences followed by a consonant (+Nukta) followed by a vowel sign is a syllable
5 तः,स्तं, स्त्रँ, स्तः, फ़्ज़ँ  zero or more consonant+ (Nukta)+ virāma sequences followed by a consonant (+Nukta) followed by modifier is a syllable
6 र्त्स्न्या: त्स्न्युं, त्स्न्युँ, फ़्ज़ें,हिं zero or more consonant+ (Nukta)+ virāma sequences followed by a consonant (+Nukta) followed by a vowel sign and modifier is a syllable
7 स्थि, ज्जि, ख्वा Zero or more Consonant +Virama sequences followed by a consonant and vowel sign is a syllable

Rule 3 : CH

त्,व्, म्, भ् etc are syllable in Hindi only at the end of the word

Examples of combination of the rules :

1. स्वागतम् - CHCv + C + C + CH has following syllables :

स्वा CHCv
C
C
म् CH

2. भरतनाट्यम- C + C + C + Cv + CHC + C

C
C
C
ना Cv
ट्य CHC
C

Use cases of ABNF based Indic orthographic syllable definition for other languages

Rule 1 : V[m]

Sl. No. Examples Definition
1. ಅ , అ, ఇ V (Vowel) is a syllable
2 అం, ఆః V+ Modifier is a syllable

Rule 2 : {CH}C[v][m]

Sl. No. Examples Definition
1 ర, క , ರ, ಕ Consonant is a syllable
2 క్ఖ, చ్త , ಪ್ಪ, ಕ್ಖ Zero or more Consonant + Virama sequences followed by consonant is a syllable
3 ర్త్స్న, ర్త , ರ್ತ್ಸ್ನ್ಯ Zero or more Consonant (Nukta) +Virama  followed by consonant is a syllable
4 ర్తా , ర్త్స్న్యా , ಕ್ಯಾ Zero or more consonant+ (Nukta)+ virāma sequences followed by a consonant (+Nukta) followed by a vowel sign is a syllable
5 తః , స్తం , ಸ್ತಃ, ಹಿಂ zero or more consonant+ (Nukta)+ virāma sequences followed by a consonant (+Nukta) followed by modifier is a syllable
6 హిం zero or more consonant+ (Nukta)+ virāma sequences followed by a consonant (+Nukta) followed by a vowel sign and modifier is a syllable
7 స్థి , జ్జి , ఖ్వా , ಖ್ವಾ Zero or more Consonant +Virama sequences followed by a consonant and vowel sign is a syllable

Rule 3 : CH

      <p><b>Examples of combination of the rules :</b></p>
  <p><b>1.   స్వాగతమ్ , ಸ್ವಾಗತಮ್-  CHCv + C + C + CH has following syllables :</b></p>
      <table width="300"  class="tab-format">
స్వా , ಸ್ವಾ CHCv గ , ಗ C త , ತ C మ్ , ಮ್ CH

Text segmentation

A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user. Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection, or “move to next word” control-arrow keys), and “Whole Word Search” for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another . Some special sentence boundaries like the double poorna virama, possibly with numbers (as in Sanskrit text, shlokas etc.) Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for Initial-letter styling, and counting “character” positions within text. [[!UAX29]]

Solution for word boundaries:
User-percieved characters boundaries should be based on tailored Grapheme Cluster Boundaries to conform Indic orthographic syllable definition

In case of Devanagari phrase separator called Danda or purnaviram (।) and double danda (।।: used to mark end of the verse),In some of the browsers ending word is selected with purnaviram on double-click while in some browsers Danda is selected as a separate. It is recommended that line should not begin with purnaviram/Danda and double danda. So the properties of Danda should be same as the properties of FullStop or other punctuation marks so that new line should not begin with Danda and double danda.

For others characters, the text segmentation should be done as Indic orthographic syllable.

Indic script behavior in initial letter styling is based on syllables, rather than individual letter forms.

The above Figure shows an example of a drop intial in Hindi. In the first word of the paragraph, स्कूल ('skūl'), the sequence of characters is stored in memory is as follows:

There are two syllables in this word: SA+VIRAMA+KA+UU and LA. Note, however, that there are three Unicode grapheme clusters here: SA+VIRAMA, KA+UU and LA.

Styling is done on the basis of the whole orthographic syllable, not the first character, nor even the first grapheme.

A syllable includes a base consonant and any combination of the following characters in the text stream:

  • sequences of consonants preceded by virama (i.e. conjuncts).
  • vowel signs
  • visarga, anusvara or candrabindu.
</section>

Line breaking

When inline-level content is laid out into lines, it is broken across line boxes. Such a break is called a line break. In most writing systems, in the absence of hyphenation a line break occurs only at word boundaries. Many writing systems use spaces or punctuation to explicitly separate words, and line break opportunities can be identified by these characters. Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area.

Hyphenation

There are different cases of hyphenation, some of the cases are given below :

Case 1 : Hyphens are commonly used in Copulative compounds words in Hindi language. Hindi has both prefixes and suffixes which are joined to words with a hyphen.

नर- नारी, लाभ- हानि, माता-पिता, ऊंच - नीच

Case 2: Single word can breaks at the end of the line at Indic orthographic syllable level using hyphen

In the below screenshot, words आकर्षण and विज्ञापन not follow Indic orthographic syllable definition in some of the browsers.


Example of Line breaking

Guiding principles of Line breaking for Indian languages

In Indic writing system , it is preferred that line breaks at word boundaries ,if required following principles may be adhered :

Rule 1: New line cannot begin with following symbols/Punctuation marks. Also these should be retain with the associated text

  • Closing brackets
  • Devanagari Danda /Purnaviram
  • Commas
  • Visarga
  • Decimal symbols
  • Semicolon
  • Repetition of punctuation marks such as semicolon with closing brackets, Semicolon with single/ Double quotes , Closing brackets with commas/Semicolon etc
  • Mathematical operators

Rule 2: The definition of Indic orthographic syllable may be used to break the line and a hyphen should be at the breaking point so that word can be read intuitively

Rule 3: The hyphenated words can be broken at the hyphen e.g.:

  • नर-नारी should be treated as:
  • नर- on the first line and नारी on the next line

Rule 4: Expression with mathematical symbol should be treated as single unit so that at the end of the line expression should not breaks at operator level

Rule 5: Breaking should not be allowed at numerical values such as currency values, year etc. e.g.

“100.00” or “10,000”, nor in “12:59”

Requirements for Indic Layout

Initial letter styling

Drop initial is a typographic effect emphasizing the initial letter(s) of a block element with a presentation similar to a 'floated' element.

Selecting initial letters

<p>Initial letters are typically a single letter, which can be selected by the ::first-letter pseudo-element. But the drop initial letter in Indic scripts must be selected on the basis of orthographic syllables, rather than individual letter forms (see an example at the end of section 3, Text segmentation). The orthographic syllable may be a single Consonant/Vowel or the combinations of Unicode code points. A detailed definition of Indic syllables can be found in section 2, Indic Syllable boundaries.In Indian languages the size of the Initial Letter is determined by the number of the lines between top line of the syllable and lowest bit in the orthographic Indic syllable cluster where subjoined consonant and other diacritics  appears.</p></section>
  <section>
<h4>Typical drop initial usage in Indic scripts</h4>
<p>Most of the Indic drop initial letters in magazines and newspapers use 2 to 4 line drops. Some examples are shown below.</p>
  <figure> <img src="images/dropcap-example2.png"/> 
      <figcaption>Examples of Indic Initial letters</figcaption>
    </figure>
<p>The Sunken and raised Initial letter are not preffered in Indian languages.In examples of this kind, reference points on the drop cap must align precisely with reference points in the text. In Indic scripts the top reference point is the hanging base line for those scripts that have one, and the bottom alignment point is the text after-edge.</p>
<p>Initial letter wrap property is not applicable for Indian languages.No contour-filling is required in Indian languages.</p>
<p>Alignment of the top line of the non-highlighted characters is at the top of the thicker top line of the initial letter is commonly used in India.In some examples top lines of the initial letter and the following letters don't touch. This is due to variable technology/formats used by the publishers. It is preferred that both the top lines of Initial letter and neighbouring text should touch.Here are some additional  examples of initial highlighted letter and drop letter based on the Indic syllable definition.</p>
</section>    




      <table>
  <tr>
<td><img src="images/I-letter2.png" alt="Bengali example" /></td>
 <td><img src="images/I-letter3.png" alt="Tamil example"/></td>
</tr>
<tr>
<td><img src="images/I-letter1.png" width="526" height="338" alt="Bengali example" /></td>
 <td><img src="images/I-letter5.png" alt="Malayalam example"/></td>
</tr>
 <tr>
<td><img src="images/I-letter4.png" alt="Odia example"/></td>
 <td><img src="images/I-letter6.png" alt="Marathi example" width="555" height="255"/></td>
</tr>

The remainder of this section describes the detailed rules for placement and alignment of hindi characters with initial letter styling relative to the adjacent text.

Alignment of Initial letter of Indic scripts with hanging baseline

Indian languages which use hanging baseline such as Hindi, Bengali, Gujarati, Marathi, Punjabi etc , the part from the hanging baseline and the ascent of the Initial letter may follow the following mechanism :

Rule for hanging baseline Rule of Indic script with hanging baseline

Where n=h/2

In Indic scripts that have hanging baseline, the top alignment point is the hanging baseline, and the bottom alignment point is the text-after-edge and both the Initial letter and first line of text should be same aligned.

Scripts that don't have hanging baseline such as Kannada, Tamil, Telugu, Malayalam , Odia etc

The publishers in India commonly used following rules for such scripts :

  • The selection of the Initial letter is based on the Indic orthographic syllable described in section 2.
  • Ascent of the first non highlighted line is equal to the median/mean line of the Initial letter as shown below :
rule for South Indian languages

Based on above observations the general rule for South Indian languages will be :

  • ¼ height of the total Drop Cap Height projected or ascended above the ascent of the first-line
  • ¾ of the total Drop Cap Height occupied or descended from ascent of first line or X1 to descent of the last line or line XN.
  •             <h3>Initial Letter box formatting in Indian languages</h3>
        <p>The Indian publishers commonly used different height of the boxes and sizes of the characters. But it is proposed that the syllable with in the box is centre-aligned with reference to box parameters as shown in the figure below :</p>
       <figure> <img  src="images/initial-letter-box.png" alt="examples of Indic Initial letters within box" />
          <figcaption>Examples of Indic Initial letters within box</figcaption>
        </figure>
    
  <section>

<h3>Letter Spacing</h3>
<p>In styling issues like horizontal spacing, the spacing between characters like C E R T I F I C A T E, the space is given between the every character in case of English. But in case of Indian language, the space needs to be introduced after each syllable for correct representation.</p>
<p>For letter spacing in Indian languages it is recommended that spacing should follow Indic orthographic syllable definition. </p>
<p>Here is the some examples of letter spacing that based on definition :</p>
1.  अं त र्रा ष्ट्री य क र ण , అం త ర్ రా ష్ట్రీ య క ర ణ , ಅಂ ತ ರ್ ರಾ ಷ್ಟ್ರೀ ಯ ಕ ರ ಣ<br />
2.  स्वा ग त म् , స్వా గ త మ్ , ಸ್ವಾ ಗ ತಂ / ಸ್ವಾ ಗ ತ ಮ್<br />
3.  सु स ज्जि त , సు స జ్జి త , ಸು ಸ ಜ್ಜಿ ತ<br />
4.  स म्प्र ति, సం ప్ర తి ,  ಸಂ ಪ್ರ ತಿ</section>
  <section>

  <h3>Vertical arrangements of characters</h3>
<p>In vertical arrangement of characters writing each character on a new line may not be suitable in Indian languages. The vertical arrangements of characters are sometimes used in Indian texts. In order to form correct arrangements, it is preferred to follow tailored grapheme cluster approach.
      Variations of vertical arrangement of the characters in Hindi is represent below :</p>
<h4>Variations in vertical arrangements</h4>
<figure> <img src="images/vert2.jpg" width="608" height="250" alt="Example of Vertical arrangements in Hindi" />
      <figcaption>Variations in vertical arrangements</figcaption>
    </figure>
<h4>Vertical representation of the word 'स्वागतम्' based on Indic orthographic syllable definition:</h4>
<table class="tab-format1">
      <tr>
    <td><strong>स्वा</strong></td>
  </tr>
      <tr>
    <td><strong>ग</strong></td>
  </tr>
      <tr>
    <td><strong>त</strong></td>
  </tr>
      <tr>
    <td><strong>म्</strong></td>
  </tr>
    </table>
    <br />
    <table class="tab-format1">
      <tr>
    <td><strong>స్వా</strong></td>
  </tr>
      <tr>
    <td><strong>గ</strong></td>
  </tr>
      <tr>
    <td><strong>త</strong></td>
  </tr>
      <tr>
    <td><strong>మ్</strong></td>
  </tr>
    </table>

Collation

Collation is one of the most important features for Indic languages . It determines the order in which a given culture indexes its characters. This is best seen in a dictionary sorting order where for easy search words are sorted and arranged in a specific order. Within a given script, each allo-script may have a different sort-order. Thus in Hindi the conjunct glyph क्ष is sorted along with क , since the first letter of that conjunct is क and on a similar principle ज्ञ is sorted along with ज . The same is not the case with Marathi and Nepali which admit a different sort order.

Different scripts admit different sort orders and for all high end NLP applications. Sorting is a crucial feature to ensure that the applications index data as per the cultural perception of that community. In quite a few States, sort order is clearly defined by the statutory bodies of that state and hence it is crucial that such sort order be ascertained and introduced in the document .

The order(left to right) as given below is pertinent to sorting by a computer program and is compliant with CLDR as laid down by Unicode.

 ़
\u093C

\u0950

\u0902

\u0901>

\u0903

\u0905

\u0906

\u0907

\u0908

\u0909

\u090A

\u090B

\u090C

\u090D

\u090F

\u0910

\u0911

\u0913

\u0914

\u0915

\u0916

\u0917

\u0918

\u0919

\u091A

\u091B

\u091C

\u091D

\u091E

\u091F
 ठ
\u0920

\u0921

\u0922

\u0923

\u0924

\u0925

\u0926

\u0927

\u0928

\u092A

\u092B

\u092C

\u092D

\u092E

\u092F

\u0930

\u0932

\u0933

\u0935

\u0936

\u0937

\u0938

\u0939

\u093D

\u093E
ि
\u093F

\u0940

\u0941

\u0942
\U0943
\U0944 \U0945
\u0947

\u0948

\u0949

\u094B

\u094C

\u094D

Following is the sort order of Consonant 'क'

कँ कं कः का कि की कु कू कृ के कॅ
कै को कॉ कौ क् क़          

Contributors

Serial No. Name Organization
1 Manoj Kumar Jain DeitY
2 Gautam Sengupta University of Hyderabad
3 Girish Nath Jha JNU
4 Rajeev Sangal IIT Varanasi
5 Dipti Misra Sharma IIIT Hyderabad
6 R K Sharma Thapar University
7 Rajat Mohanty IIT Bombay
8 Venkatesh Choppella IIIT Hyderabad
9 Soma Paul IIIT Hyderabad
10 M D Kulkarni C-DAC Pune
11 Panchanan Mohanty University of Hyderabad
12 G. Uma Maheshwar Rao University of Hyderabad
13 Dr. Bisembli P. Hemananda University of Mysore
13 Dr. R. Chandrashekar JNU

Hyphenation case 1 and 2 intentions

4.1 Hyphenation
Case 1
http://w3c.github.io/ilreq/#h_hyphenation

the example

नर- नारी, लाभ- हानि, माता-पिता, ऊंच - नीच

has different configurations of spaces around the hyphen - is this intentional? if so, what does it mean? if not, we should fix the example (and add a few words to introduce it).

Am i right to assume that in case 1 the hyphens always appear in the text? If so, what happens at the end of a line? Does the hyphen remain on the first line or move to the second, or does it disappear?

Presumably case 2 is intended to show a situation where hyphens only appear when a line is broken inside a word. Again, presumably the hyphen remains on the first line(?) - ilreq should clarify that.

Change case 2 example to show intended result

4.1 Hyphenation
Case 2
http://w3c.github.io/ilreq/#h_hyphenation

In the below screenshot, words आकर्षण and विज्ञापन not follow Indic orthographic syllable definition in some of the browsers.

I think it's more important to show what one should see, rather than to point out a browser-specific issue. I think the image should be replaced (by a figure) showing the expected behaviour.

X1 and XN not clear

or X1 to descent of the last line or line XN

The meaning of this is hard to decipher.

Two dandas or double danda ?

Text segmentation
http://w3c.github.io/ilreq/#h_text_segmentation

note that the current text uses two dandas together to show the double danda, rather than the unicode code point - some explanation would be useful here if that is intentional, or if it is commonly found in text – especially around what to do wrt segmentation, given that a two-character sequence is not like the full stop it is equated with at the end of the paragraph.

Inherent vowel sound or transcription?

[i did a quick review of the document. This is the first of a number of issues i'm raising as a result.]

1.2 Indian language complexities
http://w3c.github.io/ilreq/#h-h_indian_language_complexities

They are all abugidas in which most symbols stand for a consonant plus an inherent vowel (usually the sound /a/).

The /.../ indicates a phonemic representation, which in this case i believe is actually /ə/, whereas the 'a' is a common transliteration form. I suspect that either the a should be changed to ə, or the slashes should be replaced with quotes.

Inline Alignment

Wanted to know more about how each type of Indic script aligns text content of varying font sizes on a single line, and if there are any variations in the practice.

Reduce duplication in orthographic syllable tables

I just read through the new material in section 2.4:

2.4 Use cases of ABNF based Indic orthographic syllable definition for other languages
http://w3c.github.io/ilreq/#use-cases-of-abnf-based-indic-orthographic-syllable-definition-for-other-languages

I don't think i missed anything, but after a while i realised that actually the only difference between this and section 2.3 on Hindi is the set of examples.

Why don't we just add an extra column to the tables in section 2.2, and give the column the title 'Examples in Kannada/Telugu' (and of course rename the Hindi column to 'Examples in Hindi')?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.