Coder Social home page Coder Social logo

tehreer / sheenbidi Goto Github PK

View Code? Open in Web Editor NEW
118.0 10.0 19.0 2.74 MB

A sophisticated implementation of Unicode Bidirectional Algorithm

License: Apache License 2.0

C 72.69% C++ 26.60% Makefile 0.59% Meson 0.12%
unicode bidi unicode-bidirectional-algorithm uax-9 uax-24 c c89 ansi-c c-plus-plus library

sheenbidi's Introduction

SheenBidi

License Travis-CI Build Status AppVeyor Build Status Coverage Status

SheenBidi implements Unicode Bidirectional Algorithm available at http://www.unicode.org/reports/tr9. It is a sophisticated implementation which provides the developers an easy way to use UBA in their applications.

Here are some of the advantages of SheenBidi.

  • Object based.
  • Optimized to the core.
  • Designed to be thread safe.
  • Lightweight API for interaction.
  • Supports UTF-8, UTF-16 and UTF-32 encodings.

API

The above screenshot depicts a visual representation of the API on a sample text.

SBCodepointSequence

It works as a code point decoder by accepting a string buffer in specified encoding.

SBAlgorithm

It provides bidirectional type of each code unit in source string. Paragraph boundaries can be queried from it as determined by rule P1. Individual paragraph objects can be created from it by explicitly specifying the base level or deriving it from rules P2-P3.

SBParagraph

It represents a single paragraph of text processed with rules X1-I2. It provides resolved embedding levels of all the code units of a paragraph.

SBLine

It represents a single line of text processed with rules L1-L2. However, it provides reordered level runs instead of reordered characters.

SBRun

It represents a sequence of characters which have the same embedding level. The direction of a run would be right-to-left, if its embedding level is odd.

SBMirrorLocator

It provides the facility to find out the mirrored characters in a line as determined by rule L4.

SBScriptLocator

Not directly related to UBA but can be useful for text shaping. It provides the facility to find out the script runs as specified in UAX #24.

Dependency

SheenBidi does not depend on any external library. It only uses standard C library headers stddef.h, stdint.h and stdlib.h.

Configuration

The configuration options are available in Headers/SBConfig.h.

  • SB_CONFIG_LOG logs every activity performed in order to apply bidirectional algorithm.
  • SB_CONFIG_UNITY builds the library as a single module and lets the compiler make decisions to inline functions.

Compiling

SheenBidi can be compiled with any C compiler. The best way for compiling is to add all the files in an IDE and hit build. The only thing to consider however is that if SB_CONFIG_UNITY is enabled then only Source/SheenBidi.c should be compiled.

Example

Here is a simple example written in C11.

#include <stdint.h>
#include <stdio.h>
#include <string.h>

#include <SheenBidi.h>

int main(int argc, const char * argv[]) {
    /* Create code point sequence for a sample bidirectional text. */
    const char *bidiText = "یہ ایک )car( ہے۔";
    SBCodepointSequence codepointSequence = { SBStringEncodingUTF8, (void *)bidiText, strlen(bidiText) };

    /* Extract the first bidirectional paragraph. */
    SBAlgorithmRef bidiAlgorithm = SBAlgorithmCreate(&codepointSequence);
    SBParagraphRef firstParagraph = SBAlgorithmCreateParagraph(bidiAlgorithm, 0, INT32_MAX, SBLevelDefaultLTR);
    SBUInteger paragraphLength = SBParagraphGetLength(firstParagraph);

    /* Create a line consisting of whole paragraph and get its runs. */
    SBLineRef paragraphLine = SBParagraphCreateLine(firstParagraph, 0, paragraphLength);
    SBUInteger runCount = SBLineGetRunCount(paragraphLine);
    const SBRun *runArray = SBLineGetRunsPtr(paragraphLine);

    /* Log the details of each run in the line. */
    for (SBUInteger i = 0; i < runCount; i++) {
        printf("Run Offset: %ld\n", (long)runArray[i].offset);
        printf("Run Length: %ld\n", (long)runArray[i].length);
        printf("Run Level: %ld\n\n", (long)runArray[i].level);
    }

    /* Create a mirror locator and load the line in it. */
    SBMirrorLocatorRef mirrorLocator = SBMirrorLocatorCreate();
    SBMirrorLocatorLoadLine(mirrorLocator, paragraphLine, (void *)bidiText);
    const SBMirrorAgent *mirrorAgent = SBMirrorLocatorGetAgent(mirrorLocator);

    /* Log the details of each mirror in the line. */
    while (SBMirrorLocatorMoveNext(mirrorLocator)) {
        printf("Mirror Index: %ld\n", (long)mirrorAgent->index);
        printf("Actual Code Point: %ld\n", (long)mirrorAgent->codepoint);
        printf("Mirrored Code Point: %ld\n\n", (long)mirrorAgent->mirror);
    }

    /* Release all objects. */
    SBMirrorLocatorRelease(mirrorLocator);
    SBLineRelease(paragraphLine);
    SBParagraphRelease(firstParagraph);
    SBAlgorithmRelease(bidiAlgorithm);

    return 0;
}

License

Copyright (C) 2014-2022 Muhammad Tayyab Akram

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

sheenbidi's People

Contributors

khaledhosny avatar mta452 avatar radarhere avatar utelle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sheenbidi's Issues

Getting script of each code unit in an array

Currently, the script runs are identified in an iterating fashion. But sometimes an array turns out to be a better option to find out the overlapping regions of bidi runs and script runs. Technically it's not a big effort to write a utility function that fills out the script array by iterating over script runs. However, it would be really helpful if such a utility function is already provided in the library.

For example, it was a simpler approach for Java interoperability in Tehreer-Android. The relevant piece of code is available in ScriptClassifier.cpp

Using SheenBidi with utf8

Hi,

I am looking at a library for Bidi to pair up with Harfbuzz and FreeType for rendering text in a game engine. Do you know any project that is using SheenBidi with those so I can get a little headstart when working with these? Does SheenBidi support emojis? I am using utf8.

SBAlgorithmGetParagraphBoundary() leaves separatorLength unset if no separators.

When running SBAlgorithmGetParagraphBoundary() on a string that contains no separators (e.g. "This is a single line string."), the result of separatorLength is unchanged. For example, if you send the address to an uninitialized variable the result would be uninitialized. If there is no separator in the paragraph, I would expect it to be explicitly set to 0.

Unsafe for allocation failure

At the moment the library calls malloc and uses the result without checking for NULL throughout.

For my use case this makes it unusable. Would it be possible to add allocation failure handling?

extern "C" in public headers?

Hi!
Thank you for this library, it is exactly what I looked for as I don't want to include ICU in my project. I'm using it from a C++ code, so I needed extern "C", but it would be nicer, if this would be in the library's public headers at least.

I'm not sure which option would you like better: adding it to every public header or just for the ones declaring functions, so I decided not to open a PR about this, as it's a trivial thing anyway :)

PopulateBidiChain in some situations leaves SBBidiType portion of memory uninitialized.

I checked with valgrind that on every run it reports
"==2247== Warning: unimplemented fcntl command: 1033
==2247== Thread 32:
==2247== Conditional jump or move depends on uninitialised value(s)
==2247== at 0x14B4107: SBAlgorithmCreateParagraph"
I tried to narrow down the part of memory which is uninitialized according to valgrind and it came out that if I will add in CreateParagraphContext function just after BidiChainInitialize additional code

for (int i = 0;i < length;++i) fixedTypes[i] = SBBidiTypeNil;

the valgrind is not reporting any problem.

I followed PopulateBidiChain code and it looks that in some cases it can omit initialization of some links in the chain. I do not understand sheenbidi well enough to say if this mean we have a data error in such case. Anyway application is not crashing because of it, the only visible problem so far is valgrind report.

CJK / Hangul has return SBBidiTypeON which reversed when mixing with arabic

For example text 阿拉伯語العربية when given to sheenbidi the SBLineGetRunCount is only 1 which 阿拉伯語 will be reversed like the arabic because it returns a neutral bidi type for chinese

From UnicodeData.txt it seems the cjk and hangul has 4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;; "L" for LTR, maybe it return something wrongly elsewhere?

I use this as a workaround:

        if ((codepoint >= 0x4E00 && codepoint <= 0x9FFF) ||
            (codepoint >= 0x3400 && codepoint <= 0x4DBF) ||
            (codepoint >= 0xF900 && codepoint <= 0xFAFF) || // CJK
            (codepoint >= 0xAC00 && codepoint <= 0xD7A3)) // Hangul
            types[firstIndex] = SBBidiTypeL;
        else
            types[firstIndex] = LookupBidiType(codepoint);

Support Meson

Hi, I would like to add support for SheenBidi in libraqm (issue), which uses the increasingly popular Meson build system - as do all of its current dependencies. It would be amazing if SheenBidi supported Meson as well, since it can automatically build libraries that also use it when they are not installed on the host system (details here). Given SheenBidi does not have a package in most distros, Meson really is a must-have for us. Is supporting it something you can do?

Thanks :)

Small documentation mistake

If I'm not mistaken, README file contains a small mistake. It says this about SBParagraph :

It represents a single paragraph of text processed with rules X1-I2. It provides resolved embedding levels of all the code units of a paragraph.

However, levels returned by SBParagraphGetLevelsPtr does not seem to include L1 and L2 rules, only SBLine includes those rules.

I tested with the following code point sequence:
0661 0009 0028 0662 0029

requirement: Is there a demo for utf-16?

I guess if I input a string of utf-8 characters, when it will be revert direction, it's hard to handle it when it's represented with utf-8. So is there a demo for utf-16 ? I need to process uyghur characters.

Upgrade data files to Unicode 11

As Unicode 11 has been released, bidi type, general category, script, mirror and bracket lookups should be updated accordingly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.