Comments (6)
Here's a quick update on my progress so far.
I wasn't too familiar with how Unicode worked, but after doing some reading and research the past few days, I think I have a good, general understanding of it now.
To properly support displaying, aligning, and correctly highlighting Unicode text, it appears the program needs a way to iterate over the user-perceived characters, known as grapheme clusters.
I played around with the above linked library, tinyutf8, which allowed to iterate and get the size of a string in Unicode code points. It doesn't seem to have the capability to iterate over grapheme clusters at the moment. Since a grapheme cluster can be represented by multiple Unicode code points, it doesn't seem like the appropriate solution for what's needed.
I've also seemed to have encountered a bug with the library. The lookup table seems to become corrupt in certain reproducible cases, causing the substr function to consequently return an incorrect value for a multibyte sequence. I'll open an issue over there after I write a minimal test case that demonstrates the issue.
Following that, I checked out Boost.Locale using the icu backend. Its api boost::locale::boundary::segment_index
using boost::locale::boundary::character
for the boundary type appears to correctly allow iteration over grapheme clusters. I have a local branch using Boost.Locale where Fltrdr now correctly displays Unicode text.
Although it works, the segment index seems to create a copy of the indexed text for itself, resulting in higher memory usage as the text is stored twice, one within a std::string
and the other in the segment_index
.
I'm trying out another option today using the icu::BreakIterator
api. It looks like it is similar to the segment_index
class from Boost.Locale. It takes an icu::UnicodeString
as its string parameter, which can alias an external array of characters, meaning it doesn't own the array, but just points to it.
The plan is to test using std::string
to store the text, with an icu::UnicodeString
pointing to the main string, and use the Unicode string to be indexed with icu::BreakIterator
. This would allow a single copy of the text to be kept.
Lastly, I've been tinkering with the Boost.Regex library using the icu
backend to support Unicode regex searching using boost::regex::utf8regex_iterator
.
In conclusion, Fltrdr will have Unicode text support soon!
from fltrdr.
As of Version 0.2.0, Fltrdr supports UTF-8 Unicode text!
UTF-8 text should now render properly, including full-width CJK Unified Ideographs. I ended up using the ICU
library to provide Unicode support for a non-owning/view OB::Text::View
class, an owning/string OB::Text::String
class, and a non-owning/view regex iterator OB::Text::Regex
.
If you get the chance to try it out, please let me know if you have any suggestions or encounter any issues :)
from fltrdr.
At the moment, it can only handle reading ascii text files. To work around the limitation, you could pipe your text through iconv with cat <file> | iconv -f utf-8 -t ascii//translit
, although it will remove all the accents from the characters.
In the future to support those characters, I'll need to find/write a unicode string class along with a unicode supported regex library.
from fltrdr.
Thanks, I'll have a look at it.
from fltrdr.
Maybe https://github.com/DuffsDevice/tinyutf8 ?
from fltrdr.
It works for me, that is fantastic. Thank you.
from fltrdr.
Related Issues (14)
- src/ob/string.hh:7:20 not found HOT 5
- Cannot rebind keys HOT 2
- add man page
- ..
- Incompatible with lower version Macos(High Sierra) HOT 1
- Fedora 31: Build error HOT 2
- Would it be hard to pipe a webpage to fltrdr? HOT 4
- MacOS: Build error HOT 12
- Ubuntu 18.04 Build Error HOT 2
- Clarify Text Styles HOT 3
- warning: unknown command 'style text-brace 4feae7' HOT 2
- Odd Whitespace HOT 2
- Error: could not open stdin with docker HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fltrdr.