ekmett / hyphenation Goto Github PK
View Code? Open in Web Editor NEWKnuth-Liang Hyphenation for Haskell based on TeX hyphenation files
License: Other
Knuth-Liang Hyphenation for Haskell based on TeX hyphenation files
License: Other
First of all I'm not sure if it is a problem with library or data files from tex project. I find it more reasonable to post issue here first and then post to hyph-utf8
mailing list if it happens to be fault of data files.
Examples:
-- Names
hyphenate lithuanian "Darius" ["Da","rius"] -- The result is correct
hyphenate lithuanian "Jonas" ["Jonas"] -- Should be ["Jo", "nas"]
hyphenate lithuanian "Auksė" ["Auks\279"] -- ["Auk", "s\279"]
-- Nouns
hyphenate lithuanian "Bananas" ["Ba","nanas"] -- ["Ba", "na", "nas"]
hyphenate lithuanian "Stalas" ["Stalas"] -- ["Sta", "las"]
-- Verbs
hyphenate lithuanian "Bėgti" ["B\279gti"] -- ["B\279g", "ti"]
hyphenate lithuanian "Nebeprisikiškiakopūsteliaudavome"
["Ne","be","pri","si","ki\353","kia","ko","p\363s","te","liau","da","vome"]
-- ["Ne","be","pri","si","ki\353","kia","ko","p\363s","te","liau","da","vo", "me"]
I was just bitten by #3 , in haskell/haskell-language-server#1976, where the entire HLS 1.2.0 release is missing the hyphenation data, and thus fails at runtime. Except that CI doesn't catch this, because CI builds everything from source, and thus has the files available.
I can empathize with not wanting to bloat binaries, but default behavior that silently fails at deployment time with no early warnings seems like an exceptionally bad choice.
I recently used hyphenation
to add soft hyphens to gwern.net so I could enable fully-justified text on desktop Chrome/Chromium browsers. (Bizarrely, for many years now, Chrome has had hyphenation on mobile Android but not desktop, and the devs have dragged their feet with the excuse that they just can't figure out how to ship dictionary files for desktop browsers; I gave up waiting for them to fix it.)
A user reported that on Safari browsers, Safari would line-break a hyphen-separated word like "compile-time" but there would be two hyphens: the original hyphen and then the smaller line-breaking hyphen, presumably the soft hyphen. Other browsers correctly ignore the soft hyphen and show only the regular hyphen if they need to line-break things like "compile-time". But why was there a soft hyphen there to begin with?
It turns out that hyphenation
inserts soft hyphens even at existing hyphens!
> H.hyphenate H.english_US "Compile-time"
["Com","pile-","time"]
I would expect instead a breaking like ["Com", "pile-time"]
. There is no need to insert a soft hyphen at the existing hyphen, since that is where a justification algorithm would break anyway. It can only cause problems, and in the case of Safari, does.
My current workaround is a post-processing hack to string-replace any hyphen+soft-hyphen present: Data.Text.replace "-\173" "-"
etc. But this does other hyphenation
users no good.
On an additional note, it would be nice to have a utility function which takes a String/Text and returns it with soft hyphens inserted. My current implementation goes like this, and it's complex enough that I'm not convinced I'm doing it right:
T.pack $ unwords $ map (intercalate "\173" . H.hyphenate H.english_US{H.hyphenatorLeftMin=3}) $ words $ T.unpack s
(I don't know how much the lack of a native Text version hurts, but I'm sure it does my compile-times no good, anyway.)
Non breakable space is not respected by hyphenate
. Example:
λ> hyphenate english_US "the\x00a0table"
["the\160table"]
(expected: ["the ta", "ble"]
).
Other repl experiments to check how hyphenate
behaves with multiword input:
λ> hyphenate english_US "organge dolphin"
["or","gange ","dol","phin"]
λ> hyphenate english_US "the table"
["the table"]
This causes Stackage with GHC 7.6 failures.
When building an executable which links to hyphenation
in one machine for deployment to a different machine I need to create a tree of directories in production mimicking the path where I built the executable in the dev machine and copy the data files to avoid errors in production like: /home/alberto/src/myproject/.cabal-sandbox/share/x86_64-linux-ghc-7.8.3/hyphenation-0.4/hyph-es.chr.txt: openFile: does not exist (No such file or directory)
Ideally I would like to deploy self-contained executables or at least have a configurable data directory.
Perhaps some Template Haskell could embed those data files in the compiled library itself?
The values for defaultLeftMin
and defaultRightMin
are set to 2. How do those (wrong) hyphenations come about?
Ord-nungs-fort-schrit-t
Aa-chen-er-s
Aa-dor-f
Aal-fan-g
There are a couple of issues that don't seem right to me, regarding the german hyphenation. I don't know if the hyph-utf8 patterns are at fault.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.