C# software developer for customers in various industries. Recent projects include mobile apps, backend services and web APIs.
thomasgalliker / diacritics.net Goto Github PK
View Code? Open in Web Editor NEWFinds and replaces diacritics in strings
Finds and replaces diacritics in strings
When installing the nuget in a .NET Core project I see warnings at compile time: "Package 'Diacritics 1.0.6' was restored using '.NETFramework,Version=v4.6.1' instead of the project target framework '.NETStandard,Version=v2.0'. This package may not be fully compatible with your project."
It would be nice if it could be compiled under .NET Standard.
I came across this project while searching for an umlaut replacement library. Find the approach to solve the general problem. Unfortunately I found that there is something wrong with the German language.
Correct is that the character ß is replaced by ss. But this is not true for the other umlauts. Actually, Ä is replaced by Ae, Ö is Oe and Ü is Ue. The same applies to lower case letters, of course. In this library, however, the translation is only done by one letter instead of two. Therefore this is actually not correct.
Umlaut | replacement character |
---|---|
Ä | Ae |
Ü | Ue |
Ö | Oe |
ä | ae |
ü | ue |
ö | oe |
Hi,
In French, we have some special word like "œuf" (Egg). For them, after RemoveDiacritics, we should have "oeuf" and not "ouf".
I tracked down the mapping the the file FrenchAccentsMapping. Line 22, the { 'œ', "o" }, should be replaced by { 'œ', "oe" },
Is that possible to do it ?
To make it simple, I submitted a pull request.
Regards
Steeve
I couldn't figure out why ß wasn't being removed until I poked through the source.
Might be nice to give a Decompose example in the README.md.
doesn't seem to work with german:Feindflug - ...Hinter Feindlichen Linie
ə should be mapped to a
Please can you change the code that an ß will be translateted to ss
NuGet license info says Apache 2.0, while README.MD doesn't specify a concrete license (except a non commercial clause - even though Apache 2.0 allows commercial usage).
A LICENSE.MD or LICENSE.TXT file with a proper license would make this project useful for more people.
Hi.
Could you please add this mapping for Vietnamese "ơ" letter?
Thanks.
Integrate new diacritics source and cross check with:
https://github.com/sindresorhus/transliterate/blob/main/replacements.js
Thank you for your work on this project! My team is working a project that utilizes your library and is looking at updating from 2.0.19240.3 to 3.3.18. Do you have release notes available anywhere that we could review for possible breaking changes? I didn't see any release information available in GitHub that I could read through.
Hi Thomas, while looking for a solution to normalize diacritics and other digrams, I came across your implementation. I like the way you separated every language into its own set of rules.
However, I don't feel comfortable using it, since it produces unexpected conversions. Say you feed it with "cœur" in French. You'd expect to get "coeur" as an output, but since you map "œ" to "o" you finally get "cour" instead.
Same thing for German words, where "Grüße" might be more appropriately mapped to "Gruesse" (i.e. map ü → ue and ß → ss).
Can you explain why you chose your approach of a one-to-one mapping?
What means commercial use?
"For commercial use please contact the author."
Would an information system development fall into commercial use?
If you want to use the extension methods of the library you have to register a global default diacritics mapper.
This is not very pure and it does not allow to have different mappings for different strings without switching the global mapper.
StaticDiacritics.SetDefaultMapper(() =>
new DiacriticsMapper(
new MyGermanAccentMapping(),
new GermanAccentsMapping(),
new ItalianAccentsMapping(),
new ArabicAccentsMapping()
)
);
"Thöni".RemoveDiacritics() // "Thoeni"
The current "pure" approach is to instantiate a DiacriticsMapper with accent mappings and use the methods form this instance.
var myMapper = new DiacriticsMapper(
new MyGermanAccentMapping(),
new GermanAccentsMapping(),
new ItalianAccentsMapping(),
new ArabicAccentsMapping());
myMapper.RemoveDiacritics() // "Thoeni"
This is fine.
But it would be convenient to have an overload for the extensions methods where the mapper (or single accent mappings) could be passed:
"Thöni".RemoveDiacritics(myMapper) // "Thoeni"
or simply as a params array:
"Thöni".RemoveDiacritics(new MyGermanAccentMapping()) // "Thoeni"
When characters have a lower-case Latin equivalent the diacritic is not correctly removed.
Take for example the Turkish word "İngiltere" (England), when invoking RemoveDiacritics
the input is converted to lowercase before IndexOfAny
is called. At this point the input is transformed to "ingiltere" meaning the İ
diacritic is not replaced & the original string is returned.
The character ü
is valid in the Spanish language (e.g., penguin
is pingüino
in Spanish).
spanish letter n with tilde
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.