juliastrings / utf8proc Goto Github PK

View Code? Open in Web Editor NEW

1.0K 1.0K 135.0 5.34 MB

a clean C library for processing UTF-8 Unicode data

Home Page: http://juliastrings.github.io/utf8proc/

License: Other

CMake 0.20% Makefile 0.45% C 98.46% Julia 0.87% Shell 0.02%

utf8proc's People

Contributors

Stargazers

Watchers

Forkers

tkelman vlajos jonas sarvex w32zhong scottpjones jlec cpb9 soonr dahebolangkuan libathena melted deepakm hermixy per-gron silky hectr kleopatra999 benibela joshuarubin nicolaichuk eugenening tzimuto1 ezhangle h2co3 madscientist gingerbill nomoon atton erengy tdwong ianwlarson zonakre lixiangnlp studuino wang-yuwei brainy kghost veristack past-due ryandesign krokodileglue vadz nehaljwani chenh-w nomissbowling aliket p-brain logitom lpeter1997 edoardomarsili antimon2 openmaptiles extrowerk stweil yoursmengle randy408 deanmsands3 nickmat wollmers zumo nwg peterpostmann andreas-schniertshauer giordano forksnd itbear xkszltl glasslight linecode j-vernay yz-liu kishorkunal-raj cxpan archermarx manish364824 sreekanth370 zazzel niblo mike-glorioso developoorman blacksolarster renesugar wirtos externalrepositories leleliu008 mwilliamson markus-oberhumer-forks pdet liaofei1128 vlognouh jamesbrosnahan chris0e3 one-pr amadeusine bb0902 mr1name ajunlonglive neal-hu-chn dariofranceschini

utf8proc's Issues

warning: comparison of integers of different signs

Adding -Wextra yields:

utf8proc.c:439:28: warning: comparison of integers of different signs: 'utf8proc_ssize_t' (aka 'long') and 'unsigned long' [-Wsign-compare]
if (wpos < 0 || wpos > SSIZE_MAX/sizeof(utf8proc_int32_t)/2)
~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is an issue if a project embeds utf8proc and it's compiled with -Wextra -Werror.

utf8proc 1.2 release

Now that we have nicer documentation (thanks to #29), as far as I can tell we are ready for a 1.2 release as soon as the version number is updated.

(I don't think it's necessary to wait until we can get the HTML docs into the release tarball; there are equivalent docs in utf8proc.h, after all, and in any case the lack of formatted docs in the tarball is not a regression.)

@jiahao, @tkelman, @nalimilan, @StefanKarpinski, your thoughts?

Compilation on windows

Hi, guys!

Need some help, try to compile lib on win 7.
Use cmake ../ -G"Visual Studio 15 Win64" to create project for msvs.
Then build solution in msvs with Release configuration and in folder see only .lib file.
As far as I know to use library on win I need three files .h .lib and .dll, is it right?
I link static lib and add header to my project, but as expected get external symbol __imp... error.
How to compile this lib with .dll on win? I try many ways to do it but without success...
Is it nessesary to have .lib if I want to use .dll? Is there any way to use only static .lib or for that I need to modify .h?

Sorry for noobs question, but I really confused with such situation.

syntax error: 'constant' compilation failure with MSVC 2015

Leaving this as a to-do, mostly for myself. AppVeyor here looks like it's running the newest MSVC it can find, but I think this failure only happens when some other header gets included before utf8proc.h.
ref https://ci.appveyor.com/project/tkelman/julia/build/1.0.688

 /c/projects/julia/deps/srccache/libuv/compile cl -nologo -MD -Z7 -ftls-model=global-dynamic -D_WIN32_WINNT=0x0502 -I../support   -I/c/projects/julia/usr/include -I/c/projects/julia/usr/include  -DLIBRARY_EXPORTS -DUTF8PROC_EXPORTS  -c string.c -o /c/projects/julia/src/flisp/string.o
cl : Command line warning D9002 : ignoring unknown option '-ftls-model=global-dynamic' 
string.c
c:\projects\julia\usr\include\utf8proc.h(95): error C2059: syntax error: 'constant' 
make[2]: *** [/c/projects/julia/src/flisp/string.o] Error 2

This happens when I compile Julia as C, but not when I compile as C++. MSVC 2015 is the first version with enough C99 support to make this easily possible. Probably need to change some of the ifdefs to be version-specific to avoid a collision with C99 true/false.

test programs?

Would be nice to have a make check.

utf8proc seems difficult to use efficiently on strings

Maybe this isn't an appropriate issue, if so please feel free to close it. I have a string implementation and I need to do some basic UTF8 operations on it: I need to compute the length (in characters not bytes), compare strings in a case-insensitive way (folding), and convert to upper or lowercase strings. I need these done as efficiently as possible as this has a real impact on my system. Then there are a few other more esoteric things I need like reverse a utf8 string etc. but these don't need to be done super-efficiently.

I really would like something small and I only need UTF8, so ICU is too much.

utf8proc seems like a great per-character interface, but it seems difficult to use efficiently on entire strings. For example, there's no simple, fast string length function. Also, the way that the map functions always allocate new memory and can't be used on existing buffers is a major drawback: it necessitates a lot of extra copying in many situations. It seems like a folded comparison function could be written inside utf8proc a good bit more efficiently. Etc.

Maybe that's a goal of utf8proc: to provide a character-based interface and have users compose their own higher-level (string-based) algorithms using them: simplicity taking priority over efficiency? And/or perhaps the way Julia uses utf8proc just matches well with the current interface; it doesn't have a need for writing into existing buffers etc.?

build failure due to undefined UTF8PROC_BIDI_CLASS_LRI

#9 seems to have broken the build? utf8proc_data.h now refers to UTF8PROC_BIDI_CLASS_LRI, which doesn't exist.

@jiahao?

benchmarking/profiling

It would be good to perform some benchmarking of utf8proc against ICU, and in general to perform some profiling to see if there are any easy targets for optimization.

rename to utf8proc

The Public Software Group has asked us to officially take over maintenance of utf8proc, so we should rename this repo back to utf8proc in order to ease the transition for distros and other packagers/users of utf8proc.

@StefanKarpinski, can you do the honors?

API for Unicode version

Would be good to expose the supported Unicode version in an API.

utf8proc_ssize_t

Hi,

If you're defining utf8proc_ssize_t as several different types according to platform, could you also have something like UTF8PROC_SSIZE_MIN, UTF8PROC_SSIZE_MAX, so it would be easier to have some safe conversion from size_t to it? This is mostly because of utf8proc_iterate using utf8proc_ssize_t strlen.

Regards,
Matt

Many code points in Cn category return invalid category code 0

julia> c=0x0380; unsafe_load(ccall((:utf8proc_get_property,:libmojibake), Ptr{Uint16}, (Int32,), c))
0x0000

In this line you can see that an unknown category code 0 can be assigned which is not a valid Unicode general category. I believe that most of these should actually return 30 (UTF8PROC_CATEGORY_CN) (Other, not assigned)

add charwidth property

As discussed in JuliaLang/julia#6939, the wcwidth function is broken on many operating systems. When we import the Unicode data, it would be good to add another field to our database in order to store the character width, so that we can provide an up-to-date character-width function.

Tag release with Unicode 9 support

utf8proc fails to build on Debian due to not supporting the current Unicode data (#829236). As the bug is release critical, utf8proc and Julia are scheduled to be removed from Debian testing on 2016-08-14.

Could you tag a new release that includes support for Unicode 9 (#70)?

Update charwidths for Unicode 8

We updated the data tables in #45, but I'm guessing it doesn't have charwidths for many of the new codepoints. Probably we need to wait for a new version of GNU Unifont for up-to-date charwidths, but at the very least we probably shouldn't default to zero for codepoints in letter-like categories.

e.g. this doesn't seem good:

test/printproperty 0x14400
U+0x14400:
  category = Lo
  combining_class = 0
  bidi_class = 1
  decomp_type = 0
  uppercase_mapping = ffffffff
  lowercase_mapping = ffffffff
  titlecase_mapping = ffffffff
  comb1st_index = -1
  comb2nd_index = -1
  bidi_mirrored = 0
  comp_exclusion = 0
  ignorable = 0
  control_boundary = 0
  boundclass = 1
  charwidth = 0

Documentation

Currently, the API is documented in the header file. Maybe we should use Doxygen or similar to generate nicely formatted docs from the header?

What tool has the best GitHub integration?

restore old grapheme_break API

@Keno, upon reflection, I think it was a mistake to break backwards compatibility for utf8proc_grapheme_break in #70. This will make life harder for distros like Debian who need to support older versions of software (see #72).

Can we restore the old utf8proc_grapheme_break API, with the caveat that it will return incorrect results for some Unicode-9 cases, and add a new utf8proc_grapheme_break_stateful function (or similar) with the new API?

non-deterministic Ruby-induced failures on Travis

As seen in e.g #51 and #45, we are getting frequent Travis failures due to the Ruby data-generation script producing different results on Travis than on the machines where @jiahao and I generated the data.

It's not clear what is causing this. If anyone can reproduce this failure locally, with any Ruby version, that would be helpful. Try:

rm -f data/utf8proc_data.c.new
make data/utf8proc_data.c.new
diff utf8proc_data.c data/utf8proc_data.c.new

to see if the diff output is non-empty.

cc: @StefanKarpinski

tag 1.3 release

It seems like we should be ready for this. To do:

merge #51 in some form
update NEWS and README files as needed
push "v1.3" tag
update the web site (gh-pages branch): version numbers, release notes (copied from NEWS), documentation (run doxygen to generate).

Is there anything else or should I go ahead and do this?

1.3-dev1 fails to build on 64-bit

I see this only on 64-bit Fedora/RHEL builds. 32-bit builds went fine.

make -j2 'CFLAGS=-O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic'
cc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -c -o utf8proc.o utf8proc.c
rm -f libutf8proc.a
cc  -shared -o libutf8proc.so.1.3.0 -Wl,-soname -Wl,libutf8proc.so.1 utf8proc.o
ar rs libutf8proc.a utf8proc.o
ar: creating libutf8proc.a
/usr/bin/ld: utf8proc.o: relocation R_X86_64_PC32 against symbol `utf8proc_stage1table' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
Makefile:65: recipe for target 'libutf8proc.so.1.3.0' failed
make: *** [libutf8proc.so.1.3.0] Error 1

https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-rawhide-x86_64/utf8proc-1.3-0.dev1.fc23/build.log.gz

I actually don't need the static library, but this makes the whole build fail (make install depends on it).

Update for Unicode 8

According to unicode.org, the data files should be available this month (June).

add custom JULIA normalization?

For JuliaLang/julia#5903. If utf8proc can have LUMP, then libmojibake can have JULIA. Unless we want to keep this separate from libmojibake.

incorrect extended grapheme segmentation

The UTF8PROC_CHARBOUND map option is supposed to segment a string into graphemes (by inserting 0xFF before each grapheme), but it doesn't seem to be following the UAX extended grapheme rules. [It might be following the legacy rules? But (a) these aren't recommended and (b) the use of UTF8PROC_BOUNDCLASS_EXTEND in the source code seems to indicated that the extended rules are intended?]

According to UAX#29 and the grapheme break tests provided by the Unicode consortium, it is recommended that most applications use the "extended" graphene break rules. In particular, any codepoint followed by a spacing mark is supposed to be treated as a single grapheme.

For example "\u0020\u0903" is one of the test cases that is supposed to be treated as a single grapheme, because U+0903 is a spacing-combining mark (category Mc). But utf8proc breaks these into two graphemes:

#include <stdio.h>
#include "mojibake.h"

int main(void)
{
     uint8_t s[4] = {0x20,0xe0,0xa4,0x83}; /* UTF-8 for "\u0020\u0903" */
     uint8_t *g = 0;
     ssize_t len, i;
     len = utf8proc_map(s, 4, &g, UTF8PROC_CHARBOUND);
     printf("g[] = [");
     for (i = 0; i < len; ++i) {
          if (i) printf(",");
          printf("%02x", g[i]);
     }
     printf("]\n");
     return 0;
}

which prints g[] = [ff,20,ff,e0,a4,83] (notice the incorrect 0xff breakpoint after the first codepoint 0x20).

Use Flexible and Economical UTF-8 Decoder

I find this fast implimentation utf-8 decoder
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/
2008-2010 Bjoern Hoehrmann

What about rewrite implementaion of utf8proc_iterate method?

Also this implementation in Haskell https://github.com/bos/text/blob/master/cbits/cbits.c

Julia win64 instances stalled in utf8proc_NFKC

I'm not sure of all what's going on yet, or if there's a good way to reproduce, but I noticed that my Flycheck-spawned Julia lint checks stopped outputting, and that two Julia instances were each pegging a core.

I took some samples with Very Sleepy, and both instances appear to be stuck in utf8proc_NFKC, called by jl_generate_fptr. Most of the time (~90%) is spent in the Windows API calls VirtualProtect and VirtualQuery, with the rest in utf8proc_NFKC itself. Based on the profiler samples, it appears that control is never returned to jl_generate_fptr; the call stack shows 10-deep recursion of utf8proc_NFKC, which doesn't look like it's supposed to happen.

Killing one of the two Julia processes allowed the other to progress.

I'm starting this issue in libmojibake since this is where we appear to be stuck.

utf8proc_sequences data can be made static

Minor cleanup but it seems that the generated data can be made static.

03-static-sequences.txt

I've posted a pure Lua translation of this code

here: https://github.com/differentprogramming/lua-mojibake
(yes the original was still called mojibake when I got my copy).

It still needs a tiny amount of cleanup and some documentation, but it's been tested.
currently it works in luajit with 5.2 support on, and on lua 5.2
I'll do a little cleanup so that it works in lua 5.1 and 5.3 soon.

It has a few lua specific features, since lua doesn't have mutable strings it has the option of using other mutable formats for strings, such as arrays of codepoints, arrays of single characters, arrays of multibyte utf8 encoded characters and arrays of graphemes.

The bench test has been expanded a little bit.

It has a couple unused routines still too.

Minimizing tables for (de)composition

So I was using a Pascal port of utf8proc to convert utf8 strings between the composed and decomposed form.

Then I noticed the Unicode tables are 600kb large (also due to the port not doing bitpacking), which is not good.

So I dropped every field from the struct that was not used for compositing/decomposition, who needs title case or folding anyways?
Then the tables were only slightly above 300kb large. Much better.

But looking closer at the tables, half the sequence array consists of -1. Those values are not needed for anything. So I removed all the -1, moving the element length to an unused byte in the property array, saving 20kb and bringing the tables to under 300kb.

Then the combination array. It is even worse. Almost all of its entries are -1. Someone lost their -1 collection.
A sparse [1st index][2nd index] array, all 2nd arrays joined together, indexed as [1st index*size + 2nd index]. You could cut the -1 prefix/suffix of each 2nd array, making them variable sized and index it as [startindex[1st index] + 2nd index]. But since I care more about space than performance, I replaced the 2nd array with a list of (2nd index, value) pairs. Now you need a linear search starting at startindex[1st index] to find the 2nd index and then get the value, but it is much smaller.
In fact it reduces this table size to 8kb, saving over 90kb.

Now the property struct contains a 1st and 2nd comb index. Yet there are no characters that have both. I replaced it by a single field comb index, saving 5 kb. Pointless

The stage2 array has a curious property. It is full of almost alternating 0 and 2 elements.
Why? 0 is the default property struct used for missing (?) characters, 2 is the property struct for characters without special properties. Both structs are the same.
Dropping struct 2, turned these stage2 blocks into blocks of zeros, which were then automatically merged by the data generator, saving 20kb.

Back to the sequence array. It is a list of 32-bit numbers, but those numbers are small, wasting a lot of zero bits.
It is like an UTF-32 string. I changed it to UTF-16, which saves 10kb.

The decomposition function does not actually use the decomp_type, only checks if it is zero or not.
I do not need that type. Replacing it with a bool, automatically drops 800 property structs, as they become duplicates.

Now, what exactly is the property array? After removing all that stuff, it is actually a map of codepoint -> decompositon index.
Some structs have a composition index, but only like 10% of them. No point in storing that index for the properties that do not have it. So I replaced the property array with a DWORD-array, bit encoding the properties. Fields with comp index take 2 entries, most only need 1 entry.
This saves 20kb.

This brings us to the final table size: 90 kb.

How did it affect performance? Not much, 10% faster when decomposing/composing each character once.

Want to have a patch?

rename utf8proc.h?

For #5 (see also #7).

This breaks source compatibility, but on the other hand it is essential if someone wants to install both utf8proc and libmojibake on the same machine.

I think we should keep the utf8proc_* API, though.

shared-library versioning

We should really include a version number in the shared library (this is different from the utf8proc source version: the source version indicates API compatibility, while the library version encodes ABI compatibility). I have used libtool for this in the past, but the whole autotools stack is probably not worth the trouble here (although we could use just the libtool script, which could be bundled by make dist).

Library version versus package release version...

Having the utf8proc_version return the API/ABI version string is quite confusing. I like to call the version function during launch, and print the output to the console so I can easily confirm which version of a dependency my system is using. When the string that gets printed doesn't match up with the tarball I just installed it throws me off. I'm expecting something akin to what most autotools based Makefiles like call the PACKAGE_VERSION. Was the 1.3.0 string in the 1.3.1 tarball, and in the HEAD revision a mistake? If it was intentional, would you mind adding a utf8proc_release function?

I'm attaching a possible patch that would use the Makefile MAJOR/MINOR/PATCH variables to construct the release version string. It is setup so you can easily add a "-pre" suffix to the release version string when the commit isn't a tagged tarball release.

utf8proc.release.patch.txt

update Unicode tables

As discussed in JuliaLang/julia#7582, utf8proc currently has the Unicode 5.0 tables. It would be good to import the database from Unicode 7.

The file data_generator.rb is a Ruby script that outlines how the Unicode 5 tables were imported, though looks like it is not fully automated. The first step would be to figure out how to re-run that on the Unicode 5 table in order to reproduce the current utf8proc_data.c. This will verify that we are importing the data correctly before we move to the new Unicode 7 data tables.

Add example for documentation

I didn't find any example.

This would be possible to add a folder example with all necessary example to print, iterate, create, read a UTF-8 string with this library ?

Infinite recursion for utf8proc_decompose_char

In master branch, this code leads to infinite recursion and stack overflow:

int32_t codepoints[10];
int boundclass;
utf8proc_ssize_t decomp_result = utf8proc_decompose_char(64257, codepoints, 10, 
    (utf8proc_option_t)(UTF8PROC_COMPAT | UTF8PROC_DECOMPOSE), &boundclass);

call stack:

utf8proc_decompose_char(939, ...)
seqindex_write_char_decomposed(65535, ...)
utf8proc_decompose_char(933, ...)
seqindex_write_char_decomposed(9062, ...)
utf8proc_decompose_char(939, ...)
seqindex_write_char_decomposed(65535, ...)
utf8proc_decompose_char(933, ...)

In release-1.2, it works fine.

MSVC build warning (conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data)

utf8proc.c(169): warning C4242: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data
utf8proc.c(172): warning C4244: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data
utf8proc.c(178): warning C4244: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data
utf8proc.c(183): warning C4244: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data
utf8proc.c(196): warning C4242: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data
utf8proc.c(199): warning C4244: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data
utf8proc.c(209): warning C4244: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data
utf8proc.c(214): warning C4244: '=': conversion from 'utf8proc_int32_t' to 'utf8proc_uint8_t', possible loss of data

cmake script

It would be nice to include a cmake script (a CMakeLists.txt file) to make life easier for Windows users. Since utf8proc is so simple, we can easily maintain both this and a Makefile.

Feature request: Full Case Folding

Any plans to support full case folding, according to ftp://ftp.unicode.org/Public/UNIDATA/CaseFolding.txt ? It is very useful for code to match search queries entered by users, e.g. convert ß to ss etc.

Makefile should not set MAKE

It's not really kosher for makefiles to set the MAKE variable themselves. That variable is owned by the make program and it will set that variable itself to the appropriate value; by forcibly setting it yourself you're breaking any environment where the make program is named something different such as gmake or whatever.

Get numeric value for decimal digits?

Don't see any value for this in the property struct. I was hoping to allow all decimal digits to build numbers in my program, but I need to know each value to generate the numbers. Other libraries and languages can return this data. I can stick with ASCII for now though.

Thanks.

Would like to be able to disable DLLEXPORT

In my system I'm linking utf8proc as a static library, into my shared library and I don't want the utf8proc symbols to be visible outside my shared library (in case the user of my shared library has their own version of this library or even other code which has symbol conflicts, in other libraries, etc.)

However currently there's no way to turn off the DLLEXPORT stuff in utf8proc.h. I'd appreciate it if there were a controlling macro I could add to my CFLAGS that would disable it, maybe something like:

#ifdef UTF8PROC_NO_DLLEXPORT
# define UTF8PROC_DLLEXPORT
#elif defined(_WIN32)
#  ifdef UTF8PROC_EXPORTS
#    define UTF8PROC_DLLEXPORT __declspec(dllexport)
#  else
#    define UTF8PROC_DLLEXPORT __declspec(dllimport)
#  endif
#elif __GNUC__ >= 4
#  define UTF8PROC_DLLEXPORT __attribute__ ((visibility("default")))
#else
#  define UTF8PROC_DLLEXPORT
#endif

or something along those lines; I can add -DUTF8PROC_NO_DLLEXPORT to CFLAGS to avoid any exporting.

Unicode 9 support

http://blog.unicode.org/2016/06/announcing-unicode-standard-version-90.html

need a new name

The utf8proc people are fine with our fork, but prefer that we choose a distinctive name.

Let's try to find a decent name that doesn't contain "utf8proc" as a substring, and isn't already taken (e.g. "libutf8" and "utf8utils" are already taken). I'd prefer to keep "lib" in the name to emphasize that it is a library.

My first thought is libutf8ery, but libutf8myhomework also has a certain appeal.

C99 compatibility of bool declaration

In utf8proc.h, true, false and bool are defined. They are reserved keywords in modern C (not just C++), and clash with those definitions.

#  ifndef __cplusplus
typedef unsigned char utf8proc_bool;
enum {false, true};
#  else
typedef bool utf8proc_bool;
#  endif

This breaks compilation on MSVC, and there is no preprocessor macro I can define to let utf8proc.h know that bool is already defined elsewhere. Aside from the C99 standard, it also breaks if the utf8proc.h header is included after one that also defines false and true (like stdbool.h, which can be tested for with the __bool_true_false_are_defined feature macro).

utf8proc_utf8class array is unused ... ?

Maybe someone needs this somewhere, but it's not used anywhere in utf8proc; removing it still allows proper linking and all the tests still pass...

 UTF8PROC_DLLEXPORT extern const utf8proc_int8_t utf8proc_utf8class[256] ...

web page

It is probably a good idea to have a web page separate from the github README; at the very least we'll want to host a bunch of tarballs somewhere. What is the best way to do this? Put it on julialang somewhere?

Maybe we should have a julialang.org/utf8proc that redirects here for now, and will be updated later. @StefanKarpinski or @jiahao, can you do this? I want to have a quasi-permanent URL to give to the Public Software Group ASAP so that they can update their web links accordingly.

Add a pkgconfig file

The netsurf project uses a pkgconfig file to find linking flags for utf8proc. It would be nice to have this in the upstream library.

@tlsa this seems to be required to build against upstream instead of netsurf's fork.

Unnecessary inclusion of system-dependent header file

In utf8proc.h, @ line 80, <sys/types.h> is included. This header is not part of the C standard; it's POSIX-specific. It would be favorable to remove this include since it does not seem crucial — if the library uses types from this header at all, those types could be replaced by something more portable.

utf8proc does not correctly handle the 66 Unicode "noncharacters"

utf8proc considers the 66 Unicode "noncharacters" to not be valid, however the Unicode standard specifically says that they are valid code points, and need to be handled correctly in conforming software.

Ruby / PostgreSQL plug-ins

These were in the original utf8proc, but I removed them in the libmojibake fork to focus on the C components. We can certainly add them back in easily (since the API is backward-compatible), and should distribute them in some form (bundled or separate?) in any case. My inclination is just to bundle them, but since I don't use those languages I'm worried about bitrot unless we can add a testsuite for them.

Are Ubuntu/Debian and Fedora distributing these plugins in their utf8proc packages, or are they just distributing the C library?

COMBINING GREEK YPOGEGRAMMENI case-folding

The U+0345 combining character needs special handling, according to Jan Behrens (utf8proc author). In particular, you apparently need to do normalization both before and after case-folding (if you are doing normalization+casefolding on a string containing this character).

As a first pass, I'm not sure it's worth trying to solve this in a super-efficient manner. Just set a flag if the character is found (during decomposition?), and then run a second normalization pass after/before case-folding if necessary.