duffsdevice / tiny-utf8 Goto Github PK
View Code? Open in Web Editor NEWUnicode (UTF-8) capable std::string
License: BSD 3-Clause "New" or "Revised" License
Unicode (UTF-8) capable std::string
License: BSD 3-Clause "New" or "Revised" License
I am using your library, and have a forked version here:
https://github.com/bakerstu/tiny-utf8
I use g++ and my project, as a rule, uses -wall -werror. Therefore, I have a fix in a branch noexcept that adds pragma statements to disable -Wmaybe-uninitialized. The fix would look something like this in tinyutf8.h.
<include guard>
<includes>
#if defined(__GNUG__)
#pragma GCC diagnostic ignored "-Wmaybe-uninitialized"
#pragma push
#endif
< rest of file>
#if defined(__GNUG__)
#pragma pop
#endif
<end include guard>
What do you think? Is this a change you would be willing to accept as an isolated change on a new branch as a pull request?
Compilation with MSVC-2015 is failed with 2 issues:
std::max
not found. Solution is simple - add #include <algorithm>
into source file__lzcnt64
is not available for 32 bit architecture. Solution just check for 32 bit mode by #ifdef
Also some M$ headers can contains - #define max
So better move get_lut_width
implementation into source file.
I think other non trivial methods implementation can be moved into source file too.
This is needed to reduce possible conflicts by such defines.
TinyUTF8 is awesome, however, I am using it in a project where I encounter bidirectional text (Arabic + English) frequently, which is not represented correctly by TinyUTF8. Is there any way to use the bidi algorithm with TinyUTF8's string class?
How do I convert indexes / pointers to raw char* to codepoint / character offsets?
I was also looking at utilising the raw_get method, but I am not sure if it's the right thing to do:
str.raw_get(offsetInBytes) - str.begin();
PS. I posted it in Stackoverflow as well: https://stackoverflow.com/questions/49716774/tiny-utf8-getting-offset-in-characters-codepoints
I must be overlooking something, but suppose I have a readonly mmap'ed utf8 file - how do I pass it to tinyutf8 without triggering a full memcpy? std::string::assign()
seems to be doing just that.
Not sure if I am doing something wrong.
Given the following code it deletes the last character 'd' instead of 'W'. replacing with std::string works fine.
#include <tinyutf8/tinyutf8.h>
#include <algorithm>
#include <iostream>
int main()
{
tiny_utf8::string str = u8"Hello 🌍 World";
std::cout << str << std::endl;
// Hello 🌍 World
str.erase(std::remove_if(str.begin(), str.end(), [](auto c) { return c == U'W'; }),
str.end());
std::cout << str << std::endl;
// Hello 🌍 Worl
}
Hello,
while using object with tiny_utf8::u8string
type there is a problem with compilation while cpp_str()
and/or cpp_str_bom()
is being in use by a code.
Compiler: g++-10
, 10.3.0-1ubuntu1~20.04
With -std=c++2a
flag.
../../libs/tiny-utf8/include/tinyutf8/tinyutf8.h:3404:46: error: conversion from ‘basic_string<char>’ to non-scalar type ‘basic_string<char8_t>’ requested
3404 | std::basic_string<data_type> result = std::string( size() + 3 , ' ' );
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
tiny-utf8/include/tinyutf8/tinyutf8.h
Line 3404 in cab426c
union{ SSO t_sso; NON_SSO t_non_sso; };
SSO
/ NON_SSO
constructor is without initialization fields by zero.
So sometimes string can contain some content before assign anything.
This issue reproduce randomly (compiler MSVC2015 / 32 bit) when use cpp_str()
Hi, after replacing std::string with utf8_string, the following code shows error:
utf8_string text = "hi";
stringstream stream;
stream << text; // error: no viable conversion from 'utf8_string' to 'string'
In order to fix it, the last line must be modified:
stream << text.c_str();
Is it possible to make utf8_string compatible with streams?
Thank you
How can I solve it?
I'm trying to replace std::string
with utf8string
. But some problem occurs:
tinyutf8.h
, gives error C4146 unary minus operator applied to unsigned type, result still unsignedstd::string
:
operator+=(char )
operator+(const char*, string)
I am using VS2019, and c++ standard is 17, warning level is 4
utf8_string:.swap() allocated 'tmp', a temorary storage. Leaving it as exact copy of 'str', it might deallocate the held storage when being destructed.
Hello Jakob,
Well done on solving #55 so quickly!
Here is a new issue that I found when trying out the readme code sample on 32bit versions of MSVC.
Example at https://godbolt.org/z/Peh1xq with MSVC 2017 and commit 411dfba
tinyutf8.h(152): error C3861: '_BitScanReverse64': identifier not found
tinyutf8.h(158): note: see reference to function template instantiation 'unsigned int tiny_utf8::tiny_utf8_detail::lzcnt<uint16_t>(T) noexcept' being compiled
with
[
T=uint16_t
]
Thanks :)
Hi Jakob,
Hope things are well! A minor issue with an append(value_type)
method and the +=
operator overload. It does not concatenate non-ASCII characters properly. Example:
utf8_string badAppend;
utf8_string::value_type v1 = U'“', v2 = U'a', v3 = U'b', v4 = U'à', v5 = U'”';
badAppend += v1;
badAppend += v2;
badAppend += v3;
badAppend += v4;
badAppend += v5;
cout << badAppend << endl;
cout.flush();
You get something like &&&&&
. Funny enough, it does not happen in Debug mode, which makes me think whether some sort of a local variable was not initialized properly.
My workaround is:
badAppend += ' ';
badAppend[badAppend.length() - 1] = my_char;
Hi Jakob,
It's your fan club again.
Looks like there might be an issue in the non-sso mode for get_num_bytes()
. In this case, I called substr() on a Russian string (92 bytes long, 52 code points) with a question mark in the middle, and got it butchered because get_num_bytes() returned 52.
I tried to figure out the logic following if (sso_inactive())
- that is, line 839 onwards - but couldn't, and instead substituted it by something crude but working:
size_type byte_count = 0;
for (size_type current_code_point = index; current_code_point< index + cp_count; current_code_point++)
byte_count += get_codepoint_bytes(at(current_code_point));
return byte_count;
I'm pretty sure it's not as optimised as what you planned originally but that's the best I could do...
I still get some parts butchered further down the line, for some reason, no idea why...
Hi Jakob,
I use your library for a variety of languages, and one of them is Chinese, where most characters consist of 3 bytes. The last part of the method seems to be looking for the end of buffer where, in fact, it should be looking for the end of the fragment. This is the problematic line:
const char* buffer_end = buffer + data_len;
Try looking for any characters of this string (for example): 我给了老张三本书。
。 has to be 8, and it is returned as 6. 了 is a mess.
The correction is simple. You need to substitute data_len
by index + byte_count
. I would also recommend changing the name to fragment_end
to prevent misunderstanding.
The following modified code works for me (I left data_len
just in case):
const char* fragment_end = buffer + (index + byte_count < data_len? index + byte_count : data_len);
hello,
how are you?
it says it as an in drop component, but it cannot use + and index or i am missing something?
thanks
Any plans to add support for iterating not just code points but also grapheme cluster?
It would be greate to get allocator support. (Possibly even PMR).
In c++20 u8 string literals no longer default to const char[N], instead becoming const char8_t[N].
Since there are currently no explicit conversions to or from char8_t*, u8string, etc., examples like the one proposed in the readme (using u8"" string literals) no longer compile successfully when configured to use the latest standard.
the __cpp_char8_t and __cpp_lib_char8_t macros could possibly be used to determine support for the extended types.
Try calling, for example, my_string.append(U'ü')
. The character will be garbled.
The workaround is to convert it to a UTF-8 string in any form.
Hi,
Havent used tinyutf8 much yet, but it is looking promising :-)
Anyway, I try to crosscompile a windows program using x86_64-w64-mingw32-g++ and get the following error in the included header file:
src/tinyutf8.h: In instantiation of ‘tiny_utf8::basic_utf8_string<ValueType, DataType, Allocator>& tiny_utf8::basic_utf8_string<ValueType, DataType, Allocator>::operator=(tiny_utf8::basic_utf8_string<ValueType, DataType, Allocator>&&) [with ValueType = char32_t; DataType = char; Allocator = std::allocator]’:
src/carchive.cpp:33:43: required from here
src/tinyutf8.h:1216:18: error: invalid cast from type ‘tiny_utf8::basic_utf8_string<char32_t, char>::SSO’ to type ‘void*’
1216 | std::memcpy( (void*)this->t_sso , (void*)&str.t_sso , sizeof(SSO) ); // Copy data
Changing to &this->t_sso allows compilation, but is this a sane thing to do?
I have tried experimenting in a couple of classes in an existing project,. but have some strange issues when combining it with empty std::string objects. In the test below I expect all printf's to be empty, but I seem to get unterminated gibberish at b and d on linux, and b on windows.
#include "tinyutf8.h"
using namespace tiny_utf8;
int main(){
std::string stdstring="This is a string";
utf8_string utfstring="This is a string";
std::string a=stdstring.substr(stdstring.length());
utf8_string b=stdstring.substr(stdstring.length());
utf8_string c=utfstring.substr(utfstring.length());
utf8_string d=std::string("");
utf8_string e="";
printf("a=%s\r\n",a.c_str());
printf("b=%s\r\n",b.c_str()); // Fails
printf("c=%s\r\n",c.c_str());
printf("d=%s\r\n",d.c_str()); // Fails
printf("e=%s\r\n",e.c_str());
}
Thanks for a really nice library though :-)
I noticed this recently and I wanted to bring it up to see if this was intentional. Essentially the less than operator (and probably other operators) aren't matching std::string and I was curious what you thought!
Here's an example:
std::string aaa("aaa");
std::string zz("zz");
auto isLessStdString = aaa < zz; // true – expected
auto isLessUtf8String = utf8_string (aaa) < utf8_string(zz); // false – unexpected
Thanks for the great library. It's been really helpful!
Is there any support for a string view handled by tiny-utf8 (tiny_utf8::string_view
)?
I did not find an implementation for string view, maybe that would be nice to have as a drop in replacement for std::string_view
.
Currently I'm using std::string_view view(tinyutf8str.c_str());
but creating a temporary std::string_view
is not the best for performance.
First of all, thanks for this great library!
This is not a bug but rather my own user error that I am hoping you can clarify.
I'm trying to call find_first_of passing a utf8_string as the first parameter.
This generates the following error:
error: no matching function for call to ‘utf8_string::find_first_of(utf8_string&, int&)’
int found_pos = haystack.find_first_of(needle, at_pos);
^
In file included from Phonemizer.cpp:8:0:
tinyutf8.h:1728:12: note: candidate: utf8_string::size_type utf8_string::find_first_of(const value_type*, utf8_string::size_type) const
size_type find_first_of( const value_type* str , size_type start_codepoint = 0 ) const ;
^~~~~~~~~~~~~
tinyutf8.h:1728:12: note: no known conversion for argument 1 from ‘utf8_string’ to ‘const value_type* {aka const char32_t*}’
Can you tell me the correct usage and/or how I can get a char32_t* from my utf8_string?
Thanks!
Shawn
I have following test code:
utf8_string n1("ALF Cen"); utf8_string n2("BET Cen"); utf8_string n3("GAM Cen"); assert(n1 != n2); assert(n1 != n3); assert(n2 != n3); assert(n1 > n2); assert(n1 > n3); assert(n2 < n1); assert(n3 < n1); assert(n1 >= n2); assert(n1 >= n3); assert(n2 <= n1); assert(n3 <= n1);
However, it fails on assert(n2 < n1);
(and also on assert(n2 <= n1);
). This indicate that unequality comparisions are unreliable.
Edit: After some additional tests I became to a conclusion it's due to wrong usage of difference_type
. It's definerd as pointer difference, while compare
compares characters, not pointers. Moreover, std::ptrdiff_t
may not be used with elements in two different arrays.
Edit2: I fixed this issue on my machine just by defining difference_type
as int
.
Hello,
It seems that the code sample shown in the readme can't be compiled with some versions of GCC < 8.
Example at https://godbolt.org/z/vdE453 with GCC 7.4, -std=c++17
and commit 048f74d
tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_small_string = typename std::enable_if<(L <= get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1047:140: required from here
tinyutf8.h:668:81: error: 'get_sso_capacity' was not declared in this scope
tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_not_small_string = typename std::enable_if<(L > get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1060:144: required from here
tinyutf8.h:670:84: error: 'get_sso_capacity' was not declared in this scope
ASM generation compiler returned: 1
tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_small_string = typename std::enable_if<(L <= get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1047:140: required from here
tinyutf8.h:668:81: error: 'get_sso_capacity' was not declared in this scope
tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_not_small_string = typename std::enable_if<(L > get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1060:144: required from here
tinyutf8.h:670:84: error: 'get_sso_capacity' was not declared in this scope
Can you please have a look at this error?
Thanks :)
Hi,
I have found very interesting bug. When working with utf8_string with length 25, the string is bugged:
utf8_string s = "aaaaaaaaaaaaaaaaaaaaaaaaa";
utf8_string s1(s.cpp_str());
std::cout << s.cpp_str() << std::endl;
std::cout << s1.cpp_str() << std::endl;
This code has following output:
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaІ�\x03\x01
The same error is when using copy constructor utf8_string s1(s);
.
Any idea, what could be wrong? Thank you.
Hi,
the following code that compiled with an older tinyutf8 version now generates a compile error with MinGW 7.3.0:
utf8_string teststring = "µm";
if (teststring == "µm")
{
}
The following error is generated:
In file included from ../src/Widget.cpp:3:0:
tinyutf8.h: In instantiation of 'int utf8_string::compare(const char (&)[LITLEN]) const [with unsigned int LITLEN = 4]':
tinyutf8.h:1866:111: required from 'bool utf8_string::operator==(const char (&)[LITLEN]) const [with unsigned int LITLEN = 4]'
../src/Widget.cpp:55:23: required from here
tinyutf8.h:1809:10: error: increment of read-only location '(const char*)str'
++it, ++str;
^~~~~
tinyutf8.h:1809:10: error: lvalue required as increment operand
First of all, thanks a lot for providing this project! It makes it so much easier to work with UTF-8 data.
I'm aware that this might be out of the scope of this project, so I figured I'd just ask what you all think about this. When porting my code from std::string
to tiny_utf8::string
I encountered various issues, where the mismatch of size and length caused issues.
E.g., my code uses templates with std::size(...)
to work on arbitrary data types. It doesn't work on tiny_utf8::string
s though, since size()
is the raw byte size, but operator[]
expects a codepoint index. It would be nice of tiny_utf8 would be consistent with other STL containers.
Various other functions also made use of both size()
and length()
. Yes, it can also be fixed on my side, but (1) its difficult to get this done correctly in a large code base, and (2) it does no longer work as a "quick drop-in replacement" as advertised in the README.
So, what I'm considering is adding a new raw_size()
(similar to raw_at
, raw iterators, ...) that returns the byte size, and change the default behavior of size
to match the length
. This is obviously not a backwards compatible change, but (1) there have also been other non-backwards compatible changes and (2) there could still be a define-parameter to switch between both behaviors.
What do you think? If its out of the scope I'll come up with a different solution. :)
I've tested these cases (windows 11, msvc 2022 with ninja):
tiny_utf8::string utf8str(u8"q🌍");
utf8str.length() is 2.
but if you do this:
tiny_utf8::string utf8str(u8"q🌍\0\0");
utf8str.length() is 4.( This should be 2)
and if the c str is not a literal string but in memory string
the length of the utf8str is not always equal to strlen(cstr);
so I have to do like this:
const char *str = u8"q🌍\0\0";
tiny_utf8::string utf8str(str, strlen(str));
utf8str.length() is 2.
When compiling or using the library with C++11 or C++14 MSVC17 build fails with the following error:
error C2429: Attribute "fallthrough" requires the compiler identification "/std:c++17"
The documentation of your project states:
Tiny-utf8 is a library for extremely easy integration of Unicode into an arbitrary C++11 project
Is the bug in the documentation or in the code? Would be great if this library could be used with C++11
vcpkg
includes an older version of tiny-utf8
: https://github.com/microsoft/vcpkg/tree/master/ports/tinyutf8. Would you consider making pull requests there to keep the version in vcpkg
up to date?
Many people know about vcpkg
already, but for those who don't it's a cross-platform package manager. It's open source. It works with macOS, Linux, Windows, gcc, clang, MSVC, and probably other combinations of OSes and compilers. It's terrific for discovering libraries. That's how I found tiny-utf8
.
When including tinyutf8.h
I get these warnings
[build] In file included from ../include/tokenizers.h:5:
[build] ../thirdparty/tinyutf8.h:139:75: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build] static inline unsigned int clz( unsigned int value ) noexcept { return __builtin_clz( value ); }
[build] ~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~
[build] ../thirdparty/tinyutf8.h:140:80: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build] static inline unsigned int clz( unsigned long int value ) noexcept { return __builtin_clzl( value ); }
[build] ~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~~
[build] ../thirdparty/tinyutf8.h:142:60: warning: operand of ? changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build] return sizeof(char32_t) == sizeof(unsigned long int) ? __builtin_clzl( value ) : __builtin_clz( value );
[build] ~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~~
[build] ../thirdparty/tinyutf8.h:142:86: warning: operand of ? changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build] return sizeof(char32_t) == sizeof(unsigned long int) ? __builtin_clzl( value ) : __builtin_clz( value );
[build] ~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~
And later for a literal utf8_string UNK = u8"[UNK]";
(or without u8
):
[build] In file included from ../include/tokenizers.h:8:
[build] ../thirdparty/tinyutf8.h:906:54: warning: implicit conversion loses integer precision: 'unsigned long' to 'unsigned char' [-Wimplicit-int-conversion]
[build] t_sso.data_len = ( sizeof(SSO::data) - data_len ) << 1;
[build] ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
[build] ../thirdparty/tinyutf8.h:1042:5: note: in instantiation of member function 'tiny_utf8::basic_utf8_string<char32_t, char, std::allocator<char> >::set_sso_data_len' requested here
[build] set_sso_data_len( LITLEN );
[build] ^
[build] /mnt/e/MyProgramming/fused-transformer-mobile-1/src/tokenizers.cc:4:19: note: in instantiation of function template specialization 'tiny_utf8::basic_utf8_string<char32_t, char, std::allocator<char> >::basic_utf8_string<6>' requested here
[build] utf8_string UNK = u8"[UNK]";
[build] ^
#include <tinyutf8/tinyutf8.h>
#include <stdio.h>
namespace utf8 = tiny_utf8;
int main() {
utf8::string temp;
for (char32_t c : { 13, 10, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 13, 10, 41, 34 }) {
temp += c;
}
auto left = temp.substr(0, temp.length() - 2);
utf8::string right = "\nhello world\n";
printf("%s\n", left.c_str());
printf("%s\n", right.c_str());
printf("left: %ld, right: %ld\n", left.length(), right.length());
if (left == right) {
printf("strings were the same\n");
} else {
printf("strings were not the same\n");
}
}
expected behaviour would be that both of these strings are the same length and are equal. yet left.length() = 15
and right.length() = 13
and left != right
For example we can make a separate typedef using std::string class to behave like case insensitive class. I tried same approach with tiny-utf8 class and i got too many errors. following is the code which makes std::string derived class to behave like case insensitive.
any clue?
struct ci_char_traits : public char_traits<char> {
static bool eq(char c1, char c2) { return toupper(c1) == toupper(c2); }
static bool ne(char c1, char c2) { return toupper(c1) != toupper(c2); }
static bool lt(char c1, char c2) { return toupper(c1) < toupper(c2); }
static int compare(const char* s1, const char* s2, size_t n) {
while( n-- != 0 ) {
if( toupper(*s1) < toupper(*s2) ) return -1;
if( toupper(*s1) > toupper(*s2) ) return 1;
++s1; ++s2;
}
return 0;
}
static const char* find(const char* s, int n, char a) {
while( n-- > 0 && toupper(*s) != toupper(a) ) {
++s;
}
return s;
}
};
typedef std::basic_string<char, ci_char_traits> ci_string;
I just switched to the header only version 3 and get the following errors when compiling my 64-bit application:
tinyutf8.h:732:49: error: reference to 'detail' is ambiguous
utf8_string( const char* str , size_type len , detail::read_codepoints_tag );
^~~~~~
This is the result of the gcc -v command:
Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=C:/CodingXP/mingw730_32/bin/../libexec/gcc/i686-w64-mingw32/7.3.0/lto-wrapper.exe Target: i686-w64-mingw32 Configured with: ../../../src/gcc-7.3.0/configure --host=i686-w64-mingw32 --build=i686-w64-mingw32 --target=i686-w64-mingw32 --prefix=/mingw32 --with-sysroot=/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32 --enable-shared --enable-static --disable-multilib --enable-languages=c,c++,fortran,lto --enable-libstdcxx-time=yes --enable-threads=posix --enable-libgomp --enable-libatomic --enable-lto --enable-graphite --enable-checking=release --enable-fully-dynamic-string --enable-version-specific-runtime-libs --enable-libstdcxx-filesystem-ts=yes --disable-sjlj-exceptions --with-dwarf2 --disable-libstdcxx-pch --disable-libstdcxx-debug --enable-bootstrap --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-gnu-as --with-gnu-ld --with-arch=i686 --with-tune=generic --with-libiconv --with-system-zlib --with-gmp=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-mpfr=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-mpc=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-isl=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-pkgversion='i686-posix-dwarf-rev0, Built by MinGW-W64 project' --with-bugurl=https://sourceforge.net/projects/mingw-w64 CFLAGS='-O2 -pipe -fno-ident -I/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/include -I/c/mingw730/prerequisites/i686-zlib-static/include -I/c/mingw730/prerequisites/i686-w64-mingw32-static/include' CXXFLAGS='-O2 -pipe -fno-ident -I/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/include -I/c/mingw730/prerequisites/i686-zlib-static/include -I/c/mingw730/prerequisites/i686-w64-mingw32-static/include' CPPFLAGS=' -I/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/include -I/c/mingw730/prerequisites/i686-zlib-static/include -I/c/mingw730/prerequisites/i686-w64-mingw32-static/include' LDFLAGS='-pipe -fno-ident -L/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/lib -L/c/mingw730/prerequisites/i686-zlib-static/lib -L/c/mingw730/prerequisites/i686-w64-mingw32-static/lib -Wl,--large-address-aware' Thread model: posix gcc version 7.3.0 (i686-posix-dwarf-rev0, Built by MinGW-W64 project)
#include <iostream>
#include <cstdlib>
#include "tinyutf8.h"
int main()
{
tiny_utf8::utf8_string str;
str.push_back(U'a');
std::cout << str.size() << "\n" << str[0] << "\n" << str << "\n\n";
std::string str1;
str1.push_back('a');
std::cout << str1.size() << "\n" << str1[0] << "\n" << str1 << "\n";
}
outputs
1
a
a
for the std::string
as expected, but
1
0
for tiny_utf8::utf8_string
(using Clang 10). It seems \0
is getting appended instead of a
.
Hi Jakob,
Happy New Year!
Looks like the issue you kept fixing, strikes again.
It's pretty much the same pattern: mostly plain Western European text with one multibyte interloper.
Similar to #14, but a different point, specifically get_num_bytes_from_start
. I came across it when using find_first_of
.
Another glitch, which may be stemming from the same piece of code, is that substr
truncates the result. Having said that, if the block starting with if( utf8_string::is_lut_active( lut_iter ) )...
under get_num_bytes_from_start
is disabled, the find_first_of
returns a correct result.
Here is the sample snippet demonstrating both:
utf8_string findFirstBug = u8"The project, therefore, “is an investment in the power of the adolescent girls which is so important to breaking the inter-generational transmission of poverty, violence, exclusion and discrimination in building our societies for a better future”";
std::cout << "White space found at: " << findFirstBug.find_first_of(U" \t\r\n", 26) << endl; // returns 29 instead of 27
std::cout << "Total len: " << findFirstBug.length() << " but the substring is truncated by 2 characters: " << findFirstBug.substr(0, 246) << endl;
BTW, I see that we were talking about the code reuse in #14. I am wondering if you can take that encapsulate that lut block that you use in several functions, seems like it might save some efforts in the future.
Hi, is it possible to do case-insensitive comparison with tiny-utf8?
Basically what this question on SO is asking: https://stackoverflow.com/questions/11635/case-insensitive-string-comparison-in-c.
I'm trying to avoid using ICU in my app.
Thanks!
Hi,
The following code:
utf8_string teststring = "µm";
if (teststring == "µm")
{
std::cout << "teststring == µm --> true" << std::endl;
}
if (teststring == utf8_string("µm"))
{
std::cout << "teststring == utf8_string(\"µm\") --> true" << std::endl;
}
produces the following output:
teststring == utf8_string("µm") --> true
So the comparison with the string constant fails.
https://github.com/tapika/cutf
Haven't checked your library, but I've worked on something similar.
Wondering if it makes any sense to recombine bits and pieces together into one library.
Hello,
C++20 introduced new helper functions ends_with and starts_with, are there any plans to add partial or full support for modern std::basic_string_view operators and utils?
Hi Jakob,
Looks like the part of get_num_codepoints
that uses lut_active
has issues under some circumstances. I only could fix it by disabling the entire part altogether.
Here is the example:
utf8_string bad_find = u8"S’pore Starbucks selling. gorgeous";
utf8_string substr = u8".";
std::cout << bad_find.find(substr, 0);
std::cout.flush();
The .
character is at position 24. The return value is 26 due to ’ being a "compound" codepoint.
Best regards,
Vadim
The example code doesn't compile with C++20 or above. The only mention of C++20 is a note to use tiny_utf8::u8string
instead. That also fails to compile on MSVC at least since it's guarded by a check on __cplusplus
. That macro is by default not set properly on MSVC, see https://docs.microsoft.com/en-us/cpp/build/reference/zc-cplusplus. Maybe that check could be rewritten in a more universal way, or at least some documentation be added about any of this.
I am using your library, and have a forked version here:
https://github.com/bakerstu/tiny-utf8
I use g++ and my project does not enable C++ exception handling. I would not like to add exception handling because it increases the code size to be larger than I would like. Therefore, I have "hacked" a fix into a branch noexcept that replaces the throw statements with assertions compatible with our project platform.
I acknowledge that this is not a portable fix, as is, and should not be accepted back as a pull request. Therefore, I would like to propose an alternative solution that could possibly be accepted.
I want to replace your throw statements with an inline function call. The inline function call should be ifndef guarded so that by default it throws the exception:
#if !defined(TINYUTF8_NOEXCEPT)
inline void tinyutf8_error(const char *what_arg)
{
throw std::out_of_range(what_arg);
}
#endif
If the user defines TINYUTF8_NOEXCEPT, then it is on them to provide an implementation of tinyutf8_error() that works for their purposes and platform. This is an acceptable solution for my use case.
I'd like to get your thoughts on this, and if favorable, work on a pull request.
Line 82 in 4fc49c8
On processors that do no support the 'lzcnt' instruction this intrinsic will be interpreted incorrectly as 'bsr' (BitScanReverse) thus giving the bit index instead of the count of leading zeros.
This in turn causes the calculations for codepoint sizes to be incorrect when used in encoding and decoding operations.
If a suitable alternative is not available sources i have seen recommend basing 'clz' on BitScanReverse instead.
Thank you for your work on this project. I look forward to using it.
I'm trying to use version 3.2.2 with Visual C++ inside Visual Studio Community 2019 version 16.6.0. Here are several errors:
ConsoleApplication2\tinyutf8.h(3202): error C4703: potentially uninitialized local pointer variable 'app_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3198): error C4703: potentially uninitialized local pointer variable 'old_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3533): error C4703: potentially uninitialized local pointer variable 'str_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3484): error C4703: potentially uninitialized local pointer variable 'old_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3932): error C4703: potentially uninitialized local pointer variable 'repl_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3880): error C4703: potentially uninitialized local pointer variable 'old_lut_base_ptr' used
Any tips for using it in Visual C++?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.