duffsdevice / tiny-utf8 Goto Github PK

View Code? Open in Web Editor NEW

537.0 27.0 41.0 1.35 MB

Unicode (UTF-8) capable std::string

License: BSD 3-Clause "New" or "Revised" License

C++ 98.16% CMake 1.84%

unicode cplusplus-11 cplusplus drop-in string-manipulation utf8 utf8-string codepoints utf-8 utf-32

tiny-utf8's People

Contributors

Stargazers

Watchers

tiny-utf8's Issues

-Wmaybe-uninitialized

I am using your library, and have a forked version here:

https://github.com/bakerstu/tiny-utf8

I use g++ and my project, as a rule, uses -wall -werror. Therefore, I have a fix in a branch noexcept that adds pragma statements to disable -Wmaybe-uninitialized. The fix would look something like this in tinyutf8.h.

<include guard>
<includes>

#if defined(__GNUG__)
#pragma GCC diagnostic ignored "-Wmaybe-uninitialized"
#pragma push
#endif

< rest of file>

#if defined(__GNUG__)
#pragma pop
#endif

<end include guard>

What do you think? Is this a change you would be willing to accept as an isolated change on a new branch as a pull request?

Build issues with MSVC-2015

Compilation with MSVC-2015 is failed with 2 issues:

std::max not found. Solution is simple - add #include <algorithm> into source file
Function __lzcnt64 is not available for 32 bit architecture. Solution just check for 32 bit mode by #ifdef

Also some M$ headers can contains - #define max
So better move get_lut_width implementation into source file.
I think other non trivial methods implementation can be moved into source file too.
This is needed to reduce possible conflicts by such defines.

How to use the bidi algorithm with TinyUTF8?

TinyUTF8 is awesome, however, I am using it in a project where I encounter bidirectional text (Arabic + English) frequently, which is not represented correctly by TinyUTF8. Is there any way to use the bidi algorithm with TinyUTF8's string class?

Getting offset in characters / codepoints

How do I convert indexes / pointers to raw char* to codepoint / character offsets?

I was also looking at utilising the raw_get method, but I am not sure if it's the right thing to do:

str.raw_get(offsetInBytes) - str.begin();

PS. I posted it in Stackoverflow as well: https://stackoverflow.com/questions/49716774/tiny-utf8-getting-offset-in-characters-codepoints

Avoid memcpy?

I must be overlooking something, but suppose I have a readonly mmap'ed utf8 file - how do I pass it to tinyutf8 without triggering a full memcpy? std::string::assign() seems to be doing just that.

erase-remove idiom does not work as expected

Not sure if I am doing something wrong.
Given the following code it deletes the last character 'd' instead of 'W'. replacing with std::string works fine.

#include <tinyutf8/tinyutf8.h>

#include <algorithm>
#include <iostream>


int main()
{
    tiny_utf8::string str = u8"Hello 🌍 World";

    std::cout << str << std::endl;
    // Hello 🌍 World
    str.erase(std::remove_if(str.begin(), str.end(), [](auto c) { return c == U'W'; }),
              str.end());

    std::cout << str << std::endl;
    // Hello 🌍 Worl
}

`cpp_str()` impossible to be used with `u8string`

Hello,
while using object with tiny_utf8::u8string type there is a problem with compilation while cpp_str() and/or cpp_str_bom() is being in use by a code.

Compiler: g++-10, 10.3.0-1ubuntu1~20.04
With -std=c++2a flag.

../../libs/tiny-utf8/include/tinyutf8/tinyutf8.h:3404:46: error: conversion from ‘basic_string<char>’ to non-scalar type ‘basic_string<char8_t>’ requested
 3404 |   std::basic_string<data_type> result = std::string( size() + 3 , ' ' );
      |                                              ^~~~~~~~~~~~~~~~~~~~~~~~~~

tiny-utf8/include/tinyutf8/tinyutf8.h

Line 3404 in cab426c

std::basic_string<data_type> result = std::string( size() + 3 , ' ' );

Constructor without parameters not initalize all fields

union{ SSO t_sso; NON_SSO t_non_sso; };

SSO / NON_SSO constructor is without initialization fields by zero.
So sometimes string can contain some content before assign anything.
This issue reproduce randomly (compiler MSVC2015 / 32 bit) when use cpp_str()

stringstream operator <<

Hi, after replacing std::string with utf8_string, the following code shows error:

utf8_string text = "hi";
stringstream stream;
stream << text; // error: no viable conversion from 'utf8_string' to 'string'

In order to fix it, the last line must be modified:

 stream << text.c_str();

Is it possible to make utf8_string compatible with streams?
Thank you

compilation error when std::unordered_map<utf8_string, ~something~>

How can I solve it?

Unable to compile under VS

I'm trying to replace std::string with utf8string. But some problem occurs:

at line 4336 in tinyutf8.h, gives error C4146 unary minus operator applied to unsigned type, result still unsigned
afteri tried fixing the previous problem another occurs: C4703 potentially uninitialized local pointer variable 'app_lut_base_ptr' used tinyutf8.h 3134
also there are about 70 warnings. Could you fix them all?
some operators are not overloaded as in std::string:
- operator+=(char )
- operator+(const char*, string)

I am using VS2019, and c++ standard is 17, warning level is 4

::swap will accidentally free buffer

utf8_string:.swap() allocated 'tmp', a temorary storage. Leaving it as exact copy of 'str', it might deallocate the held storage when being destructed.

MSVC x86: error C3861: '_BitScanReverse64': identifier not found

Hello Jakob,

Well done on solving #55 so quickly!

Here is a new issue that I found when trying out the readme code sample on 32bit versions of MSVC.

Example at https://godbolt.org/z/Peh1xq with MSVC 2017 and commit 411dfba

tinyutf8.h(152): error C3861: '_BitScanReverse64': identifier not found
tinyutf8.h(158): note: see reference to function template instantiation 'unsigned int tiny_utf8::tiny_utf8_detail::lzcnt<uint16_t>(T) noexcept' being compiled

        with
        [
            T=uint16_t
        ]

Thanks :)

append() method with value_type

Hi Jakob,

Hope things are well! A minor issue with an append(value_type) method and the += operator overload. It does not concatenate non-ASCII characters properly. Example:

    utf8_string badAppend;
    utf8_string::value_type v1 = U'“', v2 = U'a', v3 = U'b', v4 = U'à', v5 = U'”';
    badAppend += v1;
    badAppend += v2;
    badAppend += v3;
    badAppend += v4;
    badAppend += v5;
    cout << badAppend << endl;
    cout.flush();

You get something like &&&&&. Funny enough, it does not happen in Debug mode, which makes me think whether some sort of a local variable was not initialized properly.

My workaround is:

badAppend += ' ';
badAppend[badAppend.length() - 1] = my_char;

get_num_bytes() falls back to the number of codepoints?

Hi Jakob,

It's your fan club again.

Looks like there might be an issue in the non-sso mode for get_num_bytes(). In this case, I called substr() on a Russian string (92 bytes long, 52 code points) with a question mark in the middle, and got it butchered because get_num_bytes() returned 52.

I tried to figure out the logic following if (sso_inactive()) - that is, line 839 onwards - but couldn't, and instead substituted it by something crude but working:

	size_type byte_count = 0;
	for (size_type current_code_point = index; current_code_point< index + cp_count; current_code_point++)
		byte_count += get_codepoint_bytes(at(current_code_point));
	return byte_count;

I'm pretty sure it's not as optimised as what you planned originally but that's the best I could do...

I still get some parts butchered further down the line, for some reason, no idea why...

utf8_string::get_num_codepoints: incorrect return value for multibytes

Hi Jakob,

I use your library for a variety of languages, and one of them is Chinese, where most characters consist of 3 bytes. The last part of the method seems to be looking for the end of buffer where, in fact, it should be looking for the end of the fragment. This is the problematic line:

const char* buffer_end = buffer + data_len;

Try looking for any characters of this string (for example): 我给了老张三本书。

。 has to be 8, and it is returned as 6. 了 is a mess.

The correction is simple. You need to substitute data_len by index + byte_count. I would also recommend changing the name to fragment_end to prevent misunderstanding.

The following modified code works for me (I left data_len just in case):

const char* fragment_end = buffer + (index + byte_count < data_len? index + byte_count : data_len);

how can i use like std:string? there are no operators

hello,

how are you?
it says it as an in drop component, but it cannot use + and index or i am missing something?

thanks

Grapheme cluster

Any plans to add support for iterating not just code points but also grapheme cluster?

Allocator support.

It would be greate to get allocator support. (Possibly even PMR).

Support for u8 string literal changes in c++20

In c++20 u8 string literals no longer default to const char[N], instead becoming const char8_t[N].

Since there are currently no explicit conversions to or from char8_t*, u8string, etc., examples like the one proposed in the readme (using u8"" string literals) no longer compile successfully when configured to use the latest standard.

the __cpp_char8_t and __cpp_lib_char8_t macros could possibly be used to determine support for the extended types.

append(char32_t) does not convert back to UTF-8?

Try calling, for example, my_string.append(U'ü'). The character will be garbled.

The workaround is to convert it to a UTF-8 string in any form.

Compilation issue

Hi,
Havent used tinyutf8 much yet, but it is looking promising :-)

Anyway, I try to crosscompile a windows program using x86_64-w64-mingw32-g++ and get the following error in the included header file:

src/tinyutf8.h: In instantiation of ‘tiny_utf8::basic_utf8_string<ValueType, DataType, Allocator>& tiny_utf8::basic_utf8_string<ValueType, DataType, Allocator>::operator=(tiny_utf8::basic_utf8_string<ValueType, DataType, Allocator>&&) [with ValueType = char32_t; DataType = char; Allocator = std::allocator]’:
src/carchive.cpp:33:43: required from here
src/tinyutf8.h:1216:18: error: invalid cast from type ‘tiny_utf8::basic_utf8_string<char32_t, char>::SSO’ to type ‘void*’
1216 | std::memcpy( (void*)this->t_sso , (void*)&str.t_sso , sizeof(SSO) ); // Copy data

Changing to &this->t_sso allows compilation, but is this a sane thing to do?

Empty std::string problem

I have tried experimenting in a couple of classes in an existing project,. but have some strange issues when combining it with empty std::string objects. In the test below I expect all printf's to be empty, but I seem to get unterminated gibberish at b and d on linux, and b on windows.

#include "tinyutf8.h"

using namespace tiny_utf8;

int main(){
std::string stdstring="This is a string";
utf8_string utfstring="This is a string";
std::string a=stdstring.substr(stdstring.length());
utf8_string b=stdstring.substr(stdstring.length());
utf8_string c=utfstring.substr(utfstring.length());
utf8_string d=std::string("");
utf8_string e="";
printf("a=%s\r\n",a.c_str());
printf("b=%s\r\n",b.c_str()); // Fails
printf("c=%s\r\n",c.c_str());
printf("d=%s\r\n",d.c_str()); // Fails
printf("e=%s\r\n",e.c_str());
}

Thanks for a really nice library though :-)

Comparison operator doesn't match std::string

I noticed this recently and I wanted to bring it up to see if this was intentional. Essentially the less than operator (and probably other operators) aren't matching std::string and I was curious what you thought!

Here's an example:

std::string aaa("aaa");
std::string zz("zz");
auto isLessStdString = aaa < zz;                             // true – expected
auto isLessUtf8String = utf8_string (aaa) < utf8_string(zz); // false – unexpected

Thanks for the great library. It's been really helpful!

Support for string view?

Is there any support for a string view handled by tiny-utf8 (tiny_utf8::string_view)?
I did not find an implementation for string view, maybe that would be nice to have as a drop in replacement for std::string_view.
Currently I'm using std::string_view view(tinyutf8str.c_str()); but creating a temporary std::string_view is not the best for performance.

Question regardin find_first_of

First of all, thanks for this great library!
This is not a bug but rather my own user error that I am hoping you can clarify.
I'm trying to call find_first_of passing a utf8_string as the first parameter.
This generates the following error:
error: no matching function for call to ‘utf8_string::find_first_of(utf8_string&, int&)’
int found_pos = haystack.find_first_of(needle, at_pos);
^
In file included from Phonemizer.cpp:8:0:
tinyutf8.h:1728:12: note: candidate: utf8_string::size_type utf8_string::find_first_of(const value_type*, utf8_string::size_type) const
size_type find_first_of( const value_type* str , size_type start_codepoint = 0 ) const ;
^~~~~~~~~~~~~
tinyutf8.h:1728:12: note: no known conversion for argument 1 from ‘utf8_string’ to ‘const value_type* {aka const char32_t*}’

Can you tell me the correct usage and/or how I can get a char32_t* from my utf8_string?
Thanks!
Shawn

Contradictory unequality

I have following test code:
utf8_string n1("ALF Cen"); utf8_string n2("BET Cen"); utf8_string n3("GAM Cen"); assert(n1 != n2); assert(n1 != n3); assert(n2 != n3); assert(n1 > n2); assert(n1 > n3); assert(n2 < n1); assert(n3 < n1); assert(n1 >= n2); assert(n1 >= n3); assert(n2 <= n1); assert(n3 <= n1);
However, it fails on assert(n2 < n1); (and also on assert(n2 <= n1);). This indicate that unequality comparisions are unreliable.
Edit: After some additional tests I became to a conclusion it's due to wrong usage of difference_type. It's definerd as pointer difference, while compare compares characters, not pointers. Moreover, std::ptrdiff_t may not be used with elements in two different arrays.
Edit2: I fixed this issue on my machine just by defining difference_type as int.

GCC < 8: error: 'get_sso_capacity' was not declared in this scope

Hello,

It seems that the code sample shown in the readme can't be compiled with some versions of GCC < 8.

Example at https://godbolt.org/z/vdE453 with GCC 7.4, -std=c++17 and commit 048f74d

tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_small_string = typename std::enable_if<(L <= get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1047:140:   required from here
tinyutf8.h:668:81: error: 'get_sso_capacity' was not declared in this scope
tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_not_small_string = typename std::enable_if<(L > get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1060:144:   required from here
tinyutf8.h:670:84: error: 'get_sso_capacity' was not declared in this scope

ASM generation compiler returned: 1

tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_small_string = typename std::enable_if<(L <= get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1047:140:   required from here
tinyutf8.h:668:81: error: 'get_sso_capacity' was not declared in this scope
tinyutf8.h: In substitution of 'template<class ValueType, class DataType, class Allocator> template<typename std::allocator_traits<_NodeAlloc>::size_type L> using enable_if_not_small_string = typename std::enable_if<(L > get_sso_capacity()), bool>::type [with typename std::allocator_traits<_NodeAlloc>::size_type L = LITLEN; ValueType = ValueType; DataType = DataType; Allocator = Allocator]':
tinyutf8.h:1060:144:   required from here
tinyutf8.h:670:84: error: 'get_sso_capacity' was not declared in this scope

Can you please have a look at this error?

Thanks :)

too many warnings when compiling

Bad constructor

Hi,
I have found very interesting bug. When working with utf8_string with length 25, the string is bugged:

utf8_string s = "aaaaaaaaaaaaaaaaaaaaaaaaa";
utf8_string s1(s.cpp_str());

std::cout << s.cpp_str() << std::endl;
std::cout << s1.cpp_str() << std::endl;

This code has following output:

aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaІ�\x03\x01

The same error is when using copy constructor utf8_string s1(s);.
Any idea, what could be wrong? Thank you.

Comparing with lvalues does not work anymore

Hi,

the following code that compiled with an older tinyutf8 version now generates a compile error with MinGW 7.3.0:

utf8_string teststring = "µm";
if (teststring == "µm")
{
 
}

The following error is generated:

In file included from ../src/Widget.cpp:3:0:
tinyutf8.h: In instantiation of 'int utf8_string::compare(const char (&)[LITLEN]) const [with unsigned int LITLEN = 4]':
tinyutf8.h:1866:111:   required from 'bool utf8_string::operator==(const char (&)[LITLEN]) const [with unsigned int LITLEN = 4]'
../src/Widget.cpp:55:23:   required from here
tinyutf8.h:1809:10: error: increment of read-only location '(const char*)str'
    ++it, ++str;
          ^~~~~
tinyutf8.h:1809:10: error: lvalue required as increment operand

RFC: Introduce raw_size() and make size equal to length.

First of all, thanks a lot for providing this project! It makes it so much easier to work with UTF-8 data.

I'm aware that this might be out of the scope of this project, so I figured I'd just ask what you all think about this. When porting my code from std::string to tiny_utf8::string I encountered various issues, where the mismatch of size and length caused issues.

E.g., my code uses templates with std::size(...) to work on arbitrary data types. It doesn't work on tiny_utf8::strings though, since size() is the raw byte size, but operator[] expects a codepoint index. It would be nice of tiny_utf8 would be consistent with other STL containers.

Various other functions also made use of both size() and length(). Yes, it can also be fixed on my side, but (1) its difficult to get this done correctly in a large code base, and (2) it does no longer work as a "quick drop-in replacement" as advertised in the README.

So, what I'm considering is adding a new raw_size() (similar to raw_at, raw iterators, ...) that returns the byte size, and change the default behavior of size to match the length. This is obviously not a backwards compatible change, but (1) there have also been other non-backwards compatible changes and (2) there could still be a define-parameter to switch between both behaviors.

What do you think? If its out of the scope I'll come up with a different solution. :)

Incorrect string size of the constructor of tiny-utf8(not only use LITLEN)

I've tested these cases (windows 11, msvc 2022 with ninja):

tiny_utf8::string utf8str(u8"q🌍");
utf8str.length() is 2.

but if you do this:

tiny_utf8::string utf8str(u8"q🌍\0\0");
utf8str.length() is 4.( This should be 2)

and if the c str is not a literal string but in memory string
the length of the utf8str is not always equal to strlen(cstr);

so I have to do like this:

const char *str = u8"q🌍\0\0";
tiny_utf8::string utf8str(str, strlen(str));
utf8str.length() is 2.

C++17 compiler required

When compiling or using the library with C++11 or C++14 MSVC17 build fails with the following error:

error C2429: Attribute "fallthrough" requires the compiler identification "/std:c++17"

The documentation of your project states:

Tiny-utf8 is a library for extremely easy integration of Unicode into an arbitrary C++11 project

Is the bug in the documentation or in the code? Would be great if this library could be used with C++11

Please consider vcpkg support

vcpkg includes an older version of tiny-utf8: https://github.com/microsoft/vcpkg/tree/master/ports/tinyutf8. Would you consider making pull requests there to keep the version in vcpkg up to date?

Many people know about vcpkg already, but for those who don't it's a cross-platform package manager. It's open source. It works with macOS, Linux, Windows, gcc, clang, MSVC, and probably other combinations of OSes and compilers. It's terrific for discovering libraries. That's how I found tiny-utf8.

Warnings when compiling with Clang 9

When including tinyutf8.h I get these warnings

[build] In file included from ../include/tokenizers.h:5:
[build] ../thirdparty/tinyutf8.h:139:75: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build]                         static inline unsigned int clz( unsigned int value ) noexcept { return __builtin_clz( value ); }
[build]                                                                                         ~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~
[build] ../thirdparty/tinyutf8.h:140:80: warning: implicit conversion changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build]                         static inline unsigned int clz( unsigned long int value ) noexcept { return __builtin_clzl( value ); }
[build]                                                                                              ~~~~~~ ^~~~~~~~~~~~~~~~~~~~~~~
[build] ../thirdparty/tinyutf8.h:142:60: warning: operand of ? changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build]                                 return sizeof(char32_t) == sizeof(unsigned long int) ? __builtin_clzl( value ) : __builtin_clz( value );
[build]                                 ~~~~~~                                                 ^~~~~~~~~~~~~~~~~~~~~~~
[build] ../thirdparty/tinyutf8.h:142:86: warning: operand of ? changes signedness: 'int' to 'unsigned int' [-Wsign-conversion]
[build]                                 return sizeof(char32_t) == sizeof(unsigned long int) ? __builtin_clzl( value ) : __builtin_clz( value );
[build]                                 ~~~~~~                                                                           ^~~~~~~~~~~~~~~~~~~~~~

And later for a literal utf8_string UNK = u8"[UNK]"; (or without u8):

[build] In file included from ../include/tokenizers.h:8:
[build] ../thirdparty/tinyutf8.h:906:54: warning: implicit conversion loses integer precision: 'unsigned long' to 'unsigned char' [-Wimplicit-int-conversion]
[build]                         t_sso.data_len = ( sizeof(SSO::data) - data_len ) << 1;
[build]                                        ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
[build] ../thirdparty/tinyutf8.h:1042:5: note: in instantiation of member function 'tiny_utf8::basic_utf8_string<char32_t, char, std::allocator<char> >::set_sso_data_len' requested here
[build]                                 set_sso_data_len( LITLEN );
[build]                                 ^
[build] /mnt/e/MyProgramming/fused-transformer-mobile-1/src/tokenizers.cc:4:19: note: in instantiation of function template specialization 'tiny_utf8::basic_utf8_string<char32_t, char, std::allocator<char> >::basic_utf8_string<6>' requested here
[build] utf8_string UNK = u8"[UNK]";
[build]                   ^

incorrect string comparison behaviour after a substr operation

#include <tinyutf8/tinyutf8.h>
#include <stdio.h>
namespace utf8 = tiny_utf8;

int main() {
    utf8::string temp;

    for (char32_t c : { 13, 10, 104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 13, 10, 41, 34 }) {
        temp += c;
    }

    auto left = temp.substr(0, temp.length() - 2);

    utf8::string right = "\nhello world\n";

    printf("%s\n", left.c_str());
    printf("%s\n", right.c_str());

    printf("left: %ld, right: %ld\n", left.length(), right.length());

    if (left == right) {
        printf("strings were the same\n");
    } else {
        printf("strings were not the same\n");
    }
}

expected behaviour would be that both of these strings are the same length and are equal. yet left.length() = 15 and right.length() = 13 and left != right

is it possible to make tiny-utf8 case insensitive?

For example we can make a separate typedef using std::string class to behave like case insensitive class. I tried same approach with tiny-utf8 class and i got too many errors. following is the code which makes std::string derived class to behave like case insensitive.
any clue?


struct ci_char_traits : public char_traits<char> {
    static bool eq(char c1, char c2) { return toupper(c1) == toupper(c2); }
    static bool ne(char c1, char c2) { return toupper(c1) != toupper(c2); }
    static bool lt(char c1, char c2) { return toupper(c1) <  toupper(c2); }
    static int compare(const char* s1, const char* s2, size_t n) {
        while( n-- != 0 ) {
            if( toupper(*s1) < toupper(*s2) ) return -1;
            if( toupper(*s1) > toupper(*s2) ) return 1;
            ++s1; ++s2;
        }
        return 0;
    }
    static const char* find(const char* s, int n, char a) {
        while( n-- > 0 && toupper(*s) != toupper(a) ) {
            ++s;
        }
        return s;
    }
};

typedef std::basic_string<char, ci_char_traits> ci_string;

error: reference to 'detail' is ambiguous

I just switched to the header only version 3 and get the following errors when compiling my 64-bit application:

tinyutf8.h:732:49: error: reference to 'detail' is ambiguous
  utf8_string( const char* str , size_type len , detail::read_codepoints_tag );
                                                 ^~~~~~

This is the result of the gcc -v command:

Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=C:/CodingXP/mingw730_32/bin/../libexec/gcc/i686-w64-mingw32/7.3.0/lto-wrapper.exe Target: i686-w64-mingw32 Configured with: ../../../src/gcc-7.3.0/configure --host=i686-w64-mingw32 --build=i686-w64-mingw32 --target=i686-w64-mingw32 --prefix=/mingw32 --with-sysroot=/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32 --enable-shared --enable-static --disable-multilib --enable-languages=c,c++,fortran,lto --enable-libstdcxx-time=yes --enable-threads=posix --enable-libgomp --enable-libatomic --enable-lto --enable-graphite --enable-checking=release --enable-fully-dynamic-string --enable-version-specific-runtime-libs --enable-libstdcxx-filesystem-ts=yes --disable-sjlj-exceptions --with-dwarf2 --disable-libstdcxx-pch --disable-libstdcxx-debug --enable-bootstrap --disable-rpath --disable-win32-registry --disable-nls --disable-werror --disable-symvers --with-gnu-as --with-gnu-ld --with-arch=i686 --with-tune=generic --with-libiconv --with-system-zlib --with-gmp=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-mpfr=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-mpc=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-isl=/c/mingw730/prerequisites/i686-w64-mingw32-static --with-pkgversion='i686-posix-dwarf-rev0, Built by MinGW-W64 project' --with-bugurl=https://sourceforge.net/projects/mingw-w64 CFLAGS='-O2 -pipe -fno-ident -I/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/include -I/c/mingw730/prerequisites/i686-zlib-static/include -I/c/mingw730/prerequisites/i686-w64-mingw32-static/include' CXXFLAGS='-O2 -pipe -fno-ident -I/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/include -I/c/mingw730/prerequisites/i686-zlib-static/include -I/c/mingw730/prerequisites/i686-w64-mingw32-static/include' CPPFLAGS=' -I/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/include -I/c/mingw730/prerequisites/i686-zlib-static/include -I/c/mingw730/prerequisites/i686-w64-mingw32-static/include' LDFLAGS='-pipe -fno-ident -L/c/mingw730/i686-730-posix-dwarf-rt_v5-rev0/mingw32/opt/lib -L/c/mingw730/prerequisites/i686-zlib-static/lib -L/c/mingw730/prerequisites/i686-w64-mingw32-static/lib -Wl,--large-address-aware' Thread model: posix gcc version 7.3.0 (i686-posix-dwarf-rev0, Built by MinGW-W64 project)

Unexpected push_back result

#include <iostream>
#include <cstdlib>
#include "tinyutf8.h"

int main()
{
    tiny_utf8::utf8_string str;
    str.push_back(U'a');
    std::cout << str.size() << "\n" << str[0] << "\n" << str << "\n\n";

    std::string str1;
    str1.push_back('a');
    std::cout << str1.size() << "\n" << str1[0] << "\n" << str1 << "\n";
}

outputs

1
a
a

for the std::string as expected, but

1
0

for tiny_utf8::utf8_string (using Clang 10). It seems \0 is getting appended instead of a.

utf8_string::get_num_bytes_from_start returns incorrect value

Hi Jakob,

Happy New Year!

Looks like the issue you kept fixing, strikes again.

It's pretty much the same pattern: mostly plain Western European text with one multibyte interloper.

Similar to #14, but a different point, specifically get_num_bytes_from_start. I came across it when using find_first_of.

Another glitch, which may be stemming from the same piece of code, is that substr truncates the result. Having said that, if the block starting with if( utf8_string::is_lut_active( lut_iter ) )... under get_num_bytes_from_start is disabled, the find_first_of returns a correct result.

Here is the sample snippet demonstrating both:

    utf8_string findFirstBug = u8"The project, therefore, “is an investment in the power of the adolescent girls which is so important to breaking the inter-generational transmission of poverty, violence, exclusion and discrimination in building our societies for a better future”";

    std::cout << "White space found at: " << findFirstBug.find_first_of(U" \t\r\n", 26) << endl; // returns 29 instead of 27
    std::cout << "Total len: " << findFirstBug.length() << " but the substring is truncated by 2 characters: " << findFirstBug.substr(0, 246) << endl;

BTW, I see that we were talking about the code reuse in #14. I am wondering if you can take that encapsulate that lut block that you use in several functions, seems like it might save some efforts in the future.

Is it possible to do case-insensitive comparison with the tiny-utf8?

Hi, is it possible to do case-insensitive comparison with tiny-utf8?

Basically what this question on SO is asking: https://stackoverflow.com/questions/11635/case-insensitive-string-comparison-in-c.

I'm trying to avoid using ICU in my app.

Thanks!

Comparing with string constant does not work

Hi,

The following code:

    utf8_string teststring = "µm";
    if (teststring == "µm")
    {
    	std::cout << "teststring == µm --> true" << std::endl;
    }

    if (teststring == utf8_string("µm"))
    {
    	std::cout << "teststring == utf8_string(\"µm\") --> true" << std::endl;
    }

produces the following output:

teststring == utf8_string("µm") --> true

So the comparison with the string constant fails.

cutf + tiny-utf-8 = ?

https://github.com/tapika/cutf

Haven't checked your library, but I've worked on something similar.

Wondering if it makes any sense to recombine bits and pieces together into one library.

C++20

Hello,

C++20 introduced new helper functions ends_with and starts_with, are there any plans to add partial or full support for modern std::basic_string_view operators and utils?

utf8_string::get_num_codepoints returns number of bytes under certain circumstances with ::lut_active

Hi Jakob,

Looks like the part of get_num_codepoints that uses lut_active has issues under some circumstances. I only could fix it by disabling the entire part altogether.

Here is the example:

utf8_string bad_find = u8"S’pore Starbucks selling. gorgeous";
utf8_string substr = u8".";
std::cout << bad_find.find(substr, 0);
std::cout.flush();

The . character is at position 24. The return value is 26 due to ’ being a "compound" codepoint.

Best regards,
Vadim

Tricky to use with MSVC and C++20

The example code doesn't compile with C++20 or above. The only mention of C++20 is a note to use tiny_utf8::u8string instead. That also fails to compile on MSVC at least since it's guarded by a check on __cplusplus. That macro is by default not set properly on MSVC, see https://docs.microsoft.com/en-us/cpp/build/reference/zc-cplusplus. Maybe that check could be rewritten in a more universal way, or at least some documentation be added about any of this.

Option for No Exceptions for Small Embedded Systems

I am using your library, and have a forked version here:

https://github.com/bakerstu/tiny-utf8

I use g++ and my project does not enable C++ exception handling. I would not like to add exception handling because it increases the code size to be larger than I would like. Therefore, I have "hacked" a fix into a branch noexcept that replaces the throw statements with assertions compatible with our project platform.

I acknowledge that this is not a portable fix, as is, and should not be accepted back as a pull request. Therefore, I would like to propose an alternative solution that could possibly be accepted.

I want to replace your throw statements with an inline function call. The inline function call should be ifndef guarded so that by default it throws the exception:

#if !defined(TINYUTF8_NOEXCEPT)
inline void tinyutf8_error(const char *what_arg)
{
    throw std::out_of_range(what_arg); 
}
#endif

If the user defines TINYUTF8_NOEXCEPT, then it is on them to provide an implementation of tinyutf8_error() that works for their purposes and platform. This is an acceptable solution for my use case.

I'd like to get your thoughts on this, and if favorable, work on a pull request.

Incorrect codepoint size calculation on processors that do not support 'lzcnt'

tiny-utf8/tinyutf8.h

Line 82 in 4fc49c8

static inline unsigned int clz( uint32_t val ){ return __lzcnt( val ); }

On processors that do no support the 'lzcnt' instruction this intrinsic will be interpreted incorrectly as 'bsr' (BitScanReverse) thus giving the bit index instead of the count of leading zeros.

This in turn causes the calculations for codepoint sizes to be incorrect when used in encoding and decoding operations.

If a suitable alternative is not available sources i have seen recommend basing 'clz' on BitScanReverse instead.

Errors building with Visual C++

Thank you for your work on this project. I look forward to using it.

I'm trying to use version 3.2.2 with Visual C++ inside Visual Studio Community 2019 version 16.6.0. Here are several errors:

ConsoleApplication2\tinyutf8.h(3202): error C4703: potentially uninitialized local pointer variable 'app_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3198): error C4703: potentially uninitialized local pointer variable 'old_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3533): error C4703: potentially uninitialized local pointer variable 'str_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3484): error C4703: potentially uninitialized local pointer variable 'old_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3932): error C4703: potentially uninitialized local pointer variable 'repl_lut_base_ptr' used
ConsoleApplication2\tinyutf8.h(3880): error C4703: potentially uninitialized local pointer variable 'old_lut_base_ptr' used

Any tips for using it in Visual C++?

duffsdevice / tiny-utf8 Goto Github PK

tiny-utf8's People

Contributors

Stargazers

Watchers

Forkers

tiny-utf8's Issues

Recommend Projects

Recommend Topics

Recommend Org