Before the code gets too large, algorithms and policies for processing UTF8 should be

I wonder if <a class="user-mention notranslate" data-hovercard-type="user

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

UTF8 support about mutgos_server HOT 12 CLOSED

mutgos commented on August 24, 2024

UTF8 support

from mutgos_server.

Comments (12)

zelerin commented on August 24, 2024

Let's start the discussion on this. In the core code itself, strings are mostly passed around as-is, with very little processing (except in specific areas like the command interpreters, colorizing, lock parsing, etc). The areas that do perform processing currently look for certain ASCII characters, so if they were to encounter UTF8 they should just pass over it no problem. Incoming and outgoing strings both have to be encapsulated as ExternalText, and AngelScripts use a custom string class that could easily process UTF8 behind the scenes.

Here are the problems to be solved (as I see it):

Malicious (or incorrectly formatted) UTF8. Web browsers, upon encountering such things, immediately drop the connection. We have to protect against this by removing or rejecting badly formatted strings.
size(), substring(), and any chopping off of excess characters need to be UTF8 aware. This is easy to do in AngelScript, but everything else in MUTGOS uses std::string, meaning support would not be integrated.
Certain parsers may need to be UTF8 aware.
Efficiently handling UTF8. My experience so far is that it's a very intensive process, though the strings are all in the CPU cache meaning it may not be as bad as I think. If there are small, efficient, standalone libraries that do this stuff for us, they should be considered. Boost does not appear to have one.

Possible solutions:

Create global functions that can be used to check for and/or modify strings that have corrupt/invalid UTF8. This will require calling them at certain entry points in comminterface and all over AString. Assuming anywhere strings are generated by the user are checked, it should mean the other areas won't require any changes.
Convert the whole system to use multi-byte characters (std::wstring, etc) internally. This is not a favorite solution for me and might be really difficult, however in theory ExternalText could do the conversion to and from the outside world and hide it.
AString can be modified such that size() will show the size of the string as if each UTF8 escape sequence is a single character. substring(), etc would work similarly. This makes everything transparent to the scripts. Similar functions can be made available as free functions for C++ code.
When lopping off the ends of strings due to a byte limit exceeded, etc, a free function can be made that checks for UTF8 at the end and chops it off at a boundary.
Create our own C++ string class that inherits from std::string and hides UTF8 stuff within. This could both be reused in AString and be used throughout MUTGOS. In theory a search and replace could be done to fix almost all areas. There could be subtle compatibility issues with external libraries, however, requiring converting back and forth.

I wonder if @Nyanster has any thoughts on how FuzzBall is currently handling this.

Ideas? Online articles of interest? I'd like to design an approach early, and do any stubs and such as needed while the codebase is relatively small, even if actual UTF8 processing code won't be implemented yet.

Somewhat related, I was hoping to make MUTGOS more internationalized from the start, but that's always an ambitious thing even for professional products. I think that can wait until after v1.0; most of that would be within the AngelScripts anyway.

from mutgos_server.

commented on August 24, 2024

I wonder if @Nyanster has any thoughts on how FuzzBall is currently handling this.

I actually don't know the internals of FuzzBall all that well. :B I am mostly a user, a user who wants to become more knowledgeable.
Well, AFAIK, fb doesn't handle unicode or utf8. Oddly enough, though, I have seen people send 8-bit characters through just fine in says and poses, like guillemets and accented characters. Probably https://en.wikipedia.org/wiki/ISO/IEC_8859-1 but really whatever xterm is recognising by default for those characters.

@wyld-sw would know more than I do.

from mutgos_server.

zelerin commented on August 24, 2024

@Nyanster , I have definitely seen that situation, where it allows 'extended ascii' (there's a tune param that turns on 8 bit support), though I figured that means it can also do UTF8. I'm guessing it's not something most people even think to try.

from mutgos_server.

hyena commented on August 24, 2024

http://utf8everywhere.org/ has some comments on supporting utf-8 in a C++ codebase that may be relevant. The tl;dr is that the approach they suggest seems to be using std::string in a proper way and avoiding writing a new subclass to represent it.

from mutgos_server.

hyena commented on August 24, 2024

I'm just beginning to look into this but it's a little depressing how little we have in the standard library for some basic UTF-8 operations.

Q: If we check all user input to check that it's valid UTF-8 on the way in and treat non-valid input as an error do we miss any input?

Q: Within the C++ codebase are there any circumstances where we'd care about a string as more than just bytes? e.g. are there limits we'd impose on character length instead of byte length?

The part that gives me pause is that reading some of the AngelScript docs, it looks like it will be happy with UTF-8 literals in the scripts (so e.g. a user could make their script println("🦊")) but we have to register our own string type for represention. Which means, I think, that "🦊".length() will return 4 (because a fox is b'\xf0\x9f\xa6\x8a') not 1. This would complicate matters for, for example, code that wants to do column output or wordwrapping.

from mutgos_server.

hyena commented on August 24, 2024

I'm starting to feel like our lives might be simpler if we register an angelscript string type that will only accept valid utf-8 (i.e. not arbitrary bytes) and provide means to try to convert an Array of bytes to it if absolutely necessary. The .length() value should return the length of the string in code points (or is it graphemes?) rather than bytes.

from mutgos_server.

zelerin commented on August 24, 2024

Luckily, I have already created (and we are using) our own string type in AngelScript: AString. The intention was that we could modify it to handle UTF-8 operations seamlessly. Based on your thoughts and mine, as I see it, we need just a few new functions that can be used everywhere, including in AString:

Check for a valid UTF-8 string. Any invalid strings should be rejected and an error thrown or returned back to the user.
Check for the presence of UTF-8 at all. This may be used in the future to optionally block UTF-8 usernames, etc.
The 'size' of a UTF-8 string (which may not be perfect, but may be close enough for most uses and can always be improved). It would probably just count normal ASCII and UTF-8 code points for simplicity. For the fox emoji example, it would return 1, since it's counting the code point start to end as a single 'character'.
The ability to chop/separate a string based on the 'size' - in other words, don't allow a code point to be chopped in half, but chop around the code points.

I think if we confirm data coming in (and generated with AString) is UTF-8 compliant, and prevent the user from making string cuts or substrings that are in the middle of code points, we shouldn't have to worry about strings going invalid elsewhere. Since I think (?) we're going with the idea @Nyanster proposed of simply rejecting strings that are too long (as applicable), internally strings will just be passed around and searched without modification. That should make things really simple.

Any disagreements? If not, I think this could be considered our way forward. Once those common functions are done it wouldn't be all that hard to do.

from mutgos_server.

hyena commented on August 24, 2024

I think I'm in agreement. My take on it, tell me what you think:

AString should be back internally by std::string. Notably, not std::wstring
AString should only ever be allowed to contain valid UTF-8. That means checks on its constructors, that a split operation can't split a codepoint (so it needs to probably take an AString as its split parameter), etc. Concatenating AStrings should be safe.
The .length() call on an AString is the length in codepoints. We can have a .size() call to return the size in bytes.
I'm not sure I understand the idea of rejecting strings that are 'too long'.
In theory if AngelScript only has access to AString and AStrings only permit UTF-8 and all things that set/check properties are in AngelScript, we shouldn't have to do more scrubbing of input. But I do think that it would be prudent and a sanity check even so to check all inputs from a user to make sure they're UTF-8 compliant.

from mutgos_server.

zelerin commented on August 24, 2024

Sounds good to me!

The 'too long' bit is because we want to have a maximum line length for most things. For instance, we don't want to allow Entities to have a 1 megabyte name, right? I also wouldn't want to allow the comm modules to accept a 1 gigabyte line from the user. There would have to be some reasonable limit set. All it would do is fail to accept any strings above a certain byte size (not codepoints) appropriate for the operation/field, and require the user to retry after slimming it down.

from mutgos_server.

hyena commented on August 24, 2024

Good. I like the idea of byte-size limits on fields. We just have to remember that languages like Thai take several times more byte storage.

Alright. Then it sounds like all we need is AString (as a class to be registered with AngelScript) with UTF-8 checking and UTF-8 aware .length() and then a checker on user input lines...

Maybe that goes in SocketClientConnection::process_raw_incoming_data() which already checks for a valid length? May need to do something similar for the websocket.

text_Utf8Tools will probably go away.

from mutgos_server.

hyena commented on August 24, 2024

One nice thing is as long as it's backed by std::string all the existing datamodels and even the generated database dumps are still valid by virtue of being valid UTF-8.

from mutgos_server.

zelerin commented on August 24, 2024

Making good progress on this. The code is done; am currently reviewing it and then will test.

from mutgos_server.

UTF8 support about mutgos_server HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent