Hi, I appreciate UTF family encodings enforcement in TextEnc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Benefits of "Legacy" Encodings – Byte Counter about encoding HOT 11 CLOSED

zomp commented on September 15, 2024

Benefits of "Legacy" Encodings – Byte Counter

from encoding.

Comments (11)

mathiasbynens commented on September 15, 2024

windows-1250 and koi8-u are listed in the spec as legacy single-byte encodings, so assuming the input is valid (i.e. can be represented in that encoding), you can just get its .length to count the bytes.

Alternatively, use a library that implements the encoding you need (e.g. https://github.com/mathiasbynens/windows-1250 for windows-1250; see https://www.npmjs.com/browse/keyword/legacy-encoding for the full list of single-byte legacy encodings) and use library.encode(input).length.

from encoding.

zomp commented on September 15, 2024

@mathiasbynens, thanks for the reply. Do you mean st like myTextarea.value.length, please? If I understand it correctly this does not work – JavaScript converts the string to UTF-16, so the .length represents only the number of UTF-16 code units. The measurement could be done on a byte stream, but I do not know how to obtain it – it cannot be done via (new TextEncoder('big5')).encode(myTextarea.value) because of the UTF exclusivity.

The libraries looks very nice, I will examine them. A native solution would be still better – I need a support for tens of encodings, even the multi-byte ones – that would require megabytes of libraries, which is not very elegant for such a simple task as byte counting (moreover the browsers have access to such functionalities).

(The counter page by itself is in UTF-8 – the final text encoding can be set independently.)

from encoding.

mathiasbynens commented on September 15, 2024

@zomp

Do you mean st like myTextarea.value.length, please?

Exactly.

If I understand it correctly this does not work – JavaScript converts the string to UTF-16, so the .length represents only the number of UTF-16 code units.

Yes, but the Unicode symbols that can be represented in any of those legacy single-byte encodings are all within the Basic Multilingual Plane, and would thus each have a length of 1 anyways.

The measurement could be done on a byte stream, but I do not know how to obtain it – it cannot be done via (new TextEncoder('big5')).encode(myTextarea.value) because of the UTF exclusivity.

It could be done using the libraries + code snippet I mentioned.

from encoding.

zomp commented on September 15, 2024

@mathiasbynens, thank you, I am on the same wave now. Unfortunately the .length approach does not work for legacy multi-byte encodings which I need to handle.

The libraries are promising, I just cannot believe they could work due to their size (fractionally as fast as a native solution) for as you type byte counting on mobile devices. It's time to check it. :-)

from encoding.

annevk commented on September 15, 2024

Why do you need the byte count for legacy encodings?

from encoding.

zomp commented on September 15, 2024

Hi @annevk, I work on a tool for professional translators (editors...) who need to know such things (I suppose mainly because of subsequent text applications – "will it fit into the provided disk/DB space?").

I do not mind to implement the byte counting differently, but I see no other way on client.

from encoding.

annevk commented on September 15, 2024

Why would such tools not use utf-8 throughout? They could probably use an overhaul if that's really still the case...

from encoding.

zomp commented on September 15, 2024

The text processing tool we are working on uses UTF-8 (and converts to target encoding on output), but the tens of thousands of Oracle* projects it processes cannot be simply converted to UTF-8. I just thought if there was no such artificial restriction,** we could implement the character counting in an easy way (otherwise we will have to call a server-side character counter on input, which will be slow).

* I was just told there is no problem to name our company, I only cannot say this is an official statement.

** The browsers are (should be) capable of "legacy" encodings, e.g. because of form's accept-charset attribute, the standard just does not allow them to encode.

from encoding.

annevk commented on September 15, 2024

I see, fair enough. I suspect there is little appetite for adding this though and getting it in browsers would take a relatively long time.

from encoding.

zomp commented on September 15, 2024

Personally I like the current status, the only pity is it cannot handle some scenarios. I started this thread also having hope there is a workaround someone will recommend and it still could happen (even now I can elaborate on Mathis's libraries). Thank you all!

from encoding.

annevk commented on September 15, 2024

Closing this. Thank you @zomp, sorry we could not be of more help.

from encoding.

Benefits of "Legacy" Encodings – Byte Counter about encoding HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent