Coder Social home page Coder Social logo

Comments (11)

mathiasbynens avatar mathiasbynens commented on September 15, 2024

windows-1250 and koi8-u are listed in the spec as legacy single-byte encodings, so assuming the input is valid (i.e. can be represented in that encoding), you can just get its .length to count the bytes.

Alternatively, use a library that implements the encoding you need (e.g. https://github.com/mathiasbynens/windows-1250 for windows-1250; see https://www.npmjs.com/browse/keyword/legacy-encoding for the full list of single-byte legacy encodings) and use library.encode(input).length.

from encoding.

zomp avatar zomp commented on September 15, 2024

@mathiasbynens, thanks for the reply. Do you mean st like myTextarea.value.length, please? If I understand it correctly this does not work – JavaScript converts the string to UTF-16, so the .length represents only the number of UTF-16 code units. The measurement could be done on a byte stream, but I do not know how to obtain it – it cannot be done via (new TextEncoder('big5')).encode(myTextarea.value) because of the UTF exclusivity.

The libraries looks very nice, I will examine them. A native solution would be still better – I need a support for tens of encodings, even the multi-byte ones – that would require megabytes of libraries, which is not very elegant for such a simple task as byte counting (moreover the browsers have access to such functionalities).

(The counter page by itself is in UTF-8 – the final text encoding can be set independently.)

from encoding.

mathiasbynens avatar mathiasbynens commented on September 15, 2024

@zomp

Do you mean st like myTextarea.value.length, please?

Exactly.

If I understand it correctly this does not work – JavaScript converts the string to UTF-16, so the .length represents only the number of UTF-16 code units.

Yes, but the Unicode symbols that can be represented in any of those legacy single-byte encodings are all within the Basic Multilingual Plane, and would thus each have a length of 1 anyways.

The measurement could be done on a byte stream, but I do not know how to obtain it – it cannot be done via (new TextEncoder('big5')).encode(myTextarea.value) because of the UTF exclusivity.

It could be done using the libraries + code snippet I mentioned.

from encoding.

zomp avatar zomp commented on September 15, 2024

@mathiasbynens, thank you, I am on the same wave now. Unfortunately the .length approach does not work for legacy multi-byte encodings which I need to handle.

The libraries are promising, I just cannot believe they could work due to their size (fractionally as fast as a native solution) for as you type byte counting on mobile devices. It's time to check it. :-)

from encoding.

annevk avatar annevk commented on September 15, 2024

Why do you need the byte count for legacy encodings?

from encoding.

zomp avatar zomp commented on September 15, 2024

Hi @annevk, I work on a tool for professional translators (editors...) who need to know such things (I suppose mainly because of subsequent text applications – "will it fit into the provided disk/DB space?").

I do not mind to implement the byte counting differently, but I see no other way on client.

from encoding.

annevk avatar annevk commented on September 15, 2024

Why would such tools not use utf-8 throughout? They could probably use an overhaul if that's really still the case...

from encoding.

zomp avatar zomp commented on September 15, 2024

The text processing tool we are working on uses UTF-8 (and converts to target encoding on output), but the tens of thousands of Oracle* projects it processes cannot be simply converted to UTF-8. I just thought if there was no such artificial restriction,** we could implement the character counting in an easy way (otherwise we will have to call a server-side character counter on input, which will be slow).

* I was just told there is no problem to name our company, I only cannot say this is an official statement.

** The browsers are (should be) capable of "legacy" encodings, e.g. because of form's accept-charset attribute, the standard just does not allow them to encode.

from encoding.

annevk avatar annevk commented on September 15, 2024

I see, fair enough. I suspect there is little appetite for adding this though and getting it in browsers would take a relatively long time.

from encoding.

zomp avatar zomp commented on September 15, 2024

Personally I like the current status, the only pity is it cannot handle some scenarios. I started this thread also having hope there is a workaround someone will recommend and it still could happen (even now I can elaborate on Mathis's libraries). Thank you all!

from encoding.

annevk avatar annevk commented on September 15, 2024

Closing this. Thank you @zomp, sorry we could not be of more help.

from encoding.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.