Comments (11)
windows-1250 and koi8-u are listed in the spec as legacy single-byte encodings, so assuming the input is valid (i.e. can be represented in that encoding), you can just get its .length
to count the bytes.
Alternatively, use a library that implements the encoding you need (e.g. https://github.com/mathiasbynens/windows-1250 for windows-1250; see https://www.npmjs.com/browse/keyword/legacy-encoding for the full list of single-byte legacy encodings) and use library.encode(input).length
.
from encoding.
@mathiasbynens, thanks for the reply. Do you mean st like myTextarea.value.length
, please? If I understand it correctly this does not work – JavaScript converts the string to UTF-16, so the .length
represents only the number of UTF-16 code units. The measurement could be done on a byte stream, but I do not know how to obtain it – it cannot be done via (new TextEncoder('big5')).encode(myTextarea.value)
because of the UTF exclusivity.
The libraries looks very nice, I will examine them. A native solution would be still better – I need a support for tens of encodings, even the multi-byte ones – that would require megabytes of libraries, which is not very elegant for such a simple task as byte counting (moreover the browsers have access to such functionalities).
(The counter page by itself is in UTF-8 – the final text encoding can be set independently.)
from encoding.
Do you mean st like
myTextarea.value.length
, please?
Exactly.
If I understand it correctly this does not work – JavaScript converts the string to UTF-16, so the
.length
represents only the number of UTF-16 code units.
Yes, but the Unicode symbols that can be represented in any of those legacy single-byte encodings are all within the Basic Multilingual Plane, and would thus each have a length of 1
anyways.
The measurement could be done on a byte stream, but I do not know how to obtain it – it cannot be done via
(new TextEncoder('big5')).encode(myTextarea.value)
because of the UTF exclusivity.
It could be done using the libraries + code snippet I mentioned.
from encoding.
@mathiasbynens, thank you, I am on the same wave now. Unfortunately the .length
approach does not work for legacy multi-byte encodings which I need to handle.
The libraries are promising, I just cannot believe they could work due to their size (fractionally as fast as a native solution) for as you type byte counting on mobile devices. It's time to check it. :-)
from encoding.
Why do you need the byte count for legacy encodings?
from encoding.
Hi @annevk, I work on a tool for professional translators (editors...) who need to know such things (I suppose mainly because of subsequent text applications – "will it fit into the provided disk/DB space?").
I do not mind to implement the byte counting differently, but I see no other way on client.
from encoding.
Why would such tools not use utf-8 throughout? They could probably use an overhaul if that's really still the case...
from encoding.
The text processing tool we are working on uses UTF-8 (and converts to target encoding on output), but the tens of thousands of Oracle* projects it processes cannot be simply converted to UTF-8. I just thought if there was no such artificial restriction,** we could implement the character counting in an easy way (otherwise we will have to call a server-side character counter on input, which will be slow).
* I was just told there is no problem to name our company, I only cannot say this is an official statement.
** The browsers are (should be) capable of "legacy" encodings, e.g. because of form's accept-charset attribute, the standard just does not allow them to encode.
from encoding.
I see, fair enough. I suspect there is little appetite for adding this though and getting it in browsers would take a relatively long time.
from encoding.
Personally I like the current status, the only pity is it cannot handle some scenarios. I started this thread also having hope there is a workaround someone will recommend and it still could happen (even now I can elaborate on Mathis's libraries). Thank you all!
from encoding.
Closing this. Thank you @zomp, sorry we could not be of more help.
from encoding.
Related Issues (20)
- End-of-queue during decoding of GB18030 should not mask ASCII characters. HOT 4
- gb18030 encoder using index gb18030 ranges pointer HOT 4
- aria-label usage in BMP coverage table HOT 4
- Bug in TextDecoderStream around processing the end of stream. HOT 1
- Add a static decode and encode method to `TextEncoder` and `TextDecoder` HOT 10
- Shift_JIS decoder HOT 12
- [GB18030] Wrong codepoint at index 7533 HOT 4
- TextDecoderStream: empty Uint8Array should result in an empty string HOT 4
- 7-bit ASCII encoding HOT 3
- The concept of "output encoding" is not described anywhere HOT 5
- Visualization tables has lack of descriptions HOT 2
- Why Big5 index contains unmappable characters? HOT 2
- Consider adding windows-936-2000 as a label for GBK HOT 2
- Preface punctuation
- Reflect changes in GB 18030-2022 HOT 5
- Make encodeInto() throw when given a detached buffer HOT 5
- Ambiguous wording in GB18030 decoder HOT 4
- Reference link wrong in "If ioQueue is empty..." HOT 1
- Fast byteLength() HOT 8
- Throw exception when text encode alloc memory fail. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from encoding.