Comments (15)
I wonder how easy the C++ APIs make it to expose that state. @vyv03354, @hsivonen, @inexorabletash?
from encoding.
Many decoders are stateful, so there's more than just the pending bytes in the internal stream.
Blink relies on ICU for non-UTF/non-Latin1 conversions. I'm not an ICU expert (or even a novice, really). It's unclear to me if the API allows for access to the pending buffers (it does give access to the counts of the buffers), and I don't see anything giving access to the other internal state.
Even if it did give access to the internal state in an opaque way, it seems like security bugs waiting to happen to prime that state with user-supplied byte data.
So... unless someone who knows far more about ICU than I do chimes in (I'll ping jshin) I'd rate this as "non-trivial"
from encoding.
Oh, I missed the bit about priming. Yeah, that would be cumbersome. But it would be really nice for streaming.
from encoding.
@domenic, I'd like to hear a more detailed explanation of the use case: Why do streams need to stay logically open across a service worker upgrade?
@annevk In Gecko, the way the state is stored is considered private to the decoder and the different decoders don't have a consistent representation. One way to deal with this would be to rewrite all our decoders to be stateless (turn them from classes to mere functions basically) and to introduce a separate common state management wrapper for all the callers that don't need to deal with the state themselves.
Apart from requiring rewriting all the decoders (which I think we should do anyway--in Rust; I'm working on a proposal), there's the sadness that ISO-2022-JP's state is different from all the others. For everything else, the state would be zero to three bytes. Except for ISO-2022-JP. :-(
Now, just because I'm working on a proposal for a rewrite doesn't mean we'll do one or that we'll do one soon. And even if a rewrite made serializing the state feasible, I'm still skeptical of the merit of the use case for making the state serializable.
from encoding.
@hsivonen that is really a question for @bsittler; I was relaying the use case from him...
from encoding.
One way to deal with this would be to rewrite all our decoders to be stateless (turn them from classes to mere functions basically) and to introduce a separate common state management wrapper for all the callers that don't need to deal with the state themselves.
I thought about this more. The storage idea I had it mind when I wrote the above would make the things inefficient on the subsequent call. So I withdraw the notion that if we did a rewrite, state wouldn't need to be private and different on a per decoder type basis.
I'm still really curious about a super-important use case. At this point I'd close this as WONTFIX due to too much hassle for too little benefit.
(Aside: Is there an example of another platform, even a single-implementation one, that supports serializing and deserializing in-flight decoders?)
from encoding.
One use case might be that you download large files and save the result, decoded, to disk. If at some point the connection gets cut you might want to have state available so you know how to properly resume. In particular when you want to support resumption across application reboots.
from encoding.
I still don't know how we'd pull this off without a rewrite (i.e. move off of or upstream changes to ICU) but regarding my comments about priming encoder state being a security issue above: one approach would just be to be able to ask the decoder to output a byte sequence that would correctly initialize a new instance to the current state. (i.e. any mode switch bytes + buffered lead bytes) when passed in as the start of a stream, rather than via some special initialization API.
(That's probably obvious but documenting it for posterity since I didn't consider it initially.)
from encoding.
I'm not sure this is "super important" but it could avoid some duplicated
effort when dealing with large inputs where the overall text size does not
fit in a JS string or when processing of chunks of bytes is offloaded to a
ServiceWorker which might not live as long as the caller which -items- owns the
stream. Text file readers could also benefit from this to avoid having to
reparse the whole preceding input prior to the previously saved "current
reading position". I think representing state as shifts+trailing
"incomplete" bytes rep_l_ayable at startup to reach equivalent state is
attractive, but I wonder whether opaque serialization mightn't be easier to
support and potentially more compact. Of course replayable bytes have the
advantage of even being potentially portable to a different implementation.
from encoding.
I'll leave this issue open for now and we can revisit in a year (or two) to see what the state of implementations is then and whether we still want it. If it becomes higher priority meanwhile we can reconsider.
from encoding.
One use case might be that you download large files and save the result, decoded, to disk. If at some point the connection gets cut you might want to have state available so you know how to properly resume. In particular when you want to support resumption across application reboots.
But who is downloading massive text files and wishing to incrementally save the decoded UTF-16 (where? in IndexedDB?) instead of saving the bytes?
@bsittler, I was hoping to see a description of a concrete use case and not just "large inputs". I still don't know what kind of app would have large inputs and why.
@inexorabletash, Storing a byte sequence that needs to be prepended to the current input was what I had in mind, but then I figured that doing the prepending would be a perf or, alternatively, complexity problem. (FWIW, the java.nio
API can run in a mode where the caller needs to manage such prepending. It's not cool.)
from encoding.
@hsivonen - re: prepending byte sequence - agreed, that would not be cool. I'd make it a little more opaque from the point of view of script, e.g. a ArrayBuffer getInitState()
method and an {initState: BufferSource}
option for the constructor. But moot until we figure out if/how we'd support this.
from encoding.
@inexorabletash, I agree with you about ICU. ICU converters are designed to work with streaming input, but that does not mean that it's easy to serialize/deserialize its internal states, I'm afraid.
Another question: is this bug also aiming at an inter-operable serialization format ( serialized result by one implementation that can be read by another implementation) ? Perhaps not, just checking.
from encoding.
If we indeed wanted to do this I think it would make more sense to expose state as an opaque object (and therefore not an ArrayBuffer
as @inexorabletash suggests) and define a way to structure clone such an opaque object so it can be stored in IDB, but cannot be transferred across the network.
But @hsivonen's question about the exact need for this would have to be answered first.
from encoding.
Closing this since nothing really materialized here. If this needs to be reopened please leave a comment with a more concrete need.
from encoding.
Related Issues (20)
- Add NeXTSTEP encoding HOT 2
- "For logical right shifts operands must have at ..." HOT 4
- Corner cases arising from Big5 encoder not excluding HKSCS codes with lead bytes 0xFA–FE HOT 6
- End-of-queue during decoding of GB18030 should not mask ASCII characters. HOT 4
- gb18030 encoder using index gb18030 ranges pointer HOT 4
- aria-label usage in BMP coverage table HOT 4
- Bug in TextDecoderStream around processing the end of stream. HOT 1
- Add a static decode and encode method to `TextEncoder` and `TextDecoder` HOT 10
- Shift_JIS decoder HOT 12
- [GB18030] Wrong codepoint at index 7533 HOT 4
- TextDecoderStream: empty Uint8Array should result in an empty string HOT 4
- 7-bit ASCII encoding HOT 3
- The concept of "output encoding" is not described anywhere HOT 5
- Visualization tables has lack of descriptions HOT 2
- Why Big5 index contains unmappable characters? HOT 2
- Consider adding windows-936-2000 as a label for GBK HOT 2
- Preface punctuation
- Reflect changes in GB 18030-2022 HOT 5
- Make encodeInto() throw when given a detached buffer HOT 5
- Ambiguous wording in GB18030 decoder HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from encoding.