Was talking to <a class="user-mention notranslate" data-hovercard-type="user" data-hov

I wonder how easy the C++ APIs make it to expose that state. <a class="user-mention no

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Serializing internal TextDecoder state? about encoding HOT 15 CLOSED

whatwg commented on August 12, 2024

Serializing internal TextDecoder state?

from encoding.

Comments (15)

annevk commented on August 12, 2024

I wonder how easy the C++ APIs make it to expose that state. @vyv03354, @hsivonen, @inexorabletash?

from encoding.

inexorabletash commented on August 12, 2024

Many decoders are stateful, so there's more than just the pending bytes in the internal stream.

Blink relies on ICU for non-UTF/non-Latin1 conversions. I'm not an ICU expert (or even a novice, really). It's unclear to me if the API allows for access to the pending buffers (it does give access to the counts of the buffers), and I don't see anything giving access to the other internal state.

Even if it did give access to the internal state in an opaque way, it seems like security bugs waiting to happen to prime that state with user-supplied byte data.

So... unless someone who knows far more about ICU than I do chimes in (I'll ping jshin) I'd rate this as "non-trivial"

from encoding.

annevk commented on August 12, 2024

Oh, I missed the bit about priming. Yeah, that would be cumbersome. But it would be really nice for streaming.

from encoding.

hsivonen commented on August 12, 2024

@domenic, I'd like to hear a more detailed explanation of the use case: Why do streams need to stay logically open across a service worker upgrade?

@annevk In Gecko, the way the state is stored is considered private to the decoder and the different decoders don't have a consistent representation. One way to deal with this would be to rewrite all our decoders to be stateless (turn them from classes to mere functions basically) and to introduce a separate common state management wrapper for all the callers that don't need to deal with the state themselves.

Apart from requiring rewriting all the decoders (which I think we should do anyway--in Rust; I'm working on a proposal), there's the sadness that ISO-2022-JP's state is different from all the others. For everything else, the state would be zero to three bytes. Except for ISO-2022-JP. :-(

Now, just because I'm working on a proposal for a rewrite doesn't mean we'll do one or that we'll do one soon. And even if a rewrite made serializing the state feasible, I'm still skeptical of the merit of the use case for making the state serializable.

from encoding.

domenic commented on August 12, 2024

@hsivonen that is really a question for @bsittler; I was relaying the use case from him...

from encoding.

hsivonen commented on August 12, 2024

One way to deal with this would be to rewrite all our decoders to be stateless (turn them from classes to mere functions basically) and to introduce a separate common state management wrapper for all the callers that don't need to deal with the state themselves.

I thought about this more. The storage idea I had it mind when I wrote the above would make the things inefficient on the subsequent call. So I withdraw the notion that if we did a rewrite, state wouldn't need to be private and different on a per decoder type basis.

I'm still really curious about a super-important use case. At this point I'd close this as WONTFIX due to too much hassle for too little benefit.

(Aside: Is there an example of another platform, even a single-implementation one, that supports serializing and deserializing in-flight decoders?)

from encoding.

annevk commented on August 12, 2024

One use case might be that you download large files and save the result, decoded, to disk. If at some point the connection gets cut you might want to have state available so you know how to properly resume. In particular when you want to support resumption across application reboots.

from encoding.

inexorabletash commented on August 12, 2024

I still don't know how we'd pull this off without a rewrite (i.e. move off of or upstream changes to ICU) but regarding my comments about priming encoder state being a security issue above: one approach would just be to be able to ask the decoder to output a byte sequence that would correctly initialize a new instance to the current state. (i.e. any mode switch bytes + buffered lead bytes) when passed in as the start of a stream, rather than via some special initialization API.

(That's probably obvious but documenting it for posterity since I didn't consider it initially.)

from encoding.

bsittler commented on August 12, 2024

I'm not sure this is "super important" but it could avoid some duplicated
effort when dealing with large inputs where the overall text size does not
fit in a JS string or when processing of chunks of bytes is offloaded to a
ServiceWorker which might not live as long as the caller which -items- owns the
stream. Text file readers could also benefit from this to avoid having to
reparse the whole preceding input prior to the previously saved "current
reading position". I think representing state as shifts+trailing
"incomplete" bytes rep_l_ayable at startup to reach equivalent state is
attractive, but I wonder whether opaque serialization mightn't be easier to
support and potentially more compact. Of course replayable bytes have the
advantage of even being potentially portable to a different implementation.

from encoding.

annevk commented on August 12, 2024

I'll leave this issue open for now and we can revisit in a year (or two) to see what the state of implementations is then and whether we still want it. If it becomes higher priority meanwhile we can reconsider.

from encoding.

hsivonen commented on August 12, 2024

@annevk,

One use case might be that you download large files and save the result, decoded, to disk. If at some point the connection gets cut you might want to have state available so you know how to properly resume. In particular when you want to support resumption across application reboots.

But who is downloading massive text files and wishing to incrementally save the decoded UTF-16 (where? in IndexedDB?) instead of saving the bytes?

@bsittler, I was hoping to see a description of a concrete use case and not just "large inputs". I still don't know what kind of app would have large inputs and why.

@inexorabletash, Storing a byte sequence that needs to be prepended to the current input was what I had in mind, but then I figured that doing the prepending would be a perf or, alternatively, complexity problem. (FWIW, the java.nio API can run in a mode where the caller needs to manage such prepending. It's not cool.)

from encoding.

inexorabletash commented on August 12, 2024

@hsivonen - re: prepending byte sequence - agreed, that would not be cool. I'd make it a little more opaque from the point of view of script, e.g. a ArrayBuffer getInitState() method and an {initState: BufferSource} option for the constructor. But moot until we figure out if/how we'd support this.

from encoding.

jungshik commented on August 12, 2024

@inexorabletash, I agree with you about ICU. ICU converters are designed to work with streaming input, but that does not mean that it's easy to serialize/deserialize its internal states, I'm afraid.

Another question: is this bug also aiming at an inter-operable serialization format ( serialized result by one implementation that can be read by another implementation) ? Perhaps not, just checking.

from encoding.

annevk commented on August 12, 2024

If we indeed wanted to do this I think it would make more sense to expose state as an opaque object (and therefore not an ArrayBuffer as @inexorabletash suggests) and define a way to structure clone such an opaque object so it can be stored in IDB, but cannot be transferred across the network.

But @hsivonen's question about the exact need for this would have to be answered first.

from encoding.

annevk commented on August 12, 2024

Closing this since nothing really materialized here. If this needs to be reopened please leave a comment with a more concrete need.

from encoding.

Serializing internal TextDecoder state? about encoding HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent