Coder Social home page Coder Social logo

Comments (20)

andrenatal avatar andrenatal commented on July 21, 2024 2

If the above is the case, it demonstrates a weakness of the engine itself: it shouldn't be the presentation layer responsibility to determine the type of the content being input, the user should just be able to input whatever they and and the engine itself infer that. There will be use cases where this won't be possible, like SOA applications for example. Unfortunately this is a bad design decision that can lead to even more issues

from firefox-translations.

kpu avatar kpu commented on July 21, 2024 1

Thanks @abhi-agg!

from firefox-translations.

andrenatal avatar andrenatal commented on July 21, 2024

@jelmervdl Maybe this is related to #35, where I'm seeing multiple exceptions in the engine but can't reproduce on different languages?

from firefox-translations.

jelmervdl avatar jelmervdl commented on July 21, 2024

@andrenatal it could be very well related. There are paragraphs with literal < and > in them. Those paragraphs would cause the html parsing bit in bergamot-translator to throw (but because of how emscripten is configured… abort()?) since it tries to parse that as an open tag and fail.

from firefox-translations.

kpu avatar kpu commented on July 21, 2024

The extension will be called upon to translate text/plain and text/HTML documents, so the extension will need to toggle HTML on/off anyway, using this:

https://github.com/browsermt/bergamot-translator/blob/c0f311a8c067372057a6f301c42b40bbe30a9c1a/src/translator/response_options.h#L22
(The TODO comment is outdated and will be fixed.)

from firefox-translations.

andrenatal avatar andrenatal commented on July 21, 2024

@abhi-agg could you please take a look at this?

from firefox-translations.

abhi-agg avatar abhi-agg commented on July 21, 2024

@andrenatal I am on it.

from firefox-translations.

abhi-agg avatar abhi-agg commented on July 21, 2024

The extension will be called upon to translate text/plain and text/HTML documents, so the extension will need to toggle HTML on/off anyway

@kpu By documents, do you literally mean the user will upload documents to translate or do you only mean web pages?

from firefox-translations.

kpu avatar kpu commented on July 21, 2024

The user could browse to a web page with a text/plain content type. The browser displays the text, which may include stray < and > that should not be interpreted as HTML.

from firefox-translations.

kpu avatar kpu commented on July 21, 2024

As in, a user with a German Firefox UI visits https://neural.mt/test.txt . The browser offers to translate it to German (or at least it should, but that's not the point here). The text should be sent to the engine with HTML off.

from firefox-translations.

abhi-agg avatar abhi-agg commented on July 21, 2024

Did some quick tests via wasm test page:

  1. Assuming right model config was set and response options are set to do html translation
    1. Hello &lt; world. returned following for various models:
      1. De: Hallo Welt.
      2. Es: Hola, mundo.
      3. Cz: Ahoj, svět.
      4. Et: Tere &lt; maailm.
        (Seems like &lt; is gone in De, Es and Cz language)
    2. Hello < world causes engine to assert (because given string is a bad-formed html)
  2. Non-html translation for Hello &lt; world. Hello < world. returned follwing:
    1. De: Hello &lt; Welt. Hallo Welt.
    2. Es: Hola &lt; mundo. Hola, mundo.
    3. Cz: Hello &lt; world. Ahoj, svět.
    4. Et: Tere &lt; maailm. Tere < maailm.

from firefox-translations.

kpu avatar kpu commented on July 21, 2024

That looks consistent with expectations. In HTML mode the processing looks like:

  1. Input Hello &lt; world.
  2. HTML maps to internal text Hello < world.
  3. Internal Estonian target Tere < maailm. (Which I note is what happened in text mode.)
  4. HTML escapes to Tere &lt; maailm.

from firefox-translations.

andrenatal avatar andrenatal commented on July 21, 2024

Wasn't this fixed by: #53?

Can we close it?

from firefox-translations.

kpu avatar kpu commented on July 21, 2024

There's a few things going on:

  1. alignment: soft was not passed at load time so HTML didn't know what to do with word alignments, resulting in poor HTML placement quality. This was mentioned as part of the initial issue and is now resolved as #53.
  2. The extension is currently extracting text and sending it to the engine with HTML on. This means a stray &lt; in the HTML is converted by the extension to < which then confuses HTML mode in the engine. This was also mentioned as part of the issue and remains open to my knowledge. We believe this is the cause of the crash in #35.
  3. The extension has no means to toggle the HTML flag. This will be required at some point since users may browse to text/plain and text/html pages. I have suggested doing this as part of the fix of point 2 above. My understanding is @abhi-agg has a draft version of a fix; he posted test output above.
  4. The extension should use HTML mode instead of sending text, which is #52.

So I think this issue could be retitled as "text is being sent to HTML mode"

from firefox-translations.

andrenatal avatar andrenatal commented on July 21, 2024

What happens in the case of a legit instance of < existing in the page's text being sent to the engine?

from firefox-translations.

kpu avatar kpu commented on July 21, 2024

If you have HTML mode on and provide ill-formed HTML, it will throw an informative exception to the request and the engine will be fine (to take more requests). However, the -fno-exceptions limitation of WASM maps this exception to process death. In Thursday's plenary, I raised the possibility of an API change to allow returning an error code, as the current API does not have the means to express errors by anything other than exceptions. See browsermt/bergamot-translator#316 .

HTML mode was originally scoped to be used with Firefox's innerHTML where it's not possible to pass erroneous HTML.

We could of course improvise something to just skip the character or treat it as if it were &lt; but this seemed like it would only cause trouble of being inconsistent with Firefox's interpretation of broken HTML.

from firefox-translations.

jelmervdl avatar jelmervdl commented on July 21, 2024

Note that you can specify whether to use HTML mode every time you call translate(). For translating innerHTML it should be on, but for translating textContent or the outbound form inputs it should not be on.

Right now the extension does some batching, calling translate() with a VectorString with multiple input strings, but only one ResponseOptions object which specifies whether they are all HTML or not. I.e.

const responseOptions = {qualityScores: true, alignment: true, html: true};
let input = new this.WasmEngineModule.VectorString();
messages.forEach(message => {
    input.push_back(message.sourceParagraph);
});
let result = this.translationService.translate(translationModel, input, responseOptions);

We can alter the emscripten bindings of bergamot-translator a bit and accept a ResponseOptions object per input string, if that makes the extension's code easier. Something like:

const responseOptions = {qualityScores: true, alignment: true};
let input = new this.WasmEngineModule.VectorSomethingSomething();
messages.forEach(message => {
    input.push_back(message.sourceParagraph, {...responseOptions, html: message.isHTML});
});
let result = this.translationService.translate(translationModel, input);

We could of course improvise something to just skip the character or treat it as if it were < but this seemed like it would only cause trouble of being inconsistent with Firefox's interpretation of broken HTML.

Trying to parse HTML with a fallback sounds riksy. For example, if someone were to type When output<input and checked>unchecked into their textarea, that will be interpreted as valid HTML (and input and checked would go untranslated)

from firefox-translations.

andrenatal avatar andrenatal commented on July 21, 2024

Based on my conversation with Abhi, we'll need the InPageTranslation.js parser to always determine if the string being passed to the engine contains plain or html text and set the proper flag on ResponseOptions. Is that right @abhi-agg ?

from firefox-translations.

kpu avatar kpu commented on July 21, 2024

All you need to do to fix this issue for the time being is turn the HTML flag to off. This is a one line change. When you have HTML translation again, turn it on.

The HTML feature is designed to operate on snippets of HTML. It does not make sense to expect automatic content identification from text vs snippets of HTML because that would introduce bugs in text translation where words are mysteriously not translated inside angle brackets.

Does Firefox render text/plain content containing some tags as HTML?

from firefox-translations.

abhi-agg avatar abhi-agg commented on July 21, 2024

All you need to do to fix this issue for the time being is turn the HTML flag to off. This is a one line change. When you have HTML translation again, turn it on.

Submitted the PR that should close this issue.

If the above is the case, it demonstrates a weakness of the engine itself: it shouldn't be the presentation layer responsibility to determine the type of the content being input, the user should just be able to input whatever they and and the engine itself infer that. There will be use cases where this won't be possible, like SOA applications for example. Unfortunately this is a bad design decision that can lead to even more issues

This leaves ⬆️ up for discussion.

from firefox-translations.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.