Right now, we simply split the text on + , but thi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I got a start on this today in <a class="issue-link js-issue-link" data-error-text="Fa

Extract text more carefully in `mdbook-xgettext`,about google/comprehensive-rust

Comments (19)

mgeisler commented on July 23, 2024

When doing this, it's critically important that we run the same transformations on the existing po/*.po files. That way we can keep the work of the translators intact.

from comprehensive-rust.

moutikabdessabour commented on July 23, 2024

why not use the markdown parser used by mdbook pulldown_cmark to extract paragraph and then reconstruct it?

from comprehensive-rust.

mgeisler commented on July 23, 2024

@moutikabdessabour we should definitely use pulldown_cmark for this!

from comprehensive-rust.

djmitche commented on July 23, 2024

I'd like to work on this, if you don't mind assigning it to me. I can see how to replace the existing extract_paragraphs with a more sophisticated thing that emits a bunch of textual chunks.

run the same transformations on the existing po/*.po files

Did you have something "easy" in mind for this? My thinking was that this would be a kind of half-automated process, where with some iteration I could find a one-off way to translate all of the old msgid's to new msgid's, and then apply those to the po/*.po files, keeping the existing translation.

from comprehensive-rust.

jooyunghan commented on July 23, 2024

While working on the Korean translation i found that keeping MD stuff(ie bullets) was helpful because it gave me freedom to do whatever fits better in the target language like splitting a single bullet into two when necessary. It's probably the same reason why mgeisler@ thought trimming links would be a poor idea.

from comprehensive-rust.

djmitche commented on July 23, 2024

That makes a lot of sense. I think we could adjust the chunk-extraction to collapse adjacent list-item chunks into a single chunk.

from comprehensive-rust.

mgeisler commented on July 23, 2024

Did you have something "easy" in mind for this? My thinking was that this would be a kind of half-automated process, where with some iteration I could find a one-off way to translate all of the old msgid's to new msgid's, and then apply those to the po/*.po files, keeping the existing translation.

Yes, that was also roughly my idea. Basically that the new extraction functionality can be accessed from some temporary tool which will iterate over pairs of msgid and msgstr and apply the same extraction to those, producing yet more pairs. As you say, the idea is that a fully translated .po file should remain translated after running a new mdbook-xgettext followed by msgmerge.

I'm thinking this should be done in smaller steps and that each step should be carried out on the .po files in lock step:

Perhaps we can start by teaching mdbook-xgettext to do proper Markdown parsing via pulldown_cmark first. Use new_cmark_parser to get a Parser and then probably into_offset_iter to get something which has the needed byte offsets.

Next, I imagine it would be easy to extract fenced code blocks as a unit, and probably also easy to strip away # from headings.

I've been dabbling a bit with this myself and I think the biggest trouble will be to parse all of the different Tag variants. So one thought I had was to only parse the simple stuff at first and bail out if you see anything else. Bailing out would mean fall back to the naive \n\n+ splitting of the file. Most pages in the course are very simple: a heading, some text, a code block. At least they used to be like that, but many of them now have "speaker notes" which is a trailing <details> ... </details> block at the end.

While working on the Korean translation i found that keeping MD stuff(ie bullets) was helpful because it gave me freedom to do whatever fits better in the target language like splitting a single bullet into two when necessary.

I knew that the current system gives us that freedom, but I didn't know the freedom was used 😄 Can you tell us more about where you had to do this? My gut feeling is that we should try to improve the original English text in those cases.

from comprehensive-rust.

djmitche commented on July 23, 2024

I got a start on this today in #449.

I think this can get pretty close to producing the existing set of messages. This is probably a good place to start, and then update the .po files where they differ (number of newlines, maybe some funny business around <details>, etc.). Then a followup could wrap paragraphs, remove # from headers, break up bullet lists (if desired), and so on. The transformations on the .po files for these followups should be pretty straightforward.

from comprehensive-rust.

djmitche commented on July 23, 2024

On experimenting a bit, I think we should leave lists as a unit for translation. The reason is, otherwise indentation is very hard to get right. For example, given

 * Always takes a single set of parameter types"

we get

msgid: "Always takes a single set of parameter types"

If the translation goes onto multiple lines, it's not at all obvious to the translator that this must be

msgstr: "Always"
"   takes"
"   a" ...

in order to keep the indentation correct. So, I will include lists in their entirety.

from comprehensive-rust.

djmitche commented on July 23, 2024

Also, I don't think there's any automated way to re-break these messages. Some lists were broken into multiple messages by having \n\n between them, and some were not. I think the only way to go about this is manually editing the translation files :(

from comprehensive-rust.

mgeisler commented on July 23, 2024

If the translation goes onto multiple lines, it's not at all obvious to the translator that this must be
msgstr: "Always"
"   takes"
"   a" ...
in order to keep the indentation correct. So, I will include lists in their entirety.

Long-term, I would like to unwrap such paragraphs. So

* This is
  a single
  list item.

  Second paragraph
  in first item.

Becomes two messages in the .po file:

"This is a single list item."
"Second paragraph in first item."

Indentation and wrapping has been taken away. When translating the original text, we end up with

"* "
The translation of the first message
"\n\n "
The translation of the second message

This should work as long as there are no new \n characters in the two messages.

The goal (for me) is to remove the possibility of errors in the translations, and also to make the translations robust against changes in the formatting.

I would like to hear from @jooyunghan, @jiyongp, @rastringer, @hugojacob, and @ronaldfw if this is a good goal?

from comprehensive-rust.

jiyongp commented on July 23, 2024

I think it's okay to not unwrap softly wrapped text. It is sometimes even useful especially when translating a code fragment having translatable comments. What is annoying with po is that it doesn't support multi-line strings. Ideally, I wish the following. Not sure po file format supports it (but we could preprocess if not).

Markdown:

# This is a heading

A _little_
paragraph.

```rust,editable
fn main() { // translatable_comment_here
    println!("Hello world!");
}
\```

* First
* Second

po file:

msgid "This is a heading"

msgid """A _little_
paragraph."""

msgid """fn main() { // translatable_comment_here
    println!("Hello world!");
}"""

msgid "First"

msgid "Second"

from comprehensive-rust.

mgeisler commented on July 23, 2024

What is annoying with po is that it doesn't support multi-line strings.

The PO format uses C-style string and C-style string concatenation. So

msgid ""
"f"
"o"
"o"

is a msgid of "foo", with no newlines. This means that there are a myriad of different ways to represent the same string in the PO file.

When using msgmerge to update PO file, it will wrap strings at 80 columns by default, but it will also use embedded \n as good places to wrap.

I don't understand how having support for newlines in the strings in the PO file helps you here?

from comprehensive-rust.

jiyongp commented on July 23, 2024

I know that. But there are a few problems here:

having to wrap each line with "..." is annoying, when you edit the po file with ordinary editors (e.g. vim). Multi-line string is much easier to deal with.
poedit doesn't support this. it forcibly adds \n to every line you make. ex:

A translated text entered in poedit

f
o
o

becomes

msgid ""
"f\n"
"o\n"
"o"

from comprehensive-rust.

jooyunghan commented on July 23, 2024

IMO, the problem of working with PO file directly is that we need to handle the stack of two encodings: PO file's C-style string literals (with escaping) over MarkDown text. I think that that's why @jiyong's wished PO file supporting raw text.

My workflow now is that

Use poedit as a main editor for "MarkDown text". (no need to think about PO file formats)
Since I don't want poedit to do extra work, I unchecked both "line wrapping" and "preserve formatting"

from comprehensive-rust.

mgeisler commented on July 23, 2024

having to wrap each line with "..." is annoying, when you edit the po file with ordinary editors (e.g. vim). Multi-line string is much easier to deal with.

Thanks, I see what you mean now!

For that use case, I would suggest writing a tiny tool which transforms a .po file into a .yaml file or a similar format which can have multi-line strings. There are many services that can convert PO files to some sort of YAML, but I think you'll want something which simply outputs the msgid and msgstr fields in a long list. For fun, I wrote a po2yaml tool. This only converts one way — you'll want to also convert back to a PO file after editing the YAML file. The output looks like this:

- msgid: '# Running the Course'
  msgstr: '# 강의 진행 방식'
- msgid: '> This page is for the course instructor.'
  msgstr: '> 강사를 위한 안내 페이지입니다.'
- msgid: |-
    Here is a bit of background information about how we've been running the course
    internally at Google.
  msgstr: 다음은 구글 내부에서 이 과정을 어떤식으로 운영해왔는지에 대한 배경 정보입니다.
- msgid: 'To run the course, you need to:'
  msgstr: '강의를 실행하기 위한 준비:'
- msgid: |-
    1. Make yourself familiar with the course material. We've included speaker notes
       on some of the pages to help highlight the key points (please help us by
       contributing more speaker notes!). You should make sure to open the speaker
       notes in a popup (click the link with a little arrow next to "Speaker
       Notes"). This way you have a clean screen to present to the class.
  msgstr: 1. 강의 자료를 숙지합니다. 주요 요점을 강조하기 위해 일부 페이지에 강의 참조노트를 포함하였습니다. (추가적인 노트를 작성하여 제공해 주시면 감사하겠습니다.) 강의 참조 노트의 링크를 누르면 강의노트가 별도의 팝업으로 분리가 되며, 메인 화면에서는 사
라집니다.

As you can see, multi-line inputs end up as multi-line literal blocks in the YAML file — ready to be edited using your favorite tool 😄

If you think this is useful, then we can probably put it somewhere.

from comprehensive-rust.

jiyongp commented on July 23, 2024

@mgeisler Yes, that looks great. I'd use it.

One question though: which file will be the source of truth? yaml, or po?

from comprehensive-rust.

mgeisler commented on July 23, 2024

@mgeisler Yes, that looks great. I'd use it.

One question though: which file will be the source of truth? yaml, or po?

I was thinking that you would generate the YAML file whenever you want locally and then export back to .po via a save hook in our editor. I was not thinking that it would be used by others, but if you find such a format useful, then go for it 😄

We would need the YAML-to-PO conversion as well, but that should be trivial — we need the fuzzy markers as well, but the source lines (the filenames and line numbers) can be skipped since they come from the messages.pot file anyway.

from comprehensive-rust.

jiyongp commented on July 23, 2024

ack!

from comprehensive-rust.

Extract text more carefully in `mdbook-xgettext` about comprehensive-rust HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent