Coder Social home page Coder Social logo

Comments (19)

mgeisler avatar mgeisler commented on July 23, 2024

When doing this, it's critically important that we run the same transformations on the existing po/*.po files. That way we can keep the work of the translators intact.

from comprehensive-rust.

moutikabdessabour avatar moutikabdessabour commented on July 23, 2024

why not use the markdown parser used by mdbook pulldown_cmark to extract paragraph and then reconstruct it?

from comprehensive-rust.

mgeisler avatar mgeisler commented on July 23, 2024

@moutikabdessabour we should definitely use pulldown_cmark for this!

from comprehensive-rust.

djmitche avatar djmitche commented on July 23, 2024

I'd like to work on this, if you don't mind assigning it to me. I can see how to replace the existing extract_paragraphs with a more sophisticated thing that emits a bunch of textual chunks.

run the same transformations on the existing po/*.po files

Did you have something "easy" in mind for this? My thinking was that this would be a kind of half-automated process, where with some iteration I could find a one-off way to translate all of the old msgid's to new msgid's, and then apply those to the po/*.po files, keeping the existing translation.

from comprehensive-rust.

jooyunghan avatar jooyunghan commented on July 23, 2024

While working on the Korean translation i found that keeping MD stuff(ie bullets) was helpful because it gave me freedom to do whatever fits better in the target language like splitting a single bullet into two when necessary. It's probably the same reason why mgeisler@ thought trimming links would be a poor idea.

from comprehensive-rust.

djmitche avatar djmitche commented on July 23, 2024

That makes a lot of sense. I think we could adjust the chunk-extraction to collapse adjacent list-item chunks into a single chunk.

from comprehensive-rust.

mgeisler avatar mgeisler commented on July 23, 2024

Did you have something "easy" in mind for this? My thinking was that this would be a kind of half-automated process, where with some iteration I could find a one-off way to translate all of the old msgid's to new msgid's, and then apply those to the po/*.po files, keeping the existing translation.

Yes, that was also roughly my idea. Basically that the new extraction functionality can be accessed from some temporary tool which will iterate over pairs of msgid and msgstr and apply the same extraction to those, producing yet more pairs. As you say, the idea is that a fully translated .po file should remain translated after running a new mdbook-xgettext followed by msgmerge.

I'm thinking this should be done in smaller steps and that each step should be carried out on the .po files in lock step:

Perhaps we can start by teaching mdbook-xgettext to do proper Markdown parsing via pulldown_cmark first. Use new_cmark_parser to get a Parser and then probably into_offset_iter to get something which has the needed byte offsets.

Next, I imagine it would be easy to extract fenced code blocks as a unit, and probably also easy to strip away # from headings.

I've been dabbling a bit with this myself and I think the biggest trouble will be to parse all of the different Tag variants. So one thought I had was to only parse the simple stuff at first and bail out if you see anything else. Bailing out would mean fall back to the naive \n\n+ splitting of the file. Most pages in the course are very simple: a heading, some text, a code block. At least they used to be like that, but many of them now have "speaker notes" which is a trailing <details> ... </details> block at the end.

While working on the Korean translation i found that keeping MD stuff(ie bullets) was helpful because it gave me freedom to do whatever fits better in the target language like splitting a single bullet into two when necessary.

I knew that the current system gives us that freedom, but I didn't know the freedom was used πŸ˜„ Can you tell us more about where you had to do this? My gut feeling is that we should try to improve the original English text in those cases.

from comprehensive-rust.

djmitche avatar djmitche commented on July 23, 2024

I got a start on this today in #449.

I think this can get pretty close to producing the existing set of messages. This is probably a good place to start, and then update the .po files where they differ (number of newlines, maybe some funny business around <details>, etc.). Then a followup could wrap paragraphs, remove # from headers, break up bullet lists (if desired), and so on. The transformations on the .po files for these followups should be pretty straightforward.

from comprehensive-rust.

djmitche avatar djmitche commented on July 23, 2024

On experimenting a bit, I think we should leave lists as a unit for translation. The reason is, otherwise indentation is very hard to get right. For example, given

 * Always takes a single set of parameter types"

we get

msgid: "Always takes a single set of parameter types"

If the translation goes onto multiple lines, it's not at all obvious to the translator that this must be

msgstr: "Always"
"   takes"
"   a" ...

in order to keep the indentation correct. So, I will include lists in their entirety.

from comprehensive-rust.

djmitche avatar djmitche commented on July 23, 2024

Also, I don't think there's any automated way to re-break these messages. Some lists were broken into multiple messages by having \n\n between them, and some were not. I think the only way to go about this is manually editing the translation files :(

from comprehensive-rust.

mgeisler avatar mgeisler commented on July 23, 2024

If the translation goes onto multiple lines, it's not at all obvious to the translator that this must be

msgstr: "Always"
"   takes"
"   a" ...

in order to keep the indentation correct. So, I will include lists in their entirety.

Long-term, I would like to unwrap such paragraphs. So

* This is
  a single
  list item.

  Second paragraph
  in first item.

Becomes two messages in the .po file:

  • "This is a single list item."
  • "Second paragraph in first item."

Indentation and wrapping has been taken away. When translating the original text, we end up with

  • "* "
  • The translation of the first message
  • "\n\n "
  • The translation of the second message

This should work as long as there are no new \n characters in the two messages.

The goal (for me) is to remove the possibility of errors in the translations, and also to make the translations robust against changes in the formatting.

I would like to hear from @jooyunghan, @jiyongp, @rastringer, @hugojacob, and @ronaldfw if this is a good goal?

from comprehensive-rust.

jiyongp avatar jiyongp commented on July 23, 2024

I think it's okay to not unwrap softly wrapped text. It is sometimes even useful especially when translating a code fragment having translatable comments. What is annoying with po is that it doesn't support multi-line strings. Ideally, I wish the following. Not sure po file format supports it (but we could preprocess if not).

Markdown:

# This is a heading

A _little_
paragraph.

```rust,editable
fn main() { // translatable_comment_here
    println!("Hello world!");
}
\```

* First
* Second

po file:

msgid "This is a heading"

msgid """A _little_
paragraph."""

msgid """fn main() { // translatable_comment_here
    println!("Hello world!");
}"""

msgid "First"

msgid "Second"

from comprehensive-rust.

mgeisler avatar mgeisler commented on July 23, 2024

What is annoying with po is that it doesn't support multi-line strings.

The PO format uses C-style string and C-style string concatenation. So

msgid ""
"f"
"o"
"o"

is a msgid of "foo", with no newlines. This means that there are a myriad of different ways to represent the same string in the PO file.

When using msgmerge to update PO file, it will wrap strings at 80 columns by default, but it will also use embedded \n as good places to wrap.

I don't understand how having support for newlines in the strings in the PO file helps you here?

from comprehensive-rust.

jiyongp avatar jiyongp commented on July 23, 2024

I know that. But there are a few problems here:

  1. having to wrap each line with "..." is annoying, when you edit the po file with ordinary editors (e.g. vim). Multi-line string is much easier to deal with.
  2. poedit doesn't support this. it forcibly adds \n to every line you make. ex:

A translated text entered in poedit

f
o
o

becomes

msgid ""
"f\n"
"o\n"
"o"

from comprehensive-rust.

jooyunghan avatar jooyunghan commented on July 23, 2024

IMO, the problem of working with PO file directly is that we need to handle the stack of two encodings: PO file's C-style string literals (with escaping) over MarkDown text. I think that that's why @jiyong's wished PO file supporting raw text.

My workflow now is that

  • Use poedit as a main editor for "MarkDown text". (no need to think about PO file formats)
  • Since I don't want poedit to do extra work, I unchecked both "line wrapping" and "preserve formatting"

from comprehensive-rust.

mgeisler avatar mgeisler commented on July 23, 2024
  • having to wrap each line with "..." is annoying, when you edit the po file with ordinary editors (e.g. vim). Multi-line string is much easier to deal with.

Thanks, I see what you mean now!

For that use case, I would suggest writing a tiny tool which transforms a .po file into a .yaml file or a similar format which can have multi-line strings. There are many services that can convert PO files to some sort of YAML, but I think you'll want something which simply outputs the msgid and msgstr fields in a long list. For fun, I wrote a po2yaml tool. This only converts one way β€” you'll want to also convert back to a PO file after editing the YAML file. The output looks like this:

- msgid: '# Running the Course'
  msgstr: '# κ°•μ˜ 진행 방식'
- msgid: '> This page is for the course instructor.'
  msgstr: '> 강사λ₯Ό μœ„ν•œ μ•ˆλ‚΄ νŽ˜μ΄μ§€μž…λ‹ˆλ‹€.'
- msgid: |-
    Here is a bit of background information about how we've been running the course
    internally at Google.
  msgstr: λ‹€μŒμ€ ꡬ글 λ‚΄λΆ€μ—μ„œ 이 과정을 μ–΄λ–€μ‹μœΌλ‘œ μš΄μ˜ν•΄μ™”λŠ”μ§€μ— λŒ€ν•œ λ°°κ²½ μ •λ³΄μž…λ‹ˆλ‹€.
- msgid: 'To run the course, you need to:'
  msgstr: 'κ°•μ˜λ₯Ό μ‹€ν–‰ν•˜κΈ° μœ„ν•œ μ€€λΉ„:'
- msgid: |-
    1. Make yourself familiar with the course material. We've included speaker notes
       on some of the pages to help highlight the key points (please help us by
       contributing more speaker notes!). You should make sure to open the speaker
       notes in a popup (click the link with a little arrow next to "Speaker
       Notes"). This way you have a clean screen to present to the class.
  msgstr: 1. κ°•μ˜ 자료λ₯Ό μˆ™μ§€ν•©λ‹ˆλ‹€. μ£Όμš” μš”μ μ„ κ°•μ‘°ν•˜κΈ° μœ„ν•΄ 일뢀 νŽ˜μ΄μ§€μ— κ°•μ˜ μ°Έμ‘°λ…ΈνŠΈλ₯Ό ν¬ν•¨ν•˜μ˜€μŠ΅λ‹ˆλ‹€. (좔가적인 λ…ΈνŠΈλ₯Ό μž‘μ„±ν•˜μ—¬ μ œκ³΅ν•΄ μ£Όμ‹œλ©΄ κ°μ‚¬ν•˜κ² μŠ΅λ‹ˆλ‹€.) κ°•μ˜ μ°Έμ‘° λ…ΈνŠΈμ˜ 링크λ₯Ό λˆ„λ₯΄λ©΄ κ°•μ˜λ…ΈνŠΈκ°€ λ³„λ„μ˜ νŒμ—…μœΌλ‘œ 뢄리가 되며, 메인 ν™”λ©΄μ—μ„œλŠ” 사
λΌμ§‘λ‹ˆλ‹€.

As you can see, multi-line inputs end up as multi-line literal blocks in the YAML file β€” ready to be edited using your favorite tool πŸ˜„

If you think this is useful, then we can probably put it somewhere.

from comprehensive-rust.

jiyongp avatar jiyongp commented on July 23, 2024

@mgeisler Yes, that looks great. I'd use it.

One question though: which file will be the source of truth? yaml, or po?

from comprehensive-rust.

mgeisler avatar mgeisler commented on July 23, 2024

@mgeisler Yes, that looks great. I'd use it.

One question though: which file will be the source of truth? yaml, or po?

I was thinking that you would generate the YAML file whenever you want locally and then export back to .po via a save hook in our editor. I was not thinking that it would be used by others, but if you find such a format useful, then go for it πŸ˜„

We would need the YAML-to-PO conversion as well, but that should be trivial β€” we need the fuzzy markers as well, but the source lines (the filenames and line numbers) can be skipped since they come from the messages.pot file anyway.

from comprehensive-rust.

jiyongp avatar jiyongp commented on July 23, 2024

ack!

from comprehensive-rust.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.