Coder Social home page Coder Social logo

google / mdbook-i18n-helpers Goto Github PK

View Code? Open in Web Editor NEW
111.0 6.0 20.0 351 KB

Translation support for mdbook. The plugins here give you a structured way to maintain a translated book.

License: Apache License 2.0

Rust 100.00%
i18n l10n mdbook mdbook-preprocessor gettext mdbook-renderer rust

mdbook-i18n-helpers's Introduction

Internationalization and Rendering extensions for mdbook

Visit crates.io Build workflow GitHub contributors GitHub stars

This repository contains the following crates that provide extensions and infrastructure for mdbook:

Showcases

mdbook-i18n-helpers

Please add your project below if it uses mdbook-i18n-helpers for translations:

Installation

mdbook-i18n-helpers

Run

cargo install mdbook-i18n-helpers

Please see USAGE for how to translate your mdbook project.

Please see the i18n-helpers/CHANGELOG for details on the changes in each release.

mdbook-tera-backend

Run

$ cargo install mdbook-tera-backend

Contact

For questions or comments, please contact Martin Geisler or start a discussion. We would love to hear from you.


This is not an officially supported Google product.

mdbook-i18n-helpers's People

Contributors

antoniolinhart avatar dalance avatar dependabot[bot] avatar dyoo avatar friendlymatthew avatar henrif75 avatar kdarkhan avatar mgeisler avatar qwandor avatar sakex avatar zachcmadsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mdbook-i18n-helpers's Issues

Filter out empty `msgid` entries

I tried running mdbook-xgettext on the Rust Book. After removing formatting from the SUMMARY.md file, I end up with a messages.pot file which is almost correct:

% msgcat -o po/messages.pot po/messages.pot
po/messages.pot:1660: duplicate message definition...
po/messages.pot:3: ...this is the location of the first definition
msgcat: found 1 fatal error

The problem is this entry:

#: src/appendix-04-useful-development-tools.md:120
#: src/appendix-04-useful-development-tools.md:149
msgid ""
msgstr ""

which in turn originates from a non-empty HTML tag:

<span class="filename">Filename: src/main.rs</span>

The empty msgid is also produced by empty HTML tags:

<span id="ferris"></span>

We should handle such entries correctly. This would mean

  1. Ensure we don't add duplicate entries to the PO file. This is perhaps something that need fixing in polib.
  2. Perhaps we should extract the text inside inline HTML tags such as the span above?

`group_events` fuzz test is flaky

After #129, the group_events fuzz test seems to have become flaky. @dalance, would you be able to take a look?

An example failure is here, where the failing input can be minimized to

"````d\n```"

The diff of the failure is

<    "````d\n```\n````",
>    "```d\n```\n```",

meaning that reconstruct_markdown(&events, None) returned

````d
```
````

whereas reconstruct_markdown(&flattened_groups, None) returned

```d
```
```

My guess is that this is because the counting of consequetive ` is slightly off.

Indeed, looking at flattened_groups in the fuzz test, I see

[fuzz_targets/group_events.rs:22:5] &flattened_groups = [
    (
        1,
        Start(
            CodeBlock(
                Fenced(
                    Borrowed(
                        "d",
                    ),
                ),
            ),
        ),
    ),
    (
        2,
        Text(
            Borrowed(
                "`",
            ),
        ),
    ),
    (
        2,
        Text(
            Borrowed(
                "`",
            ),
        ),
    ),

with lots of lone ` characters.

A related question, is this not something which should be fixed in pulldown-cmark-to-cmark instead of here? @dalance, could you create an issue in that repository and see if you can move the fix from here to there?

Lift restriction on having no formatting in `SUMMARY.md`

Right now, mdbook-xgettext forbids formatting in the SUMMARY.md file. It does this because the formatting is stripped by mdbook, meaning that the plugin receives the chapter name without formatting and so we cannot find the chapter name with a literal search in SUMMARY.md.

We should remove this restriction since it makes mdbook-xgettext incompatible with mdbook projects in the wild, such as the Rust Book.

We should instead do something like this

  • Fall back on an inaccurate line number for chapter titles which we cannot find (easy),
  • Parse SUMMARY.md ourselves, strip the formatting, and do the comparison this way (slightly harder, but we can reuse code from mdbook).

Code coverage uploads sometimes fail

Hi @kdarkhan, I've seen a few Codecov uploads fail. An example is in this run. The error is

[2024-01-15T13:13:27.254Z] ['error'] There was an error running the uploader: Error uploading to https://codecov.io: Error: There was an error fetching the storage URL during POST: 503 - upstream connect error or disconnect/reset before headers. reset reason: connection failure
Error: Codecov: Failed to properly upload: The process '/home/runner/work/_actions/codecov/codecov-action/v3/dist/codecov' failed with exit code 255

Do you have any idea what could trigger this?

Support splitting the PO files into smaller units

We currently work with monolithic file: mdbook-xgettext writes all messages to a single POT file (typically messages.pot) and mdbook-gettext read a single PO file as well. We've seen in Comprehensive Rust how these files can grow very large: the PO files are around 800 kB and 24k lines in size.

It would be interesting to support splitting the POT file into smaller units. We could let the book outline decide the splitting and split on each top-level chapter. More elegantly, we could have a splitting depth: the current behavior would be splitting at depth 0, depth 1 would split at top-level chapters, depth 2 would go one level further, etc.

This could in principle be implemented today by scripting calls to msggrep --location (to split the POT file into smaller units) and msgcat (to compile multiple PO files). However, a native solution will likely be cleaner overall. The splitting with msggrep and combining with msgcat suggests an implementation strategy: keep everything like today, and split the Catalog into units before saving them in mdbook-xgettext. Similarly, create a large Catalog from many smaller ones in mdbook-gettext.

Deduplicate repeated sources

With #145 merged, we can now end up in a situation where a message has the same source repeated multiple times:

#: src/SUMMARY.md:10 src/SUMMARY.md:10 src/SUMMARY.md:70 src/SUMMARY.md:90
msgid "Welcome"
msgstr "Te damos la bienvenida"

We should add logic to remove the duplicates in this case. The situation becomes even worse when the line numbers are removed completely (when granularity is set to 0).

Advance POT-Creation-Date by 1 second in `mdbook-i18n-normalize`

When mdbook-i18n-normalize is used on a PO file, we currently leave the metadata alone. However, this can cause problems for publishing pipelines like the one in Comprehensive Rust: there we expect that the POT-Creation-Date field has changed if any of the msgid fields change.

To handle this situation in a smooth way, I propose incrementing the timestamp by 1 second, if possible. This means: parse the field, increment it by 1 second if it's a valid timestamp, and finally write it back to the file. If it's missing or invalid, then leave it along.

mdbook-tera-backend not found in crates.io registry

Description

When installing the mdbook-tera-backend as described in the README, it is not found in the crates.io registry:

$ cargo install mdbook-tera-backend
    Updating crates.io index
error: could not find `mdbook-tera-backend` in registry `crates-io` with version `*`

rustup info

Note: I've tried also to install it on my WSL as sanity check.

Debian server

Default host: x86_64-unknown-linux-gnu
rustup home: /home/user/.rustup

stable-x86_64-unknown-linux-gnu (default)
rustc 1.76.0 (07dca489a 2024-02-04)

WSL

$ rustup show
Default host: x86_64-unknown-linux-gnu
rustup home:  /home/dev/.rustup

stable-x86_64-unknown-linux-gnu (default)
rustc 1.76.0 (07dca489a 2024-02-04)

OS info

Note: I've tried also to install it on my WSL as sanity check.

Debian server

OS: Debian GNU/Linux 12 (bookworm) x86_64
Kernel: 6.1.0-10-amd64
Shell: bash 5.2.15
Terminal: /dev/pts/0
CPU: Intel Xeon Silver 4210R (1) @ 2.394GHz
GPU: 00:0f.0 VMware SVGA II Adapter

WSL

OS: Kali GNU/Linux Rolling on Windows 10 x86_64
Kernel: 5.15.146.1-microsoft-standard-WSL2
Shell: bash 5.2.15
Terminal: Windows Terminal
CPU: 11th Gen Intel i7-1185G7 (2) @ 1.804GHz
GPU: 2ea0:00:00.0 Microsoft Corporation Basic Render Driver

Add support for comments

We should teach mdbook-xgettext to react to comments in the Markdown. Comments could be used for:

  • Marking the next paragraph or code block as non-translatable. This could save translators a lot of work if there are many code blocks which shouldn't be translated because they don't have any strings or comments.
  • Add translator comments. Gettext supports comments on each message and xgettext extracts them using a special syntax (which we should try to replicate).
  • Add a message context. This is again s Gettext feature: the same message can appear multiple times, but with s different context each time.

Backslash error when `xgettext` handles $$ formula

I think it is a xgettext BUG:

when I run

MDBOOK_OUTPUT='{"xgettext": {"pot-file": "messages.pot"}}' \
  mdbook build -d po

to generate message.pot file :

My source .md file

// my_md_file.md:

$$
\begin{array}{|c|c|c|c|c|}
\hline
1 & x_1 & x_2 & x_3 & out \\
\hline
0 & 1 & 1 & 0 & 0 \\
\hline
\end{array}
$$

When Convert to message.pot file:

// message.pot
#: src/plonk-arithmetization.md:35
msgid ""
"$$ \\\\begin{array}{|c|c|c|c|c|} \\\\hline 1 & x_1 & x_2 & x_3 & out \\\\ \\"
"\\hline 0 & 1 & 1 & 0 & 0 \\\\ \\\\hline \\\\end{array} $$"
msgid ""

As a result, you can see so many Backslash!!! rendering cannot be (mdbook-katex) performed!!!

Write tool which can convert translated files back to PO

This idea is from rust-embedded/book#326: we should write a converter tool which takes two Markdown files as input and outputs a PO file.

More concretely, the tool should take an en/foo.md and xx/foo.md file and output a xx.po file. The tool will call extract_messages on both files and line up the results. It will use the messages from en/foo.md as the msgid and the corresponding message from foo/xx.md as the msgstr.

The output is marked fuzzy to ensure that a human translator double-checks it all before publication.

Write `mdbook` renderer with strong templating system

Currently, the mdbook templates are very limited: rust-lang/mdBook#2016. I believe this is on purpose to prevent templates from becoming too unwieldy.

However, a limited template engine only works when it's easy to modify the logic behind the templates — which means modifying the upstream mdbook code. There are a number of features which we would like to implement, but which require modifying the generated HTML:

The common theme in these issues is that they require us to inject more data into the template (a list of languages, a canonical URL) and that we need to loop over these values and extract various parts.

A stronger mdbook renderer would allow us to do this inside the theme itself. The renderer should be generic and expose the same values to the theme as the HtmlHandlebars renderer does today — plus data configured in the book.toml file. This would be how users will inject more data.

The templating engine should then allow users to

  • define helper functions (Handlebars comes with a limited set of built-in helpers).
    • A much-needed function would be a function which computes a consistent canonical URL for a given .md file path, e.g., it would turn /index.html to just /. This is the url function in Django.
  • read and write variables (Handlebars seems to only work with the variables defined in the context passed in from mdbook).
  • include extra files defined in the theme (Handlebars seems to only work with an existing set of files defined in the Rust code).

Ideally, the new renderer would be very small: it converts the chapter content from Markdown to HTML and sends this plus the chapter information to the template engine. All of the static file copying in HtmlHandlebars would be replaced by the theme.

Template Engine

Take a look at https://lib.rs/template-engine and https://crates.io/categories/template-engine. I would look at askama and tera, both of which are inspired by Django. Since we target offline-rendering, we don't need the fastest engine in the world — we need something which is flexible and extensible without having to touch the mdbook renderer all the time.

Ideally, we would prototype a set of template helper functions in a theme first (in Comprehensive Rust, most likely). If they become widely useful, we will move those helper definitions to a "standard library" for the new renderer. This way others can reuse the helpers in their own themes.

Add a GitHub Action

We should publish our own GitHub Action to make it easy to use mdbook-i18n-helpers.

This will move logic from the YAML file (such as publish.yml) to the action. The action is probably written in TypeScript (I'm not sure we can write an action in Rust?) and this will be a more comfortable language than what we can do with scripts in the YAML file.

HTML tag abbr seems to confuse the parser

I'm redirected to here from google/comprehensive-rust#1284.

The source:
https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/src/bare-metal/microcontrollers/probe-rs.md

The po file (bit outdated but updating it does not resolve the issue):
https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/po/ja.po#L13420

Now, if I make this change (just for testing purpose)

diff --git a/po/ja.po b/po/ja.po
index b2baaf215743..2a9eb0c872d4 100644
--- a/po/ja.po
+++ b/po/ja.po
@@ -13418,7 +13418,7 @@ msgstr ""
 
 #: src/bare-metal/microcontrollers/probe-rs.md:10
 msgid "`cargo-embed` is a cargo subcommand to build and flash binaries, log "
-msgstr ""
+msgstr "`cargo-embed`"
 
 #: src/bare-metal/microcontrollers/probe-rs.md:11
 msgid "RTT"

I get a result like this:
localhost_3000_bare-metal_microcontrollers_probe-rs html

The strange thing about this is that string cargo-embed got merged into the last item in the list.

codecov-action upgrade broke coverage reports

Coverage reports started failing after upgrade from v3 to v4.

Example failure here

Error message

==> Running version v0.4.6
==> Running command '/home/runner/work/_actions/codecov/codecov-action/v4/dist/codecov create-commit'
/home/runner/work/_actions/codecov/codecov-action/v4/dist/codecov create-commit -C 4a213d1e5bb64fed6de71036439f53d1c796564e --pr 155 -Z
info - 2024-02-06 18:22:00,803 -- ci service found: github-actions
warning - 2024-02-06 18:22:00,806 -- No config file could be found. Ignoring config.
info - 2024-02-06 18:22:00,951 -- Process Commit creating complete
error - 2024-02-06 18:22:00,951 -- Commit creating failed: ["Service not found: none"]
Traceback (most recent call last):
  File "codecov_cli/main.py", line 81, in <module>
  File "codecov_cli/main.py", line 77, in run
  File "click/core.py", line 1157, in __call__
  File "click/core.py", line 1078, in main
  File "click/core.py", line 1688, in invoke
  File "click/core.py", line 1434, in invoke
  File "click/core.py", line 783, in invoke
  File "click/decorators.py", line 33, in new_func
  File "codecov_cli/commands/commit.py", line 64, in create_commit
  File "codecov_cli/services/commit/__init__.py", line 39, in create_commit_logic
  File "codecov_cli/helpers/request.py", line 133, in log_warnings_and_errors_if_any
NameError: name 'exit' is not defined
[1739] Failed to execute script 'main' due to unhandled exception!
Error: Codecov: Failed to properly create commit: The process '/home/runner/work/_actions/codecov/codecov-action/v4/dist/codecov' failed with exit code 1

Add a fuzz tests for all binaries

We should add fuzz tests for each binary:

  • mdbook-xgettext: test that we don't crash on random Markdown input
  • mdbook-gettext: test that we don't crash on random PO files
  • mdbook-i18n-normalize: same as for mdbook-gettext (this could have caught #56)

Add support for translation comments

We should teach mdbook-xgettext to react to comments in the Markdown. Comments could be used for:

  • Marking the next paragraph or code block as non-translatable. This could save translators a lot of work if there are many code blocks which shouldn't be translated because they don't have any strings or comments.
  • Add translator comments. Gettext supports comments on each message and xgettext extracts them using a special syntax (which we should try to replicate).
  • Add a message context. This is again s Gettext feature: the same message can appear multiple times, but with s different context each time.

Fuzz test the generated `.pot` file

This is a followup to #57 which would hopefully have caught #64: we should fuzz test the output of mdbook-xgettext.

We can do this in a at least two ways:

  • We can send the output through msgcat -o /dev/null to get an independent verification that the output can be parsed.
  • We can load the generated output with polib, assuming it will detect similar problems as msgcat. We might want to add a fuzzer for this to the upstream library: BrettDong/polib#6.

Add support for only publishing a language if it more than NN% translated

When the source material keeps changing, the translations will naturally lag behind. In that case, it could be nice to only publish a new version if it mostly up-to-date, meaning it is more than NN% translated.

This kind of functionality can be built today by looking at the output of msgfmt since it shows the number of translated and untranslated messages. We should try to package the functionality in a reusable fashion.

Add support for a translation context

Yeah, we haven't added that, though I suspect it can follow a similar pattern to how we're doing comments.

So probably a sketch of supporting this would be:

  • Define another HTML comment directive to let authors specify the ctxt, similar to what we're doing here

  • Collect and thread that state when we're gathering groups for translation.

  • Finally, pass the ctxt here and call the builder method for setting the ctxt.

So I think this would be fairly straightforward extension.

Originally posted by @dyoo in google/comprehensive-rust#1973 (comment)

Add tool which can build translations for all available languages

The mdbook-gettext preprocessor translates a book into a single language. To translate your book into all available languages, you need to build a look yourself. An example of this can be found in the publish.yml GitHub action for Comprehensive Rust 🦀:

      - name: Build all translations
        run: |
          for po_lang in ${{ env.LANGUAGES }}; do
              echo "::group::Building $po_lang translation"
              MDBOOK_BOOK__LANGUAGE=$po_lang \
              MDBOOK_OUTPUT__HTML__SITE_URL=/comprehensive-rust/$po_lang/ \
              mdbook build -d book/$po_lang
              echo "::endgroup::"
          done

We should make this easier somehow. Idea:

  • Build a small command line tool (perhaps called mdbook-i18n) where mdbook-i18n build would do the looping seen above.

Duplicate rows in message.pot

when I run

# gen message.po
MDBOOK_OUTPUT='{"xgettext": {"pot-file": "messages.pot"}}' \
  mdbook build -d po

there are so many duplicate rows, dont know why
image

image

Normalize Markdown in `.pot` files

When mdbook-xgettext extracts translatable text, it would be great if it could normalize the strings. This would make it possible for us to reformat the entire course without fearing that the translations get destroyed while doing so.

The normalization would take Markdown like this

# This is a heading

This is another heading
=======================

A _little_
paragraph.

```rust,editable
fn main() {
    println!("Hello world!");
}
```

* First
* Second

and turn it into these messages in the .pot file:

  • "This is a heading" (atx heading is stripped)
  • "This is another heading" (setext heading is stripped)
  • "A _little_ paragraph." (soft-wrapped lines are unfolded)
  • "fn main() {\n println!("Hello world!");\n}" (info string is stripped, we should instead use a #, flag)
  • "First" (bullet point extracted individually)
  • "Second"

Like in google/comprehensive-rust#318, we should do this in a step-by-step fashion and make sure to apply the transformations to the existing translations. It would also be good if we have a way to let translators update their not-yet-submitted translations.

How to display a toggle button

I want to know how to display a toggle button on my homepage after configuring the language folder, so that I can switch languages?

I looked at dojo's mdbook, and they seemed to have manually written some JavaScript dom code?

Is there some convenient way for me to add this toggle button to mdbook?

PLS ~

Package a language selector

Currently, you have to edit the mdbook theme directly to add a language selector. An example of this can be seen in Comprehensive Rust 🦀.

We should package this up in some way to make it easy for people to apply. This seems non-trivial because the templating system used in mdbook doesn't seem to make it easy to include new blocks of code without editing the main theme.

Some ideas:

  • Inject JavaScript into the pages and let this code build the menu client-side.
  • Write a tool which can modify the generated HTML to include the menu when mdbook build is called.

Add scripts for using and updating translations

The instructions for our translation pipeline in TRANSLATIONS.md don't match the actual steps in publish.yml.

The difference is mostly because of how the GitHub Actions allows us to set environment variables using a different syntax.

I would like to unify the two via small scripts. The scripts could be shell scripts (though that probably doesn't work well on Windows?) or they could be Rust "scripts" (more setup time).

I'm imagining something like

  • build-translation which takes a xx locale and outputs a book in book/xx.
  • update-translation which runs both mdbook-xgettext and msgmerge for you.

A serve-translation would probably also be nice to have.

Instead of several scripts, a single script with subcommands could also be nice. That could probably live nicely in the i18n-helpers project since it would be tightly coupled to the other binaries there.

Extract only strings and comments from code blocks

As a further step after #75, we should offer an option to only extract literal strings and comments from the code.

For this example:

fn pick_one<T>(a: T, b: T) -> T {
    if std::process::id() % 2 == 0 { a } else { b }
}

fn main() {
    println!("coin toss: {}", pick_one("heads", "tails"));
    println!("cash prize: {}", pick_one(500, 1000));
}

we would end up with just four small strings

  • "coin toss: {}"
  • "heads"
  • "tails"
  • "cash prize: {}"

in the POT file.

This would require us to process Tag::CodeBlock in a more fine-grained way, but I think it could be worth it.

The fun part would be to find a cross-language solution. I suspect our best bet would be to use a syntax highlighting library: they normally detect strings and comments and so such a library should have the necessary machinery.

Reference links are escaped

Because we sometimes feed small text fragments into extract_messages, we can end up in a situation where a reference link (such as [foo][1]) is parsed in isolation without the corresponding link definition (the [1]: http://example/ part).

The result is an unescaped link. When parsing, pulldown-cmark will call the broken_link_callback (if any), but this callback does not have a way to pass-through the reference (1 above), it must return a proper URL (http://example/ above).

There are already problems with translating reference links: the translation can easily go out of sync with the link definitions (this is documented in the Comprehensive Rust style guide).

For this reason, we will probably end up forbidding using reference links when translating.

Wrong normalization of message with HTML

This is from google/comprehensive-rust#1471: running mdbook-i18n-normalize on a translated message that contain HTML gives the wrong output.

I can reproduce this with

diff --git a/i18n-helpers/src/normalize.rs b/i18n-helpers/src/normalize.rs
index 7fed94d..4c0fd5c 100644
--- a/i18n-helpers/src/normalize.rs
+++ b/i18n-helpers/src/normalize.rs
@@ -451,6 +451,15 @@ mod tests {
         );
     }
 
+    #[test]
+    fn test_normalize_html() {
+        let catalog = create_catalog(&[("foo <span>bar</span>", "FOO <span>BAR</span>")]);
+        assert_normalized_messages_eq(
+            catalog,
+            &[exact("foo <span>bar</span>", "FOO <span>BAR</span>")],
+        );
+    }
+
     #[test]
     fn test_normalize_disappearing_html() {
         // Normalizing "<b>" results in no messages.

This tests normalizing a catalog looking like this:

msgid "foo <span>bar</span>"
msgstr "FOO <span>BAR</span>"

The test fails with

---- normalize::tests::test_normalize_html stdout ----
thread 'normalize::tests::test_normalize_html' panicked at i18n-helpers/src/normalize.rs:457:9:
assertion failed: `(left == right)`

Diff < left / right > :
 [
     (
         false,
<        "foo ",
<        "FOO ",
<    ),
<    (
<        false,
<        "bar",
<        "BAR",
>        "foo <span>bar</span>",
>        "FOO <span>BAR</span>",
     ),
 ]

which tells us that the normalized catalog contain two messages:

msgid "foo"
msgstr "FOO"

msgid "bar"
msgstr "BAR"

In other words, the original message was split into two messages.

In case only the translation contain HTML, the normalization will end up with a different number of messages: 1 for the msgid field, and 2 for the msgstr field. This is seen as an error, so a fallback kicks in: the normalized message is marked fuzzy and we accumulate the "left-over" messages into the final message.

This behavior can be seen in this unit test in normalize.rs:

    #[test]
    fn test_normalize_fuzzy_list_items_too_many() {
        let catalog = create_catalog(&[(
            "* foo\n\
             * bar",
            "* FOO\n\
             * BAR\n\
             * BAZ",
        )]);
        assert_normalized_messages_eq(catalog, &[fuzzy("foo", "FOO"), fuzzy("bar", "BAR\n\nBAZ")]);
    }

This is what happened in google/comprehensive-rust#1471 where we transform

#: src/SUMMARY.md:92
msgid "Double Frees in Modern C++"
msgstr "آزاد سازی مضاعف در<span dir=ltr>C++</span> مدرن"

into

#: src/SUMMARY.md:92
#, fuzzy
msgid "Double Frees in Modern C++"
msgstr ""
"آزاد سازی مضاعف در\n"
"\n"
"C++\n"
"\n"
"مدرن"

Add links between translations

When outputting translations, we need to add links between the corresponding pages in the translations. This will take two forms:

  1. A language picker for users. This is done via google/comprehensive-rust#411.
  2. Links for robots so that search engines can index the pages reliably.

The second part is done by adding

<link rel="alternate" hreflang="en" href="https://google.github.io/comprehensive-rust/" />
<link rel="alternate" hreflang="pt-BR" href="https://google.github.io/comprehensive-rust/pt-BR/" />

to the head element of every page. The alternatives will link each page with its translated siblings. See this documentation for details.

Support recursing into `{{#include foo}}` directives

Currently, we use mdbook-xgettext as a renderer, e.g., as an output format which sees the Markdown files after all of the preprocessors have run. To match this, we recommend installing the mdbook-gettext preprocessor with

[preprocessor.gettext]
after = ["links"]

This means that the mdbook-gettext preprocessor sees any files included via the built-in links preprocessor, just like the mdbook-xgettext renderer does.

This works well, and with #95 fixed, we even minimize the strings extracted this way. However, we could improve on this a bit by handling the include directives ourselves (hopefully by calling out to the links preprocessor). The advantage of this would be

  • Correct source lines for the translations: right now the source line points back to the Markdown file, but with a post-expansion line number.
  • Less updates of line numbers: if a Markdown file includes a.txt first and then b.txt, then a change to a.txt will disrupt the line numbers of all following messages. This causes unnecessary churn in the translation files.

To implement this, we should instead run mdbook-xgettext as a preprocessor, and run it before the links preprocessor.

This might of course not be worth it, especially as links.rs doesn't seem to expose it's functionality.

Create binaries to support stand-alone usage

The mdbook-xgettext plugin can currently only be used from mdbook since it reads a special JSON format on standard input. However, it would be quite easy to use the same code to extract text from any Markdown file on disk. We should create a small binary which does this.

Similarly for mdbook-gettext: we should create a small binary which will translate a Markdown file into another language using a PO file.

Having these binaries would make it slightly easier to text our own code:

echo '> foo **bar** *baz*' | mdbook-i18n-xgettext

should be enough to produce a PO file on standard output (it would show that *bar* is normalized to _bar_, for example). I could have used this kind of ad-hoc testing when I was battling corner-cases in #33.

Images with alt text shows up twice

See https://google.github.io/comprehensive-rust/pt-BR/, which currently looks like this:

image

The underlying HTML looks weird, with the images being duplicated:

<p>
<a href="https://github.com/google/comprehensive-rust/actions/workflows/build.yml?query=branch%3Amain">
  <img src="https://img.shields.io/github/actions/workflow/status/google/comprehensive-rust/build.yml?style=flat-square" alt="Build workflow">
</a>
Build workflow
<a href="https://github.com/google/comprehensive-rust/actions/workflows/build.yml?query=branch%3Amain">
  <img src="https://img.shields.io/github/actions/workflow/status/google/comprehensive-rust/build.yml?style=flat-square" alt="Build workflow">
</a>
<a href="https://github.com/google/comprehensive-rust/graphs/contributors">
  <img src="https://img.shields.io/github/contributors/google/comprehensive-rust?style=flat-square" alt="GitHub contributors">
</a>
GitHub contributors
<a href="https://github.com/google/comprehensive-rust/graphs/contributors">
  <img src="https://img.shields.io/github/contributors/google/comprehensive-rust?style=flat-square" alt="GitHub contributors">
</a>
<a href="https://github.com/google/comprehensive-rust/stargazers">
  <img src="https://img.shields.io/github/stars/google/comprehensive-rust?style=flat-square" alt="GitHub stars">
</a>
GitHub stars
<a href="https://github.com/google/comprehensive-rust/stargazers">
  <img src="https://img.shields.io/github/stars/google/comprehensive-rust?style=flat-square" alt="GitHub stars">
</a>
</p>

Support rounding line numbers

When updating the translation files for Comprehensive Rust, we have a lot of small diffs like this:

-#: src/SUMMARY.md:19 src/SUMMARY.md:79 src/SUMMARY.md:134 src/SUMMARY.md:192
-#: src/SUMMARY.md:218 src/SUMMARY.md:268
+#: src/SUMMARY.md:19 src/SUMMARY.md:80 src/SUMMARY.md:135 src/SUMMARY.md:193
+#: src/SUMMARY.md:231 src/SUMMARY.md:281
 msgid "Welcome"
 msgstr "Velkommen"

We could avoid some of this churn if we would round the line numbers. I'm thinking rounding them to nearest multiple of 5 or 10 will help a lot. Jumping to a rounded line number would still bring you close to the actual position.

One could also remove the line numbers completely. The whole source location can be removed by sending the files through msgcat --no-location.

What do people think of this?

Normalization does not respect line number granularity

As a side-quest ( 😄 ) to #154, I've realized that mdbook-i18n-normalize doesn't use the same rounding logic... that will come back and cause problems at some point.

It would be great to have it round the normalized message sources the same way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.