Coder Social home page Coder Social logo

hsivonen / encoding_rs Goto Github PK

View Code? Open in Web Editor NEW
376.0 14.0 55.0 3.85 MB

A Gecko-oriented implementation of the Encoding Standard in Rust

Home Page: https://docs.rs/encoding_rs/

License: Other

Rust 98.28% Python 1.71% Shell 0.01%
rust encoding charset unicode web

encoding_rs's Introduction

encoding_rs

Build Status crates.io docs.rs

encoding_rs an implementation of the (non-JavaScript parts of) the Encoding Standard written in Rust.

The Encoding Standard defines the Web-compatible set of character encodings, which means this crate can be used to decode Web content. encoding_rs is used in Gecko starting with Firefox 56. Due to the notable overlap between the legacy encodings on the Web and the legacy encodings used on Windows, this crate may be of use for non-Web-related situations as well; see below for links to adjacent crates.

Additionally, the mem module provides various operations for dealing with in-RAM text (as opposed to data that's coming from or going to an IO boundary). The mem module is a module instead of a separate crate due to internal implementation detail efficiencies.

Functionality

Due to the Gecko use case, encoding_rs supports decoding to and encoding from UTF-16 in addition to supporting the usual Rust use case of decoding to and encoding from UTF-8. Additionally, the API has been designed to be FFI-friendly to accommodate the C++ side of Gecko.

Specifically, encoding_rs does the following:

  • Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t).
  • Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 (units of u16 / char16_t) into a sequence of bytes in an Encoding Standard-defined character encoding as if the lone surrogates had been replaced with the REPLACEMENT CHARACTER before performing the encode. (Gecko's UTF-16 is potentially invalid.)
  • Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid UTF-8.
  • Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding Standard-defined character encoding. (Rust's UTF-8 is guaranteed-valid.)
  • Does the above in streaming (input and output split across multiple buffers) and non-streaming (whole input in a single buffer and whole output in a single buffer) variants.
  • Avoids copying (borrows) when possible in the non-streaming cases when decoding to or encoding from UTF-8.
  • Resolves textual labels that identify character encodings in protocol text into type-safe objects representing the those encodings conceptually.
  • Maps the type-safe encoding objects onto strings suitable for returning from document.characterSet.
  • Validates UTF-8 (in common instruction set scenarios a bit faster for Web workloads than the standard library; hopefully will get upstreamed some day) and ASCII.

Additionally, encoding_rs::mem does the following:

  • Checks if a byte buffer contains only ASCII.
  • Checks if a potentially-invalid UTF-16 buffer contains only Basic Latin (ASCII).
  • Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer contains only Latin1 code points (below U+0100).
  • Checks if a valid UTF-8, potentially-invalid UTF-8 or potentially-invalid UTF-16 buffer or a code point or a UTF-16 code unit can trigger right-to-left behavior (suitable for checking if the Unicode Bidirectional Algorithm can be optimized out).
  • Combined versions of the above two checks.
  • Converts valid UTF-8, potentially-invalid UTF-8 and Latin1 to UTF-16.
  • Converts potentially-invalid UTF-16 and Latin1 to UTF-8.
  • Converts UTF-8 and UTF-16 to Latin1 (if in range).
  • Finds the first invalid code unit in a buffer of potentially-invalid UTF-16.
  • Makes a mutable buffer of potential-invalid UTF-16 contain valid UTF-16.
  • Copies ASCII from one buffer to another up to the first non-ASCII byte.
  • Converts ASCII to UTF-16 up to the first non-ASCII byte.
  • Converts UTF-16 to ASCII up to the first non-Basic Latin code unit.

Integration with std::io

Notably, the above feature list doesn't include the capability to wrap a std::io::Read, decode it into UTF-8 and presenting the result via std::io::Read. The encoding_rs_io crate provides that capability.

no_std Environment

The crate works in a no_std environment. By default, the alloc feature, which assumes that an allocator is present is enabled. For a no-allocator environment, the default features (i.e. alloc) can be turned off. This makes the part of the API that returns Vec/String/Cow unavailable.

Decoding Email

For decoding character encodings that occur in email, use the charset crate instead of using this one directly. (It wraps this crate and adds UTF-7 decoding.)

Windows Code Page Identifier Mappings

For mappings to and from Windows code page identifiers, use the codepage crate.

DOS Encodings

This crate does not support single-byte DOS encodings that aren't required by the Web Platform, but the oem_cp crate does.

Preparing Text for the Encoders

Normalizing text into Unicode Normalization Form C prior to encoding text into a legacy encoding minimizes unmappable characters. Text can be normalized to Unicode Normalization Form C using the icu_normalizer crate.

The exception is windows-1258, which after normalizing to Unicode Normalization Form C requires tone marks to be decomposed in order to minimize unmappable characters. Vietnamese tone marks can be decomposed using the detone crate.

Licensing

TL;DR: (Apache-2.0 OR MIT) AND BSD-3-Clause for the code and data combination.

Please see the file named COPYRIGHT.

The non-test code that isn't generated from the WHATWG data in this crate is under Apache-2.0 OR MIT. Test code is under CC0.

This crate contains code/data generated from WHATWG-supplied data. The WHATWG upstream changed its license for portions of specs incorporated into source code from CC0 to BSD-3-Clause between the initial release of this crate and the present version of this crate. The in-source licensing legends have been updated for the parts of the generated code that have changed since the upstream license change.

Documentation

Generated API documentation is available online.

There is a long-form write-up about the design and internals of the crate.

C and C++ bindings

An FFI layer for encoding_rs is available as a separate crate. The crate comes with a demo C++ wrapper using the C++ standard library and GSL types.

The bindings for the mem module are in the encoding_c_mem crate.

For the Gecko context, there's a C++ wrapper using the MFBT/XPCOM types.

There's a write-up about the C++ wrappers.

Sample programs

Optional features

There are currently these optional cargo features:

simd-accel

Enables SIMD acceleration using the nightly-dependent portable_simd standard library feature.

This is an opt-in feature, because enabling this feature opts out of Rust's guarantees of future compilers compiling old code (aka. "stability story").

Currently, this has not been tested to be an improvement except for these targets and enabling the simd-accel feature is expected to break the build on other targets:

  • x86_64
  • i686
  • aarch64
  • thumbv7neon

If you use nightly Rust, you use targets whose first component is one of the above, and you are prepared to have to revise your configuration when updating Rust, you should enable this feature. Otherwise, please do not enable this feature.

Used by Firefox.

serde

Enables support for serializing and deserializing &'static Encoding-typed struct fields using Serde.

Not used by Firefox.

fast-legacy-encode

A catch-all option for enabling the fastest legacy encode options. Does not affect decode speed or UTF-8 encode speed.

At present, this option is equivalent to enabling the following options:

  • fast-hangul-encode
  • fast-hanja-encode
  • fast-kanji-encode
  • fast-gb-hanzi-encode
  • fast-big5-hanzi-encode

Adds 176 KB to the binary size.

Not used by Firefox.

fast-hangul-encode

Changes encoding precomposed Hangul syllables into EUC-KR from binary search over the decode-optimized tables to lookup by index making Korean plain-text encode about 4 times as fast as without this option.

Adds 20 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-hanja-encode

Changes encoding of Hanja into EUC-KR from linear search over the decode-optimized table to lookup by index. Since Hanja is practically absent in modern Korean text, this option doesn't affect perfomance in the common case and mainly makes sense if you want to make your application resilient agaist denial of service by someone intentionally feeding it a lot of Hanja to encode into EUC-KR.

Adds 40 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-kanji-encode

Changes encoding of Kanji into Shift_JIS, EUC-JP and ISO-2022-JP from linear search over the decode-optimized tables to lookup by index making Japanese plain-text encode to legacy encodings 30 to 50 times as fast as without this option (about 2 times as fast as with less-slow-kanji-encode).

Takes precedence over less-slow-kanji-encode.

Adds 36 KB to the binary size (24 KB compared to less-slow-kanji-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-kanji-encode

Makes JIS X 0208 Level 1 Kanji (the most common Kanji in Shift_JIS, EUC-JP and ISO-2022-JP) encode less slow (binary search instead of linear search) making Japanese plain-text encode to legacy encodings 14 to 23 times as fast as without this option.

Adds 12 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-gb-hanzi-encode

Changes encoding of Hanzi in the CJK Unified Ideographs block into GBK and gb18030 from linear search over a part the decode-optimized tables followed by a binary search over another part of the decode-optimized tables to lookup by index making Simplified Chinese plain-text encode to the legacy encodings 100 to 110 times as fast as without this option (about 2.5 times as fast as with less-slow-gb-hanzi-encode).

Takes precedence over less-slow-gb-hanzi-encode.

Adds 36 KB to the binary size (24 KB compared to less-slow-gb-hanzi-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-gb-hanzi-encode

Makes GB2312 Level 1 Hanzi (the most common Hanzi in gb18030 and GBK) encode less slow (binary search instead of linear search) making Simplified Chinese plain-text encode to the legacy encodings about 40 times as fast as without this option.

Adds 12 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

fast-big5-hanzi-encode

Changes encoding of Hanzi in the CJK Unified Ideographs block into Big5 from linear search over a part the decode-optimized tables to lookup by index making Traditional Chinese plain-text encode to Big5 105 to 125 times as fast as without this option (about 3 times as fast as with less-slow-big5-hanzi-encode).

Takes precedence over less-slow-big5-hanzi-encode.

Adds 40 KB to the binary size (20 KB compared to less-slow-big5-hanzi-encode).

Does not affect decode speed.

Not used by Firefox.

less-slow-big5-hanzi-encode

Makes Big5 Level 1 Hanzi (the most common Hanzi in Big5) encode less slow (binary search instead of linear search) making Traditional Chinese plain-text encode to Big5 about 36 times as fast as without this option.

Adds 20 KB to the binary size.

Does not affect decode speed.

Not used by Firefox.

Performance goals

For decoding to UTF-16, the goal is to perform at least as well as Gecko's old uconv. For decoding to UTF-8, the goal is to perform at least as well as rust-encoding. These goals have been achieved.

Encoding to UTF-8 should be fast. (UTF-8 to UTF-8 encode should be equivalent to memcpy and UTF-16 to UTF-8 should be fast.)

Speed is a non-goal when encoding to legacy encodings. By default, encoding to legacy encodings should not be optimized for speed at the expense of code size as long as form submission and URL parsing in Gecko don't become noticeably too slow in real-world use.

In the interest of binary size, by default, encoding_rs does not have encode-specific data tables beyond 32 bits of encode-specific data for each single-byte encoding. Therefore, encoders search the decode-optimized data tables. This is a linear search in most cases. As a result, by default, encode to legacy encodings varies from slow to extremely slow relative to other libraries. Still, with realistic work loads, this seemed fast enough not to be user-visibly slow on Raspberry Pi 3 (which stood in for a phone for testing) in the Web-exposed encoder use cases.

See the cargo features above for optionally making CJK legacy encode fast.

A framework for measuring performance is available separately.

Rust Version Compatibility

It is a goal to support the latest stable Rust, the latest nightly Rust and the version of Rust that's used for Firefox Nightly.

At this time, there is no firm commitment to support a version older than what's required by Firefox, and there is no commitment to treat MSRV changes as semver-breaking, because this crate depends on cfg-if, which doesn't appear to treat MSRV changes as semver-breaking, so it would be useless for this crate to treat MSRV changes as semver-breaking.

As of 2024-04-04, MSRV appears to be Rust 1.36.0 for using the crate and 1.42.0 for doc tests to pass without errors about the global allocator. With the simd-accel feature, the MSRV is even higher.

Compatibility with rust-encoding

A compatibility layer that implements the rust-encoding API on top of encoding_rs is provided as a separate crate (cannot be uploaded to crates.io). The compatibility layer was originally written with the assuption that Firefox would need it, but it is not currently used in Firefox.

Regenerating Generated Code

To regenerate the generated code:

  • Have Python 2 installed.
  • Clone https://github.com/hsivonen/encoding_c next to the encoding_rs directory.
  • Clone https://github.com/hsivonen/codepage next to the encoding_rs directory.
  • Clone https://github.com/whatwg/encoding next to the encoding_rs directory.
  • Checkout revision be3337450e7df1c49dca7872153c4c4670dd8256 of the encoding repo. (Note: f381389 was the revision of encoding used from before the encoding repo license change. So far, only output changed since then has been updated to the new license legend.)
  • With the encoding_rs directory as the working directory, run python generate-encoding-data.py.

Roadmap

  • Design the low-level API.
  • Provide Rust-only convenience features.
  • Provide an stl/gsl-flavored C++ API.
  • Implement all decoders and encoders.
  • Add unit tests for all decoders and encoders.
  • Finish BOM sniffing variants in Rust-only convenience features.
  • Document the API.
  • Publish the crate on crates.io.
  • Create a solution for measuring performance.
  • Accelerate ASCII conversions using SSE2 on x86.
  • Accelerate ASCII conversions using ALU register-sized operations on non-x86 architectures (process an usize instead of u8 at a time).
  • Split FFI into a separate crate so that the FFI doesn't interfere with LTO in pure-Rust usage.
  • Compress CJK indices by making use of sequential code points as well as Unicode-ordered parts of indices.
  • Make lookups by label or name use binary search that searches from the end of the label/name to the start.
  • Make labels with non-ASCII bytes fail fast.
  • Parallelize UTF-8 validation using Rayon. (This turned out to be a pessimization in the ASCII case due to memory bandwidth reasons.)
  • Provide an XPCOM/MFBT-flavored C++ API.
  • Investigate accelerating single-byte encode with a single fast-tracked range per encoding.
  • Replace uconv with encoding_rs in Gecko.
  • Implement the rust-encoding API in terms of encoding_rs.
  • Add SIMD acceleration for Aarch64.
  • Investigate the use of NEON on 32-bit ARM.
  • Investigate Björn Höhrmann's lookup table acceleration for UTF-8 as adapted to Rust in rust-encoding.
  • Add actually fast CJK encode options.
  • Investigate Bob Steagall's lookup table acceleration for UTF-8.
  • Provide a build mode that works without alloc (with lesser API surface).
  • Migrate to std::simd once it is stable and declare 1.0.
  • Migrate unsafe slice access by larger types than u8/u16 to align_to.

Release Notes

0.8.34

  • Use the portable_simd nightly feature of the standard library instead of the packed_simd crate. Only affects the simd-accel optional nightly feature.
  • Internal documentation improvements and minor code improvements around unsafe.
  • Added rust-version to Cargo.toml.

0.8.33

  • Use packed_simd instead of packed_simd_2 again now that updates are back under the packed_simd name. Only affects the simd-accel optional nightly feature.

0.8.32

  • Removed build.rs. (This removal should resolve false positives reported by some antivirus products. This may break some build configurations that have opted out of Rust's guarantees against future build breakage.)
  • Internal change to what API is used for reinterpreting the lane configuration of SIMD vectors.
  • Documentation improvements.

0.8.31

  • Use SPDX with parentheses now that crates.io supports parentheses.

0.8.30

  • Update the licensing information to take into account the WHATWG data license change.

0.8.29

  • Make the parts that use an allocator optional.

0.8.28

  • Fix error in Serde support introduced as part of no_std support.

0.8.27

  • Make the crate works in a no_std environment (with alloc).

0.8.26

  • Fix oversights in edition 2018 migration that broke the simd-accel feature.

0.8.25

  • Do pointer alignment checks in a way where intermediate steps aren't defined to be Undefined Behavior.
  • Update the packed_simd dependency to packed_simd_2.
  • Update the cfg-if dependency to 1.0.
  • Address warnings that have been introduced by newer Rust versions along the way.
  • Update to edition 2018, since even prior to 1.0 cfg-if updated to edition 2018 without a semver break.

0.8.24

  • Avoid computing an intermediate (not dereferenced) pointer value in a manner designated as Undefined Behavior when computing pointer alignment.

0.8.23

  • Remove year from copyright notices. (No features or bug fixes.)

0.8.22

  • Formatting fix and new unit test. (No features or bug fixes.)

0.8.21

  • Fixed a panic with invalid UTF-16[BE|LE] input at the end of the stream.

0.8.20

  • Make Decoder::latin1_byte_compatible_up_to return None in more cases to make the method actually useful. While this could be argued to be a breaking change due to the bug fix changing semantics, it does not break callers that had to handle the None case in a reasonable way anyway.

0.8.19

  • Removed a bunch of bound checks in convert_str_to_utf16.
  • Added mem::convert_utf8_to_utf16_without_replacement.

0.8.18

  • Added mem::utf8_latin1_up_to and mem::str_latin1_up_to.
  • Added Decoder::latin1_byte_compatible_up_to.

0.8.17

  • Update bincode (dev dependency) version requirement to 1.0.

0.8.16

  • Switch from the simd crate to packed_simd.

0.8.15

  • Adjust documentation for simd-accel (README-only release).

0.8.14

  • Made UTF-16 to UTF-8 encode conversion fill the output buffer as closely as possible.

0.8.13

  • Made the UTF-8 to UTF-16 decoder compare the number of code units written with the length of the right slice (the output slice) to fix a panic introduced in 0.8.11.

0.8.12

  • Removed the clippy:: prefix from clippy lint names.

0.8.11

  • Changed minimum Rust requirement to 1.29.0 (for the ability to refer to the interior of a static when defining another static).
  • Explicitly aligned the lookup tables for single-byte encodings and UTF-8 to cache lines in the hope of freeing up one cache line for other data. (Perhaps the tables were already aligned and this is placebo.)
  • Added 32 bits of encode-oriented data for each single-byte encoding. The change was performance-neutral for non-Latin1-ish Latin legacy encodings, improved Latin1-ish and Arabic legacy encode speed somewhat (new speed is 2.4x the old speed for German, 2.3x for Arabic, 1.7x for Portuguese and 1.4x for French) and improved non-Latin1, non-Arabic legacy single-byte encode a lot (7.2x for Thai, 6x for Greek, 5x for Russian, 4x for Hebrew).
  • Added compile-time options for fast CJK legacy encode options (at the cost of binary size (up to 176 KB) and run-time memory usage). These options still retain the overall code structure instead of rewriting the CJK encoders totally, so the speed isn't as good as what could be achieved by using even more memory / making the binary even langer.
  • Made UTF-8 decode and validation faster.
  • Added method is_single_byte() on Encoding.
  • Added mem::decode_latin1() and mem::encode_latin1_lossy().

0.8.10

  • Disabled a unit test that tests a panic condition when the assertion being tested is disabled.

0.8.9

  • Made --features simd-accel work with stable-channel compiler to simplify the Firefox build system.

0.8.8

  • Made the is_foo_bidi() not treat U+FEFF (ZERO WIDTH NO-BREAK SPACE aka. BYTE ORDER MARK) as right-to-left.
  • Made the is_foo_bidi() functions report true if the input contains Hebrew presentations forms (which are right-to-left but not in a right-to-left-roadmapped block).

0.8.7

  • Fixed a panic in the UTF-16LE/UTF-16BE decoder when decoding to UTF-8.

0.8.6

  • Temporarily removed the debug assertion added in version 0.8.5 from convert_utf16_to_latin1_lossy.

0.8.5

  • If debug assertions are enabled but fuzzing isn't enabled, lossy conversions to Latin1 in the mem module assert that the input is in the range U+0000...U+00FF (inclusive).
  • In the mem module provide conversions from Latin1 and UTF-16 to UTF-8 that can deal with insufficient output space. The idea is to use them first with an allocation rounded up to jemalloc bucket size and do the worst-case allocation only if the jemalloc rounding up was insufficient as the first guess.

0.8.4

  • Fix SSE2-specific, simd-accel-specific memory corruption introduced in version 0.8.1 in conversions between UTF-16 and Latin1 in the mem module.

0.8.3

  • Removed an #[inline(never)] annotation that was not meant for release.

0.8.2

  • Made non-ASCII UTF-16 to UTF-8 encode faster by manually omitting bound checks and manually adding branch prediction annotations.

0.8.1

  • Tweaked loop unrolling and memory alignment for SSE2 conversions between UTF-16 and Latin1 in the mem module to increase the performance when converting long buffers.

0.8.0

  • Changed the minimum supported version of Rust to 1.21.0 (semver breaking change).
  • Flipped around the defaults vs. optional features for controlling the size vs. speed trade-off for Kanji and Hanzi legacy encode (semver breaking change).
  • Added NEON support on ARMv7.
  • SIMD-accelerated x-user-defined to UTF-16 decode.
  • Made UTF-16LE and UTF-16BE decode a lot faster (including SIMD acceleration).

0.7.2

  • Add the mem module.
  • Refactor SIMD code which can affect performance outside the mem module.

0.7.1

  • When encoding from invalid UTF-16, correctly handle U+DC00 followed by another low surrogate.

0.7.0

  • Make replacement a label of the replacement encoding. (Spec change.)
  • Remove Encoding::for_name(). (Encoding::for_label(foo).unwrap() is now close enough after the above label change.)
  • Remove the parallel-utf8 cargo feature.
  • Add optional Serde support for &'static Encoding.
  • Performance tweaks for ASCII handling.
  • Performance tweaks for UTF-8 validation.
  • SIMD support on aarch64.

0.6.11

  • Make Encoder::has_pending_state() public.
  • Update the simd crate dependency to 0.2.0.

0.6.10

  • Reserve enough space for NCRs when encoding to ISO-2022-JP.
  • Correct max length calculations for multibyte decoders.
  • Correct max length calculations before BOM sniffing has been performed.
  • Correctly calculate max length when encoding from UTF-16 to GBK.

0.6.9

0.6.8

  • Correcly handle the case where the first buffer contains potentially partial BOM and the next buffer is the last buffer.
  • Decode byte 7F correctly in ISO-2022-JP.
  • Make UTF-16 to UTF-8 encode write closer to the end of the buffer.
  • Implement Hash for Encoding.

0.6.7

0.6.6

  • Correct max length calculation when a partial BOM prefix is part of the decoder's state.

0.6.5

  • Correct max length calculation in various encoders.
  • Correct max length calculation in the UTF-16 decoder.
  • Derive PartialEq and Eq for the CoderResult, DecoderResult and EncoderResult types.

0.6.4

  • Avoid panic when encoding with replacement and the destination buffer is too short to hold one numeric character reference.

0.6.3

  • Add support for 32-bit big-endian hosts. (For real this time.)

0.6.2

  • Fix a panic from subslicing with bad indices in Encoder::encode_from_utf16. (Due to an oversight, it lacked the fix that Encoder::encode_from_utf8 already had.)
  • Micro-optimize error status accumulation in non-streaming case.

0.6.1

  • Avoid panic near integer overflow in a case that's unlikely to actually happen.
  • Address Clippy lints.

0.6.0

  • Make the methods for computing worst-case buffer size requirements check for integer overflow.
  • Upgrade rayon to 0.7.0.

0.5.1

  • Reorder methods for better documentation readability.
  • Add support for big-endian hosts. (Only 64-bit case actually tested.)
  • Optimize the ALU (non-SIMD) case for 32-bit ARM instead of x86_64.

0.5.0

  • Avoid allocating an excessively long buffers in non-streaming decode.
  • Fix the behavior of ISO-2022-JP and replacement decoders near the end of the output buffer.
  • Annotate the result structs with #[must_use].

0.4.0

  • Split FFI into a separate crate.
  • Performance tweaks.
  • CJK binary size and encoding performance changes.
  • Parallelize UTF-8 validation in the case of long buffers (with optional feature parallel-utf8).
  • Borrow even with ISO-2022-JP when possible.

0.3.2

  • Fix moving pointers to alignment in ALU-based ASCII acceleration.
  • Fix errors in documentation and improve documentation.

0.3.1

  • Fix UTF-8 to UTF-16 decode for byte sequences beginning with 0xEE.
  • Make UTF-8 to UTF-8 decode SSE2-accelerated when feature simd-accel is used.
  • When decoding and encoding ASCII-only input from or to an ASCII-compatible encoding using the non-streaming API, return a borrow of the input.
  • Make encode from UTF-16 to UTF-8 faster.

0.3

  • Change the references to the instances of Encoding from const to static to make the referents unique across crates that use the refernces.
  • Introduce non-reference-typed FOO_INIT instances of Encoding to allow foreign crates to initialize static arrays with references to Encoding instances even under Rust's constraints that prohibit the initialization of &'static Encoding-typed array items with &'static Encoding-typed statics.
  • Document that the above two points will be reverted if Rust changes const to work so that cross-crate usage keeps the referents unique.
  • Return Cows from Rust-only non-streaming methods for encode and decode.
  • Encoding::for_bom() returns the length of the BOM.
  • ASCII-accelerated conversions for encodings other than UTF-16LE, UTF-16BE, ISO-2022-JP and x-user-defined.
  • Add SSE2 acceleration behind the simd-accel feature flag. (Requires nightly Rust.)
  • Fix panic with long bogus labels.
  • Map 0xCA to U+05BA in windows-1255. (Spec change.)
  • Correct the end of the Shift_JIS EUDC range. (Spec change.)

0.2.4

  • Polish FFI documentation.

0.2.3

  • Fix UTF-16 to UTF-8 encode.

0.2.2

  • Add Encoder.encode_from_utf8_to_vec_without_replacement().

0.2.1

  • Add Encoding.is_ascii_compatible().

  • Add Encoding::for_bom().

  • Make == for Encoding use name comparison instead of pointer comparison, because uses of the encoding constants in different crates result in different addresses and the constant cannot be turned into statics without breaking other things.

0.2.0

The initial release.

encoding_rs's People

Contributors

annevk avatar arnej avatar atouchet avatar burntsushi avatar dholbert avatar ede1998 avatar fschutt avatar gelbpunkt avatar glandium avatar hsivonen avatar jyn514 avatar lucacasonato avatar manishearth avatar qnighy avatar ralfjung avatar riking avatar ubnt-intrepid avatar workingjubilee avatar yoshikitakashima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

encoding_rs's Issues

Allocating three times the size of the input seems excessive.

The Encoding::decode_* methods need in some cases to allocate a String, and decide how much capacity to give it. Other than *_without_replacement (2984a8b#commitcomment-20990260), this is based on Encoding::max_utf8_buffer_length which assumes the worst case. For many encodings, that’s when every byte of the input is an error that emits a three-byte U+FFFD code point.

In short, as soon as there’s an error, these method allocate three times the size of the (remaining) input. Assuming the worst case simplifies the code which only needs to allocate once, but it seems excessive that a single bit flip near the beginning of the input could triple memory usage.

So a more adaptive allocation scheme might be desirable, but admittedly there is no obvious answer as to what it should be.

Add explicit `include` to `Cargo.toml`

I noticed that encoding-rs is the largest download in my crate graph at 1.4MB. While relatively it's not a lot, this could reduced significantly, by just adding a include to the Cargo.toml and only including what is needed to build the crate with the licence and copyright. I'm not too familiar with the project structure so there might more needed but this seems like everything.

include = [
    "COPYRIGHT",
    "LICENSE-APACHE",
    "LICENSE-MIT",
    "build.rs",
    "src/**/*",
]

implementation for io::Read/io::Write

What are your thoughts on providing implementations of the io::Read/io::Write traits as a convenience for handling stream encoding/decoding?

Here is the specific problem I'd like to solve. Simplifying, I have a function that looks like the following:

fn search<R: io::Read>(rdr: R) -> io::Result<SearchResults> { ... }

Internally, the search function limits itself to the methods of io::Read to execute a search on its contents. The search is exhaustive, but is guaranteed to use a constant amount of heap space. The search routine expects the buffer to be UTF-8 encoded (and will handle invalid UTF-8 gracefully). I'd like to use this same search routine even if the contents of rdr are, say, UTF-16. I claim that this is possible if I wrap rdr in something that satisfies io::Read but uses a encoding_rs::Decoder internally to convert UTF-16 to UTF-8. I would expect the callers of search to do that wrapping. If there's invalid UTF-16, then inserting replacement characters is OK.

Does this sound like something you'd be willing to maintain? I would be happy to take an initial crack at an implementation if so. (In fact, I must do this. The point of this issue is asking whether I should try to upstream it or not.) However, I think there are some interesting points worth mentioning. (There may be more!)

  1. Is this type of API useful in the context of the web? If not, then maybe it shouldn't live in this crate.
  2. The io::Read interface feels not-quite-right in some respects. For example, the io::Read primarily operates on a &[u8]. But if encoding_rs is used to provide an io::Read implementation, then it necessarily guarantees that all consumers of that implementation will read valid UTF-8, which means converting the &[u8] bytes to &str safely will incur an unnecessary cost. I'm not sure what to make of this and how much one might care, but it seems worth pointing out. (This particular issue isn't a problem for me, since the search routine itself handles UTF-8 implicitly.)

Consider renaming the crate

I really don't think having both encoding and encoding_rs actively maintained is a good idea at all, and there is nothing in encoding_rs' name that screams "Gecko" to me.

Encoding::decode_to_utf16 ?

I’ve just written this function:

fn decode_to_utf16(bytes: &[u8], encoding: &'static Encoding) -> Vec<u16> {
    let mut decoder = encoding.new_decoder();
    let capacity = decoder.max_utf16_buffer_length(bytes.len()).exepct("Overflow");
    let mut utf16 = Vec::with_capacity(capacity);
    let uninitialized = unsafe {
        slice::from_raw_parts_mut(utf16.as_ptr(), capacity)
    };
    let last = true;
    let (_, read, written, _) = decoder.decode_to_utf16(bytes, uninitialized, last);
    assert!(read == bytes.len());
    unsafe {
        utf16.set_len(written)
    }
    utf16
}

Do you think it would belong as a method of Encoding?

The mem module should have a Latin1 vs. bidi vs. other check

The mem module should have a SIMD-optimized check that takes a &[u16] and returns a three-way enum saying it the content is all-Latin1, contains non-Latin1 but not bidi or contains bidi.

There should probably be a version that takes UTF-8 input, too.

Use case: optimizing text nodes in Gecko.

UTF-16 decoder can panic on invalid input

This program panics (playground):

use encoding_rs::CoderResult;

fn main() {
    let d = &mut encoding_rs::UTF_16BE.new_decoder_without_bom_handling();
    let b = &mut [0; 4];

    assert_eq!(
        d.decode_to_utf8(&[217, 99], b, false),
        (CoderResult::InputEmpty, 2, 0, false)
    );
    
    let _ = d.decode_to_utf8(&[217, 99], b, true);
}
thread 'main' panicked at 'index out of bounds: the len is 4 but the index is 4', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/encoding_rs-0.8.20/src/lib.rs:3884:21

This is a minimized example of a bug found by a fuzzer in another crate. Let me know if you want full details.

ppc64le not supported?

Hello,

I recently spent a few hours trying to track down why a relatively simple (though very advanced in terms of performance) Python package was unable to build on my POWER8 machine (ppc64le). If you're curious, the package is orjson

From what I can tell, the problem is a combination of two things-

  1. encoding_rs relies on rust simd to provide a compatibility layer/abstraction for simd instructions
  2. rust simd offers to fallback mode- it explicitly supports x86, x86_64 and aarch64- if you have any other architecture, you're out of luck

In my view, the proper way to address this (if someone was willing to commit time to do so) would be to fix it downstream in the library doing the abstraction (simd) but is there any interest/appetite for addressing this on the encoding_rs side by making use of simd optional?

Today is my first experience with the rust toolchain and rust packages, so please forgive me if I'm getting something wrong here. I'm confident I have the general issue correct, though it's possible I'm missing some obvious way to "turn off" the dependency on simd via some build flag in encoding_rs

Thanks!

API review for decode_to_utf8/str

There is currently naming asymmetry between decode_to_utf8() & decode_to_str()andencode_from_utf8(). The latter takes&strwhiledecode_to_utf8()takes&mut [u8]and the&mut strversion hasstrin place ofutf8` in the name.

It's also unclear if the current formulation of decode_to_utf8() has Rust use cases that justify it being offered in the Rust API.

Realistic usage from Rust needs to be understood better and the naming of these as well as the Rust visibility of decode_to_utf8() should probably be adjusted.

Potential Unsound: 1 out-of-bound read and 5 unaligned memory access.

Hello.

I'm Yoshiki, a PhD student at CMU.

We are testing a tool to automatically generate test cases from API data and existing tests.

A few of our generated test cases were reported as "unsound" by Miri, mostly due to unaligned or out-of-bound memory. I've attached a Tarball that contains the test cases that induce this behavior.

Please note that, because the framework leverages existing tests as templates, some of the test cases overlap with existing test cases for the library. In particular,

decode(BIG5, b"", &"");//LAYER:0

also shows up in the manually written test cases.

In case this is intended behavior, or you would prefer if I focused on other parts of the code, please let me know.

Thanks.
~Yoshiki

Re-add license field to Cargo.toml

This was removed in 3a4033e#diff-2e9d962a08321605940b5a657135052fbcef87b5e360662bb527c96d9a615542 and causes automated tooling like cargo deny to fail detecting the license.

It should probably be something like (Apache-2.0 OR MIT) AND BSD-3 but I'm not sure the expression syntax allows parenthesis. If it doesn't then we have a problem and you might want to reconsider if dual-licensing warrants the increased license complexity here. Having to worry about 3 different licenses for a single crate is a bit suboptimal, even if MIT and BSD-3 are approximately the same.

set_len on a Vec<u8> of uninit is UB

encoding_rs currently has UB in the form of creating uninitialized u8's via set_len
Here are 2 examples where the UB is crystal clear:

encoding_rs/src/mem.rs

Lines 2007 to 2010 in dd9d99b

let mut vec = Vec::with_capacity(capacity);
unsafe {
vec.set_len(capacity);
}

encoding_rs/src/mem.rs

Lines 2044 to 2047 in dd9d99b

let mut vec = Vec::with_capacity(capacity);
unsafe {
vec.set_len(capacity);
}

set_len is also used in 7 functions in lib.rs, but I haven't looked at them very closely.

The docs for set_len explicitly say https://doc.rust-lang.org/std/vec/struct.Vec.html#method.set_len :

The elements at old_len..new_len must be initialized.

Some relevant discussion can be found here rust-lang/unsafe-code-guidelines#71

rustc itself has a lint specifically for this kind of thing: rust-lang/rust#75968

Using MaybeUninit::uninit().assume_init() is instant UB unless the target type is itself composed entirely of MaybeUninit

My understanding is this is currently considered UB, but this rule may be relaxed in the future to allow types where all bit patterns are valid to store uninitalized if they are not read from.

Feature Request: Support IBM OEM code pages (e.g. CP437)

IBM OEM code pages live on today, persistently in zip archive file names, for example.
OEM code pages of (South)east asian languages are same as ANSI code pages, included in this library now, but other languages including European languages are not.

Code pages list (OEM codepages are included): https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers?source=docs
Characters list (CP437; replace 437 for other code pages): https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT

0.8.29 should be a major bump

Commit cecda92 introduces an alloc feature and adds it to the defaults. Unfortunately this breaks semver, in particularly for those crates that use cargo parametr default-features = false. As an example you may easily reproduce it with rust-fontconfig = "0.1.5" dependency, its output:

error[E0599]: no method named `decode_to_string_without_replacement` found for struct `Decoder` in the current scope
  --> /home/kitsu/.cargo/registry/src/github.com-1ecc6299db9ec823/allsorts_no_std-0.5.2/src/get_name.rs:94:36
   |
94 |         let (res, _read) = decoder.decode_to_string_without_replacement(data, &mut s, true);
   |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: there is an associated function with a similar name: `decode_to_str_without_replacement`

error: aborting due to previous error

and manually downgrading encoding_rs = "0.8.28" in Cargo.lock solves the problem.

Don't include test_data in published packages

The test data is only needed when running tests, and when people get the package off of crates.io they just want to depend on encoding_rs, they don't want to run its tests. Removing it would reduce the size of the resulting .crate.

Add BOM sniffing functionality

Decoder should, by default, sniff the BOM and swap its VariantEncoder if needed. There should be a way to opt out of BOM sniffing and there should be a static method on Encoding for performing isolated BOM sniffing.

Convert generating_encoding_data.py to a Rust build script

This is basically the entire purpose of build scripts. From the book:

Some example use cases of the build command are: […]

  • Generating a Rust module from a specification.

Would it be better to include an Encoding Standard .json in the codebase, and use a build script to generate data.rs, instead of having a Python script doing that job? (I don't know how powerful git submodules are, but it might be possible to get them to automatically update the JSON too.)

Version 0.8.8 is missing

There is a version 0.8.8 on crates.io, which doesn't correspond to current master. The diff against master looks like (excluding Cargo.toml):

diff --git a/src/lib.rs b/src/lib.rs
index 2056e91..48bdee7 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -8,7 +8,7 @@
 // except according to those terms.
 
 #![cfg_attr(feature = "cargo-clippy", allow(doc_markdown, inline_always, new_ret_no_self))]
-#![doc(html_root_url = "https://docs.rs/encoding_rs/0.8.7")]
+#![doc(html_root_url = "https://docs.rs/encoding_rs/0.8.8")]
 
 //! encoding_rs is a Gecko-oriented Free Software / Open Source implementation
 //! of the [Encoding Standard](https://encoding.spec.whatwg.org/) in Rust.
diff --git a/src/mem.rs b/src/mem.rs
index 7c6e302..b3a448a 100644
--- a/src/mem.rs
+++ b/src/mem.rs
@@ -675,7 +675,8 @@ pub fn is_utf16_latin1(buffer: &[u16]) -> bool {
 /// the four RIGHT-TO-LEFT FOO controls in General Punctuation are checked
 /// for. Control characters that are technically bidi controls but do not
 /// cause right-to-left behavior without the presence of right-to-left
-/// characters or right-to-left controls are not checked for.
+/// characters or right-to-left controls are not checked for. As a special
+/// case, U+FEFF is excluded from Arabic Presentation Forms-B.
 ///
 /// Returns `true` if the input is invalid UTF-8 or the input contains an
 /// RTL character. Returns `false` if the input is valid UTF-8 and contains
@@ -708,8 +709,8 @@ pub fn is_utf8_bidi(buffer: &[u8]) -> bool {
     //
     // U+FE6F: EF B9 AF
     // U+FE70: EF B9 B0
+    // U+FEFE: EF BB BE
     // U+FEFF: EF BB BF
-    // U+FF00: EF BC 80
     //
     // U+107FF: F0 90 9F BF
     // U+10800: F0 90 A0 80
@@ -812,6 +813,10 @@ pub fn is_utf8_bidi(buffer: &[u8]) -> bool {
                                     if third > 0xAF {
                                         return true;
                                     }
+                                } else if second == 0xBB {
+                                    if third != 0xBF {
+                                        return true;
+                                    }
                                 } else {
                                     return true;
                                 }
@@ -1028,6 +1033,10 @@ pub fn is_utf8_bidi(buffer: &[u8]) -> bool {
                             if third > 0xAF {
                                 return true;
                             }
+                        } else if second == 0xBB {
+                            if third != 0xBF {
+                                return true;
+                            }
                         } else {
                             return true;
                         }
@@ -1091,7 +1100,8 @@ pub fn is_utf8_bidi(buffer: &[u8]) -> bool {
 /// the four RIGHT-TO-LEFT FOO controls in General Punctuation are checked
 /// for. Control characters that are technically bidi controls but do not
 /// cause right-to-left behavior without the presence of right-to-left
-/// characters or right-to-left controls are not checked for.
+/// characters or right-to-left controls are not checked for. As a special
+/// case, U+FEFF is excluded from Arabic Presentation Forms-B.
 #[inline]
 pub fn is_str_bidi(buffer: &str) -> bool {
     // U+058F: D6 8F
@@ -1111,8 +1121,8 @@ pub fn is_str_bidi(buffer: &str) -> bool {
     //
     // U+FE6F: EF B9 AF
     // U+FE70: EF B9 B0
+    // U+FEFE: EF BB BE
     // U+FEFF: EF BB BF
-    // U+FF00: EF BC 80
     //
     // U+107FF: F0 90 9F BF
     // U+10800: F0 90 A0 80
@@ -1197,6 +1207,11 @@ pub fn is_str_bidi(buffer: &str) -> bool {
                                     if third > 0xAF {
                                         return true;
                                     }
+                                } else if second == 0xBB {
+                                    let third = bytes[read + 2];
+                                    if third != 0xBF {
+                                        return true;
+                                    }
                                 } else {
                                     return true;
                                 }
@@ -1240,7 +1255,8 @@ pub fn is_str_bidi(buffer: &str) -> bool {
 /// the four RIGHT-TO-LEFT FOO controls in General Punctuation are checked
 /// for. Control characters that are technically bidi controls but do not
 /// cause right-to-left behavior without the presence of right-to-left
-/// characters or right-to-left controls are not checked for.
+/// characters or right-to-left controls are not checked for. As a special
+/// case, U+FEFF is excluded from Arabic Presentation Forms-B.
 ///
 /// Returns `true` if the input contains an RTL character or an unpaired
 /// high surrogate that could be the high half of an RTL character.
@@ -1260,7 +1276,8 @@ pub fn is_utf16_bidi(buffer: &[u16]) -> bool {
 /// the four RIGHT-TO-LEFT FOO controls in General Punctuation are checked
 /// for. Control characters that are technically bidi controls but do not
 /// cause right-to-left behavior without the presence of right-to-left
-/// characters or right-to-left controls are not checked for.
+/// characters or right-to-left controls are not checked for. As a special
+/// case, U+FEFF is excluded from Arabic Presentation Forms-B.
 #[inline(always)]
 pub fn is_char_bidi(c: char) -> bool {
     // Controls:
@@ -1274,8 +1291,9 @@ pub fn is_char_bidi(c: char) -> bool {
     // BMP RTL:
     // https://www.unicode.org/roadmaps/bmp/
     // U+0590...U+08FF
-    // U+FB50...U+FDFF Arabic Presentation Forms A
-    // U+FE70...U+FEFF Arabic Presentation Forms B
+    // U+FB1D...U+FDFF Hebrew presentation forms and
+    //                 Arabic Presentation Forms A
+    // U+FE70...U+FEFE Arabic Presentation Forms B (excl. BOM)
     //
     // Supplementary RTL:
     // https://www.unicode.org/roadmaps/smp/
@@ -1305,8 +1323,8 @@ pub fn is_char_bidi(c: char) -> bool {
         // Between astral RTL blocks
         return false;
     }
-    if in_range32(code_point, 0xFF00, 0x10800) {
-        // Above Arabic Presentations Forms B and below first
+    if in_range32(code_point, 0xFEFF, 0x10800) {
+        // Above Arabic Presentations Forms B (excl. BOM) and below first
         // astral RTL
         return false;
     }
@@ -1326,7 +1344,8 @@ pub fn is_char_bidi(c: char) -> bool {
 /// the four RIGHT-TO-LEFT FOO controls in General Punctuation are checked
 /// for. Control characters that are technically bidi controls but do not
 /// cause right-to-left behavior without the presence of right-to-left
-/// characters or right-to-left controls are not checked for.
+/// characters or right-to-left controls are not checked for. As a special
+/// case, U+FEFF is excluded from Arabic Presentation Forms-B.
 ///
 /// Since supplementary-plane right-to-left blocks are identifiable from the
 /// high surrogate without examining the low surrogate, this function returns
@@ -1357,8 +1376,8 @@ pub fn is_utf16_code_unit_bidi(u: u16) -> bool {
         // Between RTL high surragates
         return false;
     }
-    if u > 0xFEFF {
-        // Above Arabic Presentation Forms
+    if u > 0xFEFE {
+        // Above Arabic Presentation Forms (excl. BOM)
         return false;
     }
     if in_range16(u, 0xFE00, 0xFE70) {
@@ -2380,13 +2399,14 @@ mod tests {
         assert!(!is_char_bidi('\u{1F4A9}'));
         assert!(!is_char_bidi('\u{FE00}'));
         assert!(!is_char_bidi('\u{202C}'));
+        assert!(!is_char_bidi('\u{FEFF}'));
         assert!(is_char_bidi('\u{0590}'));
         assert!(is_char_bidi('\u{08FF}'));
         assert!(is_char_bidi('\u{061C}'));
         assert!(is_char_bidi('\u{FB50}'));
         assert!(is_char_bidi('\u{FDFF}'));
         assert!(is_char_bidi('\u{FE70}'));
-        assert!(is_char_bidi('\u{FEFF}'));
+        assert!(is_char_bidi('\u{FEFE}'));
         assert!(is_char_bidi('\u{200F}'));
         assert!(is_char_bidi('\u{202B}'));
         assert!(is_char_bidi('\u{202E}'));
@@ -2405,6 +2425,7 @@ mod tests {
         assert!(!is_utf16_code_unit_bidi(0xD801));
         assert!(!is_utf16_code_unit_bidi(0xFE00));
         assert!(!is_utf16_code_unit_bidi(0x202C));
+        assert!(!is_utf16_code_unit_bidi(0xFEFF));
         assert!(is_utf16_code_unit_bidi(0x0590));
         assert!(is_utf16_code_unit_bidi(0x08FF));
         assert!(is_utf16_code_unit_bidi(0x061C));
@@ -2412,7 +2433,7 @@ mod tests {
         assert!(is_utf16_code_unit_bidi(0xFB50));
         assert!(is_utf16_code_unit_bidi(0xFDFF));
         assert!(is_utf16_code_unit_bidi(0xFE70));
-        assert!(is_utf16_code_unit_bidi(0xFEFF));
+        assert!(is_utf16_code_unit_bidi(0xFEFE));
         assert!(is_utf16_code_unit_bidi(0x200F));
         assert!(is_utf16_code_unit_bidi(0x202B));
         assert!(is_utf16_code_unit_bidi(0x202E));
@@ -2431,13 +2452,14 @@ mod tests {
         assert!(!is_str_bidi("abcdefghijklmnop\u{1F4A9}abcdefghijklmnop"));
         assert!(!is_str_bidi("abcdefghijklmnop\u{FE00}abcdefghijklmnop"));
         assert!(!is_str_bidi("abcdefghijklmnop\u{202C}abcdefghijklmnop"));
+        assert!(!is_str_bidi("abcdefghijklmnop\u{FEFF}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{0590}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{08FF}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{061C}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{FB50}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{FDFF}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{FE70}abcdefghijklmnop"));
-        assert!(is_str_bidi("abcdefghijklmnop\u{FEFF}abcdefghijklmnop"));
+        assert!(is_str_bidi("abcdefghijklmnop\u{FEFE}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{200F}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{202B}abcdefghijklmnop"));
         assert!(is_str_bidi("abcdefghijklmnop\u{202E}abcdefghijklmnop"));
@@ -2468,6 +2490,9 @@ mod tests {
         assert!(!is_utf8_bidi(
             "abcdefghijklmnop\u{202C}abcdefghijklmnop".as_bytes()
         ));
+        assert!(!is_utf8_bidi(
+            "abcdefghijklmnop\u{FEFF}abcdefghijklmnop".as_bytes()
+        ));
         assert!(is_utf8_bidi(
             "abcdefghijklmnop\u{0590}abcdefghijklmnop".as_bytes()
         ));
@@ -2487,7 +2512,7 @@ mod tests {
             "abcdefghijklmnop\u{FE70}abcdefghijklmnop".as_bytes()
         ));
         assert!(is_utf8_bidi(
-            "abcdefghijklmnop\u{FEFF}abcdefghijklmnop".as_bytes()
+            "abcdefghijklmnop\u{FEFE}abcdefghijklmnop".as_bytes()
         ));
         assert!(is_utf8_bidi(
             "abcdefghijklmnop\u{200F}abcdefghijklmnop".as_bytes()
@@ -2541,6 +2566,10 @@ mod tests {
             0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x202C, 0x62, 0x63, 0x64, 0x65, 0x66,
             0x67, 0x68, 0x69,
         ]));
+        assert!(!is_utf16_bidi(&[
+            0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0xFEFF, 0x62, 0x63, 0x64, 0x65, 0x66,
+            0x67, 0x68, 0x69,
+        ]));
         assert!(is_utf16_bidi(&[
             0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x0590, 0x62, 0x63, 0x64, 0x65, 0x66,
             0x67, 0x68, 0x69,
@@ -2570,7 +2599,7 @@ mod tests {
             0x67, 0x68, 0x69,
         ]));
         assert!(is_utf16_bidi(&[
-            0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0xFEFF, 0x62, 0x63, 0x64, 0x65, 0x66,
+            0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0xFEFE, 0x62, 0x63, 0x64, 0x65, 0x66,
             0x67, 0x68, 0x69,
         ]));
         assert!(is_utf16_bidi(&[
@@ -2638,6 +2667,10 @@ mod tests {
             check_str_for_latin1_and_bidi("abcdefghijklmnop\u{202C}abcdefghijklmnop"),
             Latin1Bidi::Bidi
         );
+        assert_ne!(
+            check_str_for_latin1_and_bidi("abcdefghijklmnop\u{FEFF}abcdefghijklmnop"),
+            Latin1Bidi::Bidi
+        );
         assert_eq!(
             check_str_for_latin1_and_bidi("abcdefghijklmnop\u{0590}abcdefghijklmnop"),
             Latin1Bidi::Bidi
@@ -2663,7 +2696,7 @@ mod tests {
             Latin1Bidi::Bidi
         );
         assert_eq!(
-            check_str_for_latin1_and_bidi("abcdefghijklmnop\u{FEFF}abcdefghijklmnop"),
+            check_str_for_latin1_and_bidi("abcdefghijklmnop\u{FEFE}abcdefghijklmnop"),
             Latin1Bidi::Bidi
         );
         assert_eq!(
@@ -2726,6 +2759,10 @@ mod tests {
             check_utf8_for_latin1_and_bidi("abcdefghijklmnop\u{202C}abcdefghijklmnop".as_bytes()),
             Latin1Bidi::Bidi
         );
+        assert_ne!(
+            check_utf8_for_latin1_and_bidi("abcdefghijklmnop\u{FEFF}abcdefghijklmnop".as_bytes()),
+            Latin1Bidi::Bidi
+        );
         assert_eq!(
             check_utf8_for_latin1_and_bidi("abcdefghijklmnop\u{0590}abcdefghijklmnop".as_bytes()),
             Latin1Bidi::Bidi
@@ -2751,7 +2788,7 @@ mod tests {
             Latin1Bidi::Bidi
         );
         assert_eq!(
-            check_utf8_for_latin1_and_bidi("abcdefghijklmnop\u{FEFF}abcdefghijklmnop".as_bytes()),
+            check_utf8_for_latin1_and_bidi("abcdefghijklmnop\u{FEFE}abcdefghijklmnop".as_bytes()),
             Latin1Bidi::Bidi
         );
         assert_eq!(
@@ -2832,6 +2869,13 @@ mod tests {
             ]),
             Latin1Bidi::Bidi
         );
+        assert_ne!(
+            check_utf16_for_latin1_and_bidi(&[
+                0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0xFEFF, 0x62, 0x63, 0x64, 0x65,
+                0x66, 0x67, 0x68, 0x69,
+            ]),
+            Latin1Bidi::Bidi
+        );
         assert_eq!(
             check_utf16_for_latin1_and_bidi(&[
                 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x0590, 0x62, 0x63, 0x64, 0x65,
@@ -2883,7 +2927,7 @@ mod tests {
         );
         assert_eq!(
             check_utf16_for_latin1_and_bidi(&[
-                0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0xFEFF, 0x62, 0x63, 0x64, 0x65,
+                0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0xFEFE, 0x62, 0x63, 0x64, 0x65,
                 0x66, 0x67, 0x68, 0x69,
             ]),
             Latin1Bidi::Bidi
@@ -2959,7 +3003,7 @@ mod tests {
         match c {
             '\u{0590}'...'\u{08FF}'
             | '\u{FB1D}'...'\u{FDFF}'
-            | '\u{FE70}'...'\u{FEFF}'
+            | '\u{FE70}'...'\u{FEFE}'
             | '\u{10800}'...'\u{10FFF}'
             | '\u{1E800}'...'\u{1EFFF}'
             | '\u{200F}'
@@ -2975,7 +3019,7 @@ mod tests {
         match u {
             0x0590...0x08FF
             | 0xFB1D...0xFDFF
-            | 0xFE70...0xFEFF
+            | 0xFE70...0xFEFE
             | 0xD802
             | 0xD803
             | 0xD83A
@@ -3071,6 +3115,19 @@ mod tests {
         }
     }
 
+    #[test]
+    fn test_is_utf16_bidi_thoroughly() {
+        let mut buf = [0; 32];
+        for i in 0..0x10000u32 {
+            let u = i as u16;
+            buf[15] = u;
+            assert_eq!(
+                is_utf16_bidi(&buf[..]),
+                reference_is_utf16_code_unit_bidi(u)
+            );
+        }
+    }
+
     #[test]
     fn test_is_utf8_bidi_edge_cases() {
         assert!(!is_utf8_bidi(b"\xD5\xBF\x61"));
diff --git a/src/simd_funcs.rs b/src/simd_funcs.rs
index 198ecf6..8974a16 100644
--- a/src/simd_funcs.rs
+++ b/src/simd_funcs.rs
@@ -279,7 +279,7 @@ pub fn is_u16x8_bidi(s: u16x8) -> bool {
 
     (in_range16x8!(s, 0x0590, 0x0900)
         | in_range16x8!(s, 0xFB1D, 0xFE00)
-        | in_range16x8!(s, 0xFE70, 0xFF00)
+        | in_range16x8!(s, 0xFE70, 0xFEFF)
         | in_range16x8!(s, 0xD802, 0xD804)
         | in_range16x8!(s, 0xD83A, 0xD83C)
         | s.eq(u16x8::splat(0x200F))

Compilation issues under 1.43.0 nightly

I've encountered a compilation issue when building under 1.43.0 nightly of the rust toolchain. I noticed the problem when building the dependent orjson which uses the nightly toolchain for compilation.

I don't know much about rust, but it seems that an error occurs within a macro and the rust compiler subsequently panics.

I was able to reproduce the issue with the following commands, the features are the ones used by orjson. I'm filing this issue here, as I don't quite understand what is happening with regards to macros, user code and compiler code.

$ docker run --rm -it --entrypoint /bin/bash konstin2/maturin:master
(docker) $ git clone https://github.com/hsivonen/encoding_rs.git
(docker) $ cd encoding_rs/
(docker) $ git checkout v0.8.22
(docker) $ echo nightly > rust-toolchain
(docker) $ cargo --version
cargo 1.43.0-nightly (e02974078 2020-02-18)
(docker) $ rustc --version
rustc 1.43.0-nightly (7760cd0fb 2020-02-19)
(docker) $ RUST_BACKTRACE=full cargo build --features simd-accel --no-default-features
...

--verbose does not give much more information. -Z macro-backtrace does not seem to a valid flag.

cargo build output
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'
info: latest update on 2020-02-20, rust version 1.43.0-nightly (7760cd0fb 2020-02-19)
info: downloading component 'cargo'
info: downloading component 'clippy'
info: downloading component 'rust-docs'
info: downloading component 'rust-std'
info: downloading component 'rustc'
info: downloading component 'rustfmt'
info: installing component 'cargo'
info: installing component 'clippy'
info: installing component 'rust-docs'
info: installing component 'rust-std'
info: installing component 'rustc'
info: installing component 'rustfmt'
    Updating crates.io index
  Downloaded packed_simd v0.3.3
   Compiling packed_simd v0.3.3
   Compiling encoding_rs v0.8.22 (/io/encoding_rs)
   Compiling cfg-if v0.1.10
warning: unused label
   --> src/macros.rs:878:41
    |
878 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/euc_jp.rs:77:5
    |
77  | /     euc_jp_decoder_functions!(
78  | |         {
79  | |             let trail_minus_offset = byte.wrapping_sub(0xA1);
80  | |             // Fast-track Hiragana (60% according to Lunde)
...   |
220 | |         handle
221 | |     );
    | |______- in this macro invocation
    |
    = note: `#[warn(unused_labels)]` on by default
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/macros.rs:878:41
    |
878 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/euc_jp.rs:77:5
    |
77  | /     euc_jp_decoder_functions!(
78  | |         {
79  | |             let trail_minus_offset = byte.wrapping_sub(0xA1);
80  | |             // Fast-track Hiragana (60% according to Lunde)
...   |
220 | |         handle
221 | |     );
    | |______- in this macro invocation
    |
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/macros.rs:574:41
    |
574 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/gb18030.rs:111:5
    |
111 | /     gb18030_decoder_functions!(
112 | |         {
113 | |             // If first is between 0x81 and 0xFE, inclusive,
114 | |             // subtract offset 0x81.
...   |
294 | |         handle,
295 | |         'outermost);
    | |____________________- in this macro invocation
    |
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/macros.rs:574:41
    |
574 |   ...                   'innermost: loop {
    |                         ^^^^^^^^^^
    | 
   ::: src/gb18030.rs:111:5
    |
111 | /     gb18030_decoder_functions!(
112 | |         {
113 | |             // If first is between 0x81 and 0xFE, inclusive,
114 | |             // subtract offset 0x81.
...   |
294 | |         handle,
295 | |         'outermost);
    | |____________________- in this macro invocation
    |
    = note: this warning originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

warning: unused label
   --> src/mem.rs:279:17
    |
279 |                 'inner: loop {
    |                 ^^^^^^

warning: `...` range patterns are deprecated
   --> src/mem.rs:743:26
    |
743 |                         0...0x7F => {
    |                          ^^^ help: use `..=` for an inclusive range
    |
    = note: `#[warn(ellipsis_inclusive_range_patterns)]` on by default

warning: `...` range patterns are deprecated
   --> src/mem.rs:749:29
    |
749 |                         0xC2...0xD5 => {
    |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:770:36
    |
770 |                         0xE1 | 0xE3...0xEC | 0xEE => {
    |                                    ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:879:29
    |
879 |                         0xF1...0xF4 => {
    |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:942:18
    |
942 |                 0...0x7F => {
    |                  ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:948:21
    |
948 |                 0xC2...0xD5 => {
    |                     ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
   --> src/mem.rs:985:28
    |
985 |                 0xE1 | 0xE3...0xEC | 0xEE => {
    |                            ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2686:29
     |
2686 |                         b'A'...b'Z' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2691:29
     |
2691 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2691:43
     |
2691 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                                           ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2714:29
     |
2714 |                         b'A'...b'Z' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2723:29
     |
2723 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                             ^^^ help: use `..=` for an inclusive range

warning: `...` range patterns are deprecated
    --> src/lib.rs:2723:43
     |
2723 |                         b'a'...b'z' | b'0'...b'9' | b'-' | b'_' | b':' | b'.' => {
     |                                           ^^^ help: use `..=` for an inclusive range

warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
  --> src/simd_funcs.rs:19:20
   |
19 |     let mut simd = ::std::mem::uninitialized();
   |                    ^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(deprecated)]` on by default

warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
  --> src/simd_funcs.rs:43:20
   |
43 |     let mut simd = ::std::mem::uninitialized();
   |                    ^^^^^^^^^^^^^^^^^^^^^^^^^

warning: use of deprecated item 'std::mem::uninitialized': use `mem::MaybeUninit` instead
   --> src/handles.rs:113:30
    |
113 |             let mut u: u16 = ::std::mem::uninitialized();
    |                              ^^^^^^^^^^^^^^^^^^^^^^^^^

warning: unnecessary `unsafe` block
  --> src/utf_8.rs:91:12
   |
91 |         if unsafe { likely(read + 4 <= src.len()) } {
   |            ^^^^^^ unnecessary `unsafe` block
   |
   = note: `#[warn(unused_unsafe)]` on by default

warning: unnecessary `unsafe` block
  --> src/utf_8.rs:98:20
   |
98 |                 if unsafe { likely(in_inclusive_range8(byte, 0xC2, 0xDF)) } {
   |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:107:24
    |
107 |                     if unsafe { likely(read + 4 <= src.len()) } {
    |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:117:20
    |
117 |                 if unsafe { likely(byte < 0xF0) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:132:28
    |
132 |                         if unsafe { likely(read + 4 <= src.len()) } {
    |                            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:137:32
    |
137 | ...                   if unsafe { likely(byte < 0x80) } {
    |                          ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:162:20
    |
162 |                 if unsafe { likely(read + 4 <= src.len()) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:261:12
    |
261 |         if unsafe { likely(read + 4 <= src.len()) } {
    |            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:271:20
    |
271 |                 if unsafe { likely(in_inclusive_range8(byte, 0xC2, 0xDF)) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:288:24
    |
288 |                     if unsafe { likely(read + 4 <= src.len()) } {
    |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:300:20
    |
300 |                 if unsafe { likely(byte < 0xF0) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:323:28
    |
323 |                         if unsafe { likely(read + 4 <= src.len()) } {
    |                            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:328:32
    |
328 | ...                   if unsafe { likely(byte < 0x80) } {
    |                          ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:370:20
    |
370 |                 if unsafe { likely(read + 4 <= src.len()) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:657:20
    |
657 |                 if unsafe { likely(unit_minus_surrogate_start > (0xDFFF - 0xD800)) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:668:20
    |
668 |                 if unsafe { likely(unit_minus_surrogate_start <= (0xDBFF - 0xD800)) } {
    |                    ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:687:24
    |
687 |                     if unsafe { likely(second_minus_low_surrogate_start <= (0xDFFF - 0xDC00)) } {
    |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/utf_8.rs:729:16
    |
729 |             if unsafe { unlikely(unit < 0x80) } {
    |                ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
   --> src/mem.rs:913:32
    |
913 | ...                   if unsafe { unlikely(second == 0x90 || second == 0x9E) } {
    |                          ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1171:28
     |
1171 |                         if unsafe { unlikely(byte >= 0xD6) } {
     |                            ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1195:24
     |
1195 |                     if unsafe { unlikely(!in_inclusive_range8(byte, 0xE3, 0xEE) && byte != 0xE1) } {
     |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1244:24
     |
1244 |                     if unsafe { unlikely(byte == 0xF0 && (second == 0x90 || second == 0x9E)) } {
     |                        ^^^^^^ unnecessary `unsafe` block

warning: unnecessary `unsafe` block
    --> src/mem.rs:1658:8
     |
1658 |     if unsafe { likely(read == src.len()) } {
     |        ^^^^^^ unnecessary `unsafe` block

error: internal compiler error: src/librustc_codegen_ssa/mir/block.rs:622: shuffle indices must be constant
   --> src/simd_funcs.rs:289:28
    |
289 |           let first: u8x16 = shuffle!(
    |  ____________________________^
290 | |             s,
291 | |             u8x16::splat(0),
292 | |             [0, 16, 1, 17, 2, 18, 3, 19, 4, 20, 5, 21, 6, 22, 7, 23]
293 | |         );
    | |_________^
    |
    = note: this error: internal compiler error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

thread 'rustc' panicked at 'Box<Any>', <::std::macros::panic macros>:2:4
stack backtrace:
   0:     0x7fce48a8a634 - backtrace::backtrace::libunwind::trace::h0743ecf0c905ca1e
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/libunwind.rs:86
   1:     0x7fce48a8a634 - backtrace::backtrace::trace_unsynchronized::h0e046f0811b0ae4d
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.44/src/backtrace/mod.rs:66
   2:     0x7fce48a8a634 - std::sys_common::backtrace::_print_fmt::h5fcd1fd3d0e5d79e
                               at src/libstd/sys_common/backtrace.rs:78
   3:     0x7fce48a8a634 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h85ffb53d56efd098
                               at src/libstd/sys_common/backtrace.rs:59
   4:     0x7fce48ac37dc - core::fmt::write::h231e5515e704e96b
                               at src/libcore/fmt/mod.rs:1052
   5:     0x7fce48a7bf97 - std::io::Write::write_fmt::h56f503f924d6c255
                               at src/libstd/io/mod.rs:1428
   6:     0x7fce48a8f425 - std::sys_common::backtrace::_print::hf64c641be26866a9
                               at src/libstd/sys_common/backtrace.rs:62
   7:     0x7fce48a8f425 - std::sys_common::backtrace::print::h16b5d561563c7498
                               at src/libstd/sys_common/backtrace.rs:49
   8:     0x7fce48a8f425 - std::panicking::default_hook::{{closure}}::h8363003bce1deb1a
                               at src/libstd/panicking.rs:204
   9:     0x7fce48a8f166 - std::panicking::default_hook::hb365b24076d7b200
                               at src/libstd/panicking.rs:224
  10:     0x7fce490f9c39 - rustc_driver::report_ice::h2624db039b9cfba9
  11:     0x7fce48a8fb55 - std::panicking::rust_panic_with_hook::h2adc1d4c38cb25af
                               at src/libstd/panicking.rs:474
  12:     0x7fce494cf363 - std::panicking::begin_panic::h6fca9fdb6d23f676
  13:     0x7fce493e488c - rustc_errors::HandlerInner::span_bug::h6840991938d37012
  14:     0x7fce493e4c40 - rustc_errors::Handler::span_bug::h107187c882152f33
  15:     0x7fce49478c69 - rustc::util::bug::opt_span_bug_fmt::{{closure}}::hf73fd7e05df26a89
  16:     0x7fce4947715b - rustc::ty::context::tls::with_opt::{{closure}}::h0c4fdf5a849e88e3
  17:     0x7fce49477106 - rustc::ty::context::tls::with_opt::h92cfac8e0dd8f2c9
  18:     0x7fce49478b58 - rustc::util::bug::opt_span_bug_fmt::haf8b4183f62d8df3
  19:     0x7fce49478b0a - rustc::util::bug::span_bug_fmt::h0be341af60d13d91
  20:     0x7fce49573f1a - <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::fold::h333c620e944c2a61
  21:     0x7fce495532cc - rustc_codegen_ssa::mir::block::<impl rustc_codegen_ssa::mir::FunctionCx<Bx>>::codegen_call_terminator::h61b66235d798dc9e
  22:     0x7fce4954e212 - rustc_codegen_ssa::mir::block::<impl rustc_codegen_ssa::mir::FunctionCx<Bx>>::codegen_block::h977ed6f45937d617
  23:     0x7fce4956055e - rustc_codegen_ssa::base::codegen_instance::h1faa821de1d9e487
  24:     0x7fce4947f6b5 - <rustc::mir::mono::MonoItem as rustc_codegen_ssa::mono_item::MonoItemExt>::define::h0b6bdfededc22107
  25:     0x7fce4940668a - rustc_codegen_llvm::base::compile_codegen_unit::module_codegen::h469c76d782c84352
  26:     0x7fce494b3227 - rustc::dep_graph::graph::DepGraph::with_task::h29956dbbd3cd6e7c
  27:     0x7fce49406254 - rustc_codegen_llvm::base::compile_codegen_unit::hc09ab7897a17060a
  28:     0x7fce4955d55a - rustc_codegen_ssa::base::codegen_crate::h80e90e6d82f0580d
  29:     0x7fce494f1715 - <rustc_codegen_llvm::LlvmCodegenBackend as rustc_codegen_utils::codegen_backend::CodegenBackend>::codegen_crate::hbcef469c00126974
  30:     0x7fce492e0710 - rustc_session::utils::<impl rustc_session::session::Session>::time::h101a151e306dd79b
  31:     0x7fce4938b2ef - rustc_interface::passes::QueryContext::enter::hc499d446e1b9ab96
  32:     0x7fce492bbf4b - rustc_interface::queries::Queries::ongoing_codegen::h201d0ed995ada5da
  33:     0x7fce491632be - rustc_interface::interface::run_compiler_in_existing_thread_pool::hdde65f8eb6e34231
  34:     0x7fce4911d29d - scoped_tls::ScopedKey<T>::set::h774e12e87074d2a2
  35:     0x7fce49104d82 - syntax::attr::with_globals::hd6f4e6fb8aaadb66
  36:     0x7fce4911e963 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2e517a7b74830ac8
  37:     0x7fce48aa1447 - __rust_maybe_catch_panic
                               at src/libpanic_unwind/lib.rs:86
  38:     0x7fce49164ef6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h3814fa1c62419cc0
  39:     0x7fce48a6c31f - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h8e917a822ffc0592
                               at /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a/src/liballoc/boxed.rs:1017
  40:     0x7fce48a9fd50 - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h8aa486ee72f31ff1
                               at /rustc/7760cd0fbbbf2c59a625e075a5bdfa88b8e30f8a/src/liballoc/boxed.rs:1017
  41:     0x7fce48a9fd50 - std::sys_common::thread::start_thread::h8407e13fad90fc7e
                               at src/libstd/sys_common/thread.rs:13
  42:     0x7fce48a9fd50 - std::sys::unix::thread::Thread::new::thread_start::h55e6429cb8ed2e9f
                               at src/libstd/sys/unix/thread.rs:80
  43:     0x7fce4880883d - start_thread
  44:     0x7fce48170fdd - clone

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/blob/master/CONTRIBUTING.md#bug-reports

note: rustc 1.43.0-nightly (7760cd0fb 2020-02-19) running on x86_64-unknown-linux-gnu

note: compiler flags: -C debuginfo=2 -C incremental --crate-type lib

note: some of the compiler flags provided by cargo are hidden

query stack during panic:
end of query stack
error: aborting due to previous error

error: could not compile `encoding_rs`.

To learn more, run the command again with --verbose.

minimum Rust version?

Is there any policy for this crate with respect to the minimum Rust version supported? In particular, it looks like CI always runs against whatever the current stable/beta/nightly releases are. So if a change gets merged that requires a newer version of Rust, you might not even realize that it happens.

(N.B. As an ecosystem, the "right" policy here is terribly unclear. I personally have been operating under a conservative policy where by bumping the minimum Rust version requires a semver bump, but I fear this won't always be tenable.)

Enhancement: get read access to the decoder's inner state

This is also about Stringsext, a GNU Strings Alternative with Multi-Byte-Encoding Support which I migrated from rust-encoding to encoding_rs.

In order to keep anchors between the input and the output stream, I would need to know - when the decoder finished - if it has still some bytes stored in its inner state. The best would be to know how many bytes are hold back, but even the information that there are any would help already.

Is there a way to access this information?

E0494 when trying to compile

The situation

I'm dealing with a rather confusing and frustrating set of errors. I've been trying to install the cargo-edit crate, but it gets hung up on the same errors as below. The errors below are actually from when I directly cloned this repo, and ran cargo build.

Is this potentially a version requirement issue?

Version info

  • rustc: rustc 1.27.0 (3eda71b00 2018-06-19)
  • cargo: cargo 1.27.0 (1e95190e5 2018-05-27)

Stdout

$ cargo build
   Compiling encoding_rs v0.8.13 (file:///coder/mnt/rick/test/encoding_rs)
error[E0494]: cannot refer to the interior of another static, use a constant instead
   --> src/lib.rs:928:42
    |
928 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.ibm866, 0x0440, 96, 16),
    |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
   --> src/lib.rs:998:42
    |
998 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_10, 0x00DA, 90, 6),
    |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1032:42
     |
1032 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_13, 0x00DF, 95, 1),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1066:42
     |
1066 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_14, 0x00DF, 95, 17),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1100:42
     |
1100 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_15, 0x00BF, 63, 65),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1133:42
     |
1133 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_16, 0x00DF, 95, 4),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1167:42
     |
1167 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_2, 0x00DF, 95, 1),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1199:42
     |
1199 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_3, 0x00DF, 95, 4),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1231:42
     |
1231 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_4, 0x00DF, 95, 1),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1263:42
     |
1263 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_5, 0x040E, 46, 66),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1295:42
     |
1295 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_6, 0x0621, 65, 26),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1328:42
     |
1328 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_7, 0x03A3, 83, 44),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1365:42
     |
1365 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_8, 0x05D0, 96, 27),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1400:42
     |
1400 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.iso_8859_8, 0x05D0, 96, 27),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1435:42
     |
1435 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.koi8_r, 0x044E, 64, 1),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1467:42
     |
1467 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.koi8_u, 0x044E, 64, 1),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1667:42
     |
1667 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.macintosh, 0x00CD, 106, 3),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1736:42
     |
1736 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1250, 0x00DC, 92, 2),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1768:42
     |
1768 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1251, 0x0410, 64, 64),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1800:42
     |
1800 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1252, 0x00A0, 32, 96),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1833:42
     |
1833 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1253, 0x03A3, 83, 44),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1867:42
     |
1867 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1254, 0x00DF, 95, 17),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1900:42
     |
1900 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1255, 0x05D0, 96, 27),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1934:42
     |
1934 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1256, 0x0621, 65, 22),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1966:42
     |
1966 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1257, 0x00DF, 95, 1),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:1999:42
     |
1999 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_1258, 0x00DF, 95, 4),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:2036:42
     |
2036 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.windows_874, 0x0E01, 33, 58),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0494]: cannot refer to the interior of another static, use a constant instead
    --> src/lib.rs:2069:42
     |
2069 |     variant: VariantEncoding::SingleByte(&data::SINGLE_BYTE_DATA.x_mac_cyrillic, 0x0430, 96, 31),
     |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to 28 previous errors

For more information about this error, try `rustc --explain E0494`.
error: Could not compile `encoding_rs`.

To learn more, run the command again with --verbose.

Release builds broken with Rust 1.23.0-beta.1

Builds with the current rust beta fail tests:

$ rustup update beta
$ cargo +beta test --release
[...]
failures:

---- test_labels_names::test_all_labels stdout ----
	thread 'test_labels_names::test_all_labels' panicked at 'assertion failed: `(left == right)`
  left: `Some(Encoding { windows-1252 })`,
 right: `Some(Encoding { windows-1252 })`', src/test_labels_names.rs:11:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.

---- tests::test_label_resolution stdout ----
	thread 'tests::test_label_resolution' panicked at 'assertion failed: `(left == right)`
  left: `Some(Encoding { UTF-8 })`,
 right: `Some(Encoding { UTF-8 })`', src/lib.rs:4419:8


failures:
    test_labels_names::test_all_labels
    tests::test_label_resolution

test result: FAILED. 106 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out

error: test failed, to rerun pass '--lib'

UTF_16LE.encode does not encode string to UTF-16 LE correctly?

Environment

rustc --version output:

rustc 1.27.0-nightly (0b72d48f8 2018-04-10)

and my encoding_rs version is 0.7.2.

Steps to reproduce

run the following program

extern crate encoding_rs;

use encoding_rs::UTF_16LE;

fn main() {
    let s = "aa";
    let (bytes, enc, unmappable) = UTF_16LE.encode(s);
    let (dec, enc, unmappable) = UTF_16LE.decode(&bytes);
    for i in dec.chars() {
        println!("{}", i as i32)
    }
    println!("{}", dec);
}

Expected

output following text

97
0
97
0
aa

Actual

output following text(24929 = 0x6161)

24929
慡

Excessive copying

The web browser in 1999 could run on 16MB of RAM.

The web browser in 2018 needs 2GB of RAM.

This is mainly caused by excessive copying. Excessive copying is good in languages such as C and C++ where you can't guarantee the lifetimes of pointers, but this is Rust. We shouldn't need to copy everything all the time.

The fix? Cow<str> is a nice one, and what I'd recommend.

Software should go back to being efficient and fast. Excessive copying is NOT the way to go.

Can encode_from_utf16 store pending high surrogate?

use encoding_rs;

fn utf16_to_utf8() {
    let mut encoder = encoding_rs::UTF_8.new_encoder();

    let src = [0xD83Du16];
    let mut dst = [0u8;4];
    encoder.encode_from_utf16(&src, &mut dst, false);
    println!("{:?}", dst);

    let src = [0xDC99u16];
    let mut dst = [0u8;4];
    encoder.encode_from_utf16(&src, &mut dst, true);
    println!("{:?}", dst);
}

fn utf16_to_utf8_2() {
    let mut decoder = encoding_rs::UTF_16LE.new_decoder();

    let src = [0x3Du8, 0xD8u8];
    let mut dst = [0u8;4];
    decoder.decode_to_utf8(&src, &mut dst, false);
    println!("{:?}", dst);

    let src = [0x99u8, 0xDCu8];
    let mut dst = [0u8;4];
    decoder.decode_to_utf8(&src, &mut dst, true);
    println!("{:?}", dst);
}

fn main() {
    utf16_to_utf8();
    utf16_to_utf8_2();
}

Per this sample code it seems only the decoder stores the surrogate while the encoder does not. This is counterintuitive to me, is this done intentionally, or is there a way to do the equivalent, or a bug?

decoding valid UTF-16 (to UTF-8) panics with an output buffer of size 4

Here's the full program:

extern crate encoding_rs;

fn main() {
    let enc = encoding_rs::UTF_16LE;
    let mut decoder = enc.new_decoder_with_bom_removal();

    let mut dst = [0u8; 4];

    let src = &[9];
    let (res, nin, nout, err) = decoder.decode_to_utf8(src, &mut dst, false);
    eprintln!(
        "res: {:?}, nin: {:?}, nout: {:?}, err: {:?}",
        res, nin, nout, err,
    );
    assert_eq!(dst, [0x00, 0x00, 0x00, 0x00]);

    let src = &[60, 103, 62];
    let (res, nin, nout, err) = decoder.decode_to_utf8(src, &mut dst, false);
    eprintln!(
        "res: {:?}, nin: {:?}, nout: {:?}, err: {:?}",
        res, nin, nout, err,
    );
    assert_eq!(dst, [0xE6, 0xB0, 0x89, 0x00]);
}

And the output:

[andrew@Cheetah encodingrsbug]$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/encodingrsbug`
res: InputEmpty, nin: 1, nout: 0, err: false
thread 'main' panicked at 'attempt to subtract with overflow', /home/andrew/.cargo/registry/src/github.com-1ecc6299db9ec823/encoding_rs-0.8.6/src/handles.rs:277:31
note: Run with `RUST_BACKTRACE=1` for a backtrace.

[andrew@Cheetah encodingrsbug]$ cargo run --release
   Compiling encodingrsbug v0.1.0 (/home/andrew/tmp/ripgrep/issues/1052/encodingrsbug)
    Finished release [optimized] target(s) in 0.38s
     Running `target/release/encodingrsbug`
res: InputEmpty, nin: 1, nout: 0, err: false
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', /home/andrew/.cargo/registry/src/github.com-1ecc6299db9ec823/encoding_rs-0.8.6/src/handles.rs:309:21
note: Run with `RUST_BACKTRACE=1` for a backtrace.

This looks like the same symptom as #32, however, I filed a new issue since the input data is different, and because I actually seem to need two decode_to_utf8 calls here to trigger the panic. Namely, if I prepend 9 to the second src buffer and remove the first call to decode_to_utf8, then the program works as expected. Moreover, I believe the program in #32 is actually invalid since the docs for this crate state that the output buffer for UTF-8 must be at least 4 bytes, and the buffer size in #32 is 2 bytes, so it seems like a panic there is legal. However, in this program, the output buffer size is 4 bytes, which I think makes this a correct program with respect to satisfying the preconditions of decode_to_utf8. However, I am not 100% sure.

If this is a correct program, then I think the panic is a bug. I briefly looked at the code but couldn't immediately see the appropriate fix.

If this is an incorrect program, then I think that provokes a question: is it possible to write a correct program that uses encoding_rs with a statically known fixed size buffer? If so, what is the minimum size of said buffer?

To give you context, I tripped this bug in encoding_rs_io. Specifically, this is code that is handling the implementation of Read when the caller provided buffer is too small to fit a single decoded codepoint. Namely, it falls over to to the "tiny" transcoder: https://github.com/BurntSushi/encoding_rs_io/blob/a322c14c9ea48303fb883668a605f99aaf734357/src/util.rs#L6-L22

cc @sinkuu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.