ashtuchkin / iconv-lite Goto Github PK

View Code? Open in Web Editor NEW

3.0K 61.0 281.0 1.32 MB

Convert character encodings in pure javascript.

License: MIT License

JavaScript 100.00%

iconv encoding encoding-convertors javascript

iconv-lite's Introduction

iconv-lite: Pure JS character encoding conversion

No need for native code compilation. Quick to install, works on Windows, Web, and in sandboxed environments.
Used in popular projects like Express.js (body_parser), Grunt, Nodemailer, Yeoman and others.
Faster than node-iconv (see below for performance comparison).
Intuitive encode/decode API, including Streaming support.
In-browser usage via browserify or webpack (~180kb gzip compressed with Buffer shim included).
Typescript type definition file included.
React Native is supported (need to install stream module to enable Streaming API).
License: MIT.

Usage

Basic API

var iconv = require('iconv-lite');

// Convert from an encoded buffer to a js string.
str = iconv.decode(Buffer.from([0x68, 0x65, 0x6c, 0x6c, 0x6f]), 'win1251');

// Convert from a js string to an encoded buffer.
buf = iconv.encode("Sample input string", 'win1251');

// Check if encoding is supported
iconv.encodingExists("us-ascii")

Streaming API

// Decode stream (from binary data stream to js strings)
http.createServer(function(req, res) {
    var converterStream = iconv.decodeStream('win1251');
    req.pipe(converterStream);

    converterStream.on('data', function(str) {
        console.log(str); // Do something with decoded strings, chunk-by-chunk.
    });
});

// Convert encoding streaming example
fs.createReadStream('file-in-win1251.txt')
    .pipe(iconv.decodeStream('win1251'))
    .pipe(iconv.encodeStream('ucs2'))
    .pipe(fs.createWriteStream('file-in-ucs2.txt'));

// Sugar: all encode/decode streams have .collect(cb) method to accumulate data.
http.createServer(function(req, res) {
    req.pipe(iconv.decodeStream('win1251')).collect(function(err, body) {
        assert(typeof body == 'string');
        console.log(body); // full request body string
    });
});

Supported encodings

All node.js native encodings: utf8, ucs2 / utf16-le, ascii, binary, base64, hex.
Additional unicode encodings: utf16, utf16-be, utf-7, utf-7-imap, utf32, utf32-le, and utf32-be.
All widespread singlebyte encodings: Windows 125x family, ISO-8859 family, IBM/DOS codepages, Macintosh family, KOI8 family, all others supported by iconv library. Aliases like 'latin1', 'us-ascii' also supported.
All widespread multibyte encodings: CP932, CP936, CP949, CP950, GB2312, GBK, GB18030, Big5, Shift_JIS, EUC-JP.

See all supported encodings on wiki.

Most singlebyte encodings are generated automatically from node-iconv. Thank you Ben Noordhuis and libiconv authors!

Multibyte encodings are generated from Unicode.org mappings and WHATWG Encoding Standard mappings. Thank you, respective authors!

Encoding/decoding speed

Comparison with node-iconv module (1000x256kb, on MacBook Pro, Core i5/2.6 GHz, Node v0.12.0). Note: your results may vary, so please always check on your hardware.

operation             [email protected]   [email protected]
----------------------------------------------------------
encode('win1251')     ~96 Mb/s      ~320 Mb/s
decode('win1251')     ~95 Mb/s      ~246 Mb/s

BOM handling

Decoding: BOM is stripped by default, unless overridden by passing stripBOM: false in options (f.ex. iconv.decode(buf, enc, {stripBOM: false})). A callback might also be given as a stripBOM parameter - it'll be called if BOM character was actually found.
If you want to detect UTF-8 BOM when decoding other encodings, use node-autodetect-decoder-stream module.
Encoding: No BOM added, unless overridden by addBOM: true option.

UTF-16 Encodings

This library supports UTF-16LE, UTF-16BE and UTF-16 encodings. First two are straightforward, but UTF-16 is trying to be smart about endianness in the following ways:

Decoding: uses BOM and 'spaces heuristic' to determine input endianness. Default is UTF-16LE, but can be overridden with defaultEncoding: 'utf-16be' option. Strips BOM unless stripBOM: false.
Encoding: uses UTF-16LE and writes BOM by default. Use addBOM: false to override.

UTF-32 Encodings

This library supports UTF-32LE, UTF-32BE and UTF-32 encodings. Like the UTF-16 encoding above, UTF-32 defaults to UTF-32LE, but uses BOM and 'spaces heuristics' to determine input endianness.

The default of UTF-32LE can be overridden with the defaultEncoding: 'utf-32be' option. Strips BOM unless stripBOM: false.
Encoding: uses UTF-32LE and writes BOM by default. Use addBOM: false to override. (defaultEncoding: 'utf-32be' can also be used here to change encoding.)

Other notes

When decoding, be sure to supply a Buffer to decode() method, otherwise bad things usually happen.
Untranslatable characters are set to � or ?. No transliteration is currently supported.
Node versions 0.10.31 and 0.11.13 are buggy, don't use them (see #65, #77).

Testing

$ git clone [email protected]:ashtuchkin/iconv-lite.git
$ cd iconv-lite
$ npm install
$ npm test
    
$ # To view performance:
$ node test/performance.js

$ # To view test coverage:
$ npm run coverage
$ open coverage/lcov-report/index.html

iconv-lite's People

Contributors

Stargazers

Watchers

Forkers

jenkinv stagas wychi asutherland starsun robdr jackingod pekim jsmarkus lt1946 fengmk2 brnikita xiaobaolxy david50407 sequoiar newle leetreveil chinesedron wmakeev mfine2 dangibson noonnightstorm rameshrr whiteout-io getpro jungang joycelan eliangcs zoutaojlq liufeigit gpt-modules houfeng eiriklv lyralei nozer0 f4-group jkso cgc longjiarun thorlove dw123 youthlab nleush daoweili mscdex changan712 ladykiller cluo thdtjsdn dcposch giswt devlato barbara012 kingmario morfj simudream worktiletech lylpixin2121 viljami shaytan1986 sujianping yousang-yi pcj103654831 behind2 satanrabbit dvelopment mithgol kikong lygstate bigsan dieface xywenke everythingstays g21589 iwonasado dictbox hzspeed noscripter aliezc andy-ax hu19891110 xyzalzhang neverwell rokoroku robertkeizer amilajack ajitkaller geekab hongone lai-nam love131 adelespinasse ezhangle 19317362 hongyanca huangyuesong jiangzhuo torney aifer2007 ikedas

iconv-lite's Issues

Add ISO-2022-JP encoding

Needs a separate, stateful codec, so probably will not implement without a sign of significant usage.

My perfomance test result is different from readme.md.

Processor 2.4 GHz Intel Core 2 Duo
Memory 8 GB 1067 MHz DDR3
node v0.8.7

$ node test/perfomance.js 

Encoding 262144 chars 1000 times:
iconv: 4063ms, 63.01 Mb/s.
iconv-lite: 6005ms, 42.63 Mb/s.

Decoding 262144 bytes 1000 times:
iconv: 4459ms, 57.41 Mb/s.
iconv-lite: 10232ms, 25.02 Mb/s.

Invalid encoding in Hebrew

iconv.encode('שלום עולם!', 'win1255'); result wrong encoding.

Tested with Notepad++

US-ASCII not supported?

Hello,

I tried to use the US-ASCII encoding with iconv like following:

var buf = iconv.encode(string, "US-ASCII");
return buf.toString();

and I have this exception. Is it something normal?

Error: Encoding not recognized: 'US-ASCII' (searched as: 'usascii')
at Object.getCodec (/home/templth/work/repositories/git/restlet-framework-js/tests/org.restlet.js.tests/src/nodejs/server/node_modules/iconv-lite/index.js:36:23)
at Object. (/home/templth/work/repositories/git/restlet-framework-js/tests/org.restlet.js.tests/src/nodejs/server/node_modules/iconv-lite/index.js:4:22)

Thanks for your help,
Thierry

how to detect encoding?

can iconv-lite be used to DETECT encoding of a file or string?

There is some mistake in /encodings/dbcs-codec.js.

When I try to decode a GBK buffer, it just throw an error:

path-to/iconv-lite/encodings/dbcs-codec.js:506
            throw new Error("Unknown table value when decoding: " + val);
                                                                    ^
ReferenceError: val is not defined
    at Object.decoderDBCSWrite [as write] (path-to/iconv-lite/encodings/dbcs-codec.js:506:69)
    at Object.decode (path-to/iconv-lite/lib/index.js:36:23)

Maybe you just forget to define val?

Add user callback for handling invalid characters.

Inspired by ICU, something like:

  buf = iconv.encode(str, 'win1251', { invalidCharHandler: function(char) {
    // Here you can either throw exception which will be propagated outside,
    // or return a replacement char.
  });

Probably, some default handlers would also be nice (always throwing, always returning '?', trying to transliterate).

Streaming conversion?

Hello, iconv-lite seems quite promising! Thanks a lot for working on it.
I have seen in your README that you plan on adding the streaming support.
This is great news. Anything we could help with?

Also, I'm thinking that in our specific case, since we want to convert windows-1256 to utf-8, we're thinking that a full streaming implementation may not be necessary, as windows-1256 is single byte?
This means that we can just convert each chunk of byte as we receive them from the underlying stream, without caring much about cut offs. Am I right?

Thanks!

windows-1254 incorrect characters

("ç,ş,ş,ö,ü") Instead of characters ı turning

Add utf-16be encoding.

Conversion to lesser encoding

I am making a text editor and have planned to use iconv-lite. When opening a file, I use jschardet to find the encoding, but when I save, I would like to use the same encoding if possible. Because of this, I would like iconv-lite to tell me if I can possibly save some string in the original format. The editor itself will be UTF-8, so it would have to convert back to win1251 (or something else) when saving the file. I would however like iconv-lite to tell me if some symbols cannot be translated to the specified encoding, which would make me save the file as UTF-8 instead. I would like to propose the following:

iconv.encode(str, encoding, throwError)

which would throw an error if some symbol cannot be converted to the specified encoding. This would happen if you try to convert some quirky UTF-8 string to ISO for example.

Shift JIS

Just to let you know that I'd appreciate having Shift-JIS in the supported encodings.

utf-8 to gbk is error

fs.readFile(path,{encoding:'utf-8'},function(err,data){
buf = iconv.encode(data,'gbk');
fs.writeFile(outName,buf,function(){
callback();
});
});

input: var s = 1;
ouput : ?var s = 1;

A '?' is always there.

Good way to decode UTF-16?

Hi, I was just wondering if this module (or another helper module) could help with decoduing bytes declated as "UTF-16", where the ordering is not specified and the BOM would need to be examined to determine this.

Convert from/to encodings with iconv-lite only

Take this example:

mystr = new Buffer("base64 string with ISO-8859-1 encoding goes here", "base64").toString();

buffer = new Buffer(buffer, "ISO-8859-1"); <--- this will fail as Buffer doesn't support that encoding

buffer = iconv.decode(buffer, "ISO-8859-1");
buffer = iconv.encode(buffer, "utf8").toString("utf8");

This code should be able to convert from a ISO-8859-1 (or any other encoding) string to UTF8, but it will fail because of the second line.

Can that be done using iconv-lite?

Documentation for adding aliases, codecs, encodings.

Probably in wiki.

Documentation on how to add new aliases, encodings.

The documentation states that it is easy to add new encodings. However looking at the code there is no clear location where to start. A few lines of docs might clarify this.

codepage 864 uses an arabic percent sign, not the standard %

The actual character in place of % is ٪ (the so-called arabic percent sign) .

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP864.TXT

You'll notice the line

0x25 0x066a #ARABIC PERCENT SIGN

whereas most pages have

0x25 0x0025 #PERCENT SIGN

Unfortunately, it breaks ascii

Add a changelog file

Transliteration

Probably using callback mechanism in #53.

add a method to split a string into an array of encodable and non-encodable substrings

I'd like to propose an alternative to #53.

It is supposed in #53 that “invalid” characters (i.e. characters that cannot be encoded with the given encoding) should be dealt with individually. Sometimes, however, it becomes more useful to deal with the whole susbstrings of such characters. For such cases I propose an idea of a method that would split any given string into an array of encodable and non-encodable substrings following each other.

Example:

var iconvLite = require('iconv-lite');
console.log(
   iconvLite.split('Хлѣбъ です。', 'cp866')
); // output: ['Хл', 'ѣ', 'бъ ', 'です。']

The above suggested method is inspired by a behaviour of String.prototype.split when it is given a regular expression enclosed in a single set of capturing parentheses:

console.log(
   'foo-bar'.split(/(-+)/)
); // output: [ 'foo', '-', 'bar' ]
console.log(
   '--foo-bar'.split(/(-+)/)
); // output: [ '', '--', 'foo', '-', 'bar' ]

The proposed method should remind its users of String.prototype.split (hence the name .split) and thus be understood by analogy.

To make a complete similarity, it should also behave similarly, i.e. the even array indices (0, 2, 4…) should always correspond to encodable substrings while the odd array indices (1, 3, 5…) should always correspond to non-encodable substring. (To achieve that, the first substring in the returned array could sometimes be intentionally left blank, like String.prototype.split does it in the [ '', '--', 'foo', '-', 'bar' ] example above, to preserve the meaning of odd and even indices.)

Truncate part of file on decode, replacing to "...(length:"

I'm using Grunt, and grunt uses iconv-lite.

In my gruntfile, i'm using this code line to read a file, and assign his content to a variable:

var content = grunt.file.read(file);

The file reading action it's ok, but if the file is large, like +10K chars, it's shows the first +-10k of chars and replace the rest for something like that:

[*Something around the first 10k char appears here (OK) and then ...*]
...(length: 62967)

is there a limit of chars to iconv-lite decode?

[Question] What encoding names

Hi, there. I am looking to replace node-iconv with this project, but i did not found one moment.
With node-iconv i pass encoding names that is IANA names (http://www.iana.org/assignments/character-sets/character-sets.xhtml), but from your readme i see you used 'win1251' which is not IANA encoding name and it is not an alias. So how you get name of encoding and can be used IANA names? (i.e. that is used in http content-type).

Thanks.

gbk encode error

When I use iconv-lite to encode & decode files between UTF8 & GBK, I found 2 characters cannot encode to GBK correct until now.

the character "·" & "×" can not be encode to GBK. I think the GBK table has some mistake.

Can you check file's encoding?

something like

iconv-lite.check(str, 'gbk');

How to convert from one encoding to another?

Using the iconv package, I would do the following:

buffer = new Iconv(fromEncoding, toEncoding).convert(buffer)

However, I can't figure what the equivalent conversion would be for iconv-lite.

Code snippet taken from here and here

EUC-JP, EUC-KR not recognized?

I see in the source code "eucjp" and "euckr" being supported. How to I get it to work?

var iconv = require('iconv-lite');
str = iconv.encode("Sample input string", 'eucjp');
console.log (str);

$node test.js 

/test/node_modules/iconv-lite/index.js:45
                throw new Error("Encoding not recognized: '" + encoding + "' (
                      ^
Error: Encoding not recognized: 'eucjp' (searched as: 'eucjp')
    at Object.module.exports.getCodec (/test/node_modules/iconv-lite/index.js:45:23)
    at Object.module.exports.toEncoding (/test/node_modules/iconv-lite/index.js:5:22)
    at Object.<anonymous> (/test/test.js:3:13)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:902:3

Add GB18030 Encoding

Hello, I will try to add this encoding, hints appreciated!

Add more unicode consistency with combining marks

Specifically, we need to accept all normalization forms of unicode when encoding and make combined form when decoding. See https://github.com/ashtuchkin/iconv-lite/blob/v0.4/generation/research/normalization.md

See http://mathiasbynens.be/notes/javascript-unicode

Specifically, use punycode.js - I haven't thought about it before.

How to convert from utf-8 to ISO-8859-8 ?

Hi, I am trying to convert a string in utf-8 to a string in iso-8859-8. I am doing this:

var str = 'this is a string';
var buf = iconv.encode(str, 'ISO-8859-8');
str = buf.toString();

Is this the right way of using your lib ?

Thanks 👍

The new SHIFTJIS table in 0.4pre is not complete.

I just tried to convert 0x87,0x40 (glyph is ① ) which should be matching to U+24EA, but it's not working.

So I found a table from a Japanese high school's site: http://www.seiai.ed.jp/sys/text/java/shiftjis_table.html
In this table, characters in RED COLOR are machine-dependent, making the problem here interesting.

I'm thinking of converting those characters according to this table, they have a copyright page: http://www.seiai.ed.jp/sys/text/home.html . I think it would be better to send them an e-mail and write the reference source in the document.

Add tags for each release

Could you please add tags for each release? Cheers!

add option to convert to html entities untranslatable chars

An open an issue now and will propose a pull later.

Could apply utf8 debugging table?

Could you apply this?

http://www.i18nqa.com/debug/utf8-debug.html

before it returns <?>

latin1 encoding and binary buffer

i don't get how exactly the utf8 to Latin1 encoding works, and why it succeeds.
Also note that using "binary' encoding is deprecated, so is this the right time for implementing a replacement ?

Segmentation fault with node 0.10.31

When running mocha sbcs-test.js using node v 0.10.31, I get the following output :
․․․․․․․․․․Segmentation fault (core dumped)

It first happened on a real use case, trying to decode a windows-1251 html page (http://coolwebmasters.com/web-design/4135-47-modern-landing-pages-showcasing-inspirational-web-design.html)

convert directly from binary string

It'd be great to have a way to convert directly from a binary string (containing only \u0000-\u00FF) instead of converting to Buffer first.

list all of the supported encodings and their aliases

Consider updating your README with a list of all the supported encodings and their aliases (or making a hyperlink to such list).

For example, instead of “IBM/DOS codepages” mention 'cp437', 'cp808', 'cp850', 'cp858', 'cp866', 'cp1125', 'cp1252', each of them, a complete list of them.
For example, instead of “KOI8 family” mention specifically 'koi8-r', 'koi8-u', 'koi8-ru', etc.

I am creating this issue because currently I am not sure if iconv-lite supports every of the encodings that my singlebyte package supports, and if there are any differences in encodings' names. I wish I could stop reinventing that wheel and be sure that nothing would break.

Unexpected stop when doing many consecutive decodings

Hello,

I'm experiencing the following issue -

I'm using iconv-lite like this:
var iconv = require('iconv-lite');
iconv.decode(buf, 'utf-8'); // "buf" is a buffer ~300KB of size.

If I'm doing as follows multiple times, node just stops with no error:
iconv.decode(buf, 'utf-8');
iconv.decode(buf, 'utf-8');
iconv.decode(buf, 'utf-8');
// ... (x12 times)
iconv.decode(buf, 'utf-8');

Thank you.

Browser support

As I know, you don't have support for browsers. Shall I make an branch for browser support?

Add a warning when decoding a string

This is a major misunderstanding of how things work. For example, #36, #37, #40, #43,
http://stackoverflow.com/questions/13456307/getting-correct-string-from-windows-1250-encoded-web-page-with-node-js
http://stackoverflow.com/questions/5135450/nodejs-http-response-encoding

The plan:

Add a console.error warning if decode is fed with a string, with a flag to suppress it.
Write a wiki page https://github.com/ashtuchkin/iconv-lite/wiki/Use-Buffers-when-decoding about this issue and what to do.
Write a wiki primer on character conversion in javascript - there's some non-trivial things to explain.

Add all aliases from Encoding Standard.

http://encoding.spec.whatwg.org/#names-and-labels

add UTF-7 encoding

Make browser version without Browserify

Use UInt8Array or just Array as a Buffer (make a single place for codecs to create resulting objects and make a switch there)
Reimplement utf-8, utf-16, base64, hex, binary.
Make separate file for dbcs codec and its tables.
Make interface compatible with Encoding Standard.

unable to properly decode a known charset

I'm requesting a page with the request module to handle it with cheerio. If opening the page in a browser (with the proper user agent set), then I could figure out its encoding beeing iso-8859-1

Despite knowing this I can not get the response body properly encoded and set out. Do you have any ideas how to try to decode/encode this properly?

(function (exports, require, module, __filename, __dirname) { var request = require('request');
var cheerio = require('cheerio');
var iconv  = require('iconv-lite');

// var iconv = new Iconv('latin2', 'latin2//TRANSLIT//IGNORE');

function updateNapimenu(db, cb) {
    request({
        url: 'http://napimenu.eu/?1=1Zzk0Nztc6',
        headers: {
            'User-Agent': 'Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0',
            'encoding': null
        }
    }, function(err, resp, body) {
        if(err)
            throw err;

        $ = cheerio.load(iconv.decode(iconv.encode(body, 'iso-8859-1'), 'utf8'));
        // $ = cheerio.load(body);
        var body = '';
        $('.b1, .b2').each(function(idx) {
            body += '<div>' + $(this).html() + '</div>';
        })
        cb(null, body);
    });
}

exports.updateNapimenu = updateNapimenu;
});

Add EUC-JP encoding.

Very slow on Node v0.11.13 (previous are fine)

node --trace-deopt shows that in this version the codecs are deoptimized (previous versions don't have that):

[deoptimizing (DEOPT eager): begin 0x1bfe98051541 encoderSBCSWrite (opt #11) @17, FP to SP delta: 40]
  translating encoderSBCSWrite => node=40, height=16
    0x7fff5fbff298: [top + 56] <- 0x1bfe980fb101 ; rsi 0x1bfe980fb101 <an Object with map 0x32648080fa59>
    0x7fff5fbff290: [top + 48] <- 0x1bfe98004c79 ; rcx 0x1bfe98004c79 <Very long string[262144]>
    0x7fff5fbff288: [top + 40] <- 0x3b45c8dcb1e5 ; caller's pc
    0x7fff5fbff280: [top + 32] <- 0x7fff5fbff2d8 ; caller's fp
    0x7fff5fbff278: [top + 24] <- 0x1bfe98051489; context
    0x7fff5fbff270: [top + 16] <- 0x1bfe98051541; function
    0x7fff5fbff268: [top + 8] <- 0x1bfe980fb189 ; rbx 0x1bfe980fb189 <a Buffer with map 0x32648080fbc1>
    0x7fff5fbff260: [top + 0] <- 0 ; rax (smi)
[deoptimizing (eager): end 0x1bfe98051541 encoderSBCSWrite @17 => node=40, pc=0x3b45c8dd34a4, state=NO_REGISTERS, alignment=no padding, took 0.037 ms]
[removing optimized code for: encoderSBCSWrite]

Node version	v8 version
0.11.12	3.22.24
0.11.13	3.25.30

Probably something with isolates?

> iconv.decode( iconv.encode( 'é', 'latin1' ).toString(), 'latin1' )
'ï¿½'

Instead of returning 'é', it returns 'ï¿½'.

I have [email protected].