Coder Social home page Coder Social logo

iconv-lite's Introduction

iconv-lite: Pure JS character encoding conversion

  • No need for native code compilation. Quick to install, works on Windows, Web, and in sandboxed environments.
  • Used in popular projects like Express.js (body_parser), Grunt, Nodemailer, Yeoman and others.
  • Faster than node-iconv (see below for performance comparison).
  • Intuitive encode/decode API, including Streaming support.
  • In-browser usage via browserify or webpack (~180kb gzip compressed with Buffer shim included).
  • Typescript type definition file included.
  • React Native is supported (need to install stream module to enable Streaming API).
  • License: MIT.

NPM Stats
Build Status npm npm downloads npm bundle size

Usage

Basic API

var iconv = require('iconv-lite');

// Convert from an encoded buffer to a js string.
str = iconv.decode(Buffer.from([0x68, 0x65, 0x6c, 0x6c, 0x6f]), 'win1251');

// Convert from a js string to an encoded buffer.
buf = iconv.encode("Sample input string", 'win1251');

// Check if encoding is supported
iconv.encodingExists("us-ascii")

Streaming API

// Decode stream (from binary data stream to js strings)
http.createServer(function(req, res) {
    var converterStream = iconv.decodeStream('win1251');
    req.pipe(converterStream);

    converterStream.on('data', function(str) {
        console.log(str); // Do something with decoded strings, chunk-by-chunk.
    });
});

// Convert encoding streaming example
fs.createReadStream('file-in-win1251.txt')
    .pipe(iconv.decodeStream('win1251'))
    .pipe(iconv.encodeStream('ucs2'))
    .pipe(fs.createWriteStream('file-in-ucs2.txt'));

// Sugar: all encode/decode streams have .collect(cb) method to accumulate data.
http.createServer(function(req, res) {
    req.pipe(iconv.decodeStream('win1251')).collect(function(err, body) {
        assert(typeof body == 'string');
        console.log(body); // full request body string
    });
});

Supported encodings

  • All node.js native encodings: utf8, ucs2 / utf16-le, ascii, binary, base64, hex.
  • Additional unicode encodings: utf16, utf16-be, utf-7, utf-7-imap, utf32, utf32-le, and utf32-be.
  • All widespread singlebyte encodings: Windows 125x family, ISO-8859 family, IBM/DOS codepages, Macintosh family, KOI8 family, all others supported by iconv library. Aliases like 'latin1', 'us-ascii' also supported.
  • All widespread multibyte encodings: CP932, CP936, CP949, CP950, GB2312, GBK, GB18030, Big5, Shift_JIS, EUC-JP.

See all supported encodings on wiki.

Most singlebyte encodings are generated automatically from node-iconv. Thank you Ben Noordhuis and libiconv authors!

Multibyte encodings are generated from Unicode.org mappings and WHATWG Encoding Standard mappings. Thank you, respective authors!

Encoding/decoding speed

Comparison with node-iconv module (1000x256kb, on MacBook Pro, Core i5/2.6 GHz, Node v0.12.0). Note: your results may vary, so please always check on your hardware.

operation             [email protected]   [email protected]
----------------------------------------------------------
encode('win1251')     ~96 Mb/s      ~320 Mb/s
decode('win1251')     ~95 Mb/s      ~246 Mb/s

BOM handling

  • Decoding: BOM is stripped by default, unless overridden by passing stripBOM: false in options (f.ex. iconv.decode(buf, enc, {stripBOM: false})). A callback might also be given as a stripBOM parameter - it'll be called if BOM character was actually found.
  • If you want to detect UTF-8 BOM when decoding other encodings, use node-autodetect-decoder-stream module.
  • Encoding: No BOM added, unless overridden by addBOM: true option.

UTF-16 Encodings

This library supports UTF-16LE, UTF-16BE and UTF-16 encodings. First two are straightforward, but UTF-16 is trying to be smart about endianness in the following ways:

  • Decoding: uses BOM and 'spaces heuristic' to determine input endianness. Default is UTF-16LE, but can be overridden with defaultEncoding: 'utf-16be' option. Strips BOM unless stripBOM: false.
  • Encoding: uses UTF-16LE and writes BOM by default. Use addBOM: false to override.

UTF-32 Encodings

This library supports UTF-32LE, UTF-32BE and UTF-32 encodings. Like the UTF-16 encoding above, UTF-32 defaults to UTF-32LE, but uses BOM and 'spaces heuristics' to determine input endianness.

  • The default of UTF-32LE can be overridden with the defaultEncoding: 'utf-32be' option. Strips BOM unless stripBOM: false.
  • Encoding: uses UTF-32LE and writes BOM by default. Use addBOM: false to override. (defaultEncoding: 'utf-32be' can also be used here to change encoding.)

Other notes

When decoding, be sure to supply a Buffer to decode() method, otherwise bad things usually happen.
Untranslatable characters are set to � or ?. No transliteration is currently supported.
Node versions 0.10.31 and 0.11.13 are buggy, don't use them (see #65, #77).

Testing

$ git clone [email protected]:ashtuchkin/iconv-lite.git
$ cd iconv-lite
$ npm install
$ npm test
    
$ # To view performance:
$ node test/performance.js

$ # To view test coverage:
$ npm run coverage
$ open coverage/lcov-report/index.html

iconv-lite's People

Contributors

adamansky avatar amoiseev avatar ashtuchkin avatar atinux avatar chalker avatar david50407 avatar dougwilson avatar felixbuenemann avatar felixfbecker avatar fengmk2 avatar ivan-kalatchev avatar jardicc avatar jenkinv avatar jiangzhuo avatar kshetline avatar larssn avatar lastonesky avatar leetreveil avatar lmlb avatar mithgol avatar mscdex avatar nleush avatar oldj avatar pekim avatar redchair123 avatar rokoroku avatar stagas avatar tlhunter avatar vain0x avatar yosion-p avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iconv-lite's Issues

Add ISO-2022-JP encoding

Needs a separate, stateful codec, so probably will not implement without a sign of significant usage.

My perfomance test result is different from readme.md.

Processor 2.4 GHz Intel Core 2 Duo
Memory 8 GB 1067 MHz DDR3
node v0.8.7

$ node test/perfomance.js 

Encoding 262144 chars 1000 times:
iconv: 4063ms, 63.01 Mb/s.
iconv-lite: 6005ms, 42.63 Mb/s.

Decoding 262144 bytes 1000 times:
iconv: 4459ms, 57.41 Mb/s.
iconv-lite: 10232ms, 25.02 Mb/s.

US-ASCII not supported?

Hello,

I tried to use the US-ASCII encoding with iconv like following:

var buf = iconv.encode(string, "US-ASCII");
return buf.toString();

and I have this exception. Is it something normal?

Error: Encoding not recognized: 'US-ASCII' (searched as: 'usascii')
at Object.getCodec (/home/templth/work/repositories/git/restlet-framework-js/tests/org.restlet.js.tests/src/nodejs/server/node_modules/iconv-lite/index.js:36:23)
at Object. (/home/templth/work/repositories/git/restlet-framework-js/tests/org.restlet.js.tests/src/nodejs/server/node_modules/iconv-lite/index.js:4:22)

Thanks for your help,
Thierry

There is some mistake in /encodings/dbcs-codec.js.

When I try to decode a GBK buffer, it just throw an error:

path-to/iconv-lite/encodings/dbcs-codec.js:506
            throw new Error("Unknown table value when decoding: " + val);
                                                                    ^
ReferenceError: val is not defined
    at Object.decoderDBCSWrite [as write] (path-to/iconv-lite/encodings/dbcs-codec.js:506:69)
    at Object.decode (path-to/iconv-lite/lib/index.js:36:23)

Maybe you just forget to define val?

Add user callback for handling invalid characters.

Inspired by ICU, something like:

  buf = iconv.encode(str, 'win1251', { invalidCharHandler: function(char) {
    // Here you can either throw exception which will be propagated outside,
    // or return a replacement char.
  });

Probably, some default handlers would also be nice (always throwing, always returning '?', trying to transliterate).

Streaming conversion?

Hello, iconv-lite seems quite promising! Thanks a lot for working on it.
I have seen in your README that you plan on adding the streaming support.
This is great news. Anything we could help with?

Also, I'm thinking that in our specific case, since we want to convert windows-1256 to utf-8, we're thinking that a full streaming implementation may not be necessary, as windows-1256 is single byte?
This means that we can just convert each chunk of byte as we receive them from the underlying stream, without caring much about cut offs. Am I right?

Thanks!

Conversion to lesser encoding

I am making a text editor and have planned to use iconv-lite. When opening a file, I use jschardet to find the encoding, but when I save, I would like to use the same encoding if possible. Because of this, I would like iconv-lite to tell me if I can possibly save some string in the original format. The editor itself will be UTF-8, so it would have to convert back to win1251 (or something else) when saving the file. I would however like iconv-lite to tell me if some symbols cannot be translated to the specified encoding, which would make me save the file as UTF-8 instead. I would like to propose the following:

iconv.encode(str, encoding, throwError)

which would throw an error if some symbol cannot be converted to the specified encoding. This would happen if you try to convert some quirky UTF-8 string to ISO for example.

Shift JIS

Just to let you know that I'd appreciate having Shift-JIS in the supported encodings.

utf-8 to gbk is error

fs.readFile(path,{encoding:'utf-8'},function(err,data){
buf = iconv.encode(data,'gbk');
fs.writeFile(outName,buf,function(){
callback();
});
});

input: var s = 1;
ouput : ?var s = 1;

A '?' is always there.

Good way to decode UTF-16?

Hi, I was just wondering if this module (or another helper module) could help with decoduing bytes declated as "UTF-16", where the ordering is not specified and the BOM would need to be examined to determine this.

Convert from/to encodings with iconv-lite only

Take this example:

mystr = new Buffer("base64 string with ISO-8859-1 encoding goes here", "base64").toString();

buffer = new Buffer(buffer, "ISO-8859-1"); <--- this will fail as Buffer doesn't support that encoding

buffer = iconv.decode(buffer, "ISO-8859-1");
buffer = iconv.encode(buffer, "utf8").toString("utf8");

This code should be able to convert from a ISO-8859-1 (or any other encoding) string to UTF8, but it will fail because of the second line.

Can that be done using iconv-lite?

add a method to split a string into an array of encodable and non-encodable substrings

I'd like to propose an alternative to #53.

It is supposed in #53 that “invalid” characters (i.e. characters that cannot be encoded with the given encoding) should be dealt with individually. Sometimes, however, it becomes more useful to deal with the whole susbstrings of such characters. For such cases I propose an idea of a method that would split any given string into an array of encodable and non-encodable substrings following each other.

Example:

var iconvLite = require('iconv-lite');
console.log(
   iconvLite.split('Хлѣбъ です。', 'cp866')
); // output: ['Хл', 'ѣ', 'бъ ', 'です。']

The above suggested method is inspired by a behaviour of String.prototype.split when it is given a regular expression enclosed in a single set of capturing parentheses:

console.log(
   'foo-bar'.split(/(-+)/)
); // output: [ 'foo', '-', 'bar' ]
console.log(
   '--foo-bar'.split(/(-+)/)
); // output: [ '', '--', 'foo', '-', 'bar' ]

The proposed method should remind its users of String.prototype.split (hence the name .split) and thus be understood by analogy.

To make a complete similarity, it should also behave similarly, i.e. the even array indices (0, 2, 4…) should always correspond to encodable substrings while the odd array indices (1, 3, 5…) should always correspond to non-encodable substring. (To achieve that, the first substring in the returned array could sometimes be intentionally left blank, like String.prototype.split does it in the [ '', '--', 'foo', '-', 'bar' ] example above, to preserve the meaning of odd and even indices.)

Truncate part of file on decode, replacing to "...(length:"

I'm using Grunt, and grunt uses iconv-lite.

In my gruntfile, i'm using this code line to read a file, and assign his content to a variable:

var content = grunt.file.read(file);

The file reading action it's ok, but if the file is large, like +10K chars, it's shows the first +-10k of chars and replace the rest for something like that:

[*Something around the first 10k char appears here (OK) and then ...*]
...(length: 62967)

is there a limit of chars to iconv-lite decode?

gbk encode error

When I use iconv-lite to encode & decode files between UTF8 & GBK, I found 2 characters cannot encode to GBK correct until now.

the character "·" & "×" can not be encode to GBK. I think the GBK table has some mistake.

How to convert from one encoding to another?

Using the iconv package, I would do the following:

buffer = new Iconv(fromEncoding, toEncoding).convert(buffer)

However, I can't figure what the equivalent conversion would be for iconv-lite.

Code snippet taken from here and here

EUC-JP, EUC-KR not recognized?

I see in the source code "eucjp" and "euckr" being supported. How to I get it to work?

var iconv = require('iconv-lite');
str = iconv.encode("Sample input string", 'eucjp');
console.log (str);
$node test.js 

/test/node_modules/iconv-lite/index.js:45
                throw new Error("Encoding not recognized: '" + encoding + "' (
                      ^
Error: Encoding not recognized: 'eucjp' (searched as: 'eucjp')
    at Object.module.exports.getCodec (/test/node_modules/iconv-lite/index.js:45:23)
    at Object.module.exports.toEncoding (/test/node_modules/iconv-lite/index.js:5:22)
    at Object.<anonymous> (/test/test.js:3:13)
    at Module._compile (module.js:456:26)
    at Object.Module._extensions..js (module.js:474:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Function.Module.runMain (module.js:497:10)
    at startup (node.js:119:16)
    at node.js:902:3

How to convert from utf-8 to ISO-8859-8 ?

Hi, I am trying to convert a string in utf-8 to a string in iso-8859-8. I am doing this:

var str = 'this is a string';
var buf = iconv.encode(str, 'ISO-8859-8');
str = buf.toString();

Is this the right way of using your lib ?

Thanks 👍

The new SHIFTJIS table in 0.4pre is not complete.

I just tried to convert 0x87,0x40 (glyph is ① ) which should be matching to U+24EA, but it's not working.

So I found a table from a Japanese high school's site: http://www.seiai.ed.jp/sys/text/java/shiftjis_table.html
In this table, characters in RED COLOR are machine-dependent, making the problem here interesting.

I'm thinking of converting those characters according to this table, they have a copyright page: http://www.seiai.ed.jp/sys/text/home.html . I think it would be better to send them an e-mail and write the reference source in the document.

latin1 encoding and binary buffer

i don't get how exactly the utf8 to Latin1 encoding works, and why it succeeds.
Also note that using "binary' encoding is deprecated, so is this the right time for implementing a replacement ?

convert directly from binary string

It'd be great to have a way to convert directly from a binary string (containing only \u0000-\u00FF) instead of converting to Buffer first.

list all of the supported encodings and their aliases

Consider updating your README with a list of all the supported encodings and their aliases (or making a hyperlink to such list).

  • For example, instead of “IBM/DOS codepages” mention 'cp437', 'cp808', 'cp850', 'cp858', 'cp866', 'cp1125', 'cp1252', each of them, a complete list of them.
  • For example, instead of “KOI8 family” mention specifically 'koi8-r', 'koi8-u', 'koi8-ru', etc.

I am creating this issue because currently I am not sure if iconv-lite supports every of the encodings that my singlebyte package supports, and if there are any differences in encodings' names. I wish I could stop reinventing that wheel and be sure that nothing would break.

Unexpected stop when doing many consecutive decodings

Hello,

I'm experiencing the following issue -

I'm using iconv-lite like this:
var iconv = require('iconv-lite');
iconv.decode(buf, 'utf-8'); // "buf" is a buffer ~300KB of size.

If I'm doing as follows multiple times, node just stops with no error:
iconv.decode(buf, 'utf-8');
iconv.decode(buf, 'utf-8');
iconv.decode(buf, 'utf-8');
// ... (x12 times)
iconv.decode(buf, 'utf-8');

Thank you.

Browser support

As I know, you don't have support for browsers. Shall I make an branch for browser support?

Add a warning when decoding a string

This is a major misunderstanding of how things work. For example, #36, #37, #40, #43,
http://stackoverflow.com/questions/13456307/getting-correct-string-from-windows-1250-encoded-web-page-with-node-js
http://stackoverflow.com/questions/5135450/nodejs-http-response-encoding

The plan:

Make browser version without Browserify

  • Use UInt8Array or just Array as a Buffer (make a single place for codecs to create resulting objects and make a switch there)
  • Reimplement utf-8, utf-16, base64, hex, binary.
  • Make separate file for dbcs codec and its tables.
  • Make interface compatible with Encoding Standard.

unable to properly decode a known charset

I'm requesting a page with the request module to handle it with cheerio. If opening the page in a browser (with the proper user agent set), then I could figure out its encoding beeing iso-8859-1

Despite knowing this I can not get the response body properly encoded and set out. Do you have any ideas how to try to decode/encode this properly?

(function (exports, require, module, __filename, __dirname) { var request = require('request');
var cheerio = require('cheerio');
var iconv  = require('iconv-lite');

// var iconv = new Iconv('latin2', 'latin2//TRANSLIT//IGNORE');

function updateNapimenu(db, cb) {
    request({
        url: 'http://napimenu.eu/?1=1Zzk0Nztc6',
        headers: {
            'User-Agent': 'Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0',
            'encoding': null
        }
    }, function(err, resp, body) {
        if(err)
            throw err;

        $ = cheerio.load(iconv.decode(iconv.encode(body, 'iso-8859-1'), 'utf8'));
        // $ = cheerio.load(body);
        var body = '';
        $('.b1, .b2').each(function(idx) {
            body += '<div>' + $(this).html() + '</div>';
        })
        cb(null, body);
    });
}

exports.updateNapimenu = updateNapimenu;
});

Very slow on Node v0.11.13 (previous are fine)

node --trace-deopt shows that in this version the codecs are deoptimized (previous versions don't have that):

[deoptimizing (DEOPT eager): begin 0x1bfe98051541 encoderSBCSWrite (opt #11) @17, FP to SP delta: 40]
  translating encoderSBCSWrite => node=40, height=16
    0x7fff5fbff298: [top + 56] <- 0x1bfe980fb101 ; rsi 0x1bfe980fb101 <an Object with map 0x32648080fa59>
    0x7fff5fbff290: [top + 48] <- 0x1bfe98004c79 ; rcx 0x1bfe98004c79 <Very long string[262144]>
    0x7fff5fbff288: [top + 40] <- 0x3b45c8dcb1e5 ; caller's pc
    0x7fff5fbff280: [top + 32] <- 0x7fff5fbff2d8 ; caller's fp
    0x7fff5fbff278: [top + 24] <- 0x1bfe98051489; context
    0x7fff5fbff270: [top + 16] <- 0x1bfe98051541; function
    0x7fff5fbff268: [top + 8] <- 0x1bfe980fb189 ; rbx 0x1bfe980fb189 <a Buffer with map 0x32648080fbc1>
    0x7fff5fbff260: [top + 0] <- 0 ; rax (smi)
[deoptimizing (eager): end 0x1bfe98051541 encoderSBCSWrite @17 => node=40, pc=0x3b45c8dd34a4, state=NO_REGISTERS, alignment=no padding, took 0.037 ms]
[removing optimized code for: encoderSBCSWrite]
Node version v8 version
0.11.12 3.22.24
0.11.13 3.25.30

Probably something with isolates?

Latin1 encode error

Hi, I’m trying to decode a Latin1 (ISO-8859-1) encoded string and it doesn’t work, as you can see below:

> iconv.decode( iconv.encode( 'é', 'latin1' ).toString(), 'latin1' )
'�'

Instead of returning 'é', it returns '�'.

I have [email protected].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.