fgribreau / node-unidecode Goto Github PK

:page_with_curl: ASCII transliterations of Unicode text

License: BSD 3-Clause "New" or "Revised" License

JavaScript 99.54% Shell 0.46%

node-unidecode's Introduction

Unidecode for NodeJS

Unidecode is JavaScript port of the perl module Text::Unicode. It takes UTF-8 data and tries to represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F). The representation is almost always an attempt at transliteration -- i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system.

See Text::Unicode for the original README file, including methodology and limitations.

Note that all the files named 'x??.js' in data are derived directly from the equivalent perl file, and both sets of files are distributed under the perl license not the BSD license.

❤️ Shameless plug

Installation

$ npm install unidecode

Usage

$ node
> var unidecode = require('unidecode');
> unidecode("aéà)àçé");
'aea)ace'
> unidecode("に間違いがないか、再度確認してください。再読み込みしてください。");
'niJian Wei iganaika, Zai Du Que Ren sitekudasai. Zai Du miIp misitekudasai. '

Advanced Usage

Custom Substitution Values

For values that cannot be translated, empty strings are returned. You can override this behavior by passing a custom substitution value as the second argument to unidecode:

$ node
> var unidecode = require('unidecode');
> unidecode("ab\uFFFFc", "X");
'abXc'
> unidecode("ab\uFFFFc");
'abc'

Changelog

Donate

I maintain this project in my free time, if it helped you please support my work via paypal or bitcoins, thanks a lot!

I accept pull-request !

node-unidecode's People

Contributors

Stargazers

Watchers

Forkers

perropicante bringr nagaozen lusini halfdan medikoo gerhut commenthol iamstarkov gruppler pombredanne hagb4rd kshetline manhuni appy-one amagid anaclumos

node-unidecode's Issues

Feature Request: Transliterate Unicode "Fonts"

Using special Unicode glyphs, people can produce custom "fonts" such as:

𝐍𝐨𝐰 𝐢𝐬 𝐭𝐡𝐞 𝐭𝐢𝐦𝐞 𝐟𝐨𝐫 𝐚𝐥𝐥 𝐠𝐨𝐨𝐝 𝐦𝐞𝐧.
𝓝𝓸𝔀 𝓲𝓼 𝓽𝓱𝓮 𝓽𝓲𝓶𝓮 𝓯𝓸𝓻 𝓪𝓵𝓵 𝓰𝓸𝓸𝓭 𝓶𝓮𝓷.
ℕ𝕠𝕨 𝕚𝕤 𝕥𝕙𝕖 𝕥𝕚𝕞𝕖 𝕗𝕠𝕣 𝕒𝕝𝕝 𝕘𝕠𝕠𝕕 𝕞𝕖𝕟.
𝙽𝚘𝚠 𝚒𝚜 𝚝𝚑𝚎 𝚝𝚒𝚖𝚎 𝚏𝚘𝚛 𝚊𝚕𝚕 𝚐𝚘𝚘𝚍 𝚖𝚎𝚗.

Source: http://qaz.wtf/u/convert.cgi?text=Now+is+the+time+for+all+good+men.

This is becoming wildly popular in recent years. It would be wonderful if unidecode could transliterate these back to ASCII. Desire:

const unidecode = require('unidecode');
console.log( unidecode("𝓝𝓸𝔀 𝓲𝓼 𝓽𝓱𝓮 𝓽𝓲𝓶𝓮 𝓯𝓸𝓻 𝓪𝓵𝓵 𝓰𝓸𝓸𝓭 𝓶𝓮𝓷.") );
// Desired Output: Now is the time for all good men.

Much thanks!

Error: Cannot find module './data/x1d.js'

H = 29

Support Custom Substitution

Hi @FGRibreau!

First of all, I love this module and use it extensively - so thank you for putting it together!

I'm working with some legacy systems that are very picky about the length of strings, and the problem I'm having with this module is that untranslatable characters are stripped out from the string, which changes the length and makes it invalid for the legacy systems I work with.

What I'd like to be able to do is supply a custom substitution value to use when no valid translation is found, for example a space character or an underscore. If I could do that, the length would be maintained and I'd be able to use this module for my projects again.

I'll submit a PR in a minute for the change so you can take a look.

TypeError: Cannot read property 'replace' of undefined

Full error:
C:\Users\Miha\MBot\node_modules\unidecode\unidecode.js:20
return str.replace(utf8_rx, unidecode_internal_replace);
^

TypeError: Cannot read property 'replace' of undefined
at module.exports (C:\Users\Miha\MBot\node_modules\←[4munidecode←[24m\unidecode.js:20:14)
at module.exports (C:\Users\Miha\MBot\handlers\responses.js:14:26)
at C:\Users\Miha\MBot\index.js:27:85
at Array.forEach ()
at Object. (C:\Users\Miha\MBot\index.js:27:46)
←[90m at Module._compile (internal/modules/cjs/loader.js:1158:30)←[39m
←[90m at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)←[39m
←[90m at Module.load (internal/modules/cjs/loader.js:1002:32)←[39m
←[90m at Function.Module._load (internal/modules/cjs/loader.js:901:14)←[39m
←[90m at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:74:12)←[39m

code part:
const cleanmsg = unidecode(message.content).toLowerCase().replace(/0/g,"o");

Multi-pass required to correctly unidecode

For some cases it needs several passes to completely convert to ascii. Take this example:

var unidecode = require('unidecode')
var s = 'RocÃo MartÃn-Valero'; // there is a hidden - appearing after both Ã's if you paste in console!
console.log(unidecode(s))  // prints RocÃo MartÃn-Valero (removes that hidden -), but still not ascii
console.log(unidecode(unidecode(s))) // 2 passes to print RocAo MartAn-Valero

Here is the hexdump of the above string:

00000000  52 6f 63 c3 83 c2 ad 6f  20 4d 61 72 74 c3 83 c2  |Roc....o Mart...|
00000010  ad 6e 2d 56 61 6c 65 72  6f 0a                    |.n-Valero.|
0000001a

So it seems it can't convert the 2 sequences c3 83 and c2 ad that are back to back.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/41747210-multi-pass-required-to-correctly-unidecode?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).

Korean: ㅊ should be romanized to "ch", but it's romanized to "c"

Korean Romanization is Wrong · Issue #19907 · TryGhost/Ghost

I found that the corresponding code is this: https://github.com/FGRibreau/node-unidecode/blob/master/data/xcd.js

But I can't understand the array formula, so I'll stop digging into it here

Using String#codePointAt()?

I noticed that the code converts utf8 to utf16 manually, however String#codePointAt() is already standardized for quite a while.

I think switch to use the built in version should give unidecode a huge performance boost.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/36986931-using-string-codepointat?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).

Option to customise or add mappings?

Hey @FGRibreau, this is just a quick question about something for the future. As you know Ghost uses unidecode as the basis for it's slugification code and I was thinking about the future of that feature and what we might do with it.

How would you feel about having options added to node-unidecode, so that it was easy to pass in a set of custom mappings?

I was also wondering about whether it would be possible to group the characters into their logical groups with names, to make it possible to enable/disable conversion of certain chars.

These are just ideas for changes that would help in a couple of things for Ghost, although there may be better solutions. We have been getting requests to not convert certain characters for slugs, as URLS can now contain non-ascii chars. We also get requests to change the mappings, I remember one to do with umlauts in German, and where it did or didn't make sense to transliterate with an additional 'e'. That's a bit vague I know, but the upshot is, it might be nice to, one day in a future provide the ability to add user-configurable mappings in Ghost.

Also, I'm tentatively thinking about splitting this bit of the codebase out into it's own npm module, so that others can use it.

I was taking a look at another module which does slugification, https://github.com/dodo/node-slug, however it doesn't do quite the same things as the Ghost one, and as it has it's own mappings, it's not as complete as ours is being based on unidecode.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

var unidecode = require('unidecode');
unidecode('SȾÁ,SEN') // returns 'S[?]A,SEN'

todo-note about test in readme

package already have tests. Is this note outdated?