Coder Social home page Coder Social logo

fgribreau / node-unidecode Goto Github PK

View Code? Open in Web Editor NEW
138.0 8.0 18.0 343 KB

:page_with_curl: ASCII transliterations of Unicode text

Home Page: http://twitter.com/FGRibreau

License: BSD 3-Clause "New" or "Revised" License

JavaScript 99.54% Shell 0.46%

node-unidecode's Introduction

Unidecode for NodeJS


Version Downloads extra

Twitter Follow available-for-advisory Get help on Codementor Slack

Unidecode is JavaScript port of the perl module Text::Unicode. It takes UTF-8 data and tries to represent it in US-ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F). The representation is almost always an attempt at transliteration -- i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system.

See Text::Unicode for the original README file, including methodology and limitations.

Note that all the files named 'x??.js' in data are derived directly from the equivalent perl file, and both sets of files are distributed under the perl license not the BSD license.

❤️ Shameless plug

Installation

$ npm install unidecode

Usage

$ node
> var unidecode = require('unidecode');
> unidecode("aéà)àçé");
'aea)ace'
> unidecode("に間違いがないか、再度確認してください。再読み込みしてください。");
'niJian Wei iganaika, Zai Du Que Ren sitekudasai. Zai Du miIp misitekudasai. '

Advanced Usage

Custom Substitution Values

For values that cannot be translated, empty strings are returned. You can override this behavior by passing a custom substitution value as the second argument to unidecode:

$ node
> var unidecode = require('unidecode');
> unidecode("ab\uFFFFc", "X");
'abXc'
> unidecode("ab\uFFFFc");
'abc'

Donate

I maintain this project in my free time, if it helped you please support my work via paypal or bitcoins, thanks a lot!

I accept pull-request !

node-unidecode's People

Contributors

amagid avatar c089 avatar commenthol avatar fgribreau avatar halfdan avatar iamstarkov avatar medikoo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-unidecode's Issues

Feature Request: Transliterate Unicode "Fonts"

Using special Unicode glyphs, people can produce custom "fonts" such as:

  • 𝐍𝐨𝐰 𝐢𝐬 𝐭𝐡𝐞 𝐭𝐢𝐦𝐞 𝐟𝐨𝐫 𝐚𝐥𝐥 𝐠𝐨𝐨𝐝 𝐦𝐞𝐧.
  • 𝓝𝓸𝔀 𝓲𝓼 𝓽𝓱𝓮 𝓽𝓲𝓶𝓮 𝓯𝓸𝓻 𝓪𝓵𝓵 𝓰𝓸𝓸𝓭 𝓶𝓮𝓷.
  • ℕ𝕠𝕨 𝕚𝕤 𝕥𝕙𝕖 𝕥𝕚𝕞𝕖 𝕗𝕠𝕣 𝕒𝕝𝕝 𝕘𝕠𝕠𝕕 𝕞𝕖𝕟.
  • 𝙽𝚘𝚠 𝚒𝚜 𝚝𝚑𝚎 𝚝𝚒𝚖𝚎 𝚏𝚘𝚛 𝚊𝚕𝚕 𝚐𝚘𝚘𝚍 𝚖𝚎𝚗.

Source: http://qaz.wtf/u/convert.cgi?text=Now+is+the+time+for+all+good+men.

This is becoming wildly popular in recent years. It would be wonderful if unidecode could transliterate these back to ASCII. Desire:

const unidecode = require('unidecode');
console.log( unidecode("𝓝𝓸𝔀 𝓲𝓼 𝓽𝓱𝓮 𝓽𝓲𝓶𝓮 𝓯𝓸𝓻 𝓪𝓵𝓵 𝓰𝓸𝓸𝓭 𝓶𝓮𝓷.") );
// Desired Output: Now is the time for all good men.

Much thanks!

Support Custom Substitution

Hi @FGRibreau!

First of all, I love this module and use it extensively - so thank you for putting it together!

I'm working with some legacy systems that are very picky about the length of strings, and the problem I'm having with this module is that untranslatable characters are stripped out from the string, which changes the length and makes it invalid for the legacy systems I work with.

What I'd like to be able to do is supply a custom substitution value to use when no valid translation is found, for example a space character or an underscore. If I could do that, the length would be maintained and I'd be able to use this module for my projects again.

I'll submit a PR in a minute for the change so you can take a look.

TypeError: Cannot read property 'replace' of undefined

Full error:
C:\Users\Miha\MBot\node_modules\unidecode\unidecode.js:20
return str.replace(utf8_rx, unidecode_internal_replace);
^

TypeError: Cannot read property 'replace' of undefined
at module.exports (C:\Users\Miha\MBot\node_modules\←[4munidecode←[24m\unidecode.js:20:14)
at module.exports (C:\Users\Miha\MBot\handlers\responses.js:14:26)
at C:\Users\Miha\MBot\index.js:27:85
at Array.forEach ()
at Object. (C:\Users\Miha\MBot\index.js:27:46)
←[90m at Module._compile (internal/modules/cjs/loader.js:1158:30)←[39m
←[90m at Object.Module._extensions..js (internal/modules/cjs/loader.js:1178:10)←[39m
←[90m at Module.load (internal/modules/cjs/loader.js:1002:32)←[39m
←[90m at Function.Module._load (internal/modules/cjs/loader.js:901:14)←[39m
←[90m at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:74:12)←[39m

code part:
const cleanmsg = unidecode(message.content).toLowerCase().replace(/0/g,"o");

Multi-pass required to correctly unidecode

For some cases it needs several passes to completely convert to ascii. Take this example:

var unidecode = require('unidecode')
var s = 'Rocío Martín-Valero'; // there is a hidden - appearing after both Ã's if you paste in console!
console.log(unidecode(s))  // prints RocÃo MartÃn-Valero (removes that hidden -), but still not ascii
console.log(unidecode(unidecode(s))) // 2 passes to print RocAo MartAn-Valero

Here is the hexdump of the above string:

00000000  52 6f 63 c3 83 c2 ad 6f  20 4d 61 72 74 c3 83 c2  |Roc....o Mart...|
00000010  ad 6e 2d 56 61 6c 65 72  6f 0a                    |.n-Valero.|
0000001a

So it seems it can't convert the 2 sequences c3 83 and c2 ad that are back to back.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/41747210-multi-pass-required-to-correctly-unidecode?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).

Using String#codePointAt()?

I noticed that the code converts utf8 to utf16 manually, however String#codePointAt() is already standardized for quite a while.

I think switch to use the built in version should give unidecode a huge performance boost.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/36986931-using-string-codepointat?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).

Option to customise or add mappings?

Hey @FGRibreau, this is just a quick question about something for the future. As you know Ghost uses unidecode as the basis for it's slugification code and I was thinking about the future of that feature and what we might do with it.

How would you feel about having options added to node-unidecode, so that it was easy to pass in a set of custom mappings?

I was also wondering about whether it would be possible to group the characters into their logical groups with names, to make it possible to enable/disable conversion of certain chars.

These are just ideas for changes that would help in a couple of things for Ghost, although there may be better solutions. We have been getting requests to not convert certain characters for slugs, as URLS can now contain non-ascii chars. We also get requests to change the mappings, I remember one to do with umlauts in German, and where it did or didn't make sense to transliterate with an additional 'e'. That's a bit vague I know, but the upshot is, it might be nice to, one day in a future provide the ability to add user-configurable mappings in Ghost.

Also, I'm tentatively thinking about splitting this bit of the codebase out into it's own npm module, so that others can use it.

I was taking a look at another module which does slugification, https://github.com/dodo/node-slug, however it doesn't do quite the same things as the Ghost one, and as it has it's own mappings, it's not as complete as ours is being based on unidecode.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Some characters get rendered as [?]

In the readme it states that characters that can't be transliterated get turned into the empty string, but some characters seem to return [?] instead:

var unidecode = require('unidecode');
unidecode('SȾÁ,SEN') // returns 'S[?]A,SEN'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.