Coder Social home page Coder Social logo

djy's Introduction

Clojars Project

What's this?

This is a library of character utility functions for Clojure, inspired by useful built-in string and character libraries from other languages, most significantly Haskell's Data.Char library.

It is currently somewhat cumbersome to work with characters in Clojure. Complicating matters is the inherent complexity of dealing with supplementary characters in the JVM; Java characters are 16-bit, allowing characters in the Unicode range 0000-FFFF to be expressed as single characters. This range is called the Basic Multilingual Plane (BMP), however the range of existent characters has since expanded, bringing about the need for 32-bit characters. Java's way of representing these supplementary characters is via pairs of 16-bit characters, for a combined total of 32 bits.

This library aims to provide convenient wrappers for standard Java Character library functions, as well as some new utility functions to facilitate working with characters.

Many of these functions are polymorphic in nature, by way of a HasCodePoint protocol exposing a code-point-of function, which can take as an argument a character, an integer representing a Unicode code point, or a string beginning with a supplementary character (i.e. two 16-bit Java characters). This allows us to work with BMP and supplementary characters without having to think about whether they are BMP or supplementary -- they're just characters™.

Among the new utility functions is char' (on analogy with clojure.core's +' and other "enhanced" arithmetic operators that support arbitrary precision), an extension of clojure.core/char that will return a string containing a supplementary character if provided with a codepoint above U+FFFF, e.g. (char' 135641) => "𡇙""

Another convenient function is char-range, which returns the range (inclusive) between two characters, e.g. (char-range \a \z) => (\a \b \c ... \x \y \z). This provides a concise, readable syntax for representing ranges of characters, as compared to, e.g., (map char (range (int \a) (inc (int \z)))). As a bonus, this function also supports supplementary characters, as it uses char' internally.

My hope is that this library will end up in clojure.contrib or (my pipe dream) as a part of Clojure proper as "clojure.char."

Any feedback and suggestions would be very welcome -- feel free to join the discussion going on the Clojure dev Google group.

Enjoy!

- Dave Yarwood, 10/8/14

To do:

  • Remedy potential performance issues caused by dynamic type introspection, as noted by Mikera.

djy's People

Contributors

daveyarwood avatar nibe avatar

Stargazers

zhongxiao avatar  avatar Andrea Richiardi avatar James Thornton avatar ayato-p avatar Paul Gowder avatar Groza Cristian avatar yāλu avatar  avatar Josef Pospíšil avatar Eric Bailey avatar Boris avatar  avatar Thomas Scheiblauer avatar Solicode avatar Alan Thompson avatar Tasshin Fogleman avatar Gaylord Mazelier avatar Justin Overfelt avatar Daniel Le avatar Syed Khurram avatar Ilshad Khabibullin avatar Richard Wofford avatar  avatar Paul Legato avatar Christoffer Sawicki avatar Vic Goldfeld avatar  avatar Dom Kiva Meyer avatar Joel Holdbrooks avatar Michael Clayton avatar Gerrit avatar Dmitri Sotnikov avatar Ruslan Prakapchuk avatar Colin Jones avatar Alan Dipert avatar Ambrose Bonnaire-Sergeant avatar

Watchers

yāλu avatar Goutham Gandhi Nadendla avatar  avatar Guillaume Erétéo avatar  avatar  avatar

Forkers

tracym ereteog

djy's Issues

Benchmarking with Criterium

Use Criterium to build a range of benchmarks to test the performance of this library's functions vs. their pure Java equivalents.

Examples of things to benchmark:

  • Find some large body of text that will likely contain at least some supplementary characters, e.g. a Chinese blog. Break the string of text down into characters using char-seq. Get the code-point-of each character.
  • Convert a large number of integers in the range 0 - 1114111 (BMP + Supplementary character ranges) to characters using char'. The current implementation of char' relies on dynamic type inspection to determine if the argument is a character or a string (supplementary characters have to be represented in string form), so it will be interesting to see if this function's performance can be improved (see issue #2 ).
  • Benchmark char-range for large ranges of characters in the BMP range, the Supplementary range, and ranges spanning both.

`code-point-of` with Integers

char-seq doesn't seem to work for me. This is on Clojure 1.6, Java 1.7.

user=> (require '[djy.char :as c])
nil
user=> (c/char' 120121)
"𝔹"
user=> (c/char-seq (c/char' 120121))
IllegalArgumentException No method in multimethod 'code-point-of' for dispatch value: class java.lang.Integer  clojure.lang.MultiFn.getFn (MultiFn.java:160)

Adding a defmethod to code-point-of for java.lang.Integer fixes the issue for me locally.

Write more tests

Write tests covering the rest of the API functions. So far I have only had to time to write tests for the code-point-of multimethod -- see char_test.clj to get an idea of the kind of generative tests I want to write.

Maybe consider whether test.check might be helpful for this.

Improve performance

This library's functions currently work for both BMP and supplementary characters by using a multimethod internally to determine whether the argument is a character, an integer or a string.

As Mikera brought up in the Google group discussion, this will probably cause significant performance issues, since we're relying on dynamic type inspection for each individual character processed.

Some careful thought is needed about how to make this library both easy to use and as fast as it needs to be for the performance-sensitive domains where it will likely be used (parsing, text analysis, data conversion).

Per Mikera, we would want to plan Clojure language changes to be aligned with this goal:

Examples of language changes that might help:

  • More analysis of argument types at call sites to use primitive functions / short-circuit multi-methods / eliminate type checks where possible
  • Suppression of the warnings that we currently get when replacing a core function (I'm thinking clojure.core/char here)
  • Compiler macros?

Ideas:

  • Split the library into two separate namespaces, allowing/forcing the programmer to choose whether he wants to use BMP-specific functions or the dynamic (but less performant) functions that can also accept strings supplementary characters (in string form). This would only partially solve the problem, as it would give better performance when working only with BMP characters, but the same issues would still exist when working with supplementary characters.
  • Convert the current multimethod with a protocol-based approach. This should significantly improve performance compared to the multimethod, but would not be an ideal solution. done
  • Consider if this problem might be solved somehow by writing lower-level Java methods and giving them Clojure wrappers? The idea would be to make this library more "static" for performance purposes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.