cl-babel / babel Goto Github PK
View Code? Open in Web Editor NEWBabel is a charset encoding/decoding library, not unlike GNU libiconv, written in pure Common Lisp.
Home Page: http://common-lisp.net/project/babel
License: Other
Babel is a charset encoding/decoding library, not unlike GNU libiconv, written in pure Common Lisp.
Home Page: http://common-lisp.net/project/babel
License: Other
As far as I can tell, only the encoding for external formats is actually used when converting to octets:
CL-USER> (babel:string-to-octets "foo
bar
baz" :encoding (babel:make-external-format :ascii :eol-style :crlf))
=> #(102 111 111 10 98 97 114 10 98 97 122)
Am I misunderstanding how babel is meant to be used?
I have the following test case that shows the GBK encoding to be broken:
echo "高怡雯" | iconv -t gbk -o /tmp/gao
# in Clozure CL
(let ((v (make-array 6 :element-type '(unsigned-byte 8))))
(with-open-file (in "/tmp/gao"
:element-type '(unsigned-byte 8))
(read-sequence v in)
v))
→ #(184 223 226 249 246 169)
(babel:octets-to-string * :encoding :gbk)
→ "高恂霎"
Where is the proper documentation? babel.texi
contains mainly
Bla bla bla, bla bla bla.
Fix babel to work with :invert readtable case
Make symbol munging to use the correct case by using '#:~a-symbol-name syntax. Also fixed some refences to T when they should have been t.
Signed-off-by: Jyrki Jaakkola
diff --git a/src/enc-unicode.lisp b/src/enc-unicode.lisp
index 1a8375b..1c90a19 100644
--- a/src/enc-unicode.lisp
+++ b/src/enc-unicode.lisp
@@ -520,9 +520,9 @@ code points for each invalid byte."
(check-type name keyword)
(let ((swap-var (gensym "SWAP"))
(code-point-counter-name
- (intern (format nil "~a-CODE-POINT-COUNTER" name)))
- (encoder-name (intern (format nil "~a-ENCODER" name)))
- (decoder-name (intern (format nil "~a-DECODER" name))))
+ (intern (format nil (string '#:~a-code-point-counter) (string name))))
+ (encoder-name (intern (format nil (string '#:~a-encoder) (string name))))
+ (decoder-name (intern (format nil (string '#:~a-decoder) (string name)))))
(labels ((make-bom-check-form (end start getter seq)
(if (null endianness)
``((,',swap-var
@@ -536,14 +536,14 @@ code points for each invalid byte."
(case endianness
(:le ``(,,getter ,,src ,,i 2 :le))
(:be ``(,,getter ,,src ,,i 2 :be))
- (T ``(if ,',swap-var
+ (t ``(if ,',swap-var
(,,getter ,,src ,,i 2 :re)
(,,getter ,,src ,,i 2 :ne)))))
(make-setter-form (setter code dest di)
(case endianness
(:be ``(,,setter ,,code ,,dest ,,di 2 :be))
(:le ``(,,setter ,,code ,,dest ,,di 2 :le))
- (T ``(,,setter ,,code ,,dest ,,di 2 :ne)))))
+ (t ``(,,setter ,,code ,,dest ,,di 2 :ne)))))
`(progn
(define-octet-counter ,name (getter type)
`(utf16-octet-counter ,getter ,type))
@@ -691,11 +691,11 @@ written in big-endian byte-order without a leading byte-order mark."
(check-type endianness (or null (eql :le) (eql :be)))
(let ((swap-var (gensym "SWAP"))
(code-point-counter-name
- (intern (format nil "~a-CODE-POINT-COUNTER" name)))
+ (intern (format nil (string '#:~a-code-point-counter) (string name))))
(encoder-name
- (intern (format nil "~a-ENCODER" name)))
+ (intern (format nil (string '#:~a-encoder) (string name))))
(decoder-name
- (intern (format nil "~a-DECODER" name))))
+ (intern (format nil (string '#:~a-decoder) (string name)))))
(labels ((make-bom-check-form (end start getter src)
(if (null endianness)
``(when (not (zerop (- ,,end ,,start)))
@@ -703,8 +703,8 @@ written in big-endian byte-order without a leading byte-order mark."
(#.+byte-order-mark-code+
(incf ,,start ,',bytes) nil)
(#.+swapped-byte-order-mark-code-32+
- (incf ,,start ,',bytes) T)
- (T #+little-endian T)))
+ (incf ,,start ,',bytes) t)
+ (t #+little-endian t)))
'()))
(make-setter-form (setter code dest di)
``(,,setter ,,code ,,dest ,,di ,',bytes
There are currently calls like (format-symbol t '#:~a-code-point-counter (string name))
(all in this file: https://github.com/cl-babel/babel/blob/master/src/enc-unicode.lisp#L527 ). Alexandria hands these off to format, and CCL's format raises a type-error.
Wrapping all those symbols in (string ...) will resolve the issue.
This effects the current (2013-06-15) quicklisp release, and in turn frustrations the compiling those who grovel difficult.
Ideally we should implement the flexi-streams API.
The following formats have decoders that can emit #\Replacement_Character
even though their encoders don't accept that: :cp1251
, :iso-8859-3
, :iso-8859-6
, :iso-8859-7
, :iso-8859-8
, :iso-8859-11
. :ebcdic-international
has a similar issue, but with #\U+FFFF
instead. :ebcdic-us
seems to substitute various Latin-1 code points such as the private use characters, but for what little I know about EBCDIC, that might actually be the correct behavior.
I would expect octets-to-string
output to be valid input to string-to-octets
, even if chaining the two need not result in the same bytes. It's not quite clear what the behavior should be because the only other encodings in babel that run into this edge case (:cp1252
, :gbk
, :eucjp
, :cp932
) lack error checks for it entirely. I actually have a patch more or less prepared for that already, but it should be consistent with the rest.
In my opinion, signalling an error is the right thing to do when errorp
is set and otherwise the ASCII substitution byte (which seems to be available in all supported encodings) could be used. decoding-error
conveniently does this out of the box.
Note that this overlaps heavily with the first half of #41. Both have the same underlying issue.
Line 173 in f892d05
with-simple-vector doesn't check for arrays with fill-pointer, and calls the call-with-array-data/fast for such an array, which goes wrong (at least on LispWorks).
CL-USER 60 > (progn (setq str (make-array 10 :fill-pointer 4 :element-type 'character))
(replace str "docs")
(BABEL:STRING-TO-OCTETS str))
#(244 130 155 134 0 0 0)
It can be fixed by changing the condition to (or (adjustable-array-p ,vector) (array-has-fill-pointer-p ,vector))
, and with this it works as expected:
CL-USER 63 > (progn (setq str (make-array 10 :fill-pointer 4 :element-type 'character))
(replace str "docs")
(BABEL:STRING-TO-OCTETS str))
#(100 111 99 115)
you get this problem if you do:
(ql:quickload "quri")
(quri:url-encode (quri:url-encode "docs"))
Because quri:url-encode returns a string with a fill-pointer.
Actually found it in the test of cl-ses4, because it does this double call:
https://github.com/Jach/cl-ses4/blob/14b9dc5ffb2fe93db82312e3eefbdd4164572b71/src/canonicalize.lisp#L49
Gbk-map.lisp does not appear to be working in Lispworks 6.1.
Error: #\啊 is not of type BASE-CHAR.
So it happens more often than I'd like that I want to serialise longer pieces of text to/from an encoding. Having to round-trip through an array copy to do so is quite cumbersome. It would be great if there was instead an API that works either via callbacks, or even better, via a resumable state machine. The callback API and the current copying API could both be implemented in terms of the state machine API quite trivially, I think.
Naturally this would require refactoring most things, and is as such a big undertaking. Still, I feel like this is a very valuable feature, since having to copy megabytes if not gigabytes of text around is often not just slow, but also prohibitively taxing on memory. A state machine API would allow processing text in a streaming fashion, too, without needing to keep anything at all in memory.
In some situations it would be nice to be able to decode a single character form a position in a vector, or encode a single character into a vector.
Due to a missing stream-write-string
method, I get such a stack trace in sldb:
There is no applicable method for the generic function
#<STANDARD-GENERIC-FUNCTION STREAM-WRITE-STRING (5)>
when called with arguments
(#<BABEL-STREAMS:VECTOR-OUTPUT-STREAM {10086524A3}>
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">"
0 NIL).
[Condition of type SIMPLE-ERROR]
Restarts:
5: CONTINUE-ERROR-HANDLING Continue processing the error as if the debugger was not available
4: RETRY Retry calling the generic function.
3: RETRY-HANDLING-REQUEST Try again handling this HTTP request
2: *ABORT-SERVER-REQUEST Abort processing request 1 by simply closing the network socket
1: REMOVE-WORKER Stop and remove worker #<WORKER {1008D0F2B3}>
0: ABORT Abort thread (#<THREAD "http worker 0 / serving request 1 / HANDLE-LEVEL-1-ERROR / HANDLE-TOPLEVEL-ERROR" RUNNING {1008D0FAF3}>)
Backtrace:
0: (HU.DWIM.UTIL::INVOKE-SLIME-DEBUGGER #<SIMPLE-ERROR "~@<There is no applicable method for the generic function ~2I~_~S~ ..)
1: (HU.DWIM.UTIL:MAYBE-INVOKE-DEBUGGER #<SIMPLE-ERROR "~@<There is no applicable method for the generic function ~2I~_~S~ ..)
2: ((SB-PCL::EMF HU.DWIM.WEB-SERVER:HANDLE-TOPLEVEL-ERROR) #<unavailable argument> #<unavailable argument> #<HU.DWIM.WEB-SERVER:BROKER-BASED-SERVER listen: 0.0.0.0/11080, 0.0.0.0/8443; brokers: 2 {100C47..
3: ((:METHOD HU.DWIM.WEB-SERVER:HANDLE-TOPLEVEL-ERROR :AROUND (T T)) #<HU.DWIM.WEB-SERVER:BROKER-BASED-SERVER listen: 0.0.0.0/11080, 0.0.0.0/8443; brokers: 2 {100C47D493}> #<SIMPLE-ERROR "~@<There is no ..
4: ((FLET HU.DWIM.WEB-SERVER::HANDLE-REQUEST-ERROR :IN HU.DWIM.WEB-SERVER::WORKER-LOOP/SERVE-ONE-REQUEST) #<SIMPLE-ERROR "~@<There is no applicable method for the generic function ~2I~_~S~ ..)
5: ((LABELS HU.DWIM.UTIL::HANDLE-LEVEL-1-ERROR :IN HU.DWIM.UTIL::CALL-WITH-LAYERED-ERROR-HANDLERS) #<SIMPLE-ERROR "~@<There is no applicable method for the generic function ~2I~_~S~ ..)
6: (SIGNAL #<SIMPLE-ERROR "~@<There is no applicable method for the generic function ~2I~_~S~ ..)
7: (ERROR "~@<There is no applicable method for the generic function ~2I~_~S~ ..)
8: ((:METHOD NO-APPLICABLE-METHOD (T)) #<STANDARD-GENERIC-FUNCTION STREAM-WRITE-STRING (5)> #<BABEL-STREAMS:VECTOR-OUTPUT-STREAM {10086524A3}> "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http:..
9: (SB-PCL::CALL-NO-APPLICABLE-METHOD #<STANDARD-GENERIC-FUNCTION STREAM-WRITE-STRING (5)> (#<BABEL-STREAMS:VECTOR-OUTPUT-STREAM {10086524A3}> "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http:..
10: (SB-IMPL::%WRITE-STRING "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">" #<BABEL-STREAMS:VECTOR-OUTPUT-STREAM {10086524A3}> 0 NIL)
Locals:
SB-DEBUG::ARG-0 = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">"
SB-DEBUG::ARG-1 = #<BABEL-STREAMS:VECTOR-OUTPUT-STREAM {10086524A3}>
SB-DEBUG::ARG-2 = 0
SB-DEBUG::ARG-3 = NIL
That macro calls substitute to replace ? with a non-base-char in a string. This does not work if the string is a base-string. It is allowed, by the standard, for string constants to be simple-base-strings if all the characters in them are base-chars. The macro should coerce that string constant to a one dimensional simple array of characters.
This is relevant because a potential space optimization in SBCL is to make the double quote reader macro return a simple-base-string, when possible. If SBCL does this, this test will break. I understand Clasp already experienced this issue.
With a fresh checkout of babel from github, I get the following errors:
; file: /home/raison/quicklisp/local-projects/babel/src/enc-unicode.lisp
; in: DEFINE-UTF-16 :UTF-16
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UTF-16 :UTF-16LE
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16LE :LE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UTF-16 :UTF-16BE
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16BE :BE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UTF-32
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32 4)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32 ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UTF-32LE
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32LE 4 :LE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UTF-32BE
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32BE 4 :BE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2 2 NIL 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2 ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2LE
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2LE 2 :LE 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2BE
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2BE 2 :BE 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
I saw some similar issues and tried to apply there resolution but not working
This is SBCL 1.0.57.0.debian, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (ql:quickload "babel")
To load "babel":
Load 1 ASDF system:
babel
; Loading "babel"
; file: /home/recruiterbox/quicklisp/dists/quicklisp/software/babel-20121125-git/src/enc-unicode.lisp
; in: DEFINE-UTF-16 :UTF-16
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UTF-16 :UTF-16LE
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16LE :LE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UTF-16 :UTF-16BE
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16BE :BE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
.
; in: DEFINE-UCS :UTF-32
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32 4)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32 ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UTF-32LE
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32LE 4 :LE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UTF-32BE
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32BE 4 :BE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2 2 NIL 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2 ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2LE
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2LE 2 :LE 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2BE
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2BE 2 :BE 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
debugger invoked on a ASDF:COMPILE-ERROR in thread
#<THREAD "main thread" RUNNING {AAF87A1}>:
Error while invoking #<COMPILE-OP (:VERBOSE NIL) {C8830E1}> on
#<CL-SOURCE-FILE "babel" "src" "enc-unicode">
Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [RETRY ] Retry compiling #<CL-SOURCE-FILE "babel" "src" "enc-unicode">.
1: [ACCEPT] Continue, treating
compiling #<CL-SOURCE-FILE "babel" "src" "enc-unicode"> as having
been successful.
2: [ABORT ] Give up on "babel"
3: Exit debugger, returning to top level.
((SB-PCL::FAST-METHOD ASDF:PERFORM (ASDF:COMPILE-OP ASDF:CL-SOURCE-FILE))
#<unavailable argument>
#<unavailable argument>
#<ASDF:COMPILE-OP (:VERBOSE NIL) {C8830E1}>
#<ASDF:CL-SOURCE-FILE "babel" "src" "enc-unicode">)
0]
-- thanks
ps:
(ql:update-all-dists)
1 dist to check.
You already have the latest version of "quicklisp": 2012-12-23.
NIL
The form
(babel:octets-to-string (coerce #(237 189 177) '(simple-array ub8 (3))) :encoding :utf-8)
fails with the error:
value NIL is not of the expected type CHARACTER.
[Condition of type TYPE-ERROR]
Characters such as #\№ can not be converted to octets.
CL-USER> (babel:string-to-octets "№あいう" :encoding :cp932)
; Evaluation aborted on #<TYPE-ERROR #xCE7F6BE>.
CL-USER> (ccl:encode-string-to-octets "№あいう" :external-format :cp932)
#(250 89 130 160 130 162 130 164)
8
CL-USER> (lisp-implementation-version)
"Version 1.10-r16196 (WindowsX8632)"
CL-USER>
This patch seems to fix this error.
bash-3.2$ diff -u enc-jpn.lisp new-enc-jpn.lisp
--- enc-jpn.lisp 2015-04-14 13:36:44.000000000 +0900
+++ new-enc-jpn.lisp 2015-05-19 22:04:36.000000000 +0900
@@ -43,8 +43,9 @@
(+ (ash mid 8) low))))))
(dolist (i *eucjp*)
(let ((cp932 (euc-cp932 (first i))))
- (setf (gethash cp932 *cp932-to-ucs-hash*) (second i))
- (setf (gethash (second i) *ucs-to-cp932-hash*) cp932))))
+ (when cp932
+ (setf (gethash cp932 *cp932-to-ucs-hash*) (second i))
+ (setf (gethash (second i) *ucs-to-cp932-hash*) cp932)))))
;ascii
(loop for i from #x00 to #x7f do
bash-3.2$
CL-USER> (ql:quickload :babel)
To load "babel":
Load 1 ASDF system:
babel
; Loading "babel"
; file: /home/walker/lisp/babel/src/enc-unicode.lisp
; in: DEFINE-UTF-16 :UTF-16
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UTF-16 :UTF-16LE
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16LE :LE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UTF-16 :UTF-16BE
; (BABEL-ENCODINGS::DEFINE-UTF-16 :UTF-16BE :BE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UTF-16 :UTF-16BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
.
; in: DEFINE-UCS :UTF-32
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32 4)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32 ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UTF-32LE
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32LE 4 :LE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UTF-32BE
; (BABEL-ENCODINGS::DEFINE-UCS :UTF-32BE 4 :BE)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UTF-32BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2 2 NIL 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2 ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2LE
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2LE 2 :LE 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2LE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
; in: DEFINE-UCS :UCS-2BE
; (BABEL-ENCODINGS::DEFINE-UCS :UCS-2BE 2 :BE 65536)
;
; caught ERROR:
; (during macroexpansion of (DEFINE-UCS :UCS-2BE ...))
; #:~A-CODE-POINT-COUNTER fell through ETYPECASE expression.
; Wanted one of (STRING SIMPLE-STRING).
Error while invoking #<COMPILE-OP (:VERBOSE NIL) {100C4F0963}>
on #<CL-SOURCE-FILE "babel" "src" "enc-unicode">
[Condition of type ASDF:COMPILE-ERROR]
; Evaluation aborted on NIL
IT IS FIXED IN CURRENT RELEASE, SORRY, MY BAD!
The following example works out of the box in sbcl, but fails in lispworks unless I use a workaround (that is NOT going to work in all cases of course). I don't consider the workaround to be a fix, but a way to pinpoint a bug.
when I invoke without patch (workaround) below:
(puri::decode-escaped-encoding "/tal%2Dstatic%2Dfullscreen%2Dforall/toc" t)
I get:
Error: Illegal :UTF-8 character starting at position 0.
with patch (workaround) it evaluates to:
"/tal-static-fullscreen-forall/toc"
I have babel-20140316-git from quicklisp and lispworks professional linux 6.1.1
My workaround for URLs specifically:
(in-package :babel)
(defparameter *ascii-codes*
'((32 " ")(33 "!")(34 "\"")(35 "#")(36 "$")(37 "%")(38 "&")(39 "'")(40 "(")(41 ")")
(42 "*")(43 "+")(44 ",")(45 "-")(46 ".")(47 "/")(48 "0")(49 "1")(50 "2")(51 "3")
(52 "4")(53 "5")(54 "6")(55 "7")(56 "8")(57 "9")(58 ":")(59 ";")(60 "<")(61 "=")
(62 ">")(63 "?")(64 "@")(65 "A")(66 "B")(67 "C")(68 "D")(69 "E")(70 "F")(71 "G")
(72 "H")(73 "I")(74 "J")(75 "K")(76 "L")(77 "M")(78 "N")(79 "O")(80 "P")(81 "Q")
(82 "R")(83 "S")(84 "T")(85 "U")(86 "V")(87 "W")(88 "X")(89 "Y")(90 "Z")(91 "[")
(92 "\\")(93 "]")(94 "^")(95 "_")(96 "`")(97 "a")(98 "b")(99 "c")(100 "d")(101 "e")
(102 "f")(103 "g")(104 "h")(105 "i")(106 "j")(107 "k")(108 "l")(109 "m")(110 "n")(111 "o")
(112 "p")(113 "q")(114 "r")(115 "s")(116 "t")(117 "u")(118 "v")(119 "w")(120 "x")(121 "y")
(122 "z")(123 "{")(124 "|")(125 "}")(126 "~")))
#+lispworks
(defun octets-to-string (vector &key (start 0)
end
errorp
encoding )
(let ((retval (make-array `(,(length vector)) :element-type 'character :initial-element #\Space)))
(dotimes (i (length vector))
(setf (aref retval i) (aref (second (assoc (aref vector i) *ascii-codes*)) 0)))
retval))
We would want to know when we have accumulated in a buffer enough bytes to decode a character, depending on the current encodng…
babel doesn't provide a convenient (efficient) API to test that, but I hoped to be able to use OCTETS-TO-STRING for that.
Unfortunately, handling of incomplete code sequences by the different encoding is not consistent.
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-8)
"¶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-8)
"�"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 2 :errorp nil :encoding :utf-16)
"슶"
cl-user> (babel:OCTETS-TO-STRING (coerce #(194 182) '(vector (unsigned-byte 8))) :start 0 :end 1 :errorp nil :encoding :utf-16)
> Debug: Failed assertion: (= babel-encodings::i babel-encodings::end)
> While executing: (:internal swank::invoke-default-debugger), in process new-repl-thread(1481).
> Type cmd-/ to continue, cmd-. to abort, cmd-\ for a list of available restarts.
> If continued: test the assertion again.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<simple-error #x302006CBABDD>.
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 2 :encoding :eucjp)
"こ"
cl-user> (babel:octets-to-string (babel:string-to-octets "こんにちは 世界" :encoding :eucjp) :start 0 :end 1 :encoding :eucjp)
> Debug: Illegal :eucjp character starting at position 0.
> While executing: (:internal swank::invoke-default-debugger), in process repl-thread(3921).
> Type cmd-. to abort, cmd-\ for a list of available restarts.
> Type :? for other options.
1 > :q
; Evaluation aborted on #<babel-encodings:end-of-input-in-character #x302006CA4EAD>.
cl-user>
I would suggest to add a keyword parameter to specify what to do in such a case:
| :on-invalid-code substitution-character | would insert the given substitution-character in place of the code. |
| :on-invalid-code :ignore | would ignore the code and go on. |
| :on-invalid-code :error | would signal a babel-encodings:character-decoding-error condition. |
I would propose also, to provide an efficient function to query the length of a code sequence for the next character:
(babel:decode-character bytes &key start end encoding)
--> character ;
sequence-valid-p ;
length
If a character can be decoded, then it is returned as primary value, otherwise NIL.
If the code sequence is definitely invalid then NIL, else T. Notably if it is just too short, but could be a valid code sequence if completed, T should be returned.
If the character is decoded and returned, then the length of the decoded code sequence is returned; if sequence-valid-p then a minimal code sequence length with the given prefix is returned; otherwise a minimum code sequence length.
| character | sequence-valid-p | length |
|-----------+------------------+----------------------------------------------------------------|
| ch | T | length of the decoded sequence |
| ch | NIL | --impossible-- |
| NIL | T | minimal length of a valid code sequence with the given prefix. |
| NIL | NIL | minimal length of a valid code sequence. |
For example, in the case NIL T len, if len <= (- end start), then it means the given code sequence is valid, but the decoded code is not the code of a character. eg. #(#xED #xA0 #x80)
is UTF-8 for 55296, but (code-char 55296) --> nil
.
(babel:decode-character (coerce #(65 32 66) '(vector (unsigned-byte 8)))
:start 0 :end 3 :encoding :utf-8)
--> #\A
T
1
(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 3 :encoding :utf-8)
--> #\À
T
2
(babel:decode-character (coerce #(195 128 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 1 :encoding :utf-8)
--> NIL
T
2
(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 1 :encoding :utf-8)
--> NIL
T
2
(babel:decode-character (coerce #(195 195 32 80 97 114 105 115) '(vector (unsigned-byte 8)))
:start 0 :end 2 :encoding :utf-8)
--> NIL
NIL
1
(babel:decode-character (coerce #(#xED #xA0 #x80) '(vector (unsigned-byte 8)))
:start 0 :end 3 :encoding :utf-8)
--> NIL
T
3
(defparameter b1 (babel:string-to-octets "string 1"))
(defparameter b2 (babel:string-to-octets "string 2"))
(defparameter delim (babel:string-to-octets "|"))
(defparameter b3 (concatenate 'vector b1 delim b2))
(babel:octets-to-string b3)
;; give this error message:
The value of VECTOR is #(115 116 114 105 110 103 32 49 124 115
116 114 105 110 103 32 50), which is not of type (VECTOR
(UNSIGNED-BYTE
8)).
[Condition of type SIMPLE-TYPE-ERROR]
----
but it's ok in flexi-streams and cl-base64
(flexi-streams:octets-to-string b3)
=> "string 1|string 2"
(cl-base64:base64-string-to-string (cl-base64:usb8-array-to-base64-string b3))
=> "string 1|string 2"
Hey all,
on first loading babel
, I receive the following warnings regarding undefined variables:
; file: C:/Users/Zulu/quicklisp/dists/quicklisp/software/yason-20230214-git/encode.lisp
; in: DEFMETHOD YASON:ENCODE (SYMBOL)
; (EQ YASON::OBJECT YASON:FALSE)
;
; caught WARNING:
; undefined variable: YASON:FALSE
; (EQ YASON::OBJECT YASON:TRUE)
;
; caught WARNING:
; undefined variable: YASON:TRUE
;
; compilation unit finished
; Undefined variables:
; YASON:FALSE YASON:TRUE
; caught 2 WARNING conditions
; printed 1 note
I'm assuming it's an issue with the load order of the files in the project, as those variables do exist
The NOTES file admits to lifting some code from OpenMCL, which is LLGPL. Doesn't that conflict with the stated MIT license of Babel?
There is concern among some Lisp users, that since Babel is pulled in by CFFI, this licensing situation creates a problem for distributing application binaries that use foreign dependencies. See this thread for context.
Since NOTES file marks this as an open issue, maybe it's possible to eventually close it? As I understand, OpenMCL today is CCL, so maybe Clozure Associates can help rectify the legal uncertainty here?
It would be very helpful if you could change babel-encodings::*cp932-to-ucs-hash*
to incorporate a workaround described here. https://support.microsoft.com/en-us/kb/170559/en-us
BABEL> (setf *print-base* 16
*print-radix* t)
T
BABEL> (mapcar #'char-code (coerce "ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ" 'list))
(#x2170 #x2171 #x2172 #x2173 #x2174 #x2175 #x2176 #x2177 #x2178 #x2179)
BABEL> (string-to-octets "ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ" :encoding :cp932)
#(#xEE #xEF #xEE #xF0 #xEE #xF1 #xEE #xF2 #xEE #xF3 #xEE #xF4 #xEE #xF5 #xEE #xF6 #xEE #xF7 #xEE #xF8)
BABEL> (load "babel-cp932-workaround.lisp")
#P"c:/lispbox-0.7/babel-cp932-workaround.lisp"
BABEL> (string-to-octets "ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ" :encoding :cp932)
#(#xFA #x40 #xFA #x41 #xFA #x42 #xFA #x43 #xFA #x44 #xFA #x45 #xFA #x46 #xFA #x47 #xFA #x48 #xFA #x49)
BABEL>
"babel-cp932-workaround.lisp" is a patch that I use.
(in-package #:babel-encodings)
;; This is quoted from https://support.microsoft.com/en-us/kb/170559/en-us
(let ((kb170559 "0x8790 -> U+2252 -> 0x81e0 Approximately Equal To Or The Image Of
0x8791 -> U+2261 -> 0x81df Identical To
0x8792 -> U+222b -> 0x81e7 Integral
0x8795 -> U+221a -> 0x81e3 Square Root
0x8796 -> U+22a5 -> 0x81db Up Tack
0x8797 -> U+2220 -> 0x81da Angle
0x879a -> U+2235 -> 0x81e6 Because
0x879b -> U+2229 -> 0x81bf Intersection
0x879c -> U+222a -> 0x81be Union
0xed40 -> U+7e8a -> 0xfa5c CJK Unified Ideograph
0xed41 -> U+891c -> 0xfa5d CJK Unified Ideograph
0xed42 -> U+9348 -> 0xfa5e CJK Unified Ideograph
0xed43 -> U+9288 -> 0xfa5f CJK Unified Ideograph
0xed44 -> U+84dc -> 0xfa60 CJK Unified Ideograph
0xed45 -> U+4fc9 -> 0xfa61 CJK Unified Ideograph
0xed46 -> U+70bb -> 0xfa62 CJK Unified Ideograph
0xed47 -> U+6631 -> 0xfa63 CJK Unified Ideograph
0xed48 -> U+68c8 -> 0xfa64 CJK Unified Ideograph
0xed49 -> U+92f9 -> 0xfa65 CJK Unified Ideograph
0xed4a -> U+66fb -> 0xfa66 CJK Unified Ideograph
0xed4b -> U+5f45 -> 0xfa67 CJK Unified Ideograph
0xed4c -> U+4e28 -> 0xfa68 CJK Unified Ideograph
0xed4d -> U+4ee1 -> 0xfa69 CJK Unified Ideograph
0xed4e -> U+4efc -> 0xfa6a CJK Unified Ideograph
0xed4f -> U+4f00 -> 0xfa6b CJK Unified Ideograph
0xed50 -> U+4f03 -> 0xfa6c CJK Unified Ideograph
0xed51 -> U+4f39 -> 0xfa6d CJK Unified Ideograph
0xed52 -> U+4f56 -> 0xfa6e CJK Unified Ideograph
0xed53 -> U+4f92 -> 0xfa6f CJK Unified Ideograph
0xed54 -> U+4f8a -> 0xfa70 CJK Unified Ideograph
0xed55 -> U+4f9a -> 0xfa71 CJK Unified Ideograph
0xed56 -> U+4f94 -> 0xfa72 CJK Unified Ideograph
0xed57 -> U+4fcd -> 0xfa73 CJK Unified Ideograph
0xed58 -> U+5040 -> 0xfa74 CJK Unified Ideograph
0xed59 -> U+5022 -> 0xfa75 CJK Unified Ideograph
0xed5a -> U+4fff -> 0xfa76 CJK Unified Ideograph
0xed5b -> U+501e -> 0xfa77 CJK Unified Ideograph
0xed5c -> U+5046 -> 0xfa78 CJK Unified Ideograph
0xed5d -> U+5070 -> 0xfa79 CJK Unified Ideograph
0xed5e -> U+5042 -> 0xfa7a CJK Unified Ideograph
0xed5f -> U+5094 -> 0xfa7b CJK Unified Ideograph
0xed60 -> U+50f4 -> 0xfa7c CJK Unified Ideograph
0xed61 -> U+50d8 -> 0xfa7d CJK Unified Ideograph
0xed62 -> U+514a -> 0xfa7e CJK Unified Ideograph
0xed63 -> U+5164 -> 0xfa80 CJK Unified Ideograph
0xed64 -> U+519d -> 0xfa81 CJK Unified Ideograph
0xed65 -> U+51be -> 0xfa82 CJK Unified Ideograph
0xed66 -> U+51ec -> 0xfa83 CJK Unified Ideograph
0xed67 -> U+5215 -> 0xfa84 CJK Unified Ideograph
0xed68 -> U+529c -> 0xfa85 CJK Unified Ideograph
0xed69 -> U+52a6 -> 0xfa86 CJK Unified Ideograph
0xed6a -> U+52c0 -> 0xfa87 CJK Unified Ideograph
0xed6b -> U+52db -> 0xfa88 CJK Unified Ideograph
0xed6c -> U+5300 -> 0xfa89 CJK Unified Ideograph
0xed6d -> U+5307 -> 0xfa8a CJK Unified Ideograph
0xed6e -> U+5324 -> 0xfa8b CJK Unified Ideograph
0xed6f -> U+5372 -> 0xfa8c CJK Unified Ideograph
0xed70 -> U+5393 -> 0xfa8d CJK Unified Ideograph
0xed71 -> U+53b2 -> 0xfa8e CJK Unified Ideograph
0xed72 -> U+53dd -> 0xfa8f CJK Unified Ideograph
0xed73 -> U+fa0e -> 0xfa90 CJK compatibility Ideograph
0xed74 -> U+549c -> 0xfa91 CJK Unified Ideograph
0xed75 -> U+548a -> 0xfa92 CJK Unified Ideograph
0xed76 -> U+54a9 -> 0xfa93 CJK Unified Ideograph
0xed77 -> U+54ff -> 0xfa94 CJK Unified Ideograph
0xed78 -> U+5586 -> 0xfa95 CJK Unified Ideograph
0xed79 -> U+5759 -> 0xfa96 CJK Unified Ideograph
0xed7a -> U+5765 -> 0xfa97 CJK Unified Ideograph
0xed7b -> U+57ac -> 0xfa98 CJK Unified Ideograph
0xed7c -> U+57c8 -> 0xfa99 CJK Unified Ideograph
0xed7d -> U+57c7 -> 0xfa9a CJK Unified Ideograph
0xed7e -> U+fa0f -> 0xfa9b CJK compatibility Ideograph
0xed80 -> U+fa10 -> 0xfa9c CJK compatibility Ideograph
0xed81 -> U+589e -> 0xfa9d CJK Unified Ideograph
0xed82 -> U+58b2 -> 0xfa9e CJK Unified Ideograph
0xed83 -> U+590b -> 0xfa9f CJK Unified Ideograph
0xed84 -> U+5953 -> 0xfaa0 CJK Unified Ideograph
0xed85 -> U+595b -> 0xfaa1 CJK Unified Ideograph
0xed86 -> U+595d -> 0xfaa2 CJK Unified Ideograph
0xed87 -> U+5963 -> 0xfaa3 CJK Unified Ideograph
0xed88 -> U+59a4 -> 0xfaa4 CJK Unified Ideograph
0xed89 -> U+59ba -> 0xfaa5 CJK Unified Ideograph
0xed8a -> U+5b56 -> 0xfaa6 CJK Unified Ideograph
0xed8b -> U+5bc0 -> 0xfaa7 CJK Unified Ideograph
0xed8c -> U+752f -> 0xfaa8 CJK Unified Ideograph
0xed8d -> U+5bd8 -> 0xfaa9 CJK Unified Ideograph
0xed8e -> U+5bec -> 0xfaaa CJK Unified Ideograph
0xed8f -> U+5c1e -> 0xfaab CJK Unified Ideograph
0xed90 -> U+5ca6 -> 0xfaac CJK Unified Ideograph
0xed91 -> U+5cba -> 0xfaad CJK Unified Ideograph
0xed92 -> U+5cf5 -> 0xfaae CJK Unified Ideograph
0xed93 -> U+5d27 -> 0xfaaf CJK Unified Ideograph
0xed94 -> U+5d53 -> 0xfab0 CJK Unified Ideograph
0xed95 -> U+fa11 -> 0xfab1 CJK compatibility Ideograph
0xed96 -> U+5d42 -> 0xfab2 CJK Unified Ideograph
0xed97 -> U+5d6d -> 0xfab3 CJK Unified Ideograph
0xed98 -> U+5db8 -> 0xfab4 CJK Unified Ideograph
0xed99 -> U+5db9 -> 0xfab5 CJK Unified Ideograph
0xed9a -> U+5dd0 -> 0xfab6 CJK Unified Ideograph
0xed9b -> U+5f21 -> 0xfab7 CJK Unified Ideograph
0xed9c -> U+5f34 -> 0xfab8 CJK Unified Ideograph
0xed9d -> U+5f67 -> 0xfab9 CJK Unified Ideograph
0xed9e -> U+5fb7 -> 0xfaba CJK Unified Ideograph
0xed9f -> U+5fde -> 0xfabb CJK Unified Ideograph
0xeda0 -> U+605d -> 0xfabc CJK Unified Ideograph
0xeda1 -> U+6085 -> 0xfabd CJK Unified Ideograph
0xeda2 -> U+608a -> 0xfabe CJK Unified Ideograph
0xeda3 -> U+60de -> 0xfabf CJK Unified Ideograph
0xeda4 -> U+60d5 -> 0xfac0 CJK Unified Ideograph
0xeda5 -> U+6120 -> 0xfac1 CJK Unified Ideograph
0xeda6 -> U+60f2 -> 0xfac2 CJK Unified Ideograph
0xeda7 -> U+6111 -> 0xfac3 CJK Unified Ideograph
0xeda8 -> U+6137 -> 0xfac4 CJK Unified Ideograph
0xeda9 -> U+6130 -> 0xfac5 CJK Unified Ideograph
0xedaa -> U+6198 -> 0xfac6 CJK Unified Ideograph
0xedab -> U+6213 -> 0xfac7 CJK Unified Ideograph
0xedac -> U+62a6 -> 0xfac8 CJK Unified Ideograph
0xedad -> U+63f5 -> 0xfac9 CJK Unified Ideograph
0xedae -> U+6460 -> 0xfaca CJK Unified Ideograph
0xedaf -> U+649d -> 0xfacb CJK Unified Ideograph
0xedb0 -> U+64ce -> 0xfacc CJK Unified Ideograph
0xedb1 -> U+654e -> 0xfacd CJK Unified Ideograph
0xedb2 -> U+6600 -> 0xface CJK Unified Ideograph
0xedb3 -> U+6615 -> 0xfacf CJK Unified Ideograph
0xedb4 -> U+663b -> 0xfad0 CJK Unified Ideograph
0xedb5 -> U+6609 -> 0xfad1 CJK Unified Ideograph
0xedb6 -> U+662e -> 0xfad2 CJK Unified Ideograph
0xedb7 -> U+661e -> 0xfad3 CJK Unified Ideograph
0xedb8 -> U+6624 -> 0xfad4 CJK Unified Ideograph
0xedb9 -> U+6665 -> 0xfad5 CJK Unified Ideograph
0xedba -> U+6657 -> 0xfad6 CJK Unified Ideograph
0xedbb -> U+6659 -> 0xfad7 CJK Unified Ideograph
0xedbc -> U+fa12 -> 0xfad8 CJK compatibility Ideograph
0xedbd -> U+6673 -> 0xfad9 CJK Unified Ideograph
0xedbe -> U+6699 -> 0xfada CJK Unified Ideograph
0xedbf -> U+66a0 -> 0xfadb CJK Unified Ideograph
0xedc0 -> U+66b2 -> 0xfadc CJK Unified Ideograph
0xedc1 -> U+66bf -> 0xfadd CJK Unified Ideograph
0xedc2 -> U+66fa -> 0xfade CJK Unified Ideograph
0xedc3 -> U+670e -> 0xfadf CJK Unified Ideograph
0xedc4 -> U+f929 -> 0xfae0 CJK compatibility Ideograph
0xedc5 -> U+6766 -> 0xfae1 CJK Unified Ideograph
0xedc6 -> U+67bb -> 0xfae2 CJK Unified Ideograph
0xedc7 -> U+6852 -> 0xfae3 CJK Unified Ideograph
0xedc8 -> U+67c0 -> 0xfae4 CJK Unified Ideograph
0xedc9 -> U+6801 -> 0xfae5 CJK Unified Ideograph
0xedca -> U+6844 -> 0xfae6 CJK Unified Ideograph
0xedcb -> U+68cf -> 0xfae7 CJK Unified Ideograph
0xedcc -> U+fa13 -> 0xfae8 CJK compatibility Ideograph
0xedcd -> U+6968 -> 0xfae9 CJK Unified Ideograph
0xedce -> U+fa14 -> 0xfaea CJK compatibility Ideograph
0xedcf -> U+6998 -> 0xfaeb CJK Unified Ideograph
0xedd0 -> U+69e2 -> 0xfaec CJK Unified Ideograph
0xedd1 -> U+6a30 -> 0xfaed CJK Unified Ideograph
0xedd2 -> U+6a6b -> 0xfaee CJK Unified Ideograph
0xedd3 -> U+6a46 -> 0xfaef CJK Unified Ideograph
0xedd4 -> U+6a73 -> 0xfaf0 CJK Unified Ideograph
0xedd5 -> U+6a7e -> 0xfaf1 CJK Unified Ideograph
0xedd6 -> U+6ae2 -> 0xfaf2 CJK Unified Ideograph
0xedd7 -> U+6ae4 -> 0xfaf3 CJK Unified Ideograph
0xedd8 -> U+6bd6 -> 0xfaf4 CJK Unified Ideograph
0xedd9 -> U+6c3f -> 0xfaf5 CJK Unified Ideograph
0xedda -> U+6c5c -> 0xfaf6 CJK Unified Ideograph
0xeddb -> U+6c86 -> 0xfaf7 CJK Unified Ideograph
0xeddc -> U+6c6f -> 0xfaf8 CJK Unified Ideograph
0xeddd -> U+6cda -> 0xfaf9 CJK Unified Ideograph
0xedde -> U+6d04 -> 0xfafa CJK Unified Ideograph
0xeddf -> U+6d87 -> 0xfafb CJK Unified Ideograph
0xede0 -> U+6d6f -> 0xfafc CJK Unified Ideograph
0xede1 -> U+6d96 -> 0xfb40 CJK Unified Ideograph
0xede2 -> U+6dac -> 0xfb41 CJK Unified Ideograph
0xede3 -> U+6dcf -> 0xfb42 CJK Unified Ideograph
0xede4 -> U+6df8 -> 0xfb43 CJK Unified Ideograph
0xede5 -> U+6df2 -> 0xfb44 CJK Unified Ideograph
0xede6 -> U+6dfc -> 0xfb45 CJK Unified Ideograph
0xede7 -> U+6e39 -> 0xfb46 CJK Unified Ideograph
0xede8 -> U+6e5c -> 0xfb47 CJK Unified Ideograph
0xede9 -> U+6e27 -> 0xfb48 CJK Unified Ideograph
0xedea -> U+6e3c -> 0xfb49 CJK Unified Ideograph
0xedeb -> U+6ebf -> 0xfb4a CJK Unified Ideograph
0xedec -> U+6f88 -> 0xfb4b CJK Unified Ideograph
0xeded -> U+6fb5 -> 0xfb4c CJK Unified Ideograph
0xedee -> U+6ff5 -> 0xfb4d CJK Unified Ideograph
0xedef -> U+7005 -> 0xfb4e CJK Unified Ideograph
0xedf0 -> U+7007 -> 0xfb4f CJK Unified Ideograph
0xedf1 -> U+7028 -> 0xfb50 CJK Unified Ideograph
0xedf2 -> U+7085 -> 0xfb51 CJK Unified Ideograph
0xedf3 -> U+70ab -> 0xfb52 CJK Unified Ideograph
0xedf4 -> U+710f -> 0xfb53 CJK Unified Ideograph
0xedf5 -> U+7104 -> 0xfb54 CJK Unified Ideograph
0xedf6 -> U+715c -> 0xfb55 CJK Unified Ideograph
0xedf7 -> U+7146 -> 0xfb56 CJK Unified Ideograph
0xedf8 -> U+7147 -> 0xfb57 CJK Unified Ideograph
0xedf9 -> U+fa15 -> 0xfb58 CJK compatibility Ideograph
0xedfa -> U+71c1 -> 0xfb59 CJK Unified Ideograph
0xedfb -> U+71fe -> 0xfb5a CJK Unified Ideograph
0xedfc -> U+72b1 -> 0xfb5b CJK Unified Ideograph
0xee40 -> U+72be -> 0xfb5c CJK Unified Ideograph
0xee41 -> U+7324 -> 0xfb5d CJK Unified Ideograph
0xee42 -> U+fa16 -> 0xfb5e CJK compatibility Ideograph
0xee43 -> U+7377 -> 0xfb5f CJK Unified Ideograph
0xee44 -> U+73bd -> 0xfb60 CJK Unified Ideograph
0xee45 -> U+73c9 -> 0xfb61 CJK Unified Ideograph
0xee46 -> U+73d6 -> 0xfb62 CJK Unified Ideograph
0xee47 -> U+73e3 -> 0xfb63 CJK Unified Ideograph
0xee48 -> U+73d2 -> 0xfb64 CJK Unified Ideograph
0xee49 -> U+7407 -> 0xfb65 CJK Unified Ideograph
0xee4a -> U+73f5 -> 0xfb66 CJK Unified Ideograph
0xee4b -> U+7426 -> 0xfb67 CJK Unified Ideograph
0xee4c -> U+742a -> 0xfb68 CJK Unified Ideograph
0xee4d -> U+7429 -> 0xfb69 CJK Unified Ideograph
0xee4e -> U+742e -> 0xfb6a CJK Unified Ideograph
0xee4f -> U+7462 -> 0xfb6b CJK Unified Ideograph
0xee50 -> U+7489 -> 0xfb6c CJK Unified Ideograph
0xee51 -> U+749f -> 0xfb6d CJK Unified Ideograph
0xee52 -> U+7501 -> 0xfb6e CJK Unified Ideograph
0xee53 -> U+756f -> 0xfb6f CJK Unified Ideograph
0xee54 -> U+7682 -> 0xfb70 CJK Unified Ideograph
0xee55 -> U+769c -> 0xfb71 CJK Unified Ideograph
0xee56 -> U+769e -> 0xfb72 CJK Unified Ideograph
0xee57 -> U+769b -> 0xfb73 CJK Unified Ideograph
0xee58 -> U+76a6 -> 0xfb74 CJK Unified Ideograph
0xee59 -> U+fa17 -> 0xfb75 CJK compatibility Ideograph
0xee5a -> U+7746 -> 0xfb76 CJK Unified Ideograph
0xee5b -> U+52af -> 0xfb77 CJK Unified Ideograph
0xee5c -> U+7821 -> 0xfb78 CJK Unified Ideograph
0xee5d -> U+784e -> 0xfb79 CJK Unified Ideograph
0xee5e -> U+7864 -> 0xfb7a CJK Unified Ideograph
0xee5f -> U+787a -> 0xfb7b CJK Unified Ideograph
0xee60 -> U+7930 -> 0xfb7c CJK Unified Ideograph
0xee61 -> U+fa18 -> 0xfb7d CJK compatibility Ideograph
0xee62 -> U+fa19 -> 0xfb7e CJK compatibility Ideograph
0xee63 -> U+fa1a -> 0xfb80 CJK compatibility Ideograph
0xee64 -> U+7994 -> 0xfb81 CJK Unified Ideograph
0xee65 -> U+fa1b -> 0xfb82 CJK compatibility Ideograph
0xee66 -> U+799b -> 0xfb83 CJK Unified Ideograph
0xee67 -> U+7ad1 -> 0xfb84 CJK Unified Ideograph
0xee68 -> U+7ae7 -> 0xfb85 CJK Unified Ideograph
0xee69 -> U+fa1c -> 0xfb86 CJK compatibility Ideograph
0xee6a -> U+7aeb -> 0xfb87 CJK Unified Ideograph
0xee6b -> U+7b9e -> 0xfb88 CJK Unified Ideograph
0xee6c -> U+fa1d -> 0xfb89 CJK compatibility Ideograph
0xee6d -> U+7d48 -> 0xfb8a CJK Unified Ideograph
0xee6e -> U+7d5c -> 0xfb8b CJK Unified Ideograph
0xee6f -> U+7db7 -> 0xfb8c CJK Unified Ideograph
0xee70 -> U+7da0 -> 0xfb8d CJK Unified Ideograph
0xee71 -> U+7dd6 -> 0xfb8e CJK Unified Ideograph
0xee72 -> U+7e52 -> 0xfb8f CJK Unified Ideograph
0xee73 -> U+7f47 -> 0xfb90 CJK Unified Ideograph
0xee74 -> U+7fa1 -> 0xfb91 CJK Unified Ideograph
0xee75 -> U+fa1e -> 0xfb92 CJK compatibility Ideograph
0xee76 -> U+8301 -> 0xfb93 CJK Unified Ideograph
0xee77 -> U+8362 -> 0xfb94 CJK Unified Ideograph
0xee78 -> U+837f -> 0xfb95 CJK Unified Ideograph
0xee79 -> U+83c7 -> 0xfb96 CJK Unified Ideograph
0xee7a -> U+83f6 -> 0xfb97 CJK Unified Ideograph
0xee7b -> U+8448 -> 0xfb98 CJK Unified Ideograph
0xee7c -> U+84b4 -> 0xfb99 CJK Unified Ideograph
0xee7d -> U+8553 -> 0xfb9a CJK Unified Ideograph
0xee7e -> U+8559 -> 0xfb9b CJK Unified Ideograph
0xee80 -> U+856b -> 0xfb9c CJK Unified Ideograph
0xee81 -> U+fa1f -> 0xfb9d CJK compatibility Ideograph
0xee82 -> U+85b0 -> 0xfb9e CJK Unified Ideograph
0xee83 -> U+fa20 -> 0xfb9f CJK compatibility Ideograph
0xee84 -> U+fa21 -> 0xfba0 CJK compatibility Ideograph
0xee85 -> U+8807 -> 0xfba1 CJK Unified Ideograph
0xee86 -> U+88f5 -> 0xfba2 CJK Unified Ideograph
0xee87 -> U+8a12 -> 0xfba3 CJK Unified Ideograph
0xee88 -> U+8a37 -> 0xfba4 CJK Unified Ideograph
0xee89 -> U+8a79 -> 0xfba5 CJK Unified Ideograph
0xee8a -> U+8aa7 -> 0xfba6 CJK Unified Ideograph
0xee8b -> U+8abe -> 0xfba7 CJK Unified Ideograph
0xee8c -> U+8adf -> 0xfba8 CJK Unified Ideograph
0xee8d -> U+fa22 -> 0xfba9 CJK compatibility Ideograph
0xee8e -> U+8af6 -> 0xfbaa CJK Unified Ideograph
0xee8f -> U+8b53 -> 0xfbab CJK Unified Ideograph
0xee90 -> U+8b7f -> 0xfbac CJK Unified Ideograph
0xee91 -> U+8cf0 -> 0xfbad CJK Unified Ideograph
0xee92 -> U+8cf4 -> 0xfbae CJK Unified Ideograph
0xee93 -> U+8d12 -> 0xfbaf CJK Unified Ideograph
0xee94 -> U+8d76 -> 0xfbb0 CJK Unified Ideograph
0xee95 -> U+fa23 -> 0xfbb1 CJK compatibility Ideograph
0xee96 -> U+8ecf -> 0xfbb2 CJK Unified Ideograph
0xee97 -> U+fa24 -> 0xfbb3 CJK compatibility Ideograph
0xee98 -> U+fa25 -> 0xfbb4 CJK compatibility Ideograph
0xee99 -> U+9067 -> 0xfbb5 CJK Unified Ideograph
0xee9a -> U+90de -> 0xfbb6 CJK Unified Ideograph
0xee9b -> U+fa26 -> 0xfbb7 CJK compatibility Ideograph
0xee9c -> U+9115 -> 0xfbb8 CJK Unified Ideograph
0xee9d -> U+9127 -> 0xfbb9 CJK Unified Ideograph
0xee9e -> U+91da -> 0xfbba CJK Unified Ideograph
0xee9f -> U+91d7 -> 0xfbbb CJK Unified Ideograph
0xeea0 -> U+91de -> 0xfbbc CJK Unified Ideograph
0xeea1 -> U+91ed -> 0xfbbd CJK Unified Ideograph
0xeea2 -> U+91ee -> 0xfbbe CJK Unified Ideograph
0xeea3 -> U+91e4 -> 0xfbbf CJK Unified Ideograph
0xeea4 -> U+91e5 -> 0xfbc0 CJK Unified Ideograph
0xeea5 -> U+9206 -> 0xfbc1 CJK Unified Ideograph
0xeea6 -> U+9210 -> 0xfbc2 CJK Unified Ideograph
0xeea7 -> U+920a -> 0xfbc3 CJK Unified Ideograph
0xeea8 -> U+923a -> 0xfbc4 CJK Unified Ideograph
0xeea9 -> U+9240 -> 0xfbc5 CJK Unified Ideograph
0xeeaa -> U+923c -> 0xfbc6 CJK Unified Ideograph
0xeeab -> U+924e -> 0xfbc7 CJK Unified Ideograph
0xeeac -> U+9259 -> 0xfbc8 CJK Unified Ideograph
0xeead -> U+9251 -> 0xfbc9 CJK Unified Ideograph
0xeeae -> U+9239 -> 0xfbca CJK Unified Ideograph
0xeeaf -> U+9267 -> 0xfbcb CJK Unified Ideograph
0xeeb0 -> U+92a7 -> 0xfbcc CJK Unified Ideograph
0xeeb1 -> U+9277 -> 0xfbcd CJK Unified Ideograph
0xeeb2 -> U+9278 -> 0xfbce CJK Unified Ideograph
0xeeb3 -> U+92e7 -> 0xfbcf CJK Unified Ideograph
0xeeb4 -> U+92d7 -> 0xfbd0 CJK Unified Ideograph
0xeeb5 -> U+92d9 -> 0xfbd1 CJK Unified Ideograph
0xeeb6 -> U+92d0 -> 0xfbd2 CJK Unified Ideograph
0xeeb7 -> U+fa27 -> 0xfbd3 CJK compatibility Ideograph
0xeeb8 -> U+92d5 -> 0xfbd4 CJK Unified Ideograph
0xeeb9 -> U+92e0 -> 0xfbd5 CJK Unified Ideograph
0xeeba -> U+92d3 -> 0xfbd6 CJK Unified Ideograph
0xeebb -> U+9325 -> 0xfbd7 CJK Unified Ideograph
0xeebc -> U+9321 -> 0xfbd8 CJK Unified Ideograph
0xeebd -> U+92fb -> 0xfbd9 CJK Unified Ideograph
0xeebe -> U+fa28 -> 0xfbda CJK compatibility Ideograph
0xeebf -> U+931e -> 0xfbdb CJK Unified Ideograph
0xeec0 -> U+92ff -> 0xfbdc CJK Unified Ideograph
0xeec1 -> U+931d -> 0xfbdd CJK Unified Ideograph
0xeec2 -> U+9302 -> 0xfbde CJK Unified Ideograph
0xeec3 -> U+9370 -> 0xfbdf CJK Unified Ideograph
0xeec4 -> U+9357 -> 0xfbe0 CJK Unified Ideograph
0xeec5 -> U+93a4 -> 0xfbe1 CJK Unified Ideograph
0xeec6 -> U+93c6 -> 0xfbe2 CJK Unified Ideograph
0xeec7 -> U+93de -> 0xfbe3 CJK Unified Ideograph
0xeec8 -> U+93f8 -> 0xfbe4 CJK Unified Ideograph
0xeec9 -> U+9431 -> 0xfbe5 CJK Unified Ideograph
0xeeca -> U+9445 -> 0xfbe6 CJK Unified Ideograph
0xeecb -> U+9448 -> 0xfbe7 CJK Unified Ideograph
0xeecc -> U+9592 -> 0xfbe8 CJK Unified Ideograph
0xeecd -> U+f9dc -> 0xfbe9 CJK compatibility Ideograph
0xeece -> U+fa29 -> 0xfbea CJK compatibility Ideograph
0xeecf -> U+969d -> 0xfbeb CJK Unified Ideograph
0xeed0 -> U+96af -> 0xfbec CJK Unified Ideograph
0xeed1 -> U+9733 -> 0xfbed CJK Unified Ideograph
0xeed2 -> U+973b -> 0xfbee CJK Unified Ideograph
0xeed3 -> U+9743 -> 0xfbef CJK Unified Ideograph
0xeed4 -> U+974d -> 0xfbf0 CJK Unified Ideograph
0xeed5 -> U+974f -> 0xfbf1 CJK Unified Ideograph
0xeed6 -> U+9751 -> 0xfbf2 CJK Unified Ideograph
0xeed7 -> U+9755 -> 0xfbf3 CJK Unified Ideograph
0xeed8 -> U+9857 -> 0xfbf4 CJK Unified Ideograph
0xeed9 -> U+9865 -> 0xfbf5 CJK Unified Ideograph
0xeeda -> U+fa2a -> 0xfbf6 CJK compatibility Ideograph
0xeedb -> U+fa2b -> 0xfbf7 CJK compatibility Ideograph
0xeedc -> U+9927 -> 0xfbf8 CJK Unified Ideograph
0xeedd -> U+fa2c -> 0xfbf9 CJK compatibility Ideograph
0xeede -> U+999e -> 0xfbfa CJK Unified Ideograph
0xeedf -> U+9a4e -> 0xfbfb CJK Unified Ideograph
0xeee0 -> U+9ad9 -> 0xfbfc CJK Unified Ideograph
0xeee1 -> U+9adc -> 0xfc40 CJK Unified Ideograph
0xeee2 -> U+9b75 -> 0xfc41 CJK Unified Ideograph
0xeee3 -> U+9b72 -> 0xfc42 CJK Unified Ideograph
0xeee4 -> U+9b8f -> 0xfc43 CJK Unified Ideograph
0xeee5 -> U+9bb1 -> 0xfc44 CJK Unified Ideograph
0xeee6 -> U+9bbb -> 0xfc45 CJK Unified Ideograph
0xeee7 -> U+9c00 -> 0xfc46 CJK Unified Ideograph
0xeee8 -> U+9d70 -> 0xfc47 CJK Unified Ideograph
0xeee9 -> U+9d6b -> 0xfc48 CJK Unified Ideograph
0xeeea -> U+fa2d -> 0xfc49 CJK compatibility Ideograph
0xeeeb -> U+9e19 -> 0xfc4a CJK Unified Ideograph
0xeeec -> U+9ed1 -> 0xfc4b CJK Unified Ideograph
0xeeef -> U+2170 -> 0xfa40 Small Roman Numeral One
0xeef0 -> U+2171 -> 0xfa41 Small Roman Numeral Two
0xeef1 -> U+2172 -> 0xfa42 Small Roman Numeral Three
0xeef2 -> U+2173 -> 0xfa43 Small Roman Numeral Four
0xeef3 -> U+2174 -> 0xfa44 Small Roman Numeral Five
0xeef4 -> U+2175 -> 0xfa45 Small Roman Numeral Six
0xeef5 -> U+2176 -> 0xfa46 Small Roman Numeral Seven
0xeef6 -> U+2177 -> 0xfa47 Small Roman Numeral Eight
0xeef7 -> U+2178 -> 0xfa48 Small Roman Numeral Nine
0xeef8 -> U+2179 -> 0xfa49 Small Roman Numeral Ten
0xeef9 -> U+ffe2 -> 0x81ca Fullwidth Not Sign
0xeefa -> U+ffe4 -> 0xfa55 Fullwidth Broken Bar
0xeefb -> U+ff07 -> 0xfa56 Fullwidth Apostrophe
0xeefc -> U+ff02 -> 0xfa57 Fullwidth Quotation Mark
0xfa4a -> U+2160 -> 0x8754 Roman Numeral One
0xfa4b -> U+2161 -> 0x8755 Roman Numeral Two
0xfa4c -> U+2162 -> 0x8756 Roman Numeral Three
0xfa4d -> U+2163 -> 0x8757 Roman Numeral Four
0xfa4e -> U+2164 -> 0x8758 Roman Numeral Five
0xfa4f -> U+2165 -> 0x8759 Roman Numeral Six
0xfa50 -> U+2166 -> 0x875a Roman Numeral Seven
0xfa51 -> U+2167 -> 0x875b Roman Numeral Eight
0xfa52 -> U+2168 -> 0x875c Roman Numeral Nine
0xfa53 -> U+2169 -> 0x875d Roman Numeral Ten
0xfa54 -> U+ffe2 -> 0x81ca Fullwidth Not Sign
0xfa58 -> U+3231 -> 0x878a Parenthesized Ideograph Stock
0xfa59 -> U+2116 -> 0x8782 Numero Sign
0xfa5a -> U+2121 -> 0x8784 Telephone Sign
0xfa5b -> U+2235 -> 0x81e6 Because"))
(with-input-from-string (s kb170559)
(do ((line (read-line s nil)
(read-line s nil)))
((null line))
(let* ((*read-base* 16)
(ucs (read-from-string (subseq line 14 18)))
(cp932 (read-from-string (subseq line 26 30))))
(setf (gethash ucs *ucs-to-cp932-hash*) cp932)))))
Hi, wondering if it'd be possible to add enc/dec of compound text. There are some examples in swank backends.
Namely GBK and some others.
Such code-points do not represent unicode characters.
This also breaks the non-ambiguity of :utf-8
encoding:
(babel:string-to-octets (string (code-char #xd800)))
; => #(237 160 128)
(babel:octets-to-string *)
; Evaluation aborted on #<BABEL-ENCODINGS:CHARACTER-OUT-OF-RANGE {10053D9533}>.
For example sbcl throws an error in such case:
(sb-ext:string-to-octets (string (code-char #xd800)))
; Evaluation aborted on #<SB-IMPL::OCTETS-ENCODING-ERROR {10013BEA23}>.
This seems to affect some other utf/ucs encodings as well (like :utf-16be
or :utf-16le
).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.