patperry / r-utf8 Goto Github PK

View Code? Open in Web Editor NEW

113.0 4.0 4.0 3.63 MB

UTF-8 Text Processing (R Package)

License: Apache License 2.0

Makefile 0.63% R 3.88% C 91.59% Python 3.90%

r-utf8's Introduction

utf8

utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in R’s UTF-8 handling.

Installation

Stable version

utf8 is available on CRAN. To install the latest released version, run the following command in R:

install.packages("utf8")

Development version

To install the latest development version, run the following:

devtools::install_github("patperry/r-utf8")

Usage

library(utf8)

Validate character data and convert to UTF-8

Use as_utf8() to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:

# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails
#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4

# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
#> [1] "façile" "façile" "façile"

Normalize data

Use utf8_normalize() to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.

# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
#> [1] "Å" "Å" "Å"
utf8_normalize(angstrom) == "\u00c5"
#> [1] TRUE TRUE TRUE

# perform full Unicode case-folding
utf8_normalize("Größe", map_case = TRUE)
#> [1] "grösse"

# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("𝖸𝗈 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝗅 𝗁𝖾𝗋𝖽 𝕌 𝗅𝗂𝗄𝖾 𝑡𝑦𝑝𝑒𝑓𝑎𝑐𝑒𝑠 𝗌𝗈 𝗐𝖾 𝗉𝗎𝗍 𝗌𝗈𝗆𝖾 𝚌𝚘𝚍𝚎𝚙𝚘𝚒𝚗𝚝𝚜 𝗂𝗇 𝗒𝗈𝗎𝗋 𝔖𝔲𝔭𝔭𝔩𝔢𝔪𝔢𝔫𝔱𝔞𝔯𝔶 𝔚𝔲𝔩𝔱𝔦𝔩𝔦𝔫𝔤𝔳𝔞𝔩 𝔓𝔩𝔞𝔫𝔢 𝗌𝗈 𝗒𝗈𝗎 𝖼𝖺𝗇 𝓮𝓷𝓬𝓸𝓭𝓮 𝕗𝕠𝕟𝕥𝕤 𝗂𝗇 𝗒𝗈𝗎𝗋 𝒇𝒐𝒏𝒕𝒔.",
               map_compat = TRUE)
#> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."

Print emoji

On some platforms (including MacOS), the R implementation of print() uses an outdated version of the Unicode standard to determine which characters are printable. Use utf8_print() for an updated print function:

print(intToUtf8(0x1F600 + 0:79)) # with default R print function
#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"

utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫…"

utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"

Citation

Cite utf8 with the following BibTeX entry:

@Manual{,
  title = {utf8: Unicode Text Processing},
  author = {Patrick O. Perry},
  year = {2018},
  note = {R package version 1.1.4},
  url = {https://github.com/patperry/r-utf8},
}

Contributing

The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you’d like to contribute, either

fork the repository and submit a pull request
file an issue;
or contact the maintainer via e-mail.

This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.

r-utf8's People

Contributors

Stargazers

Watchers

Forkers

bedatadriven warlicks antonov548

r-utf8's Issues

ranlib sometimes not available on Solaris?

I'm having trouble checking pillar on Solaris with R-hub: utf8 can't be installed because ranlib can't be found: https://builder.r-hub.io/status/pillar_1.2.2.9001.tar.gz-1ea56dcff8e44d29be07bbd460c0a7d2#L2037

The CRAN checks seem to work OK, though. I wonder if ranlib can be made optional so that the package can be installed on R-hub too, or perhaps substituted with ar -s (found in man ranlib).

# install.packages("rhub")
rhub::check(platform = "solaris-x86-patched")

Compile error - uth8lite.h not found

When I use devtools::install_github("patperry/r-utf8") I get the missing file error on a new (clean) install of R/Rtools/mingw64 etc.

I noticed the sub-project utf8lite has the file. Could you add some directions on how to include it via devtools::install_github?

OS Windows 10 / 64bit build 12699

Emoji vs Emoji_Presentation

AFAICT in determining whether a code-point should be wide or not the code (or at least the code that gens the tables) relies on checking whether it's emoji or emoji_presentation (not 100% certain):

# https://www.unicode.org/reports/tr51/#def_basic_emoji_set
emoji = ((emoji_props['Emoji'] - emoji_props['Emoji_Component'])
         | emoji_props['Emoji_Presentation'])

But the current tr51 states:

The emoji code points are those with property values Emoji=Yes, Emoji_Component=No, and Emoji_Presentation=Yes.

It seems the | above is effectively doing an or, though I do not know python so I have no idea if the code is doing what I think it's doing. However, I do see:

utf8::utf8_width(c('\u2139', '\u2728'))
## [1] 2 2

u2139 is not in the "Emoji Presentation" section of Emoji_data, but u2728 is.

On my system (mojave OS X terminal) this is what I see:

I don't pretend that my system is the end all be-all in terms of the correct display computation, but it appears to behave as per tr51.

There is additional ambiguity with some emoji with text presentation that actually have a wide-ish text presentation:

Though clearly the terminal treats it as 1-wide (FWIW, until recently the terminal also treated normal emojis as 1-wide...).

pillar support

utf8_encode(quote = NA), meaning "quotes if needed"
enhance utf8_encode(escapes = ...) to support subtle quotes, perhaps passage of full open-close strings, and perhaps fix #34 by offering an alternative option
Understand difference between utf8_encode() and utf8_format(); only the latter has a chars argument (but #33?), only the former has escapes
When abbreviating, never break escapes and honor wide characters

For r-lib/pillar#563.

CC @patperry.

Compute width ignoring ANSI sequences

Width computations on strings with ANSI escapes currently consume a very substantial part of the time needed to format tibbles:

http://rpubs.com/krlmlr/pillar-print-timing

This is even after replacing crayon::strip_style() with the faster fansi::strip_sgr(). I wonder if UTF-8 could provide a function that computes the width but ignores these non-printable codepoints. I currently can't use utf8_width() for this:

withr::with_options(
  list(crayon.enabled = TRUE),
  utf8::utf8_width(crayon::magenta("four"))
)
#> [1] 24

Created on 2018-07-04 by the reprex package (v0.2.0).

CC @brodieG.

Some Unicode characters are lost after encoding

Hello. While working on some web scraping application, I encounter an error when I converted a matrix of Vietnamese sentences to tibble and some characters were mis-converted. It looks like the problem is from the utf8 package so I hope you can shed a light on why this problem occurs and how to fix it.

Here are all Vietnamese characters with diacritics and were encoded incorrectly.

> utf8::utf8_encode("ă â đ ê ô ơ ư")
[1] "a â d ê ô o u "

Then I verify by comparing the outputs again, and here is the result

> utf8::utf8_encode("ă â đ ê ô ơ ư") == utf8::utf8_encode("a â d ê ô o u")
[1] TRUE

Indeed, the "ă", "đ", "ơ" and "ư" are mis-encoded.

Thank you.

`utf8_format(quote = TRUE)` broken?

I'm wondering what the effect of the quote argument is. @patperry: is this intentional?

utf8::utf8_format('"')
#> [1] "\""
utf8::utf8_format('"', quote = TRUE)
#> [1] "\""

^{Created on 2022-06-30 by the reprex package (v2.0.1)}

The only user on CRAN seems to be the corpus package, which threads it through.

Unexpected utf8_width output

I'll admit I still haven't figured exactly what the correct column with for monospaced fonts should be (I guess partly because it isn't specified by unicode), but was surprised by some odd widths (6,10) I see coming out of utf8_width. These are for two non-spacing marks (Mn), uall is a raw import of the unicode DB (12.1). There are several others that produce that type of values. Maybe these are error codes? If so, is there documentation for them?

> utf8::utf8_width(c('\u07fd', '\U00010d24'))
[1]  6 10
> subset(uall, V1 %in% c(0x7fd, 0x10d24))
         V1                            V2 V3  V4  V5 V6 V7 V8 V9 V10 V11 V12
1988  007fd                NKO DANTAYALAN Mn 220 NSM    NA NA      N      NA
19015 10d24 HANIFI ROHINGYA SIGN HARBAHAY Mn 230 NSM    NA NA      N      NA
      V13 V14 V15
1988             
19015

NEWS not showing up in CRAN

Probably because .Rbuildignore has an entry ^NEWS[.]md$. Not a big deal since we can see the NEWS on the repo.

Error during the installation of the utf8 package

I am trying to install utf8 package (version 1.1.3) on my laptop but I got an error. I am using R 3.4.3 and my OS is macOS Sierra 10.12.6.

Error: package or namespace load failed for ‘utf8’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/Library/Frameworks/R.framework/Versions/3.4/Resources/library/utf8/libs/utf8.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.4/Resources/library/utf8/libs/utf8.so, 6): Symbol not found: _utf8lite_graph_measure
  Referenced from: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/utf8/libs/utf8.so
  Expected in: flat namespace
 in /Library/Frameworks/R.framework/Versions/3.4/Resources/library/utf8/libs/utf8.so
Errore: loading failed
Esecuzione interrotta
ERROR: loading failed
* removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/utf8’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/utf8’

Get R to keep UTF-8 Codepoint representation

I have a weird problem in which I want emojis in a data set I'm working with to stay in codepoint representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...

First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)

Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)

This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)

BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)

and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)

I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. Apologies if this is the wrong place to ask this question, but I'm stumped! :P

use lower default chars

when chars = NULL, so looks good in a data frame. possibly chars = 60.

get_width() in non-UTF-8 locales

In non-UTF-8 locales R will translate characters it can't display to "<U+xxxx>" sequences. Should utf8_width() respect this, or maybe we can provide another entry point? This is important for pillar.

rlang::mut_latin1_locale()
#> Locale codeset is now latin1
"\u6211\u662f\u8c01"
#> [1] "<U+6211><U+662F><U+8C01>"
utf8::get_width("\u6211\u662f\u8c01")
#> Error: 'get_width' is not an exported object from 'namespace:utf8'

Failure on R-devel on Windows

https://www.r-project.org/nosvn/R.check/r-devel-windows-ix86+x86_64/utf8-00check.html

Other platforms seem fine.

Weird output in C locale

Download https://github.com/lyons7/emojidictionary/blob/master/Emoji%20Dictionary%202.1.csv

Then run this code:

library(utf8)

emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)

# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)

# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
                      keep.source = FALSE), eval)

Sys.setlocale("LC_CTYPE", "C")
utf8_print(codes) # weird output

Here is the tail of the output:

[2143] "\U0001f64d\u200d\u2642\ufe0f"                              
[2157] "\U0001f645\U0001f3fc\u200d\u2642\ufe0f"                    
[2171] "\U0001f481\U0001f3fe\u200d\u2642\ufe0f"                    
[2185] "\U0001f469\u200d\U0001f467\u200d\U0001f467"                
[2199] "\U0001f468\u200d\U0001f467" "\U0001f468\u200d\U0001f466\u200d\U0001f466"
[2199] "\U0001f468\u200d\U0001f467\u200d\U0001f466"
[2199] "\U0001f468\u200d\U0001f467\u200d\U0001f467"
[2199] "\U0001f3f3\ufe0f\u200d\U0001f308"

Session information:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/C/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.2  utf8_1.0.0.9000

`utf8_encode(escapes = "31")` colors escapes and backslashes/quotes differently

This splits the output by opening/closing styling. In the first two examples, only the backslash is colored red. In the third example, \n is colored red.

@patperry: Is this intentional? Should we consistently color all characters in \" and \\ in red?

utf8::utf8_encode(c("\\", '"', "\n"), quote = TRUE, escapes = "31", display = TRUE) |>
  strsplit("\033[[][^m]*m")
#> [[1]]
#> [1] "\""   "\\"   "\\\""
#> 
#> [[2]]
#> [1] "\""   "\\"   "\"\""
#> 
#> [[3]]
#> [1] "\""  "\\n" "\""