Coder Social home page Coder Social logo

alistaire47 / passport Goto Github PK

View Code? Open in Web Editor NEW
34.0 4.0 0.0 3.36 MB

Travel smoothly between country name and code formats

Home Page: https://alistaire47.github.io/passport/

License: Other

R 100.00%
r country-names country-codes country-data package

passport's Introduction

passport

Travis-CI Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge

passport smooths the process of working with country names and codes via powerful parsing, standardization, and conversion utilities arranged in a simple, consistent API. Country name formats include multiple sources including the Unicode CLDR common-sense standardizations in hundreds of languages.

Installation

Install from CRAN with

install.packages("passport")

or the development version from GitHub with

# install.packages("remotes")
remotes::install_github("alistaire47/passport")

Travel smoothly between country name and code formats

Working with country data can be frustrating. Even with well-curated data like gapminder, there are some oddities:

library(passport)
library(gapminder)
library(dplyr)    # Works equally well in any grammar.
library(tidyr)
set.seed(47)

grep("Korea", unique(gapminder$country), value = TRUE)
#> [1] "Korea, Dem. Rep." "Korea, Rep."
grep("Yemen", unique(gapminder$country), value = TRUE)
#> [1] "Yemen, Rep."

passport offers a framework for working with country names and codes without manually editing data or scraping codes from Wikipedia.

I. Standardize

If data has non-standardized names, standardize them to an ISO 3166-1 code or other standardized code or name with parse_country:

gap <- gapminder %>% 
    # standardize to ISO 3166 Alpha-2 code
    mutate(country_code = parse_country(country))

gap %>%
    select(country, country_code, year, lifeExp) %>%
    sample_n(10)
#> # A tibble: 10 x 4
#>    country                  country_code  year lifeExp
#>    <fct>                    <fct>        <int>   <dbl>
#>  1 France                   FR            2002    79.6
#>  2 Ireland                  IE            1997    76.1
#>  3 Honduras                 HN            1982    60.9
#>  4 Iran                     IR            1967    52.5
#>  5 Central African Republic CF            1972    43.5
#>  6 Madagascar               MG            1997    55.0
#>  7 Albania                  AL            1952    55.2
#>  8 Jamaica                  JM            2002    72.0
#>  9 Philippines              PH            1997    68.6
#> 10 Libya                    LY            1972    52.8

If country names are particularly irregular, in unsupported languages, or are even just unique location names, parse_country can use Google Maps or Data Science Toolkit geocoding APIs to parse instead of regex:

parse_country(c("somewhere in Japan", "日本", "Japon", "जापान"), how = "google")
#> [1] "JP" "JP" "JP" "JP"

parse_country(c("1600 Pennsylvania Ave, DC", "Eiffel Tower"), how = "google")
#> [1] "US" "FR"

II. Convert

If data comes with countries already coded,

  • convert them to ISO or other codes with as_country_code()
  • convert them to country names with as_country_name()
  • convert them to other languages with as_country_name()
# NATO member defense expenditure data; see `?nato`
data("nato", package = "passport")

nato %>% 
    select(country_stanag) %>% 
    distinct() %>%
    mutate(
        country_iso = as_country_code(country_stanag, from = "stanag"),
        country_name = as_country_name(country_stanag, from = "stanag", short = FALSE),
        country_name_thai = as_country_name(country_stanag, from = "stanag", to = "ta-my")
    )
#> # A tibble: 29 x 4
#>    country_stanag country_iso country_name country_name_thai
#>    <chr>          <chr>       <chr>        <chr>            
#>  1 ALB            AL          Albania      அல்பேனியா         
#>  2 BEL            BE          Belgium      பெல்ஜியம்          
#>  3 BGR            BG          Bulgaria     பல்கேரியா         
#>  4 CAN            CA          Canada       கனடா             
#>  5 CZE            CZ          Czechia      செசியா           
#>  6 DEU            DE          Germany      ஜெர்மனி           
#>  7 DNK            DK          Denmark      டென்மார்க்          
#>  8 ESP            ES          Spain        ஸ்பெயின்           
#>  9 EST            EE          Estonia      எஸ்டோனியா         
#> 10 FRA            FR          France       பிரான்ஸ்           
#> # … with 19 more rows

Language formats largely follow IETF language tag BCP 47 format. For all available formats, run DT::datatable(codes) for an interactive widget of format names and further information.

III. Format

A particularly common hangup with country data is presentation. While “Yemen, Rep.” may be fine for exploratory work, to create a plot to share, such names need to be changed to something more palatable either by editing the data or manually overriding the labels directly on the plot.

If the existing format is already standardized, passport offers another option: use a formatter function created with country_format, just like for thousands separators or currency formatting. Reorder simply with order_countries:

library(ggplot2)

living_longer <- gap %>% 
    group_by(country_code) %>% 
    summarise(start_life_exp = lifeExp[which.min(year)], 
              stop_life_exp = lifeExp[which.max(year)], 
              diff_life_exp = stop_life_exp - start_life_exp) %>% 
    top_n(10, diff_life_exp) 
#> `summarise()` ungrouping output (override with `.groups` argument)

# Plot country codes...
ggplot(living_longer, aes(x = country_code, y = stop_life_exp - 3.3,
                          ymin = start_life_exp, 
                          ymax = stop_life_exp - 3.3, 
                          colour = factor(diff_life_exp))) + 
    geom_point(pch = 17, size = 15) + 
    geom_linerange(size = 10) + 
                     # ...just pass `labels` a formatter function!
    scale_x_discrete(labels = country_format(),
                     # Easily change order
                     limits = order_countries(living_longer$country_code, 
                                              living_longer$diff_life_exp)) + 
    scale_y_continuous(limits = c(30, 80)) + 
    labs(title = "Life gets better",
         subtitle = "Largest increase in life expectancy",
         x = NULL, y = "Life expectancy") + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1), 
          legend.position = "none")

By default country_format will use Unicode CLDR (see below) English names, which are intelligible and suitable for most purposes. If desired, other languages or formats can be specified just like in as_country_name.


Data

The data underlying passport comes from a number of sources, including

Licensing

passport is licensed as open-source software under GPL-3. Unicode CLDR data is licensed according to its own license, a copy of which is included. countrycode regex are used as a modification under GPL-3; see the included aggregation script for modifying code and date.

passport's People

Contributors

alistaire47 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

passport's Issues

add google API key support for parse_country

The Google Places API no longer allows keyless access, and changed from allowing 2500 free calls a day to giving each user $200 of free usage each month. Given the change, it seems worthwhile to allow a key to be specified. Short of very heavy usage (e.g. using it to find countries for thousands of addresses) the current structure should always be well under $200/month, but since this could cost users actual money, some clear warnings/confirmations are in order.

add administrative division support

It would be nice to add administrative division support, including names in as many languages as possible (start with UN-official?) and codes where possible (at least ISO 3166-2), in parallel with as_country_code, as_country_name, and parse_country, maybe with as_division_code etc. ("state" is country-specific). parse_division (or whatever it's called) should probably take a country parameter to limit result scope.

Code can be recycled and refactored to be multi-purpose, but new data will have to be assembled. ISO 3166-2 codes are easy enough to grab, but I don't think CLDR yet has administrative divisions, so non-English names may be hard.

Going beyond principal subdivisions (e.g. not just U.S. states, but down to counties/parishes) seems unlikely unless a spectacular data source appears. AFAIK most (all?) codes for them are country-specific (e.g. FIPS), so aggregating data would be a pain. Machine translation of names is possible, if it's useful.

  • Aggregate data
  • Build conversion functions
  • Build parsing function
  • Add way to mutate from division to country without geocoding
  • Build attribute function (capitals, at least)

non-logical datasets shouldn't start with "is_"

Currently is_developed and is_independent return character vectors:

unique(passport:::countries$is_developed)
#> [1] NA           "Developed"  "Developing"

unique(passport:::countries$is_independent)
#>  [1] NA                       "Yes"                   
#>  [3] "Territory of GB"        "International"         
#>  [5] "Territory of US"        "Part of NL"            
#>  [7] "Part of FI"             "Part of FR"            
#>  [9] "Territory of NO"        "Territory of AU"       
#> [11] "Associated with NZ"     "In contention"         
#> [13] "Part of DK"             "Crown dependency of GB"
#> [15] "Part of CN"             "Commonwealth of US"    
#> [17] "Territory of FR"        "Territory of NZ"       
#> [19] "Territories of US"

If they're going to start with is_, they should really return logical vectors. To address the issue, they could

  • drop information to actually return a logical
  • get renamed
  • be split in two, e.g. is_independent and dependency_status

None of these options is really ideal, as the expectation of as_country_code and as_country_name is usually to return a character vector or factor. They are not the only exceptions:

code_types <- sapply(passport:::countries, typeof) 

code_types[code_types != 'character']
#>                        gaul              un_region_code 
#>                    "double"                   "integer" 
#>           un_subregion_code un_intermediate_region_code 
#>                   "integer"                   "integer" 
#>                         m49                         ldc 
#>                   "integer"                   "logical" 
#>                        lldc                        sids 
#>                   "logical"                   "logical"

Numeric country codes (gaul, un_*_code, m49) are a different issue. Perhaps they should be strings, as they should not be operated upon, but converting them to factors is potentially very confusing and may merit a warning or message.

Country groupings (ldc, lldc, sids, un_*_code) will be addressed by #1 (though they face the same type issue).

These two (plus a lot more) should be split into a separate set of country attributes (#3), but the issue will still have to be addressed within that dataset.

This will be a breaking change, but integrating the change with #3 will minimize disruption.

Rebuild data and release patch version

It's been a while since the data for this package has been recompiled, and some of it does change, so it's time to rebuild and release a patch version to CRAN.

split country attributes into separate data.frame and function

Country attributes that are not strictly country names or codes (tld, currency, capital, independence and territorial status (#2), etc.) should be split out of countries and accessed with a more apt function, maybe as_country_attr.

Reverse conversion (where sensible, e.g. tld or currency code) should be enabled through separate helpers to as_country_code and as_country_name akin to convert_country in order to avoid losing functionality. Conversion can probably use iso2c as a key to bridge the datasets. There are some observations without one, though once a list of attributes is collected, it will have to be seen whether they have any worth using a separate key column.

TODO:

  • Assemble list of attributes to separate
  • Check if all with data have iso2c to use as key column
  • Rebuild aggregate.Rmd as necessary to generate separate data.frames
  • Build separate parser—reuse convert_country?
  • Add separate helper functions to as_country_code and as_country_name to allow back-conversion
    • Add warning for non-unique types or limit reverse conversion

Failed Parse for Regex Option

The Regex option of parse_country is unable to parse the following countries:

  • North Ireland
  • St. Martin
  • Saint Martin
  • Channel Islands
  • Eswatini

handle country groupings more effectively

It would be useful to handle containment and country groupings more effectively, e.g. when moving from countries to regions:

passport::as_country_name(c('FR', 'DE', 'SG'), 'continent')
#> Multiple unique values aggregated to single output
#> [1] "EU" "EU" "AS"

passport::as_country_name(c('FR', 'DE', 'SE', 'SG'), 'en_un_subregion')
#> Multiple unique values aggregated to single output
#> [1] "Western Europe"     "Western Europe"     "Northern Europe"   
#> [4] "South-eastern Asia"

Issues with current implementation:

  • Regions are obviously not countries.
  • Changes are not reversible, as indicated by the message.
    • Apart from introducing NAs, reversibility should a goal for the conversion functions.
  • For group membership (EU, UN, G20, etc.) it would be nice to have vectors for filtering and generating logical columns

Some of these are already in countries (above, UN SIDS, UN development status, etc.) and CLDR has some data, though there are certainly more country groups that should be added.

Use cases:

  1. Converting to a superset, maybe with new function as_region
    • Are all groupings regions?
    • "Region" has a lot of definitions, so this name may clash with other packages
    • Is as_ the right prefix? It's conversion, yes, but levels are being aggregated.
  2. Adding a logical column for whether a country is in a group
    • Could have its own function, but if groups are exposed as vectors (via an accessor?), users can just use %in%
  3. Filtering to countries in a group, e.g. the OECD countries
    • Requires groups be exposed and in same format as existing country data

Addressing 2 and 3 requires group vectors, but does it make more sense to

  • make an accessor function that returns a vector of the specified group converted to the specified format, or
  • add a bunch of iso2c vectors of groups as package data which can be converted with existing conversion functions, or
  • both?

To codify, the TODO:

  • Reproducibly generate vectors of country groupings (necessary even if they're not exported)
  • Aggregate group vectors into internal data structure?
  • Make group accessor and/or document group data exposed
  • Separate regions from existing data; deduplicate new data (alts are irrelevant)
  • Write as_region (or whatever it ends up going by)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.