traitecoevo / apcalign Goto Github PK
View Code? Open in Web Editor NEWR package for accessing, matching and updating species names of Australian flora
Home Page: https://traitecoevo.github.io/APCalign/
License: Other
R package for accessing, matching and updating species names of Australian flora
Home Page: https://traitecoevo.github.io/APCalign/
License: Other
The update_taxonomy() function takes an argument output = x to specify the location and name of the .csv file written by that function, so that the csv file is not just written to the working directory.
Is it possible to add a similar output = x argument to the align_taxa() function, which currently automatically writes "taxonomic_updates.csv" to the working directory?
For align_taxa()
are the arguments (max_distance_abs = 3
, max_distance_rel = 0.2
) being called?? I suspect arguments aren't being passed appropiately.
Looks like align_taxa()
calls match_taxa()
which calls fuzzy_match(max_distance_abs, max_distance_rel)
- many many many times.
Should I try pass max_distance_abs
and max_distance_rel
across all calls?
Originally posted by @fontikar in #64 (comment)
Hello!
I am trying to use this package and have run into a strange error.
I downloaded the two packages (ausflora and dtastorr) from github and ran the following code in one of the examples:
create_taxonomic_update_lookup(c("Banksia integrifolia","Acacia longifolia","Commersonia rosea"),full=FALSE)
The code yields the following error message which I think has something to do with a bug in the APC dataset:
trying URL 'https://github.com/traitecoevo/ausflora/releases/download/0.0.2.9000/apc.parquet'
Content type 'application/octet-stream' length 10895737 bytes (10.4 MB)
downloaded 10.4 MB
Error: IOError: Couldn't deserialize thrift: TProtocolException: Invalid data
Please let me know if this is the case, and if so when it might be fixed.
The R version I am using is
R version 4.3.0
Cheers,
Sam
Lizzy has figured this out for austraits but needs to be generalized. code below
not sure what we want to do there
Need a readme, documenting installation and usage
Basic readme examples:
Data is downloaded from https://biodiversity.org.au/nsl/services/export/index They don’t currently hold a link to the actual download file.
Advice on acknowledgements and citations from Anne Fuchs:
Data is provided as CC-BY3. Also include the attribute ccAttributionIRI with the data. This provides a link back to the source data.
APNI
“Australian Plant Name Index (continuously updated), Centre of Australian National Biodiversity Research, www.biodiversity.org.au/nsl/services/apni (date of extract)”,
APC (taxon
file): is from APC which changes constantly. The file downloaded corresponds to the tree version in your file. If you look at the ccAttributionIRI the following part of the URI https://id.biodiversity.org.au/tree/{id} is a resolvable identifier back to the version that was used for this download. So suggest a citation of
“Australian Plant Census, Centre of Australian National Biodiversity Research, Council of Heads of Australasian Herbaria, {date} https://id.biodiversity.org.au/tree/{id}
Sending Eucalyptus deglupta to create_taxonomic_update_lookup
function returns a fuzzy match taxonomic synonym to Eucalyptus decepta which is a completely different species. Eucalyptus deglupta does not appear to be considered native by the APC, and 'deglupta' and 'decepta' are close enough for fuzzy matching.
Maybe have a native check before doing the name matching? Using the native_anywhere_in_australia
function returns Eucalyptus deglupta as False.
code is here to parse the taxonDistribution
column but could be organized for particular usecases.
library(tidyverse)
library(stringr)
apc <- read_csv("data/APC-taxon-2022-02-14-5132.csv")
apc_species <- filter(apc, taxonRank == "Species",taxonomicStatus=="accepted")
#seperate the states
sep_state_data <-
str_split(unique(apc_species$taxonDistribution), ",")
#get unique places
all_codes <- unique(str_trim(unlist(sep_state_data)))
apc_places <- unique(word(all_codes[!is.na(all_codes)], 1, 1))
#make a table to fill in
data.frame(col.names = apc_places)
species_df <- tibble(species = apc_species$scientificName)
for (i in 1:length(apc_places)) {
species_df <- bind_cols(species_df, NA)
}
names(species_df) <- c("species", apc_places)
#look for all possible entries after each state
state_parse_and_add_column <- function(species_df, state, apc_species){
print(all_codes[grepl(state,all_codes)]) # checking for weird ones
species_df[,state] <- case_when(
grepl(paste0("\\b",state," \\(uncertain origin\\)"), apc_species$taxonDistribution) ~ "uncertain origin",
grepl(paste0("\\b",state," \\(naturalised\\)"), apc_species$taxonDistribution) ~ "naturalised",
grepl(paste0("\\b",state," \\(doubtfully naturalised\\)"), apc_species$taxonDistribution) ~ "doubtfully naturalised",
grepl(paste0("\\b",state," \\(native and naturalised\\)"), apc_species$taxonDistribution) ~ "native and naturalised",
grepl(paste0("\\b",state," \\(formerly naturalised\\)"), apc_species$taxonDistribution) ~ "formerly naturalised",
grepl(paste0("\\b",state," \\(presumed extinct\\)"), apc_species$taxonDistribution) ~ "presumed extinct",
grepl(paste0("\\b",state," \\(native and doubtfully naturalised\\)"), apc_species$taxonDistribution) ~ "native and doubtfully naturalised",
grepl(paste0("\\b",state," \\(native and uncertain origin\\)"), apc_species$taxonDistribution) ~ "native and uncertain origin",
grepl(paste0("\\b",state), apc_species$taxonDistribution) ~ "native", #no entry = native, it's important this is last in the list
TRUE ~ "not present"
)
return(species_df)
}
#bug checking
#species_df<-state_parse_and_add_column(species_df,"LHI",apc_species)
#species_df<-state_parse_and_add_column(species_df,"HI",apc_species)
#go through the states one by one
for (i in 1:length(apc_places)){
species_df <- state_parse_and_add_column(species_df,apc_places[i],apc_species)
}
write_csv(species_df,"data/states_islands_species_list.csv")
Can we provide some kinda explanation for these different terms? @wcornwell said these are calculated from raw data. Maybe worth sticking in an article.Rmd for the pkgdown website.
We need the methods on how these are calculated pls 😄
library(purrr)
library(janitor)
status_matrix |>
+ select(-species) |>
+ flatten_chr() |>
+ tabyl()
flatten_chr(select(status_matrix, -species)) n percent
doubtfully naturalised 1120 2.371003e-03
formerly naturalised 277 5.863998e-04
native 40336 8.538997e-02
native and doubtfully naturalised 9 1.905270e-05
native and naturalised 136 2.879075e-04
native and uncertain origin 2 4.233933e-06
naturalised 8765 1.855521e-02
not present 421606 8.925258e-01
presumed extinct 101 2.138136e-04
uncertain origin 22 4.657327e-05
Will work on website
branch for this!
loading resources
print(n = 6)
in vignetteWhen loading files via datatorr, can we use a consistent name, even if the file name changes?
e.g. APNI-names-2020-05-14-1341.csv -> APNI?
idea from @fontikar 👍
library(ausflora)
library(tidyverse)
resources<-load_taxonomic_resources()
plot_taxa_heat_map <- function(taxa, resources = resources) {
ss <- create_species_state_origin_matrix(resources = resources)
ss %>%
pivot_longer(2:19, names_to = "State") %>%
filter(grepl(taxa, species)) %>%
filter(value != "not present") %>%
filter(
value %in% c(
"native",
"presumed extinct",
"naturalised",
"formerly naturalised",
"doubtfully naturalised"
)
) %>%
filter(State %in% c("WA", "Qld", "NT", "NSW", "Vic", "Tas", "SA", "ACT")) %>%
group_by(State, value) %>%
summarise (`number of species` = n()) %>%
ggplot(aes(x = State, y = value, fill = `number of species`)) +
geom_tile(color = "black") +
scale_fill_gradient2(
low = "#075AFF",
mid = "#FFFFCC",
high = "#FF0000"
) +
coord_fixed() + ggtitle(paste(taxa, " species"))
}
```
code from @dfalster :
create_lookup <- function(species_list, fuzzy_matching = FALSE, ver="0.0.1.9000")
tmp <- dataset_access_function("0.0.1.9000")
aligned_data <-
unique(species_list) %>%
align_taxa(fuzzy_matching = fuzzy_matching, ver=ver)
aligned_species_list_tmp <-
aligned_data$aligned_name %>% update_taxonomy()
aligned_species_list <-
aligned_data %>% select(original_name, aligned_name) %>%
left_join(aligned_species_list_tmp, by = c("aligned_name"), multiple= "first") %>%
filter(!is.na(taxonIDClean)) %>%
mutate(genus = word(canonicalName,1,1))
return(aligned_species_list)
}
Currently, unclear which files refer to which functions.
Best practice is to name of .R as the main function e.g. align_taxa.R
Sub functions i.e. from switch() or helper functions are best stored under the main .R
Will work on new branch clean-r
Check they are happy with use of the data and workflow for reconciling taxon names
Trying with test data where there are NA in species, align_taxa will throw error if you don't drop NA!
library(tidyverse)
remotes::install_github("traitecoevo/ausflora", ref = "vignette")
#> Skipping install of 'ausflora' from a github remote, the SHA1 (12bd620b) has not changed since last install.
#> Use `force = TRUE` to force installation
library(ausflora)
dim(gbif_lite)
#> [1] 129 7
gbif_lite
#> # A tibble: 129 × 7
#> species infraspecificepithet taxonrank decimalLongitude decimalLatitude
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 Tetratheca c… <NA> SPECIES 145. -37.4
#> 2 Peganum harm… <NA> SPECIES 139. -33.3
#> 3 Calotis mult… <NA> SPECIES 115. -24.3
#> 4 Leptospermum… <NA> SPECIES 151. -34.0
#> 5 Lepidosperma… <NA> SPECIES 142. -37.3
#> 6 Enneapogon p… <NA> SPECIES 129. -17.8
#> 7 Acacia verti… <NA> SPECIES 144. -38.6
#> 8 Banksia serr… <NA> SPECIES 149. -37.8
#> 9 Glischrocary… <NA> SPECIES 136. -34.3
#> 10 Senna artemi… artemisioides SUBSPECI… 142. -25.9
#> # ℹ 119 more rows
#> # ℹ 2 more variables: scientificname <chr>, verbatimscientificname <chr>
resources <- load_taxonomic_resources(stable_or_current_data = "stable")
#> Loading resources...
#> ...done
gbif_lite |>
# tidyr::drop_na(species) |>
dplyr::pull(species) |>
align_taxa(resources = resources)
#> Checking alignments of 129 taxa
#> -> 0 names already matched; 0 names checked but without a match; 122 taxa yet to be checked
#> Error in `dplyr::mutate()`:
#> ℹ In argument: `fuzzy_match_genus = fuzzy_match_genera(genus,
#> resources$genera_accepted$canonicalName)`.
#> Caused by error in `purrr::map_chr()`:
#> ℹ In index: 72.
#> Caused by error in `if (words_in_text > 1) ...`:
#> ! missing value where TRUE/FALSE needed
#> Backtrace:
#> ▆
#> 1. ├─ausflora::align_taxa(dplyr::pull(gbif_lite, species), resources = resources)
#> 2. │ └─ausflora:::match_taxa(taxa, resources)
#> 3. │ └─taxa$tocheck %>% ...
#> 4. ├─dplyr::mutate(...)
#> 5. ├─dplyr:::mutate.data.frame(...)
#> 6. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
#> 7. │ ├─base::withCallingHandlers(...)
#> 8. │ └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
#> 9. │ └─mask$eval_all_mutate(quo)
#> 10. │ └─dplyr (local) eval()
#> 11. ├─ausflora (local) fuzzy_match_genera(genus, resources$genera_accepted$canonicalName)
#> 12. │ └─purrr::map_chr(x, ~fuzzy_match(.x, y, 2, 0.35, n_allowed = 1))
#> 13. │ └─purrr:::map_("character", .x, .f, ..., .progress = .progress)
#> 14. │ ├─purrr:::with_indexed_errors(...)
#> 15. │ │ └─base::withCallingHandlers(...)
#> 16. │ ├─purrr:::call_with_cleanup(...)
#> 17. │ └─ausflora (local) .f(.x[[i]], ...)
#> 18. │ └─ausflora:::fuzzy_match(.x, y, 2, 0.35, n_allowed = 1)
#> 19. └─base::.handleSimpleError(...)
#> 20. └─purrr (local) h(simpleError(msg, call))
#> 21. └─cli::cli_abort(...)
#> 22. └─rlang::abort(...)
gbif_lite |>
tidyr::drop_na(species) |>
dplyr::pull(species) |>
align_taxa(resources = resources)
#> Checking alignments of 127 taxa
#> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked
#> # A tibble: 121 × 28
#> original_name cleaned_name aligned_name source known checked stripped_name
#> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr>
#> 1 Tetratheca cili… Tetratheca … Tetratheca … <NA> TRUE TRUE tetratheca c…
#> 2 Peganum harmala Peganum har… Peganum har… <NA> TRUE TRUE peganum harm…
#> 3 Calotis multica… Calotis mul… Calotis mul… <NA> TRUE TRUE calotis mult…
#> 4 Leptospermum tr… Leptospermu… Leptospermu… <NA> TRUE TRUE leptospermum…
#> 5 Lepidosperma la… Lepidosperm… Lepidosperm… <NA> TRUE TRUE lepidosperma…
#> 6 Enneapogon poly… Enneapogon … Enneapogon … <NA> TRUE TRUE enneapogon p…
#> 7 Acacia verticil… Acacia vert… Acacia vert… <NA> TRUE TRUE acacia verti…
#> 8 Banksia serrata Banksia ser… Banksia ser… <NA> TRUE TRUE banksia serr…
#> 9 Glischrocaryon … Glischrocar… Glischrocar… <NA> TRUE TRUE glischrocary…
#> 10 Senna artemisio… Senna artem… Senna artem… <NA> TRUE TRUE senna artemi…
#> # ℹ 111 more rows
#> # ℹ 21 more variables: stripped_name2 <chr>, trinomial <chr>, binomial <chr>,
#> # genus <chr>, aligned_reason <chr>, fuzzy_match_genus <chr>,
#> # fuzzy_match_genus_known <chr>, fuzzy_match_genus_APNI <chr>,
#> # fuzzy_match_binomial <chr>, fuzzy_match_binomial_APC_known <chr>,
#> # fuzzy_match_trinomial <chr>, fuzzy_match_trinomial_known <chr>,
#> # fuzzy_match_cleaned_APC <chr>, fuzzy_match_cleaned_APC_known <chr>, …
Created on 2023-07-19 with reprex v2.0.2
A small number of species from my test dataset didn't match using apcnames but did match when I searched for them on https://biodiversity.org.au/nsl/services/APC using the predictive text function. I'm unsure whether these didn't match because of the version of APC used by apcnames or because of some sort of syntax issue. All were species with a "sp" in the middle of the species name (see below).
e.g. Agrostis mulleriana dwarf form
should become Agrostis muelleriana
Agrostis aff_hyemalis
needs to be Agrostis sp. aff. hiemalis
This also depends on what happens with #34 in large part, but should remember to do this
If we progress, we may want a more enticing package name. What about ausflora
, austaxa
, aus_plant_taxa
?
Thoughts @wcornwell ?
Currently it looks like a large number of unmatched species from my dataset are genus-level identifications (i.e. Senna sp.). This makes sense given the APC search itself currently doesn't return anything for sp.'s, unless you give it something explicitly genus only.
Is it possible to convert "Genus sp." searches to genus-level searches and return a genus-level value for them, such that e.g. a Dryandra sp. entry would be returned as Banksia sp.?
After meeting with @cboettig, we learnt about https://github.com/cboettig/contentid
This looks like a promising option for locally caching downloads. Plays nice with zenodo.
If you run this code:
create_taxonomic_update_lookup(
c(
"Banksia integrifolia integrifolia"
),
resources = resources
)
it retrieves Banksia integrifolia integrifolia as expected. Similarly, asking for Banksia integrifolia subsp. integrifolia also works
But if I ask for Banksia integrifolia ssp. integrifolia, it only retrieves the species name, not the subspecies
So 'ssp.' needs to be entered as an accepted notation for subspecies
Would have to check for other infraspecific ranks too, eg
v. or var.
form. or forma. or f.
Hello! I was wondering if it's possible to include a column in the output from update_taxonomy
that gives a taxon's subclass, in addition to the column for family etc?
Subclass is quite useful e.g. when you are filtering a list for Magnoliidae only, to remove non-flowering plants.
I suspect that inserting "subclass" at line 323 in clean_names.R might achieve this, I will have a go at this and see if it works.
As I was writing the vignette, I noticed there are 2 match_06
match_06. Automatic alignment with synonymous term among accepted canonical names in APC (2023-07-20)
match_06. Automatic alignment with synonymous term among known canonical names APC (2023-07-20)
Can these be collapsed into one? or delineated as match_06A, match_6B to retain their nuances?
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
library(ausflora)
resources <- load_taxonomic_resources(stable_or_current_data = "stable")
#> Loading resources...
#> Warning: Error in curl::curl_fetch_memory(url, handle = handle): Timeout was reached: [hash-archive.carlboettiger.info] Operation timed out after 2002 milliseconds with 0 bytes received
#> ...done
aligned_gbif_taxa <- gbif_lite |>
tidyr::drop_na(species) |>
dplyr::pull(species) |>
align_taxa(resources = resources)
#> Checking alignments of 127 taxa
#> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked
aligned_gbif_taxa |>
pull(aligned_reason) |>
tabyl() |>
tibble()
#> # A tibble: 5 × 3
#> `pull(aligned_gbif_taxa, aligned_reason)` n percent
#> <chr> <int> <dbl>
#> 1 match_06. Automatic alignment with synonymous term among accept… 112 0.926
#> 2 match_06. Automatic alignment with synonymous term among known … 6 0.0496
#> 3 match_08. Automatic alignment with synonymous name in APNI (202… 1 0.00826
#> 4 match_14. Automatic alignment with species-level canonical name… 1 0.00826
#> 5 match_20. Rewording name to be recognised as genus rank, with g… 1 0.00826
Created on 2023-07-20 with reprex v2.0.2
Hi team,
Anne Fuchs (@ afuchs1) advised "On a different topic, are you aware that the acronym of “AusFlora” is being used for the Flora of Australia https://ausflora.net/ and <www.ausflora.org.au> will this become confusing in the long term as the two products are somewhat different."
So suggest we find a new package name.
How about ausfloralign
?
Suggestions welcome @wcornwell @rubysaltbush @ehwenk!
I'll put some time into updating this package sometime soon. It would be helpful to better understand use cases. If anyone could document use cases for the package, that would be very helpful. Particularly if they're not met by current functionality.
Thoughts @wcornwell @rubysaltbush @eflower @ehwenk
Thanks!
specifically aligned_name
versus canonical_name
.
i agree with @yangsophieee point here
this could be as simple as
read_csv("https://biodiversity.org.au/nsl/services/export/taxonCsv")
Hello!
Would it be possible to add an argument in align_taxa
to set fuzzy matching to FALSE
?
Perhaps it is an unusual use case but I'm trying to run apcnames on a list of species that includes both Australian and a large number of international taxa. I'm only interested in the Australian taxa, but as the APC is the best source of info on what taxa are found in Australia I thought apcnames might be a good way to separate Aus from international taxa. Unfortunately though the fuzzy matching in align_taxa
coerced a large number of genuses/species not found in Aus into adjacent Australian taxa.
I've figured out a workaround but thought perhaps it could be useful to have an argument to turn fuzzy matching in align_taxa
on or off? Or could this be dealt with by changing the max_distance
settings?
The update_taxonomy
function loads a large list of 6 variables into the global environment while it's running. Is it possible to unload this taxonomic_resources
list when the function finishes if it's no longer needed?
The update_taxonomy
function returned duplicate rows for two species in my data set - both orchids matched via APNI (Caladenia tentaculata and Pterostylis aff. nana). I suspect this is because of problematic taxonomy for these taxa leading to duplicated records in APNI. The duplicate rows are identical in each column so I have removed them by calling dplyr::distinct
on the output from update_taxonomy
, not sure if this could/should be worked into the function or not.
If I understand right we're only using APNI when the name is not in APC, so APNI could be reduced to just the rows that don't duplicate APC?
which would take us from 132385 to 29059 rows
going to just leave this in an issue for the moment
move loading to top level
Currently we're storing copies of data in gut hub releases but ideally we'd archive copies of the APC/APNI outputs on Zenodo. These files will then be drawn upon by the package.
I checked in which @afuchs1, who's part of the team preparing and publishing the data. They have been thinking similar and have are currently working towards this goal. so we can stay tuned for updates.
I'd like to update the DESCRIPTION file at some point (not urgent!)
Leaving this here for reference
I would assume @ehwenk did all the heavy lifting for this package! 💪
Would Anna Monroe and Anne Fuchs like to be named on package as key players in data distribution 📊
Same with Carl with contentID 🗄️
and everyone wonderful found here
When running the function ausflora::native_anywhere_in_australia()
, inputting a mis-spelt taxon returns FALSE for the outcome native_anywhere_in_aus
.
For example, ausflora::native_anywhere_in_australia("Banks")
.
Could this instead return, as an example, not in list
or NA
?
Hello!
Tried installing current Github version of ausflora to run on a new dataset, but got the following dplyr::bind_rows error when trying to run update_taxonomy
> updated_sisspec <- ausflora::update_taxonomy(aligned_sisspec$aligned_name)
Error in `dplyr::bind_rows()`:
! Can't combine `..1$source` <character> and `..2$source` <logical>.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_incompatible_type>
Error in `dplyr::bind_rows()`:
! Can't combine `..1$source` <character> and `..2$source` <logical>.
---
Backtrace:
1. ausflora::update_taxonomy(aligned_sisspec$aligned_name)
5. dplyr::bind_rows(taxa_APC, taxa_APNI)
8. vctrs::vec_rbind(!!!dots, .names_to = .id)
I'm not sure if this is a problem caused by a particular name in my list of names
sister_species.csv ?
I've tried removing the few non-Australian taxa that returned NAs for aligned_name but this did not fix it.
Just creating this issue so I can tag my commits and PR.
Documentation we need:
usethis::use_package_doc
usethis::use_vignette
usethis::use_article
accessed via pkgdown website but not via R using vignettes("ausflora")
. Articles are suited for longer form documentationSource is NA in align_taxa
but seems to return nicely in update_taxonomy
user_data <- tibble( my_species_names = c(
"Eucalyptus regnans", "Acacia melanoxylon",
"Banksia integrifolia", "Commersonia rosea",
"Not a species"),other_trait_data = rnorm(5))
aligned<- align_taxa(user_data$my_species_names, resources = resources)
aligned |> select(source)
updated <- aligned$aligned_name |> update_taxonomy(resources = resources)
updated |> select(source)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.