traitecoevo / apd Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 8.04 MB

The Australian Plant Traits Dictionary

Home Page: https://traitecoevo.github.io/APD/

R 0.16% HTML 99.84%

apd's People

Stargazers

Watchers

Forkers

cboettig rekyt

apd's Issues

Add links to RVA and Zenodo

Once we've posted a proper release, we should add something in the header (possibly navbar) linking to

RVA vocab
Zenodo release

Typos in correspondence with GIFT

Hi @ehwenk & @dfalster,

while building the trait correspondence network I noticed some issues with the correspondence with GIFT traits (I still have to do the same for BIEN and TRY).
I basically checked that both the trait codes provided by AusTraits were in GIFT, as well as the trait names, and that the provided GIFT traits names were matching the provided GIFT trait codes.

The script I used is below. But I'll first detail my findings.

For trait_0030020 & trait_0030015 the GIFT_close contains multiple traits as a single line. Is it on purpose? Because other matched traits span multiple lines.
For trait_0030215, there is a typo in the GIFT_exact name as it is referenced "Fuiting time" missing an r.
For trait_0030020, there is a typo in the GIFT code 'leaf_thorns_1 [GIFT:4:14.1]' which should be 'leaf_thorns_1 [GIFT:4.14.1]'.
Several GIFT traits names are written following AusTraits' convention and not GIFT's 'seed_height' instead of Seed height.
Capitalization of trait names isn't following GIFT's names, APD tend to use snake_case while GIFT uses Camel_snake_case. For example, GIFT's name referenced in APD is 'flower_colour' [APD:trait_0012417] while in GIFT the trait is 'Flower_clour' [GIFT:3.21.1].
There is an error in the GIFT match with trait_0030060, GIFT_close matches with GIFT 1.4.1 (Climber_1) while it should match with GIFT 3.4.1 (Reproduction_sexual_1). This was the trait that triggered my systematic search for potential mismatches, as I obtained in the correspondence network a much larger connected component than expected with traits that shouldn't be matching.

Maybe you could use an adaptation of the script below to perform semi-automated quality checks when updating the APD?

For the sake of completeness, I'll try performing the same checks for TRY and BIEN.

Matching script

library("dplyr")

gift_trait_meta = GIFT::GIFT_traits_meta()

apd_gift_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |>
  select(identifier:label, starts_with("GIFT")) |>
  rename(trait_id = identifier) |>
  tidyr::pivot_longer(
    starts_with("GIFT"), names_to = "match_type", values_to = "matched_trait"
  ) |>
  filter(matched_trait != "") |>
  mutate(
    # Split for traits that have multiple matches on one line
    split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws),
    # Extract GIFT trait name
    extracted_trait = purrr::map(
      split_traits, \(x) stringr::str_extract(x, "^(.*)\\s\\[", group = 1)
    ),
    # Extract GIFT trait code
    extracted_code = purrr::map(
      split_traits, \(x) stringr::str_extract(x, "\\[GIFT:(.+)\\]", group = 1)
    ),
    # Get level
    gift_lvl = purrr::map(
      extracted_code, \(x) stringr::str_count(x, stringr::fixed(".")) + 1L
    )
  ) |>
  # Put everything in a tidy format
  tidyr::unnest(split_traits:gift_lvl)

## Level 2 traits
# Matching code at level 2
apd_gift_lvl2 = apd_gift_detailed |>
  filter(gift_lvl == 2) |>
    left_join(
      gift_trait_meta |>
        distinct(Lvl2, Trait1),
      by = c(extracted_code = "Lvl2")
    )

# Problematic traits
apd_gift_lvl2 |>
  filter((extracted_trait != Trait1) | is.na(Trait1))


## Level 3 traits
apd_gift_lvl3 = apd_gift_detailed |>
  filter(gift_lvl == 3) |>
  left_join(
    gift_trait_meta |>
      distinct(Lvl3, Trait2),
    by = c(extracted_code = "Lvl3")
  )

# Problematic traits
apd_gift_lvl3 |>
  filter((extracted_trait != Trait2) | is.na(Trait2))

Ideas from TERN repository

Files/folders within the TERN GitHub repository that are good examples of what we might like to create for the APT:

https://ternaustralia.github.io/ontology_tern/
https://github.com/ternaustralia/dawe-rlp-vocabs/blob/1b4e3daa611deae25ec90445be27[…]9c12cc/vocab_files/categorical_collections/luts/growth-form.ttl
https://github.com/ternaustralia/dawe-rlp-vocabs
https://linkeddata.tern.org.au/viewers/dawe-vocabs

Typos in correspondence with TRY

Similarly to #28. Let's look at the correspondence with TRY.

I've performed a similar matching of codes and names in TRY, and found few typos (see the detailed script below).

Same remarks as for GIFT, some APD traits have several matching traits on the same line for TRY, e.g., trait_0030810 has two traits matching on GIFT_close.
Names are globally matching but some names correspondence are off because TRY silently modified the names of the trait. The names can be updated accordingly by matching the TraitID in an updated TRY traits table (downloadable through TRY website: https://www.try-db.org/de/DnldTraitList.php).
More serious are the non-corresponding codes. It seems some matching are wrong because of this.
For example, 'leaf_cell_wall_N_per_cell_wall_dry_mass ' [APD:trait_0001511] is referenced as having a close match with 'Leaf cell wall nitrogen (N) per unit cell wall dry mass' referenced as [TRY:96] in APD, however, this traits corresponds to 'Seed oil content per seed mass'. While given the matched name it should be matching with [TRY:3377]. See the script for more example of this

Matching script

```r try_traits = readr::read_delim("tde2024422162351.txt", skip = 3, col_select = -6)

apd_try_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |>
select(identifier:label, starts_with("TRY")) |>
rename(trait_id = identifier) |>
tidyr::pivot_longer(
starts_with("TRY"), names_to = "match_type", values_to = "matched_trait"
) |>
filter(matched_trait != "") |>
mutate(
# Split for traits that have multiple matches on one line
split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws),
# Extract GIFT trait name
extracted_trait = purrr::map(
split_traits, (x) stringr::str_extract(x, "^(.*)\s\[", group = 1)
),
# Extract GIFT trait code
extracted_code = purrr::map(
split_traits, (x) stringr::str_extract(x, "\[TRY:(.+)\]", group = 1) |>
as.numeric()
)
) |>
tidyr::unnest(split_traits:extracted_code)

apd_try_smaller = apd_try_detailed |>

Match names based on trait code

left_join(
try_traits |>
distinct(TraitID, name_matched_on_code = Trait),
by = c(extracted_code = "TraitID")
) |>

Match code based on trait name

left_join(
try_traits |>
distinct(code_matched_on_name = TraitID, Trait),
by = c(extracted_trait = "Trait")
)
select(trait, extracted_trait, extracted_code, name_matched_on_code, code_matched_on_name)

Potentially problematic traits

non-matching names according to code

apd_try_smaller |>
filter(extracted_trait != name_matched_on_code)

non-matching code according to name

apd_try_smaller |>
filter(extracted_code != code_matched_on_name)

</details>

Typos in some correspondence with TRY traits?

Hi @ehwenk and @dfalster 👋

As told in the PR #24 I'm using the raw APD_traits_input.csv to get trait correspondence across databases.

I noticed some issues with some columns in TRY (or at least that are non-standard?).
I'm unsure about tackling these so I rather open an issue about them.

My routine is the following:

apd_try_traits = read.csv("APD_traits_input.csv") |>
  tibble::as_tibble() |>
  select(
    trait_id = identifier, trait, label, contains("BIEN"), contains("GIFT"),
    contains("TRY")
  ) |>
  # Get all traits for which there is an equivalent in TRY
  filter(if_any(contains("TRY"), \(x) x != "")) |>
  select(trait_id, trait, label, contains("TRY")) |>
  # Making data tidy
  tidyr::pivot_longer(
    contains("TRY"), names_to = "match_type", values_to = "match_value"
  ) |>
  filter(match_value != "") |>
  # Extract TRY TraitIDs
  mutate(
    extracted_trait = match_value |>
      stringr::str_extract_all("\\[TRY:\\d+\\]") |>
      purrr::map(stringr::str_remove, "\\[TRY:") |>
      purrr::map(stringr::str_remove,"\\]"),
    match_type =  stringr::str_extract(match_type, "[:alpha:]+"),
    # Count number of match traits
    length_extracted = purrr::map_int(extracted_trait, length)
  )

If I count the number of matched traits given the columns I get the following:

> apd_try_traits |>
+     count(length_extracted)
# A tibble: 4 × 2
  length_extracted     n
             <int> <int>
1                0     5
2                1   316
3                2     7
4                3     1

So 5 AusTraits traits, with non-empty columns have 0 matches given my extraction of TRY IDs.

If I go to see the strings in the columns I get:

> apd_try_traits |>
+     filter(length_extracted == 0) |>
+     pull(match_value)
[1] "specific leaf area [TO:0000562] (https://www.try-db.org/de/de.php)"                                  
[2] "Leaf epidermis cell area; Leaf mesophyll cell area [TRY:338; 573] (https://www.try-db.org/de/de.php)"
[3] "Bark thickness [TRY:24, TRY:3355, TRY:3356] (https://www.try-db.org/de/de.php)"                      
[4] "Bark thickness [TRY:24, 3355, 3356] (https://www.try-db.org/de/de.php)"                              
[5] "plant lifespan and age of first flowering [LEDA:1.3] (https://www.try-db.org/de/de.php)"

For the first line, it matches back to a Trait Ontology definition, but not to a TRY trait.
For leaf epidermis cell and bark thickness it's a matter of TRY IDs writing style. Also Bark thickness is written in two ways?!
For plant lifespan, it's a link to a LEDA trait. Is this relevant here?

I've checked and these issue propagate to the RDF file.

Add vocabulary metadata

Still need to add in metadata for APD. What follows is what Rowan coded into one of the drafts for ARDC RVA;

https://github.com/traitecoevo/APD
a rdfs:Resource , skos:ConceptScheme , owl:Ontology , http://terminologies.gfbio.org/terms/ETS/TraitData ;
rdfs:label "AusTraits"@en ;
rdfs:seeAlso "https://github.com/traitecoevo/APD_values"^^xsd:anyURI ;
dcterms:description "AusTraits is an open-source, harmonized database of Australian plant trait data. It synthesises data on nearly 500 traits across more than 30,000 taxa from field campaigns, published literature, taxonomic monographs, and individual taxon descriptions. Begun in 2016 as an initiative between three lab groups, it has grown to be the largest collation of plant trait data for Australian plants. AusTraits integrates plant trait data collected by researchers from diverse disciplines, including functional plant biology, plant physiology, plant taxonomy, and conservation biology. By harmonizing and error checking values, linking all AusTraits data entries to detailed metadata, and documenting trait and trait values definitions, AusTraits is a resource researchers can trust and use for their research agendas with minimal additional filtering or manipulations."@en ;
dcterms:license "https://creativecommons.org/licenses/by/4.0/"^^xsd:anyURI ;
dcterms:publisher "https://austraits.org/"^^xsd:anyURI ;
dcterms:title "AusTraits"@en ;
skos:hasTopConcept https://github.com/traitecoevo/APD#0000000 ;
skos:prefLabel "AusTraits"@en .

Trait suggestion: inter-annual variability in seed production (masting)

trait concept: seed production inter-annual variability (i.e. masting)
reference: https://doi.org/10.1071/BT22043 (Wright, B. R., Franklin, D. C., & Fensham, R. J. (2022). The ecology, evolution and management of mast reproduction in Australian plants. Australian Journal of Botany, 70(8), 509-530.)
available data: some in manuscript
suggested by: Matt White

traitecoevo / apd Goto Github PK

apd's People

Stargazers

Watchers

Forkers

apd's Issues

Add links to RVA and Zenodo

Typos in correspondence with GIFT

Ideas from TERN repository

Typos in correspondence with TRY

Match names based on trait code

Match code based on trait name

Potentially problematic traits

non-matching names according to code

non-matching code according to name

Typos in some correspondence with TRY traits?

Add vocabulary metadata

Trait suggestion: inter-annual variability in seed production (masting)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent