Coder Social home page Coder Social logo

apd's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

cboettig rekyt

apd's Issues

Add links to RVA and Zenodo

Once we've posted a proper release, we should add something in the header (possibly navbar) linking to

  • RVA vocab
  • Zenodo release

Typos in correspondence with GIFT

Hi @ehwenk & @dfalster,

while building the trait correspondence network I noticed some issues with the correspondence with GIFT traits (I still have to do the same for BIEN and TRY).
I basically checked that both the trait codes provided by AusTraits were in GIFT, as well as the trait names, and that the provided GIFT traits names were matching the provided GIFT trait codes.

The script I used is below. But I'll first detail my findings.

  1. For trait_0030020 & trait_0030015 the GIFT_close contains multiple traits as a single line. Is it on purpose? Because other matched traits span multiple lines.
  2. For trait_0030215, there is a typo in the GIFT_exact name as it is referenced "Fuiting time" missing an r.
  3. For trait_0030020, there is a typo in the GIFT code 'leaf_thorns_1 [GIFT:4:14.1]' which should be 'leaf_thorns_1 [GIFT:4.14.1]'.
  4. Several GIFT traits names are written following AusTraits' convention and not GIFT's 'seed_height' instead of Seed height.
  5. Capitalization of trait names isn't following GIFT's names, APD tend to use snake_case while GIFT uses Camel_snake_case. For example, GIFT's name referenced in APD is 'flower_colour' [APD:trait_0012417] while in GIFT the trait is 'Flower_clour' [GIFT:3.21.1].
  6. There is an error in the GIFT match with trait_0030060, GIFT_close matches with GIFT 1.4.1 (Climber_1) while it should match with GIFT 3.4.1 (Reproduction_sexual_1). This was the trait that triggered my systematic search for potential mismatches, as I obtained in the correspondence network a much larger connected component than expected with traits that shouldn't be matching.

Maybe you could use an adaptation of the script below to perform semi-automated quality checks when updating the APD?

For the sake of completeness, I'll try performing the same checks for TRY and BIEN.

Matching script
library("dplyr")

gift_trait_meta = GIFT::GIFT_traits_meta()

apd_gift_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |>
  select(identifier:label, starts_with("GIFT")) |>
  rename(trait_id = identifier) |>
  tidyr::pivot_longer(
    starts_with("GIFT"), names_to = "match_type", values_to = "matched_trait"
  ) |>
  filter(matched_trait != "") |>
  mutate(
    # Split for traits that have multiple matches on one line
    split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws),
    # Extract GIFT trait name
    extracted_trait = purrr::map(
      split_traits, \(x) stringr::str_extract(x, "^(.*)\\s\\[", group = 1)
    ),
    # Extract GIFT trait code
    extracted_code = purrr::map(
      split_traits, \(x) stringr::str_extract(x, "\\[GIFT:(.+)\\]", group = 1)
    ),
    # Get level
    gift_lvl = purrr::map(
      extracted_code, \(x) stringr::str_count(x, stringr::fixed(".")) + 1L
    )
  ) |>
  # Put everything in a tidy format
  tidyr::unnest(split_traits:gift_lvl)

## Level 2 traits
# Matching code at level 2
apd_gift_lvl2 = apd_gift_detailed |>
  filter(gift_lvl == 2) |>
    left_join(
      gift_trait_meta |>
        distinct(Lvl2, Trait1),
      by = c(extracted_code = "Lvl2")
    )

# Problematic traits
apd_gift_lvl2 |>
  filter((extracted_trait != Trait1) | is.na(Trait1))


## Level 3 traits
apd_gift_lvl3 = apd_gift_detailed |>
  filter(gift_lvl == 3) |>
  left_join(
    gift_trait_meta |>
      distinct(Lvl3, Trait2),
    by = c(extracted_code = "Lvl3")
  )

# Problematic traits
apd_gift_lvl3 |>
  filter((extracted_trait != Trait2) | is.na(Trait2))

Typos in correspondence with TRY

Similarly to #28. Let's look at the correspondence with TRY.

I've performed a similar matching of codes and names in TRY, and found few typos (see the detailed script below).

  1. Same remarks as for GIFT, some APD traits have several matching traits on the same line for TRY, e.g., trait_0030810 has two traits matching on GIFT_close.
  2. Names are globally matching but some names correspondence are off because TRY silently modified the names of the trait. The names can be updated accordingly by matching the TraitID in an updated TRY traits table (downloadable through TRY website: https://www.try-db.org/de/DnldTraitList.php).
  3. More serious are the non-corresponding codes. It seems some matching are wrong because of this.
    For example, 'leaf_cell_wall_N_per_cell_wall_dry_mass ' [APD:trait_0001511] is referenced as having a close match with 'Leaf cell wall nitrogen (N) per unit cell wall dry mass' referenced as [TRY:96] in APD, however, this traits corresponds to 'Seed oil content per seed mass'. While given the matched name it should be matching with [TRY:3377]. See the script for more example of this
Matching script ```r try_traits = readr::read_delim("tde2024422162351.txt", skip = 3, col_select = -6)

apd_try_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |>
select(identifier:label, starts_with("TRY")) |>
rename(trait_id = identifier) |>
tidyr::pivot_longer(
starts_with("TRY"), names_to = "match_type", values_to = "matched_trait"
) |>
filter(matched_trait != "") |>
mutate(
# Split for traits that have multiple matches on one line
split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws),
# Extract GIFT trait name
extracted_trait = purrr::map(
split_traits, (x) stringr::str_extract(x, "^(.*)\s\[", group = 1)
),
# Extract GIFT trait code
extracted_code = purrr::map(
split_traits, (x) stringr::str_extract(x, "\[TRY:(.+)\]", group = 1) |>
as.numeric()
)
) |>
tidyr::unnest(split_traits:extracted_code)

apd_try_smaller = apd_try_detailed |>

Match names based on trait code

left_join(
try_traits |>
distinct(TraitID, name_matched_on_code = Trait),
by = c(extracted_code = "TraitID")
) |>

Match code based on trait name

left_join(
try_traits |>
distinct(code_matched_on_name = TraitID, Trait),
by = c(extracted_trait = "Trait")
)
select(trait, extracted_trait, extracted_code, name_matched_on_code, code_matched_on_name)

Potentially problematic traits

non-matching names according to code

apd_try_smaller |>
filter(extracted_trait != name_matched_on_code)

non-matching code according to name

apd_try_smaller |>
filter(extracted_code != code_matched_on_name)

</details>

Typos in some correspondence with TRY traits?

Hi @ehwenk and @dfalster ๐Ÿ‘‹

As told in the PR #24 I'm using the raw APD_traits_input.csv to get trait correspondence across databases.

I noticed some issues with some columns in TRY (or at least that are non-standard?).
I'm unsure about tackling these so I rather open an issue about them.

My routine is the following:

apd_try_traits = read.csv("APD_traits_input.csv") |>
  tibble::as_tibble() |>
  select(
    trait_id = identifier, trait, label, contains("BIEN"), contains("GIFT"),
    contains("TRY")
  ) |>
  # Get all traits for which there is an equivalent in TRY
  filter(if_any(contains("TRY"), \(x) x != "")) |>
  select(trait_id, trait, label, contains("TRY")) |>
  # Making data tidy
  tidyr::pivot_longer(
    contains("TRY"), names_to = "match_type", values_to = "match_value"
  ) |>
  filter(match_value != "") |>
  # Extract TRY TraitIDs
  mutate(
    extracted_trait = match_value |>
      stringr::str_extract_all("\\[TRY:\\d+\\]") |>
      purrr::map(stringr::str_remove, "\\[TRY:") |>
      purrr::map(stringr::str_remove,"\\]"),
    match_type =  stringr::str_extract(match_type, "[:alpha:]+"),
    # Count number of match traits
    length_extracted = purrr::map_int(extracted_trait, length)
  )

If I count the number of matched traits given the columns I get the following:

> apd_try_traits |>
+     count(length_extracted)
# A tibble: 4 ร— 2
  length_extracted     n
             <int> <int>
1                0     5
2                1   316
3                2     7
4                3     1

So 5 AusTraits traits, with non-empty columns have 0 matches given my extraction of TRY IDs.

If I go to see the strings in the columns I get:

> apd_try_traits |>
+     filter(length_extracted == 0) |>
+     pull(match_value)
[1] "specific leaf area [TO:0000562] (https://www.try-db.org/de/de.php)"                                  
[2] "Leaf epidermis cell area; Leaf mesophyll cell area [TRY:338; 573] (https://www.try-db.org/de/de.php)"
[3] "Bark thickness [TRY:24, TRY:3355, TRY:3356] (https://www.try-db.org/de/de.php)"                      
[4] "Bark thickness [TRY:24, 3355, 3356] (https://www.try-db.org/de/de.php)"                              
[5] "plant lifespan and age of first flowering [LEDA:1.3] (https://www.try-db.org/de/de.php)"          

For the first line, it matches back to a Trait Ontology definition, but not to a TRY trait.
For leaf epidermis cell and bark thickness it's a matter of TRY IDs writing style. Also Bark thickness is written in two ways?!
For plant lifespan, it's a link to a LEDA trait. Is this relevant here?

I've checked and these issue propagate to the RDF file.

Add vocabulary metadata

Still need to add in metadata for APD. What follows is what Rowan coded into one of the drafts for ARDC RVA;


https://github.com/traitecoevo/APD
a rdfs:Resource , skos:ConceptScheme , owl:Ontology , http://terminologies.gfbio.org/terms/ETS/TraitData ;
rdfs:label "AusTraits"@en ;
rdfs:seeAlso "https://github.com/traitecoevo/APD_values"^^xsd:anyURI ;
dcterms:description "AusTraits is an open-source, harmonized database of Australian plant trait data. It synthesises data on nearly 500 traits across more than 30,000 taxa from field campaigns, published literature, taxonomic monographs, and individual taxon descriptions. Begun in 2016 as an initiative between three lab groups, it has grown to be the largest collation of plant trait data for Australian plants. AusTraits integrates plant trait data collected by researchers from diverse disciplines, including functional plant biology, plant physiology, plant taxonomy, and conservation biology. By harmonizing and error checking values, linking all AusTraits data entries to detailed metadata, and documenting trait and trait values definitions, AusTraits is a resource researchers can trust and use for their research agendas with minimal additional filtering or manipulations."@en ;
dcterms:license "https://creativecommons.org/licenses/by/4.0/"^^xsd:anyURI ;
dcterms:publisher "https://austraits.org/"^^xsd:anyURI ;
dcterms:title "AusTraits"@en ;
skos:hasTopConcept https://github.com/traitecoevo/APD#0000000 ;
skos:prefLabel "AusTraits"@en .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.