Coder Social home page Coder Social logo

Comments (7)

MichalTorma avatar MichalTorma commented on June 22, 2024

Prepacte za neskoru odpoved - minuly tyzden som toho mal vela.
Tu to mam s upravami a komentarmi:

SELECT
-- MT: Tu je otazka ci je tento identifikator dostatocne stabilny aby sme ho puzivali. standarne pouzivame uuid alebo triplet institutionCode:collectionCode:catalogNumber
    CONCAT(p.institution_id, ':', p.abbrev, ':', r.id) AS occurrenceID,
-- VK: chybně t.scientific_name AS scientificName,
-- MT: pravda - netusim odkial som nabral scientific_name :)
    t.name_lat AS scientificName,
    r.latitude AS decimalLatitude,
    r.longitude AS decimalLongitude,
    r.datum AS eventDate,
    r.locality,
    r.altitude_min AS minimumElevationInMeters,
    r.altitude_max AS maximumElevationInMeters,
    r.comment AS occurrenceRemarks,
-- VK: Co tady potřebujete? Záznam má více autorů. Je to vazba M : N. Předpokládám, že id z databáze Pladias Vám nepomůže nebo ano?
-- MT: Mate pravdu - to id nepotrebujem - skoda ze nepouzivate ORCID. Mohlo byt is ist takto? :
    string_agg(CONCAT(ra.name, ' ', ra.surname), '|') AS recordedBy
    r.source,
    r.environment,
-- VK: Zde Vám stačí id integer z tabulky atlas.projects?
-- MT: Toto mozeme vypustit - to bola GPT halucinacia
    -- r.project_id AS associatedSequences, -- Example of mapping project ID to associated sequences; adjust based on actual data mapping
-- VK: Co by mělo být zde? Jedná se o id z tabulky atlas.record_validation_status
-- MT: tu neviem ake su moznosti - v priklade co ste mi poslal mali vsetky hodnotu 3. viete mi o tom povedat viac?
    r.validation_status AS identificationVerificationStatus,
    r.original_name AS verbatimScientificName,
    r.gps_coords_precision AS coordinatePrecision,
-- VK: možné hodnoty Y, M nebo D, které určují, jestli je z pole datum platný pouze rok nebo i měsíc nebo celé datum.
-- I see takze r.datum je vzdy ISO date ale podla tohoto stlpca treba ignorovat mesiac/den ?
    -- r.datum_precision AS verbatimEventDate,
-- VK: duplicitní s r.locality
    -- r.locality AS locality,
-- VK: jedná se o integer id z tabulky fytochorionů (geodata.phytochorions)
-- MT: Kedze sa nechystame publikovat aj tabulku fytochorionov, tak mozeme tento stlpec zhodit tiez
    -- r.phytochorion_id AS locationID -- Example of using phytochorion_id as a location ID
FROM
    atlas.records AS r
LEFT JOIN public.taxons AS t ON r.taxon_id = t.id
LEFT JOIN atlas.records_authors AS ra ON r.id = ra.records_id
LEFT JOIN public.projects AS p ON r.records_id = p.id
WHERE
-- VK: U tohoto příznaku je to komplikovanější, protože v určitých případech chtěli v českém Pladiasu, aby bylo možné vložit do mapy i odmítnuté záznamy. Možná by bylo vhodnější vybírat podle validation status. 
    r.include_in_map = TRUE;
-- pridame toto nech vieme robit agregaciu
GROUP BY r.id

from pladias-dwc-export.

Jules- avatar Jules- commented on June 22, 2024

Upravil jsem:

SELECT
-- MT: Tu je otazka ci je tento identifikator dostatocne stabilny aby sme ho puzivali. standarne pouzivame uuid alebo triplet institutionCode:collectionCode:catalogNumber
-- VK2: Myslím, že by se to nemělo brát z projektu. Projekt je pro nás například databáze ze které jsou záznamy importovány.
--      Dával by mi smysl konstantní prefix pro identifikaci instituce. Např. BU-SAV.
CONCAT(p.institution_id, ':', p.abbrev, ':', r.id) AS occurrenceID_wrong,
CONCAT('BU-SAV:', r.id) AS occurrenceID,
-- VK: chybně t.scientific_name AS scientificName,
-- MT: pravda - netusim odkial som nabral scientific_name :)
t.name_lat AS scientificName,
r.latitude AS decimalLatitude,
r.longitude AS decimalLongitude,
r.datum AS eventDate,
r.locality,
r.altitude_min AS minimumElevationInMeters,
r.altitude_max AS maximumElevationInMeters,
r.comment AS occurrenceRemarks,
-- VK: Co tady potřebujete? Záznam má více autorů. Je to vazba M : N. Předpokládám, že id z databáze Pladias Vám nepomůže nebo ano?
-- MT: Mate pravdu - to id nepotrebujem - skoda ze nepouzivate ORCID. Mohlo byt is ist takto? :
-- VK2: zkoušel jsem to a zdá se, že subquery je výrazně efektivnější
(
    SELECT string_agg(CASE WHEN a.name = '' THEN a.surname ELSE CONCAT(a.name, ' ', a.surname) END, '|')
    FROM atlas.records_authors
    INNER JOIN atlas.authors AS a ON records_authors.authors_id = a.id
    WHERE records_id = r.id) AS recordedBy,
r.source,
r.environment,
-- VK: Zde Vám stačí id integer z tabulky atlas.projects?
-- MT: Toto mozeme vypustit - to bola GPT halucinacia
-- r.project_id AS associatedSequences, -- Example of mapping project ID to associated sequences; adjust based on actual data mapping
-- VK: Co by mělo být zde? Jedná se o id z tabulky atlas.record_validation_status
-- MT: tu neviem ake su moznosti - v priklade co ste mi poslal mali vsetky hodnotu 3. viete mi o tom povedat viac?
r.validation_status AS identificationVerificationStatus,
r.original_name AS verbatimScientificName,
r.gps_coords_precision AS coordinatePrecision
-- VK: možné hodnoty Y, M nebo D, které určují, jestli je z pole datum platný pouze rok nebo i měsíc nebo celé datum.
-- I see takze r.datum je vzdy ISO date ale podla tohoto stlpca treba ignorovat mesiac/den ?
-- r.datum_precision AS verbatimEventDate,
-- VK: duplicitní s r.locality
-- r.locality AS locality,
-- VK: jedná se o integer id z tabulky fytochorionů (geodata.phytochorions)
-- MT: Kedze sa nechystame publikovat aj tabulku fytochorionov, tak mozeme tento stlpec zhodit tiez
-- r.phytochorion_id AS locationID -- Example of using phytochorion_id as a location ID
FROM
    atlas.records AS r
        LEFT JOIN public.taxons AS t ON r.taxon_id = t.id
        LEFT JOIN atlas.projects AS p ON r.project_id = p.id
WHERE
-- VK: U tohoto příznaku je to komplikovanější, protože v určitých případech chtěli v českém Pladiasu, aby bylo možné vložit do mapy i odmítnuté záznamy. Možná by bylo vhodnější vybírat podle validation status.
r.include_in_map = TRUE
LIMIT 1000;

from pladias-dwc-export.

MichalTorma avatar MichalTorma commented on June 22, 2024

Dakujem za promptnu odpoved. myslim ze nam ostava len zopar detailov tak ich tu rozpisem:

  • occurrenceID: Co si myslite o niecom takomto BU-SAV:PLADIAS:{r.id}? v tom pripade ak by BU-SAV publikoval v buducnosti nejaky iny dataset, mohli by pouzit podobnu strukturu.
  • recordedBy: zaujimave - ja som vzdy pokladal subquery ako najpomalsi :) clovek sa uci kazdy den
  • r.validation_status: ake moznosti su tam?
  • r.datum_precision: ak tomu rozumiem spravne tak v r.datum mate vzdy plny datum ale kedze DATE nepodporuje nekompletne datumy, tak to mate ako flag. Vec je ze v eventDate by som idealne potreboval ISO datum ktory podporuje aj reduced precision. Rozumiem tomu spravne?

from pladias-dwc-export.

Jules- avatar Jules- commented on June 22, 2024
  • occurrenceID: BU-SAV:PLADIAS:{r.id} je za mě v pohodě, ale myslím, že bych neměl rozhodovat, jak bude vypadat identifikátor slovenských dat
  • recordedBy: Přemýšlel jsem nad tím a mohlo to být rychlejší jen díky tomu, že jsem použil LIMIT 1000. V případě GROUP BY musel procházet všechny záznamy, které vyhovují podmínce, ale v případě subquery prošel prvních 1000 a měl hotovo.
  • r.validation_status: posílal jsem na e-mail s výstupem skriptu
  • r.datum_precision: na toto jsem zapomněl odpovědět. Je to jak píšete. Pokud má být výstupem string, tak je potřeba ho poskládat podle datum_precision.

from pladias-dwc-export.

MichalTorma avatar MichalTorma commented on June 22, 2024

Tak som to precistil a implementoval komentare. vyzera to rozumne? Trebalo by skusit ktore je rychleisie na celom datasete.
subquerry:

SELECT
    CONCAT('BU-SAV:PLADIAS:', r.id) AS occurrenceID,
    t.name_lat AS scientificName,
    r.latitude AS decimalLatitude,
    r.longitude AS decimalLongitude,
    CASE
        WHEN r.datum_precision = 'Y' THEN TO_CHAR(r.datum, 'YYYY')
        WHEN r.datum_precision = 'M' THEN TO_CHAR(r.datum, 'YYYY-MM')
        WHEN r.datum_precision = 'D' THEN TO_CHAR(r.datum, 'YYYY-MM-DD')
        ELSE NULL
    END AS eventDate,
    r.locality,
    r.altitude_min AS minimumElevationInMeters,
    r.altitude_max AS maximumElevationInMeters,
    r.comment AS occurrenceRemarks,
    (
        SELECT string_agg(CASE WHEN a.name = '' THEN a.surname ELSE CONCAT(a.name, ' ', a.surname) END, '|')
        FROM atlas.records_authors
        INNER JOIN atlas.authors AS a ON records_authors.authors_id = a.id
        WHERE records_id = r.id) AS recordedBy,
    r.source,
    r.environment,
    CASE
        WHEN r.validation_status IN (0, 1) THEN 'verification required'
        WHEN r.validation_status = 3 THEN 'verified'
        ELSE NULL
    END AS identificationVerificationStatus
    r.original_name AS verbatimScientificName,
    r.gps_coords_precision AS coordinatePrecision
FROM
    atlas.records AS r
        LEFT JOIN public.taxons AS t ON r.taxon_id = t.id
        LEFT JOIN atlas.projects AS p ON r.project_id = p.id
WHERE
    r.validation_status != 2

alebo aggregate

SELECT
    CONCAT('BU-SAV:PLADIAS:', r.id) AS occurrenceID,
    t.name_lat AS scientificName,
    r.latitude AS decimalLatitude,
    r.longitude AS decimalLongitude,
    CASE
        WHEN r.datum_precision = 'Y' THEN TO_CHAR(r.datum, 'YYYY')
        WHEN r.datum_precision = 'M' THEN TO_CHAR(r.datum, 'YYYY-MM')
        WHEN r.datum_precision = 'D' THEN TO_CHAR(r.datum, 'YYYY-MM-DD')
        ELSE NULL
    END AS eventDate,
    r.locality,
    r.altitude_min AS minimumElevationInMeters,
    r.altitude_max AS maximumElevationInMeters,
    r.comment AS occurrenceRemarks,
    string_agg(CONCAT(ra.name, ' ', ra.surname), '|') AS recordedBy
    r.source,
    r.environment,
    CASE
        WHEN r.validation_status IN (0, 1) THEN 'verification required'
        WHEN r.validation_status = 3 THEN 'verified'
        ELSE NULL
    END AS identificationVerificationStatus
    r.original_name AS verbatimScientificName,
    r.gps_coords_precision AS coordinatePrecision
FROM
    atlas.records AS r
        LEFT JOIN public.taxons AS t ON r.taxon_id = t.id
        LEFT JOIN atlas.projects AS p ON r.project_id = p.id
WHERE
    r.validation_status != 2
GROUP BY r.id

from pladias-dwc-export.

MichalTorma avatar MichalTorma commented on June 22, 2024

Este som si vsimol, r.gps_coords_precision je v metroch alebo v stupnoch?

from pladias-dwc-export.

Jules- avatar Jules- commented on June 22, 2024

Ta r.gps_coords_precision je v metrech.

Upravil jsem SQL, aby se daly spustit. Otestoval jsem export na 14 mil. záznamech z českého Pladiasu. Dopadlo to lépe pro subquery (3:26) oproti group by(12:20). Možná je problém, že group by seskupuje podle všech sloupců včetně double latitude a longitude. Pro úplnost jsem ještě vymyslel jeden přístup přes join(4:19). Ten je o něco horší než subquery.

subquery

SELECT
    CONCAT('BU-SAV:PLADIAS:', r.id) AS occurrenceID,
    t.name_lat AS scientificName,
    r.latitude AS decimalLatitude,
    r.longitude AS decimalLongitude,
    CASE
        WHEN r.datum_precision = 'Y' THEN TO_CHAR(r.datum, 'YYYY')
        WHEN r.datum_precision = 'M' THEN TO_CHAR(r.datum, 'YYYY-MM')
        WHEN r.datum_precision = 'D' THEN TO_CHAR(r.datum, 'YYYY-MM-DD')
        ELSE NULL
        END AS eventDate,
    r.locality,
    r.altitude_min AS minimumElevationInMeters,
    r.altitude_max AS maximumElevationInMeters,
    r.comment AS occurrenceRemarks,
    (
        SELECT string_agg(CASE WHEN a.name = '' THEN a.surname ELSE CONCAT(a.name, ' ', a.surname) END, '|')
        FROM atlas.records_authors
                 INNER JOIN atlas.authors AS a ON records_authors.authors_id = a.id
        WHERE records_id = r.id) AS recordedBy,
    r.source,
    r.environment,
    CASE
        WHEN r.validation_status IN (0, 1) THEN 'verification required'
        WHEN r.validation_status = 3 THEN 'verified'
        ELSE NULL
        END AS identificationVerificationStatus,
    r.original_name AS verbatimScientificName,
    r.gps_coords_precision AS coordinatePrecision
FROM
    atlas.records AS r
        LEFT JOIN public.taxons AS t ON r.taxon_id = t.id
        LEFT JOIN atlas.projects AS p ON r.project_id = p.id
WHERE
    r.validation_status != 2

group by

SELECT
    CONCAT('BU-SAV:PLADIAS:', r.id) AS occurrenceID,
    t.name_lat AS scientificName,
    r.latitude AS decimalLatitude,
    r.longitude AS decimalLongitude,
    CASE
        WHEN r.datum_precision = 'Y' THEN TO_CHAR(r.datum, 'YYYY')
        WHEN r.datum_precision = 'M' THEN TO_CHAR(r.datum, 'YYYY-MM')
        WHEN r.datum_precision = 'D' THEN TO_CHAR(r.datum, 'YYYY-MM-DD')
        ELSE NULL
        END AS eventDate,
    r.locality,
    r.altitude_min AS minimumElevationInMeters,
    r.altitude_max AS maximumElevationInMeters,
    r.comment AS occurrenceRemarks,
    string_agg(CASE WHEN a.name = '' THEN a.surname ELSE CONCAT(a.name, ' ', a.surname) END, '|') AS recordedBy,
    r.source,
    r.environment,
    CASE
        WHEN r.validation_status IN (0, 1) THEN 'verification required'
        WHEN r.validation_status = 3 THEN 'verified'
        ELSE NULL
        END AS identificationVerificationStatus,
    r.original_name AS verbatimScientificName,
    r.gps_coords_precision AS coordinatePrecision
FROM
    atlas.records AS r
        LEFT JOIN public.taxons AS t ON r.taxon_id = t.id
        LEFT JOIN atlas.projects AS p ON r.project_id = p.id
        LEFT JOIN atlas.records_authors AS ra ON ra.records_id = r.id
        LEFT JOIN atlas.authors AS a ON ra.authors_id = a.id
WHERE
    r.validation_status != 2
GROUP BY r.id, t.name_lat, r.latitude, r.longitude, r.datum_precision, r.locality, r.altitude_min,
         r.altitude_max, r.comment, r.source, r.environment, r.validation_status,
         r.original_name, r.gps_coords_precision;

join

SELECT
    CONCAT('BU-SAV:PLADIAS:', r.id) AS occurrenceID,
    t.name_lat AS scientificName,
    r.latitude AS decimalLatitude,
    r.longitude AS decimalLongitude,
    CASE
        WHEN r.datum_precision = 'Y' THEN TO_CHAR(r.datum, 'YYYY')
        WHEN r.datum_precision = 'M' THEN TO_CHAR(r.datum, 'YYYY-MM')
        WHEN r.datum_precision = 'D' THEN TO_CHAR(r.datum, 'YYYY-MM-DD')
        ELSE NULL
        END AS eventDate,
    r.locality,
    r.altitude_min AS minimumElevationInMeters,
    r.altitude_max AS maximumElevationInMeters,
    r.comment AS occurrenceRemarks,
    a.names AS recordedBy,
    r.source,
    r.environment,
    CASE
        WHEN r.validation_status IN (0, 1) THEN 'verification required'
        WHEN r.validation_status = 3 THEN 'verified'
        ELSE NULL
        END AS identificationVerificationStatus,
    r.original_name AS verbatimScientificName,
    r.gps_coords_precision AS coordinatePrecision
FROM
    atlas.records AS r
        LEFT JOIN public.taxons AS t ON r.taxon_id = t.id
        LEFT JOIN atlas.projects AS p ON r.project_id = p.id
        LEFT JOIN (
            SELECT records_id, string_agg(CASE WHEN a.name = '' THEN a.surname ELSE CONCAT(a.name, ' ', a.surname) END, '|') AS names
            FROM atlas.records_authors
                INNER JOIN atlas.authors AS a ON records_authors.authors_id = a.id
            GROUP BY records_id) AS a ON r.id = a.records_id
WHERE
    r.validation_status != 2
;

from pladias-dwc-export.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.