Coder Social home page Coder Social logo

saved_fisdat's Introduction

Fish Data Utilities

Pre-requisites / caveats:

A previous version of this document had a git submodule comprising our data model, based on LinkML. There is no longer an external dependency on this, nor any other git submodules, and the directory which it occupied should be removed:

rm -r ./fisdat/data_model/

A further element of this is that the manifest files have changed format again, to a YAML file which can be edited directly. When uploading, these are converted to a machine-readable format, but all of the tooling uses the YAML format, now. Refer to the updated examples below.

This is a Python package. It can be installed in any of the usual ways for Python packages, perhaps using a virtual environment like so:

python -m venv /some/where/env
source /some/where/env/bin/activate

Whence installing the utilities is done as:

pip install --editable .

The --editable or -e flag is important as it means that updates to the file (i.e. those fetched with git) run immediately. Having done this, some new programs are available:

fisdat - validating and working with data files

Operation

The fisdat program is for preparing data files to be published. It takes a CSV file and a schema and checks that the CSV file matches the schema. It then adds the file and schema to a manifest.

For example, in the examples/sentinal_cages directory, one can run,

fisdat sentinel_cages_sampling.yaml \
    sentinel_cages_cleaned.csv \
	manifest.yaml

which will result in a slew of warnings about entries in the file that do not match the datatype specified in the schema (adding the -s or --strict flag will turn these warnings into errors) and result in a new or updated manifest.yaml file which serves to indicate which data belongs to which schema.

If you do /not/ wish to validate the data file, perhaps because it is not a CSV file, you can give the program the -n argument.

Do this for each file that should be added to the manifest.

Dealing with missing data (important for validation)

In the sentinel cages example data, empty/missing values were indicated using the string "NA". LinkML is unable to accept these as empty (we have opened an issue to try and move this forward). In the meantime, LinkML will happily accept empty fields.

In the sentinel cages data, we have added an example R script called prep.R which will read in the CSV, then re-export a new table with the NA string as an empty field. Similarly, in the density count model, which partly uses the sentinel cages data, we have similarly replaced the "NA" string with an empty field.

Debugging / extra information about running state

Providing the --verbose flag (or -v for short) will print messages about running state, e.g.:

fisdat sentinel_cages_sampling.yaml \
    sentinel_cages_cleaned.csv \
	manifest.yaml \
	--verbose

To see even more information, use the --extra-verbose (or -vv for short), e.g.:

fisdat sentinel_cages_sampling.yaml \
    sentinel_cages_cleaned.csv \
	manifest.yaml \
	--extra-verbose

Program version number and associated git commit is always printed.

fisup - uploading data

Operation

Once the manifest is full, uploading the data can be done with the program fisup. It is used like this,

fisup manifest.yaml

You will need to set an environment variable to where you have saved your access credentials. It needs to be the full path to the file. If you do not have access credentials, you will need to ask for them.

export GOOGLE_APPLICATION_CREDENTIALS=/some/where/fisdat.key

It will do some basic checks on the files and then upload them to cloud storage. Use the -d command line option to specify a particular directory path if you do not want one to be randomly generated. It is a good idea to make a note of the generated path. For example, from the examples/farm_site_af_source directory,

$ fisdat fo_farms.yaml fo_farms.csv manifest.yaml
$ fisdat fo_lice.yaml fo_lice_data.csv manifest.yaml
$ fisup manifest.yaml
Uploading gs://saved-fisdat/2d6bf8f4-c6cc-11ee-9969-7aa465704562/manifest.yaml ...
Uploading gs://saved-fisdat/2d6bf8f4-c6cc-11ee-9969-7aa465704562/fo_farms.csv ...
Uploading gs://saved-fisdat/2d6bf8f4-c6cc-11ee-9969-7aa465704562/fo_lice_data.csv ...
Successfully uploaded your dataset to gs://saved-fisdat/2d6bf8f4-c6cc-11ee-9969-7aa465704562

Now the dataset bundle has been uploaded and can be further processed.

Usage notes

Neither the name nor file extension of the manifest matter. They are always serialised as RDF (TTL). However, older manifests in JSON can no longer be uploaded, so make sure to re-generate them.

The --verbose and --extra-verbose flags have the same effect as in fisdat. They print debugging information about running state. Similarly, the version number and associated git commit are always printed.

LinkML YAML usage

Many of the LinkML schema fields are vague.

The id and name fields

The id field must be an URI, pointing somewhere. This does not need to be active, e.g. I put 'https://marine.gov.scot/metadata/saved/marinescot/sentinel_cages/sampling' in one of the examples.

The name field is a short identifier or 'atom'. It cannot have spaces or most special characters, albeit underscores are valid. Put longer text titles in the title field, and longer still free text descriptions in the description field. (Unlike id and name, the description field is optional.)

Prefixes and imports

Prefixes in the LinkML schema are used as the start of URIs in the generated schema. In the sentinel cages YAML example, we define saved as one such prefix, and then set it as the default prefix with the default_prefix keyword. The effect of this is that, by default, the classes and slots have a URI prepended to them in the generated documentation, which is this default prefix.

For example, suppose we declared a slot called infection_pressure, declare a prefix saved_new with URI "https://marine.gov.scot/metadata/saved/new_schema/", and set saved_new as the value of default_prefix. The slot infection_pressure would then be given the URI saved_new:infection_pressure which would expand to "https://marine.gov.scot/metadata/saved/new_schema/infection_pressure".

The imports take a prefix and import resources from it. It is sufficient to leave this as in the examples for now, as linkml:types and our own schema declare everything we need.

URI and CURIE prefixes

These are in the format prefix:atom. There must be no space on either side of the colon. These are typically used in the various mappings attributes of slots, or overriding an URI.

Indentation

Indentation does matter in most circumstances, because it is how the YAML distinguishes between sections. Getting the indentation right also makes the document easier to read, albeit it can sometimes be difficult to see where the indentation is wrong.

Indentation may take any number of spaces, the suggested number is two or four.

saved_fisdat's People

Contributors

druimalban avatar wwaites avatar trondurt avatar

Stargazers

James Morrison avatar  avatar Julien Moreau avatar

Watchers

 avatar Julien Moreau avatar Meadhbh Moriarty avatar  avatar  avatar

saved_fisdat's Issues

UTF-8 error when uploading binary files

When uploading a binary file using fisup, I get an UTF-8 error.

(fisdat) C:[...]\data>fisup manifest.json
Checking merged_tables_Lus2.nc ...
Uploading gs://saved-fisdat/f463442a-d56f-11ee-8c83-e470b8f02723/manifest.json ...
Uploading gs://saved-fisdat/f463442a-d56f-11ee-8c83-e470b8f02723/merged_tables_Lus2.nc ...
Traceback (most recent call last):
File "C:\ProgramData\miniconda3\envs\fisdat\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\miniconda3\envs\fisdat\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\ProgramData\miniconda3\envs\fisdat\Scripts\fisup.exe_main
.py", line 7, in
File "C:\ProgramData\miniconda3\envs\fisdat\lib\site-packages\fisdat\cmd_up.py", line 91, in cli
url = upload_files(args, [basename(args.manifest)] + data + schemas)
File "C:\ProgramData\miniconda3\envs\fisdat\lib\site-packages\fisdat\cmd_up.py", line 28, in upload_files
stuff = fp.read(BUFSIZ)
File "C:\ProgramData\miniconda3\envs\fisdat\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 135: character maps to

Feature Request: simulated vs measured data

The fisdat program should be able to specify that data is simulated or measured.

The fisup program may require that a manifest.json has both simulated and measured data that should be compared.

Write operation timing out

See attached text document for full details.

I've pulled a copy of fisdat this morning, and tried to upload 1 file using fisup, its quite small, up it's still timing out on me.

Thanks for your help,
M
Fisup_upload error.txt

installlation error

if i'm installing the app as instruckted by the readme file.

saved_fisdat/README.md

Lines 9 to 19 in caef665

git submodule init && git submodule update
This is a Python package. It can be installed in any of the usual ways
for Python packages, perhaps using a virtual environment like so,
python -m venv /some/where/env
. /some/where/env/bin/activate
whence installing the utilities is done as,
python setup.py install

my commands:

git clone [email protected]:wwaites/saved_fisdat.git
cd saved_fisdat/
git submodule init && git submodule update
python -m venv .venv
source .venv/bin/activate
python setup.py install

I get

error: The 'linkml' distribution was not found and is required by   fisdat

reset

cd ..
rm -rf saved_fisdat

But using pip to install the package:

git clone [email protected]:wwaites/saved_fisdat.git
cd saved_fisdat/
git submodule init && git submodule update
python -m venv .venv
source .venv/bin/activate
python -m pip install .

it installs

cd ..
ls $(python -c 'import pathlib, fisdat; print(pathlib.Path(fisdat.__file__).parent)')/..|grep data-model

but fisdat/../data-model is not to be found as requierd by:

, default = str(Path(__file__).parent / "../data-model/src/model/meta.yaml"))

Problems with pip install

env) [email protected]:saved_fisdat$ python --version
Python 3.6.8

(env) [email protected]:saved_fisdat$ pip install --editable .
Obtaining file:///home/philip.gillibrand%40marineharvest.net/work/SAVED/saved_fisdat
Installing build dependencies ... error
ERROR: Command errored out with exit status 1:
command: /home/[email protected]/work/SAVED/env/bin/python /home/[email protected]/work/SAVED/env/lib64/python3.6/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-ktl_3nfo/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools setuptools-scm 'versioneer[toml]==0.29'
cwd: None
Complete output (6 lines):
Collecting setuptools
Using cached setuptools-59.6.0-py3-none-any.whl (952 kB)
Collecting setuptools-scm
Using cached setuptools_scm-6.4.2-py3-none-any.whl (37 kB)
ERROR: Could not find a version that satisfies the requirement versioneer[toml]==0.29 (from versions: 0.9, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22)
ERROR: No matching distribution found for versioneer[toml]==0.29

WARNING: Discarding file:///home/philip.gillibrand%40marineharvest.net/work/SAVED/saved_fisdat. Command errored out with exit status 1: /home/[email protected]/work/SAVED/env/bin/python /home/[email protected]/work/SAVED/env/lib64/python3.6/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-ktl_3nfo/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools setuptools-scm 'versioneer[toml]==0.29' Check the logs for full command output.
ERROR: Command errored out with exit status 1: /home/[email protected]/work/SAVED/env/bin/python /home/[email protected]/work/SAVED/env/lib64/python3.6/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-ktl_3nfo/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools setuptools-scm 'versioneer[toml]==0.29' Check the logs for full command output.
(env) [email protected]:saved_fisdat$

Is there a version requirement on the python environment installation?

Early exit

If there is an error in the header, it does not make sense to validate the data.

Merged YAML manifest format

The change which I just merged on the call just now does a couple of things:

The manifest files generated are now in the LinkML YAML format. This is because we then edit these files directly, to specify jobs. In terms of usage of the programs, the only change is to specify a '.yaml' extension instead of '.ttl' or similar (note that these are converted when uploading to TTL).

The local data model git submodule is no longer necessary since we got it hosted on https://marine.gov.scot/metadata/saved/schema/. If you pull in the changes, do remove the directory ./fisdat/data_model/.

There is an empty/ignored example job added to generated YAML manifest files. It is in this section that we'd describe real jobs (which I will document elsewhere).

I am tracking this as an issue as the changes aren't necessarily obvious. I'll need to update the documentation to note the changes, but refer to this issue in the meantime if there's anything there which is wrong/incomplete.

AttributeError: type object 'Draft201909Validator' has no attribute 'FORMAT_CHECKER'

Error when running a test example using fisdat.

System:
Windows 10 v 22H2
Python 3.10.4
linkml==1.7.10
linkml-dataops==0.1.0
linkml-runtime==1.7.5

$ fisdat density.yaml density.csv manifest1.yaml
C:\Users\tadams\OneDrive - Scottish Sea Farms\Documents\projects\202312_SAVED\saved_fisdat\fisdat\data_model.py:371: FutureWarning: Possible nested set at position 10
pattern=re.compile(r'^:?[a-z]+[[a-z]|_|]*$'))
WARNING [2024-05-24 15:16:35,089] [term.py' new' (l.287)] C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\site-packages\linkml_runtime\linkml_model\model\schema\types does not look like a valid URI, trying to serialize this will break.'' WARNING [2024-05-24 15:16:35,462] [`term.py' `__new__' (l.287)] C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\site-packages\linkml_runtime\linkml_model\model\schema\types does not look like a valid URI, trying to serialize this will break.''
Traceback (most recent call last):
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\Scripts\fisdat.exe_main
.py", line 7, in
File "C:\Users\tadams\OneDrive - Scottish Sea Farms\Documents\projects\202312_SAVED\saved_fisdat\fisdat\cmd_dat.py", line 315, in cli
manifest_wrapper (data = args.csvfile
File "C:\Users\tadams\OneDrive - Scottish Sea Farms\Documents\projects\202312_SAVED\saved_fisdat\fisdat\cmd_dat.py", line 224, in manifest_wrapper
validation_check = validation_helper (data, schema, "TableSchema")
File "C:\Users\tadams\OneDrive - Scottish Sea Farms\Documents\projects\202312_SAVED\saved_fisdat\fisdat\utils.py", line 44, in validation_helper
report = validate_file (data, schema, target_class, strict = True)
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\site-packages\linkml\validator_init
.py", line 113, in validate_file
return validator.validate_source(loader, target_class)
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\site-packages\linkml\validator\validator.py", line 68, in validate_source
return ValidationReport(results=list(self.iter_results_from_source(loader, target_class)))
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\site-packages\linkml\validator\validator.py", line 107, in iter_results_from_source
for result in plugin.process(instance, context):
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\site-packages\linkml\validator\plugins\jsonschema_validation_plugin.py", line 43, in process
validator = context.json_schema_validator(
File "C:\Users\tadams\AppData\Local\Programs\Python\Python310\lib\site-packages\linkml\validator\validation_context.py", line 56, in json_schema_validator
return validator_cls(json_schema, format_checker=validator_cls.FORMAT_CHECKER)
AttributeError: type object 'Draft201909Validator' has no attribute 'FORMAT_CHECKER'
This is fisdat version 0.5+4.g9f4151c, commit 9f4151c

When I run the same thing on a remote (Windows Server 10 Standard v 1809, Python 3.10.9) machine - after updating to latest version of fisdat etc, it still warns about the URIs but doesn't fail. Linkml versions match those above.

D:\OneDrive Profiles\Tom\OneDrive - Scottish Sea Farms\Documents\projects\202312_SAVED\saved_fisdat\fisdat\data_model.py:371: FutureWarning: Possible nested set at position 10
pattern=re.compile(r'^:?[a-z]+[[a-z]|_|]*$'))
WARNING [2024-05-24 15:50:26,016] [term.py' new' (l.287)] D:\ProgramData\TAdams\miniconda3\lib\site-packages\linkml_runtime\linkml_model\model\schema\types does not look like a valid URI, trying to serialize this will break.'' WARNING [2024-05-24 15:50:26,420] [`term.py' `__new__' (l.287)] D:\ProgramData\TAdams\miniconda3\lib\site-packages\linkml_runtime\linkml_model\model\schema\types does not look like a valid URI, trying to serialize this will break.''
This is fisdat version 0.5+4.g9f4151c, commit 9f4151c
Wrote to manifest.yaml:

| data URI | data schema | data hash |

| density.csv | density.yaml | 817813d04 |

date validator

the date validator validates everything:/
test script in link /manual_test

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.