Coder Social home page Coder Social logo

tcy's Introduction

TCY

TsvtoCondaYml. A package for easy creation of conda .yml files using a .tsv file as input.

Aims

Using .yml files as recipes to create conda environments is already a good step towards reproducible scientific computing environments. However, sometimes we want to know why a particular package was included (or not), what it does (improving transparency), and whether it runs without errors on all common operating systems (Linux, Mac OS, Windows). Spreadsheet files offer much more possibilities to document this. The goal of this repository is to have the documentation capabilities of a .tsv file and then be able to export the packages that are described in it to a .yml file.

Use this repository for your own work

The most easy way to use tcy is to create your own repository by using this repository as a template. This has two advantages over using tcy locally on your machine:

  1. Your "recipes" will be stored in a Github repository and are therefore available from any machine as long as you have an internet connection.
  2. Creating environments can take a lot of time, depending on the number of packages that need to be included. Using the following approach, the computionally heavy solving process is outsourced to a Github-Runner so your personal machine can be used for other things.

If you want to use this approach, then follow these steps:

  1. Create your own repository by clicking on the Use this template button in the upper right.

  2. Make sure to allow Github Runners to push changes to your repository by going to Settings → Actions → General → Workflow permissions → Checkmark "Read and write permissions"

  3. Clone your repository to your local machine

  4. Make local changes to environments/packages.tsv

  5. Push your changes. This will start a Github-Action-Worfklow (that uses tcy and micromamba) to create .yml files with solved package specification solutions. The workflow will automatically push the files to your repo, so wait until it's finished.

  6. After the workflow has finished, pull the latest changes to your local repository.

  7. Create your conda environment using either the ubuntu-latest_solved.yml or windows-latest_solved.yml file (depending on your OS) by doing this:

    • Set the name of the environment by overwriting the name: attribute in the .yml file.
    • After that execute the following command to create your environment: conda env create -f ubuntu_latest_solved.yml (or conda env create -f windows_latest_solved.yml) (Note: There is no need to specify -n environment_name in this command because the name of the environment was already specified in the first step. More information can be found here)

For developers

How to generate a custom .yml file using tcy

tcy can be pip-installed using pip install tcy. There are two ways to use tcy:

  1. You can import the run function in your own code-base using from tcy import run.
  2. tcy can also be used as a command-line application by simply running tcy in the terminal.

The following positional arguments have to be specified in both cases:

  • {linux,windows} (Operating system under which the .yml file will be used to create a conda environment. Can be 'linux' or 'windows'. Depending on the input only packages that run bug-free under the specified OS are selected. Packages that are flagged with cross-platform in the bug_flag column of the input .tsv file are never included.

The following optional arguments can be set for further customization:

  • --yml_name (Sets the "name:" attribute of the .yml file. If not given, the .yml file will not have a "name:" attribute. This is useful if the file should only be used for updating an existing environment that already has a name, i.e. not to create a new one)
  • --yml_file_name (Sets the name of the .yml file. The default is 'environment.yml')
  • --pip_requirements_file (Write pip packages to a separate requirements.txt file. This will file will always be placed in the same directory as the .yml file)
  • --write_conda_channels (Specifies conda channels directly for each conda package (e.g. conda-forge::spyder). In this case the 'defaults' channel is the only channel that appears in the 'channels:' section. See: this link for a preview)
  • --tsv_path (Optional path to the .tsv file. If not given, the function will expect a "packages.tsv" file to be in the current working directory)
  • --yml_dir(Path to a valid directory where the .yml file should be placed in. If not given the file will be placed in the current working directory. If a requirements.txt for pip is generated it will always be placed in the same directory as the .yml file)
  • --cran_installation_script (If set, generates a bash script install_cran_packages.shthat allows to install CRAN-packages within the conda-environment. Only valid when --yml_name is set)
  • --cran_mirror(A valid URL to a CRAN-Mirror where packages should be downloaded from. The default is https://cloud.r-project.org)
  • --languages (Filter for certain languages. Valid inputs are 'python', 'r', 'julia' or 'all'. The default is 'all')
  • --necessity (Filter for necessity. Valid inputs are 'optional' and 'required').

The packages.tsv file

The input spreadsheet file needs to have the following columns:

  • package_name (the offical name of the package)
  • version (specify the version of the package you need by following the package match specification syntax)
  • package_manager (can be 'pip', 'conda', or 'cran')
  • conda_channel (which conda channel to install from)
  • necessity (can be 'required', 'optional')
  • language (python, r, julia)
  • bug_flag (can be 'linux','windows' or 'cross_platform')

Automatic testing of the datasets.tsv file

This repository includes a testing pipeline that checks for the integrity of / valid entries in the packages.tsv. Which tests are running is decided using the test_configs.json file. Each tests corresponds to a key within the json file. If the corresponding value is null the test is not being executed. Here's an explanation of each test and rules for how the values should be provided in case the test should be executed.

key value description
valid_columns list of column names tsv file must only have these columns in that specific order
filled_out_columns list of column names cells in these columns must not contain NaNs, i.e. every row within these columns must contain a value
valid_options dict with column names as keys and list of valid options as values cells in these columns must only contain these values
column_dependencies dict with column names as keys and list of other columns as values if a cell in this column is filled out, cells in this/these other column(s) also to be filled out
conditional_column_dependencies dict of dict of lists if a cell in this column has this value, cells in this/these other column(s) have to be filled out
multi_option_columns dict with column names as keys and list of valid options as values cells in these columns must only contain valid options separated by commas

CRAN-packages

EDIT: Still in development! Some R-packages are not (yet) available as conda-packages. In order to semi-automate the installation process of these packages in your conda environment, run install_cran_packages.sh. This script will activate the conda environment, start R in this environment and then install the CRAN-packages via install.packages() Note that this is not the recommended way to do it, but some R-packages are simply not available as conda-packages (this should be checked though on a regular basis).

Q & A

What about dependencies?

It's not necesary to specifiy dependencies in the .tsv file! Conda will take care of that. So for example, there's no need to put numpy in the .tsv file because numpy is a common dependency of most scientific python packages (e.g. scikit-learn,pytorch, etc.) There might however be cases where there are optional dependencies that can but do not have to be installed (Example: The plotting package plotly works completely fine if we install it as it is. But if we want the nice feature of creating interactive plots we also have to install the dependency orca). Optional dependencies should be marked as dependency in the area column of the .tsv file.

Why not create the environment and share the exported .yml file?

Theoretically there would be an even better option than everyone creating the same environment over and over: The environment should be only created once (which can take a long time because conda has to resolve a dependency graph where each of the packages is ‘happy’ with the versions of all other packages). Then this environment could be exported via conda env export > environment.yml . Finally, other users could then take this .yml file to create the environment without the need to resolve the dependency graph one more time, because this file already contains the ‘solution’. More information on that can be found here.

But here comes the catch: This file will probably not work across operating systems and their versions (e.g. your own personal laptop which might run on Windows vs. your server which runs on Linux). The reason for that is, that complex dependency graphs contain packages that are only available for a specific OS/OS-version.

The longterm solution for this problem is to create a containerized solution that includes a conda environment as aimed in csp_docker

tcy's People

Contributors

github-actions[bot] avatar johanneswiesner avatar katiereh avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

katiereh

tcy's Issues

Order of channels in the channels section changes when there are ties

Currently, when channels are not specified directly, we write the unique channels in the order of frequency in the channels section. For that we sort them by count. But when two or more packages have an equal number of counts the value_counts method adds them in random order. So we should also sort them alphabetically afterwards (see:

tcy/tcy.py

Lines 39 to 49 in 6b4f8b9

# write channel attribute
# if conda channels are not specified directly, all conda channels will be
# written into the channels section. The order of appearance defines the priority.
# Therefore, we sort the data frame according to how often a particular channel is needed.
# The most needed channel appears first and the least needed appears last.
# The default channel always comes last
f.write('channels:\n')
if write_conda_channels == False:
conda_channel_counts = df['conda_channel'].value_counts().index.to_list()
for channel in conda_channel_counts:
f.write(f"- {channel}\n") #
)

rtools doesn't work under Linux

I get an ResolvePackageNotFoundError. rtools is made for Windows so this makes sense. Include an option, that this package is ignored when creating a conda environment under Linux

install_pip_packages.sh and install_cran_packages should be runnable from Anaconda prompt

install_cran_packages.sh should activate the conda environment and then install CRAN-packages in it (in one go). I created the bash-scripts install_pip_packages.sh and install_cran_packages.sh to be able to install pip and cran packages AFTER the environment was created. But the line conda activate csp_surname_name does not work from within a bash-script (https://stackoverflow.com/questions/47246350/conda-activate-not-working).

Besides, the scripts should also work as Windows-Shell-Scripts, because bash is not working on windows (maybe if the bash-scripts are executed from Git Bash? But then we have to tell Git Bash how to activate conda)

Avoid pip ssl-errors

Currently, when trying to create the environment from environment.yml, one sometimes gets a ResolvePackageNotFound for the pip packages. When out commenting the ‘pip-part’ everything works (in other words: For the conda packages everything seems to work fine but not for the pip packages). However, this breaks the logic as the user then has to install the pip packages in their environment using install_pip_packages.sh

This is the current workaround for install_pip_packages.sh
https://stackoverflow.com/questions/25981703/pip-install-fails-with-connection-error-ssl-certificate-verify-failed-certi), parse_packages

Upate base-image

Would be more nice to have a newer base-image but we run into troubles with images like debian-bookworm or newer versions of neurdebian (either we are not being able to install packages with neurodocker using arguments like --fsl version=6.0.4, or we are stuck in user surveys where we cannot set "yes" or "no"). See also these comments.

Add neurokit2

Specifically needed for computing multiscale sample entropy

Allow set set --yml_file_name

Currently, we can specify a directory for the exported .yml file but not a file name. But if we want to create explicit & dynamic filenames (for example with GitHub Actions where we run tcy for different OS) we need to be able to do something like:

tcy.py linux --ym_file_name=environment_linux.yml

Error with scikit-learn and semopy

Some package in the pip-section apparently has scikit-learn as dependency in it's requirements, but the install command is now different. Wee need to find that pip-package and ask maintainers to switch to the new install-command.

image

Environment creation can take a long time

Currently, the creation of an environment takes a really long time. This might have to do something with the channel priority arguments. Try if you can solve this problem:
solving environment for 6 hours · Issue #7690 · conda/conda · GitHub, Set the channel_priority in Conda environment.yml - Stack Overflow, Allow configuration setting to be set in environment.yml file · Issue #8675 · conda/conda · GitHub. The solution to this problem might be to set conda config --set channel_priority strict, but it would be nice if this could be set within the .yml file. Another option seems to change the order of the channels (but I do not have the knowledge why this improves the speed), maybe this also helps? https://stackoverflow.com/questions/61239956/is-it-possible-to-specify-a-conda-config-into-an-environment-file/

Make tcy package pip-installable and add standalone-functionality

Would be nice if this package could be pip-installable (see my nisupply repo and the included pyproject.toml file) and if users could invoke it like:

tcy . . . instead of python tcy.py . . .

See: https://packaging.python.org/en/latest/guides/installing-stand-alone-command-line-tools/#installing-stand-alone-command-line-tools

How did the fmriprep-peepz did this? https://fmriprep.org/en/20.2.0/installation.html#the-fmriprep-docker-wrapper

  • Make pip-installable
  • Create standalone application

Allow user to filter out programming languages

Would be nice if you could something like

python tcy.py --languages python to only get Python-packages. But how would we deal with dependencies between Python and R? For example the pymer package needs R

Add ahba

Needed to define gene receptor expression values using the Allen Human Brain Atlas and an input atlas of choice

Run tests locally when running tcy.py

The tests that run in the Github Actions should also run locally with every execution. This is especially useful if people use their own packages.tsv file.

Let users further specify bug_flag

Allow users to further specify bug flag. It should be possible to specify on which OS certain packages don't work. Depending on the used OS, this will automatically filter for the right packages. Related to #15

Switch to latest version of python 3.11

Python 3.11 is a lot faster (40-60%) than older versions. Make sure that you also get a stable version of R when switching over to 3.11. Especially packages like rpy2 and pymer will possibly complain because they are not stable under Python 3.11? It would be probably nice to have a more elaborate parsing of the version column in the .tsv file for that.

See:

https://stackoverflow.com/questions/72178646/upgrading-a-conda-environment-yml-file-to-current-version-that-supports-pandas-1

And :

https://docs.conda.io/projects/conda-build/en/latest/resources/package-spec.html#package-match-specifications

Make name and surname optional keyword arguments

Right now, the parser also accepts --ignore_yml_name but even when using this flag users are still forced to provide surname and name as positional arguments which contradicts the --ignore_yml_name flag. Besides. not everyone wants to name their environment in the CSP-style (csp_surname_name), we should allow users to set whatever name: they like.

Solution:

1.) Merge surname and name to --name: Whatever string the user then provides should be set after the name: attribute.
2.) If --nameis not provided, simply don't set it.

Allow to filter for necessity

Would be nice to filter for only necessary packages and leave the others out. Of course what is necessary depends on the user, but people can still create their own packages.tsv file

Use jupyterlab + jupyter lab variable explorer as alternative for RStudio

When installing R-studio via conda, the R-version gets downgraded to 3.x because the R-Studio from conda is not well maintained. This in consequence leads to to a lot of incompatibilty issues because some packages only work together with R version > 4.x. The current solution is to NOT install Rstudio in the first place but to completely rely on jupyter lab as IDE for R. But for that it would be nice, to get the variable inspector from jupyterlab working, because not having a possibility to inspect the variables sucks.

Merge parsing function

Not exactly sure but I think the functions write_conda_package, write_pip_package, write_to_pip_requirements, write_pip_packages could be merge into one single function (e.g. parse_tsv_row) that handles everything. This function should be used as a callable to df.apply(). It should check wether the current row is a "conda" or a "pip" row and do the necessary steps. With this there would be also no need anymore to subset the dataframe into a pip- and a conda-dataframe.

Allow yml-file to have no name

This is useful if the yml file should simply be used to update an existing environment (not to create a new one). This is needed for csp_neurodocker because the current approach here is to update the base-environment instead of creating a new environment (this is a workaround though)

Add option to choose output directory for environment.yml file

This is necessary so that when tcy gets added as as submodule somewhere else and tcy.py is run with custom arguments we don't overwrite the environment.yml file in the tcy submodule. Then git would think something has changed and we would overwrite the default environment.yml which should stay as it is (i.e. pip-packages as separate requirements.txt, name: header, etc.)

GitHub-Actions: Run tcy before solving environment

In the long-run we want to parametrize the Github-Actions workflows and solve the environment under different base-images (Ubuntu, Mac OS, Windows). But certain packages that are described in the .tsv file do not run under certain OS or are buggy. Therefore, before each solving process, run python tcy.py linux/windows/mac --yaml_name linux/windows/mac. The generated .yml file should then be piped into the solving process. But make sure only to push back the solved environment, not the generated input environment. This can probably be achieved with the file-pattern flag from the git-auto-commit workflow:

file_pattern: '*.php src/*.js tests/*.js'

Related: #51

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.