Coder Social home page Coder Social logo

exasol / ai-lab Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 0.0 2.92 MB

Development environment for data science developers

License: MIT License

Python 54.04% Jupyter Notebook 45.35% Jinja 0.06% Shell 0.53% Dockerfile 0.02%
exasol exasol-integration language-container script-languages

ai-lab's People

Contributors

ahsimb avatar ckunki avatar dejanmihajlovic avatar marlenekress79789 avatar shmuma avatar tkilias avatar tomuben avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ai-lab's Issues

Create CI build workflow to build and push Docker Image

Acceptance criteria

  • Tests have been renamed / moved:
    • git mv test/aws test/integration/aws
    • git mv test/ci test/codebuild (requires to update script aws-code-build/ci/buildspec.yaml, too)
  • CI Build has been enhanced to additionally execute all tests in folder test/integration
  • Developer Guide has been enhanced to describe CodeBuild triggering in more detail
  • Generated release letter is enhanced to also name the Docker image

Details on CodeBuild triggering

On release

  • user calls release-droid,
  • which in turn runs GitHub action release_droid_upload_github_release_assets.yml.
  • this calls exasol.ds.sandbox.main start-release-build
  • lib/release_build/run_release_build.py

Currently, the Webhook is configured in GitHub so that GitHub sends a post request on specific events, e.g. push.
Apparently the event contains the name of the current branch e.g. refs/heads/main and the commit message.
The CodeBuild in the AWS Stack configured to evaluate the event and to start on commit message "[CodeBuild]".

Open questions:

  • The web hooks in the GitHub repo are created by AWS template templates/ci_code_build.jinja.yaml
  • The ci user used by AWS Codebuild therefore has been granted access to the GitHub repo incl. the permission to create web hooks

Details on release letter

Create a table-of-contents notebook

This should be the header notebook of the set of tutorials. It should

  • Include links to all other notebooks.
  • Provide the structure of the tutorials.

DSS Rename classes and packages in lib/ansible

Replace from exasol.ds.sandbox.lib.ansible.ansible_run_context import AnsibleRunContext
by import exasol.ds.sandbox.lib.ansibleand call ansible.RunContext

This could be done by creating a file ansible.__init__.py with content import ...ansible.repository
and enables usage: ansible.repository.default.

@Nicoretti :

Having a module ansible_access which just contains a class AnsibleAccess does not make much sense and only increases noise and verbosity imho. (ansible_access.AnsibleAcces vs ansible.Access)

ansible.Access should be as descriptive.

Fix setup_db.ipynb

  • make it reusable from other notebooks
  • make parameters to be redefinable

Make systemd call entrypoint.py

Ticket #69 described adding an entry point to the DSS Docker Container Images in order to copy notebook files and start the jupyter server.

The entrypoint was implemented as Python script entrypoint.py.
The current ticket requests to call entrypoint.py for other image types (AMI, vmdk, vdh), too.

Please note: This is not required for calling magic %pip, see #273

Rename all occurences of script-languages-developer-sandbox to data-science-sandbox

Tasks

Remove unnecessary apt-update

Currently, there are multiple ansible tasks calling apt update (update_cache: true or update_cache: yes).
This takes some time and is only required once in the beginning.

The current requests to remove unnecessary calls to apt update.

Use a non-root user to run Jupyter in the Docker Image

Security best practices recommend not to run services in Docker with root. We should prepare a non-root user which is used to run Jupyter.

https://cheatsheetseries.owasp.org/cheatsheets/Docker_Security_Cheat_Sheet.html#rule-2-set-a-user

In total we will need two users besides root

  • User with sudo permission used for Ansible installation using sudo where required.
    • name ansible for Docker Edition
    • name ubuntu AMI and VM images
  • User jupyter without sudo permission useds for running the docker container and Jupyterlab.

Investigate persistent storage of user modified files, e.g. Jupyter notebooks

Use case / requirements

  • DSS should bring some files with an a-priori default content
  • Users of DSS should be able to modify these files by using the DSS
  • DSS should enable users to save the modified files persistently
  • When starting the DSS next time users should be able to continue with the last saved state of their files

The current ticket requests to investigate how these requirements could be achieved.

Idea: Maybe Docker VOLUME can be used to achieve this functionality?
See https://docs.docker.com/engine/reference/builder/#volume

CLI: setup-vm-bucket-waf: default value for cli option allowed-ip

DSS stores AMI images in an publicly accessible AWS S3 bucket. In case of automated downloads Exasol company faces the risk of significant costs as AMI images are quite large (n GB).

To avoid this risk the S3 bucket is protected by a wab application firewall (WAF) which asks the downloader to solve a captcha. This should prevent at least simple implementations of automated downloaders.

According to AWS documentation WAF with captcha is only supported for region us-east-1 (N. Virginia), which should be covered already by default setting for waf_region in file exasol/ds/sandbox/lib/config.py.

Additionally DSS does not require any access to the S3 bucket without WAF but AWS CLI requires to specify at least one ip address to be excluded.

poetry install # if you modified the template
export AWS_DEFAULT_REGION=us-east-1
poetry run exasol/ds/sandbox/main.py setup-vm-bucket-waf --allowed-ip 127.0.0.1
poetry run exasol/ds/sandbox/main.py setup-vm-bucket --aws-profile ci4_mfa

The current ticket therefore asks to

  1. Enhance the DSS developer guide to describe the region restriction and how to handle it
  2. Add a default value to CLI option --allowed-ip of DSS command setup-vm-bucket-waf

Current implementation:

@click.option('--allowed-ip', type=str)

Proposal:

@click.option('--allowed-ip', type=str, default="127.0.0.1", show_default=True)

See https://click.palletsprojects.com/en/8.1.x/options/

Move SLC cloning to notebook

In the past SLC was cloned by Ansible task exasol/ds/sandbox/runtime/ansible/roles/script_languages/tasks/main.yml

In the future this should be done by the related SLC notbook in the DSS.

This will require changes to

  • user_guide.md
  • test/ansible_conflict/slc_setup.yml
  • test/test_ansible.py
  • test/ansible_conflict/slc_setup.yml
  • test/test_ci.py
  • test/test_install_dependencies.py
  • test/test_release_build.py
  • exasol/ds/sandbox/lib/config.py
  • exasol/ds/sandbox/lib/setup_ec2/run_install_dependencies.py

See also

Investiate reuse strategies for docker image

Potential use cases

UC-1

  • notebook developer Nadine works on creating a new Jupyter notebook or updating an existing one
  • Nadine wants to use new libraries that are not available yet in the latest release on docker-hub
  • Nadine therefor wants to build a private Docker image from the branch Nadine is currently working on

UC-2

  • As image creation currently (Nov 2023) takes around 7 minutes Nadine wants to reuse the image in follow-up usage

UC-3

  • Nadine changed a file or dependency that requires to re-create the Docker image taking the change into account

Investigate how to store the Notebooks persistent, in case of Docker Container removal

The current implementation is

  1. Create and deliver the Docker Container with default notebook files in folder $BACKUP
  2. Mount a host directory to mount point $MOUNT inside the Docker Container
  3. Add an entry point to the Docker Container
  4. For each notebook file the entry point checks if the file is present in $MOUNT
  5. If not then the entry point copies the file from $BACKUP to $MOUNT
    • Entry point only copies missing files and directories (FS objects)
    • Copied FS objects are owned by root (of Docker Container)
    • Entry point relaxes permissions for copied FS objects to still allow host user outside the Docker Container modifying and deleting these FS objects in the mounted directory

Directory setting in current implementation

Folder Type of DSS image AMI, VM images vhd, vmdk
$BACKUP {{ user_home }}/notebook_defaults {{ user_home }}/notebooks
$MOUNT {{ user_home }}/notebooks mounting is not supported

Additional task:

  • Try out VOLUME directive in Dockerfile, results: see comment below.

Current implementation uses a python script as entry point as

  • this is more flexible
  • easier to test and
  • offers higher level structures and language features

However, analysis showed that handling owners and permissions for FS objects in the mounted directories is difficult.
We therefore propose to not mount a host directory directly into the Docker container but to use docker volumes for persistent storage. Ticket #78 therefore requests to add instructions to the DSS User Guide.

Remove apt cache to reduce image size

Background

We need to update the apt cache during the setup of the EC-2 instance. Maybe we could decrease the size of the final VM files by cleaning up the apt cache at the end.
Acceptance Criteria

Run:

  • ansible.builtin.apt:
    clean: yes
    become: yes

  • ansible.builtin.file:
    path: /var/lib/apt/lists/
    state: absent

at the end of the setup

See also https://askubuntu.com/questions/1050800/how-do-i-remove-the-apt-package-index

Here a measurement on our image

$ docker run -it exasol/data-science-sandbox:0.1.0-dev-1 bash
root@9fed5334f0a2:/# du -sh  /var/lib/apt/lists/
46M	/var/lib/apt/lists/

Enable to suppress Ansible output

Currently DSS in ansible_access.py calls external library ansible_runner.run().

Some output is already passed to argument printer defined in ansible_runner.py.

However Ansible prints some more output, that only can be suppressed when adding argument quiet=True to ansible_runner.run(), see documentation of ansible_runner.

The current ticket requests to

  1. add an argument to AnsibleAccess.run() enabling callers to control the value of argument quiet to method ansible_runner.run().
  2. Update callers like ansible_runner.py to pass argument quiet depending on the current log_level, e.g. quiet = (log_level >= logging.INFO).

CLI: Refactor location of scripts and naming of CLI commands

Renaming, see Developer guide:

  • setup-ec2 โ†’ create-ec2-instance
  • create-vm โ†’ create-vm-image
    • TODO check if it is multiple images of different types or only AMI

Location

  • Currently all cli commands are in one folder exasol/ds/sandbox/cli/commands
  • The current ticket requests to organize the cli scripts into folders for Docker, EC2, AWS, ...

Additional tasks

  • Register scripts in poetry to enable execution without run python ...
  • This could be done together with #21

Move Jupyter notebooks to folder visible to ansible

Ticket #16 requests to Install Jupyter notebooks via ansible.
As a prerequisite the notebook files need to be made visible to ansible.

The current ticket therefore requests to

To avoid merge conflicts this should only be done after all current pull requests involving changes to one of the notebook files have been merged, e.g. PR #32

Decouple DSS release versions from SLC version

Currently release versions of DSS are based on the version of SLC (https://github.com/exasol/script-languages-release/).

The current ticket requests to decouple release versions of DSS. In effect image names (e.g. of AMIs) no longer need to contain the SLC version but instead only the version of the DSS.

Required changes

  • documentation: user-guide, developer-guide
  • build scripts / implementation
  • exasol/ds/sandbox/lib/release_build/run_release_build.py
  • test/test_ci.py
  • exasol/ds/sandbox/cli/options/id_options.py

Enhance setup_ec2 to return AnsibleFacts

  • Currently AnsibleAccess looks for entry "docker_container" in extra_vars
  • setup_ec2/run_setup_ec2_and_install_dependencies.py passes HostInfo to run_install_dependencies()
  • run_install_dependencies adds this to
  • AnsibleRunner already supports parameter host_infos but does not forward it
    to AnsibleAccess

Options

  • Either AnsibleAccess.run() should return ansible_runner r (which I think breaks the abstraction)
  • or AnsibleRunner should accept host_infos as additional argument

If using host_infos, then dss_docker/create_image.py should also be updated to forward this info to ansible_runner.

AnsibleAccess should return a dict with

  • hostnames as keys and the
  • corresponding fact_cache as values

Caller can extract the resp. fact_cache based on the known hostname.

Docker container: Display usage instructions

Users of the DSS Docker Image should get clear instructions, incl.

  • how to connect to the jupyter server: hostname and port?
  • default password
  • recommendation to change default password

Inside the Docker container this can be done with a logger or python print.
If user did not specify CLI option -d or --detach then the output will be displayed on the console.
Otherwise the user needs to call docker logs.

As the Docker container will only know its own IP, our best option could be to write generic instructions, and as example use localhost and default port of Jupyter.

Reminder to change the password is important in any case.

Create architecture design documents

Background

We need to get an overview of the design and architecture of the Data Science Sandbox, including:

  • User view
  • Image Building
  • Dependencies
  • Notebook Structures
  • SLC Activation
  • Exasol Wrappers

Update strings in images

After renamings in #5 the images should be updated, too.

Folders:

  • doc/developer_guide/img/
  • doc/user_guide/img/

DSS Add entry point to start Jupyter server

Investigation

Backup path is defined in ansible/general_setup_tasks.yml as "{{user_home}}/notebook-defaults".

Add entry point

Display how users can connect to Jupyter server has been moved to separate ticket

Lately I also adjusted the folder for Ansible to copy the notebook files initially.
See issue #51

Install Jupyter notebooks via ansible

After notebook files have been moved to a different folder as requested by ticket #53 the current ticket requests to

Tasks

  • Move files in exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/ to sub-folder slc:
    • slc_main_build_steps.svg
    • bash_runner.py
    • script-languages.ipynb

DSS Push Docker Image to a Docker Registry

Please see ITDE cli/options/push_options.py for pushing images to a registry like hub.docker.com

ITDE test utility docker_registry.py

  • Uses docker container registry
  • Allocates a free port
  • Starts the container as local registry for verifying the publication of docker images
  • In the end calls docker_image_push_base_task

Option force_push

  • Seems to be evaluated only in this method.
  • I don't see a check for an image to exist but only about whether the image has been built (recently?)

For integration tests please have a look at file test_api_push_test_container.py of the ITDE and note:

  • api.push_test_container accepts arguments username and password for source and target docker registries.
  • The test does not provide these arguments.
  • api.push_test_container in calls to set_docker_repository_config() will set value None
  • DockerPushImageBaseTask will use None for fields username and password in auth_config.

Password

  • Support to provide user name, and make python ask for interactive password entry
  • support environment variable to pass secret in CI build

Create basic notebooks with cloud-storage-extension

  • fetch latest jar from cloud-storage-extension github
  • put cloud-storage-extension in bucketfs
  • create import scripts
  • arrange sample data (parquet) on the s3 (need public bucket for this)
  • import it into the table

Change initial password of jupyter notebook

Currently: "script-languages"
Expected: "dss"

Tasks

Content of ansible task:

 password: "{{ lookup('ansible.builtin.env', 'JUPYTER_LAB_PASSWORD', default='script-languages') }}"

Change default port of Jupyter server

Currently 8888 which is quite frequently used for various purposes.

Proposal: use a different port, probably > 20000.

Wikipedia lists some "ephemeral" ports, still containing potential conflicts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.