exasol / ai-lab Goto Github PK
View Code? Open in Web Editor NEWDevelopment environment for data science developers
License: MIT License
Development environment for data science developers
License: MIT License
git mv test/aws test/integration/aws
git mv test/ci test/codebuild
(requires to update script aws-code-build/ci/buildspec.yaml
, too)test/integration
On release
release_droid_upload_github_release_assets.yml
.exasol.ds.sandbox.main start-release-build
lib/release_build/run_release_build.py
Currently, the Webhook is configured in GitHub so that GitHub sends a post request on specific events, e.g. push.
Apparently the event contains the name of the current branch e.g. refs/heads/main
and the commit message.
The CodeBuild in the AWS Stack configured to evaluate the event and to start on commit message "[CodeBuild]".
Open questions:
This should be the header notebook of the set of tutorials. It should
Replace from exasol.ds.sandbox.lib.ansible.ansible_run_context import AnsibleRunContext
by import exasol.ds.sandbox.lib.ansible
and call ansible.RunContext
This could be done by creating a file ansible.__init__.py
with content import ...ansible.repository
and enables usage: ansible.repository.default
.
Having a module
ansible_access
which just contains a classAnsibleAccess
does not make much sense and only increases noise and verbosity imho. (ansible_access.AnsibleAcces
vsansible.Access
)
ansible.Access
should be as descriptive.
Ticket #69 described adding an entry point to the DSS Docker Container Images in order to copy notebook files and start the jupyter server.
The entrypoint was implemented as Python script entrypoint.py
.
The current ticket requests to call entrypoint.py
for other image types (AMI, vmdk, vdh), too.
Please note: This is not required for calling magic %pip
, see #273
Currently in case of a failure the ci-build does not show the original cause, e.g. outdated version of an ubuntu package pinned in the ansible scripts.
Example: https://github.com/exasol/data-science-sandbox/actions/runs/6877900023/job/18706405162
Additionally in support for investigations requested by #59 this ticket requests to report the duration of single ansible steps.
Or maybe change the approach completely, depending on discussions
Provide a function and UI to open the Secret store.
Provide a UI to enter configuration and put it into the Secret store.
run_install_dependencies() currently requires entry slc_version
in ConfigObject
.
The current ticket requests to remove the need for this entry.
Start using the Secret Store
Revisit formatting in text cells.
Currently, there are multiple ansible tasks calling apt update (update_cache: true or update_cache: yes).
This takes some time and is only required once in the beginning.
The current requests to remove unnecessary calls to apt update.
Create a series of notebooks to show training of a customer model in a notebook. Some of these notebooks, e.g. uploading data, will be shared in other tutorials.
If we want to test the notebook later automatically, we need to implement the poll loop, instead of entrusting the polling to the user.
The current ticket requests to document how users can run the DSS as Docker Container while using ITDE and EXASLCT.
This includes Mapping the port of the docker socket from host to inside the Docker Container.
Security best practices recommend not to run services in Docker with root. We should prepare a non-root user which is used to run Jupyter.
https://cheatsheetseries.owasp.org/cheatsheets/Docker_Security_Cheat_Sheet.html#rule-2-set-a-user
In total we will need two users besides root
sudo
permission used for Ansible installation using sudo
where required.
ansible
for Docker Editionubuntu
AMI and VM imagesjupyter
without sudo
permission useds for running the docker container and Jupyterlab.Use case / requirements
The current ticket requests to investigate how these requirements could be achieved.
Idea: Maybe Docker VOLUME can be used to achieve this functionality?
See https://docs.docker.com/engine/reference/builder/#volume
The current ticket request to move the files once more.
At the moment the rules are not enforced. Also, a user is required to enter the password in every notebook. This could motivate them to use the simplest, e.g. one-letter, password.
DSS stores AMI images in an publicly accessible AWS S3 bucket. In case of automated downloads Exasol company faces the risk of significant costs as AMI images are quite large (n GB).
To avoid this risk the S3 bucket is protected by a wab application firewall (WAF) which asks the downloader to solve a captcha. This should prevent at least simple implementations of automated downloaders.
According to AWS documentation WAF with captcha is only supported for region us-east-1
(N. Virginia), which should be covered already by default setting for waf_region
in file exasol/ds/sandbox/lib/config.py
.
Additionally DSS does not require any access to the S3 bucket without WAF but AWS CLI requires to specify at least one ip address to be excluded.
poetry install # if you modified the template
export AWS_DEFAULT_REGION=us-east-1
poetry run exasol/ds/sandbox/main.py setup-vm-bucket-waf --allowed-ip 127.0.0.1
poetry run exasol/ds/sandbox/main.py setup-vm-bucket --aws-profile ci4_mfa
The current ticket therefore asks to
--allowed-ip
of DSS command setup-vm-bucket-waf
Current implementation:
@click.option('--allowed-ip', type=str)
Proposal:
@click.option('--allowed-ip', type=str, default="127.0.0.1", show_default=True)
In the past SLC was cloned by Ansible task exasol/ds/sandbox/runtime/ansible/roles/script_languages/tasks/main.yml
In the future this should be done by the related SLC notbook in the DSS.
This will require changes to
user_guide.md
test/ansible_conflict/slc_setup.yml
test/test_ansible.py
test/ansible_conflict/slc_setup.yml
test/test_ci.py
test/test_install_dependencies.py
test/test_release_build.py
exasol/ds/sandbox/lib/config.py
exasol/ds/sandbox/lib/setup_ec2/run_install_dependencies.py
See also
UC-1
UC-2
UC-3
The current implementation is
$BACKUP
$MOUNT
inside the Docker Container$MOUNT
$BACKUP
to $MOUNT
Folder | Type of DSS image | AMI, VM images vhd, vmdk |
---|---|---|
$BACKUP |
{{ user_home }}/notebook_defaults |
{{ user_home }}/notebooks |
$MOUNT |
{{ user_home }}/notebooks |
mounting is not supported |
Additional task:
VOLUME
directive in Dockerfile, results: see comment below.Current implementation uses a python script as entry point as
However, analysis showed that handling owners and permissions for FS objects in the mounted directories is difficult.
We therefore propose to not mount a host directory directly into the Docker container but to use docker volumes for persistent storage. Ticket #78 therefore requests to add instructions to the DSS User Guide.
Background
We need to update the apt cache during the setup of the EC-2 instance. Maybe we could decrease the size of the final VM files by cleaning up the apt cache at the end.
Acceptance Criteria
Run:
ansible.builtin.apt:
clean: yes
become: yes
ansible.builtin.file:
path: /var/lib/apt/lists/
state: absent
at the end of the setup
See also https://askubuntu.com/questions/1050800/how-do-i-remove-the-apt-package-index
Here a measurement on our image
$ docker run -it exasol/data-science-sandbox:0.1.0-dev-1 bash
root@9fed5334f0a2:/# du -sh /var/lib/apt/lists/
46M /var/lib/apt/lists/
Currently DSS in ansible_access.py calls external library ansible_runner.run()
.
Some output is already passed to argument printer defined in ansible_runner.py.
However Ansible prints some more output, that only can be suppressed when adding argument quiet=True
to ansible_runner.run()
, see documentation of ansible_runner
.
The current ticket requests to
quiet
to method ansible_runner.run()
.quiet
depending on the current log_level
, e.g. quiet = (log_level >= logging.INFO)
.Renaming, see Developer guide:
setup-ec2
โ create-ec2-instance
create-vm
โ create-vm-image
Location
exasol/ds/sandbox/cli/commands
Additional tasks
run python ...
Ticket #16 requests to Install Jupyter notebooks via ansible.
As a prerequisite the notebook files need to be made visible to ansible.
The current ticket therefore requests to
git mv
to keep change history, see https://git-scm.com/docs/git-mv.To avoid merge conflicts this should only be done after all current pull requests involving changes to one of the notebook files have been merged, e.g. PR #32
Currently release versions of DSS are based on the version of SLC (https://github.com/exasol/script-languages-release/).
The current ticket requests to decouple release versions of DSS. In effect image names (e.g. of AMIs) no longer need to contain the SLC version but instead only the version of the DSS.
Required changes
exasol/ds/sandbox/lib/release_build/run_release_build.py
test/test_ci.py
exasol/ds/sandbox/cli/options/id_options.py
AnsibleAccess
looks for entry "docker_container" in extra_vars
setup_ec2/run_setup_ec2_and_install_dependencies.py
passes HostInfo
to run_install_dependencies()
run_install_dependencies
adds this toAnsibleRunner
already supports parameter host_infos
but does not forward itAnsibleAccess
Options
AnsibleAccess.run()
should return ansible_runner
r
(which I think breaks the abstraction)AnsibleRunner
should accept host_infos
as additional argumentIf using host_infos
, then dss_docker/create_image.py
should also be updated to forward this info to ansible_runner
.
AnsibleAccess
should return a dict with
Caller can extract the resp. fact_cache based on the known hostname.
Users of the DSS Docker Image should get clear instructions, incl.
Inside the Docker container this can be done with a logger or python print
.
If user did not specify CLI option -d
or --detach
then the output will be displayed on the console.
Otherwise the user needs to call docker logs
.
As the Docker container will only know its own IP, our best option could be to write generic instructions, and as example use localhost
and default port of Jupyter.
Reminder to change the password is important in any case.
We need to get an overview of the design and architecture of the Data Science Sandbox, including:
After renamings in #5 the images should be updated, too.
Folders:
Add notebooks to showcase sagemaker-extension. The workflow can be partially shared with other ML tutorial.
Setup functionality developed in this task #27
needs to be moved to https://github.com/exasol/notebook-connector repo and used in notebooks
Backup path is defined in ansible/general_setup_tasks.yml as "{{user_home}}/notebook-defaults"
.
Display how users can connect to Jupyter server has been moved to separate ticket
Python library ansible_runner
Lately I also adjusted the folder for Ansible to copy the notebook files initially.
See issue #51
Describe folder for notebook file to be in
data-science-sandbox/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/
Potentially we can make sql queries more concise and easier to read
After notebook files have been moved to a different folder as requested by ticket #53 the current ticket requests to
exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/
to sub-folder slc
:
slc_main_build_steps.svg
bash_runner.py
script-languages.ipynb
Please see ITDE cli/options/push_options.py for pushing images to a registry like hub.docker.com
ITDE test utility docker_registry.py
Option force_push
For integration tests please have a look at file test_api_push_test_container.py of the ITDE and note:
api.push_test_container
in calls to set_docker_repository_config()
will set value None
None
for fields username
and password
in auth_config
.Password
Add a set of notebooks to the Data Science Sandbox to demo the Transformer extension.
Currently: "script-languages"
Expected: "dss"
Content of ansible task:
password: "{{ lookup('ansible.builtin.env', 'JUPYTER_LAB_PASSWORD', default='script-languages') }}"
Currently 8888
which is quite frequently used for various purposes.
Proposal: use a different port, probably > 20000.
Wikipedia lists some "ephemeral" ports, still containing potential conflicts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.