Coder Social home page Coder Social logo

tensorflow / cloud Goto Github PK

View Code? Open in Web Editor NEW
367.0 29.0 84.0 1.83 MB

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.

Home Page: https://github.com/tensorflow/cloud

License: Apache License 2.0

Python 69.52% Jupyter Notebook 25.27% Starlark 1.28% C++ 3.93%
tensorflow cloud gcp keras

cloud's Introduction

TensorFlow Cloud

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging, training, tuning your Keras and TensorFlow code in a local environment to distributed training/tuning on Cloud.

Introduction

TensorFlow Cloud run API for GCP training/tuning

Installation

Requirements

For detailed end to end setup instructions, please see Setup instructions.

Install latest release

pip install -U tensorflow-cloud

Install from source

git clone https://github.com/tensorflow/cloud.git
cd cloud
pip install src/python/.

High level overview

TensorFlow Cloud package provides the run API for training your models on GCP. To start, let's walk through a simple workflow using this API.

  1. Let's begin with a Keras model training code such as the following, saved as mnist_example.py.

    import tensorflow as tf
    
    (x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
    
    x_train = x_train.reshape((60000, 28 * 28))
    x_train = x_train.astype('float32') / 255
    
    model = tf.keras.Sequential([
      tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=['accuracy'])
    
    model.fit(x_train, y_train, epochs=10, batch_size=128)
  2. After you have tested this model on your local environment for a few epochs, probably with a small dataset, you can train the model on Google Cloud by writing the following simple script scale_mnist.py.

    import tensorflow_cloud as tfc
    tfc.run(entry_point='mnist_example.py')

    Running scale_mnist.py will automatically apply TensorFlow one device strategy and train your model at scale on Google Cloud Platform. Please see the usage guide section for detailed instructions and additional API parameters.

  3. You will see an output similar to the following on your console. This information can be used to track the training job status.

    user@desktop$ python scale_mnist.py
    Job submitted successfully.
    Your job ID is:  tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e
    Please access your job logs at the following URL:
    https://console.cloud.google.com/mlengine/jobs/tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e?project=prod-123

Setup instructions

End to end instructions to help set up your environment for Tensorflow Cloud. You use one of the following notebooks to setup your project or follow the instructions below.

Colab logoRun in Colab GitHub logoView on GitHub Kaggle logoRun in Kaggle
  1. Create a new local directory

    mkdir tensorflow_cloud
    cd tensorflow_cloud
  2. Make sure you have python >= 3.6

    python -V
  3. Set up virtual environment

    virtualenv tfcloud --python=python3
    source tfcloud/bin/activate
  4. Set up your Google Cloud project

    Verify that gcloud sdk is installed.

    which gcloud

    Set default gcloud project

    export PROJECT_ID=<your-project-id>
    gcloud config set project $PROJECT_ID
  5. Authenticate your GCP account

    Create a service account.

    export SA_NAME=<your-sa-name>
    gcloud iam service-accounts create $SA_NAME
    gcloud projects add-iam-policy-binding $PROJECT_ID \
        --member serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
        --role 'roles/editor'

    Create a key for your service account.

    gcloud iam service-accounts keys create ~/key.json --iam-account $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com

    Create the GOOGLE_APPLICATION_CREDENTIALS environment variable.

    export GOOGLE_APPLICATION_CREDENTIALS=~/key.json
  6. Create a Cloud Storage bucket. Using Google Cloud build is the recommended method for building and publishing docker images, although we optionally allow for local docker daemon process depending on your specific needs.

    BUCKET_NAME="your-bucket-name"
    REGION="us-central1"
    gcloud auth login
    gsutil mb -l $REGION gs://$BUCKET_NAME

    (optional for local docker setup) shell sudo dockerd

  7. Authenticate access to Google Cloud registry.

    gcloud auth configure-docker
  8. Install nbconvert if you plan to use a notebook file entry_point as shown in usage guide #4.

    pip install nbconvert
  9. Install latest release of tensorflow-cloud

    pip install tensorflow-cloud

Usage guide

As described in the high level overview, the run API allows you to train your models at scale on GCP. The run API can be used in four different ways. This is defined by where you are running the API (Terminal vs IPython notebook), and your entry_point parameter. entry_point is an optional Python script or notebook file path to the file that contains your TensorFlow Keras training code. This is the most important parameter in the API.

run(entry_point=None,
    requirements_txt=None,
    distribution_strategy='auto',
    docker_config='auto',
    chief_config='auto',
    worker_config='auto',
    worker_count=0,
    entry_point_args=None,
    stream_logs=False,
    job_labels=None,
    **kwargs)
  1. Using a python file as entry_point.

    If you have your tf.keras model in a python file (mnist_example.py), then you can write the following simple script (scale_mnist.py) to scale your model on GCP.

    import tensorflow_cloud as tfc
    tfc.run(entry_point='mnist_example.py')

    Please note that all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file. It's recommended to create a new directory to house each cloud project which includes necessary files and nothing else, to optimize image build times.

  2. Using a notebook file as entry_point.

    If you have your tf.keras model in a notebook file (mnist_example.ipynb), then you can write the following simple script (scale_mnist.py) to scale your model on GCP.

    import tensorflow_cloud as tfc
    tfc.run(entry_point='mnist_example.ipynb')

    Please note that all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file. Like the python script entry_point above, we recommended creating a new directory to house each cloud project which includes necessary files and nothing else, to optimize image build times.

  3. Using run within a python script that contains the tf.keras model.

    You can use the run API from within your python file that contains the tf.keras model (mnist_scale.py). In this use case, entry_point should be None. The run API can be called anywhere and the entire file will be executed remotely. The API can be called at the end to run the script locally for debugging purposes (possibly with fewer epochs and other flags).

    import tensorflow_datasets as tfds
    import tensorflow as tf
    import tensorflow_cloud as tfc
    
    tfc.run(
        entry_point=None,
        distribution_strategy='auto',
        requirements_txt='requirements.txt',
        chief_config=tfc.MachineConfig(
                cpu_cores=8,
                memory=30,
                accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4,
                accelerator_count=2),
        worker_count=0)
    
    datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
    mnist_train, mnist_test = datasets['train'], datasets['test']
    
    num_train_examples = info.splits['train'].num_examples
    num_test_examples = info.splits['test'].num_examples
    
    BUFFER_SIZE = 10000
    BATCH_SIZE = 64
    
    def scale(image, label):
        image = tf.cast(image, tf.float32)
        image /= 255
        return image, label
    
    train_dataset = mnist_train.map(scale).cache()
    train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
    
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(
            28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    
    model.compile(loss='sparse_categorical_crossentropy',
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=['accuracy'])
    model.fit(train_dataset, epochs=12)

    Please note that all the files in the same directory tree as the python script will be packaged in the docker image created, along with the python file. It's recommended to create a new directory to house each cloud project which includes necessary files and nothing else, to optimize image build times.

  4. Using run within a notebook script that contains the tf.keras model.

    Image of colab

    In this use case, entry_point should be None and docker_config.image_build_bucket must be specified, to ensure the build can be stored and published.

    Cluster and distribution strategy configuration

    By default, run API takes care of wrapping your model code in a TensorFlow distribution strategy based on the cluster configuration you have provided.

    No distribution

    CPU chief config and no additional workers

    tfc.run(entry_point='mnist_example.py',
            chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'])

    OneDeviceStrategy

    1 GPU on chief (defaults to AcceleratorType.NVIDIA_TESLA_T4) and no additional workers.

    tfc.run(entry_point='mnist_example.py')

    MirroredStrategy

    Chief config with multiple GPUS (AcceleratorType.NVIDIA_TESLA_V100).

    tfc.run(entry_point='mnist_example.py',
            chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_4X'])

    MultiWorkerMirroredStrategy

    Chief config with 1 GPU and 2 workers each with 8 GPUs (AcceleratorType.NVIDIA_TESLA_V100).

    tfc.run(entry_point='mnist_example.py',
            chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_1X'],
            worker_count=2,
            worker_config=tfc.COMMON_MACHINE_CONFIGS['V100_8X'])

    TPUStrategy

    Chief config with 1 CPU and 1 worker with TPU.

    tfc.run(entry_point="mnist_example.py",
            chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"],
            worker_count=1,
            worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"])

    Please note that TPUStrategy with TensorFlow Cloud works only with TF version 2.1 as this is the latest version supported by AI Platform cloud TPU

    Custom distribution strategy

    If you would like to take care of specifying distribution strategy in your model code and do not want run API to create a strategy, then set distribution_stategy as None. This will be required for example when you are using strategy.experimental_distribute_dataset.

    tfc.run(entry_point='mnist_example.py',
            distribution_strategy=None,
            worker_count=2)

What happens when you call run?

The API call will encompass the following:

  1. Making code entities such as a Keras script/notebook, cloud and distribution ready.
  2. Converting this distribution entity into a docker container with the required dependencies.
  3. Deploy this container at scale and train using TensorFlow distribution strategies.
  4. Stream logs and monitor them on hosted TensorBoard, manage checkpoint storage.

By default, we will use local docker daemon for building and publishing docker images to Google container registry. Images are published to gcr.io/your-gcp-project-id. If you specify docker_config.image_build_bucket, then we will use Google Cloud build to build and publish docker images.

We use Google AI platform for deploying docker images on GCP.

Please note that, when entry_point argument is specified, all the files in the same directory tree as entry_point will be packaged in the docker image created, along with the entry_point file.

Please see run API documentation for detailed information on the parameters and how you can modify the above processes to suit your needs.

End to end examples

cd src/python/tensorflow_cloud/core
python tests/examples/call_run_on_script_with_keras_fit.py

Running unit tests

pytest src/python/tensorflow_cloud/core/tests/unit/

Local vs remote training

Things to keep in mind when running your jobs remotely:

[Coming soon]

Debugging workflow

Here are some tips for fixing unexpected issues.

Operation disallowed within distribution strategy scope

Error like: Creating a generator within a strategy scope is disallowed, because there is ambiguity on how to replicate a generator (e.g. should it be copied so that each replica gets the same random numbers, or 'split' so that each replica gets different random numbers).

Solution: Passing distribution_strategy='auto' to run API wraps all of your script in a TF distribution strategy based on the cluster configuration provided. You will see the above error or something similar to it, if for some reason an operation is not allowed inside distribution strategy scope. To fix the error, please pass None to the distribution_strategy param and create a strategy instance as part of your training code as shown in this example.

Docker image build timeout

Error like: requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))

Solution: The directory being used as an entry point likely has too much data for the image to successfully build, and there may be extraneous data included in the build. Reformat your directory structure such that the folder which contains the entry point only includes files necessary for the current project.

Version not supported for TPU training

Error like: There was an error submitting the job.Field: tpu_tf_version Error: The specified runtime version '2.3' is not supported for TPU training. Please specify a different runtime version.

Solution: Please use TF version 2.1. See TPU Strategy in Cluster and distribution strategy configuration section.

TF nightly build.

Warning like: Docker parent image '2.4.0.dev20200720' does not exist. Using the latest TF nightly build.

Solution: If you do not provide docker_config.parent_image param, then by default we use pre-built TF docker images as parent image. If you do not have TF installed on the environment where run is called, then TF docker image for the latest stable release will be used. Otherwise, the version of the docker image will match the locally installed TF version. However, pre-built TF docker images aren't available for TF nightlies except for the latest. So, if your local TF is an older nightly version, we upgrade to the latest nightly automatically and raise this warning.

Mixing distribution strategy objects.

Error like: RuntimeError: Mixing different tf.distribute.Strategy objects.

Solution: Please provide distribution_strategy=None when you already have a distribution strategy defined in your model code. Specifying distribution_strategy'='auto', will wrap your code in a TensorFlow distribution strategy. This will cause the above error, if there is a strategy object already used in your code.

Coming up

  • Distributed Keras tuner support.

Contributing

We welcome community contributions, see CONTRIBUTING.md and, for style help, Writing TensorFlow documentation guide.

License

Apache License 2.0

Privacy Notice

This application reports technical and operational details of your usage of Cloud Services in accordance with Google privacy policy, for more information please refer to https://policies.google.com/privacy. If you wish to opt-out, you may do so by running tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().

cloud's People

Contributors

adammichaelwood avatar bhack avatar chongyouquan avatar christianversloot avatar fyangf avatar g-luo avatar gogasca avatar haifeng-jin avatar jonah-kohn avatar juanuribe28 avatar lamberta avatar lgeiger avatar lukewood avatar markdaoust avatar pavithrasv avatar rchen152 avatar rosbo avatar samuelmarks avatar sinachavoshi avatar ucdmkt avatar xingyousong avatar yashk2810 avatar yilei avatar yinghsienwu avatar yixingfu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloud's Issues

Price callback

It would be nice to have a price callback to monitor job costs.

Distributed training for Keras-Tuner

After some experimenting, using keras-tuner with chief-worker distribution can (only) be done by setting KERASTUNER_ORACLE_IP to 0.0.0.0 on chief, but the actual chief IP obtained from TFCONFIG on worker replica.
Distributed training is certainly useful when we support keras-tuner. Should we implement this special case here, or just create an example showing how to distribute keras-tuner clearly?

Alternatively this could be solved from the keras-tuner side, but from what I see this is only a problem putting KT on AI platform using tensorflow-cloud; but not an issue for KT distribution in general.

End to end example with prediction via REST API

Requesting a feature:

Could you develop the library further for deployment management, or is it already in the roadmap,
nevertheless can an example workflow be added for deployment and support REST API via gcloud SDK management
as best practice ?

Add license check github action

This is a tracking task to add a license checker to Github Actions CI to ensure files without license are not checked in accidentally. This should cover .cpp & .py files as well notebook files. There are a few GH actions that may be a good match here.

Allow passing in Google Cloud Credentials (instead of application default credentials)

Kaggle Notebooks supports integrations with several Google Cloud services by authenticating the user via OAuth and providing a credentials object (for example https://www.kaggle.com/product-feedback/163416).

We'd like to provide support for tensorflow_cloud on Kaggle using user's existing auth via this mechanism. One way to do that would be allowing the library to have a credential object passed in on initialization (like other Google Cloud client libraries).

Another approach, which I've sort of hacked together below[1] is to take the credential object and write it to the filesystem as the GOOGLE_APPLICATION_CREDENTIALS file. This actually works and is implemented by the google_auth library but it always returns a None project id (https://github.com/googleapis/google-auth-library-python/blob/9058f1fea680613d9717a62ee37dc294c11b9c8a/google/auth/_default.py#L126) so tensorflow_cloud throws an error:

if project_id is None:
. Would it be possible to support this approach, but have another mechanism to pass in the project id, in a constructor for example?

Is there any constraint that disallows using user credentials (and requires service accounts to be used)?

[1]

import json
import os
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
user_credential = user_secrets.get_gcloud_credential()
with open('/tmp/key.json', 'w', encoding='utf-8') as f:
    f.write(str(user_credential))
os.environ['GOOGLE_APPLICATION_CREDENTIALS']="/tmp/key.json"
!gcloud config set project vimota-project2

performance improvement on integration tests.

This is to decrease the cost and time of running integration tests. We can lower the number of epochs / steps to 1-2. Additionally decrease the dataset size used for integration tests.

tfc.run() should return job_id

Either tfc.run() should return the job_id or have a method that allows to retrieve the job id.

Also a follow up ask ( nice to have) here would be to directly return the job status (running, failed, succeeded ) as a Enum for automation.

containerize_test.py - Ambiguous import: "call" from package "mock"

In src/python/tensorflow_cloud/core/tests/unit/containerize_test.py

Import statement should be changed to import module not attribute.

line 24: from mock import call, patch: Ambiguous import: "call" from package "mock": cannot determine whether "call" is an attribute on the package, or a module 
line 24: from mock import call, patch: Ambiguous import: "patch" from package "mock": cannot determine whether "patch" is an attribute on the package, or a module not provided by a direct dependency.

Improving cloud build timeout

Feedback from Yixing Fu:

Runtime Error: There was an error executing the cloud build job. Job status: WORKING.

This error description is not very helpful. Also from GCP console the build is actually successful. This error did not come up the second time I tried it. I am guessing it is caused by creating a docker image that took a long time and triggered some time out?
Update: Got this again today. Seems build time > 5 produces this problem, and build time <3 are fine.
Update: Now that I looked a bit into the code, this error is clearly raised by the 10 trial, each 30 second limit. Maybe there should be a check to see if the job status is still “WORKING”, raise a warning that building time is long, and only kill it for much longer time or when job status is something else?

containerize.py - Ambiguous import: APIClient

In /src/python/tensorflow_cloud/core/containerize.py, no direct deps found for imports:

line 30: from docker import APIClient: Ambiguous import: "APIClient" from package "docker": cannot determine whether "APIClient" is an attribute on the package, or a module.

cloud build fail when tf-nightly installed locally

In an environment with tf-nightly, trying to run tfc with cloud build fails.

The failure is caused by dockerfile use version to determine base image. In the case of tf-nightly, it gives something like 2.x.0-dev2020mmdd.

It should be good to check if corresponding image exist before starting build, and throw relevant error message. Also, for the case of nightly build, can we just use nightly for all versions labeled with 'dev'?

Re-use images (only changing parameters)?

My current workflow is to use the gcloud CLI to submit jobs with a common base image and differ the parameters (used in a CI setting). I like that in tensorflow cloud that the machine types and accelerator types are expressible as code, but I don't want to rebuild the image (and re-upload) the image each time I train.

Is there anyway to re-use the same image (but just supply a different 'trainer' file)? What I'd like to do is to push the trainer file + hyper parameters (as YAML or JSON) and keep the image the same across runs (except when my dependencies change).

I see that you can specify a custom base image, but AFAICT a new image would still be built / pushed.

Setting stream_logs=False, does not seem to be working

sample

import tensorflow_cloud as tfc

# Automated MirroredStrategy: chief config with multiple GPUs
tfc.run(
    entry_point="../../tests/testdata/mnist_example_using_fit_no_reqs.py",
    distribution_strategy="auto",
    chief_config=tfc.MachineConfig(
        cpu_cores=8,
        memory=30,
        accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_P100,
        accelerator_count=2,
    ),
    worker_count=0,
    stream_logs=False,
    docker_image_bucket_name="<some_bucket>",
)

output

Job submitted successfully.
Your job ID is:  tf_cloud_train_..._67a20a7a68d1
Please access your job logs at the following URL:
https://console.cloud.google.com/mlengine/jobs/tf_cloud_train_..._67a20a7a68d1?project=...
Streaming job logs: 

expected job logs not to stream.

How to load a dataset from google cloud storage with tensorflow cloud?

Tensorflow cloud configuration:

  GCP_BUCKET = "stereo-train"

  tfc.run(
    requirements_txt="requirements.txt",
    chief_config=tfc.MachineConfig(
      cpu_cores=8,
      memory=30,
      accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4,
      accelerator_count=1,
    ),
    docker_image_bucket_name=GCP_BUCKET,
  )

And I have a bucket called gs://stereo-train that contains the dataset. The exact location of the dataset is:

gs://stereo-train/data_scene_flow/training/dat

However, when using this location like so:

tf.keras.preprocessing.image_dataset_from_directory("gs://stereo-train/data_scene_flow/training/dat", image_size=(375,1242),\
                                                         batch_size=6, shuffle=False, label_mode=None)

Behavior:

getting the error that "gs://stereo-train/data_scene_flow/training/dat" doesn't exist

Expected behavior:
tf.keras.preprocessing.image_dataset_from_directory should know that there's a gs bucket associated with the account and the dataset should be loaded.

validate.py import error 'tensorflow.python.framework.versions'

validate.py",

line 27, in : Can't find module 'tensorflow.python.framework.versions'. [import-error]

recommended method is to check the loaded module versions

import pkg_resources

version = "2.2.0"
for pkg in pkg_resources.working_set:
    if "tensorflow" == pkg.key:
        version = pkg.version

A test case fails when testing under environment with TF2.2

Test case test_request_dict_with_TPU_worker fails when TF2.2 (or TF version) is installed. This is caused by tpuTfVersion in request body set to TF2.2, while the test case require it to be TF2.1. Should we raise an error when tpuTfVersion is trying to be set to TF>2.1, and change this test case?

Bucket name ending with '/' may error out.

From: @yixingfu

Path ending with / makes a problem. For example, when setting path as ‘gs://[BUCKET_NAME]/saves’ works fine, but ‘gs://[BUCKET_NAME]/saves/’ breaks down when trying to reload in remote. For example, running “call_run_on_script_with_keras_save_and_load.py” in the integration test, the log stops updating after the following one:
But the job keeps running. I manually killed the job after a while. I think it is because the extra ‘/’ preventing it from correctly loading. Maybe an error should be raised? Or remove any trailing ‘/’ when parsing the path?

Add magic phrase support to rerun tests

Currently to re run tests need to submit an empty commit

git commit -m "retest" --allow-empty and then 
git push <branch-name>

This is a request for a nice to have feature to enable rerun of tests with a magic phrase such as "/retest" or "/rerun tests".

Authentication error when running the run command on Colab

When running the Colab example I am able tu run the code and successfully save the model inside the created bucket, but the run command keeps trying multiple GETs returning:

"Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project."

The run command I am using is

f = open('requirements.txt', 'w')
f.write('tensorflow-datasets\n')
f.write('pandas')
f.close()

tfc.run(
    entry_point=None,
    distribution_strategy="auto",
    requirements_txt="requirements.txt",
    chief_config=tfc.MachineConfig(
        cpu_cores=8,
        memory=30,
        accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_P100,
        accelerator_count=2,
    ),
    docker_image_bucket_name=BUCKET_NAME,
    docker_base_image='tensorflow/tensorflow:2.2.0-gpu',
    worker_count=0
)

I had to add docker_base_image='tensorflow/tensorflow:2.2.0-gpu' since I was getting a message saying that there is no docker for the TF version 2.2.0.

log streaming does not terminate if a job is terminated while streaming logs

I started a job with log streaming enabled. While the job was running I manually terminated the job in AI Platform training by clicking on stop job in the UI. The log streaming continued to run after 5 minutes until manual termination with the following:

  File ".../cloud/tensorflow_cloud/deploy.py", line 211, in _stream_logs
    output = process.stdout.readline()

I expected when the job is terminated on AI Platform training the log streaming should also terminate / stop.

Make the file path in integration tests relative

Currently integration tests assuming they are being ran from root, this is to make their path relative so that regardless of where they are being executed they can be run

current format

tfc.run(
    entry_point="tests/testdata/mnist_example_using_ctl.py", ...

suggested path look up

# Path to the source code in test environment
TEST_DATA_PATH = os.path.join(
  os.path.dirname(os.path.abspath(__file__)), '../testdata/')

...
tfc.run(
    entry_point=os.path.join(TEST_DATA_PATH, 'mnist_example_using_fit_no_reqs.py'), ...

setup.py execution depends on current working directory

This is a small issue that impacts users if they build the package from any folder other than python. For example if users runs the build package command as

python python/setup.py bdist_wheel

The build will complete with an empty package, source will not include any of the modules.
to mitigate this issue setup.py should set current working directory to setup.py location.

import os
setup_file_directory = os.path.dirname(os.path.abspath(__file__)
os.chdir(setup_file_directory)

Allow caller to specify a custom job_id

This is particularly useful in case of automation and rapid iterations. It allows for easy tracking of the execution result, without needing to keep track of generated IDs and map them to external keys.

deploy_test.py - Ambiguous import: "call" from package "mock"

tensorflow_cloud/core/tests/unit/deploy_test.py

line 32: from mock import call, patch: Ambiguous import: "call" from package "mock": cannot determine whether "call" is an attribute on the package, or a module .
line 32: from mock import call, patch: Ambiguous import: "patch" from package "mock": cannot determine whether "patch" is an attribute on the package, or a module.

Instructions missing

Looks like you need to define:

  1. GCP Project
  2. GCP authentication
  3. Google Cloud SDK
  4. Docker client

I can submit a PR with the requirements.

Docker error while running setup

Hi, I have followed the "High level overview" steps on the readme to train the models on GCP. I have created both mnist_example.py and scale_mnist.py and followed all the setup instructions. However, when I run python scale_mnist.py, I encounter the following error:

docker.errors.DockerException: Error while fetching server API version: (2, 'CreateFile', 'El sistema no puede encontrar el archivo especificado.'), which translated into English says that it cannot find a file. I am using a Cloud Storage bucket in Europe-west1

My configuration:

  • Windows 10
  • Python 3.7.7
  • Tensorflow 2.3.0
  • Tensorflow Cloud 0.1.4

I have also tried on macOS Catalina with the same result.

Let me know if you need any further information. Thank you!

Bug tfc uses python2 in the cloud container

When I have tf version 2.1.0 installed locally with python3 interpreter, not providing the docker_base_image arg to tfc.
tfc would use python2.7 in the cloud container.
File "/usr/local/lib/python2.7/dist-packages/kerastuner/engine/hypermodel.py
log link

Instruct users to keep images small by using dedicated project directories.

Calling run() on a directory which includes a virtual environment or other large files will either time-out or take hours to build. Any virtual environment used for Cloud will contain tensorflow, so we should by default recommend that users create a dedicated subdirectory for each project, containing only the essentials.

Error while creating container

Environment:

 Mac, 
 Python 3.7, 
 gcloud 
     Google Cloud SDK 300.0.0
     bq 2.0.58
     core 2020.07.06
     gsutil 4.51
     kubectl 2020.05.01

Steps:

  • create the sample mnist.py, run it locally - OK
  • create a script to submit it to GCP
     import tensorflow_cloud as tfc
     tfc.run(entry_point='mnist_train.py')
    
  • run the script

Expected:

The script build a container and runs the job in GCP

Observed:

An exception while building container:

(tfcloud) :tf_cloud$ python run_on_gcp.py 
INFO:tensorflow_cloud.containerize:Building docker image: gcr.io/my-project/tf_cloud_train:c74be2d8_b6cc_4832_9522_b1c69ded7f95
INFO:tensorflow_cloud.containerize:{"stream":"Step 1/4 : FROM tensorflow/tensorflow:2.2.0-gpu"} {"stream":"\n"}
   Traceback (most recent call last):
      File "/Users/dournov/tf_cloud/tfcloud/lib/python3.7/site-packages/tensorflow_cloud/containerize.py", line 287, in _get_logs
    line = json.loads(unicode_line)
  File "/Users/dournov/miniconda3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/Users/dournov/miniconda3/lib/python3.7/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 62)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_on_gcp.py", line 2, in <module>
    tfc.run(entry_point='mnist_train.py')
  File "/Users/dournov/tf_cloud/tfcloud/lib/python3.7/site-packages/tensorflow_cloud/run.py", line 188, in run
    docker_img_uri = container_builder.get_docker_image()
  File "/Users/dournov/tf_cloud/tfcloud/lib/python3.7/site-packages/tensorflow_cloud/containerize.py", line 235, in get_docker_image
    image_uri = self._build_docker_image()
  File "/Users/dournov/tf_cloud/tfcloud/lib/python3.7/site-packages/tensorflow_cloud/containerize.py", line 255, in _build_docker_image
    self._get_logs(bld_logs_generator, 'build')
  File "/Users/dournov/tf_cloud/tfcloud/lib/python3.7/site-packages/tensorflow_cloud/containerize.py", line 294, in _get_logs
    'There was an error decoding the Docker logs')
RuntimeError: There was an error decoding the Docker logs

Permission Error in Windows

Tensoflow cloud version: (0.1.5)
Python version : 3.7.6
OS: Windows 10 64-bit

When running from a notebook:

import tensorflow_cloud as tfc
tfc.run(entry_point='mnist.py',
        docker_image_bucket_name="a-test",
)

mnist.py file is as in example :

import tensorflow as tf

(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()

x_train = x_train.reshape((60000, 28 * 28))
x_train = x_train.astype('float32') / 255

model = tf.keras.Sequential([
  tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',
              optimizer=tf.keras.optimizers.Adam(),
              metrics=['accuracy'])

model.fit(x_train,  y_train, epochs=10, batch_size=128)

Result:

INFO:tensorflow_cloud.core.containerize:Uploading files to GCS.

INFO:tensorflow_cloud.core.containerize:Building and publishing docker image using Google Cloud Build: gcr.io/dydra- 282320/tf_cloud_train:371ef57f_1db8_4c96_aae6_09e373926732


PermissionError Traceback (most recent call last)
in
1 import tensorflow_cloud as tfc
2 tfc.run(entry_point='mnist.py',
----> 3 docker_image_bucket_name="a-test",
4 )

~\anaconda3\lib\site-packages\tensorflow_cloud\core\run.py in run(entry_point, requirements_txt, distribution_strategy, docker_base_image, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_bucket_name, job_labels, **kwargs)
224 # Delete all the temporary files we created.
225 if preprocessed_entry_point is not None:
--> 226 os.remove(preprocessed_entry_point)
227 for f in container_builder.get_generated_files():
228 os.remove(f)

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\semih\AppData\Local\Temp\tmpkjul06z0.py'

Exclude tests from setup.py

Test files are currently included in the pypi package, this is to update the setup.py file to exclude tests.

find_packages("core", exclude=["tests"])

Alternatively we can move tests folder up to /python.

Bug while pulling 2.3.0rc0 docker image

If the user have tf 2.3.0rc0 installed locally.
TF cloud would try to pull docker image tensorflow/tensorflow:2.3.0-rc0-gpu which doesn't exist. The correct tag is tensorflow/tensorflow:2.3.0rc0-gpu, which doesn't have the - between 2.3.0 and rc0.

validate_test.py - Ambiguous import: "patch" from package "mock"

src/python/tensorflow_cloud/core/tests/unit/validate_test.py, no direct deps found for imports:
line 20: from mock import patch: Ambiguous import: "patch" from package "mock": cannot determine whether "patch" is an attribute on the package, or a module not provided by a direct dependency.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.