The training-data-analyst from googlecloudplatform

Updated the CLA with correct details but still the checks are failing!

Originally posted by @lokeshsoni in #330

Serverless Machine Learning - Lab 7 : Feature Engineering v1.3: apache-beam not installed because of dill

I have some issue with the first cell of the following notebook:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/feateng.ipynb

%%bash
conda update -y -n base -c defaults conda
source activate py2env
pip uninstall -y google-cloud-dataflow
conda install -y pytz 
pip install apache-beam[gcp]

It seems the issue is the installation of apache-beam:

...
Requirement already satisfied: cachetools>=2.0.0 in /usr/local/envs/py2env/lib/python2.7/site-packages (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.4.1->google-cloud-pubsub==0.39.0; extra == "gcp"->apache-beam[gcp]) (2.1.0)
Installing collected packages: dill, pyarrow, typing, pyvcf, fastavro, httplib2, docopt, hdfs, grpc-google-iam-v1, google-api-core, google-cloud-pubsub, monotonic, fasteners, google-apitools, google-cloud-bigquery, apache-beam
Found existing installation: dill 0.2.6
Skipping google-cloud-dataflow as it is not installed.
google-cloud-monitoring 0.28.0 has requirement google-api-core<0.2.0dev,>=0.1.1, but you'll have google-api-core 1.7.0 which is incompatible.
googledatastore 7.0.1 has requirement httplib2<0.10,>=0.9.1, but you'll have httplib2 0.11.3 which is incompatible.
Cannot uninstall 'dill'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

then if I check, I don't see the package installed:
conda list

the right env is activated:

py2env * /usr/local/envs/py2env

so when trying to in port the package it doesn't work (restarting the kernel doesn't help since the package was not installed):

ImportErrorTraceback (most recent call last)
<ipython-input-4-830e0319c5fc> in <module>()
      1 import tensorflow as tf
----> 2 import apache_beam as beam
      3 import shutil
      4 print(tf.__version__)


ImportError: No module named apache_beam

It seems that

!conda uninstall dill=0.2.6 -y

can drop dill and then the installation of apache-beam is working. My 1:30 min session is over. I will start again and see if this was a temporary glitch and if the solution above is working.

During the stop of tensorboard, the cell will stack trace if no running tensorboard instances

Here is a fix

Stop tensorboard instances

pids_df = TensorBoard.list()
if not pids_df.empty:
  for pid in pids_df['pid']:
    TensorBoard().stop(pid)
    print 'Stopped TensorBoard with pid {}'.format(pid)

No module named sklearn_crfsuite.estimator

I using sklearn_crfsuite estimator

crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=2,
all_possible_transitions=True
)

I'm saving the model as described below:

model = 'model.joblib'
joblib.dump(crf, model)

and when I try to deploy the model it reports this error:

ERROR: (gcloud.alpha.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Could not load the model: /tmp/model/0001/model.joblib. No module named sklearn_crfsuite.estimator. (Error code: 0)"

deploy model:
gcloud alpha ml-engine versions create v1 --model teste --origin $ORI --python-version 2.7 --runtime-version 1.8 --framework scikit-learn

stacktrace during execution of deepdive/04_features/taxifare/feateng.ipynb

I'm not sure if this is related to new changes released. This code used to work.

python 2.
All cells cleared and restarted. Everything runs until this cell:

preprocess(50*100, 'DataflowRunner') 
#change first arg to None to preprocess full dataset

Result is this stack trace:

Launching Dataflow job preprocess-taxifeatures-180901-165026 ... hang on
/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)

CalledProcessErrorTraceback (most recent call last)
<ipython-input-10-b4775e416971> in <module>()
----> 1 preprocess(50*100, 'DataflowRunner')
      2 #change first arg to None to preprocess full dataset

<ipython-input-8-8419c1762ff8> in preprocess(EVERY_N, RUNNER)
     53     )
     54 
---> 55   p.run()

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    174       finally:
    175         shutil.rmtree(tmpdir)
--> 176     return self.runner.run(self)
    177 
    178   def __enter__(self):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in run(self, pipeline)
    250     # Create the job
    251     result = DataflowPipelineResult(
--> 252         self.dataflow_client.create_job(self.job), self)
    253 
    254     self._metrics = DataflowMetrics(self.dataflow_client, result, self.job)

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/retry.pyc in wrapper(*args, **kwargs)
    166       while True:
    167         try:
--> 168           return fun(*args, **kwargs)
    169         except Exception as exn:  # pylint: disable=broad-except
    170           if not retry_filter(exn):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job(self, job)
    423   def create_job(self, job):
    424     """Creates job description. May stage and/or submit for remote execution."""
--> 425     self.create_job_description(job)
    426 
    427     # Stage and submit the job when necessary

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job_description(self, job)
    446     """Creates a job described by the workflow proto."""
    447     resources = dependency.stage_job_resources(
--> 448         job.options, file_copy=self._gcs_file_copy)
    449     job.proto.environment = Environment(
    450         packages=resources, options=job.options,

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in stage_job_resources(options, file_copy, build_setup_args, temp_dir, populate_requirements_cache)
    377       else:
    378         sdk_remote_location = setup_options.sdk_location
--> 379       _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    380       resources.append(names.DATAFLOW_SDK_TARBALL_FILE)
    381     else:

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    462   elif sdk_remote_location == 'pypi':
    463     logging.info('Staging the SDK tarball from PyPI to %s', staged_path)
--> 464     _dependency_file_copy(_download_pypi_sdk_package(temp_dir), staged_path)
    465   else:
    466     raise RuntimeError(

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _download_pypi_sdk_package(temp_dir)
    525       '--no-binary', ':all:', '--no-deps']
    526   logging.info('Executing command: %s', cmd_args)
--> 527   processes.check_call(cmd_args)
    528   zip_expected = os.path.join(
    529       temp_dir, '%s-%s.zip' % (package_name, version))

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/processes.pyc in check_call(*args, **kwargs)
     42   if force_shell:
     43     kwargs['shell'] = True
---> 44   return subprocess.check_call(*args, **kwargs)
     45 
     46 

/usr/local/envs/py2env/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
    188         if cmd is None:
    189             cmd = popenargs[0]
--> 190         raise CalledProcessError(retcode, cmd)
    191     return 0
    192 

CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/tmp6JRn77', 'google-cloud-dataflow==2.0.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 2

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py:113: DeprecationWarning: object() takes no parameters
  super(GcsIO, cls).__new__(cls, storage_client))

Production ML models

Hi,
When can we have contents for 6. Production ML models ?

Thanks

gcloud SDK, a better solution reproducible environmental variables?

I've been doing the

Machine Learning with TensorFlow on Google Cloud Platform

course on coursera. While working through labs in the course, I have noticed that the strategies for configuring the gcloud sdk are not very robust. Perhaps it is because they are intended to be run on GCP in datalab, but I like doing them on my computer or VMs: datalab itself has been showing non-responsive UI, which may be caused poor network latency, or my persistent use of firefox.

Anyhow, moving onward, there doesn't seem to be a place in the documentation with an advised way of automating the setup of a GCP config, and I have broke quite a few gcloud configurations by running scripts like the one here. It changes the project id, bucket and region in my currently open config. These configurations are proving quite tedious to keep an eye on.

I know terraform and other devops tools offer partial solutions, but this really feels like something that should be native. Does anyone have suggestions on how we could improve the scripts used for setting up GCP environmental variables to stop them from being set on-top of existing configs, but to use a temporary one that belongs exclusively to the script?

Perhaps it is possible to set all of these variables with the python api, and avoid changing any of the configs that are used for bash calls.

Update use of google.datalab.bigquery to google-cloud-bigquery

Update tutorials and course content that currently use Datalab's BigQuery module to use the official python client library. The google-cloud-* client libraries are now Google's recommended way of interacting with GCP.

Error in serving lab 2 with run_dataflow.sh

In the "Prod ML Systems Lab 2 : Serving ML Predictions in batch and real-time" lab, it says:

Step 2

Back in your Cloud Shell, modify the script run_dataflow.sh to get Project Id (using --project) from command line arguments, and then run as follows:

cd ~/training-data-analyst/courses/machine_learning/deepdive/06_structured/labs/serving
./run_dataflow.sh

However, I can already see this set here: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive/06_structured/labs/serving/run_dataflow.sh#L11

I then get this Java error running the script:

[WARNING]
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions)
        at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:233)
        at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:162)
        at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55)
        at org.apache.beam.sdk.Pipeline.create(Pipeline.java:150)
        at com.google.cloud.training.mlongcp.AddPrediction.main(AddPrediction.java:69)
        ... 6 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:222)
        ... 10 more
Caused by: java.lang.NoSuchMethodError: com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Builder.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services/AbstractG
oogleClient$Builder;
        at com.google.api.services.cloudresourcemanager.CloudResourceManager$Builder.setBatchPath(CloudResourceManager.java:5929)
        at com.google.api.services.cloudresourcemanager.CloudResourceManager$Builder.<init>(CloudResourceManager.java:5908)
        at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.newCloudResourceManagerClient(GcpOptions.java:370)
        at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:240)
        at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:228)
        at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper(ProxyInvocationHandler.java:592)
        at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:533)
        at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:155)
        at com.sun.proxy.$Proxy37.getGcpTempLocation(Unknown Source)
        at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:240)
        ... 15 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8.330 s
[INFO] Finished at: 2018-10-14T14:31:16+01:00
[INFO] Final Memory: 26M/62M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project pipeline: An exception occured while executing the Java class. null: InvocationTargetException: Faile
d to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions): com.google.api.client.googleapis.services.json.AbstractGoogleJsonClient$Build
er.setBatchPath(Ljava/lang/String;)Lcom/google/api/client/googleapis/services/AbstractGoogleClient$Builder; -> [Help 1]

Iot data is inserted to Bigquery table based on item name

Hi Sir,

Now the records are created not based on timestamp but on item name as below.
If you see the timestamp column it is not in order. Everytime a new record is created it is getting appended based on item name.

Please find the below output.

device	item	type	state	timestamp
fueb_38B1DB168ABB	dimmer	Dimmer	41	2018-07-24 12:27:31 UTC
fueb_38B1DB168ABB	dimmer	Dimmer	63	2018-07-24 12:24:50 UTC
fueb_38B1DB168ABB	dimmer	Dimmer	80	2018-07-24 12:27:04 UTC
fueb_38B1DB168ABB	light	Switch	ON	2018-07-24 12:24:43 UTC
fueb_38B1DB168ABB	light	Switch	ON	2018-07-24 12:26:03 UTC
fueb_38B1DB168ABB	light	Switch	OFF	2018-07-24 12:22:39 UTC
fueb_38B1DB168ABB	light	Switch	OFF	2018-07-24 12:25:47 UTC
fueb_38B1DB168ABB	color	Color	109100100	2018-07-24 12:27:56 UTC
fueb_38B1DB168ABB	color	Color	201100100	2018-07-24 12:24:57 UTC

Please find the dataflow program which I am using to push iot data to BQ table.
PubSubReader.java.zip

How to resolve this issue and make BQ records based on timestamp.

Below error while installing packages PYTHON

google-cloud-storage 1.13.2 has requirement google-cloud-core<0.30dev,>=0.29.0, but you'll have google-cloud-core 0.25.0 which is incompatible.
google-gax 0.15.16 has requirement future<0.17dev,>=0.16.0, but you'll have future 0.17.1 which is incompatible.
apache-beam 2.5.0 has requirement httplib2<0.10,>=0.8, but you'll have httplib2 0.12.0 which is incompatible.
google-cloud-logging 1.9.1 has requirement google-cloud-core<0.30dev,>=0.29.0, but you'll have google-cloud-core 0.25.0 which is incompatible.
google-cloud-spanner 1.7.1 has requirement google-cloud-core<0.30dev,>=0.29.0, but you'll have google-cloud-core 0.25.0 which is incompatible.

Installing collected packages: six
Found existing installation: six 1.10.0
Uninstalling six-1.10.0:
Successfully uninstalled six-1.10.0
Successfully installed six-1.10.0

While installing below packages if I am not wrong

$ cat install_packages.sh
#!/bin/bash
apt-get install python-pip
pip install google-cloud-dataflow oauth2client==3.0.0
pip install --force six==1.10 # downgrade as 1.11 breaks apitools
pip install -U pip

Error running TPU lab: prefetch() missing 1 required positional argument

Reminder that this might be enough to correct the lab problem.
#360

How to predict in feateng.ipynb from csv data

Used training-data-analyst/courses/machine_learning/feateng/feateng.ipynb
with kaggle nyc taxi fare dataset on colab. The issue I encountered is how to predict after trained model? Since I am not using GCP directly I read test data and call predict but predictions were all empty. Can you please let me know how to perform prediction?

CSV_COLUMNS = 'key,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,dayofweek,hourofday'.split(',')
LABEL_COLUMN = 'fare_amount'

def pandas_test_input_fn(df):
return tf.estimator.inputs.pandas_input_fn(
x=df,
y=None,
batch_size=512,
num_epochs=1,
shuffle=False,
queue_capacity=1000
)
df_valid2 = pd.read_csv('mydata/valid.csv', header = None, names = CSV_COLUMNS)
predictions = estimator.predict(input_fn = pandas_test_input_fn(df_valid2))

It complains "ValueError: Feature euclidean is not in features dictionary." since that is coming from add_engineered. But confused how to process add_engineered when I feed the data thru pandas?
Seem like need to define tf.estimator.ModeKeys.EVAL but how, right?
Thanks

Simplify REST access to deployed ML service

Avoid the use of Discovery API and directly hit the ML end-point (since it is documented and won't change)

c9228cb

The resulting code is simpler and easier to understand

San Diego Traffic Example: What is the role of LaneInfo.Java? help wanted

Hi,
I am currently learning GCP, and I've been following some of the examples in codelabs. To be more specific, I've been studying the San Diego traffic example. I don't quite understand what's the role of the file "LaneInfo.Java". It seems that these files define the input variables as strings, and then Currentconditions.java and AverageSpeeds.java use those variable definitions? As part of my learning, I am trying to replicate the same process using the Chicago Traffic dataset, but I keep running into issues when running the averagespeeds.java & laneinfo.java. Any type of insight(s) would be helpful. I am still very new to GCP and java/apache beam in general.

Use train_and_evaluate() rather than learn_runner

For example:

update 08_image/mnistmodel/trainer/task.py to use model.train_and_evaluate() instead of using learn_runner

r'^' seems unnecessary in re.match

training-data-analyst/courses/data_analysis/lab2/python/grep.py

Line 21 in 10f6da3

if re.match( r'^' + re.escape(term), line):

training-data-analyst/courses/data_analysis/lab2/python/grepc.py

Line 20 in 10f6da3

if re.match( r'^' + re.escape(term), line):

Since re.match() checks for a match only at the beginning of the string - regardless of mode - I wonder whether we can get rid of r'^' in front of re.escape(term) as it will do the same, and IMHO, more readable.

Alternatively, I am also considering simply using line.startswith(term), which doesn't need re module at all, and, again IMHO, more Pythonic and faster.

Error while running Beam on Dataflow in feateng.ipynb

Hello - I'm getting an error when running the code in "Run Beam pipeline on Cloud Dataflow" section of the "feateng" notebook.

Command:
preprocess(50*100, 'DataflowRunner')

Stacktrace:

Launching Dataflow job preprocess-taxifeatures-181109-182408 ... hang on

ContextualVersionConflictTraceback (most recent call last)
<ipython-input-14-b4775e416971> in <module>()
----> 1 preprocess(50*100, 'DataflowRunner')
      2 #change first arg to None to preprocess full dataset

<ipython-input-8-0ab357cc98ce> in preprocess(EVERY_N, RUNNER)
     50           p | 'read_{}'.format(phase) >> beam.io.Read(beam.io.BigQuerySource(query=query))
     51             | 'tocsv_{}'.format(phase) >> beam.Map(to_csv)
---> 52             | 'write_{}'.format(phase) >> beam.io.Write(beam.io.WriteToText(outfile))
     53         )
     54   print("Done")

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in __exit__(self, exc_type, exc_val, exc_tb)
    421   def __exit__(self, exc_type, exc_val, exc_tb):
    422     if not exc_type:
--> 423       self.run().wait_until_finish()
    424 
    425   def visit(self, visitor):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    401     if test_runner_api and self._verify_runner_api_compatible():
    402       return Pipeline.from_runner_api(
--> 403           self.to_runner_api(), self.runner, self._options).run(False)
    404 
    405     if self._options.view_as(TypeOptions).runtime_type_check:

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    414       finally:
    415         shutil.rmtree(tmpdir)
--> 416     return self.runner.run_pipeline(self)
    417 
    418   def __enter__(self):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in run_pipeline(self, pipeline)
    387     # raise an exception.
    388     result = DataflowPipelineResult(
--> 389         self.dataflow_client.create_job(self.job), self)
    390 
    391     # TODO(BEAM-4274): Circular import runners-metrics. Requires refactoring.

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/retry.pyc in wrapper(*args, **kwargs)
    182       while True:
    183         try:
--> 184           return fun(*args, **kwargs)
    185         except Exception as exn:  # pylint: disable=broad-except
    186           if not retry_filter(exn):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job(self, job)
    488   def create_job(self, job):
    489     """Creates job description. May stage and/or submit for remote execution."""
--> 490     self.create_job_description(job)
    491 
    492     # Stage and submit the job when necessary

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job_description(self, job)
    517 
    518     # Stage other resources for the SDK harness
--> 519     resources = self._stage_resources(job.options)
    520 
    521     job.proto.environment = Environment(

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in _stage_resources(self, options)
    450         options,
    451         temp_dir=tempfile.mkdtemp(),
--> 452         staging_location=google_cloud_options.staging_location)
    453     return resources
    454 

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/portability/stager.pyc in stage_job_resources(self, options, build_setup_args, temp_dir, populate_requirements_cache, staging_location)
    221         resources.extend(
    222             self._stage_beam_sdk(sdk_remote_location, staging_location,
--> 223                                  temp_dir))
    224       elif setup_options.sdk_location == 'container':
    225         # Use the SDK that's built into the container, rather than re-staging

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/portability/stager.pyc in _stage_beam_sdk(self, sdk_remote_location, staging_location, temp_dir)
    464       """
    465     if sdk_remote_location == 'pypi':
--> 466       sdk_local_file = Stager._download_pypi_sdk_package(temp_dir)
    467       sdk_sources_staged_name = Stager.\
    468           _desired_sdk_filename_in_staging_location(sdk_local_file)

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/portability/stager.pyc in _download_pypi_sdk_package(temp_dir, fetch_binary, language_version_tag, language_implementation_tag, abi_tag, platform_tag)
    513     package_name = Stager.get_sdk_package_name()
    514     try:
--> 515       version = pkg_resources.get_distribution(package_name).version
    516     except pkg_resources.DistributionNotFound:
    517       raise RuntimeError('Please set --sdk_location command-line option '

/usr/local/envs/py2env/lib/python2.7/site-packages/pkg_resources/__init__.pyc in get_distribution(dist)
    469         dist = Requirement.parse(dist)
    470     if isinstance(dist, Requirement):
--> 471         dist = get_provider(dist)
    472     if not isinstance(dist, Distribution):
    473         raise TypeError("Expected string, Requirement, or Distribution", dist)

/usr/local/envs/py2env/lib/python2.7/site-packages/pkg_resources/__init__.pyc in get_provider(moduleOrReq)
    345     """Return an IResourceProvider for the named module or requirement"""
    346     if isinstance(moduleOrReq, Requirement):
--> 347         return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
    348     try:
    349         module = sys.modules[moduleOrReq]

/usr/local/envs/py2env/lib/python2.7/site-packages/pkg_resources/__init__.pyc in require(self, *requirements)
    889         included, even if they were already activated in this working set.
    890         """
--> 891         needed = self.resolve(parse_requirements(requirements))
    892 
    893         for dist in needed:

/usr/local/envs/py2env/lib/python2.7/site-packages/pkg_resources/__init__.pyc in resolve(self, requirements, env, installer, replace_conflicting, extras)
    780                 # Oops, the "best" so far conflicts with a dependency
    781                 dependent_req = required_by[req]
--> 782                 raise VersionConflict(dist, req).with_context(dependent_req)
    783 
    784             # push the new requirements onto the stack

ContextualVersionConflict: (pytz 2016.7 (/usr/local/envs/py2env/lib/python2.7/site-packages), Requirement.parse('pytz<=2018.4,>=2018.3'), set(['apache-beam']))

Dual imports of different bigquery libraries as bq can create confusion

In courses/machine_learning/deepdive/02_generalization/create_datasets.ipynb there are 2 imports. One at the top

import google.datalab.bigquery as bq

and second at the last cell

import datalab.bigquery as bq

The problem with this is that the first one uses standard sql by default and second one uses legacy sql by default. If someone runs through the whole notebook and then tries to run earlier queries then it won't work and fail with errors related to enabling standard sql.

Error while running python transform.py

While running python transform.py in SSH, getting the below error:

Traceback (most recent call last):
File "transform.py", line 11, in
import urllib.request, urllib.error, urllib.parse
ImportError: No module named request

Please help.

Isn't this supposed to be a TODO (also the next line)?

training-data-analyst/courses/machine_learning/deepdive/06_structured/serving/application/main.py

Line 32 in 86692d3

credentials = GoogleCredentials.get_application_default()

Hey @lakshmanok
the course is nice but there is very little chance to actually actively learn in most labs as usually one just clicks through pre written code. In this case this example is supposed to be be a TODO according to the lab sheet on QuickLabs...

--delete--

Provide a Feature-Engineering example with complex operation

Hi Team,

For the feature engineering function section can you please provide an example with a complex calculation.

Example: We want to generate a new feature as division of two columns only if a third column has value = "Y" else set value for that row as -1.

In pandas it is easily possible with .apply() function but in Tensorflow pipeline how should this be done ?

I tried using tf.where and tf.cond but it doesn't seem to work fine in pipelines for me.

Small Typo on Sklearn + Cloud ML Tutorial

On the Jupyter Notebook (https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/sklearn/babyweight_skl.ipynb), there's a small typo.
In the 3rd coding cell under the section "Packaging up as a Python package", and for the "install_requires" argument, it says 'cloudml-hypertune,'. The comma should go outside of the string and not within the string.
The file generated by the %writefile magic function (babyweight/setup.py) of that coding cell is actually correct, so my guess is that the Jupyter Notebook cell must have been changed after the the file was written.

Thanks Lak for the great tutorials! On my team, we definitely following them carefully and closely.

TypeError: init() takes exactly 2 arguments (3 given) / Dataflow

I am running (https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/tftransform.ipynb)

based on the updates discussed in the previous issue:

#313

But I'm getting the following error:

Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 130, in execute
test_shuffle_sink=self._test_shuffle_sink)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 104, in create_operation
is_streaming=False)
File "apache_beam/runners/worker/operations.py", line 636, in apache_beam.runners.worker.operations.create_operation
op = create_pgbk_op(name_context, spec, counter_factory, state_sampler)
File "apache_beam/runners/worker/operations.py", line 482, in apache_beam.runners.worker.operations.create_pgbk_op
return PGBKCVOperation(step_name, spec, counter_factory, state_sampler)
File "apache_beam/runners/worker/operations.py", line 538, in apache_beam.runners.worker.operations.PGBKCVOperation.init
fn, args, kwargs = pickler.loads(self.spec.combine_fn)[:3]
File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 246, in loads
return dill.loads(s)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 316, in loads
return load(file, ignore)
File "/usr/local/lib/python2.7/dist-packages/dill/_dill.py", line 304, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
TypeError: init() takes exactly 2 arguments (3 given)

apache-airflow==1.9.0
apache-beam==2.8.0
tensorflow==1.9.0
tensorflow-metadata==0.9.0
tensorflow-transform==0.9.0

I also noticed the SDK changed from 2.7 (10/22) to 2.8 (10/26)

response: { message: 'publisher is not defined', internalCode: undefined } }

I'm following codelab instructions to publish a topic in pubsub as below, but an error is returning:
response: { message: 'publisher is not defined', internalCode: undefined } }

at next (/home/google2145703_student/training-data-analyst/courses/developingapps/nodejs/pubsub-languageapi-spanner/start/node_modules/express/lib/router/index.js:275:10

Code:
// Handler for feedback POSTed from the client app
router.post('/feedback/:quiz', (req, res, next) => {
const feedback = req.body;
// TODO: Publish the message into Cloud Pub/Sub
publisher.publishFeedback(feedback).then(() => {
// TODO: Move the statement that returns a message to
// the client app here
res.json('Feedback received');

// END TODO

// TODO: Add a catch
}).catch(err => {
// TODO: There was an error, invoke the next middleware
next(err);

// END TODO

});

// END TODO
});

Small typo in compose_gcf_trigger lab

In the jupyter notebook courses/machine_learning/deepdive/10_recommend/composer_gcf_trigger/composertriggered.ipynb
there is a typo/inconsistency in the name of the airflow-variable gcp_completion_bucket.
In the section Complete the DAG file it is called gcs_completion_bucket (note the s at third position).
However, in the section Setting Airflow variables it is called gcp_completion_bucket. (I guess this name is correct since it conforms with the names of the other variables.)

The same applies to the the jupyter notebook in the labs folder courses/machine_learning/deepdive/10_recommend/labs/composer_gcf_trigger/composertriggered.ipynb

datalab/cloudshell doesn't exist

The coursera video is referencing a folder that doesn't exist

train_and_evaluate does not seem to work for me

Hi,

I was running this code on GCP and when I get to this line of code, it ended up in an error

train_and_evaluate('babyweight_trained')

I checked every single line of code in the training section and it seems like the error is from the line I mentioned.

InvalidArgumentErrorTraceback (most recent call last)
in ()
26
27 shutil.rmtree('babyweight_trained', ignore_errors=True) # start fresh each time
---> 28 train_and_evaluate('babyweight_trained')

in train_and_evaluate(output_dir)
23 steps=None,
24 exporters=exporter)
---> 25 tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
26
27 shutil.rmtree('babyweight_trained', ignore_errors=True) # start fresh each time

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/estimator/training.pyc in train_and_evaluate(estimator, train_spec, eval_spec)
437 '(with task id 0). Given task id {}'.format(config.task_id))
438
--> 439 executor.run()
440
441

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/estimator/training.pyc in run(self)
516 config.task_type != run_config_lib.TaskType.EVALUATOR):
517 logging.info('Running training and evaluation locally (non-distributed).')
--> 518 self.run_local()
519 return
520

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/estimator/training.pyc in run_local(self)
648 input_fn=self._train_spec.input_fn,
649 max_steps=self._train_spec.max_steps,
--> 650 hooks=train_hooks)
651
652 # Final export signal: For any eval result with global_step >= train

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.pyc in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
361
362 saving_listeners = _check_listeners_type(saving_listeners)
--> 363 loss = self._train_model(input_fn, hooks, saving_listeners)
364 logging.info('Loss for final step: %s.', loss)
365 return self

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.pyc in _train_model(self, input_fn, hooks, saving_listeners)
841 return self._train_model_distributed(input_fn, hooks, saving_listeners)
842 else:
--> 843 return self._train_model_default(input_fn, hooks, saving_listeners)
844
845 def _train_model_default(self, input_fn, hooks, saving_listeners):

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.pyc in _train_model_default(self, input_fn, hooks, saving_listeners)
857 return self._train_with_estimator_spec(estimator_spec, worker_hooks,
858 hooks, global_step_tensor,
--> 859 saving_listeners)
860
861 def _train_model_distributed(self, input_fn, hooks, saving_listeners):

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.pyc in _train_with_estimator_spec(self, estimator_spec, worker_hooks, hooks, global_step_tensor, saving_listeners)
1057 loss = None
1058 while not mon_sess.should_stop():
-> 1059 _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
1060 return loss
1061

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.pyc in exit(self, exception_type, exception_value, traceback)
677 if exception_type in [errors.OutOfRangeError, StopIteration]:
678 exception_type = None
--> 679 self._close_internal(exception_type)
680 # exit should return True to suppress an exception.
681 return exception_type is None

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.pyc in _close_internal(self, exception_type)
714 if self._sess is None:
715 raise RuntimeError('Session is already closed.')
--> 716 self._sess.close()
717 finally:
718 self._sess = None

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.pyc in close(self)
962 if self._sess:
963 try:
--> 964 self._sess.close()
965 except _PREEMPTION_ERRORS:
966 pass

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.pyc in close(self)
1106 self._coord.join(
1107 stop_grace_period_secs=self._stop_grace_period_secs,
-> 1108 ignore_live_threads=True)
1109 finally:
1110 try:

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/training/coordinator.pyc in join(self, threads, stop_grace_period_secs, ignore_live_threads)
387 self._registered_threads = set()
388 if self._exc_info_to_raise:
--> 389 six.reraise(*self._exc_info_to_raise)
390 elif stragglers:
391 if ignore_live_threads:

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.pyc in _run(self, sess, enqueue_op, coord)
250 break
251 try:
--> 252 enqueue_callable()
253 except self._queue_closed_exception_types: # pylint: disable=catching-non-exception
254 # This exception indicates that a queue was closed.

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _single_operation_run()
1242
1243 def _single_operation_run():
-> 1244 self._call_tf_sessionrun(None, {}, [], target_list, None)
1245
1246 return _single_operation_run

/usr/local/envs/py2env/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1407 return tf_session.TF_SessionRun_wrapper(
1408 self._session, options, feed_dict, fetch_list, target_list,
-> 1409 run_metadata)
1410 else:
1411 with errors.raise_exception_on_not_ok_status() as status:

InvalidArgumentError: assertion failed: [string_input_producer requires a non-null input tensor]
[[Node: input_producer/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](input_producer/Greater, input_producer/Assert/Assert/data_0)]]