Coder Social home page Coder Social logo

building-ml-pipelines / building-machine-learning-pipelines Goto Github PK

View Code? Open in Web Editor NEW
584.0 20.0 249.0 28.1 MB

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson

License: MIT License

Python 1.18% Jupyter Notebook 98.77% Dockerfile 0.01% Shell 0.03% Makefile 0.01%

building-machine-learning-pipelines's Introduction

Building Machine Learning Pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson

Update

  • The example code has been updated to work with TFX 1.4.0, TensorFlow 2.6.1, and Apache Beam 2.33.0. A GCP Vertex example (training and serving) was added.

Set up the demo project

Download the initial dataset. From the root of this repository, execute

python3 utils/download_dataset.py

After this script runs, you should have a data folder containing the file consumer_complaints_with_narrative.csv.

The dataset

The data that we use in this example project can be downloaded using the script above. The dataset is from a public dataset on customer complaints collected from the US Consumer Finance Protection Bureau. If you would like to reproduce our edited dataset, carry out the following steps:

  • Download the dataset from https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data
  • Rename the columns to [ "product", "sub_product", "issue", "sub_issue", "consumer_complaint_narrative", "company", "state", "zip_code", "company", "company_response", "timely_response", "consumer_disputed"]
  • Filter the dataset to remove rows with missing data in the consumer_complaint_narrative column
  • In the consumer_disputed column, map Yes to 1 and No to 0

Pre-pipeline experiment

Before building our TFX pipeline, we experimented with different feature engineering and model architectures. The notebooks in this folder preserve our experiments, and we then refactored our code into the interactive pipeline below.

Interactive pipeline

The interactive-pipeline folder contains a full interactive TFX pipeline for the consumer complaint data.

Full pipelines with Apache Beam, Apache Airflow, Kubeflow Pipelines, GCP

The pipelines folder contains complete pipelines for the various orchestrators. See Chapters 11 and 12 for full details.

Chapters

The following subfolders contain stand-alone code for individual chapters.

Model analysis

Chapter 7. Stand-alone code for TFMA, Fairness Indicators, What-If Tool. Note that these notebooks will not work in JupyterLab.

Advanced TFX

Chapter 10. Notebook outlining the implementation of custom TFX components from scratch and by inheriting existing functionality. Presented at the Apache Beam Summit 2020.

Data privacy

Chapter 14. Code for training a differentially private version of the demo project. Note that the TF-Privacy module only supports TF 1.x as of June 2020.

Version notes

The code was written and tested for version 0.22.

  • As of 11/23/21, the examples have been updated to support TFX 1.4.0, TensorFlow 2.6.1, and Apache Beam 2.33.0. A GCP Vertex example (training and serving) was added.

  • As of 9/22/20, the interactive pipeline runs on TFX version 0.24.0rc1. Due to tiny TFX bugs, the pipelines currently don't work on the releases 0.23 and 0.24-rc0. Github issues have been filed with the TFX team specifically for the book pipelines (Issue 2500). We will update the repository once the issue is resolved.

  • As of 9/14/20, TFX only supports Python 3.8 with version >0.24.0rc0.

building-machine-learning-pipelines's People

Contributors

biogeek avatar buildingmlpipelines avatar catherinenelson1 avatar hanneshapke avatar snehankekre avatar tcmetzger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

building-machine-learning-pipelines's Issues

tfx.utils.dsl_utils isn't supported anymore and raises errors.

Hi there,

I tried to follow along, and at the very start where we are to ingest a local csv-file, we have to
from tfx.utils.dsl_utils import external_input, and use it to pass the external data_dir to CsvExampleGen. However, this does not work anymore. It would be nice to update this. Currently, TFX suggest passing the data_dir directly as a string. However, even though this does not raise an error, it fails to ingest the .csv file. As to how to do it properly, I've posted a question on the brand new Tensorflow forum: https://discuss.tensorflow.org/t/tfx-csvexamplegen-does-not-work-with-simply-example-help/1589

Hope this can be resolved :)
Best,
Timo

Interactive Pipeline crashes on different stages

tensorflow==2.6.2
tfx==1.3.3

Environment: Google Colab


Bug 1

When importing external_input the following error occurs:

ModuleNotFoundError: No module named 'tfx.utils.dsl_utils'

Fix

ExampleGen component can now accept the path to data directory as a string

CsvExampleGen(input_base='path/to/csv_data/')

Remove the line from tfx.utils.dsl_utils import external_input from the cell


Bug 2

When creating CsvExampleGen component, the following error occurs:

TypeError: init() got an unexpected keyword argument 'input'

Fix

CsvExampleGen components parameter input is now changed to input_base

CsvExampleGen(input_base='path/to/csv_data/')


Bug 3

When running the transform component using the following line
context.run(transform)
It throws the following error:

OperatorNotAllowedInGraphError: using a tf.Tensor as a Python bool is not allowed: AutoGraph is disabled in this function. Try decorating it directly with @tf.function.

Fix

We need to decorate convert_zip_code() function with @tf.function in module.py

@tf.function
def convert_zip_code(zipcode: str) -> tf.float32:
    pass

Bug 4

After the upper fix the transform component throws an other error:

TypeError: bucketize() got an unexpected keyword argument 'always_return_num_quantiles'

Fix

always_return_num_quantiles arg of tft.bucketize is deprecated in version 0.26 of tensorflow-transform.
Remove or comment out the this argument from the function tft.bucketize() inside preprocessing_fn() in module.py.


Bug 5

After the upper fix, the transform component throws another error:

TypeError: '>' not supported between instances of 'NoneType' and 'int'

This error occurs as tensor returns the shape None and python cannot compare NoneType with int, float, or str.
I have tried to figure out where and why the tensor is returning shape None but it's over my head and can't figure out.

GCP Serving - Features required not used in model

After deploying model to GCP i found predictions required features 'company' and 'timely response' although these are not used in model. The other features below were also required but 'state' and 'zipcode' were not.

exampledict = {'product': 'Bank account or service', 'sub_product': 'Savings account', 'timely_response': 'No', 'company_response': 'Closed with monetary relief' ,'issue':'Cash advance','company':'test','consumer_complaint_narrative':'happy with the service' }
featuredict = {}

for key,value in exampledict.items():
featuredict[key]=tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.encode('utf-8')])),

example = tf.train.Example(
features=tf.train.Features(feature=featuredict)
)

input_data_json = {
"signature_name":"serving_default",
"instances":[
{
"examples":{"b64": base64.b64encode(example.SerializeToString()).decode('utf-8')}
}
]
}

request = ml_resource.predict(name=model_path, body=input_data_json)
response = request.execute()
if "error" in response:
raise RuntimeError(response["error"])
for pred in response["predictions"]:
print(pred)

[0.123692594]

Why Infra Validator Component is not even covered?

Hi!
I want to congrats you for the book. Nice reading and well explained.
I am designing a ML platform for the company I work for and I am using TFX.
I just wanted to ask you a question, why does InfraValidator is not covered? Is there any reason behind?

Thanks!

Chapter 2 example error

In attempting to execute the code at the end of chapter 2 i get the following error:

WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server.
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
Connecting anonymously.

I know its in reference to attempting to pull kinglear.txt from google storage. Any tips on how to resolve this? BTW here is the source code i copied out of the book:

import re
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions

input_file = "gs://dataflow-samples/shakespeare/kinglear.txt"
output_file = "~/coding/machine-learning/output.txt"

pipeline_options = PipelineOptions()

with beam.Pipeline(options=pipeline_options) as p:
    lines = p | ReadFromText(input_file)
    counts = (
        lines
        | 'Split' >> beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum)
    )
    def format_result(word_count):
        (word, count) = word_count
        return "{}: {}".format(word, count)
    
    output = counts | 'Format' >> beam.Map(format_result)

    output | WriteToText(output_file)

Chapter 7 - TFMA Evaluator AUC Metric Case Mismatch

Interactive pipeline sets a threshold for 'AUC' but metric produced is 'auc' resulting in tfma.load_validation_result messages: "Metric not found." overall and for all products slices

Correcting to:

thresholds={
'auc':

produces correct validation failures as for several products (overall passes as >0.65 threshold) as below.

However evaluator.outputs['blessing'].get()[0].uri is NOT_BLESSED:

metric_validations_per_slice {
slice_key {
single_slice_keys {
column: "product"
bytes_value: "Consumer Loan"
}
}
failures {
metric_key {
name: "auc"
}
metric_threshold {
value_threshold {
lower_bound {
value: 0.65
}
}
}
metric_value {
double_value {
value: 0.6262196898460388
}
}
}
}
metric_validations_per_slice {
slice_key {
single_slice_keys {
column: "product"
bytes_value: "Mortgage"
}
}
failures {
metric_key {
name: "auc"
}
metric_threshold {
value_threshold {
lower_bound {
value: 0.65
}
}
}
metric_value {
double_value {
value: 0.618944525718689
}
}
}
}
metric_validations_per_slice {
slice_key {
single_slice_keys {
column: "product"
bytes_value: "Payday loan"
}
}
failures {
metric_key {
name: "auc"
}
metric_threshold {
value_threshold {
lower_bound {
value: 0.65
}
}
}
metric_value {
double_value {
value: 0.6383864879608154
}
}
}
}
metric_validations_per_slice {
slice_key {
single_slice_keys {
column: "product"
bytes_value: "Student loan"
}
}
failures {
metric_key {
name: "auc"
}
metric_threshold {
value_threshold {
lower_bound {
value: 0.65
}
}
}
metric_value {
double_value {
value: 0.6052306294441223
}
}
}
}

Do we need to create schema for train dataset or entire dataset to compare with eval and test dataset?

Thank you for reporting an issue!

If you want to report an issue with the code in this repository,
please provide the following information:

  • Your operating system name and version, as well as version numbers of the following packages: tensorflow, tfx.
  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.

If you found an error in the book, please report it at
https://www.oreilly.com/catalog/errata.csp?isbn=0636920260912.

The way to extend BaseExampleGenExecutor has changed in version 0.23

In your example code for writing a custom component by extending BaseExampleGenExecutor (see Custom_TFX_Components notebook), your ImageToExample function should no longer accept an input_dict explicitly.

This has changed since version 0.23. The input base can now simply be found at exec_properties['input_base']. The code as is will result in an error in more recent versions. Hope this helps keeping this valuable resource (thanks for that, btw!) up to date.

Please help with understanding of convert_zip_code

Hi! I'm struggling to understand why function convert_zip_code works.

First of all, its input argument will have type tf.Tensor when it is called from preprocessing_fn. Consequently, zipcode=='' will allways be false, as tensor does not equal to empty string and eager mode is not supported in tfx. So I expect this function to crash during casting to an empty string to a number, what I'm missing here? Thanks!

BigQueryExampleGen failing due to lack of --project

Thank you for reporting an issue!

If you want to report an issue with the code in this repository,
please provide the following information:

  • MAC Catalina 10.15.5
  • python 3.7 TFX 0.21.4 .
  • trying to use BigQueryExampleGen get error message:
    'Missing executing project information. Please use the --project '
    RuntimeError: Missing executing project information. Please use the --project command line option to specify it."

There does not appear to be a way to inject project into the BigQueryExampleGen.
the exact same query(used in the same notebook) in question works fine when passed in as part of:
%%bigquery retail --project jwdeeplearn

Not sure if this is an error in the book of just an issue with BigQueryExampleGen (or perhaps BQEG is not passing project info along to apache beam?)

If you found an error in the book, please report it at
https://www.oreilly.com/catalog/errata.csp?isbn=0636920260912.

problem while serving with docker container

The current trainer script in TFX saves the exported model with an additional directory structure before the saved_model.pb file. This structure includes the model version followed by the format ("serving") which creates a path like:
model_path/

  • version_X/
    • serving/
      • saved_model.pb

However, the Docker container expects the saved_model.pb file to be directly under the model version directory:

model_path/

  • version_X/
    • saved_model.pb
      This mismatch causes issues when loading the model in the Docker environment.

There issues installing the requirements.txt in ubutu 22.04

Thank you for reporting an issue!

If you want to report an issue with the code in this repository,
please provide the following information:

  • Your operating system name and version, as well as version numbers of the following packages: tensorflow, tfx.

Operating system: Ubuntu 22.04

Trying to install:

tensorboard_plugin_fairness_indicators==0.35.0
tensorflow_hub==0.12.0
tensorflow_privacy==0.7.3
tensorflow==2.6.1
tfx==1.4.0
witwidget==1.8.1

Runing environements I am trying to use (using pyenv):

3.5.8
3.6.8
3.8.2
3.9.2
3.9.8

  • 3.10.0 (set by /home/ivan/.pyenv/version)

  • Any details about your local setup that might be helpful in troubleshooting.

(venv) ivan@ivan-Z590I-AORUS-ULTRA ~/ProjectPrometheus/ComputerScience/NeuralNetworksProjects/ci-cd-pipelines/foundations/building-machine-learning-pipelines/requirements (main)$ pip install -r requirements.txt 
ERROR: Could not find a version that satisfies the requirement tensorboard_plugin_fairness_indicators==0.35.0 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0, 0.29.0, 0.30.0)
ERROR: No matching distribution found for tensorboard_plugin_fairness_indicators==0.35.0
WARNING: You are using pip version 21.2.3; however, version 22.3.1 is available.
You should consider upgrading via the '/home/ivan/ProjectPrometheus/ComputerScience/NeuralNetworksProjects/ci-cd-pipelines/foundations/building-machine-learning-pipelines/venv/bin/python3 -m pip install --upgrade pip' command
  • Detailed steps to reproduce the bug.

Try to use

Trying to install:

tensorboard_plugin_fairness_indicators==0.35.0
tensorflow_hub==0.12.0
tensorflow_privacy==0.7.3
tensorflow==2.6.1
tfx==1.4.0
witwidget==1.8.1

and use any of the running environments

3.5.8
3.6.8
3.8.2
3.9.2
3.9.8

  • 3.10.0 (set by /home/ivan/.pyenv/version)

and try to run the commands

cd requirements
pip install -r requirements.txt

Interactive Pipeline. Trainer component. UnimplementedError: Cast string to float is not supported

Hi,

Thanks for the book, great material.
Can't get the whole pipeline working though.
As I am running interactive_pipeline.ipynb, Trainer component cell I am getting the following error:

UnimplementedError:  Cast string to float is not supported
	 [[node Cast (defined at /home/jovyan/work/building-machine-learning-pipelines/interactive-pipeline/../components/module.py:285) ]] [Op:__inference_train_function_13208]

Function call stack:
train_function

Please note that I couldn't make it through Transform component until I've changed type of field "zip_code" form INT (inferred by SchemaGen component) to BYTES. Don't know if that can contribute to the error mentioned above.

Kubeflow pipeline example in chapter.12 is not working

Thank you for reporting an issue!

If you want to report an issue with the code in this repository,
please provide the following information:

  • Your operating system name and version, as well as version numbers of the following packages: tensorflow, tfx.
    Tensorflow: 2.3.1
    tfx: 0.22 (and 0.24)
    kubeflow: 1.0.2 (kfp:1.0.0)
    OS: Ubuntu 18.04
    Notebook: 6.0.3 (lab)

  • Any details about your local setup that might be helpful in troubleshooting.
    To run kubeflow pipeline example,
    I copied datasets downloaded using this into PV/data directory
    and module.py into PV/components directory

  • Detailed steps to reproduce the bug.
    but following errors occurred at statisticsgen step of Argo

2020-10-09 13:45:14.925525: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib
2020-10-09 13:45:14.925569: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:absl:Running driver for StatisticsGen
INFO:absl:MetadataStore with gRPC connection initialized
INFO:absl:Adding KFP pod name consumer-complaint-pipeline-kubeflow-hzrjh-887385430 to execution
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 171, in _call_method
    response.CopyFrom(grpc_method(request))
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/dist-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.ALREADY_EXISTS
	details = "Type already exists with different properties."
	debug_error_string = "{"created":"@1602251118.007945649","description":"Error received from peer ipv4:10.106.131.168:8080","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"Type already exists with different properties.","grpc_status":6}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 360, in <module>
    main()
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 353, in main
    execution_info = launcher.launch()
  File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 197, in launch
    self._exec_properties)
  File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 166, in _run_driver
    component_info=self._component_info)
  File "/tfx-src/tfx/components/base/base_driver.py", line 330, in pre_execution
    contexts=contexts)
  File "/tfx-src/tfx/orchestration/metadata.py", line 599, in update_execution
    registered_artifacts_ids=registered_output_artifact_ids))
  File "/tfx-src/tfx/orchestration/metadata.py", line 538, in _artifact_and_event_pairs
    a.set_mlmd_artifact_type(self._prepare_artifact_type(a.artifact_type))
  File "/tfx-src/tfx/orchestration/metadata.py", line 185, in _prepare_artifact_type
    artifact_type=artifact_type, can_add_fields=True)
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 282, in put_artifact_type
    self._call('PutArtifactType', request, response)
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 146, in _call
    return self._call_method(method_name, request, response)
  File "/usr/local/lib/python3.7/dist-packages/ml_metadata/metadata_store/metadata_store.py", line 176, in _call_method
    raise _make_exception(e.details(), e.code().value[0])  # pytype: disable=attribute-error
ml_metadata.errors.AlreadyExistsError: Type already exists with different properties.

If you found an error in the book, please report it at
https://www.oreilly.com/catalog/errata.csp?isbn=0636920260912.

is it possible to get TFserving tutorial for the existing pipeline and for the same data?

Thank you for reporting an issue!

If you want to report an issue with the code in this repository,
please provide the following information:

  • Your operating system name and version, as well as version numbers of the following packages: tensorflow, tfx.
  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.

If you found an error in the book, please report it at
https://www.oreilly.com/catalog/errata.csp?isbn=0636920260912.

Kubeflow pipeline example in chapter.12 is not working

Thank you for reporting an issue!

If you want to report an issue with the code in this repository,
please provide the following information:

  • Your operating system name and version, as well as version numbers of the following packages: tensorflow, tfx.
    Tensorflow: 2.3.1
    tfx: 0.22
    kubeflow: 1.0.2 (kfp:1.0.0)
    OS: Ubuntu 18.04
    Notebook: 6.0.3 (lab)

  • Any details about your local setup that might be helpful in troubleshooting.
    To run kubeflow pipeline example,
    I copied datasets downloaded using this into PV/data directory
    and module.py into PV/components directory
    and made output directory

  • Detailed steps to reproduce the bug.
    but following errors occurred at csvexamplegen step of Argo

WARNING:absl:Could not find matching artifact class for type 'Examples' (proto: 'name: "Examples"\nproperties {\n  key: "span"\n  value: INT\n}\nproperties {\n  key: "split_names"\n  value: STRING\n}\nproperties {\n  key: "version"\n  value: INT\n}\n'); generating an ephemeral artifact class on-the-fly. If this is not intended, please make sure that the artifact class for this type can be imported within your container or environment where a component is executed to consume this type.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with gRPC connection initialized
INFO:absl:Adding KFP pod name consumer-complaint-pipeline-kubeflow-cd8hd-1778682510 to execution
INFO:absl:Running executor for CsvExampleGen
INFO:absl:Attempting to infer TFX Python dependency for beam
INFO:absl:Copying all content from install dir /tfx-src/tfx to temp dir /tmp/tmpysdjeff4/build/tfx
INFO:absl:Generating a temp setup file at /tmp/tmpysdjeff4/build/tfx/setup.py
INFO:absl:Creating temporary sdist package, logs available at /tmp/tmpysdjeff4/build/tfx/setup.log
INFO:absl:Added --extra_package=/tmp/tmpysdjeff4/build/tfx/dist/tfx_ephemeral-0.22.0.tar.gz to beam args
INFO:absl:Generating examples.
INFO:absl:Using 10 process(es) for Beam pipeline execution.
Traceback (most recent call last):
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 360, in <module>
    main()
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 353, in main
    execution_info = launcher.launch()
  File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 205, in launch
    execution_decision.exec_properties)
  File "/tfx-src/tfx/orchestration/launcher/in_process_component_launcher.py", line 67, in _run_executor
    executor.Do(input_dict, output_dict, exec_properties)
  File "/tfx-src/tfx/components/example_gen/base_example_gen_executor.py", line 234, in Do
    exec_properties)
  File "/tfx-src/tfx/components/example_gen/base_example_gen_executor.py", line 193, in GenerateExamplesByBeam
    | 'SplitData' >> beam.Partition(_PartitionFn, len(buckets), buckets))
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 998, in __ror__
    return self.transform.__ror__(pvalueish, self.label)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 562, in __ror__
    result = p.apply(self, pvalueish, label)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 612, in apply
    return self.apply(transform, pvalueish)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 655, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 198, in apply
    return m(transform, input, options)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 228, in apply_PTransform
    return transform.expand(input)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 923, in expand
    return self._fn(pcoll, *args, **kwargs)
  File "/tfx-src/tfx/components/example_gen/base_example_gen_executor.py", line 86, in _InputToSerializedExample
    | 'SerializeDeterministically' >>
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pvalue.py", line 140, in __or__
    return self.pipeline.apply(ptransform, self)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 602, in apply
    transform.transform, pvalueish, label or transform.label)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 612, in apply
    return self.apply(transform, pvalueish)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/pipeline.py", line 655, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 198, in apply
    return m(transform, input, options)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 228, in apply_PTransform
    return transform.expand(input)
  File "/opt/venv/lib/python3.6/site-packages/apache_beam/transforms/ptransform.py", line 923, in expand
    return self._fn(pcoll, *args, **kwargs)
  File "/tfx-src/tfx/components/example_gen/csv_example_gen/executor.py", line 118, in _CsvToExample
    input_base_uri = artifact_utils.get_single_uri(input_dict[INPUT_KEY])
KeyError: 'input'

If you found an error in the book, please report it at
https://www.oreilly.com/catalog/errata.csp?isbn=0636920260912.

Transform code snippet for Computer Vision problem set in the book not working (or I couldn't make it work)

The book provides code snippets for the computer vision problem set but it seems to be not working for the transform. I mean specifcally the following code:

def process_image(raw_image):
    raw_image = tf.reshape(raw_image, [-1])
    img_rgb = tf.image.decode_jpeg(raw_image, channels=3)
    img_gray = tf.image.rgb_to_grayscale(img_rgb)
    img = tf.image.convert_image_dtype(img_gray, tf.float32)
    resized_img = tf.image.resize_with_pad(
        img,
        target_height=300,
        target_width=300,
    )
    img_grayscale = tf.image.rgb_to_grayscale(resized_img)
    return tf.reshape(img_grayscale, [-1, 300, 300, 1])

I am using it as follows in the preprocessing_fn:

def preprocessing_fn(inputs):
    image_raw = inputs['image_raw']
    label = inputs['label']
    label_integerized = tft.compute_and_apply_vocabulary(label)
    img_preprocessed = process_image(image_raw)  ## used here
    return {
      'img_preprocessed': img_preprocessed,
      'label_integerized': label_integerized,
    }

This is being called in the Transform step of the pipeline:

transform = tfx.components.Transform(
    examples=example_gen.outputs['examples'],
    schema=schema_gen.outputs['schema'],
    module_file=os.path.abspath("module.py"),
)

context.run(transform)

The TFRecordDataset is a two-feature dataset one containing the raw (JPEG) image and other one contains the label as string (stored also as bytes). It was generated using pretty much the same code shown earlier in the book under the Data Ingestion chapter.

When I run the above, I get the following traceback:

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~/projects/datadrivers/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in _create_c_op(graph, node_def, inputs, control_inputs, op_def)
   1811   try:
-> 1812     c_op = pywrap_tf_session.TF_FinishOperation(op_desc)
   1813   except errors.InvalidArgumentError as e:

InvalidArgumentError: Shape must be rank 0 but is rank 1 for '{{node DecodeJpeg}} = DecodeJpeg[acceptable_fraction=1, channels=3, dct_method="", fancy_upscaling=true, ratio=1, try_recover_truncated=false](Reshape)' with input shapes: [?].

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-38-795928f0e78f> in <module>
      5 )
      6 
----> 7 context.run(transform)

~/projects/datadrivers/venv/lib/python3.6/site-packages/tfx/orchestration/experimental/interactive/interactive_context.py in run_if_ipython(*args, **kwargs)
     65       # __IPYTHON__ variable is set by IPython, see
     66       # https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/reference.html#embedding-ipython.
---> 67       return fn(*args, **kwargs)
     68     else:
     69       absl.logging.warning(

~/projects/datadrivers/venv/lib/python3.6/site-packages/tfx/orchestration/experimental/interactive/interactive_context.py in run(self, component, enable_cache, beam_pipeline_args)
    175         telemetry_utils.LABEL_TFX_RUNNER: runner_label,
    176     }):
--> 177       execution_id = launcher.launch().execution_id
    178 
    179     return execution_result.ExecutionResult(

~/projects/datadrivers/venv/lib/python3.6/site-packages/tfx/orchestration/launcher/base_component_launcher.py in launch(self)
    203                          execution_decision.input_dict,
    204                          execution_decision.output_dict,
--> 205                          execution_decision.exec_properties)
    206 
    207     absl.logging.info('Running publisher for %s',

~/projects/datadrivers/venv/lib/python3.6/site-packages/tfx/orchestration/launcher/in_process_component_launcher.py in _run_executor(self, execution_id, input_dict, output_dict, exec_properties)
     65         executor_context)  # type: ignore
     66 
---> 67     executor.Do(input_dict, output_dict, exec_properties)

~/projects/datadrivers/venv/lib/python3.6/site-packages/tfx/components/transform/executor.py in Do(self, input_dict, output_dict, exec_properties)
    388       label_outputs[labels.CACHE_OUTPUT_PATH_LABEL] = cache_output
    389     status_file = 'status_file'  # Unused
--> 390     self.Transform(label_inputs, label_outputs, status_file)
    391     absl.logging.debug('Cleaning up temp path %s on executor success',
    392                        temp_path)

~/projects/datadrivers/venv/lib/python3.6/site-packages/tfx/components/transform/executor.py in Transform(***failed resolving arguments***)
    886     # order to fail faster if it fails.
    887     analyze_input_columns = tft.get_analyze_input_columns(
--> 888         preprocessing_fn, typespecs)
    889 
    890     if not compute_statistics and not materialize_output_paths:

~/projects/datadrivers/venv/lib/python3.6/site-packages/tensorflow_transform/inspect_preprocessing_fn.py in get_analyze_input_columns(preprocessing_fn, specs)
     56     input_signature = impl_helper.batched_placeholders_from_specs(
     57         specs)
---> 58     _ = preprocessing_fn(input_signature.copy())
     59 
     60     tensor_sinks = graph.get_collection(analyzer_nodes.TENSOR_REPLACEMENTS)

~/projects/datadrivers/module.py in preprocessing_fn(inputs)
     21     label = inputs['label']
     22     label_integerized = tft.compute_and_apply_vocabulary(label)
---> 23     img_preprocessed = process_image(image_raw)
     24     return {
     25       'img_preprocessed': img_preprocessed,

~/projects/datadrivers/module.py in process_image(raw_image)
      5 def process_image(raw_image):
      6     raw_image = tf.reshape(raw_image, [-1])
----> 7     img_rgb = tf.io.decode_jpeg(raw_image, channels=3)
      8     img_gray = tf.image.rgb_to_grayscale(img_rgb)
      9     img = tf.image.convert_image_dtype(img_gray, tf.float32)

~/projects/datadrivers/venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_image_ops.py in decode_jpeg(contents, channels, ratio, fancy_upscaling, try_recover_truncated, acceptable_fraction, dct_method, name)
   1101                       try_recover_truncated=try_recover_truncated,
   1102                       acceptable_fraction=acceptable_fraction,
-> 1103                       dct_method=dct_method, name=name)
   1104   _result = _outputs[:]
   1105   if _execute.must_record_gradient():

~/projects/datadrivers/venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py in _apply_op_helper(op_type_name, name, **keywords)
    742       op = g._create_op_internal(op_type_name, inputs, dtypes=None,
    743                                  name=scope, input_types=input_types,
--> 744                                  attrs=attr_protos, op_def=op_def)
    745 
    746     # `outputs` is returned as a separate return value so that the output

~/projects/datadrivers/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in _create_op_internal(self, op_type, inputs, dtypes, input_types, name, attrs, op_def, compute_device)
   3483           input_types=input_types,
   3484           original_op=self._default_original_op,
-> 3485           op_def=op_def)
   3486       self._create_op_helper(ret, compute_device=compute_device)
   3487     return ret

~/projects/datadrivers/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in __init__(self, node_def, g, inputs, output_types, control_inputs, input_types, original_op, op_def)
   1973         op_def = self._graph._get_op_def(node_def.op)
   1974       self._c_op = _create_c_op(self._graph, node_def, inputs,
-> 1975                                 control_input_ops, op_def)
   1976       name = compat.as_str(node_def.name)
   1977     # pylint: enable=protected-access

~/projects/datadrivers/venv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in _create_c_op(graph, node_def, inputs, control_inputs, op_def)
   1813   except errors.InvalidArgumentError as e:
   1814     # Convert to ValueError for backwards compatibility.
-> 1815     raise ValueError(str(e))
   1816 
   1817   return c_op

ValueError: Shape must be rank 0 but is rank 1 for '{{node DecodeJpeg}} = DecodeJpeg[acceptable_fraction=1, channels=3, dct_method="", fancy_upscaling=true, ratio=1, try_recover_truncated=false](Reshape)' with input shapes: [?].

context.run(statistics_gen) interactive pipeline throwing error

Docker container running the following image tensorflow/tensorflow:2.2.1-gpu-py3-jupyter
tensorflow==2.2.1
tfx==0.22.0

Running the interactive pipeline notebook throws the following error when trying to run statistics gen

ERROR
TypeCheckError: Type hint violation for 'ToTopKTuples': requires FrozenSet[FeaturePath] but got Set[Any] for bytes_features
Full type hint:
IOTypeHints[inputs=((Tuple[Union[NoneType, bytes, str], RecordBatch], FrozenSet[FeaturePath], FrozenSet[FeaturePath], Union[NoneType, str]), {}), outputs=((Tuple[Tuple[Union[NoneType, bytes, str], Tuple[Union[bytes, str], ...], Any], Union[Tuple[int, Union[float, int]], int]],), {})]
strip_iterable()

based on:
IOTypeHints[inputs=((Tuple[Union[NoneType, bytes, str], RecordBatch], FrozenSet[FeaturePath], FrozenSet[FeaturePath], Union[NoneType, str]), {}), outputs=((Iterable[Tuple[Tuple[Union[NoneType, bytes, str], Tuple[Union[bytes, str], ...], Any], Union[Tuple[int, Union[float, int]], int]]],), {})]
from_callable(_to_topk_tuples)
signature: (sliced_record_batch:Tuple[Union[str, bytes, NoneType], pyarrow.lib.RecordBatch], bytes_features:FrozenSet[tensorflow_data_validation.types.FeaturePath], categorical_features:FrozenSet[tensorflow_data_validation.types.FeaturePath], weight_feature:Union[str, NoneType]) -> Iterable[Tuple[Tuple[Union[str, bytes, NoneType], Tuple[Union[bytes, str], ...], Any], Union[int, Tuple[int, Union[int, float]]]]]
File "/usr/local/lib/python3.6/dist-packages/tensorflow_data_validation/statistics/generators/top_k_uniques_stats_generator.py", line 202

Can't run Kubeflow sample code

I've tried a few different versions of tfx/tensorflow/kfp/python and consistently get the following error:

root@b8ad67b54428:~/building-machine-learning-pipelines/pipelines/kubeflow_pipelines# export PYTHONPATH=~/building-machine-learning-pipelines/pipelines
root@b8ad67b54428:~/building-machine-learning-pipelines/pipelines/kubeflow_pipelines# python pipeline_kubeflow.py 
2021-02-21 21:27:38.699832: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "pipeline_kubeflow.py", line 10, in <module>
    from tfx.orchestration import pipeline
  File "/usr/local/lib/python3.6/dist-packages/tfx/orchestration/pipeline.py", line 28, in <module>
    from tfx.dsl.components.base import base_node
  File "/usr/local/lib/python3.6/dist-packages/tfx/dsl/components/base/base_node.py", line 28, in <module>
    from tfx.dsl.components.base import base_executor
  File "/usr/local/lib/python3.6/dist-packages/tfx/dsl/components/base/base_executor.py", line 40, in <module>
    beam_Pipeline = beam.Pipeline
AttributeError: module 'apache_beam' has no attribute 'Pipeline'

Data Validation - GCP Cloud DataFlow - No module named IPython

I get the following when trying to generate_statistics using dataflow:

File "/usr/local/lib/python3.7/site-packages/tensorflow_data_validation/utils/display_util.py", line 39, in
'tensorflow-data-validation[visualization]": {}'.format(e))
ImportError: To use visualization features, make sure ipython is installed, or install TFDV using "pip install tensorflow-data-validation[visualization]": No module named 'IPython'

Pip list shows tensorflow==2.3.0, tensorflow-data-validation==0.23.0, ipython ==7.17.0 as per https://pypi.org/project/tensorflow-data-validation/

I'm using: tensorflow_data_validation-0.23.0-cp37-cp37m-manylinux2010_x86_64.whl

Works fine with DirectRunner

Cannot get artifacts during the train-eval-test split

Hi,
Referring to the Cha 03, I split the data into train-eval-test. In the component outputs, I can see the split names. But, when I am trying to print the artifacts, I can't see the 3 artifacts.

Please refer to the screenshot attached. I am running the notebook on colab.

artifacts

.

Unable to install requirements

I am currently facing issues installing the given requirements of the projects.

I am on Mac OS X 11.1 and used python 3.6.12, 3.7.9 and 3.8.7.

pip tries to solve the dependency tree but isn't able to fulfill all requirements and takes multiple hours trying all combinations.
I also tried fixing tensorflow==2.2.1 and let all other packages open but still cannot resolve the version dependencies.

There are different error messages and I do not wanted to paste all here. Maybe you can guide me to one working python version and I can then try again and paste error messages.

Thank you!

Chapter 3 Data Ingestion

In the book data splitting was mentioned briefly, e.g.

base_dir = os.getcwd()
output = example_gen_pb2.Output(
    split_config=example_gen_pb2.SplitConfig(splits=[ 
        example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=6),
        example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2),
        example_gen_pb2.SplitConfig.Split(name='test', hash_buckets=2)
    ]))

examples = dsl_utils.external_input(os.path.join(base_dir, 'data'))
example_gen = CsvExampleGen(input=examples, output_config=output) 

context.run(example_gen)

But how would one handle unbalanced datasets when generate samples for train/eval and test? I'm not able to find any examples or documentation of this example_gen_pb2.SplitConfig.Split class

Issue installing requirements

Hi!

I don't know if you could help me.

I get this conflict error when installing requirements:

The conflict is caused by:
tensorflow-privacy 0.7.3 depends on attrs>=21.2.0
tfx 1.4.0 depends on attrs<21 and >=19.3.0

Why don't you create a docker image? It would be much easier.

Thanks a lot for your work.

FileNotFoundError after executing convert_data_to_tfrecords.py

Bug

Incorrect file name provided to original_data_file on line 30 of convert_data_to_tfrecords.py leads to a FileNotFoundError.

System details

  • OS name and version: Ubuntu 18.04.3 LTS | Linux 4.19.104+
  • Package versions: tensorflow 2.2.0 | tfx 0.22.0
  • Local setup: Google Colab

Steps to reproduce

git clone https://github.com/Building-ML-Pipelines/building-machine-learning-pipelines.git
cd building-machine-learning-pipelines/chapters/data_ingestion/
python3 convert_data_to_tfrecords.py

2020-07-15 09:21:35.509359: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "/content/building-machine-learning-pipelines/chapters/data_ingestion/convert_data_to_tfrecords.py", line 34, in <module>
    with open(original_data_file) as csv_file:
FileNotFoundError: [Errno 2] No such file or directory: '../../data/consumer-complaints.csv'

Cause

Incorrect file name provided to original_data_file on line 30 of convert_data_to_tfrecords.py. The data set is stored as consumer-complaints_with_narrative.csv upon download.

Fix

Replace line 30 with the following:
original_data_file = "../../data/consumer_complaints_with_narrative.csv"

Expected output

python3 convert_data_to_tfrecords.py

2020-07-15 09:31:39.794495: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
66799it [00:10, 6585.19it/s]

GOOGLE_APPLICATION_CREDENTIALS

I'm trying to execute the basic_pipeline.py with the example code on chapter 2.

However, I get the following error:
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information. Connecting anonymously.

Can anyone help me to set up the credentials correctly?

Thank you

Examples pass to evaluator component

the examples passed to Evaluator function below shouldnt be from tranform component , as this component updates the evaluation split with necessary preprocessing stpes applied during transformation.
using examples from ExampleGen will leads to input mismatch as evaluation split set there wouldn't take care of tranformation/preprocessing steps
so insteed of:
evaluator = Evaluator( examples=example_gen.outputs['examples'], model=trainer.outputs['model'], baseline_model=model_resolver.outputs['model'], eval_config=eval_config ) context.run(evaluator)
it should be ?:
from tfx.components import Evaluator evaluator = Evaluator(examples=transform.outputs['transformed_examples'], model=trainer.outputs['model'], baseline_model=model_resolver.outputs['model'], eval_config=eval_config) context.run(evaluator)
there may be another reason why you have used example gen in examples please let me if any

Data Ingestion: String to Float

Downloaded dataset contains non-numeric zip codes ending "XX" causing conversion to fail, for example:

File "convert_data_to_tfrecords.py", line 47, in
"zip_code": _int64_feature(int(float(row["zip_code"]))),
ValueError: could not convert string to float: '113XX'

Replacing XX with 00 allows conversion to proceed.

context.run(statistics_gen) errors in interactive pipeline sample

In a freshly created GCP AI Notebook (linux) tfx==0.22 tf==2.2.1
when i get to the cell to run execute StatisticsGen (ExampleGen ran fine) I get the following error which seems to be coming from tensorflow validation or perhaps apache beam. my tfdv version is: 0.22.2

TypeCheckError: Type hint violation for 'ToTopKTuples': requires FrozenSet[FeaturePath] but got Set[Any] for bytes_features
Full type hint:
IOTypeHints[inputs=((Tuple[Union[NoneType, bytes, str], RecordBatch], FrozenSet[FeaturePath], FrozenSet[FeaturePath], Union[NoneType, str]), {}), outputs=((Tuple[Tuple[Union[NoneType, bytes, str], Tuple[Union[bytes, str], ...], Any], Union[Tuple[int, Union[float, int]], int]],), {})]
strip_iterable()

based on:
IOTypeHints[inputs=((Tuple[Union[NoneType, bytes, str], RecordBatch], FrozenSet[FeaturePath], FrozenSet[FeaturePath], Union[NoneType, str]), {}), outputs=((Iterable[Tuple[Tuple[Union[NoneType, bytes, str], Tuple[Union[bytes, str], ...], Any], Union[Tuple[int, Union[float, int]], int]]],), {})]
from_callable(_to_topk_tuples)
signature: (sliced_record_batch: Tuple[Union[str, bytes, NoneType], pyarrow.lib.RecordBatch], bytes_features: FrozenSet[tensorflow_data_validation.types.FeaturePath], categorical_features: FrozenSet[tensorflow_data_validation.types.FeaturePath], weight_feature: Union[str, NoneType]) -> Iterable[Tuple[Tuple[Union[str, bytes, NoneType], Tuple[Union[bytes, str], ...], Any], Union[int, Tuple[int, Union[int, float]]]]]
File "/opt/conda/lib/python3.7/site-packages/tensorflow_data_validation/statistics/generators/top_k_uniques_stats_generator.py", line 202

i cloned the repo as of today so I don't think there's anything stale about the environment.
thx,
j.

Modification for `pre-experiment-pipeline/experiment_6Mar.ipynb`

Problem

In pre-experiment-pipeline/experiment_6Mar.ipynb

  1. Incorrect file path for dataframe
  2. Incompatible pandas arguments passed to df['zip_code'].str.replace(...)

System details

OS name and version: Ubuntu 18.04
Package versions: tensorflow 2.2.0 | tfx 0.22.0
Local setup: virtualenv + make develop

Fix

  1. Replace original file path ../data/6Mar/consumer_complaints_with_narrative.csv to ../data/consumer_complaints_with_narrative.csv in shell 6
  2. Remove regex=True from df['zip_code'] = df['zip_code'].str.replace('X', '0') in shell 11 and 12, to be consistent with pandas=0.22.0

Also, in order to pass make test I modified the name of /requirements/test_requirements.txt, which #38 addressed.

interactive_pipeline.ipynb error when running transform step.

using interactive_pipeline notebook on a mac catalina - all steps prior to transform ran successfully:
transform = Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=transform_file)
context.run(transform)

Raises:
----zipcode = tf.strings.regex_replace(zipcode, r"X{0,5}", "0")
TypeError: Input 'input' of 'StaticRegexReplace' Op has type int64 that does not match expected type of string.

versions:
Tensorflow Version: 2.2.0
TFX Version: 0.21.4
TFDV Version: 0.21.5
TFMA Version: 0.21.6

thanks,

Chapter 5 Data preprocessing

Operating system: windows 10
tensorflow v2.2.0
tfx v0.22.0

Simply running the interactive_pipeline.ipynb jupyter notebook in the repo, the code fails at the data transform section when trying to run the preprocessing_fn in module.py and produces the following error:

RuntimeError: FileNotFoundError: [Errno 2] No such file or directory: 'C:\blmp\tfx\Transform\transform_graph\13\.temp_path\tftransform_tmp\beam-temp-vocab_compute_and_apply_vocabulary_vocabulary-47fb6d70f05611eab03c7ce9d3b592e5\007c4e2d-3739-412a-ae5a-808d1283096e.vocab_compute_and_apply_vocabulary_vocabulary' [while running 'Analyze/VocabularyOrderAndWrite[compute_and_apply_vocabulary/vocabulary]/WriteToFile/Write/WriteImpl/WriteBundles']

The file 007c4e2d-3739-412a-ae5a-808d1283096e.vocab_compute_and_apply_vocabulary_vocabulary doesn't get generated which is probably why it fails.

GCP AI Pipeline with Dataflow fails with TypeError

Error:

File "apache_beam/coders/coder_impl.py", line 165, in apache_beam.coders.coder_impl.CoderImpl.estimate_size
File "apache_beam/coders/coder_impl.py", line 488, in apache_beam.coders.coder_impl.BytesCoderImpl.encode_to_stream
TypeError: Expected bytes, got list [while running 'InputToSerializedExample/InputSourceToExample/ParseCSVLine']

INFO:apache_beam.runners.dataflow.dataflow_runner:2020-09-27T07:25:49.571Z: JOB_MESSAGE_BASIC: Finished operation InputToSerializedExample/InputSourceToExample/ReadFromText/Read+InputToSerializedExample/InputSourceToExample/ParseCSVLine+InputToSerializedExample/InputSourceToExample/InferColumnTypes/KeyWithVoid+InputToSerializedExample/InputSourceToExample/InferColumnTypes/CombinePerKey/GroupByKey+InputToSerializedExample/InputSourceToExample/InferColumnTypes/CombinePerKey/Combine/Partial+InputToSerializedExample/InputSourceToExample/InferColumnTypes/CombinePerKey/GroupByKey/Reify+InputToSerializedExample/InputSourceToExample/InferColumnTypes/CombinePerKey/GroupByKey/Write
INFO:apache_beam.runners.dataflow.dataflow_runner:2020-09-27T07:25:49.643Z: JOB_MESSAGE_DEBUG: Executing failure step failure72

Succeeds ok with dataflow removed:

# beam_pipeline_args=beam_pipeline_args,

ValueError: Usecols do not match columns, columns expected but not found: ['company_response_to_consumer', 'zipcode', 'consumer_disputed?']

Bug

Set up of the demo project fails and throws a ValueError when following the instructions.

System details

  • OS name and version: Ubuntu 18.04.3 LTS | Linux 4.19.104+
  • Package versions: tensorflow 2.2.0 | tfx 0.22.0
  • Local setup: Google Colab

Steps to reproduce

!pip install tfx
!git clone https://github.com/Building-ML-Pipelines/building-machine-learning-pipelines.git
!cd building-machine-learning-pipelines/;python3 utils/download_dataset.py

INFO:root:Started
INFO:root:Data folder created.
INFO:urllib3.poolmanager:Redirecting http://bit.ly/building-ml-pipelines-dataset -> https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:urllib3.poolmanager:Redirecting https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF -> https://doc-0o-8s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/s9hu87rhvef8qlae21p9rreoda7auml3/1594723575000/06616860426990197454/*/1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF?e=download
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:root:Download completed.
Traceback (most recent call last):
  File "utils/download_dataset.py", line 131, in <module>
    update_csv()
  File "utils/download_dataset.py", line 101, in update_csv
    df = pd.read_csv(LOCAL_FILE_NAME, usecols=feature_cols)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1937, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1233, in _validate_usecols_names
    "Usecols do not match columns, "
ValueError: Usecols do not match columns, columns expected but not found: ['company_response_to_consumer', 'zipcode', 'consumer_disputed?']

Cause

In line 101 of utils/download_dataset.py, usecols looks for columns defined in features_cols within the consumer_complaints_with_narrative.csv dataset. It does not find the 'company_response_to_consumer', 'zipcode', 'consumer_disputed?' columns and throws a ValueError. A simple case of column name mismatch. The dataset actually contains the following column names:

df.columns
Index(['product', 'sub_product', 'issue', 'sub_issue',
       'consumer_complaint_narrative', 'company', 'state', 'zip_code',
       'company_response', 'timely_response', 'consumer_disputed'],
      dtype='object')

Fix

Update the column names in feature_cols and remove lines 103 - 110 in utils/download_dataset.py. i.e. Lines 88 through 110 can be replaced by the following:

feature_cols = [
        "product",
        "sub_product",
        "issue",
        "sub_issue",
        "state",
        "zip_code",
        "company",
        "company_response",
        "timely_response",
        "consumer_disputed",
        "consumer_complaint_narrative",
    ]
df = pd.read_csv(LOCAL_FILE_NAME, usecols=feature_cols)

Expected output

!cd building-machine-learning-pipelines/;python3 utils/download_dataset.py

INFO:root:Started
INFO:root:Data folder already existed.
INFO:urllib3.poolmanager:Redirecting http://bit.ly/building-ml-pipelines-dataset -> https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:urllib3.poolmanager:Redirecting https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF -> https://doc-0o-8s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/10faglfko9lihkhoq7mugfqlen9c30lu/1594725450000/06616860426990197454/*/1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF?e=download
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:root:Download completed.
INFO:root:CSV header updated and rewritten to data/tmp_consumer_complaints_with_narrative.csv
INFO:root:Finished

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.