googlecloudplatform / ml-design-patterns Goto Github PK
View Code? Open in Web Editor NEWSource code accompanying O'Reilly book: Machine Learning Design Patterns
License: Apache License 2.0
Source code accompanying O'Reilly book: Machine Learning Design Patterns
License: Apache License 2.0
Hello, I am studying this book. Thank you for writing such a well-structured textbook.
I have a question about the Bridged Schema pattern in section 23. How should I determine the amount of old data to be added to the training data.
In this repository, it is stated that adding 60,000 old data is best. However, in the line graph of the number of data and the R2 value, 60,000 is rather at the bottom of the R2 value. The higher the R2 value, the better the prediction, so it looks rather like the prediction accuracy is decreasing as older data is added.
From the results of this graph, I may conclude that the prediction accuracy is higher when learning only with new data.
I would be glad if anyone tell me why you decide that adding 60,000 old data was the best.
Dear authors,
the evaluate component of the pipeline fails due to the lack of pyarrow module.
Solved by changing the module request in the pipeline definition :
dsl.pipeline(
name='Cascade pipeline on SF bikeshare',
description='Cascade pipeline on SF bikeshare'
)
def cascade_pipeline(
project_id = PROJECT_ID
):
ddlop = comp.func_to_container_op(run_bigquery_ddl, packages_to_install=['google-cloud-bigquery'])
c1 = train_classification_model(ddlop, PROJECT_ID)
c1_model_name = c1.outputs['created_table']
c2a_input = create_training_data(ddlop, PROJECT_ID, c1_model_name, 'Typical')
c2b_input = create_training_data(ddlop, PROJECT_ID, c1_model_name, 'Long')
c3a_model = train_distance_model(ddlop, PROJECT_ID, c2a_input.outputs['created_table'], 'Typical')
c3b_model = train_distance_model(ddlop, PROJECT_ID, c2b_input.outputs['created_table'], 'Long')
evalop = comp.func_to_container_op(evaluate, packages_to_install=['google-cloud-bigquery[bqstorage,pandas]', 'pandas'])
error = evalop(PROJECT_ID, c1_model_name, c3a_model.outputs['created_table'], c3b_model.outputs['created_table'])
print(error.output)
Best Regards
Jerome
mixed_image_tabular_model = Model(inputs=[image_tabular_input, tiled_input], outputs=merged_image_output)
NameError: name 'tiled_input' is not defined
Should be 'image_input'?
getting access denied trying to copy gs://ml-design-patterns/auto-mpg.csv - chapter 7 explainability.ipynb
jim
Deepexplainer fails with TF2.3, ok with 2.2
The order of the publications is inconsistent between the original CLASSES definition and the source name function:
CLASSES = {
'github': 0,
'nytimes': 1,
'techcrunch': 2
}
labels = tf.constant(['github', 'techcrunch', 'nytimes'], dtype=tf.string)
Suggest:
labels = tf.constant(['github', 'nytimes','techcrunch'], dtype=tf.string)
Typo in last cell of notebook:
gcloud ai-platform predict --model 'flighs_regression' --version
'v1' --json-instances 'input.json'
Should be flights_regression.
Also ensure installed version of xbgoost used for training matches the AI Platform runtime version (https://cloud.google.com/ai-platform/training/docs/runtime-version-list) otherwise null predictions are produced.
Dear authors,
it seems to me that the Section 2 of the continuous evaluation notebook needs to be updated. Indeed, the continuous evaluation mode is not more straightforward and requires more setup information to be used flawlessly,
Thanks
Best Regards
Jerome
Had to copy data directory with csv files from https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/quests/serverlessml/data to run taxifare_fc.ipynb.
Model trained on Tensorflow 2.3 is incompatible with serving from GCP AI Platform, Tensorflow 2.2.0. Prediction fails with
input segment_ids[0] expected type int32 != int64. Training on TF2.2.0 predicts ok.
shap.DeepExplainer call fails with error "Can't convert non-rectangular Python sequence to Tensor".
tensorflow version 2.1.1
shap version 0.37.0
seems similar to: shap/shap#850
thanks,
jim
NameError: name 'image_batch' is not defined
feature_batch = mobilenet(image_batch)
Suggest:
train_image, train_label = next(train_data_gen)
feature_batch = mobilenet(train_image)
Phase 2: identifying instrument sounds, needs the '/audio_train_spectro' directory repopulated with images as the files are moved into audio_spectros/not_instrument/ or audio_spectros/instrument/ in the previous example
Similarly suggest:
feature_batch = vgg_model(image_instrument_train)
On GCP AI Notebook install xgboost
!pip install xgboost
Colab Auth as per issue #5
Needed to manually create natality dataset in US region
Replaced df['NEAREST_CENTROIDS_DISTANCE'].iloc[0]
with
average_pred['NEAREST_CENTROIDS_DISTANCE'].iloc[0]
Running chapter 1 on GCP Notebook Google Colab is not installed. The notebook executes successfully without running this cell.
Installing the package (google-colab) flags some incompatibilities and then produces:
import error: cannot import name 'ordereddict' from 'pandas.compat'
causing the remainder of the notebook to fail.
Hi authors team,
just to inform you that the checkpointing callback is not called in the fit() method.
I guess that it is a typo as the callback is used in the code snippet in the book.
Best regards
Jerome
Dear authors
bqlm without l2, rmse = 4.828183
bqlm with l2m rmse = 4.844853
arghhhhhhh ;-)
Best regards
Jerome
Skaffold must be installed to use tfx cli as per https://www.tensorflow.org/tfx/tutorials/tfx/cloud-ai-platform-pipelines
tfx pipeline create command must be run from my_pipeline dir not workflow_pipeline as in README.md
Had to run 'gcloud ai-platform models create txtcls' prior to creation of new model version
Dear authors,
The text
'''Creating a Feature Cross with BQML
Next, we'll create a feature cross of the features is_male and mother_race. To create a feature cross we apply ML.FEATURE_CROSS to a STRUCT of the features is_male and mother_race cast as a string. The STRUCT clause creates an ordered pair of the two features. The TRANSFORM clause is used for engineering features of our model. This allows us to specify all preprocessing during model creation and apply those preprocessing steps during prediction and evaluation. The rest of the features within the TRANSFORM clause remain unchanged.'''
has to be changed as the two features crossed are 'is_male' and 'plurality' not 'mother_race'
Thanks
Best regards
Jerome
With tfx==0.24 had to substitute:
from tfx.extensions.google_cloud_big_query.example_gen.component import BigQueryExampleGen
Also transform step fails with kernel restart, see: tensorflow/tfx#2598
Hi everyone,
I have no access to thte bucket to get the dataset.
AccessDeniedException: 403 [email protected] does not have storage.objects.list access to the Google Cloud Storage bucket.
Best Regards
Jerome
GCP AI notebook does not have kfp installed.
!pip install kfp has no incompatibilities.
Pipeline run completes with output: 569.155..
Dear authors,
The last cell of the notebook could be confusing : indeed, when using pre-trained embedding vectors, it is not useful to have EMBED_DIM parameter defined. The build_hub_model() method does not use it hopefully :)
Best regards
Jerome
I am working through the following notebook: https://github.com/GoogleCloudPlatform/ml-design-patterns/blob/master/02_data_representation/weather_search/wx_embeddings.ipynb. I am running a GCP AI Notebook VM with JupyterLab.
When I get to the following line of code: %run -m wxsearch.hrrr_to_tfrecord -- --startdate 20190915 --enddate 20190916 --outdir gs://{BUCKET}/wxsearch/data/2019 --project {PROJECT}
, my Dataflow batch job indicates that it runs fine and to completion (first image below). However, the batch job produces a zero byte TensorFlow record file (second image below). The zero elements per second seems concerning to me in create_tfr, although I don't know if this is a problem.
Any thoughts as to what may be happening? The only modifications I made were to the bucket and project variables where I wrote my own bucket and project values into the command.
It would be helpful to indicate the likely training times for a typical machine specification.
Using a GCP N1-standard notebook i found the DNN and Linear training times to be significantly longer than stated in the draft book.
I created using code from: https://datalab.office.datisan.com.au/notebooks/training-data-analyst/blogs/textclassification/txtcls.ipynb
as:
query="""
SELECT source, REGEXP_REPLACE(title, '[^a-zA-Z0-9 $.-]', ' ') AS title FROM
(SELECT
ARRAY_REVERSE(SPLIT(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.'))[OFFSET(1)] AS source,
title
FROM
bigquery-public-data.hacker_news.stories
WHERE
REGEXP_CONTAINS(REGEXP_EXTRACT(url, '.://(.[^/]+)/'), '.com$')
AND LENGTH(title) > 10
)
WHERE (source = 'github' OR source = 'nytimes' OR source = 'techcrunch')
"""
from google.cloud import bigquery
client = bigquery.Client()
df = client.query(query).to_dataframe()
df.to_csv('titles_full.csv', header=False, index=False, encoding='utf-8', sep=',')
I had to swap the column order:
COLUMNS = ['source', 'title']
callbacks=[EarlyStopping(), TensorBoard(model_dir)],
without it loss was minimised after 20.
"some stuff here about setting up Eval jobs"
@munnm, in the 05_resilience/continuous_eval.ipynb
notebook, please set some value for the patience
(preferably > 1) argument in the EarlyStopping()
callback. Otherwise, it'd only run for a single epoch. Users not familiar with the EarlyStopping callback API of Keras might not understand what's wrong at the very first glance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.