- You need to upload training/evaluation data to Cloud Storage.
- The ML Engine may not support the versions of the packages you need.
ch13
directory of the TensorFlow For Dummies downloadable code, you’ll need to upload the mnist_test.tfrecords
and mnist_train.tfrecords
files to a Cloud Storage bucket. For example, if your project's ID is $(PROJECT_ID)
, you can create a bucket named $(PROJECT_ID)_mnist
in the central United States with the following command:
gsutil mb -c regional -l us-central1 gs://$(PROJECT_ID)_mnistAfter you create the bucket, you can upload the two
MNIST
files to the bucket with the following command:
gsutil cp mnist_test.tfrecords mnist_train.tfrecords gs://$(PROJECT_ID)_mnistAfter the command executes, it's a good idea to check that Cloud Storage created objects for the two files. You can verify this by running the command
gsutil ls gs://$(PROJECT_ID)_mnist
.
Running a remote training job
After you upload your test/evaluation data, you can launch a training job with the following command:gcloud ml-engine jobs submit training $(JOB_ID)
$(JOB_ID)
provides a unique identifier for the training job. After you launch the job, you can use this ID to check on the job's status.In addition to identifying the job, you need to tell the ML Engine where to find your package and your input data. You also need to tell the engine where it should store output files. You can provide this information by following the command with flags, and this table lists each of them.
Flags for Cloud Training Jobs
Flag | Description |
--module-name=MODULE_NAME |
Identifies the module to execute |
--package-path=PACKAGE_PATH |
Path to the Python package containing the module to execute |
--job-dir=JOB_DIR |
Path to store output files |
--staging-bucket=STAGING_BUCKET |
Bucket to hold package during operation |
--region=REGION |
The region of the machine learning job |
--runtime-version=RUNTIME_VERSION |
The version of the ML Engine for the job |
--stream-logs |
Block until the job completes and stream the logs |
--scale-tier=SCALE_TIER |
The job's operating environment |
--config=CONFIG |
Path to a job configuration file |
The --module-name
, --package-path
, and --job-dir
flags serve the same purposes as the similarly named flags for local training jobs. The --staging-bucket
flag identifies the bucket to hold the deployed package. The --region
flag accepts one of the regions listed in the table.
By default, deployed applications run on the latest stable version of the ML Engine. You can configure this by setting the --runtime-version
flag. You can get the list of versions at cloud.google.com/ml-engine/docs/runtime-version-list
.
Set the --stream-logs
flag because it forces the command to block until the job completes. As the job runs, the console prints messages from the remote log. Aborting the command (Ctrl-C) doesn't affect the remote job.
By default, applications uploaded to the ML Engine can run only on a single CPU. You can configure the execution environment by setting the --scale-tier
flag to one of the values listed in this table.
Scale Tier Values
Value | Description |
basic |
A single worker on a CPU |
basic-gpu |
A single worker with a GPU |
basic-tpu |
A single worker instance with a Cloud TPU |
standard-1 |
Many workers and a few parameter servers |
premium-1 |
A large number of workers and many parameter servers |
custom |
Define a cluster |
If you set --scale-tier
to basic-gpu
, you can execute your code on an Nvidia Tesla K80 GPU. This has 4,992 CUDA cores and 24 GB of GDDR5 memory. If you set --scale-tier
to basic-tpu
, you can execute your code on one or more of Google's Tensor Processing Units (TPUs). At the time of this writing, Google restricts TPU access to developers in its Cloud TPU program.
If you set --scale-tier
to standard-1
or premium-1
, you can run your job on a cluster of processors. If you set --scale-tier
to custom, you can configure the cluster by assigning the --config
flag to the name of a configuration file.
Running a remote prediction job
If you upload aSavedModel
to a Cloud Storage bucket, you can launch a prediction job with the following command:
gcloud ml-engine jobs submit prediction $(JOB_ID)This command accepts flags that specify where the prediction job should read its input and write its output. This table lists each of these flags.
Flags for Cloud Prediction Jobs
Flag | Description |
--model-dir=MODEL_DIR |
Path of the bucket containing the saved model |
--model=MODEL |
Name of the model to use for prediction |
--input-paths=INPUT_PATH, [INPUT_PATH,…] |
Path to the input data to use for prediction |
--data-format=DATA_FORMAT |
Format of the input data |
--output-path=OUTPUT_PATH |
Path to store the prediction results |
--region=REGION |
The region of the machine learning job |
--batch-size=BATCH_SIZE |
Number of records per batch |
--max-worker-count=MAX_WORKER_COUNT |
The maximum number of workers to employ for parallel processing |
--runtime-version=RUNTIME_VERSION |
The version of the ML Engine for the job |
--version=VERSION |
Version of the model to be used |
When you launch a remote prediction job, you must identify the model's name with --model
or the bucket containing the model files with --model-dir
. You also need to identify the location of the input files with --input-paths
.
The ML Engine accepts prediction input data in one of three formats. You can identify the format of your data by setting --data-format
to one of the following values:
text
: Text files with one line per instancetf-record
: TFRecord filestf-record-gzip
: GZIP-compressed TFRecord files
--output-path
. This tells the ML Engine which Cloud Storage bucket should contain the prediction results.
Viewing a job's status
After you launch a job, you can view the job’s status in two ways. First, you can usegcloud
commands, such as the following:
gcloud ml-engine jobs list
: List the jobs associated with the default project along with their statuses and creation timesgcloud ml-engine jobs describe $(JOB_ID) --summarize
: Provide detailed information about a specific job in human-readable format
If you click the ML Engine→ Jobs option, the page lists all the jobs associated with the project. If you click on a job name, a new page provides detailed information about the job's execution, including its status and any log messages.