Use distributed training

Brief introduction

TensorFlow is just a library. Distributed TensorFlow applications require us to launch Python scripts on multiple nodes to form a distributed computing cluster.

Xiaomi Cloud-ML supports standard distributed TensorFlow applications that users can run by merely compiling and submitting the corresponding Python scripts. Their usage is similar to that of their standalone versions.

Code specifications

Since distributed TensorFlow applications must start multiple nodes, each node must know its own role, which is usually imported via command-line parameters. However, the user-defined command line parameter names and numbers may be different. Cloud-ML requires users to import DISTRIBUTED_CONFIG or TF_CONFIG (Cloud-ML originally supported only Tensorflow distribution, using the TF_CONFIG environment variable to import distributed parameters, which is retained at the time and later unified to DISTRIBUTED_CONFIG). These environment variables are imported into the cluster and node information.

For example, in the case of 1 master, 1 PS and 1 worker, the imported parameters would be:

DISTRIBUTED_CONFIG='{"cluster": {"master": ["127.0.0.1:3000"], "ps": ["127.0.0.1:3001"], "worker": ["127.0.0.1:3002"]}, "task": {"index": 0, "type": "ps"}}'
TF_CONFIG='{"cluster": {"master": ["127.0.0.1:3000"], "ps": ["127.0.0.1:3001"], "worker": ["127.0.0.1:3002"]}, "task": {"index": 0, "type": "ps"}}'

The user can then read the environment variables directly in the Python code to obtain the Cluster Spec and Type, as well as the Index information.

if os.environ.get('DISTRIBUTED_CONFIG', ""):
  env = json.loads(os.environ.get('DISTRIBUTED_CONFIG', '{}'))
  task_data = env.get('task', None)
  cluster_spec = env["cluster"]
  task_type = task_data["type"]
  task_index = task_data["index"]

Code examples

We have also implemented a standard distributed TensorFlow application. The code address is https://github.com/XiaoMi/cloud-ml-sdk/blob/master/cloud_ml_samples/tensorflow/linear_regression/trainer/task.py</0 >.</p>

Run locally

To launch the distributed TensorFlow application locally, taking the Sample code as an example, you can open three terminals first and then execute the following commands respectively.

CUDA_VISIBLE_DEVICES='' DISTRIBUTED_CONFIG='{"cluster": {"ps": ["127.0.0.1:3001"], "worker": ["127.0.0.1:3002"]}, "task": {"index": 0, "type": "ps"}}' python -m trainer.task 

CUDA_VISIBLE_DEVICES='' DISTRIBUTED_CONFIG='{"cluster": {"ps": ["127.0.0.1:3001"], "worker": ["127.0.0.1:3002"]}, "task": {"index": 0, "type": "worker"}}' python -m trainer.task

Use Xiaomi Cloud-ML

If Xiaomi Cloud-ML is used, you just need to package the Python code before importing it to -D during runtime. Then enter the task type's name, quantity, resources and the like, as prompted:

cloudml jobs submit -n distributed -m trainer.task -u fds://cloud-ml/linear/trainer-1.0.tar.gz -c 0.3 -M 300M -D

After the distributed training task is submitted, you can see the launch of multiple tasks through the command line. Viewing specific worker logs to locate distributed training tasks is also accomplished normally.

cloudml jobs list

cloudml jobs logs distributed-worker-0

cloudml jobs logs distributed-ps-0

Parameters introduction

-D indicates the use of distribution. Enter distribution-related information as prompted. Supports universal distribution.

Legacy distributed parameters:

-p indicates the number of PS in the cluster. Only the TensorFlow Cloud-ML framework is currently supported.
-w indicates the number of workers in the cluster. Only the TensorFlow Cloud-ML framework is currently supported.