The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network distributed or non-distributed training on Kubernetes. Please check out here for an introduction to DGL and dgl distributed training philosophy.
- Kubernetes >= 1.16
You can deploy the operator with default settings by running the following commands:
git clone https://github.com/Qihoo360/dgl-operator cd dgl-operator kubectl create -f deploy/v1alpha1/dgl-operator.yaml
You can check whether the DGL Job custom resource is installed via:
kubectl get crd
The output should include
dgljobs.qihoo.net like the following:
NAME AGE ... dgljobs.qihoo.net 1m ...
Creating a DGL Job
You can create a DGL job by defining an DGLJob config file. See GraphSAGE.yaml or GraphSAGE_dist.yaml example config file for launching a single-node or multi-node GraphSAGE training job. You may change the config file based on your requirements.
# standalone GraphSAGE cat examples/v1alpha1/GraphSAGE.yaml # or a distributed version cat examples/v1alpha1/GraphSAGE_dist.yaml
Deploy the DGLJob resource to start training:
# standalone GraphSAGE kubectl create -f examples/v1alpha1/GraphSAGE.yaml # or a distributed version kubectl create -f examples/v1alpha1/GraphSAGE_dist.yaml
Please check out these previous works that helped inspire the creation of DGL Operator