Service Deployment

The Dynamic-Graph-Service can be deployed on a Kubernetes cluster using the Helm package manager.

Prerequisites

  • Kubernetes 1.19+

  • Helm 3.2.0+

  • PV provisioner support in the underlying infrastructure

Deploy kafka Queue Service

The Dynamic-Graph-Service uses Kafka queue service to store streaming graph updates and sampled results. Before deploy a DGS service, you must deploy a kafka cluster first and create the following kafka queues:

  • dl2spl: receiving graph updates from your data-source loaders, and consumed by sampling workers.

  • spl2srv: receiving sampled results from sampling workers, and consumed by serving workers.

Installing the Chart

Get repo info:

helm repo add DGS https://graphlearn.oss-cn-hangzhou.aliyuncs.com/charts/dgs/
helm repo update

Install the chart with release name my-release:

helm install my-release dgs/dgs \
    --set-file graphSchema=/path/to/schema/json/file \
    --set kafka.dl2spl.brokers=[your kafka broker list of dl2spl] \
    --set kafka.dl2spl.topic="your_kafka_topic_of_dl2spl" \
    --set kafka.dl2spl.partitions=your_kafka_partitions_of_dl2spl \
    --set kafka.spl2srv.brokers=[your kafka broker list of spl2srv] \
    --set kafka.spl2srv.topic="your_kafka_topic_of_spl2srv" \
    --set kafka.spl2srv.partitions=your_kafka_partitions_of_spl2srv

The graph schema must be specified from a json string or file by parameter graphSchema. An example schema file can be followed to write your customized graph schema.

The info of dl2spl and spl2srv of your pre-deployed kafka cluster should be configured when you install the chart, refer to Kafka service parameters

These commands deploy Dynamic-Graph-Service on the Kubernetes cluster in the default configuration. The Parameters section lists the parameters that can be configured during installation.

Tip: List all releases using helm list

Uninstalling the Chart

To uninstall/delete the my-release deployment:

helm delete my-release

The command removes all the Kubernetes components associated with the chart and deletes the release.

Parameters

Global parameters

Name Description Value
kubeVersion Override Kubernetes version ""
nameOverride String to partially override common.names.fullname ""
fullnameOverride String to fully override common.names.fullname ""
clusterDomain Default Kubernetes cluster domain cluster.local
commonLabels Labels to add to all deployed objects {}
commonAnnotations Annotations to add to all deployed objects {}
graphSchema The json string of graph schema, must be set during installation ""
configPath The service configmap mount path "/dgs_conf"
glog.toConsole Specify whether program logs are written to standard error as well as to files false

Kafka service parameters

Name Description Value
kafka.dl2spl.brokers Kafka brokers of processed graph updates from dataloader to sampling workers ["localhost:9092"]
kafka.dl2spl.topic Kafka topic of (dataloader -> sampling workers) "record-batches"
kafka.dl2spl.partitions Topic partition number of (dataloader -> sampling workers) 4
kafka.spl2srv.brokers Kafka brokers of sampled updates from sampling workers to serving workers ["localhost:9092"]
kafka.spl2srv.topic Kafka topic of (sampling workers -> serving workers) "sample-batches"
kafka.spl2srv.partitions Topic partition number of (sampling workers -> serving workers) 4

Image parameters

Name Description Value
image.registry Core service image registry "graphlearn"
image.repository Core service image repository "dgs-core"
image.tag Core service image tag (immutable tags are recommended) "1.0.0"
image.pullPolicy Core service image pull policy IfNotPresent
image.pullSecrets Specify docker-registry secret names as an array []

FrontEnd parameters

Name Description Value
frontend.ingressHostName The host name of external ingress "dynamic-graph-service.info"
frontend.limitConnections The number of concurrent connections allowed from a single IP address 10

Common Pod parameters

All workers have the same parameters of common pod assignment, worker-type={coordinator, sampling, serving}.

Name Description Value
${worker-type}.updateStrategy.type Pod deployment strategy type "RollingUpdate"
${worker-type}.updateStrategy.rollingUpdate Pod deployment rolling update configuration parameters {}
${worker-type}.podLabels Extra labels for worker pod {}
${worker-type}.podAnnotations Extra annotations for worker pod {}
${worker-type}.podAffinityPreset Pod affinity preset. Ignored if ${worker-type}.affinity is set. Allowed values: soft or hard ""
${worker-type}.podAntiAffinityPreset Pod anti-affinity preset. Ignored if ${worker-type}.affinity is set. Allowed values: soft or hard "soft"
${worker-type}.nodeAffinityPreset.type Node affinity preset type. Ignored if ${worker-type}.affinity is set. Allowed values: soft or hard ""
${worker-type}.nodeAffinityPreset.key Node label key to match Ignored if ${worker-type}.affinity is set. ""
${worker-type}.nodeAffinityPreset.values Node label values to match. Ignored if ${worker-type}.affinity is set. []
${worker-type}.affinity Affinity for pod assignment {}
${worker-type}.nodeSelector Node labels for pod assignment {}
${worker-type}.tolerations Toleration for pod assignment []
${worker-type}.resources.limits The resources limits for the container {}
${worker-type}.resources.requests The requested resources for the container {}
${worker-type}.persistence.enabled Enable worker checkpoints persistence using PVC false
${worker-type}.persistence.storageClass PVC Storage Class for checkpoint data volume ""
${worker-type}.persistence.accessModes Persistent Volume Access Modes ["ReadWriteOnce"]
${worker-type}.persistence.size PVC Storage Request for checkpoint data volume 20Gi
${worker-type}.persistence.annotations Annotations for the PVC {}
${worker-type}.persistence.selector Selector to match an existing Persistent Volume for checkpoint data PVC. {}
${worker-type}.persistence.mountPath Mount path of the checkpoint data volume "/${worker-type}_checkpoints"
${worker-type}.livenessProbe.enabled Enable livenessProbe on ${worker-type} containers true
${worker-type}.livenessProbe.initialDelaySeconds Initial delay seconds for livenessProbe 10
${worker-type}.livenessProbe.periodSeconds Period seconds for livenessProbe 10
${worker-type}.livenessProbe.timeoutSeconds Timeout seconds for livenessProbe 1
${worker-type}.livenessProbe.failureThreshold Failure threshold for livenessProbe 3
${worker-type}.livenessProbe.successThreshold Success threshold for livenessProbe 1

Coordinator parameters

Name Description Value
coordinator.readinessProbe.enabled Enable readinessProbe on coordinator containers true
coordinator.readinessProbe.initialDelaySeconds Initial delay seconds for readinessProbe 5
coordinator.readinessProbe.periodSeconds Period seconds for readinessProbe 10
coordinator.readinessProbe.timeoutSeconds Timeout seconds for readinessProbe 1
coordinator.readinessProbe.failureThreshold Failure threshold for readinessProbe 6
coordinator.readinessProbe.successThreshold Success threshold for readinessProbe 1
coordinator.rpcService.port Coordinator headless rpc service port for internal connections 50051
coordinator.rpcService.clusterIP Static clusterIP or None for Coordinator headless rpc service ""
coordinator.rpcService.sessionAffinity Control where internal rpc requests go, to the same pod or round-robin None
coordinator.rpcService.annotations Additional custom annotations for Coordinator headless rpc service {}
coordinator.httpService.port Coordinator http service port for external admin requests 8080
coordinator.httpService.sessionAffinity Control where external http requests go, to the same pod or round-robin None
coordinator.httpService.externalTrafficPolicy Coordinator http service external traffic policy Cluster
coordinator.httpService.annotations Additional custom annotations for Coordinator http service {}
coordinator.workdir Local ephemeral storage mount path for Coordinator working directory "/coordinator_workdir"
coordinator.connectTimeoutSeconds The max timeout seconds when other workers connect to Coordinator 60
coordinator.heartbeatIntervalSeconds The heartbeat interval in seconds when other workers report statistics to coordinator 10

Sampling parameters

Name Description Value
sampling.workerNum Number of Sampling workers 2
sampling.workdir Local ephemeral storage mount path for Sampling working directory "/sampling_workdir"
sampling.actorLocalShardNum Local computing shard number for each Sampling Worker pod 4
sampling.dataPartitionNum The total partition number of data across all Sampling Workers 8
sampling.rocksdbEnv.highPriorityThreads The thread number of high-priority rocksdb background tasks 2
sampling.rocksdbEnv.lowPriorityThreads The thread number of low-priority rocksdb background tasks 2
sampling.sampleStore.memtableRep The rocksdb memtable structure type of sample store "hashskiplist"
sampling.sampleStore.hashBucketCount The hash bucket count of sample store memtable 1048576
sampling.sampleStore.skipListLookahead The look-ahead factor of sample store memtable 0
sampling.sampleStore.blockCacheCapacity The capacity (bytes) of sample store block cache 67108864
sampling.sampleStore.ttlHours The TTL hours for sampling data in sample store 1200
sampling.subscriptionTable.memtableRep The rocksdb memtable structure type of Sampling subscription table "hashskiplist"
sampling.subscriptionTable.hashBucketCount The hash bucket count of subscription table memtable 1048576
sampling.subscriptionTable.skipListLookahead The look-ahead factor of subscription table memtable 0
sampling.subscriptionTable.blockCacheCapacity The capacity (bytes) of subscription table block cache 67108864
sampling.subscriptionTable.ttlHours The TTL hours for sampling rules in subscription table 1200
sampling.recordPolling.threadNum The thread number for graph update consuming from kafka queues 2
sampling.recordPolling.retryIntervalMs The retry interval (ms) when no record has been polled 100
sampling.recordPolling.processConcurrency The max processing concurrency for polled records 100
sampling.samplePublishing.producerPoolSize The max number of kafka producer for sampling results 2
sampling.samplePublishing.maxProduceRetryTimes The maximum retry times of producing a kafka message 3
sampling.samplePublishing.callbackPollIntervalMs The interval(ms) for polling async producing callbacks 100
sampling.logging.dataLogPeriod Specify how many graph update batches should be processed between two logs 10
sampling.logging.ruleLogPeriod Specify how many sampling rules should be processed between two logs 10

Serving parameters

Name Description Value
serving.workerNum Number of Serving workers, each Serving Worker is an independent pod 2
serving.readinessProbe.enabled Enable readinessProbe on Serving worker containers true
serving.readinessProbe.initialDelaySeconds Initial delay seconds for readinessProbe 30
serving.readinessProbe.periodSeconds Period seconds for readinessProbe 10
serving.readinessProbe.timeoutSeconds Timeout seconds for readinessProbe 1
serving.readinessProbe.failureThreshold Failure threshold for readinessProbe 6
serving.readinessProbe.successThreshold Success threshold for readinessProbe 1
serving.httpService.port The external port of Serving http service for inference queries 10000
serving.httpService.sessionAffinity Control where http requests go, to the same pod or round-robin None
serving.httpService.externalTrafficPolicy The external traffic policy of Serving http service Cluster
serving.httpService.annotations Additional custom annotations of Serving http service {}
serving.workdir Local ephemeral storage mount path for Serving working directory "/serving_workdir"
serving.actorLocalShardNum Local computing shard number for each Serving Worker pod 4
serving.dataPartitionNum The partition number of data for each Serving Worker 4
serving.rocksdbEnv.highPriorityThreads The thread number of high-priority rocksdb background tasks 2
serving.rocksdbEnv.lowPriorityThreads The thread number of low-priority rocksdb background tasks 2
serving.sampleStore.inMemoryMode Specify whether to open rocksdb in-memory mode of sample store false
serving.sampleStore.memtableRep The rocksdb memtable structure type of sample store "hashskiplist"
serving.sampleStore.hashBucketCount The hash bucket count of sample store memtable 1048576
serving.sampleStore.skipListLookahead The look-ahead factor of sample store memtable 0
serving.sampleStore.blockCacheCapacity The capacity (bytes) of sample store block cache 67108864
serving.sampleStore.ttlHours The TTL hours for serving data in sample store 1200
serving.recordPolling.threadNum The thread number for sample update consuming from kafka queues 2
serving.recordPolling.retryIntervalMs The retry interval (ms) when no record has been polled 100
serving.recordPolling.processConcurrency The max processing concurrency for polled records 100
serving.logging.dataLogPeriod Specify how many sample update batches should be processed between two logs 10
serving.logging.requestLogPeriod Interval of incoming inference query requests for logging serving statistics 1

Other Parameters

Name Description Value
serviceAccount.create Enable creation of ServiceAccount for pods true
serviceAccount.name The name of the service account to use. If not set and create is true, a name is generated ""
serviceAccount.automountServiceAccountToken Allows auto mount of ServiceAccountToken on the serviceAccount created true
serviceAccount.annotations Additional custom annotations for the ServiceAccount {}
rbac.create Whether to create & use RBAC resources or not false