# Service Deployment The Dynamic-Graph-Service can be deployed on a [Kubernetes](https://kubernetes.io) cluster using the [Helm](https://helm.sh) package manager. ## Prerequisites - Kubernetes 1.19+ - Helm 3.2.0+ - PV provisioner support in the underlying infrastructure ## Deploy kafka Queue Service The Dynamic-Graph-Service uses [Kafka](https://kafka.apache.org/) queue service to store streaming graph updates and sampled results. Before deploy a DGS service, you must deploy a kafka cluster first and create the following kafka queues: - `dl2spl`: receiving graph updates from your data-source loaders, and consumed by sampling workers. - `spl2srv`: receiving sampled results from sampling workers, and consumed by serving workers. ## Installing the Chart Get repo info: ```shell helm repo add DGS https://graphlearn.oss-cn-hangzhou.aliyuncs.com/charts/dgs/ helm repo update ``` Install the chart with release name `my-release`: ```shell helm install my-release dgs/dgs \ --set-file graphSchema=/path/to/schema/json/file \ --set kafka.dl2spl.brokers=[your kafka broker list of dl2spl] \ --set kafka.dl2spl.topic="your_kafka_topic_of_dl2spl" \ --set kafka.dl2spl.partitions=your_kafka_partitions_of_dl2spl \ --set kafka.spl2srv.brokers=[your kafka broker list of spl2srv] \ --set kafka.spl2srv.topic="your_kafka_topic_of_spl2srv" \ --set kafka.spl2srv.partitions=your_kafka_partitions_of_spl2srv ``` The graph schema must be specified from a json string or file by parameter `graphSchema`. An [example](https://github.com/alibaba/graph-learn/blob/master/dynamic_graph_service/conf/u2i/schema.u2i.json) schema file can be followed to write your customized graph schema. The info of `dl2spl` and `spl2srv` of your pre-deployed kafka cluster should be configured when you install the chart, refer to [Kafka service parameters](#kafka-service-parameters) These commands deploy Dynamic-Graph-Service on the Kubernetes cluster in the default configuration. The [Parameters](#parameters) section lists the parameters that can be configured during installation. > **Tip**: List all releases using `helm list` ## Uninstalling the Chart To uninstall/delete the `my-release` deployment: ```shell helm delete my-release ``` The command removes all the Kubernetes components associated with the chart and deletes the release. ## Parameters ### Global parameters | Name | Description | Value | | -------------------------- | --------------------------------------------------------------------------------------- | --------------------- | | `kubeVersion` | Override Kubernetes version | `""` | | `nameOverride` | String to partially override common.names.fullname | `""` | | `fullnameOverride` | String to fully override common.names.fullname | `""` | | `clusterDomain` | Default Kubernetes cluster domain | `cluster.local` | | `commonLabels` | Labels to add to all deployed objects | `{}` | | `commonAnnotations` | Annotations to add to all deployed objects | `{}` | | `graphSchema` | The json string of graph schema, **must** be set during installation | `""` | | `configPath` | The service configmap mount path | `"/dgs_conf"` | | `glog.toConsole` | Specify whether program logs are written to standard error as well as to files | `false` | ### Kafka service parameters | Name | Description | Value | | --------------------------- | ----------------------------------------------------------------------------- | -------------------- | | `kafka.dl2spl.brokers` | Kafka brokers of processed graph updates from dataloader to sampling workers | `["localhost:9092"]` | | `kafka.dl2spl.topic` | Kafka topic of (dataloader -> sampling workers) | `"record-batches"` | | `kafka.dl2spl.partitions` | Topic partition number of (dataloader -> sampling workers) | `4` | | `kafka.spl2srv.brokers` | Kafka brokers of sampled updates from sampling workers to serving workers | `["localhost:9092"]` | | `kafka.spl2srv.topic` | Kafka topic of (sampling workers -> serving workers) | `"sample-batches"` | | `kafka.spl2srv.partitions` | Topic partition number of (sampling workers -> serving workers) | `4` | ### Image parameters | Name | Description | Value | | --------------------------- | ---------------------------------------------------------------------- | -------------------- | | `image.registry` | Core service image registry | `"graphlearn"` | | `image.repository` | Core service image repository | `"dgs-core"` | | `image.tag` | Core service image tag (immutable tags are recommended) | `"1.0.0"` | | `image.pullPolicy` | Core service image pull policy | `IfNotPresent` | | `image.pullSecrets` | Specify docker-registry secret names as an array | `[]` | ### FrontEnd parameters | Name | Description | Value | | ---------------------------- | ------------------------------------------------------------------------ | -------------------------------- | | `frontend.ingressHostName` | The host name of external ingress | `"dynamic-graph-service.info"` | | `frontend.limitConnections` | The number of concurrent connections allowed from a single IP address | `10` | ### Common Pod parameters All workers have the same parameters of common pod assignment, worker-type={coordinator, sampling, serving}. | Name | Description | Value | | ---------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | ----------------------------------- | | `${worker-type}.updateStrategy.type` | Pod deployment strategy type | `"RollingUpdate"` | | `${worker-type}.updateStrategy.rollingUpdate` | Pod deployment rolling update configuration parameters | `{}` | | `${worker-type}.podLabels` | Extra labels for worker pod | `{}` | | `${worker-type}.podAnnotations` | Extra annotations for worker pod | `{}` | | `${worker-type}.podAffinityPreset` | Pod affinity preset. Ignored if `${worker-type}.affinity` is set. Allowed values: `soft` or `hard` | `""` | | `${worker-type}.podAntiAffinityPreset` | Pod anti-affinity preset. Ignored if `${worker-type}.affinity` is set. Allowed values: `soft` or `hard` | `"soft"` | | `${worker-type}.nodeAffinityPreset.type` | Node affinity preset type. Ignored if `${worker-type}.affinity` is set. Allowed values: `soft` or `hard` | `""` | | `${worker-type}.nodeAffinityPreset.key` | Node label key to match Ignored if `${worker-type}.affinity` is set. | `""` | | `${worker-type}.nodeAffinityPreset.values` | Node label values to match. Ignored if `${worker-type}.affinity` is set. | `[]` | | `${worker-type}.affinity` | Affinity for pod assignment | `{}` | | `${worker-type}.nodeSelector` | Node labels for pod assignment | `{}` | | `${worker-type}.tolerations` | Toleration for pod assignment | `[]` | | `${worker-type}.resources.limits` | The resources limits for the container | `{}` | | `${worker-type}.resources.requests` | The requested resources for the container | `{}` | | `${worker-type}.persistence.enabled` | Enable worker checkpoints persistence using PVC | `false` | | `${worker-type}.persistence.storageClass` | PVC Storage Class for checkpoint data volume | `""` | | `${worker-type}.persistence.accessModes` | Persistent Volume Access Modes | `["ReadWriteOnce"]` | | `${worker-type}.persistence.size` | PVC Storage Request for checkpoint data volume | `20Gi` | | `${worker-type}.persistence.annotations` | Annotations for the PVC | `{}` | | `${worker-type}.persistence.selector` | Selector to match an existing Persistent Volume for checkpoint data PVC. | `{}` | | `${worker-type}.persistence.mountPath` | Mount path of the checkpoint data volume | `"/${worker-type}_checkpoints"` | | `${worker-type}.livenessProbe.enabled` | Enable livenessProbe on ${worker-type} containers | `true` | | `${worker-type}.livenessProbe.initialDelaySeconds` | Initial delay seconds for livenessProbe | `10` | | `${worker-type}.livenessProbe.periodSeconds` | Period seconds for livenessProbe | `10` | | `${worker-type}.livenessProbe.timeoutSeconds` | Timeout seconds for livenessProbe | `1` | | `${worker-type}.livenessProbe.failureThreshold` | Failure threshold for livenessProbe | `3` | | `${worker-type}.livenessProbe.successThreshold` | Success threshold for livenessProbe | `1` | ### Coordinator parameters | Name | Description | Value | | ------------------------------------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------------- | | `coordinator.readinessProbe.enabled` | Enable readinessProbe on coordinator containers | `true` | | `coordinator.readinessProbe.initialDelaySeconds` | Initial delay seconds for readinessProbe | `5` | | `coordinator.readinessProbe.periodSeconds` | Period seconds for readinessProbe | `10` | | `coordinator.readinessProbe.timeoutSeconds` | Timeout seconds for readinessProbe | `1` | | `coordinator.readinessProbe.failureThreshold` | Failure threshold for readinessProbe | `6` | | `coordinator.readinessProbe.successThreshold` | Success threshold for readinessProbe | `1` | | `coordinator.rpcService.port` | Coordinator headless rpc service port for internal connections | `50051` | | `coordinator.rpcService.clusterIP` | Static clusterIP or None for Coordinator headless rpc service | `""` | | `coordinator.rpcService.sessionAffinity` | Control where internal rpc requests go, to the same pod or round-robin | `None` | | `coordinator.rpcService.annotations` | Additional custom annotations for Coordinator headless rpc service | `{}` | | `coordinator.httpService.port` | Coordinator http service port for external admin requests | `8080` | | `coordinator.httpService.sessionAffinity` | Control where external http requests go, to the same pod or round-robin | `None` | | `coordinator.httpService.externalTrafficPolicy` | Coordinator http service external traffic policy | `Cluster` | | `coordinator.httpService.annotations` | Additional custom annotations for Coordinator http service | `{}` | | `coordinator.workdir` | Local ephemeral storage mount path for Coordinator working directory | `"/coordinator_workdir"` | | `coordinator.connectTimeoutSeconds` | The max timeout seconds when other workers connect to Coordinator | `60` | | `coordinator.heartbeatIntervalSeconds` | The heartbeat interval in seconds when other workers report statistics to coordinator | `10` | ### Sampling parameters | Name | Description | Value | | -------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | -------------------------------- | | `sampling.workerNum` | Number of Sampling workers | `2` | | `sampling.workdir` | Local ephemeral storage mount path for Sampling working directory | `"/sampling_workdir"` | | `sampling.actorLocalShardNum` | Local computing shard number for each Sampling Worker pod | `4` | | `sampling.dataPartitionNum` | The total partition number of data across all Sampling Workers | `8` | | `sampling.rocksdbEnv.highPriorityThreads` | The thread number of high-priority rocksdb background tasks | `2` | | `sampling.rocksdbEnv.lowPriorityThreads` | The thread number of low-priority rocksdb background tasks | `2` | | `sampling.sampleStore.memtableRep` | The rocksdb memtable structure type of sample store | `"hashskiplist"` | | `sampling.sampleStore.hashBucketCount` | The hash bucket count of sample store memtable | `1048576` | | `sampling.sampleStore.skipListLookahead` | The look-ahead factor of sample store memtable | `0` | | `sampling.sampleStore.blockCacheCapacity` | The capacity (bytes) of sample store block cache | `67108864` | | `sampling.sampleStore.ttlHours` | The TTL hours for sampling data in sample store | `1200` | | `sampling.subscriptionTable.memtableRep` | The rocksdb memtable structure type of Sampling subscription table | `"hashskiplist"` | | `sampling.subscriptionTable.hashBucketCount` | The hash bucket count of subscription table memtable | `1048576` | | `sampling.subscriptionTable.skipListLookahead` | The look-ahead factor of subscription table memtable | `0` | | `sampling.subscriptionTable.blockCacheCapacity` | The capacity (bytes) of subscription table block cache | `67108864` | | `sampling.subscriptionTable.ttlHours` | The TTL hours for sampling rules in subscription table | `1200` | | `sampling.recordPolling.threadNum` | The thread number for graph update consuming from kafka queues | `2` | | `sampling.recordPolling.retryIntervalMs` | The retry interval (ms) when no record has been polled | `100` | | `sampling.recordPolling.processConcurrency` | The max processing concurrency for polled records | `100` | | `sampling.samplePublishing.producerPoolSize` | The max number of kafka producer for sampling results | `2` | | `sampling.samplePublishing.maxProduceRetryTimes` | The maximum retry times of producing a kafka message | `3` | | `sampling.samplePublishing.callbackPollIntervalMs` | The interval(ms) for polling async producing callbacks | `100` | | `sampling.logging.dataLogPeriod` | Specify how many graph update batches should be processed between two logs | `10` | | `sampling.logging.ruleLogPeriod` | Specify how many sampling rules should be processed between two logs | `10` | ### Serving parameters | Name | Description | Value | | ------------------------------------------------- | --------------------------------------------------------------------------------------------------- | -------------------------------- | | `serving.workerNum` | Number of Serving workers, each Serving Worker is an independent pod | `2` | | `serving.readinessProbe.enabled` | Enable readinessProbe on Serving worker containers | `true` | | `serving.readinessProbe.initialDelaySeconds` | Initial delay seconds for readinessProbe | `30` | | `serving.readinessProbe.periodSeconds` | Period seconds for readinessProbe | `10` | | `serving.readinessProbe.timeoutSeconds` | Timeout seconds for readinessProbe | `1` | | `serving.readinessProbe.failureThreshold` | Failure threshold for readinessProbe | `6` | | `serving.readinessProbe.successThreshold` | Success threshold for readinessProbe | `1` | | `serving.httpService.port` | The external port of Serving http service for inference queries | `10000` | | `serving.httpService.sessionAffinity` | Control where http requests go, to the same pod or round-robin | `None` | | `serving.httpService.externalTrafficPolicy` | The external traffic policy of Serving http service | `Cluster` | | `serving.httpService.annotations` | Additional custom annotations of Serving http service | `{}` | | `serving.workdir` | Local ephemeral storage mount path for Serving working directory | `"/serving_workdir"` | | `serving.actorLocalShardNum` | Local computing shard number for each Serving Worker pod | `4` | | `serving.dataPartitionNum` | The partition number of data for each Serving Worker | `4` | | `serving.rocksdbEnv.highPriorityThreads` | The thread number of high-priority rocksdb background tasks | `2` | | `serving.rocksdbEnv.lowPriorityThreads` | The thread number of low-priority rocksdb background tasks | `2` | | `serving.sampleStore.inMemoryMode` | Specify whether to open rocksdb in-memory mode of sample store | `false` | `serving.sampleStore.memtableRep` | The rocksdb memtable structure type of sample store | `"hashskiplist"` | | `serving.sampleStore.hashBucketCount` | The hash bucket count of sample store memtable | `1048576` | | `serving.sampleStore.skipListLookahead` | The look-ahead factor of sample store memtable | `0` | | `serving.sampleStore.blockCacheCapacity` | The capacity (bytes) of sample store block cache | `67108864` | | `serving.sampleStore.ttlHours` | The TTL hours for serving data in sample store | `1200` | | `serving.recordPolling.threadNum` | The thread number for sample update consuming from kafka queues | `2` | | `serving.recordPolling.retryIntervalMs` | The retry interval (ms) when no record has been polled | `100` | | `serving.recordPolling.processConcurrency` | The max processing concurrency for polled records | `100` | | `serving.logging.dataLogPeriod` | Specify how many sample update batches should be processed between two logs | `10` | | `serving.logging.requestLogPeriod` | Interval of incoming inference query requests for logging serving statistics | `1` | ### Other Parameters | Name | Description | Value | | --------------------------------------------- | ---------------------------------------------------------------------------------------------- | ------- | | `serviceAccount.create` | Enable creation of ServiceAccount for pods | `true` | | `serviceAccount.name` | The name of the service account to use. If not set and `create` is `true`, a name is generated | `""` | | `serviceAccount.automountServiceAccountToken` | Allows auto mount of ServiceAccountToken on the serviceAccount created | `true` | | `serviceAccount.annotations` | Additional custom annotations for the ServiceAccount | `{}` | | `rbac.create` | Whether to create & use RBAC resources or not | `false` |