Graph traversal
Introduction
Graph traversal, in GNN, has a different semantics than classical graph computation. The training model of mainstream deep learning algorithms iterates by batch. To meet this requirement, the data has to be accessible by batch, and we call this data access pattern traversal. In GNN algorithms, the data source is the graph, and the training samples usually consist of the vertices and edges of the graph. Graph traversal refers to providing the algorithm with the ability to access vertices, edges or subgraphs by batch.
Currently GL supports batch traversal of vertices and edges. This random traversal can be either putback-free or putback. In a no-replay traversal, gl.OutOfRangeError
is triggered every time an epoch ends. The data source being traversed is partitioned, i.e. the current worker (in the case of distributed TF) only traverses the data on the Server corresponding to it.
Vertex traversal
Usage
There are 3 sources of data for vertices: all vertices of uniqueness, source vertices of all edges, and destination vertices of all edges. Vertex traversal relies on the NodeSampler
operator. The node_sampler()
interface of the Graph object returns a NodeSampler
object, which in turn calls the get()
interface to return data in Nodes
format.
def node_sampler(type, batch_size=64, strategy="by_order", node_from=gl.NODE):
"""
Args:
type(string): vertex type when node_from is gl.NODE, otherwise it is edge type;
batch_size(int): the number of vertices to be traversed each time
strategy(string): optional values are "by_order" and "random", which means ordered traversal and random traversal. When use "by_order", if the bottom is less than batch_size, the actual number will be returned, if the actual number is 0, gl.OutOfRangeError will be triggered.
node_from: data source, optional values are gl;
Return:
NodeSampler object
"""
def NodeSampler.get():
"""
Return:
Nodes object, if not bottomed out, expects the shape of ids to be [batch_size]
"""
Get specific values such as id, weight, attribute, etc. from Nodes
object, refer to API. In GSL, vertex traversal reference g.V()
.
Example
“user” vertex table:
id | attributes |
---|---|
10001 | 0:0.1:0 |
10002 | 1:0.2:3 |
10003 | 3:0.3:4 |
“buy” edge table:
src_id | dst_id | attributes |
---|---|---|
10001 | 1 | 0.1 |
10001 | 2 | 0.2 |
10001 | 3 | 0.4 |
10002 | 1 | 0.1 |
# Exmaple1: Randomly sample vertices.
sampler1 = g.node_sampler("user", batch_size=3, strategy="random")
for i in range(5):
nodes = sampler1.get()
print(nodes.ids) # shape=(3, )
print(nodes.int_attrs) # shape=(3, 2), with 2 int attributes
print(nodes.float_attrs) # shape=(3, 1), with 1 float attribute
# Exmaple2: iterate over the user vertices in the graph
sampler2 = g.node_sampler("user", batch_size=3, strategy="by_order")
while True:
try:
nodes = sampler1.get()
print(nodes.ids) # except for the last batch, the shape is (3, ), the shape of the last batch is the number of remaining ids
print(nodes.int_attrs)
print(nodes.float_attrs)
except gl.OutOfRangError:
break
# Exmaple3: Iterate over the source vertices of the buy edges of the graph, i.e. the user vertices, for the unique
sampler2 = g.node_sampler("user", batch_size=3, strategy="by_order", node_from=gl.EDGE_SRC)
while True:
try:
nodes = sampler1.get()
print(nodes.ids) # shape=(2, ), because buy side table src_id only 2 unique values, dissatisfaction batch_size 3, so this loop is only carried out once
print(nodes.int_attrs)
print(nodes.float_attrs)
except gl.OutOfRangError:
break
Edge traversal
Usage
Edge traversal relies on the EdgeSampler
operator, and the edge_sampler()
interface of the Graph object returns an EdgeSampler
object, which in turn calls the get()
interface to return data in Edges
format.
def edge_sampler(edge_type, batch_size=64, strategy="by_order"):
"""
Args:
edge_type(string): edge type
batch_size(int): number of edges per traversal
strategy(string): optional values are "by_order" and "random", which means ordered traversal and random traversal. When use "by_order", if the bottom is less than batch_size, the actual number will be returned, if the actual number is 0, gl.OutOfRangeError will be triggered.
Return:
EdgeSampler object
"""
def EdgeSampler.get():
"""
Return:
Edges object, if not bottomed, expects src_ids to have a shape of [batch_size]
"""
Get specific values such as id, weight, attribute, etc. from Edges
object, refer to API. In GSL, the edge traversal reference is g.E()
.
Example
src_id | dst_id | weight | attributes |
---|---|---|---|
20001 | 30001 | 0.1 | 0.10,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19 |
20001 | 30003 | 0.2 | 0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29 |
20003 | 30001 | 0.3 | 0.30,0.31,0.32,0.33,0.34,0.35,0.36,0.37,0.38,0.39 |
20004 | 30002 | 0.4 | 0.40,0.41,0.42,0.43,0.44,0.45,0.46,0.47,0.48,0.49 |
sampler = g.edge_sampler("buy", batch_size=3, strategy="random")
for i in range(5):
edges = sampler.get()
print(edges.src_ids)
print(edges.src_ids)
print(edges.weights)
print(edges.float_attrs)