CS246: Mining Massive Datasets Jure Leskovec, - Stanford University PDF Free Download

1y ago

27 Views

1 Downloads

9.10 MB

62 Pages

Report/dmca

Download PDF

Transcription

CS246: Mining Massive DatasetsJure Leskovec, Stanford UniversityMina Ghashami, Amazonhttp://cs246.stanford.edu

ImagesText/SpeechModern deep learning toolbox is designedfor simple sequences & gridsJure Leskovec & Mina Ghashami, Stanford University2

But networks are far more complex!§ Arbitrary size and complex topological structure (i.e.,no spatial locality like grids)vs.TextNetworksImages§ No fixed node ordering or reference point§ Often dynamic and have multimodal featuresJure Leskovec & Mina Ghashami, Stanford University3

A naïve approach Take adjacency matrixand feature matrix[A, X]¡ Join adjacency matrix and features Feedthem intodeepinto(fully aconnected)neural net¡ Feedthemdeep neuralnet: Concatenate them 10111E0101010?Issues with this idea:Issues with this idea:Problems: § 𝑂( 𝑉 ) parametersHuge number of parameters§ Not applicable to graphs of different sizesNo inductive learning possible§ Sensitive to node orderingJure Leskovec & Mina Ghashami, Stanford University5

Graph-structured dataHidden layerHidden lGraph-structured dataWhat if our data looks like this?InputBut Whatourlooklikelikethis:ifgraphsour data looksthis?Inputor orthis:this:ReLUReLUor this: § There is no fixed notion of locality or slidingwindow on the graph§ Graph is permutation invariantCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University6

Convolutional neural networks (on grids)SingleConvolutionalneuralnetwork (CNN) layerSingleCNN layer with3x3 filter:with 3x3 filter:(Animation byVincent Dumoulin)GraphImageIdea: transform information at the neighbors and combine it:§ Transform “messages” ℎ! from neighbors: 𝑊! ℎ!§ Add them up: ! 𝑊! ℎ!Credit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University7

[Kipf and Welling, ICLR 2017]Idea: Node’s neighborhood defines acomputation graph𝑖Determine nodecomputation graph𝑖Propagate andtransform informationLearn how to propagate information across thegraph to compute node featuresCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University8

¡Key idea: Generate node embeddings basedon local network neighborhoodsACBBTARGET NODEAACBACEFDFEDAINPUT GRAPHCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University9

¡Intuition: Nodes aggregate information fromtheir neighbors using neural networksACBBTARGET NODEAACBCAEFDFEDAINPUT GRAPHNeural networksCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University10

¡Intuition: Network neighborhood defines acomputation graphEvery node defines a computationgraph based on its neighborhood!Credit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University11

¡Model can be of arbitrary depth:§ Nodes have embeddings at each layer§ Layer-0 embedding of node 𝑢 is its input feature, 𝑥𝑢§ Layer-𝑘 embedding gets information from nodes thatare K hops awayLayer-1Layer-2AACBBTARGET NODELayer-0ACBACEFDFEDAINPUT GRAPHJure Leskovec & Mina Ghashami, Stanford UniversityxAxCxAxBxExFxA12

¡Neighborhood aggregation: Key distinctionsare in how different approaches aggregateinformation across the layersBTARGET NODEAWhat is in the box?CAF?BA?CACB?DEFEDINPUT GRAPH?ACredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University13

¡Basic approach: Average information fromneighbors and apply a neural networkBTARGET NODE(1) average messagesfrom neighborsACBAACBACEFDFEDINPUT GRAPH(2) apply neural networkACredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University14

¡Assume we have a graph 𝑮:§§§§§𝑉 is the vertex set𝑨 is the adjacency matrix (assume binary)𝑿 ℝ! is a matrix of node features𝑣: a node in 𝑉; 𝑁 𝑣 : the set of neighbors of 𝑣.Node features:§ Social networks: User profile, User image§ Biological networks: Gene expression profiles, genefunctional information§ When there is no node feature in the graph dataset:§ Indicator vectors (one-hot encoding of a node)§ Vector of constant 1: [1, 1, , 1]Jure Leskovec & Mina Ghashami, Stanford University15

¡Basic approach: Average neighbor messagesand apply a neural networkh&% x%(* ,)h%Initial 0-th layer embeddings areequal to node featuresembedding of𝑣 at layer 𝑙(*) 𝜎(W* 2- /(%)h(*) B* h% ), 𝑙 {0, , 𝐿 1}N(𝑣)(()Average of neighbor’s Total numberprevious layer embeddings of layersEmbedding after Llayers of neighborhood Non-linearity(e.g., ReLU)Credit: Stanford CS224Waggregationz% h %Jure Leskovec & Mina Ghashami, Stanford University16

How do we train the model togenerate embeddings?𝒛0Need to define a loss function on the embeddingsCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University17

(#)h! x!(&'()h!z! Trainable weight matrices(i.e., what we learn) 𝜎(W& ((%)h!) (!)(&)h)N(𝑣) (&)B& h! ), 𝑙 {0, , 𝐿 1}Final node embeddingWe can feed these embeddings into any loss functionand run SGD to train the weight parametersℎ!" : the hidden representation of node 𝑣 at layer 𝑙¡ 𝑊# : weight matrix for neighborhood aggregation¡ 𝐵# : weight matrix for transforming hidden vector ofselfCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University18

¡¡Node embedding 𝒛! is a function of input graphSupervised setting: we want to minimize the lossℒ (see also slide 15):min ℒ(𝒚, 𝑓 𝒛" )!§ 𝒚: node label§ ℒ could be L2 if 𝒚 is real number, or cross entropyif 𝒚 is categoricalCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University19

Directly train the model for a supervised task(e.g., node classification)Safe or toxicdrug?Safe or toxicdrug?E.g., a drug-druginteraction networkJure Leskovec & Mina Ghashami, Stanford University20

Directly train the model for a supervised task(e.g., node classification)¡ Use cross entropy loss (slide 16)ℒ ( 𝑦! log(𝜎(z!- 𝜃)) 1 𝑦! log(1 𝜎 z!- 𝜃 )! ,Encoder output:node embeddingSafe or toxic drug?ClassificationweightsNode classlabelCredit: Stanford CS224WJure Leskovec & Mina Ghashami, Stanford University21

CS246: Mining Massive DatasetsJure Leskovec, Stanford Universityhttp://cs246.stanford.edu

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020GNN Layer Message AggregationCB Different instantiationsunder this perspectiveA GCN, GraphSAGE,GAT,B CABTARGET NODEACAEFDFEDAINPUT GRAPHGNN Layer 1(2) Aggregation(1) MessageJure Leskovec & Mina Ghashami, Stanford University23

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020Connect GNN layersinto aC GNNB Stack layers sequentiallyAB Ways of addingC skip connectionsABTARGET NODEACAEFDFEDAINPUT GRAPHGNN Layer 1(3) LayerconnectivityGNN Layer 2Jure Leskovec & Mina Ghashami, Stanford University24

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020AIdea: Raw input graph computationalgraphCB Graph feature augmentationAB Graph structureaugmentationCBTARGET NODEACAEFDFEDAINPUT GRAPH(4) Graph augmentationJure Leskovec & Mina Ghashami, Stanford University25

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020AACBBTARGET NODEA(5) Learning objectiveBCACEFDFEDAINPUT GRAPHHow do we train a GNN Supervised/Unsupervisedobjectives Node/Edge/Graph levelobjectivesJure Leskovec & Mina Ghashami, Stanford University26

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020AACBBTARGET NODEA(5) Learning objectiveBCACEFDFEDAINPUT GRAPHGNN Layer 1(2) Aggregation(1) Message(3) LayerconnectivityGNN Layer 2(4) Graph augmentationJure Leskovec & Mina Ghashami, Stanford University27

CS246: Mining Massive DatasetsJure Leskovec, Stanford Universityhttp://cs246.stanford.edu

¡Idea of a GNN Layer:§ Compress a set of vectors into a single vector§ Two step process:§ (1) Message§ (2) Aggregation"Output node embedding 𝐡!Node 𝒗𝒍-th GNN Layer(2) Aggregation(1) Message"# "# Input node embedding 𝐡!, 𝐡% '(!)(from node itself neighboring nodes)Jure Leskovec & Mina Ghashami, Stanford University30

¡(1) Message computation(*)𝐦-§ Message function: MSG*2,𝐡-*§ Intuition: Each node will create a message, which will besent to other nodes later(&)&()§ Example: A Linear layer 𝐦 𝐖 & 𝐡 § Multiply node features with weight matrix 𝐖!ANode 𝒗ACBBTARGET NODEA(2) AggregationCBACFE(1) MessageFDEDINPUT GRAPHJure Leskovec & Mina Ghashami, Stanford UniversityA31

¡(2) Aggregation§ Intuition: Each node will aggregate the messages fromnode 𝑣’s neighbors(&)𝐡 AGG&*𝐦-,𝑢 𝑁 𝑣§ Example: Sum( ), Mean( ) or Max( ) aggregator§ 𝐡*& Sum({𝐦 & , 𝑢 𝑁(𝑣)})ANode 𝒗BTARGET NODECBAACACB(2) AggregationEFDFEDINPUT GRAPHJure Leskovec & Mina Ghashami, Stanford University(1) MessageA32

¡Issue: Information from node 𝑣 itself could get lost(&)(&())§ Computation of 𝐡* does not directly depend on 𝐡*¡Solution: Include(*2,)𝐡%when computing(*)𝐡%§ (1) Message: compute message from node 𝒗 itself§ Usually, a different message computation will be performed(!)(!)!%&!%&𝐦" 𝐁 ! 𝐡"𝐦' 𝐖 ! 𝐡'§ (2) Aggregation: After aggregating from neighbors, we canaggregate the message from node 𝒗 itself§ Via concatenation or summation&𝐡* CONCAT AGG&𝐦 Then aggregate from node itself,𝑢 𝑁 𝑣&, 𝐦*First aggregate from neighborsJure Leskovec & Mina Ghashami, Stanford University33

¡Putting things together:§ (1) Message: each node computes a message(*)𝐦- MSG * 𝐡-*2, , 𝑢 {𝑁 𝑣 𝑣}§ (2) Aggregation: aggregate messages from neighbors(&)𝐡 AGG&𝐦(& , 𝑢 𝑁 𝑣, 𝐦 &§ Nonlinearity (activation): Adds expressiveness§ Often written as 𝜎( ): ReLU( ), Sigmoid( ) , § Can be added to message or aggregation(2) Aggregation(1) MessageJure Leskovec & Mina Ghashami, Stanford University34

T. Kipf, M. Welling. Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017¡(1) Graph Convolutional Networks (GCN)(&)𝐡* 𝜎 𝐖&𝐡 &()H𝑁 𝑣 , *¡How to write this as Message Aggregation?Message(&)𝐡* 𝜎H 𝐖& , *𝐡 &()𝑁 𝑣(2) Aggregation(1) MessageAggregationJure Leskovec & Mina Ghashami, Stanford University35

¡(1) Graph Convolutional Networks (GCN)&()(&)𝐡* 𝜎H 𝐖& , *¡Message:§ Each¡𝐡 𝑁 𝑣(&)Neighbor: 𝐦 Aggregation:), *𝐖&𝐡 &()(2) Aggregation(1) MessageNormalized by node degree(In the GCN paper they use a slightlydifferent normalization)§ Sum over messages from neighbors, then apply activation&§ 𝐡* 𝜎 Sum&𝐦 , 𝑢 𝑁 𝑣Jure Leskovec & Mina Ghashami, Stanford University36

Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017¡(2) GraphSAGE(&)𝐡* 𝜎 𝐖 (&) I CONCAT 𝐡*&() , AGG¡𝐡 &() , 𝑢 𝑁 𝑣How to write this as Message Aggregation?§ Message is computed within the AGG § Two-stage aggregation§ Stage 1: Aggregate from node neighbors(!)𝐡((") AGG(!%&)𝐡', 𝑢 𝑁 𝑣§ Stage 2: Further aggregate over the node itself(!)!%&𝐡" 𝜎 𝐖 (!) CONCAT(𝐡"Jure Leskovec & Mina Ghashami, Stanford University(!), 𝐡((") )37

¡Mean: Take a weighted average of neighborsAGG 2Aggregation - E(%)¡(*2,)𝐡-𝑁(𝑣)Message computationPool: Transform neighbor vectors and applysymmetric vector function Mean( ) or Max( )(*2,)AGG Mean({MLP(𝐡Aggregation), 𝑢 𝑁(𝑣)})Message computationJure Leskovec & Mina Ghashami, Stanford University38

¡ℓ# Normalization:§ Optional: Apply ℓ- normalization to 𝐡(&)* at every layer§(&)𝐡 (%)𝐡# 𝑣 𝑉 where 𝑢(%)𝐡#* 𝑢 * (ℓ* -norm)'§ Without ℓ* normalization, the embedding vectors havedifferent scales (ℓ* -norm) for vectors§ In some cases (not always), normalization of embeddingresults in performance improvement§ After ℓ* normalization, all vectors will have the sameℓ* -normJure Leskovec & Mina Ghashami, Stanford University39

¡(3) Graph Attention Networks(&)𝐡! (&)#)(&)𝜎( " ! 𝛼!" 𝐖 𝐡" )Attention weights¡In GCN / GraphSAGE§ 𝛼!" # !is the weighting factor (importance)of node 𝑢’s message to node 𝑣§ 𝛼!" is defined explicitly based on thestructural properties of the graph (node degree)§ All neighbors 𝑢 𝑁(𝑣) are equally importantto node 𝑣Jure Leskovec & Mina Ghashami, Stanford University40

[Velickovic et al., ICLR 2018; Vaswani et al., NIPS 2017]Can we do better than simpleneighborhood aggregation?Can we let weighting factors 𝜶𝒗𝒖 to belearned?Goal: Specify arbitrary importance to differentneighbors of each node in the graph(&)¡ Idea: Compute embedding 𝒉! of each node in thegraph following an attention strategy:¡§ Nodes attend over their neighborhoods’ message§ Implicitly specifying different weights to different nodesin a neighborhoodJure Leskovec & Mina Ghashami, Stanford University41

¡Let 𝛼" be computed as a byproduct of anattention mechanism 𝒂:§ (1) Let 𝑎 compute attention coefficients 𝒆𝒗𝒖 acrosspairs of nodes 𝑢, 𝑣 based on their messages:(&) (&)#)(&) (&)#)𝑒!" 𝑎(𝐖 𝐡" , 𝐖 𝒉! )§ 𝒆𝒗𝒖 indicates the importance of 𝒖0 𝐬 message to node 𝒗("# )𝑒* 𝐡 ("# )𝐡*(&())𝑒23 𝑎(𝐖 (&) 𝐡2(&()), 𝐖 (&) 𝐡3)Jure Leskovec & Mina Ghashami, Stanford University42

§ Normalize 𝑒!" into the final attention weight 𝜶𝒗𝒖§ Use the softmax function, so that ,𝛼 (exp(𝑒 ( ) , . exp(𝑒 , )*𝛼* 1:§ Weighted sum based on the final attention weight𝜶𝒗𝒖(&)𝐡 𝜎( ( . (&) (&/0)𝛼 ( 𝐖 𝐡((!)(!%&)(!%&)𝛼), 𝐖 (!) 𝐡,(!%&) 𝛼) 𝐖 (!) 𝐡 𝛼*, )Jure Leskovec & Mina Ghashami, Stanford University("# )𝐡 𝛼* Weighted sum using 𝛼23 , 𝛼25 , 𝛼26 :𝐡) 𝜎(𝛼)* 𝐖 (!) 𝐡*)𝛼*-("# )𝐡,43

¡What is the form of attention mechanism 𝒂?§ The approach is agnostic to the choice of 𝑎§ E.g., use a simple single-layer neural network§ 𝑎 have trainable parameters (weights in the Linear layer)Concatenate("# )𝐡*("# )𝐡 Linear𝑒)*(!%&)𝑒)* 𝑎 𝐖 (!) 𝐡)(!%&), 𝐖 (!) 𝐡*(!%&) Linear Concat 𝐖 (!) 𝐡)(!%&), 𝐖 (!) 𝐡*§ Parameters of 𝑎 are trained jointly:§ Learn the parameters together with weight matrices (i.e.,other parameter of the neural net 𝐖 (&) ) in an end-to-endfashionJure Leskovec & Mina Ghashami, Stanford University44

¡Multi-head attention: Stabilizes the learningprocess of attention mechanism§ Create multiple attention scores (each replicawith a different set of parameters):(&)𝐡 [1](&)𝐡 [2](&)𝐡 [3](&/0)0(&)) 𝛼 ( 𝐖 𝐡((&/0)*(&)𝜎( ( . 𝛼 ( 𝐖 𝐡( )1 𝐖 (&) 𝐡(&/0) )𝜎( ( . 𝛼 (( 𝜎( ( . § Outputs are aggregated:§ By concatenation or summation(&)(&)(&)(&)§ 𝐡* AGG(𝐡* 1 , 𝐡* 2 , 𝐡* 3 )Jure Leskovec & Mina Ghashami, Stanford University45

Key benefit: Allows for (implicitly) specifying differentimportance values (𝜶𝒗𝒖 ) to different neighbors¡ Computationally efficient:¡§ Computation of attentional coefficients can be parallelizedacross all edges of the graph§ Aggregation may be parallelized across all nodes¡¡¡Storage efficient:§ Sparse matrix operations do not require more than𝑂(𝑉 𝐸) entries to be stored§ Fixed number of parameters, irrespective of graph sizeLocalized:§ Only attends over local network neighborhoodsInductive capability:§ It is a shared edge-wise mechanism§ It does not depend on the global graph structureJure Leskovec & Mina Ghashami, Stanford University46

Apply activation to 𝒊-th dimension ofembedding 𝐱¡ Rectified linear unit (ReLU)𝑦𝑦 𝑥0𝑦1𝜎 𝐱7 1 𝑒 (𝐱"§ Used only when you want to restrict therange of your embeddingsJure Leskovec & Mina Ghashami, Stanford University11 𝑒 !"10Parametric ReLUPReLU 𝐱 7 max 𝐱 7 , 0 𝑎7 min(𝐱 7 , 0)𝑎7 is a trainable parameter§ Empirically performs better than ReLU𝑦 𝑦𝑦 𝑥𝑦 𝑎𝑥0𝑥¡Sigmoid𝑥¡𝑥ReLU 𝐱 7 max(𝐱 7 , 0)§ Most commonly used47

CS246: Mining Massive DatasetsJure Leskovec, Stanford Universityhttp://cs224w.stanford.edu

J. You, R. Ying, J. Leskovec. Design Space of Graph Neural Networks, NeurIPS 2020AIdea: Raw input graph computationalgraphCB Graph feature augmentationAB Graph structuremanipulationCBTARGET NODEACAEFDFEDAINPUT GRAPH(4) Graph manipulationJure Leskovec & Mina Ghashami, Stanford University49

Our assumption so far has been¡ Raw input graph computational graphReasons for breaking this assumption§ Feature level:§ The input graph lacks features à feature augmentation§ Structure level:§ The graph is too sparse à inefficient message passing§ The graph is too dense à message passing is too costly§ The graph is too large à cannot fit the computationalgraph into a GPU§ It’s just unlikely that the input graph happens to bethe optimal computation graph for embeddingsJure Leskovec & Mina Ghashami, Stanford University50

¡Graph Feature manipulation§ The input graph lacks features à featureaugmentation¡Graph Structure manipulation§ The graph is too sparse à Add virtual nodes / edges§ The graph is too dense à Sample neighbors whendoing message passing§ The graph is too large à Sample subgraphs tocompute embeddings§ Will cover later in lecture: Scaling up GNNsJure Leskovec & Mina Ghashami, Stanford University51

Why do we need feature augmentation?¡ (1) Input graph does not have node features§ This is common when we only have the adj. matrix¡¡Standard approaches:a) Assign constant values to nodes111111Jure Leskovec & Mina Ghashami, Stanford University52

Why do we need feature augmentation?¡ (1) Input graph does not have node features§ This is common when we only have the adj. matrix¡¡Standard approaches:b) Assign unique IDs to nodes§ These IDs are converted into one-hot vectors21One-hot vector for node with ID 5ID 53645Jure Leskovec & Mina Ghashami, Stanford University[0, 0, 0, 0, 1, 0]Total number of IDs 653

¡Feature augmentation: constant vs. one-hotConstant node feature11One-hot node feature2111113645Expressive powerMedium. All the nodes areidentical, but GNN can still learnfrom the graph structureHigh. Each node has a unique ID,so node-specific information canbe storedInductive learning(Generalize tounseen nodes)High. Simple to generalize to newnodes: we assign constantfeature to them, then apply ourGNNLow. Cannot generalize to newnodes: new nodes introduce newIDs, GNN doesn’t know how toembed unseen IDsComputationalcostLow. Only 1 dimensional featureHigh. 𝑂 𝑉 dimensional feature,cannot apply to large graphsUse casesAny graph, inductive settings(generalize to new nodes)Small graph, transductive settings(no new nodes)Jure Leskovec & Mina Ghashami, Stanford University54

Why do we need feature augmentation?¡ (2) Certain features can help GNN learning¡ Other commonly used augmented features:§§§§¡Node degreePageRankClustering coefficient Any useful graph statistics can be used!Jure Leskovec & Mina Ghashami, Stanford University55

¡¡Motivation: Augment sparse graphs(1) Add virtual edges§ Common approach: Connect 2-hop neighbors viavirtual edges§ Intuition: Instead of using adj. matrix 𝐴 for GNNcomputation, use 𝐴 𝐴?Authors§ Use cases: Bipartite graphs§ Author-to-papers (they authored)§ 2-hop virtual edges make an author-authorcollaboration graphJure Leskovec & Mina Ghashami, Stanford UniversityPapersABCDE56

¡¡Motivation: Augment sparse graphs(2) Add virtual nodes§ The virtual node will connect to all thenodes in the graph§ Suppose in a sparse graph, two nodes haveshortest path distance of 10§ After adding the virtual node, all the nodeswill have a distance of 2The virtualnode§ Node A – Virtual node – Node B§ Benefits: Greatly improves messagepassing in sparse graphsJure Leskovec & Mina Ghashami, Stanford University57

Hamilton et al. Inductive Representation Learning on Large Graphs, NeurIPS 2017¡Previously:§ All the nodes are used for message passingACBBTARGET NODEAACBACEFDFEDAINPUT GRAPH¡New idea: (Randomly) sample a node’sneighborhood for message passingJure Leskovec & Mina Ghashami, Stanford University58

¡For example, we can randomly choose 2neighbors to pass messages§ Only nodes 𝐵 and 𝐷 will pass message to 𝐴ACBBTARGET NODEAACBACEFDFEDAINPUT GRAPHJure Leskovec & Mina Ghashami, Stanford University59

¡Next time when we compute the embeddings,we can sample different neighbors§ Only nodes 𝐶 and 𝐷 will pass message to 𝐴ACBBTARGET NODEAACBACEFDFEDAINPUT GRAPHJure Leskovec & Mina Ghashami, Stanford University60

Ying et al. Graph Convolutional Neural Networks for Web-Scale Recommender Systems, KDD 2018¡In expectation, we can get embeddings similarto the case where all the neighbors are used§ Benefits: greatly reduce computational cost§ And in practice it works great!ACBBTARGET NODEAACBACEFDFEDAINPUT GRAPHJure Leskovec & Mina Ghashami, Stanford University61

¡Recap: A general perspective for GNNs§ GNN Layer:§ Transformation Aggregation§ Classic GNN layers: GCN, GraphSAGE, GAT§ Layer connectivity:§ Deciding number of layers§ Skip connections§ Graph Manipulation:§ Feature augmentation§ Structure manipulationJure Leskovec & Mina Ghashami, Stanford University62

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. Jure Leskovec & Mina Ghashami, Stanford University .