Two types of patch embeddings were considered in the paper:
- Linear (
PatchExtractor): Typical embedding type for vision transformers, which consists of a pixel shuffle and linear layer (often implemented with a conv2d using stride=k). - Convolutional (
ViGStem): A series of Conv-BN-Act layers to down-sample the input as used by ViG and ViHGNN, where a GeLU activation is used.
Both approaches result in a down-sampling of p for patches of size p x p and require that p be a power of 2. However, we found that using the convolutional approach performs significantly better, and results in richer embeddings (as measured by PCA). The tradeoff comes with increased parameter count and FLOPs for the embedding layer. A middle-ground can be achieved by reducing the channel up-sampling factor in the convolutional embeddings via the div_hdim > 1, though results in some performance degradation.
The blocks consist of four main sub-modules:
- FFN (
Split_FFN_GEGLU): The feed-forward network, which takes in the feature (xf) and adjacency feature (xa) tensors to mix via a GeGLU activation. The weight matrices are split per path and combined in a sum to avoid the extra data movement required for concatenation. - Vertex Self-Attention (
MHHESA): This layer provides thevertex <-> vertexcommunication via masked self-attention. The mask is dynamically generated using the hard adjacency matrixadjthrand thus must use an attention mechanism that supports arbitrary masks. - Edge Aggregation Attention (
MHEAgg): This layer performs thevertex -> edgecommunication via modulated cross-attention. The modulation utilizes the soft adjacency matrixadjand bias restrictionae_mask_p(to prevent image vertices from communicating with virtual edges). - Edge Distribution Attention (
MHEDist): This layer performs theedge -> vertexcommunication via modulated cross-attention. The mechanism is identical to the edge aggregation attention; however, we also include a null-key-value embedding as proposed by eDiff-i. This inclusion allows vertices to "ignore" hyperedges if they so choose to - the aggregation operation does not have this operation since it would imply that an edge is empty and counter to the "non-trivial" graph objective.
- HgVT Blocks : All blocks use pre-layer
RMSNormwith affine scaling, following current transformer trends. - Attention : All attention layers use Q-K normalization, with affine-less
LayerNormoperations, following current transformer trends.
These were observed in follow-up work and are not included in the codebase.
- Vertex Self-Attention : There is no guarantee that all vertices will have a corresponding hyperedge, which can result in a divide-by-zero operation in softmax. Luckily, using a large negative bias (e.g.,
-1e9) prevents training instability in FP32, but will cause issues in lower precision. Further investigation suggests that explicitly adding the identity to the mask will fix this issue, in addition to lowering the bias to-1e4when training in FP16. - Modulated Cross-Attention : The included implementation applies a symmetric modulation based on the soft-adjacency matrix and provides a neutral weight of
0to the null-kv. We have found some evidence to suggest that asymmetric negative weighting and applying a neutral weight of1to the null-kv may be beneficial.
The current implementation makes heavy use of the einops.rearrange and torch.einsum operations for clarity. However, profiling revealed that these operations incur a high overhead, and should be replaced with permute&reshape and bmm, respectively.
The feature pooling mechanism uses a factory method to instantiate one of the pooling mechanisms:
- Expert (
ExpertCombiner): Applies the edge-expert pooling operation on the primary hyperedges. - Mean (
MeanCombiner): Applies mean pooling over the primary hyperedges. - Max (
MaxCombiner): Applies max pooling over the primary hyperedges - this did not perform well. - Image (
ImageCombiner): Applies mean pooling over the image vertices (standard mechanism for vision models). - Dual (
DualCombiner): Applies pooling to both image vertices (same as Image) and primary hyperedges (either Mean or Expert). This includes pre-normalization so that the features are balanced when concatenated into a final feature vector for the classifier.
The hypergraph model only supports isotropic towers due to the complexity of down-sampling virtual features. However, the model supports two variations (discussed in the paper appendix).
- Joined Adjacency Features (
x=xadj): This case combines both semantic and adjacency features into a single tensor, simplifying the overall architecture. The adjacency mask would then be computed based on a shared (shared_adjproj=True) or per-block (shared_adjproj=False) linear projection. Interestingly, we found that a shared projection produced the best results, though still inferior to splitting the features (x!=xadj). - Repeating Layer (
repeat_n): This option artificially increases the network depth by repeating the final layerntimes, and is used in the Mini model (as discussed in the paper appendix). Notably, the regularization losses are independently applied to the subsequent layers, where this mainly serves as a method to decrease the overall parameter count. - Simplifying the Final Block (
skip_last): This option removed unused layers from the final block, which would otherwise not receive useful gradients. For example, the vertex FFN if only edge pooling is used, and the final adjacency feature outputs from the FFNs. We found that this helps stabilize training, otherwise certain gradients will float. - Disabling Vertex Self-Attention (
use_nodesa): This option can remove vertex self-attention completely, thereby saving FLOPs, Parameters, and memory, in exchange for a reduction in prediction accuracy. - Binding QKV (
bind_qkv): This option allows the QKV matrices to be tied together in the attention operations, thereby reducing the overall parameter count. This would align the model more closely with typical GNNs but was not explored in the paper. - Interpolated PEs (
pe_use_interpolation): This option enables interpolating the position embeddings for higher resolutions using the mechanism from DINOv2. While not used for classification, it was used in the segmentation experiments in the paper's appendix.
As discussed in the paper, HgVT utilizes a combination of population and diversity regularization. These methods are applied to all layers, with diversity receiving an emphasis on the input (given the learned virtual vertex and hyperedge embeddings). We accomplish this using a 2:1 ratio, where the output of the input embeddings are scaled by total_layers and the final output is then normalized to 2*total_layers.
The model provides a handle to enable/disable saving of the intermediate activations via model.enable_save(enable: Bool). If enabled, the semantic and adjacency features for both vertices and edges will be added to the model._x and model._ei lists, along with model._pooled_edge for the pre-classifier head embedding, and the expert routing information (model._khot and model._kscores). Notably, the adjacency features are shifted by one layer, since the feature output in layer i are used to create the mask in layer i+1.
