Skip to content

HYLA Attention

HYLAAttention is a softmax-free attention mechanism. Instead of the standard QK^T softmax attention, HYLA uses a linear attention formulation that avoids the O(N^2) softmax bottleneck.


Usage

from grilly import nn

hyla = nn.HYLAAttention(embed_dim=512, num_heads=8)
output = hyla(query, key, value)

How It Works

Standard attention:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

HYLA replaces the softmax with a kernel function that allows computing the attention output in O(N) time:

HYLA(Q, K, V) = phi(Q) (phi(K)^T V) / (phi(Q) phi(K)^T 1)

Where phi is a feature map. The key insight: by computing phi(K)^T V first (a d x d matrix), we avoid the N x N attention matrix entirely.

Benefits:

  • O(N) time and memory complexity (linear in sequence length)
  • No attention matrix materialized
  • Well-suited for long sequences where standard attention is prohibitive

FNetMixing

FNetMixing replaces attention entirely with Fourier transforms:

fnet = nn.FNetMixing(embed_dim=512)
output = fnet(x)  # x: (batch, seq_len, embed_dim)

Uses FFT along the sequence dimension for token mixing, and FFT along the feature dimension for channel mixing. No learnable parameters in the mixing step.

Backed by: fft.glsl and ifft.glsl compute shaders.


When to Use What

Mechanism Complexity Best For
Flash Attention 2/3 O(N) memory, O(N^2) compute Standard transformers, moderate seq lengths
HYLA O(N) time and memory Very long sequences, linear-time models
FNetMixing O(N log N) Speed-critical, no learnable attention needed
Standard MHA O(N^2) When you need attention weights