Architecture
Grilly has three layers: Python modules, a C++ bridge, and Vulkan compute shaders.
Layer Stack
┌─────────────────────────────────────────────────────────┐
│ Python API │
│ nn.Module, nn.Linear, F.relu, AdamW, Variable │
│ grilly.nn / grilly.functional / grilly.optim │
├─────────────────────────────────────────────────────────┤
│ C++ Bridge (grilly_core) │
│ pybind11 bindings, VMA persistent mapping │
│ BufferPool bucketed allocation, CommandBatch │
│ backend/_bridge.py dispatches to C++ or falls back │
├─────────────────────────────────────────────────────────┤
│ Vulkan Compute Shaders │
│ 194 GLSL shaders compiled to SPIR-V │
│ Dispatched via VkComputePipeline │
│ AMD / NVIDIA / Intel -- no CUDA │
└─────────────────────────────────────────────────────────┘
Package Layout
The repo root is the grilly package. pyproject.toml uses tool.setuptools.package-dir to map subpackages:
grilly/
├── backend/ # Vulkan GPU dispatch
│ ├── _bridge.py # Routes ops through grilly_core C++
│ ├── core.py # Vulkan instance/device init, buffer alloc
│ ├── compute.py # VulkanCompute: composes all op modules
│ ├── pipelines.py # Pipeline/descriptor-set creation + LRU cache
│ ├── autograd_core.py # GradientTape, ComputationNode, backward ops
│ ├── jit.py # Trace-based JIT compilation
│ ├── amp.py # Automatic mixed precision
│ └── shader_registry.py # Architecture-specific shader selection
├── cpp/ # C++ pybind11 extension
│ ├── src/ # Core: device, buffer_pool, pipeline_cache, autograd
│ │ ├── ops/ # Op implementations: linear, conv, attention, etc.
│ │ └── vulkan/ # Vulkan abstraction layer
│ └── python/ # pybind11 binding files (11 focused files)
├── nn/ # PyTorch-like nn.Module subclasses
│ ├── module.py # Base Module (parameters, train/eval, state_dict)
│ ├── linear.py # Linear
│ ├── conv.py # Conv1d, Conv2d
│ ├── attention.py # MultiheadAttention, FlashAttention2, HYLA, SympFormer
│ ├── autograd.py # Variable, reverse-mode autodiff
│ ├── snn_*.py # SNN framework (neurons, containers, surrogate grads)
│ └── lora.py # LoRA fine-tuning
├── functional/ # Stateless F.* API
├── optim/ # Optimizers + LR schedulers
├── utils/ # DataLoader, VulkanTensor, HuggingFaceBridge
├── shaders/ # 194 GLSL compute shaders + SPIR-V in spv/
└── experimental/ # VSA, MoE, temporal reasoning, cognitive
Data Flow
Without C++ Backend (Pure Python)
numpy array --> backend/core.py --> struct.pack --> vkMapMemory
--> ctypes.memmove --> Vulkan dispatch --> fence wait
--> vkMapMemory (read back) --> numpy array
Each operation maps/unmaps GPU memory. This CPU-GPU ping-pong is the bottleneck.
With C++ Backend (grilly_core)
numpy array --> _bridge.py --> grilly_core.linear()
--> VMA persistent mapping (single memcpy, no map/unmap)
--> BufferPool bucketed allocation (reuse buffers)
--> Vulkan dispatch --> result stays GPU-resident
--> next op reads from same GPU buffer
--> final result: single download to numpy
The bridge uses a try/fallback pattern. If the C++ backend is available, ops go through grilly_core. Otherwise, they fall back to the pure Python Vulkan path:
result = _bridge.linear(x, weight, bias)
if result is not None:
return result
# Fall back to legacy backend
Shader Dispatch
Each neural network operation is a GLSL compute shader compiled to SPIR-V bytecode. The shader_registry.py selects architecture-specific variants (BERT, GPT, T5) with a generic fallback.
Shaders are loaded at device initialization from shaders/spv/. The pipeline cache (pipelines.py) creates and LRU-caches VkComputePipeline objects for each shader + specialization constant combination.
See Compute Shaders for details on the shader system.
Entry Points
The main entry point is grilly.Compute() (alias for VulkanCompute):
import grilly
backend = grilly.Compute()
print(backend.device_name) # "AMD Radeon RX 6750 XT"
VulkanCompute composes all operation modules into a single object with namespaced ops:
backend.snn.lif_step()-- SNN neuron operationsbackend.fnn.linear()-- feedforward operationsbackend.attention.flash_attention2()-- attention mechanismsbackend.learning.stdp_update()-- learning rules
Most users interact through nn.Module subclasses or grilly.functional rather than calling VulkanCompute directly.