Hello! My name is Matt Suiche. I am the founder of OnDB Inc., a data infrastructure startup for the agentic economy. I recently discussed cyberwar in the age of AI, Iran’s cyber capabilities, and how AI is reshaping hacking on Bloomberg’s Odd Lots and the National Security Lab podcast.
Previously, I co-founded CloudVolumes (acquired by VMware in 2014) and Comae Technologies (acquired by Magnet Forensics in 2022), where I later served as Head of Detection Engineering. I also founded the cybersecurity community project OPCDE.
My path into technology started in reverse engineering as a teenager, and has since spanned memory forensics, operating systems, virtualization, blockchain, and now AI infrastructure.
Latest
Introduction 🔗This document analyzes AMD GPU support implementation in Triton’s Gluon framework, examining architecture-specific optimizations, performance characteristics, and implementation details relative to NVIDIA GPU support.
For background on Gluon and its motivation as a lower-level alternative to Triton, see my previous post: “Gluon: When Triton Isn’t Low-Level Enough”.
Background: GPU Programming Architecture Landscape 🔗The GPU programming ecosystem has evolved with distinct architectural approaches between NVIDIA and AMD, creating implementation challenges for cross-platform frameworks.
Introduction 🔗Byte Pair Encoding (BPE) tokenization is used in modern language models, but efficient training implementations are limited. OpenAI’s tiktoken handles inference well, while HuggingFace’s tokenizers supports training but has complexity and overhead. RustBPE is a Rust implementation that provides training capabilities with better performance.
RustBPE was developed by Andrej Karpathy as part of the nanochat project. This analysis covers the RustBPE implementation, including its architecture, performance characteristics, and Python integration.
For those interested in understanding BPE implementation from first principles, Sebastian Raschka provides an excellent deep-dive into implementing BPE from scratch in his blogpost, and this is also covered in his book “Build a Large Language Model (From Scratch)”. His work offers invaluable insights into the algorithmic foundations that underpin implementations like RustBPE.
Background 🔗I recently encountered the GPU MODE TriMul challenge while exploring GPU optimization. Coming from a systems engineering background without prior PyTorch or Triton experience, this challenge provided an opportunity to learn GPU performance engineering through a practical problem.
The Triangle Multiplicative Update (TriMul) is a core operation in AlphaFold2 and AlphaFold3—the protein structure prediction systems that earned the 2024 Nobel Prize in Chemistry. The operation’s O(n³) complexity creates severe performance bottlenecks in production, forcing AlphaFold3 to use batch size 1 during training despite having under 1B parameters. This makes the optimization problem both practically relevant and technically challenging.