Hello! My name is Matt Suiche. I am the founder of OnDB Inc., a data infrastructure startup for the agentic economy. I recently discussed cyberwar in the age of AI, Iran’s cyber capabilities, and how AI is reshaping hacking on Bloomberg’s Odd Lots and the National Security Lab podcast.
Previously, I co-founded CloudVolumes (acquired by VMware in 2014) and Comae Technologies (acquired by Magnet Forensics in 2022), where I later served as Head of Detection Engineering. I also founded the cybersecurity community project OPCDE.
My path into technology started in reverse engineering as a teenager, and has since spanned memory forensics, operating systems, virtualization, blockchain, and now AI infrastructure.
Latest
Eight months ago I published Building Agents for Small Language Models, a set of hard-won notes from shipping agents on 270M–32B parameter models. At the time, running useful local models meant embracing constraints: small context windows, CPU-only fallbacks, broken UTF-8 streams, and reasoning that fell apart past two steps.
I stand by that post. But the ground has shifted fast. What was a set of careful workarounds in August 2025 is starting to look like the default architecture for a large class of workloads.
On March 7, 2026, I joined Tracy Alloway and Joe Weisenthal on Bloomberg’s Odd Lots podcast for the second time. The first was in March 2022, during the Russia-Ukraine war, where we discussed what cyberwar actually looks like. Four years later, the same thesis holds – but the stakes have changed dramatically.
Listen: Apple Podcasts | Spotify
This time: the Iran-Israel war, the first kinetic attack on cloud infrastructure, Anthropic’s standoff with the Pentagon, AI coding agents, and why I started a new company called OnDB.
The internet was built with a missing piece. In 1994, when the HTTP specification reserved status code 402 for “Payment Required,” the architects knew money would eventually flow as freely as data. Three decades later, that vision is finally materializing—not because humans demanded it, but because AI agents need it.
The 402 Awakening 🔗HTTP 402 sat dormant for years, a placeholder for a future nobody could quite figure out. Credit cards required human intervention.
Porting a CUDA Fast Fourier Transform (FFT) implementation to Mojo for the LeetGPU Fast Fourier Transform challenge presented an unexpected challenge: achieving bit-exact precision matching between CUDA’s sinf()/cosf() functions and their Mojo equivalents. This required PTX assembly analysis, cross-platform testing, and ultimately upgrading to Float64 precision for deterministic results.
Challenge Constraints 🔗 N range: $1 \leq N \leq 262,144$ (power-of-2 FFT sizes) Data type: All values are 32-bit floating point numbers Accuracy requirements: Absolute error $\leq 10^{-3}$, Relative error $\leq 10^{-3}$ Array format: Input and output arrays have length $2N$ (interleaved real/imaginary) Initial Problem: Accuracy Mismatch 🔗The initial Mojo FFT implementation failed correctness tests with a maximum absolute difference of 0.
Introduction 🔗This document analyzes AMD GPU support implementation in Triton’s Gluon framework, examining architecture-specific optimizations, performance characteristics, and implementation details relative to NVIDIA GPU support.
For background on Gluon and its motivation as a lower-level alternative to Triton, see my previous post: “Gluon: When Triton Isn’t Low-Level Enough”.
Background: GPU Programming Architecture Landscape 🔗The GPU programming ecosystem has evolved with distinct architectural approaches between NVIDIA and AMD, creating implementation challenges for cross-platform frameworks.
Introduction 🔗Byte Pair Encoding (BPE) tokenization is used in modern language models, but efficient training implementations are limited. OpenAI’s tiktoken handles inference well, while HuggingFace’s tokenizers supports training but has complexity and overhead. RustBPE is a Rust implementation that provides training capabilities with better performance.
RustBPE was developed by Andrej Karpathy as part of the nanochat project. This analysis covers the RustBPE implementation, including its architecture, performance characteristics, and Python integration.
Background 🔗I recently encountered the GPU MODE TriMul challenge while exploring GPU optimization. Coming from a systems engineering background without prior PyTorch or Triton experience, this challenge provided an opportunity to learn GPU performance engineering through a practical problem.
The Triangle Multiplicative Update (TriMul) is a core operation in AlphaFold2 and AlphaFold3—the protein structure prediction systems that earned the 2024 Nobel Prize in Chemistry. The operation’s O(n³) complexity creates severe performance bottlenecks in production, forcing AlphaFold3 to use batch size 1 during training despite having under 1B parameters.
GPU production constraints are creating infrastructure bottlenecks. Multi-GPU programming, particularly vendor-agnostic implementations, has become essential. In their GPU Mode presentation, AMD Research engineers Muhammad Awad, Muhammad Osama, and Brandon Potter introduced Iris—a Python library that enables fine-grained multi-GPU programming in Triton. Similarly to my previous Gluon blogpost, this post captures my understanding and interpretation of their work, serving as both technical documentation and personal reference for this emerging multi-GPU programming paradigm.
My Journey from PyTorch to Gluon 🔗After spending the last month diving into PyTorch, learning Triton, understanding CUDA, and even peeking at PTX/SASS assembly, I’ve come to a surprising realization: I’ve yet to meet anyone who’s actually writing raw CUDA code in production anymore. Everyone I’ve talked to – from ML engineers at startups to researchers at big tech companies – seems to have converged on Triton as their go-to solution for custom GPU kernels.
Another day, another zero-day. This time it’s CVE-2025-21043, a critical vulnerability in Android’s DNG image parser that’s been actively exploited in the wild. What makes this one particularly interesting is how it leverages an obscure feature of the DNG format—opcode lists—to achieve remote code execution.
Following our previous analysis of CVE-2025-43300 and the ELEGANTBOUNCER detection framework, let’s dive into how this vulnerability works and why it matters.
The Discovery 🔗On September 2025, Samsung just pushed a critical security update.