Hello! My name is Matt Suiche. I am an independent researcher, advisor, and investor. I previously served as the Head of Detection Engineering at Magnet Forensics. Our organization was passionately dedicated to justice and protecting the innocent, a mission we embarked on more intensely after the 2022 acquisition of my cybersecurity start-up, Comae Technologies.
My life-long fascination with learning and understanding complex systems first led me to cybersecurity. My teenage years were spent immersed in reverse engineering, which ignited a profound curiosity about technology that continues to this day. I’ve since explored various fields including operating systems architecture, programming languages, virtualization, modern web application development, and generative art. Furthermore, I’ve delved into numerous domains such as privacy, surveillance, forensics, blockchain, and community development among others.
Porting a CUDA Fast Fourier Transform (FFT) implementation to Mojo for the LeetGPU Fast Fourier Transform challenge presented an unexpected challenge: achieving bit-exact precision matching between CUDA’s sinf()/cosf() functions and their Mojo equivalents. This required PTX assembly analysis, cross-platform testing, and ultimately upgrading to Float64 precision for deterministic results.
Challenge Constraints 🔗 N range: $1 \leq N \leq 262,144$ (power-of-2 FFT sizes) Data type: All values are 32-bit floating point numbers Accuracy requirements: Absolute error $\leq 10^{-3}$, Relative error $\leq 10^{-3}$ Array format: Input and output arrays have length $2N$ (interleaved real/imaginary) Initial Problem: Accuracy Mismatch 🔗The initial Mojo FFT implementation failed correctness tests with a maximum absolute difference of 0.
Introduction 🔗This document analyzes AMD GPU support implementation in Triton’s Gluon framework, examining architecture-specific optimizations, performance characteristics, and implementation details relative to NVIDIA GPU support.
For background on Gluon and its motivation as a lower-level alternative to Triton, see my previous post: “Gluon: When Triton Isn’t Low-Level Enough”.
Background: GPU Programming Architecture Landscape 🔗The GPU programming ecosystem has evolved with distinct architectural approaches between NVIDIA and AMD, creating implementation challenges for cross-platform frameworks.
Introduction 🔗Byte Pair Encoding (BPE) tokenization is used in modern language models, but efficient training implementations are limited. OpenAI’s tiktoken handles inference well, while HuggingFace’s tokenizers supports training but has complexity and overhead. RustBPE is a Rust implementation that provides training capabilities with better performance.
RustBPE was developed by Andrej Karpathy as part of the nanochat project. This analysis covers the RustBPE implementation, including its architecture, performance characteristics, and Python integration.