Tutorials
Unifying Attention and Diffusion with Kan Extension Transformers: Structured Deep Learning with Diagrammatic Backpropagation
Sridhar Mahadevan
Modern foundation models are powerful, but their representations, training dynamics, and agentic workflows remain difficult to audit, compose, and trust. This tutorial presents a categorical and geometric framework for trustworthy foundation-model systems. The major scientific components of the tutorial include
- **Diagrammatic Backpropagation** (DB), which generalizes deep learning to include curvature loss function over categorical diagrams
- **Infinitesimal Causality** (IC), which generalizes the chain rule in calculus to functors in tangent categories
- **Kan Extension Transformers** (KET), which define a structured computation substrate, unifying attention and diffusion, and providing a universal machine learning framework for mapping finite experience into infinite futures
- **Universal Decision Learning** (UDL), which is a rigorous categorical framework for building foundries, or building blocks of foundation models
- **Lie-algebra based neural adapters** (ALLORA), which shows how to compose LoRa adapters by detecting non-commutativity using Lie-Brackets
- **Agentic skill optimization using Lie Algebroids**(LASKO), which formalizes optimization over tangent Markdown categories
- **Odyssey**: a demonstration system for automatic foundry construction.
The tutorial is designed as a conceptual 2.5-hour overview. Technical details are deferred to associated arXiv papers and the *Categories for AGI* book. Participants will leave with a solid understanding of a powerful categorical and geometric design language for foundation-model systems that learn locally, transfer cautiously, expose obstructions, and glue global conclusions only when the evidence permits.
Show more
- **Diagrammatic Backpropagation** (DB), which generalizes deep learning to include curvature loss function over categorical diagrams
- **Infinitesimal Causality** (IC), which generalizes the chain rule in calculus to functors in tangent categories
- **Kan Extension Transformers** (KET), which define a structured computation substrate, unifying attention and diffusion, and providing a universal machine learning framework for mapping finite experience into infinite futures
- **Universal Decision Learning** (UDL), which is a rigorous categorical framework for building foundries, or building blocks of foundation models
- **Lie-algebra based neural adapters** (ALLORA), which shows how to compose LoRa adapters by detecting non-commutativity using Lie-Brackets
- **Agentic skill optimization using Lie Algebroids**(LASKO), which formalizes optimization over tangent Markdown categories
- **Odyssey**: a demonstration system for automatic foundry construction.
The tutorial is designed as a conceptual 2.5-hour overview. Technical details are deferred to associated arXiv papers and the *Categories for AGI* book. Participants will leave with a solid understanding of a powerful categorical and geometric design language for foundation-model systems that learn locally, transfer cautiously, expose obstructions, and glue global conclusions only when the evidence permits.
Diffusion and Flow-Matching: From Memorization to Generalization & Beyond
Mathurin Massias ⋅ Quentin Bertrand
View full details
Unlearning Data at Scale
Vinith Suriyakumar ⋅ Gautam Kamath ⋅ Ashia Wilson
View full details
Probabilistic Numerics — Computation is Machine Learning
Philipp Hennig ⋅ Marvin Pförtner ⋅ Tim Weiland
Machine learning is the process of estimating latent representations or variables from *finite data*. If the data is insufficient, this inference process leaves a finite *estimation error*. Probabilistic (Bayesian) machine learning attempts to capture this empirical uncertainty in a probability distribution.
But what actually happens inside of a Learning Machine, the computational side of ML, is invariably the solution of a *numerical problem*: *Optimisation* for deep learning, solving *differential equations* for diffusion, flow matching, and scientific simulation, or even just (large-scale, approximate) numerical *linear algebra*. These numerical tasks have no analytic solution in reach. The computational resources are insufficient, and so the computation leaves a finite *computational error*. **Probabilistic numerical methods attempt to capture this computational uncertainty in a probability distribution.**
By matching the mathematical modelling language of the empirical and the computational side of machine learning in this way, probabilistic numerical methods open new opportunities for computational savings, and new functionality in the ML stack: Computational and data uncertainty can be controlled in relation to each other, and information from data can flow "backwards" through a computation to solve inverse problems. A growing research community within ML is developing this toolchain, typically by building on established, highly efficient, classic numerical methods.
The tutorial is split in three parts. We will start with a simple worked example to establish key concepts and patterns. A second part will generalise these insights into a design pattern across a large class of numerical tasks. Finally, a hands-on code demo will demonstrate how probabilistic numerical methods work in practice.
Show more
But what actually happens inside of a Learning Machine, the computational side of ML, is invariably the solution of a *numerical problem*: *Optimisation* for deep learning, solving *differential equations* for diffusion, flow matching, and scientific simulation, or even just (large-scale, approximate) numerical *linear algebra*. These numerical tasks have no analytic solution in reach. The computational resources are insufficient, and so the computation leaves a finite *computational error*. **Probabilistic numerical methods attempt to capture this computational uncertainty in a probability distribution.**
By matching the mathematical modelling language of the empirical and the computational side of machine learning in this way, probabilistic numerical methods open new opportunities for computational savings, and new functionality in the ML stack: Computational and data uncertainty can be controlled in relation to each other, and information from data can flow "backwards" through a computation to solve inverse problems. A growing research community within ML is developing this toolchain, typically by building on established, highly efficient, classic numerical methods.
The tutorial is split in three parts. We will start with a simple worked example to establish key concepts and patterns. A second part will generalise these insights into a design pattern across a large class of numerical tasks. Finally, a hands-on code demo will demonstrate how probabilistic numerical methods work in practice.
Proving Theorems with Lean and Machine Learning
Rémy Degenne ⋅ Wenda Li
AI agents can now write mathematics, including proofs of theorems relevant to Machine Learning, but we can’t trust them yet. Subtle errors might be hidden deep in the reasoning steps, and checking the proofs manually takes a lot of time and expertise.
The Lean theorem prover provides a way to write formal, machine-checkable proofs, giving us high confidence in their correctness. AI systems have managed to reach gold medal level at the International Mathematical Olympiad while producing Lean-checked proofs. Could we get them to write research-level, verified mathematics?
In this tutorial, we introduce Lean and its mathematical library Mathlib, and show how they can be used to write trusted proofs, in particular machine learning theory proofs. We then show how machine learning can help with theorem proving, and present recent advances in AI-assisted formalization.
Show more
The Lean theorem prover provides a way to write formal, machine-checkable proofs, giving us high confidence in their correctness. AI systems have managed to reach gold medal level at the International Mathematical Olympiad while producing Lean-checked proofs. Could we get them to write research-level, verified mathematics?
In this tutorial, we introduce Lean and its mathematical library Mathlib, and show how they can be used to write trusted proofs, in particular machine learning theory proofs. We then show how machine learning can help with theorem proving, and present recent advances in AI-assisted formalization.
Adaptive Reasoning in LLMs: From Post-Training to Test-Time Learning (partially remote)
Akhil Arora ⋅ Nouha Dziri
View full details
Calibration: From Predictions to Decisions, Collaboration, and Alignment
Aaron Roth ⋅ Natalie Collina ⋅ Ira Globus-Harris
View full details
Evaluating and Training LLMs for Math Copilots and Theorem Proving
Simon Frieder ⋅ Philip Vonderlind
View full details
Is numerical optimization theory irrelevant to machine learning practice in 2026?
Mark Schmidt
We are seeing more numerical optimization theory papers published than ever before. These papers often make unrealistic assumptions or propose algorithms that never get adopted. So is all this optimization theory largely useless?
In this tutorial I show how some surprisingly simple optimization ideas can explain a wide variety of the implementation choices we make when training modern deep learning models. Some of these ideas might have let us skip some generations of grad-student descent, or have led to state-of-the-art tricks in modern architectures. On the other hand, I will highlight how some important practical ideas are not explained by optimization theory and where we can go from here.
Here is a list of keywords to get you (and your LLM sidekick) interested in attending: Adam and [*]A[*]d[*]a[*]m[*], Muon and its friends/enemies, critical-ish batch size, the RMSnorm and skip connection love affair, dead ReLUs and living SwiGLU, Schedule-Free and WSD and muP and max\_grad\_norm = 1.0, variance reduction and shuffle=True, and maybe edge-of-stability/catapults/feature-learning. I may also tell you why your second-order stochastic optimization method did not work.
Show more
In this tutorial I show how some surprisingly simple optimization ideas can explain a wide variety of the implementation choices we make when training modern deep learning models. Some of these ideas might have let us skip some generations of grad-student descent, or have led to state-of-the-art tricks in modern architectures. On the other hand, I will highlight how some important practical ideas are not explained by optimization theory and where we can go from here.
Here is a list of keywords to get you (and your LLM sidekick) interested in attending: Adam and [*]A[*]d[*]a[*]m[*], Muon and its friends/enemies, critical-ish batch size, the RMSnorm and skip connection love affair, dead ReLUs and living SwiGLU, Schedule-Free and WSD and muP and max\_grad\_norm = 1.0, variance reduction and shuffle=True, and maybe edge-of-stability/catapults/feature-learning. I may also tell you why your second-order stochastic optimization method did not work.
New Techniques for Sequence Prediction: Spectral Filtering and Preconditioning
Elad Hazan ⋅ Annie Marsden
View full details
Successful Page Load