instant Bayesian optimization! (kind of)
Bayesian optimization (BO) is one of the pillars of modern machine learning and scientific discovery. It’s a standard tool for finding the best hyperparameters for a model, the ideal material composition, or the most effective drug compound. The textbook picture of BO is an elegant and simple loop: fit a probabilistic surrogate model (usually a Gaussian Process aka GP) to your observations, then optimize a so-called acquisition function to decide where to sample next, rinse and repeat.
While BO can be very fast nowadays and with solid implementations such as BoTorch, the classic loop can become intricate and sluggish once you move beyond the most basic or “vanilla” settings. There is a whole zoo of options to choose from: many different Gaussian Process kernels and an ever-growing list of acquisition functions (e.g., Expected Improvement, Upper Confidence Bound, Entropy Search, and many more). Moreover, something that seems like it should be simple in a method that has “Bayesian” in the name – for example, including an educated guess (a prior) about the location or value of the optimum – is not at all straightforward to incorporate into the standard GP framework.
But what if, instead of all this, we could just… predict the optimum?
The core idea I want to discuss in this blog post is this: if we are smart about it, we can reframe the entire task of optimization as a straightforward prediction problem.
Think about how humans develop expertise
We can do the same with machine learning. If we can generate a virtually infinite dataset of problems with known solutions, we can (pre)train a large model – like a transformer, the same architecture that powers modern Large Language Models (LLMs) – to learn the mapping from problem to solution. This is the essence of amortized inference or meta-learning. For a new problem, the model doesn’t need to reason from first principles; it makes a fast, amortized prediction using its learned “intuition”.
The main bottleneck for this approach is similar to the problem faced by modern LLMs: finding the training data. Where do we get a limitless dataset of functions with known optima?
While there are well-known techniques to generate functions (for example, using our old friends, the GPs), if we are required to optimize them to know their optimum, it looks like we are back to square one. The functions we want to train on are exactly those difficult, pesky functions where finding the optimum is hard in the first place. Generating such (function, optimum)
pairs would be extremely expensive.
But it turns out you can do better than this, if you’re willing to get your hands dirty with a bit of generative modeling.
In our ACE paper, we needed to create a massive dataset of functions to train our model. The challenge was ensuring each function was unique, complex, and – most importantly – had a single, known global optimum $(\mathbf{x}_{\text{opt}}, y_{\text{opt}})$ which we could give our network as a target or label for training. Here is the recipe we came up with, which you can think of in four steps.
First, we decide what kind of function we want to generate. Is it very smooth and slowly varying? Is it highly oscillatory? We define this “character” by sampling a kernel for a Gaussian Process (GP), such as an RBF or Matérn kernel, along with its hyperparameters (like length scales). This gives us a prior over a certain “style” of functions.
Next, we choose a location for the global optimum, $\mathbf{x}_{\text{opt}}$, usually by sampling it uniformly within a box. Then comes an interesting trick. We don’t just pick any value $y_{\text{opt}}$. To make it realistic, we sample it from the minimum-value distribution for the specific GP family we chose in Step 1. This ensures that the optimum’s value is statistically plausible for that function style. With a small probability, we bump the minimum to be even lower, to make our method robust to “unexpectedly low” minima.
Then, we generate a function from the GP prior (defined in Step 1) by conditioning it to pass through our chosen optimum location and value, $(\mathbf{x}_{\text{opt}}, y_{\text{opt}})$ established in Step 2. This is done by treating the optimum as a known data point.
However, simply forcing the function to go through this point is not enough. The GP is a flexible, random process; a sample from it might wiggle around and create an even lower minimum somewhere else by chance. To train our model, we need to be certain that $(\mathbf{x}_{\text{opt}}, y_{\text{opt}})$ is the true global optimum.
To guarantee this, we apply a transformation. As detailed in our paper’s appendix, we modify the function by adding a convex envelope. We transform all function values $y_i$ like this:
\[y_{i}^{\prime} = y_{\text{opt}} + |y_{i} - y_{\text{opt}}| + \frac{1}{5}\|\mathbf{x}_{\text{opt}} - \mathbf{x}_{i}\|^{2}.\]Let’s break down what this does. The term $y_{\text{opt}} + |y_{i} - y_{\text{opt}}|$ is key. If a function value $y_i$ is already above our chosen optimum $y_{\text{opt}}$, it remains unchanged. However, if $y_i$ happens to be below the optimum, this term reflects it upwards, placing it above $y_{\text{opt}}$. This ensures that no point in the function has a value lower than our chosen minimum. Then, we add the quadratic “bowl” term that has its lowest point exactly at $\mathbf{x}_{\text{opt}}$.
This is a simple but effective way to ensure the ground truth for our generative process is, in fact, true. Without it, we would be feeding our network noisy labels, where the provided “optimum” isn’t always the real one.
With the function’s shape secured, we simply sample the data points (the (x, y)
pairs) that we’ll use for training. We also add a random vertical offset to the whole function. This prevents the model from cheating by learning, for example, that optima are always near $y=0$.
By repeating this recipe millions of times, we can build a massive, diverse dataset of (function, optimum)
pairs. The hard work is done. Now, we just need to learn from it.
Once you have this dataset, the rest is fairly standard machine learning. We feed our model, ACE, a “context set” consisting of a few observed (x, y)
pairs from a function. The model’s task is to predict the latent variables we care about: $\mathbf{x}_{\text{opt}}$ and $y_{\text{opt}}$. Here the term latent is taken from the language of probabilistic modeling, and simply means “unknown”, as opposed to the observed function values.
Because ACE is a transformer, it uses the attention mechanism to see the relationships between the context points and we set it up to output a full predictive distribution for the optimum, not just a single point estimate. This means we get uncertainty estimates for free, which is crucial for any Bayesian approach.
In addition to predicting the latent variables, ACE can also predict data, i.e., function values $y^\star$ at any target point $\mathbf{x}^\star$, following the recipe of similar models such as the Transformer Neural Process (TNP)
So we have a model that, given a few observations, can predict a probability distribution over the optimum’s location and value. How do we use this to power the classic Bayesian optimization loop?
At each step, we need to decide which point $\mathbf{x}_{\text{next}}$ to evaluate. This choice is guided by an acquisition function. One of the most intuitive acquisition strategies is Thompson sampling, which suggests that we should sample our next point from our current belief about where the optimum is. For us, this would mean sampling from $p(\mathbf{x}_{\text{opt}}|\mathcal{D})$, which we can easily do with ACE.
But there’s a subtle trap here. If we just sample from our posterior over the optimum’s location, we risk getting stuck. The model’s posterior will naturally concentrate around the best point seen so far – which is a totally sensible belief to hold. However, sampling from it might lead us to repeatedly query points in the same “good” region without ever truly exploring for a great one. The goal is to find a better point, not just to confirm where we think the current optimum is.
This is where having predictive distributions over both the optimum’s location and value becomes relevant. With ACE, we can use an enhanced version of Thompson sampling that explicitly encourages exploration (see
This two-step process elegantly balances exploitation (by conditioning on data) and exploration (by forcing the model to seek improvement). It’s a simple, probabilistic way to drive the search towards new and better regions of the space, as shown in the example below.
While this enhanced Thompson Sampling is powerful and simple, the story doesn’t end here. Since ACE gives us access to these explicit predictive distributions, implementing more sophisticated, information-theoretic acquisition functions (like Max-value Entropy Search or MES
Predicting the optimum from a few data points is powerful, but what if you’re not starting from complete ignorance? Often, you have some domain knowledge. For example, if you are tuning the hyperparameters of a neural network, you might have a strong hunch that the optimal learning rate is more likely to be in the range $[0.0001, 0.01]$ than around $1.0$. This kind of information is called a prior in Bayesian terms.
Incorporating priors into the standard Bayesian optimization loop is surprisingly tricky. While the Bayesian framework is all about updating beliefs, shoehorning prior knowledge about the optimum’s location or value into a standard Gaussian Process model is not straightforward and either requires heuristics or complex, custom solutions (see, for example,
This is another area where an amortized approach shines. Because we control the training data generation, we can teach ACE not only to predict the optimum but also how to listen to and use a prior. During its training, we don’t just show ACE functions; we also provide it with various “hunches” (priors of different shapes and strengths) about where the optimum might be for those functions, or for its value. By seeing millions of examples, ACE learns to combine the information from the observed data points with the hint provided by the prior.
At runtime, the user can provide a prior distribution over the optimum’s location, $p(\mathbf{x}_{\text{opt}})$, or value $p(y_{\text{opt}})$, as a simple histogram across each dimension. ACE then seamlessly integrates this information to produce a more informed (and more constrained) prediction for the optimum. This allows for even faster convergence, as the model doesn’t waste time exploring regions that the user already knows are unpromising. Instead of being a complex add-on, incorporating prior knowledge becomes another natural part of the prediction process.
The main takeaway is that by being clever about data generation, we can transform traditionally complex inference and reasoning problems into large-scale prediction tasks. This approach unifies seemingly disparate fields. In the ACE paper, we show that the exact same architecture can be used for Bayesian optimization, simulation-based inference (predicting simulator parameters from data), and even image completion and classification (predicting class labels or missing pixels).
Everything – well, almost everything – boils down to conditioning on data and possibly task-relevant latents (or prior information), and predicting data or other task-relevant latent variables, where what the “latent variable” is depends on the task. For example, in BO, as we saw in this blog post, the latents of interest are the location $\mathbf{x}_{\text{opt}}$ and value $y_{\text{opt}}$ of the global optimum.
This is not to say that traditional methods are obsolete. They provide the theoretical foundation and are indispensable when you can’t generate realistic training data. But as our simulators get better and our generative models more powerful, the paradigm of “just predicting” the answer is becoming an increasingly powerful and practical alternative – see for example this recent position paper
Direct prediction is only one part of the story. As we hinted at earlier, a key component of intelligence – both human and artificial – isn’t just pattern recognition, but also planning or search (the “thinking” part of modern LLMs and large reasoning models). This module actively decides what to do next to gain the most information. The acquisition strategies we covered are a form of planning which is not amortized. Conversely, we have been working on a more powerful and general framework that tightly integrates amortized inference with amortized active data acquisition. This new system is called Amortized Active Learning and Inference Engine (ALINE)
The Amortized Conditioning Engine (ACE) is a new, general-purpose framework for multiple kinds of prediction tasks. On the paper website you can find links to all relevant material including code, and we are actively working on extending the framework in manifold ways. If you are interested in this line of research, please get in touch!