# Bayesian Machine Scientist

## Example

Let's generate a simple data set with two features $$x_1, x_2 \in [0, 1]$$ and a target $$y$$. We will use the following generative model: $$y = 2 x_1 - e^{(5 x_2)}$$

import numpy as np

x_1 = np.linspace(0, 1, num=10)
x_2 = np.linspace(0, 1, num=10)
X = np.array(np.meshgrid(x_1, x_2)).T.reshape(-1,2)

y = 2 * X[:,0] + np.exp(5 * X[:,1])


Now let us choose a prior over the primitives. In this case, we will use priors determined by Guimerà et al (2020).

prior = "Guimera2020"


## Set up the BMS Regressor

We will use the BMS Regressor to predict the outcomes. There are a number of parameters that determine how the architecture search is performed. The most important ones are listed below:

• epochs: The number of epochs to run BMS. This corresponds to the total number of equation mutations - one mcmc step for each parallel-tempered equation and one tree swap between a pair of parallel-tempered equations.
• prior_par: A dictionary of priors for each operation. The keys correspond to operations and the respective values correspond to prior probabilities of those operations. The model comes with a default.
• ts: A list of temperature values. The machine scientist creates an equation tree for each of these values. Higher temperature trees are harder to fit, and thus they help prevent overfitting of the model.

Let's use the same priors over primitives that we specified on the previous page as well as an illustrative set of temperatures to set up the BMS regressor with default parameters.

from autora.skl.bms import BMSRegressor

temperatures = [1.0] + [1.04**k for k in range(1, 20)]

bms_estimator = BMSRegressor(
epochs=1500,
prior_par=primitives,
ts=temperatures,
)


Now we have everything to fit and verify the model.

bms_estimator.fit(X,y)
bms_estimator.predict(X)


## Troubleshooting

We can troubleshoot the model by playing with a few parameters:

• Increasing the number of epochs. The original paper recommends 1500-3000 epochs for reliable fitting. The default is set to 1500.
• Using custom priors that are more relevant to the data. The default priors are over equations nonspecific to any particular scientific domain.
• Increasing the range of temperature values to escape local minima.
• Reducing the differences between parallel temperatures to escape local minima.