AutoRA (Automated Research Assistant) is an open-source framework designed to automate various stages of empirical research, including model discovery, experimental design, and data collection.
This notebook is the fourth of four notebooks within the basic tutorials of autora
. We suggest that you go through these notebooks in order as each builds upon the last. However, each notebook is self-contained and so there is no need to run the content of the last notebook for your current notebook.
These notebooks provide a comprehensive introduction to the capabilities of autora
. It demonstrates the fundamental components of autora
, and how they can be combined to facilitate automated (closed-loop) empirical research through synthetic experiments.
How to use this notebook You can progress through the notebook section by section or directly navigate to specific sections. If you choose the latter, it is recommended to execute all cells in the notebook initially, allowing you to easily rerun the cells in each section later without issues.
Tutorial Setup¶
We will here import some standard python packages, set seeds for replicability, and define a plotting function.
#### Installation ####
!pip install -q "autora[theorist-bms]"
!pip install -q "autora[experiment-runner-synthetic-abstract-equation]"
#### Import modules ####
from typing import Optional
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sympy as sp
import torch
from autora.variable import Variable, ValueType, VariableCollection
from autora.state import StandardState, on_state, estimator_on_state
from autora.experimentalist.random import random_pool
from autora.theorist.bms import BMSRegressor
from autora.experiment_runner.synthetic.abstract.equation import equation_experiment
#### Set seeds ####
np.random.seed(42)
torch.manual_seed(42)
#### Define plot function ####
def plot_from_state(s: StandardState, expr: str):
"""
Plots the data, the ground truth model, and the current predicted model
"""
#Determine labels and variables
print(s.models[-1])
model_label = f"Model: {s.models[-1]}" if hasattr(s.models[-1],'repr') else "Model"
experiment_data = s.experiment_data.sort_values(by=["x"])
ground_x = np.linspace(s.variables.independent_variables[0].value_range[0],s.variables.independent_variables[0].value_range[1],100)
#Determine predicted ground truth
equation = sp.simplify(expr)
ground_predicted_y = [equation.evalf(subs={'x':x}) for x in ground_x]
model_predicted_y = s.models[-1].predict(ground_x.reshape(-1, 1))
#Plot the data and models
f = plt.figure(figsize=(4,3))
plt.plot(experiment_data["x"], experiment_data["y"], 'o', label = None)
plt.plot(ground_x, model_predicted_y, alpha=.8, label=model_label)
plt.plot(ground_x, ground_predicted_y, alpha=.8, label=f'Ground Truth: {expr}')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
WARNING: typer 0.12.3 does not provide the extra 'all' WARNING: typer 0.12.3 does not provide the extra 'all'
Customizing Automated Empirical Research Components¶
autora
is a flexible framework in which users can integrate their own experimentalists, experiment runners, and theorists in an automated empirical research workflow. This section illustrates the integration of custom autora
components. For more information on how to contribute your own modules to the autora
ecosystem, please refer to the Contributor Documentation.
To illustrate the use of custom experimentalists, experiment runners, and theorists, we consider a simple workflow:
- Generate 10 seed experimental conditions using
random_pool
- Iterate through the following steps
- Identify 3 new experimental conditions using an
experimentalist
- Collect observations using the
experiment_runner
- Identify a model relating conditions to observations using a
theorist
- Identify 3 new experimental conditions using an
Once this workflow is setup, we will replace each component with a custom function.
#### Define metadata ####
iv = Variable(name="x", value_range=(0, 2 * np.pi), allowed_values=np.linspace(0, 2 * np.pi, 30))
dv = Variable(name="y", type=ValueType.REAL)
variables = VariableCollection(independent_variables=[iv],dependent_variables=[dv])
#### Define condition pool ####
conditions = random_pool(variables, num_samples=10, random_state=0)
#### Define state ####
s = StandardState(
variables = variables,
conditions = conditions,
experiment_data = pd.DataFrame(columns=["x","y"])
)
#### Define experimentalist and wrap with state functionality ####
experimentalist = on_state(random_pool, output=["conditions"])
#### Define experiment runner and wrap with state functionality ####
sin_experiment = equation_experiment(sp.simplify('sin(x)'), variables.independent_variables, variables.dependent_variables[0])
sin_runner = sin_experiment.run
experiment_runner = on_state(sin_runner, output=["experiment_data"])
#### Define theorist and wrap with state functionality ####
theorist = estimator_on_state(BMSRegressor(epochs=100))
We should quickly test to make sure everything works as expected.
print('\033[1mPrevious State:\033[0m')
print(s)
for cycle in range(2):
s = experimentalist(s, num_samples=10, random_state=42+cycle)
s = experiment_runner(s, added_noise=0.5, random_state=42+cycle)
s = theorist(s)
plot_from_state(s, 'sin(x)')
print('\n\033[1mUpdated State:\033[0m')
print(s)
INFO:autora.theorist.bms.regressor:BMS fitting started
Previous State:
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(0, 6.283185307179586), allowed_values=array([0. , 0.21666156, 0.43332312, 0.64998469, 0.86664625,
1.08330781, 1.29996937, 1.51663094, 1.7332925 , 1.94995406,
2.16661562, 2.38327719, 2.59993875, 2.81660031, 3.03326187,
3.24992343, 3.466585 , 3.68324656, 3.89990812, 4.11656968,
4.33323125, 4.54989281, 4.76655437, 4.98321593, 5.1998775 ,
5.41653906, 5.63320062, 5.84986218, 6.06652374, 6.28318531]), units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x
0 5.416539
1 4.116570
2 3.249923
3 1.733292
4 1.949954
5 0.216662
6 0.433323
7 0.000000
8 1.083308
9 5.199877, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])
100%|██████████| 100/100 [00:04<00:00, 20.17it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
sin(x)
INFO:autora.theorist.bms.regressor:BMS fitting started 100%|██████████| 100/100 [00:05<00:00, 19.45it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
sin(x)
Updated State:
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(0, 6.283185307179586), allowed_values=array([0. , 0.21666156, 0.43332312, 0.64998469, 0.86664625,
1.08330781, 1.29996937, 1.51663094, 1.7332925 , 1.94995406,
2.16661562, 2.38327719, 2.59993875, 2.81660031, 3.03326187,
3.24992343, 3.466585 , 3.68324656, 3.89990812, 4.11656968,
4.33323125, 4.54989281, 4.76655437, 4.98321593, 5.1998775 ,
5.41653906, 5.63320062, 5.84986218, 6.06652374, 6.28318531]), units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x
0 3.249923
1 4.116570
2 2.599939
3 0.216662
4 3.683247
5 0.000000
6 1.733292
7 5.416539
8 2.816600
9 3.683247, experiment_data= x y
0 0.433323 0.572248
1 4.983216 -1.483542
2 4.116570 -0.452463
3 2.816600 0.789584
4 2.599939 -0.459964
5 5.416539 -1.413252
6 0.433323 0.483809
7 4.333231 -1.087098
8 1.299969 0.955149
9 0.433323 -0.006633
10 3.249923 0.013996
11 4.116570 -0.488600
12 2.599939 0.222789
13 0.216662 -0.239366
14 3.683247 -1.511473
15 0.000000 0.485811
16 1.733292 0.995155
17 5.416539 -0.659296
18 2.816600 -0.072496
19 3.683247 0.097695, models=[sin(x), sin(x)])
Custom Experimentalists¶
Experimentalists must be implemented as functions. For instance, an experimentalist sampler function expects a pool of experimental conditions and returns a modified set of experimental conditions.
Requirements for working with the state:
- The function has a
variables
argument that accepts theVariableCollection
type - The function has a
conditions
argument that accepts apandas.DataFrame
- The function returns a
pandas.DataFrame
The custom uniform_sampler
below will select conditions that are the least represented in the data.
Note that when building custom experimentalists, we can either wrap the function with on_state(output=['conditions'])
as we did in tutorial III, or else we can use the @on_state(output=['conditions'])
decorator.
#==================================================================#
# Option 1 - Wrapping our Component #
#==================================================================#
def uniform_sample(variables: VariableCollection, conditions: pd.DataFrame, num_samples: int = 1, random_state: Optional [int] = None):
"""
An experimentalist that selects the least represented datapoints
"""
#Set rng seed
rng = np.random.default_rng(random_state)
#Retrieve the possible values
allowed_values = variables.independent_variables[0].allowed_values
#Determine the representation of each value
conditions_count = np.array([conditions["x"].isin([value]).sum(axis=0) for value in allowed_values])
#Sort to determine the least represented values
conditions_sort = conditions_count.argsort()
conditions_count = conditions_count[conditions_sort]
values_count = allowed_values[conditions_sort]
#Sample from values with the smallest frequency
x = values_count[conditions_count<=conditions_count[num_samples-1]]
x = rng.choice(x,num_samples)
return pd.DataFrame({"x": x})
custom_experimentalist = on_state(uniform_sample, output=["conditions"])
#==================================================================#
# Option 2 - Using a Decorator #
#==================================================================#
@on_state(output=["conditions"])
def custom_experimentalist(variables: VariableCollection, conditions: pd.DataFrame, num_samples: int = 1, random_state: Optional [int] = None):
"""
An experimentalist that selects the least represented datapoints
"""
#Set rng seed
rng = np.random.default_rng(random_state)
#Retrieve the possible values
allowed_values = variables.independent_variables[0].allowed_values
#Determine the representation of each value
conditions_count = np.array([conditions["x"].isin([value]).sum(axis=0) for value in allowed_values])
#Sort to determine the least represented values
conditions_sort = conditions_count.argsort()
conditions_count = conditions_count[conditions_sort]
values_count = allowed_values[conditions_sort]
#Sample from values with the smallest frequency
x = values_count[conditions_count<=conditions_count[num_samples-1]]
x = rng.choice(x,num_samples)
return pd.DataFrame({"x": x})
Now, we will re-run our initial workflow while incorporating our custom experimentalist.
#### First, let's reinitialize the state object to get a clean state ####
iv = Variable(name="x", value_range=(0, 2 * np.pi), allowed_values=np.linspace(0, 2 * np.pi, 30))
dv = Variable(name="y", type=ValueType.REAL)
variables = VariableCollection(independent_variables=[iv],dependent_variables=[dv])
conditions = random_pool(variables, num_samples=10, random_state=0)
s = StandardState(variables = variables, conditions = conditions, experiment_data = pd.DataFrame(columns=["x","y"]))
#Report previous state
print('\033[1mPrevious State:\033[0m')
print(s)
#Cycle
for cycle in range(5):
s = custom_experimentalist(s, num_samples = 10, random_state=42+cycle) #Our custom experimentalist
s = experiment_runner(s, added_noise=0.5, random_state=42+cycle)
s = theorist(s)
plot_from_state(s,'sin(x)')
#Report updated state
print('\n\033[1mUpdated State:\033[0m')
print(s)
INFO:autora.theorist.bms.regressor:BMS fitting started
Previous State:
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(0, 6.283185307179586), allowed_values=array([0. , 0.21666156, 0.43332312, 0.64998469, 0.86664625,
1.08330781, 1.29996937, 1.51663094, 1.7332925 , 1.94995406,
2.16661562, 2.38327719, 2.59993875, 2.81660031, 3.03326187,
3.24992343, 3.466585 , 3.68324656, 3.89990812, 4.11656968,
4.33323125, 4.54989281, 4.76655437, 4.98321593, 5.1998775 ,
5.41653906, 5.63320062, 5.84986218, 6.06652374, 6.28318531]), units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x
0 5.416539
1 4.116570
2 3.249923
3 1.733292
4 1.949954
5 0.216662
6 0.433323
7 0.000000
8 1.083308
9 5.199877, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])
100%|██████████| 100/100 [00:04<00:00, 23.27it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
-0.21
INFO:autora.theorist.bms.regressor:BMS fitting started 100%|██████████| 100/100 [00:05<00:00, 19.52it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
cos((-1.49 + x))
INFO:autora.theorist.bms.regressor:BMS fitting started 100%|██████████| 100/100 [00:04<00:00, 20.38it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
sin(x)
INFO:autora.theorist.bms.regressor:BMS fitting started 100%|██████████| 100/100 [00:05<00:00, 18.74it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
sin(x)
INFO:autora.theorist.bms.regressor:BMS fitting started 100%|██████████| 100/100 [00:04<00:00, 20.01it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
sin(x)
Updated State:
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(0, 6.283185307179586), allowed_values=array([0. , 0.21666156, 0.43332312, 0.64998469, 0.86664625,
1.08330781, 1.29996937, 1.51663094, 1.7332925 , 1.94995406,
2.16661562, 2.38327719, 2.59993875, 2.81660031, 3.03326187,
3.24992343, 3.466585 , 3.68324656, 3.89990812, 4.11656968,
4.33323125, 4.54989281, 4.76655437, 4.98321593, 5.1998775 ,
5.41653906, 5.63320062, 5.84986218, 6.06652374, 6.28318531]), units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x
0 4.983216
1 5.849862
2 3.466585
3 4.116570
4 1.733292
5 3.249923
6 3.249923
7 5.199877
8 3.683247
9 4.333231, experiment_data= x y
0 5.849862 -0.267531
1 1.516631 0.478541
2 2.383277 1.062925
3 3.683247 -0.045271
4 3.683247 -1.491071
5 2.599939 -0.135536
6 5.849862 -0.355969
7 2.383277 0.529578
8 4.766554 -1.006934
9 5.849862 -0.846411
10 2.816600 0.441416
11 1.949954 1.268066
12 3.466585 -0.612066
13 5.633201 -1.059511
14 3.033262 -0.887800
15 0.000000 0.485811
16 4.333231 -0.920648
17 0.649985 0.708040
18 6.066524 -0.606768
19 2.166616 1.440938
20 1.299969 1.686531
21 3.683247 -0.464401
22 5.849862 -0.256513
23 4.983216 -0.395319
24 1.299969 1.375672
25 2.383277 0.977178
26 3.899908 -0.877285
27 4.549893 -1.496263
28 5.416539 -0.574078
29 4.766554 -1.254513
30 1.949954 0.700044
31 0.216662 -0.096459
32 0.866646 0.829807
33 0.649985 1.027126
34 0.649985 0.530925
35 6.283185 0.131242
36 2.166616 1.093523
37 1.516631 1.343437
38 0.649985 0.159225
39 3.033262 0.342529
40 4.983216 -1.215008
41 5.849862 0.189056
42 3.466585 -0.454861
43 4.116570 -0.462597
44 1.733292 0.402174
45 3.249923 -0.822852
46 3.249923 -0.119755
47 5.199877 -1.107408
48 3.683247 -0.471516
49 4.333231 -0.666329, models=[sin(x), sin(x), sin(x), sin(x), sin(x)])
Custom Experiment Runner¶
Experiment runners must be implemented as functions.
Requirements for working with the state:
- The function has a
conditions
argument that accepts apandas.DataFrame
- The function returns a
pandas.DataFrame
The custom quadratic_experiment
below will apply a quadratic transform (x + x**2
) to the conditions.
Note that when building custom experiment runners, we can either wrap the function with on_state(output=['experiment_data'])
as we did in tutorial III, or else we can use the @on_state(output=['experiment_data'])
decorator.
#==================================================================#
# Option 1 - Wrapping our Component #
#==================================================================#
def quadratic_experiment(conditions: pd.DataFrame, added_noise: int = 0.01, random_state: Optional[int] = None):
#Set rng seed
rng = np.random.default_rng(random_state)
#Extract conditions
x = conditions["x"]
#Compute data
y = (x + x**2) + rng.normal(0, added_noise, size=x.shape)
#Assign to dataframe
observations = conditions.assign(y = y)
return observations
custom_experiment_runner = on_state(quadratic_experiment, output=["experiment_data"])
#==================================================================#
# Option 2 - Using a Decorator #
#==================================================================#
@on_state(output=["experiment_data"])
def quadratic_experiment(conditions: pd.DataFrame, added_noise: int = 0.01, random_state: Optional[int] = None):
#Set rng seed
rng = np.random.default_rng(random_state)
#Extract conditions
x = conditions["x"]
#Compute data
y = (x + x**2) + rng.normal(0, added_noise, size=x.shape)
#Assign to dataframe
observations = conditions.assign(y = y)
return observations
Now, we will re-run our initial workflow while incorporating our custom experiment runner.
#### First, let's reinitialize the state object to get a clean state ####
iv = Variable(name="x", value_range=(0, 2 * np.pi), allowed_values=np.linspace(0, 2 * np.pi, 30))
dv = Variable(name="y", type=ValueType.REAL)
variables = VariableCollection(independent_variables=[iv],dependent_variables=[dv])
conditions = random_pool(variables, num_samples=10, random_state=0)
s = StandardState(variables = variables, conditions = conditions, experiment_data = pd.DataFrame(columns=["x","y"]))
#Report previous state
print('\033[1mPrevious State:\033[0m')
print(s)
#Cycle
for cycle in range(5):
s = experimentalist(s, num_samples = 10, random_state=42+cycle)
s = custom_experiment_runner(s, added_noise=0.5, random_state=42+cycle)
s = theorist(s)
plot_from_state(s, 'x + x**2')
#Report updated state
print('\n\033[1mUpdated State:\033[0m')
print(s)
INFO:autora.theorist.bms.regressor:BMS fitting started
Previous State:
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(0, 6.283185307179586), allowed_values=array([0. , 0.21666156, 0.43332312, 0.64998469, 0.86664625,
1.08330781, 1.29996937, 1.51663094, 1.7332925 , 1.94995406,
2.16661562, 2.38327719, 2.59993875, 2.81660031, 3.03326187,
3.24992343, 3.466585 , 3.68324656, 3.89990812, 4.11656968,
4.33323125, 4.54989281, 4.76655437, 4.98321593, 5.1998775 ,
5.41653906, 5.63320062, 5.84986218, 6.06652374, 6.28318531]), units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x
0 5.416539
1 4.116570
2 3.249923
3 1.733292
4 1.949954
5 0.216662
6 0.433323
7 0.000000
8 1.083308
9 5.199877, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])
100%|██████████| 100/100 [00:06<00:00, 15.44it/s] INFO:autora.theorist.bms.regressor:BMS fitting finished
(x + ((x * x) * 0.99))