Linear And Cyclical Workflows Using Functions And States¶
Using the functions and objects in autora
, we can build flexible pipelines and cycles.
Experiment Runner And Theorist¶
We define a two part autora
pipeline consisting of an experiment runner and a theorist (we use the seed conditions
always).
The key part here is that both experiment runner and theorist are functions which:
- operate on the
State
, and - return a modified object of the same type
State
.
Defining The State¶
We use the standard State object bundled with autora
: StandardState
import numpy as np
import pandas as pd
from autora.variable import VariableCollection, Variable
from autora.state import StandardState
s = StandardState(
variables=VariableCollection(independent_variables=[Variable("x", value_range=(-15,15))],
dependent_variables=[Variable("y")]),
conditions=pd.DataFrame({"x": np.linspace(-15,15,101)}),
experiment_data = pd.DataFrame(columns=["x","y"]),
)
s
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-15, 15), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x 0 -15.0 1 -14.7 2 -14.4 3 -14.1 4 -13.8 .. ... 96 13.8 97 14.1 98 14.4 99 14.7 100 15.0 [101 rows x 1 columns], experiment_data=Empty DataFrame Columns: [x, y] Index: [], models=[])
Given this state, we define a two part autora
pipeline consisting of an experiment runner and a theorist. We'll just
reuse the initial seed conditions
in this example.
First we define and test the experiment runner.
The key part here is that both the experiment runner and the theorist are functions which operate on the State
.
We use the wrapper function wrap_to_use_state
that wraps the experiment_runner and makes it operate on the
fields of the State
rather than the conditions
and experiment_data
directly.
Defining The Experiment Runner¶
For this example, we'll use a polynomial of degree 3 as our "ground truth" function. We're also using pandas DataFrames and Series as our data interchange format.
from autora.state import on_state
def ground_truth(x: pd.Series, c=(432, -144, -3, 1)):
return c[0] + c[1] * x + c[2] * x**2 + c[3] * x**3
def experiment_runner(conditions, std=100., random_state=None):
"""Coefs from https://www.maa.org/sites/default/files/0025570x28304.di021116.02p0130a.pdf"""
rng = np.random.default_rng(random_state)
x = conditions["x"]
noise = rng.normal(0, std, len(x))
y = (ground_truth(x) + noise)
experiment_data = conditions.assign(y = y)
return experiment_data
experiment_runner = on_state(experiment_runner, output=['experiment_data'])
When we run the experiment runner, we can see the updated state object which is returned – it has new experimental data.
experiment_runner(s, std=1).experiment_data
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)
x | y | |
---|---|---|
0 | -15.0 | -1459.429483 |
1 | -14.7 | -1274.322129 |
2 | -14.4 | -1101.452213 |
3 | -14.1 | -937.643485 |
4 | -13.8 | -779.855970 |
... | ... | ... |
96 | 13.8 | 502.684242 |
97 | 14.1 | 610.424989 |
98 | 14.4 | 722.685211 |
99 | 14.7 | 844.061598 |
100 | 15.0 | 971.162262 |
101 rows × 2 columns
Defining The Theorist¶
Now we define a theorist, which does a linear regression on the polynomial of degree 5. We define a regressor and a
method to return its feature names and coefficients, and then the theorist to handle it. Here, we use a different wrapper estimator_on_state
that wraps the regressor and returns a function with the same functionality, but operating on State
fields. In this case, we want to use the State
field experiment_data
and extend the State
field models
.
from sklearn.linear_model import LinearRegression
from autora.state import estimator_on_state
from sklearn.pipeline import make_pipeline as make_theorist_pipeline
from sklearn.preprocessing import PolynomialFeatures
# Completely standard scikit-learn pipeline regressor
regressor = make_theorist_pipeline(PolynomialFeatures(degree=5), LinearRegression())
theorist = estimator_on_state(regressor)
def get_equation(r):
t = r.named_steps['polynomialfeatures'].get_feature_names_out()
c = r.named_steps['linearregression'].coef_
return pd.DataFrame({"t": t, "coefficient": c.reshape(t.shape)})
Directly Chaining State Based Functions¶
Now we run the theorist on the result of the experiment runner (by chaining the two functions).
t = theorist(experiment_runner(s, random_state=1))
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)
The fitted coefficients are:
get_equation(t.models[-1])
t | coefficient | |
---|---|---|
0 | 1 | 0.000000 |
1 | x | -145.723526 |
2 | x^2 | -2.909293 |
3 | x^3 | 1.048788 |
4 | x^4 | -0.000242 |
5 | x^5 | -0.000252 |
Creating A Pipeline With State Based Functions¶
Now we can define the simplest pipeline which runs the experiment runner and theorist in sequence and returns the updated state:
def pipeline(state: StandardState, random_state=None) -> StandardState:
s_ = state
t_ = experiment_runner(s_, random_state=random_state)
u_ = theorist(t_)
return u_
Running this pipeline is the same as running the individual steps – just pass the state object.
u = pipeline(s, random_state=1)
get_equation(u.models[-1])
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)
t | coefficient | |
---|---|---|
0 | 1 | 0.000000 |
1 | x | -145.723526 |
2 | x^2 | -2.909293 |
3 | x^3 | 1.048788 |
4 | x^4 | -0.000242 |
5 | x^5 | -0.000252 |
Since the pipeline function operates on the State
itself and returns a State
, we can chain these pipelines in the same fashion as we chain the theorist and experiment runner:
u_ = pipeline(pipeline(s, random_state=1), random_state=2)
get_equation(u_.models[-1])
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)
t | coefficient | |
---|---|---|
0 | 1 | 0.000000 |
1 | x | -145.738569 |
2 | x^2 | -2.898667 |
3 | x^3 | 1.042038 |
4 | x^4 | -0.000893 |
5 | x^5 | -0.000218 |
To show what's happening, we'll show the data, best fit model and ground truth:
from matplotlib import pyplot as plt
def show_best_fit(state):
state.experiment_data.plot.scatter("x", "y", s=1, alpha=0.5, c="gray")
observed_x = state.experiment_data[["x"]].sort_values(by="x")
observed_x = pd.DataFrame({"x": np.linspace(observed_x["x"].min(), observed_x["x"].max(), 101)})
plt.plot(observed_x, state.models[-1].predict(observed_x), label="best fit")
allowed_x = pd.Series(np.linspace(*state.variables.independent_variables[0].value_range, 101), name="x")
plt.plot(allowed_x, ground_truth(allowed_x), label="ground truth")
plt.legend()
def show_coefficients(state):
return get_equation(state.models[-1])
show_best_fit(u)
show_coefficients(u)
t | coefficient | |
---|---|---|
0 | 1 | 0.000000 |
1 | x | -145.738569 |
2 | x^2 | -2.898667 |
3 | x^3 | 1.042038 |
4 | x^4 | -0.000893 |
5 | x^5 | -0.000218 |
We can use this pipeline to make a trivial cycle, where we keep on gathering data until we reach 1000 datapoints. Any condition defined on the state object could be used here, though.
v = s
while len(v.experiment_data) < 1_000: # any condition on the state can be used here.
v = pipeline(v)
show_best_fit(v)
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)
Creating Generators With State Based Functions¶
We can redefine the pipeline as a generator, which can be operated on using iteration tools:
def cycle(state: StandardState) -> StandardState:
s_ = state
while True:
s_ = experiment_runner(s_)
s_ = theorist(s_)
yield s_
cycle_generator = cycle(s)
for i in range(1000):
t = next(cycle_generator)
show_best_fit(t)
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)
You can also define a cycle (or a sequence of steps) which yield the intermediate results.
v0 = s
def cycle(state: StandardState) -> StandardState:
s_ = state
while True:
print("#-- running experiment_runner --#\n")
s_ = experiment_runner(s_)
yield s_
print("#-- running theorist --#\n")
s_ = theorist(s_)
yield s_
cycle_generator = cycle(v0)
s
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-15, 15), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x 0 -15.0 1 -14.7 2 -14.4 3 -14.1 4 -13.8 .. ... 96 13.8 97 14.1 98 14.4 99 14.7 100 15.0 [101 rows x 1 columns], experiment_data=Empty DataFrame Columns: [x, y] Index: [], models=[])
At the outset, we have no model and an emtpy experiment_data
dataframe.
print(f"{v0.models=}, \n{v0.experiment_data=}")
v0.models=[], v0.experiment_data=Empty DataFrame Columns: [x, y] Index: []
In the first next
, we only run the "experiment_runner"
v1 = next(cycle_generator)
print(f"{v1.models=}, \n{v1.experiment_data=}")
#-- running experiment_runner --# v1.models=[], v1.experiment_data= x y 0 -15.0 -1504.798665 1 -14.7 -1447.778278 2 -14.4 -1079.358506 3 -14.1 -1075.973379 4 -13.8 -601.183784 .. ... ... 96 13.8 610.172788 97 14.1 566.573162 98 14.4 595.721089 99 14.7 788.030909 100 15.0 1009.839502 [101 rows x 2 columns]
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)
In the next step, we run the theorist on that data, but we don't add any new data:
v2 = next(cycle_generator)
print(f"{v2.models=}, \n{v2.experiment_data.shape=}")
#-- running theorist --# v2.models=[Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=5)), ('linearregression', LinearRegression())])], v2.experiment_data.shape=(101, 2)
In the next step, we run the experiment runner again and gather more observations:
v3 = next(cycle_generator)
print(f"{v3.models=}, \n{v3.experiment_data.shape=}")
#-- running theorist --# v3.models=[Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=5)), ('linearregression', LinearRegression())]), Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=5)), ('linearregression', LinearRegression())])], v3.experiment_data.shape=(202, 2)
Adding The Experimentalist¶
Modifying the code to use a custom experimentalist is simple. We define an experimentalist which adds some observations each cycle:
from autora.experimentalist.random import random_pool
experimentalist = on_state(random_pool, output=["conditions"])
experimentalist(s)
StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-15, 15), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions= x 0 13.318426 1 3.322472 2 -10.317879 3 6.496320 4 -2.501831, experiment_data=Empty DataFrame Columns: [x, y] Index: [], models=[])
u0 = s
for i in range(5):
u0 = experimentalist(u0, num_samples=10, random_state=42+i)
u0 = experiment_runner(u0, random_state=43+i)
u0 = theorist(u0)
show_best_fit(u0)
plt.title(f"{i=}, {len(u0.experiment_data)=}")
c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. return pd.concat((a, b), ignore_index=True)