Linear And Cyclical Workflows Using Functions And States¶

Using the functions and objects in autora, we can build flexible pipelines and cycles.

Experiment Runner And Theorist¶

We define a two part autora pipeline consisting of an experiment runner and a theorist (we use the seed conditions always).

The key part here is that both experiment runner and theorist are functions which:

operate on the State, and
return a modified object of the same type State.

Defining The State¶

We use the standard State object bundled with autora: StandardState

In [ ]:

Copied!





import numpy as np
import pandas as pd
from autora.variable import VariableCollection, Variable
from autora.state import StandardState

s = StandardState(
    variables=VariableCollection(independent_variables=[Variable("x", value_range=(-15,15))],
                                 dependent_variables=[Variable("y")]),
    conditions=pd.DataFrame({"x": np.linspace(-15,15,101)}),
    experiment_data = pd.DataFrame(columns=["x","y"]),
)
import numpy as np
import pandas as pd
from autora.variable import VariableCollection, Variable
from autora.state import StandardState

s = StandardState(
    variables=VariableCollection(independent_variables=[Variable("x", value_range=(-15,15))],
                                 dependent_variables=[Variable("y")]),
    conditions=pd.DataFrame({"x": np.linspace(-15,15,101)}),
    experiment_data = pd.DataFrame(columns=["x","y"]),
)

In [ ]:

Copied!

s
s

Out[ ]:

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-15, 15), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=        x
0   -15.0
1   -14.7
2   -14.4
3   -14.1
4   -13.8
..    ...
96   13.8
97   14.1
98   14.4
99   14.7
100  15.0

[101 rows x 1 columns], experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

Given this state, we define a two part autora pipeline consisting of an experiment runner and a theorist. We'll just reuse the initial seed conditions in this example.

First we define and test the experiment runner.

The key part here is that both the experiment runner and the theorist are functions which operate on the State. We use the wrapper function wrap_to_use_state that wraps the experiment_runner and makes it operate on the fields of the State rather than the conditions and experiment_data directly.

Defining The Experiment Runner¶

For this example, we'll use a polynomial of degree 3 as our "ground truth" function. We're also using pandas DataFrames and Series as our data interchange format.

In [ ]:

Copied!





from autora.state import on_state

def ground_truth(x: pd.Series, c=(432, -144, -3, 1)):
    return c[0] + c[1] * x + c[2] * x**2 + c[3] * x**3

def experiment_runner(conditions, std=100., random_state=None):
    """Coefs from https://www.maa.org/sites/default/files/0025570x28304.di021116.02p0130a.pdf"""
    rng = np.random.default_rng(random_state)
    x = conditions["x"]
    noise = rng.normal(0, std, len(x))
    y = (ground_truth(x) + noise)
    experiment_data = conditions.assign(y = y)
    return experiment_data

experiment_runner = on_state(experiment_runner, output=['experiment_data'])
from autora.state import on_state

def ground_truth(x: pd.Series, c=(432, -144, -3, 1)):
    return c[0] + c[1] * x + c[2] * x**2 + c[3] * x**3

def experiment_runner(conditions, std=100., random_state=None):
    """Coefs from https://www.maa.org/sites/default/files/0025570x28304.di021116.02p0130a.pdf"""
    rng = np.random.default_rng(random_state)
    x = conditions["x"]
    noise = rng.normal(0, std, len(x))
    y = (ground_truth(x) + noise)
    experiment_data = conditions.assign(y = y)
    return experiment_data

experiment_runner = on_state(experiment_runner, output=['experiment_data'])

When we run the experiment runner, we can see the updated state object which is returned – it has new experimental data.

In [ ]:

Copied!

experiment_runner(s, std=1).experiment_data
experiment_runner(s, std=1).experiment_data

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)

Out[ ]:

	x	y
0	-15.0	-1459.429483
1	-14.7	-1274.322129
2	-14.4	-1101.452213
3	-14.1	-937.643485
4	-13.8	-779.855970
...	...	...
96	13.8	502.684242
97	14.1	610.424989
98	14.4	722.685211
99	14.7	844.061598
100	15.0	971.162262

101 rows × 2 columns

Defining The Theorist¶

Now we define a theorist, which does a linear regression on the polynomial of degree 5. We define a regressor and a method to return its feature names and coefficients, and then the theorist to handle it. Here, we use a different wrapper estimator_on_state that wraps the regressor and returns a function with the same functionality, but operating on State fields. In this case, we want to use the State field experiment_data and extend the State field models.

In [ ]:

Copied!





from sklearn.linear_model import LinearRegression
from autora.state import estimator_on_state
from sklearn.pipeline import make_pipeline as make_theorist_pipeline
from sklearn.preprocessing import PolynomialFeatures

# Completely standard scikit-learn pipeline regressor
regressor = make_theorist_pipeline(PolynomialFeatures(degree=5), LinearRegression())
theorist = estimator_on_state(regressor)

def get_equation(r):
    t = r.named_steps['polynomialfeatures'].get_feature_names_out()
    c = r.named_steps['linearregression'].coef_
    return pd.DataFrame({"t": t, "coefficient": c.reshape(t.shape)})
from sklearn.linear_model import LinearRegression
from autora.state import estimator_on_state
from sklearn.pipeline import make_pipeline as make_theorist_pipeline
from sklearn.preprocessing import PolynomialFeatures

# Completely standard scikit-learn pipeline regressor
regressor = make_theorist_pipeline(PolynomialFeatures(degree=5), LinearRegression())
theorist = estimator_on_state(regressor)

def get_equation(r):
    t = r.named_steps['polynomialfeatures'].get_feature_names_out()
    c = r.named_steps['linearregression'].coef_
    return pd.DataFrame({"t": t, "coefficient": c.reshape(t.shape)})

Directly Chaining State Based Functions¶

Now we run the theorist on the result of the experiment runner (by chaining the two functions).

In [ ]:

Copied!

t = theorist(experiment_runner(s, random_state=1))
t = theorist(experiment_runner(s, random_state=1))

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)

The fitted coefficients are:

In [ ]:

Copied!

get_equation(t.models[-1])
get_equation(t.models[-1])

Out[ ]:

	t	coefficient
0	1	0.000000
1	x	-145.723526
2	x^2	-2.909293
3	x^3	1.048788
4	x^4	-0.000242
5	x^5	-0.000252

Creating A Pipeline With State Based Functions¶

Now we can define the simplest pipeline which runs the experiment runner and theorist in sequence and returns the updated state:

In [ ]:

Copied!





def pipeline(state: StandardState, random_state=None) -> StandardState:
    s_ = state
    t_ = experiment_runner(s_, random_state=random_state)
    u_ = theorist(t_)
    return u_
def pipeline(state: StandardState, random_state=None) -> StandardState:
    s_ = state
    t_ = experiment_runner(s_, random_state=random_state)
    u_ = theorist(t_)
    return u_

Running this pipeline is the same as running the individual steps – just pass the state object.

In [ ]:

Copied!

u = pipeline(s, random_state=1)
get_equation(u.models[-1])
u = pipeline(s, random_state=1)
get_equation(u.models[-1])

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)

Out[ ]:

	t	coefficient
0	1	0.000000
1	x	-145.723526
2	x^2	-2.909293
3	x^3	1.048788
4	x^4	-0.000242
5	x^5	-0.000252

Since the pipeline function operates on the State itself and returns a State, we can chain these pipelines in the same fashion as we chain the theorist and experiment runner:

In [ ]:

Copied!

u_ = pipeline(pipeline(s, random_state=1), random_state=2)
get_equation(u_.models[-1])
u_ = pipeline(pipeline(s, random_state=1), random_state=2)
get_equation(u_.models[-1])

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)

Out[ ]:

	t	coefficient
0	1	0.000000
1	x	-145.738569
2	x^2	-2.898667
3	x^3	1.042038
4	x^4	-0.000893
5	x^5	-0.000218

To show what's happening, we'll show the data, best fit model and ground truth:

In [ ]:

Copied!

from matplotlib import pyplot as plt

def show_best_fit(state):
    state.experiment_data.plot.scatter("x", "y", s=1, alpha=0.5, c="gray")

    observed_x = state.experiment_data[["x"]].sort_values(by="x")
    observed_x = pd.DataFrame({"x": np.linspace(observed_x["x"].min(), observed_x["x"].max(), 101)})

    plt.plot(observed_x, state.models[-1].predict(observed_x), label="best fit")
    
    allowed_x = pd.Series(np.linspace(*state.variables.independent_variables[0].value_range, 101), name="x")
    plt.plot(allowed_x, ground_truth(allowed_x), label="ground truth")
    
    plt.legend()

def show_coefficients(state):
    return get_equation(state.models[-1])

show_best_fit(u)
show_coefficients(u)
from matplotlib import pyplot as plt

def show_best_fit(state):
    state.experiment_data.plot.scatter("x", "y", s=1, alpha=0.5, c="gray")

    observed_x = state.experiment_data[["x"]].sort_values(by="x")
    observed_x = pd.DataFrame({"x": np.linspace(observed_x["x"].min(), observed_x["x"].max(), 101)})

    plt.plot(observed_x, state.models[-1].predict(observed_x), label="best fit")
    
    allowed_x = pd.Series(np.linspace(*state.variables.independent_variables[0].value_range, 101), name="x")
    plt.plot(allowed_x, ground_truth(allowed_x), label="ground truth")
    
    plt.legend()

def show_coefficients(state):
    return get_equation(state.models[-1])

show_best_fit(u)
show_coefficients(u)

Out[ ]:

	t	coefficient
0	1	0.000000
1	x	-145.738569
2	x^2	-2.898667
3	x^3	1.042038
4	x^4	-0.000893
5	x^5	-0.000218

No description has been provided for this image

We can use this pipeline to make a trivial cycle, where we keep on gathering data until we reach 1000 datapoints. Any condition defined on the state object could be used here, though.

In [ ]:

Copied!





v = s
while len(v.experiment_data) < 1_000:  # any condition on the state can be used here.
    v = pipeline(v)
show_best_fit(v)
v = s
while len(v.experiment_data) < 1_000:  # any condition on the state can be used here.
    v = pipeline(v)
show_best_fit(v)

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)

Creating Generators With State Based Functions¶

We can redefine the pipeline as a generator, which can be operated on using iteration tools:

In [ ]:

Copied!





def cycle(state: StandardState) -> StandardState:
    s_ = state
    while True:
        s_ = experiment_runner(s_)
        s_ = theorist(s_)
        yield s_

cycle_generator = cycle(s)

for i in range(1000):
    t = next(cycle_generator)
show_best_fit(t)
def cycle(state: StandardState) -> StandardState:
    s_ = state
    while True:
        s_ = experiment_runner(s_)
        s_ = theorist(s_)
        yield s_

cycle_generator = cycle(s)

for i in range(1000):
    t = next(cycle_generator)
show_best_fit(t)

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)

You can also define a cycle (or a sequence of steps) which yield the intermediate results.

In [ ]:

Copied!





v0 = s
def cycle(state: StandardState) -> StandardState:
    s_ = state
    while True:
        print("#-- running experiment_runner --#\n")
        s_ = experiment_runner(s_)
        yield s_
        print("#-- running theorist --#\n")
        s_ = theorist(s_)
        yield s_

cycle_generator = cycle(v0)
v0 = s
def cycle(state: StandardState) -> StandardState:
    s_ = state
    while True:
        print("#-- running experiment_runner --#\n")
        s_ = experiment_runner(s_)
        yield s_
        print("#-- running theorist --#\n")
        s_ = theorist(s_)
        yield s_

cycle_generator = cycle(v0)

In [ ]:

Copied!

s
s

Out[ ]:

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-15, 15), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=        x
0   -15.0
1   -14.7
2   -14.4
3   -14.1
4   -13.8
..    ...
96   13.8
97   14.1
98   14.4
99   14.7
100  15.0

[101 rows x 1 columns], experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

At the outset, we have no model and an emtpy experiment_data dataframe.

In [ ]:

Copied!

print(f"{v0.models=}, \n{v0.experiment_data=}")
print(f"{v0.models=}, \n{v0.experiment_data=}")

v0.models=[], 
v0.experiment_data=Empty DataFrame
Columns: [x, y]
Index: []

In the first next, we only run the "experiment_runner"

In [ ]:

Copied!

v1 = next(cycle_generator)
print(f"{v1.models=}, \n{v1.experiment_data=}")
v1 = next(cycle_generator)
print(f"{v1.models=}, \n{v1.experiment_data=}")

#-- running experiment_runner --#

v1.models=[], 
v1.experiment_data=        x            y
0   -15.0 -1504.798665
1   -14.7 -1447.778278
2   -14.4 -1079.358506
3   -14.1 -1075.973379
4   -13.8  -601.183784
..    ...          ...
96   13.8   610.172788
97   14.1   566.573162
98   14.4   595.721089
99   14.7   788.030909
100  15.0  1009.839502

[101 rows x 2 columns]

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)

In the next step, we run the theorist on that data, but we don't add any new data:

In [ ]:

Copied!

v2 = next(cycle_generator)
print(f"{v2.models=}, \n{v2.experiment_data.shape=}")
v2 = next(cycle_generator)
print(f"{v2.models=}, \n{v2.experiment_data.shape=}")

#-- running theorist --#

v2.models=[Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=5)),
                ('linearregression', LinearRegression())])], 
v2.experiment_data.shape=(101, 2)

In the next step, we run the experiment runner again and gather more observations:

In [ ]:

Copied!

v3 = next(cycle_generator)
print(f"{v3.models=}, \n{v3.experiment_data.shape=}")
v3 = next(cycle_generator)
print(f"{v3.models=}, \n{v3.experiment_data.shape=}")

#-- running theorist --#

v3.models=[Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=5)),
                ('linearregression', LinearRegression())]), Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=5)),
                ('linearregression', LinearRegression())])], 
v3.experiment_data.shape=(202, 2)

Adding The Experimentalist¶

Modifying the code to use a custom experimentalist is simple. We define an experimentalist which adds some observations each cycle:

In [ ]:

Copied!

from autora.experimentalist.random import random_pool
experimentalist = on_state(random_pool, output=["conditions"])
experimentalist(s)
from autora.experimentalist.random import random_pool
experimentalist = on_state(random_pool, output=["conditions"])
experimentalist(s)

Out[ ]:

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-15, 15), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=           x
0  13.318426
1   3.322472
2 -10.317879
3   6.496320
4  -2.501831, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

In [ ]:

Copied!





u0 = s
for i in range(5):
    u0 = experimentalist(u0, num_samples=10, random_state=42+i)
    u0 = experiment_runner(u0, random_state=43+i)
    u0 = theorist(u0)
    show_best_fit(u0)
    plt.title(f"{i=}, {len(u0.experiment_data)=}")
u0 = s
for i in range(5):
    u0 = experimentalist(u0, num_samples=10, random_state=42+i)
    u0 = experiment_runner(u0, random_state=43+i)
    u0 = theorist(u0)
    show_best_fit(u0)
    plt.title(f"{i=}, {len(u0.experiment_data)=}")

c:\Users\cwill\GitHub\virtualEnvs\autoraEnv\lib\site-packages\autora\state.py:417: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  return pd.concat((a, b), ignore_index=True)