Skip to content

autora.experimentalist.falsification

falsification_score_sample(conditions, model, reference_conditions, reference_observations, metadata=None, num_samples=None, training_epochs=1000, training_lr=0.001, plot=False)

A Sampler that generates samples of experimental conditions with the objective of maximizing the (approximated) loss of a model relating experimental conditions to observations. The samples are generated by first training a neural network to approximate the loss of a model for all patterns in the training data. Once trained, the network is then provided with the candidate samples of experimental conditions and the selects those with the highest loss.

Parameters:

Name Type Description Default
conditions Union[DataFrame, ndarray]

The candidate samples of experimental conditions to be evaluated.

required
model

Scikit-learn model, could be either a classification or regression model

required
reference_conditions Union[DataFrame, ndarray]

Experimental conditions that the model was trained on

required
reference_observations Union[DataFrame, ndarray]

Observations that the model was trained to predict

required
metadata Optional[VariableCollection]

Meta-data about the dependent and independent variables specifying the experimental conditions

None
num_samples Optional[int]

Number of samples to return

None
training_epochs int

Number of epochs to train the popper network for approximating the

1000
training_lr float

Learning rate for training the popper network

0.001
plot bool

Print out the prediction of the popper network as well as its training loss

False

Returns:

Name Type Description
new_conditions

Samples of experimental conditions with the highest loss

scores

Normalized falsification scores for the samples

Source code in temp_dir/falsification/src/autora/experimentalist/falsification/__init__.py
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def falsification_score_sample(
    conditions: Union[pd.DataFrame, np.ndarray],
    model,
    reference_conditions: Union[pd.DataFrame, np.ndarray],
    reference_observations: Union[pd.DataFrame, np.ndarray],
    metadata: Optional[VariableCollection] = None,
    num_samples: Optional[int] = None,
    training_epochs: int = 1000,
    training_lr: float = 1e-3,
    plot: bool = False,
):
    """
    A Sampler that generates samples of experimental conditions with the objective of maximizing the
    (approximated) loss of a model relating experimental conditions to observations. The samples are generated by first
    training a neural network to approximate the loss of a model for all patterns in the training data.
    Once trained, the network is then provided with the candidate samples of experimental conditions and the selects
    those with the highest loss.

    Args:
        conditions: The candidate samples of experimental conditions to be evaluated.
        model: Scikit-learn model, could be either a classification or regression model
        reference_conditions: Experimental conditions that the model was trained on
        reference_observations: Observations that the model was trained to predict
        metadata: Meta-data about the dependent and independent variables specifying the experimental conditions
        num_samples: Number of samples to return
        training_epochs: Number of epochs to train the popper network for approximating the
        error of the model
        training_lr: Learning rate for training the popper network
        plot: Print out the prediction of the popper network as well as its training loss

    Returns:
        new_conditions: Samples of experimental conditions with the highest loss
        scores: Normalized falsification scores for the samples

    """

    if isinstance(conditions, Iterable) and not isinstance(conditions, pd.DataFrame):
        conditions = np.array(list(conditions))

    condition_pool_copy = conditions.copy()
    conditions = np.array(conditions)
    reference_conditions = np.array(reference_conditions)
    reference_observations = np.array(reference_observations)

    if len(reference_conditions.shape) == 1:
        reference_conditions = reference_conditions.reshape(-1, 1)

    predicted_observations = model.predict(reference_conditions)

    new_conditions, new_scores =  falsification_score_sample_from_predictions(conditions,
                                                        predicted_observations,
                                                        reference_conditions,
                                                        reference_observations,
                                                        metadata,
                                                        num_samples,
                                                        training_epochs,
                                                        training_lr,
                                                        plot)

    if isinstance(condition_pool_copy, pd.DataFrame):
        sorted_conditions = pd.DataFrame(new_conditions, columns=condition_pool_copy.columns)
    else:
        sorted_conditions = pd.DataFrame(new_conditions)

    sorted_conditions["score"] = new_scores

    return sorted_conditions

falsification_score_sample_from_predictions(conditions, predicted_observations, reference_conditions, reference_observations, metadata=None, num_samples=None, training_epochs=1000, training_lr=0.001, plot=False)

A Sampler that generates samples of experimental conditions with the objective of maximizing the (approximated) loss of a model relating experimental conditions to observations. The samples are generated by first training a neural network to approximate the loss of a model for all patterns in the training data. Once trained, the network is then provided with the candidate samples of experimental conditions and the selects those with the highest loss.

Parameters:

Name Type Description Default
conditions Union[DataFrame, ndarray]

The candidate samples of experimental conditions to be evaluated.

required
predicted_observations Union[DataFrame, ndarray]

Prediction obtained from the model for the set of reference experimental conditions

required
reference_conditions Union[DataFrame, ndarray]

Experimental conditions that the model was trained on

required
reference_observations ndarray

Observations that the model was trained to predict

required
metadata Optional[VariableCollection]

Meta-data about the dependent and independent variables specifying the experimental conditions

None
num_samples Optional[int]

Number of samples to return

None
training_epochs int

Number of epochs to train the popper network for approximating the

1000
training_lr float

Learning rate for training the popper network

0.001
plot bool

Print out the prediction of the popper network as well as its training loss

False

Returns:

Name Type Description
new_conditions

Samples of experimental conditions with the highest loss

scores

Normalized falsification scores for the samples

Source code in temp_dir/falsification/src/autora/experimentalist/falsification/__init__.py
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
def falsification_score_sample_from_predictions(
    conditions: Union[pd.DataFrame, np.ndarray],
    predicted_observations: Union[pd.DataFrame, np.ndarray],
    reference_conditions: Union[pd.DataFrame, np.ndarray],
    reference_observations: np.ndarray,
    metadata: Optional[VariableCollection] = None,
    num_samples: Optional[int] = None,
    training_epochs: int = 1000,
    training_lr: float = 1e-3,
    plot: bool = False,
):
    """
    A Sampler that generates samples of experimental conditions with the objective of maximizing the
    (approximated) loss of a model relating experimental conditions to observations. The samples are generated by first
    training a neural network to approximate the loss of a model for all patterns in the training data.
    Once trained, the network is then provided with the candidate samples of experimental conditions and the selects
    those with the highest loss.

    Args:
        conditions: The candidate samples of experimental conditions to be evaluated.
        predicted_observations: Prediction obtained from the model for the set of reference experimental conditions
        reference_conditions: Experimental conditions that the model was trained on
        reference_observations: Observations that the model was trained to predict
        metadata: Meta-data about the dependent and independent variables specifying the experimental conditions
        num_samples: Number of samples to return
        training_epochs: Number of epochs to train the popper network for approximating the
        error of the model
        training_lr: Learning rate for training the popper network
        plot: Print out the prediction of the popper network as well as its training loss

    Returns:
        new_conditions: Samples of experimental conditions with the highest loss
        scores: Normalized falsification scores for the samples

    """

    conditions = np.array(conditions)
    reference_conditions = np.array(reference_conditions)
    reference_observations = np.array(reference_observations)

    if len(conditions.shape) == 1:
        conditions = conditions.reshape(-1, 1)

    reference_conditions = np.array(reference_conditions)
    if len(reference_conditions.shape) == 1:
        reference_conditions = reference_conditions.reshape(-1, 1)

    reference_observations = np.array(reference_observations)
    if len(reference_observations.shape) == 1:
        reference_observations = reference_observations.reshape(-1, 1)

    if num_samples is None:
        num_samples = conditions.shape[0]

    if metadata is not None:
        if metadata.dependent_variables[0].type == ValueType.CLASS:
            # find all unique values in reference_observations
            num_classes = len(np.unique(reference_observations))
            reference_observations = class_to_onehot(reference_observations, n_classes=num_classes)

    # create list of IV limits
    iv_limit_list = get_iv_limits(reference_conditions, metadata)

    popper_net, model_loss = train_popper_net(predicted_observations,
                                              reference_conditions,
                                              reference_observations,
                                              metadata,
                                              iv_limit_list,
                                              training_epochs,
                                              training_lr,
                                              plot)

    # now that the popper network is trained we can assign losses to all data points to be evaluated
    popper_input = Variable(torch.from_numpy(conditions)).float()
    Y = popper_net(popper_input).detach().numpy().flatten()
    scaler = StandardScaler()
    score = scaler.fit_transform(Y.reshape(-1, 1)).flatten()

    # order rows in Y from highest to lowest
    sorted_conditions = conditions[np.argsort(score)[::-1]]
    sorted_score = score[np.argsort(score)[::-1]]

    return sorted_conditions[0:num_samples], sorted_score[0:num_samples]

pool(model, reference_conditions, reference_observations, metadata, num_samples=100, training_epochs=1000, optimization_epochs=1000, training_lr=0.001, optimization_lr=0.001, limit_offset=0, limit_repulsion=0, plot=False)

A pooler that generates samples for independent variables with the objective of maximizing the (approximated) loss of the model. The samples are generated by first training a neural network to approximate the loss of a model for all patterns in the training data. Once trained, the network is then inverted to generate samples that maximize the approximated loss of the model.

Note: If the pooler returns samples that are close to the boundaries of the variable space, then it is advisable to increase the limit_repulsion parameter (e.g., to 0.000001).

Parameters:

Name Type Description Default
model

Scikit-learn model, could be either a classification or regression model

required
reference_conditions Union[DataFrame, ndarray]

data that the model was trained on

required
reference_observations Union[DataFrame, ndarray]

labels that the model was trained on

required
metadata VariableCollection

Meta-data about the dependent and independent variables

required
num_samples int

number of samples to return

100
training_epochs int

number of epochs to train the popper network for approximating the

1000
optimization_epochs int

number of epochs to optimize the samples based on the trained

1000
training_lr float

learning rate for training the popper network

0.001
optimization_lr float

learning rate for optimizing the samples

0.001
limit_offset float

a limited offset to prevent the samples from being too close to the value

0
limit_repulsion float

a limited repulsion to prevent the samples from being too close to the

0
plot bool

print out the prediction of the popper network as well as its training loss

False

Returns: Sampled pool

Source code in temp_dir/falsification/src/autora/experimentalist/falsification/__init__.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
def pool(
    model,
    reference_conditions: Union[pd.DataFrame, np.ndarray],
    reference_observations: Union[pd.DataFrame, np.ndarray],
    metadata: VariableCollection,
    num_samples: int = 100,
    training_epochs: int = 1000,
    optimization_epochs: int = 1000,
    training_lr: float = 1e-3,
    optimization_lr: float = 1e-3,
    limit_offset: float = 0,  # 10**-10,
    limit_repulsion: float = 0,
    plot: bool = False,
):
    """
    A pooler that generates samples for independent variables with the objective of maximizing the
    (approximated) loss of the model. The samples are generated by first training a neural network
    to approximate the loss of a model for all patterns in the training data.
    Once trained, the network is then inverted to generate samples that maximize the approximated
    loss of the model.

    Note: If the pooler returns samples that are close to the boundaries of the variable space,
    then it is advisable to increase the limit_repulsion parameter (e.g., to 0.000001).

    Args:
        model: Scikit-learn model, could be either a classification or regression model
        reference_conditions: data that the model was trained on
        reference_observations: labels that the model was trained on
        metadata: Meta-data about the dependent and independent variables
        num_samples: number of samples to return
        training_epochs: number of epochs to train the popper network for approximating the
        error fo the model
        optimization_epochs: number of epochs to optimize the samples based on the trained
        popper network
        training_lr: learning rate for training the popper network
        optimization_lr: learning rate for optimizing the samples
        limit_offset: a limited offset to prevent the samples from being too close to the value
        boundaries
        limit_repulsion: a limited repulsion to prevent the samples from being too close to the
        allowed value boundaries
        plot: print out the prediction of the popper network as well as its training loss

    Returns: Sampled pool

    """

    # format input

    if isinstance(reference_conditions, pd.DataFrame):
        reference_conditions = align_dataframe_to_ivs(reference_conditions, metadata.independent_variables)

    reference_conditions_np = np.array(reference_conditions)
    if len(reference_conditions_np.shape) == 1:
        reference_conditions_np = reference_conditions_np.reshape(-1, 1)

    x = np.empty([num_samples, reference_conditions_np.shape[1]])

    reference_observations = np.array(reference_observations)
    if len(reference_observations.shape) == 1:
        reference_observations = reference_observations.reshape(-1, 1)

    if metadata.dependent_variables[0].type == ValueType.CLASS:
        # find all unique values in reference_observations
        num_classes = len(np.unique(reference_observations))
        reference_observations = class_to_onehot(reference_observations, n_classes=num_classes)

    reference_conditions_tensor = torch.from_numpy(reference_conditions_np).float()

    iv_limit_list = get_iv_limits(reference_conditions_np, metadata)

    popper_net, model_loss = train_popper_net_with_model(model,
                                              reference_conditions_np,
                                              reference_observations,
                                              metadata,
                                              iv_limit_list,
                                              training_epochs,
                                              training_lr,
                                              plot)

    # now that the popper network is trained we can sample new data points
    # to sample data points we need to provide the popper network with an initial
    # condition we will sample those initial conditions proportional to the loss of the current
    # model

    # feed average model losses through softmax
    # model_loss_avg= torch.from_numpy(np.mean(model_loss.detach().numpy(), axis=1)).float()
    softmax_func = torch.nn.Softmax(dim=0)
    probabilities = softmax_func(model_loss)
    # sample data point in proportion to model loss
    transform_category = torch.distributions.categorical.Categorical(probabilities)

    popper_net.freeze_weights()

    for condition in range(num_samples):

        index = transform_category.sample()
        input_sample = torch.flatten(reference_conditions_tensor[index, :])
        popper_input = Variable(input_sample, requires_grad=True)

        # invert the popper network to determine optimal experiment conditions
        for optimization_epoch in range(optimization_epochs):
            # feedforward pass on popper network
            popper_prediction = popper_net(popper_input)
            # compute gradient that maximizes output of popper network
            # (i.e. predicted loss of original model)
            popper_loss_optim = -popper_prediction
            popper_loss_optim.backward()

            with torch.no_grad():

                # first add repulsion from variable limits
                for idx in range(len(input_sample)):
                    iv_value = popper_input[idx]
                    iv_limits = iv_limit_list[idx]
                    dist_to_min = np.abs(iv_value - np.min(iv_limits))
                    dist_to_max = np.abs(iv_value - np.max(iv_limits))
                    # deal with boundary case where distance is 0 or very small
                    dist_to_min = np.max([dist_to_min, 0.00000001])
                    dist_to_max = np.max([dist_to_max, 0.00000001])
                    repulsion_from_min = limit_repulsion / (dist_to_min**2)
                    repulsion_from_max = limit_repulsion / (dist_to_max**2)
                    iv_value_repulsed = (
                        iv_value + repulsion_from_min - repulsion_from_max
                    )
                    popper_input[idx] = iv_value_repulsed

                # now add gradient for theory loss maximization
                delta = -optimization_lr * popper_input.grad
                popper_input += delta

                # finally, clip input variable from its limits
                for idx in range(len(input_sample)):
                    iv_raw_value = input_sample[idx]
                    iv_limits = iv_limit_list[idx]
                    iv_clipped_value = np.min(
                        [iv_raw_value, np.max(iv_limits) - limit_offset]
                    )
                    iv_clipped_value = np.max(
                        [
                            iv_clipped_value,
                            np.min(iv_limits) + limit_offset,
                        ]
                    )
                    popper_input[idx] = iv_clipped_value
                popper_input.grad.zero_()

        # add condition to new experiment sequence
        for idx in range(len(input_sample)):
            iv_limits = iv_limit_list[idx]

            # first clip value
            iv_clipped_value = np.min([iv_raw_value, np.max(iv_limits) - limit_offset])
            iv_clipped_value = np.max(
                [iv_clipped_value, np.min(iv_limits) + limit_offset]
            )
            # make sure to convert variable to original scale
            iv_clipped_scaled_value = iv_clipped_value

            x[condition, idx] = iv_clipped_scaled_value

    return iter(x)

sample(conditions, model, reference_conditions, reference_observations, metadata, num_samples=None, training_epochs=1000, training_lr=0.001, plot=False)

A Sampler that generates samples of experimental conditions with the objective of maximizing the (approximated) loss of a model relating experimental conditions to observations. The samples are generated by first training a neural network to approximate the loss of a model for all patterns in the training data. Once trained, the network is then provided with the candidate samples of experimental conditions and the selects those with the highest loss.

Parameters:

Name Type Description Default
conditions Union[DataFrame, ndarray]

The candidate samples of experimental conditions to be evaluated.

required
model

Scikit-learn model, could be either a classification or regression model

required
reference_conditions Union[DataFrame, ndarray]

Experimental conditions that the model was trained on

required
reference_observations Union[DataFrame, ndarray]

Observations that the model was trained to predict

required
metadata VariableCollection

Meta-data about the dependent and independent variables specifying the experimental conditions

required
num_samples Optional[int]

Number of samples to return

None
training_epochs int

Number of epochs to train the popper network for approximating the

1000
training_lr float

Learning rate for training the popper network

0.001
plot bool

Print out the prediction of the popper network as well as its training loss

False

Returns: Samples with the highest loss

Source code in temp_dir/falsification/src/autora/experimentalist/falsification/__init__.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
def sample(
    conditions: Union[pd.DataFrame, np.ndarray],
    model,
    reference_conditions: Union[pd.DataFrame, np.ndarray],
    reference_observations: Union[pd.DataFrame, np.ndarray],
    metadata: VariableCollection,
    num_samples: Optional[int] = None,
    training_epochs: int = 1000,
    training_lr: float = 1e-3,
    plot: bool = False,
):
    """
    A Sampler that generates samples of experimental conditions with the objective of maximizing the
    (approximated) loss of a model relating experimental conditions to observations. The samples are generated by first
    training a neural network to approximate the loss of a model for all patterns in the training data.
    Once trained, the network is then provided with the candidate samples of experimental conditions and the selects
    those with the highest loss.

    Args:
        conditions: The candidate samples of experimental conditions to be evaluated.
        model: Scikit-learn model, could be either a classification or regression model
        reference_conditions: Experimental conditions that the model was trained on
        reference_observations: Observations that the model was trained to predict
        metadata: Meta-data about the dependent and independent variables specifying the experimental conditions
        num_samples: Number of samples to return
        training_epochs: Number of epochs to train the popper network for approximating the
        error of the model
        training_lr: Learning rate for training the popper network
        plot: Print out the prediction of the popper network as well as its training loss

    Returns: Samples with the highest loss

    """

    # format input

    if isinstance(conditions, Iterable) and not isinstance(conditions, pd.DataFrame):
        conditions = np.array(list(conditions))

    condition_pool_copy = conditions.copy()
    conditions = np.array(conditions)
    reference_observations = np.array(reference_observations)
    reference_conditions = np.array(reference_conditions)
    if len(reference_conditions.shape) == 1:
        reference_conditions = reference_conditions.reshape(-1, 1)

    # get target pattern for popper net
    model_predict = getattr(model, "predict_proba", None)
    if callable(model_predict) is False:
        model_predict = getattr(model, "predict", None)

    if callable(model_predict) is False or model_predict is None:
        raise Exception("Model must have `predict` or `predict_proba` method.")

    predicted_observations = model_predict(reference_conditions)
    if isinstance(predicted_observations, np.ndarray) is False:
        try:
            predicted_observations = np.array(predicted_observations)
        except Exception:
            raise Exception("Model prediction must be convertable to numpy array.")
    if predicted_observations.ndim == 1:
        predicted_observations = predicted_observations.reshape(-1, 1)

    new_conditions, scores = falsification_score_sample_from_predictions(
        conditions,
        predicted_observations,
        reference_conditions,
        reference_observations,
        metadata,
        num_samples,
        training_epochs,
        training_lr,
        plot,
    )

    if isinstance(condition_pool_copy, pd.DataFrame):
        new_conditions = pd.DataFrame(new_conditions, columns=condition_pool_copy.columns)

    return new_conditions