Bandit Task with Target Value¶

Bandit tasks are used to study human reinforcement learning behavior. In this example, we demonstrate how to use SweetBean in combination with LLMs to determine experimental sequences that exceed random chance for human participants. In other words, we demonstrate how to use natural language experiments with synthetic participants to inform the design of web-based experiments with humans.

Timeline¶

Our goal is to counterbalance the reward values of the two bandits. Each bandit can either yield a reward or no reward under the following conditions:

If Bandit 1 yields a reward, Bandit 2 does not, and vice versa.
Each bandit yields a reward in 50% of the trials, ensuring balance.

We design a total of 50 trials. Theoretically, a participant could achieve a maximum score of 50 points if they perfectly predict the bandits. However, with random choices, the expected score is 25 points.

For this experiment, we aim to generate trial sequences where a simulated participant achieves at least 70% of the points. This allows us to investigate performance under conditions that exceed random chance but are not perfect.

We begin by implementing a function that generates random reward sequences for the two bandits:

In [3]:

Copied!





import random

def get_random_timeline(n=50):
  rewards = [0] * (n // 2) + [1] * (n // 2)
  random.shuffle(rewards)
  timeline = [{'bandit_1': {'color': 'orange', 'value': r}, 'bandit_2': {'color': 'blue', 'value': 1-r}} for r in rewards]
  return timeline

print(get_random_timeline(10))
import random

def get_random_timeline(n=50):
  rewards = [0] * (n // 2) + [1] * (n // 2)
  random.shuffle(rewards)
  timeline = [{'bandit_1': {'color': 'orange', 'value': r}, 'bandit_2': {'color': 'blue', 'value': 1-r}} for r in rewards]
  return timeline

print(get_random_timeline(10))

[{'bandit_1': {'color': 'orange', 'value': 1}, 'bandit_2': {'color': 'blue', 'value': 0}}, {'bandit_1': {'color': 'orange', 'value': 0}, 'bandit_2': {'color': 'blue', 'value': 1}}, {'bandit_1': {'color': 'orange', 'value': 1}, 'bandit_2': {'color': 'blue', 'value': 0}}, {'bandit_1': {'color': 'orange', 'value': 1}, 'bandit_2': {'color': 'blue', 'value': 0}}, {'bandit_1': {'color': 'orange', 'value': 0}, 'bandit_2': {'color': 'blue', 'value': 1}}, {'bandit_1': {'color': 'orange', 'value': 1}, 'bandit_2': {'color': 'blue', 'value': 0}}, {'bandit_1': {'color': 'orange', 'value': 0}, 'bandit_2': {'color': 'blue', 'value': 1}}, {'bandit_1': {'color': 'orange', 'value': 0}, 'bandit_2': {'color': 'blue', 'value': 1}}, {'bandit_1': {'color': 'orange', 'value': 1}, 'bandit_2': {'color': 'blue', 'value': 0}}, {'bandit_1': {'color': 'orange', 'value': 0}, 'bandit_2': {'color': 'blue', 'value': 1}}]

Experiment¶

We create a function that returns a SweetBean two-armed bandit experiment

Install SweetBean:

In [4]:

Copied!

%%capture
!pip install sweetbean
%%capture
!pip install sweetbean

Define the function

In [5]:

Copied!





from sweetbean import Experiment, Block
from sweetbean.variable import (
  TimelineVariable, SharedVariable, DataVariable,
  FunctionVariable, SideEffect
)
from sweetbean.stimulus import Bandit, Text


def get_experiment(timeline):
  bandit_1 = TimelineVariable("bandit_1")
  bandit_2 = TimelineVariable("bandit_2")

  score = SharedVariable("score", 0)
  value = DataVariable("value", 0)

  # here, we set an identifier to make it easier to filter the correct
  # trials from the data
  bandit_identifier = DataVariable("is_bandit_task", False)

  update_score = FunctionVariable(
    "update_score", lambda sc, val: sc + val, [score, value]
  )


  update_score_side_effect = SideEffect(score, update_score)
  add_identifier = SideEffect(bandit_identifier, True)

  bandit_task = Bandit(
    bandits=[bandit_1, bandit_2],
    side_effects=[update_score_side_effect, add_identifier],
  )
  show_score = Text(duration=1000, text=score)
  block = Block([bandit_task, show_score], timeline=timeline)
  experiment = Experiment([block])
  return experiment
from sweetbean import Experiment, Block
from sweetbean.variable import (
  TimelineVariable, SharedVariable, DataVariable,
  FunctionVariable, SideEffect
)
from sweetbean.stimulus import Bandit, Text


def get_experiment(timeline):
  bandit_1 = TimelineVariable("bandit_1")
  bandit_2 = TimelineVariable("bandit_2")

  score = SharedVariable("score", 0)
  value = DataVariable("value", 0)

  # here, we set an identifier to make it easier to filter the correct
  # trials from the data
  bandit_identifier = DataVariable("is_bandit_task", False)

  update_score = FunctionVariable(
    "update_score", lambda sc, val: sc + val, [score, value]
  )


  update_score_side_effect = SideEffect(score, update_score)
  add_identifier = SideEffect(bandit_identifier, True)

  bandit_task = Bandit(
    bandits=[bandit_1, bandit_2],
    side_effects=[update_score_side_effect, add_identifier],
  )
  show_score = Text(duration=1000, text=score)
  block = Block([bandit_task, show_score], timeline=timeline)
  experiment = Experiment([block])
  return experiment

Let's test the experiment as html file:

In [6]:

Copied!

timeline = get_random_timeline(10)
experiment = get_experiment(timeline)
experiment.to_html('bandit.html')
timeline = get_random_timeline(10)
experiment = get_experiment(timeline)
experiment.to_html('bandit.html')

LLM participant¶

After confirming that the html file is as expected by running it, we can create a synthetic participant by using the centaur model

Installing the dependencies:

In [ ]:

Copied!

%%capture
!pip install unsloth "xformers==0.0.28.post2"
%%capture
!pip install unsloth "xformers==0.0.28.post2"

Creating a generate function:

In [ ]:

Copied!





from unsloth import FastLanguageModel
import transformers

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="marcelbinz/Llama-3.1-Centaur-8B-adapter",
    max_seq_length=32768,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    pad_token_id=0,
    do_sample=True,
    temperature=1.0,
    max_new_tokens=1,
)


def generate(prompt):
    return pipe(prompt)[0]["generated_text"][len(prompt):]
from unsloth import FastLanguageModel
import transformers

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="marcelbinz/Llama-3.1-Centaur-8B-adapter",
    max_seq_length=32768,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    pad_token_id=0,
    do_sample=True,
    temperature=1.0,
    max_new_tokens=1,
)


def generate(prompt):
    return pipe(prompt)[0]["generated_text"][len(prompt):]

First, let's simulate a single experiment:

In [ ]:

Copied!

timeline = get_random_timeline(10)
experiment = get_experiment(timeline)
data, _ = experiment.run_on_language(get_input=generate)
timeline = get_random_timeline(10)
experiment = get_experiment(timeline)
data, _ = experiment.run_on_language(get_input=generate)

... and look at the data:

In [ ]:

Copied!

print(data)
print(data)

We can filter out the "is_bandit_task" trials and get the chosen values

In [ ]:

Copied!

data_values = [d['value'] for d in data if 'is_bandit_task' in d and d['is_bandit_task']]
print(data_values)
print(sum(data_values)/len(timeline))
data_values = [d['value'] for d in data if 'is_bandit_task' in d and d['is_bandit_task']]
print(data_values)
print(sum(data_values)/len(timeline))

Let's define a function for the full simulation

In [ ]:

Copied!





def simulation(n):
  timeline = get_random_timeline(n)
  experiment = get_experiment(timeline)
  data, _ = experiment.run_on_language(get_input=generate)
  data_values = [d['value'] for d in data if 'is_bandit_task' in d and d['is_bandit_task']]
  return sum(data_values)/n, timeline
def simulation(n):
  timeline = get_random_timeline(n)
  experiment = get_experiment(timeline)
  data, _ = experiment.run_on_language(get_input=generate)
  data_values = [d['value'] for d in data if 'is_bandit_task' in d and d['is_bandit_task']]
  return sum(data_values)/n, timeline

Now, we can create a loop that simulates until a threshold of 70% is reached and stores the timeline of the reward sequence. To speed up things here, we only simulate 20 trials. (In a real application instead of creating random sequences, one would vary the sequences more systematically. For example, by applying drifts to the reward probabilities)

In [ ]:

Copied!





import json
value_percentage = 0
while value_percentage < 0.7:
  value_percentage, timeline = simulation(20)
  print()
  print(value_percentage)

print(timeline)

with open('timeline.json', 'w') as f:
  json.dump(timeline, f)
import json
value_percentage = 0
while value_percentage < 0.7:
  value_percentage, timeline = simulation(20)
  print()
  print(value_percentage)

print(timeline)

with open('timeline.json', 'w') as f:
  json.dump(timeline, f)

Let's rerun the experiment on the same timeline to check if the llm just got lucky or if a similar average value can be achieved:

In [ ]:

Copied!





experiment = get_experiment(timeline)
data, _ = experiment.run_on_language(get_input=generate)
data_values = [d['value'] for d in data if 'is_bandit_task' in d and d['is_bandit_task']]
print(sum(data_values)/len(timeline))
experiment = get_experiment(timeline)
data, _ = experiment.run_on_language(get_input=generate)
data_values = [d['value'] for d in data if 'is_bandit_task' in d and d['is_bandit_task']]
print(sum(data_values)/len(timeline))

Conclusion¶

SweetBean can be used to pilot experiments. Afterward, one could manually set up the experiment with the same timeline and run it on human participants or use AutoRA to comfortably automate the process of hosting the same experiment and collecting the data online and even run experiments in a closed loop to iteratively improve the experiment with a mixture of simulated and human data.