Bandit Task with Model Confidence¶

Bandit tasks are used to study human reinforcement learning behavior. Here, we will implement a two-armed bandit task. We then run the same task on a language model specifically trained on tasks like these (centaur) and compare the results. To demonstrate how we can add additional data-points to the experiment, we will assess the certainty of the model's predictions. This can be used, for example, to explore which experimental designs are informative. (see, for example, AutoRA Uncertainty Experimentalist

Two-Armed Bandit Task¶

Installation¶

In [ ]:

Copied!

%%capture
!pip install sweetbean
%%capture
!pip install sweetbean

Imports¶

In [ ]:

Copied!





from sweetbean import Block, Experiment
from sweetbean.stimulus import Bandit, Text
from sweetbean.variable import (
    DataVariable,
    FunctionVariable,
    SharedVariable,
    SideEffect,
    TimelineVariable,
)
from sweetbean import Block, Experiment
from sweetbean.stimulus import Bandit, Text
from sweetbean.variable import (
    DataVariable,
    FunctionVariable,
    SharedVariable,
    SideEffect,
    TimelineVariable,
)

Timeline¶

Here, we slowly change the values of bandit_1 10 to 0 and for bandit_2 in reverse order from 0 to 10.

In [ ]:

Copied!





timeline = []
for i in range(11):
    timeline.append(
        {
            "bandit_1": {"color": "orange", "value": 10 - i},
            "bandit_2": {"color": "blue", "value": i},
        }
    )
timeline = []
for i in range(11):
    timeline.append(
        {
            "bandit_1": {"color": "orange", "value": 10 - i},
            "bandit_2": {"color": "blue", "value": i},
        }
    )

Implementation¶

We also keep track of the score with a shared variable to present it between the bandit tasks.

In [ ]:

Copied!





bandit_1 = TimelineVariable("bandit_1")
bandit_2 = TimelineVariable("bandit_2")

score = SharedVariable("score", 0)
value = DataVariable("value", 0)

update_score = FunctionVariable(
    "update_score", lambda sc, val: sc + val, [score, value]
)

update_score_side_effect = SideEffect(score, update_score)

bandit_task = Bandit(
    bandits=[bandit_1, bandit_2],
    side_effects=[update_score_side_effect],
)

score_text = FunctionVariable("score_text", lambda sc: f"Score: {sc}", [score])

show_score = Text(duration=2000, text=score_text)

trial_sequence = Block([bandit_task, show_score], timeline=timeline)
experiment = Experiment([trial_sequence])
bandit_1 = TimelineVariable("bandit_1")
bandit_2 = TimelineVariable("bandit_2")

score = SharedVariable("score", 0)
value = DataVariable("value", 0)

update_score = FunctionVariable(
    "update_score", lambda sc, val: sc + val, [score, value]
)

update_score_side_effect = SideEffect(score, update_score)

bandit_task = Bandit(
    bandits=[bandit_1, bandit_2],
    side_effects=[update_score_side_effect],
)

score_text = FunctionVariable("score_text", lambda sc: f"Score: {sc}", [score])

show_score = Text(duration=2000, text=score_text)

trial_sequence = Block([bandit_task, show_score], timeline=timeline)
experiment = Experiment([trial_sequence])

Instead of running the experiment manually, we can also use a large language model. In this case, we use centaur. This model has been trained on similar tasks as the two-armed bandit task. We can use the model to predict the next response and then run the experiment on the model. We can also assess the models certainty in its predictions. If we want to use additional data, our generate_function should return a dictionary. The key "response" is mandatory and should contain the response. There can be as many additional keys in the dictionary as needed. In this case, we will add the key "certainty" to the dictionary. This key will contain the certainty of the model in its prediction.

First, we need to install unsloth

In [ ]:

Copied!

!pip install unsloth "xformers==0.0.28.post2"
!pip install unsloth "xformers==0.0.28.post2"

Then, we load the model:

In [ ]:

Copied!





from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
  model_name = "marcelbinz/Llama-3.1-Centaur-8B-adapter",
  max_seq_length = 32768,
  dtype = None,
  load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# our generate function will return a dict with the response and the certainty
def generate(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")


    # Generate logits and tokens
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1,          # Generate only one new token
            do_sample=True,
            temperature=1.0,
            return_dict_in_generate=True,
            output_scores=True,        # Enable outputting scores (logits)
        )

# Get generated tokens (including the input prompt and the new token)
    generated_tokens = outputs.sequences  # Shape: [batch_size, sequence_length]
    
    # Extract the generated token ID (the last token in the sequence)
    generated_token_id = generated_tokens[0, -1]  # Assuming batch_size = 1
    
    # Convert logits to probabilities
    scores = outputs.scores  # List of logits for each generation step
    # Since max_new_tokens=1, outputs.scores will have length 1
    logits = scores[0]       # Shape: [batch_size, vocab_size]
    probabilities = torch.softmax(logits, dim=-1)  # Convert logits to probabilities
    
    # Get the probability of the generated token
    token_probability = probabilities[0, generated_token_id].item()  # probabilities[batch_idx, token_id]
    
    # Decode the generated text (including the input prompt and the new token)
    generated_text = tokenizer.decode(generated_tokens[0][-1], skip_special_tokens=True)

    return {"response": generated_text, "certainty": token_probability}
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
  model_name = "marcelbinz/Llama-3.1-Centaur-8B-adapter",
  max_seq_length = 32768,
  dtype = None,
  load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# our generate function will return a dict with the response and the certainty
def generate(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")


    # Generate logits and tokens
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1,          # Generate only one new token
            do_sample=True,
            temperature=1.0,
            return_dict_in_generate=True,
            output_scores=True,        # Enable outputting scores (logits)
        )

# Get generated tokens (including the input prompt and the new token)
    generated_tokens = outputs.sequences  # Shape: [batch_size, sequence_length]
    
    # Extract the generated token ID (the last token in the sequence)
    generated_token_id = generated_tokens[0, -1]  # Assuming batch_size = 1
    
    # Convert logits to probabilities
    scores = outputs.scores  # List of logits for each generation step
    # Since max_new_tokens=1, outputs.scores will have length 1
    logits = scores[0]       # Shape: [batch_size, vocab_size]
    probabilities = torch.softmax(logits, dim=-1)  # Convert logits to probabilities
    
    # Get the probability of the generated token
    token_probability = probabilities[0, generated_token_id].item()  # probabilities[batch_idx, token_id]
    
    # Decode the generated text (including the input prompt and the new token)
    generated_text = tokenizer.decode(generated_tokens[0][-1], skip_special_tokens=True)

    return {"response": generated_text, "certainty": token_probability}

In [ ]:

Copied!

data = experiment.run_on_language(generate)
data = experiment.run_on_language(generate)

Results¶

We can now look at the results: The responses, the values of the chosen bandits, and the certainty of the model in its predictions.

In [ ]:

Copied!





responses = [d["response"] for d in data]
values = [d["value"] for d in data]
certainties = [d["certainty"] for d in data]

for i, (response, value, certainty) in enumerate(zip(responses, values, certainties)):
    print(f"Response {i}: {response} (Value: {value}, Certainty: {certainty})")
responses = [d["response"] for d in data]
values = [d["value"] for d in data]
certainties = [d["certainty"] for d in data]

for i, (response, value, certainty) in enumerate(zip(responses, values, certainties)):
    print(f"Response {i}: {response} (Value: {value}, Certainty: {certainty})")

Conclusion¶

This notebook demonstrates how to run a simple bandit task via a language model and assess its certainty. The results can, for example, can be used to explore which experimental designs are informative.

SweetBean is also integrated in AutoRa, a platform for running the same experiments automatically via prolific. This allows for automatic data collection and analysis while using large language models either for prototyping, in finding good experimental design or for automatic fine-tuning.