Overview

The wizrd package exists to test the hypothesis that Large Language Models (LLMs) can be programmed as functions and integrated with data science tools and workflows implemented in any programming language. To accomplish this, wizrd defines a grammar and implements a fluent API for programming with LLMs.

Just a utility for pretty-printing in Rmd block quotes:

pretty_rmd <- function(x) {
    x |> strwrap() |> paste(">", text = _, collapse = "\n") |> cat()
}

Quick start

To start analyzing data with an agent:

library(wizrd)
agent <- llamafile_llama()
predict(agent, "Describe the mtcars dataset") |> pretty_rmd()

The mtcars dataset is a built-in dataset in R, a popular programming language for statistical computing and graphics. It was created by John Fox and Alan S. Blume in 1973 and is often used as a benchmark dataset in machine learning and data analysis.

Here’s an overview of the mtcars dataset:

Description: The mtcars dataset contains information about 32 cars, including their performance characteristics, engine specifications, and fuel efficiency. The dataset is divided into two main categories:

  1. Car characteristics: This includes variables such as: * mpg: miles per gallon (fuel efficiency) * cyl: number of cylinders in the engine * disp: engine displacement (in cubic inches) * hp: horsepower of the engine * drat: gear ratio * wt: weight of the car (in thousands of pounds) * qsec: quarter mile time * vs: vehicle type (0 = automatic, 1 = manual) * am: transmission type (0 = automatic, 1 = manual) * gear: number of gears in the transmission * carb: number of carburetors 2. Car performance: This includes variables such as: * vs: vehicle type (0 = automatic, 1 = manual) * am: transmission type (0 = automatic, 1 = manual) gear: number of gears in the transmission carb: number of carburetors

Data structure: The mtcars dataset is a data frame with 32 rows (one row per car) and 11 columns (variables).

Example: Here’s a sample of the first few rows of the mtcars dataset: r mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 3.85 2.320 18.61 1 1 4 1 I hope this helps! Let me know if you have any questions or need further clarification.<|eot_id|>

The call to llamafile_llama() will download, cache and run the self-contained, cross-platform llamafile binary for the llama 3.2 3B model. Behind the scenes, it starts up a local HTTP server (based on llama.cpp) through which it communicates with the R process. For more general use of local LLMs, it is recommended to install Ollama and use ollama_agent() to pull (if necessary) and run agents with Ollama. The llama() function is a convenience to run llama 3.2 3B with Ollama. Convenience functions exist for some other common models.

The predict() function is the most convenient way to execute single exchanges with an agent. To maintain a context over multiple exchanges, use the chat() function:

ctx <- agent |>
    chat("Describe the mtcars dataset") |>
    chat("How can I manipulate it in R?")
ctx
#> <Chat>: 5 messages
#>  - - - - - - - - - - - - - - -  Latest messages  - - - - - - - - - - - - - - -
#>                                          ┌───────────────────────────────────┐
#>                                          │                                   │
#>                                          │   How can I manipulate it in R?   │
#>                                          │                                   │
#>                                          └───────────────────────────────────┘
#> 
#> You can manipulate the mtcars dataset in R using various functions and
#> techniques. Here are a few examples:
#> 
#> **1. View the first few rows of the dataset**
#> ```r
#> head(mtcars)
#> ```
#> This will display the first few rows of the dataset.
#> 
#> **2. Get the summary statistics of the dataset**
#> ```r
#> summary(mtcars)
#> ```
#> This will display the summary statistics of the dataset, including the mean,
#> median, and standard deviation of each variable.
#> 
#> **3. Filter the dataset to include only cars with a specific transmission
#> type**
#> ```r
#> mtcars[mtcars$am == 1, ]
#> ```
#> This will return a subset of the dataset that includes only cars with a
#> manual transmission.
#> 
#> **4. Group the dataset by transmission type and calculate the mean mpg**
#> ```r
#> library(dplyr)
#> mtcars %>%
#> group_by(am) %>%
#> summarise(mean_mpg = mean(mpg))
#> ```
#> This will group the dataset by transmission type (am) and calculate the mean
#> mpg for each group.
#> 
#> **5. Plot a scatterplot of mpg vs. wt**
#> ```r
#> plot(mpg ~ wt, data = mtcars)
#> ```
#> This will create a scatterplot of mpg vs. wt.
#> 
#> **6. Calculate the correlation between mpg and wt**
#> ```r
#> cor(mpg, wt, mtcars)
#> ```
#> This will calculate the correlation between mpg and wt.
#> 
#> **7. Create a new column for the horsepower**
#> ```r
#> mtcars$hp <- mtcars$hp
#> ```
#> This will create a new column called hp in the dataset.
#> 
#> **8. Sort the dataset by mpg in descending order**
#> ```r
#> mtcars[order(mtcars$mpg, decreasing = TRUE), ]
#> ```
#> This will sort the dataset by mpg in descending order.
#> 
#> These are just a few examples of what you can do with the mtcars dataset in
#> R. I hope this helps! Let me know if you have any questions or need further
#> assistance.<|eot_id|>

A useful feature of the returned Chat object is that it prints the context in a readable format.

To extract the last output from the Chat object, call last_output():

last_output(ctx) |> pretty_rmd()

You can manipulate the mtcars dataset in R using various functions and techniques. Here are a few examples:

1. View the first few rows of the dataset r head(mtcars) This will display the first few rows of the dataset.

2. Get the summary statistics of the dataset r summary(mtcars) This will display the summary statistics of the dataset, including the mean, median, and standard deviation of each variable.

3. Filter the dataset to include only cars with a specific transmission type r mtcars[mtcars$am == 1, ] This will return a subset of the dataset that includes only cars with a manual transmission.

4. Group the dataset by transmission type and calculate the mean mpg r library(dplyr) mtcars %>% group_by(am) %>% summarise(mean_mpg = mean(mpg)) This will group the dataset by transmission type (am) and calculate the mean mpg for each group.

5. Plot a scatterplot of mpg vs. wt r plot(mpg ~ wt, data = mtcars) This will create a scatterplot of mpg vs. wt.

6. Calculate the correlation between mpg and wt r cor(mpg, wt, mtcars) This will calculate the correlation between mpg and wt.

7. Create a new column for the horsepower r mtcars$hp <- mtcars$hp This will create a new column called hp in the dataset.

8. Sort the dataset by mpg in descending order r mtcars[order(mtcars$mpg, decreasing = TRUE), ] This will sort the dataset by mpg in descending order.

These are just a few examples of what you can do with the mtcars dataset in R. I hope this helps! Let me know if you have any questions or need further assistance.<|eot_id|> The returned value can then be used in further computations.

As an exercise, try to create a readline-based chatbot interface. See wizrd:::readline_chat for one answer.

LLMs as functions

There are three requirements for LLMs to act as functions: 1. Accept a list of input parameters, each of arbitrary type, 1. Implement a series of logical operations, potentially delegating to R functions, including those based on an LLM, and 1. Return an R object of a specified type and structure.

We will demonstrate how the wizrd package meets each of these requirements in turn. We will use gpt-4o-mini for this example, in order to support constrained output. Set the OPENAI_API_KEY environment variable to your OpenAI key before running this.

agent <- openai_agent("gpt-4o-mini", temperature = 0) |>
    instruct("Answer questions about this dataset:", mtcars)

For reproducibility reasons, it is best to explicitly specify the underlying model, as above, because the default will change as new models are released.

The instruct() function configures the agent with a system prompt, instructing the agent and providing basic context. In this case, we insert the mtcars dataset verbatim as context, which the agent can reference in its responses. The agent will now be able to answer questions about the dataset’s characteristics.

Parameterized input

Let’s extend the above example so that it can analyze any given variable in the mtcars dataset.

parameterized_agent <- agent |>
    prompt_as("Analyze the relationship between {var1} and {var2}.")
parameterized_agent |> chat(list(var1 = "mpg", var2 = "wt"))
#> <Chat>: 3 messages
#>  - - - - - - - - - - - - - - -  Latest messages  - - - - - - - - - - - - - - -
#>                                   ┌──────────────────────────────────────────┐
#>                                   │                                          │
#>                                   │   Analyze the relationship between mpg   │
#>                                   │   and wt.                                │
#>                                   │                                          │
#>                                   └──────────────────────────────────────────┘
#> 
#> To analyze the relationship between miles per gallon (mpg) and weight (wt) in
#> the provided dataset, we can consider the following points:
#> 
#> 1. **General Trend**: Typically, in automotive datasets, there is an inverse
#> relationship between mpg and weight. As the weight of a vehicle increases,
#> the fuel efficiency (mpg) tends to decrease. Heavier vehicles require more
#> energy to move, which can lead to lower fuel efficiency.
#> 
#> 2. **Visual Representation**: A scatter plot of mpg against wt would help
#> visualize this relationship. In such a plot, we would expect to see a
#> downward trend, indicating that as weight increases, mpg decreases.
#> 
#> 3. **Statistical Analysis**: To quantify the relationship, we could calculate
#> the correlation coefficient between mpg and wt. A negative correlation
#> coefficient would suggest that as weight increases, mpg decreases.
#> 
#> 4. **Regression Analysis**: Performing a linear regression analysis could
#> provide a more detailed understanding of the relationship. The regression
#> equation would help predict mpg based on weight and provide insights into the
#> strength of the relationship.
#> 
#> 5. **Outliers**: It is also important to check for outliers in the dataset
#> that may skew the results. For instance, very heavy vehicles with low mpg
#> could significantly affect the overall trend.
#> 
#> 6. **Categorical Factors**: Other factors such as the number of cylinders
#> (cyl), horsepower (hp), and type of transmission (am) could also influence
#> mpg. It may be useful to control for these variables in a multivariate
#> analysis to isolate the effect of weight on mpg.
#> 
#> In summary, while we expect to see a negative relationship between mpg and
#> wt, further analysis through visualization, correlation, and regression would
#> provide a clearer picture of this relationship in the dataset.

By calling prompt_as(), we parameterized the agent using a glue template to accept parameters named var1 and var2. By passing "mpg" and "wt" as the variables, we get an analysis of the relationship between fuel efficiency and weight.

Implementing logic

We can define the logic of the LLM function using natural language instructions, inserted into the system prompt, using the instruct() function:

instructed_agent <- parameterized_agent |>
    instruct("Answer questions about this dataset:", mtcars,
             "When comparing variables, calculate their correlation.")
chat(instructed_agent, list(var1 = "mpg", var2 = "wt"))
#> <Chat>: 3 messages
#>  - - - - - - - - - - - - - - -  Latest messages  - - - - - - - - - - - - - - -
#>                                   ┌──────────────────────────────────────────┐
#>                                   │                                          │
#>                                   │   Analyze the relationship between mpg   │
#>                                   │   and wt.                                │
#>                                   │                                          │
#>                                   └──────────────────────────────────────────┘
#> 
#> To analyze the relationship between miles per gallon (mpg) and weight (wt),
#> we can calculate the correlation coefficient between these two variables. The
#> correlation coefficient (often denoted as "r") measures the strength and
#> direction of a linear relationship between two variables.
#> 
#> 1. **Data Extraction**: We will extract the mpg and wt values from the
#> dataset.
#> 
#> 2. **Calculation of Correlation**: The correlation coefficient can be
#> calculated using the formula:
#> \[
#> r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum
#> y^2 - (\sum y)^2]}}
#> \]
#> where:
#> - \( n \) is the number of data points,
#> - \( x \) represents mpg values,
#> - \( y \) represents wt values.
#> 
#> 3. **Interpretation**: The value of r ranges from -1 to 1:
#> - r = 1 indicates a perfect positive correlation,
#> - r = -1 indicates a perfect negative correlation,
#> - r = 0 indicates no correlation.
#> 
#> Let's calculate the correlation coefficient for mpg and wt using the provided
#> dataset.
#> 
#> ### Data Points
#> Here are the mpg and wt values extracted from the dataset:
#> 
#> | mpg | wt |
#> |-------|-------|
#> | 21.0 | 2.62 |
#> | 21.0 | 2.875 |
#> | 22.8 | 2.32 |
#> | 21.4 | 3.215 |
#> | 18.7 | 3.44 |
#> | 18.1 | 3.46 |
#> | 14.3 | 3.57 |
#> | 24.4 | 3.19 |
#> | 22.8 | 3.15 |
#> | 19.2 | 3.44 |
#> | 17.8 | 3.44 |
#> | 16.4 | 4.07 |
#> | 17.3 | 3.73 |
#> | 15.2 | 3.78 |
#> | 10.4 | 5.25 |
#> | 10.4 | 5.424 |
#> | 14.7 | 5.345 |
#> | 32.4 | 2.2 |
#> | 30.4 | 1.615 |
#> | 33.9 | 1.835 |
#> | 21.5 | 2.465 |
#> | 15.5 | 3.52 |
#> | 15.2 | 3.435 |
#> | 13.3 | 3.84 |
#> | 19.2 | 3.845 |
#> | 27.3 | 1.935 |
#> | 26.0 | 2.14 |
#> | 30.4 | 1.513 |
#> | 15.8 | 3.17 |
#> | 19.7 | 2.77 |
#> | 15.0 | 3.57 |
#> | 21.4 | 2.78 |
#> 
#> ### Calculation
#> Using statistical software or a programming language like Python or R, we can
#> compute the correlation coefficient.
#> 
#> For example, in Python, you could use:
#> ```python
#> import pandas as pd
#> 
#> data = {
#> 'mpg': [21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4,
#> 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2,
#> 27.3, 26, 30.4, 15.8, 19.7, 15, 21.4],
#> 'wt': [2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44, 3.44,
#> 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2, 1.615, 1.835, 2.465, 3.52, 3.435,
#> 3.84, 3.845, 1.935, 2.14, 1.513, 3.17, 2.77, 3.57, 2.78]
#> }
#> 
#> df = pd.DataFrame(data)
#> correlation = df['mpg'].corr(df['wt'])
#> print(correlation)
#> ```
#> 
#> ### Result
#> After performing the calculation, you would find that the correlation
#> coefficient between mpg and wt is approximately -0.87.
#> 
#> ### Conclusion
#> This indicates a strong negative correlation between mpg and weight. As the
#> weight of the vehicle increases, the miles per gallon (fuel efficiency) tends
#> to decrease. This is a common finding in automotive data, as heavier vehicles
#> generally require more fuel to operate.

The agent will now provide a more structured analysis including correlation, interpretation, and visualization suggestions. However, it is not able to carry out the correlation calculation.

To solve this, we can provide a tool that performs the actual correlation calculation:

calculate_correlation <- function(var1, var2) {
    cor(mtcars[[var1]], mtcars[[var2]])
}
equipped_agent <- instructed_agent |> equip(calculate_correlation)
equipped_agent |> chat(list(var1 = "mpg", var2 = "wt"))
#> <Chat>: 5 messages
#>  - - - - - - - - - - - - - - -  Latest messages  - - - - - - - - - - - - - - -
#>                              + call_2XNUw6iErqvV… +
#>                              |                    |
#>                              |   [1] -0.8676594   |
#>                              |                    |
#>                              +--------------------+
#> 
#> The correlation between miles per gallon (mpg) and weight (wt) is
#> approximately -0.87. This indicates a strong negative correlation, meaning
#> that as the weight of the vehicle increases, the miles per gallon tends to
#> decrease.

Constraining output

In order to incorporate the output into a larger program, it is often necessary to convert the output to a more standardized and computable object. We can use the output_as() function to structure our analysis results:

# Return the correlation as a single number
equipped_agent |>
    output_as(S7::class_numeric) |>
    predict(list(var1 = "mpg", var2 = "wt"))
#> [1] -0.8676594

# Return a filtered subset of mtcars, which serves as the prototype
filtered_agent <- agent |>
    output_as(mtcars)
filtered_agent |>
    predict("Cars with mpg > 20")
#>    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 4 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> 5 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> 6 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 7 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> 8 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> 9 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

# Return a summary data.frame
summary_agent <- agent |>
    output_as(data.frame(
        cyl = integer(),
        avg_mpg = numeric(),
        avg_hp = numeric()
    ))
summary_agent |>
    predict("Average mpg and hp by number of cylinders.")
#>   cyl avg_mpg avg_hp
#> 1   4    24.1   91.5
#> 2   6    18.9  130.5
#> 3   8    15.1  209.5

# Using class_data.frame for more general output
agent |>
    output_as(S7::class_data.frame) |>
    predict("Top 5 most fuel efficient cars including mpg, hp, and wt.")
#>    mpg  hp    wt
#> 1 33.9  65 1.835
#> 2 32.4  66 2.200
#> 3 30.4  52 1.615
#> 4 30.4 113 1.513
#> 5 27.3  66 1.935

In the above examples, we demonstrate different ways to structure the output: 1. Using an existing data.frame (mtcars) as a template 2. Using a data.frame stub with specific columns 3. Using class_data.frame for more general output

The agent will return properly structured data.frames that can be used directly in further analysis or visualization.

Note that output_as() supports any S7 class as a constraint, not just data.frames. This allows for type-safe conversion of agent outputs into any R object structure defined using S7.

Converting an agent to an actual R function

Since the agent is already behaving like a function, it is relatively straightforward to convert it into an actual R function using the convert() generic from S7:

# Convert the filtered agent to a function
filter_cars <- S7::convert(filtered_agent, S7::class_function)
filter_cars("Cars with mpg > 30")
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#> 2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
#> 3 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
#> 4 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1

Additional features

Model Context Protocol (MCP)

The wizrd package implements a client for the Model Context Protocol (MCP), which enables constructing agents from shared ingredients, including tools, prompts, and resources.

Here’s an example of using MCP to interact with a data analysis server:

# Start the data analysis server
data_server.py <- system.file("mcp", "data_server.py",
                              package = "wizrd")
server <- start_mcp(data_server.py)

# Create an MCP session
session <- connect_mcp(server)

# List available tools
tools <- tools(session)

# Tools are just ordinary R functions
tools$get_mean(mtcars$mpg)
#> [1] "20.090625"

# Equip a agent with the MCP tools and analyze the data
mcp_agent <- agent |>
    output_as(S7::class_numeric) |>
    equip(tools)
predict(mcp_agent, "What is the mean fuel efficiency in the mtcars dataset?")
#> [1] 20.09062

The MCP protocol provides a standardized way to: 1. Discover available tools, prompts, and resources 2. Call tools with structured arguments 3. Access and use predefined prompts 4. Handle resources and templates

This makes it easier to work with different agent implementations while maintaining a consistent interface in R.

Retrieval Augmented Generation (RAG)

The wizrd package implements experimental functionality for Retrieval Augmented Generation (RAG). One potentially useful application is in the querying of R manual pages.

Querying Rd

The code below uses the chunk() generic to generate text chunks from the S7 man pages. Next, it creates a TextStore that indexes those chunks using the nomic text embedding model. It then configures the prompt generator to query the text store for chunks that are similar to the query. Finally, it sends the query for an example of the S7::new_property() function.

The chunk() utility has basic support for a number of formats, including markdown derivatives, HTML and PDF.

Since the output is markdown, we embed it directly in this document.

chunks <- chunk(tools::Rd_db("S7"))
store <- text_store(nomic(), chunks)
agent <- llama() |> prompt_as(rag_with(store))
cat("#### new_property example\n")
#> #### new_property example
last_message(chat(agent, "new_property example"))
#> Here's an example of using the `new_property` function in R:
#> 
#> ```r
#> # Define a new property for a class called "Person"
#> person_class <- new_class("Person", properties = list(
#> name = new_property(class_character, default = "John"),
#> age = new_property(class_numeric)
#> ))
#> 
#> # Create an instance of the Person class with default values
#> john <- person_class()
#> 
#> # Print the initial values of the properties
#> print(john$name) # prints: John
#> print(john$age) # prints: NA
#> 
#> # Update the name property and print the result
#> john$name <- "Jane"
#> print(john$name) # prints: Jane
#> 
#> # Create a new instance with custom values for both properties
#> jane <- person_class(name = "Jane", age = 30)
#> 
#> # Print the initial values of the properties
#> print(jane$name) # prints: Jane
#> print(jane$age) # prints: 30
#> 
#> # Update the age property and print the result
#> jane$age <- 31
#> print(jane$age) # prints: 31
#> 
#> # Use a dynamic property to compute on demand
#> clock_class <- new_class("Clock", properties = list(
#> now = new_property(getter = function(self) Sys.time())
#> ))
#> 
#> my_clock <- clock_class()
#> 
#> # Print the initial value of the now property
#> print(my_clock$now) # prints: current time
#> 
#> # Wait for a second and print the updated value of the now property
#> Sys.sleep(1)
#> print(my_clock$now) # prints: new current time
#> ```
#> 
#> In this example, we define two classes: `Person` with properties `name` and
#> `age`, and `Clock` with a dynamic property `now`. We create instances of
#> these classes with default values or custom values for the properties. We
#> also demonstrate how to use dynamic properties to compute on demand.

Querying a data.frame

And here is an example of querying a data.frame using RAG:

chunks <- chunk(mtcars)
store <- text_store(nomic(), chunks)
agent <- llama() |> prompt_as(rag_with(store))
predict(agent, "MPG of Datsun 710")
#> [1] "The MPG (miles per gallon) of the Datsun 710 is 22.8."