vignettes/biomart-mcp.Rmd
biomart-mcp.Rmd
This vignette demonstrates how to use the wizrd agent interface with the BioMart MCP server to perform natural language queries against Ensembl and other BioMart resources. Unlike traditional biomaRt workflows, you do not need to know the exact dataset, attribute, or filter names—just ask your question in plain English and get a data.frame result.
The BioMart MCP server uses the pybiomart
Python
package, which relies on an older XML-based approach to BioMart. This
limits its functionality compared to the biomaRt
R package,
which uses direct API access. Some queries that work with biomaRt may
not be possible with the current MCP server.
We begin by downloading the BioMart MCP server script to a temporary
directory and launching it with all dependencies using uvx
and start_mcp()
.
library(wizrd)
biomart_mcp_path <- system.file("mcp", "biomart-mcp.py", package = "wizrd")
server <- start_mcp(biomart_mcp_path)
Connect to the BioMart MCP server.
session <- connect_mcp(server)
Alternatively, you can connect to a hosted BioMart MCP server using:
session <- connect_mcp("https://server.smithery.ai/@jzinno/biomart-mcp/mcp")
This option requires a free account on smithery.ai and uses interactive authorization, so this vignette will use the local server.
We create an agent, equip it with the BioMart MCP tools, and instruct it to answer user queries using BioMart. All outputs are returned as data.frames for easy downstream analysis.
biomart_tools <- tools(session)
agent <- openai_agent("o4-mini") |>
equip(biomart_tools) |>
instruct(
"You are an expert in Ensembl gene annotation.",
"Answer user queries using BioMart,",
"using only strings for filter values, with lists comma-separated.",
"Return results as a data.frame."
) |>
output_as(S7::class_data.frame)
In biomaRt, you must specify the dataset, attributes, and filters. With wizrd, just ask:
suppressPackageStartupMessages(library(GenomicRanges))
coords <- agent |>
output_as(as.data.frame(GRanges())) |>
predict(
"What are the chromosomal coordinates for TP53 and BRCA2?"
)
coords |> makeGRangesFromDataFrame(keep.extra.columns = TRUE)
#> GRanges object with 2 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] 17 7661779-7687546 -
#> [2] 13 32315086-32400268 +
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
The above relies on the GenomicRanges package in order to return a useful R data structure for further computations, but every query can return a simple data.frame, as demonstrated in later examples.
The biomaRt equivalent requires the user to learn how to use the biomaRt package, to determine many different identifiers, and to exert extra effort to convert the output to a GRanges.
library(biomaRt)
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
getBM(
attributes = c(
"external_gene_name", "chromosome_name", "start_position",
"end_position", "strand"
),
filters = "external_gene_name",
values = c("TP53", "BRCA2"),
mart = mart
) |>
transform(strand = ifelse(strand > 0, "+", "-")) |>
makeGRangesFromDataFrame(
seqnames.field = "chromosome_name",
start.field = "start_position",
end.field = "end_position",
keep.extra.columns = TRUE
)
Retrieve all EntrezGene IDs and gene symbols for genes with MAP kinase activity (GO:0004707):
go_genes <- agent |>
predict(
"List all EntrezGene IDs and gene symbols with MAP kinase activity"
)
go_genes
#> entrezgene_id hgnc_symbol
#> 1 5596 MAPK4
#> 2 5594 MAPK1
#> 3 6300 MAPK12
#> 4 5600 MAPK11
#> 5 5595 MAPK3
#> 6 225689 MAPK15
#> 7 5598 MAPK7
#> 8 5602 MAPK10
#> 9 5603 MAPK13
#> 10 5609 MAP2K7
#> 11 5601 MAPK9
#> 12 5599 MAPK8
#> 13 5597 MAPK6
#> 14 6885 MAP3K7
#> 15 51701 NLK
#> 16 1432 MAPK14
Retrieve all HUGO gene symbols of genes located on chromosome 20 and associated with a specific GO term:
With wizrd and the BioMart MCP server, you can perform complex biological queries using natural language, without needing to know the technical details of BioMart datasets, attributes, or filters. All results are returned as data.frames, making downstream analysis in R seamless and efficient.