Natural Language BioMart Queries

Overview

This vignette demonstrates how to use the wizrd agent interface with the BioMart MCP server to perform natural language queries against Ensembl and other BioMart resources. Unlike traditional biomaRt workflows, you do not need to know the exact dataset, attribute, or filter names—just ask your question in plain English and get a data.frame result.

Important Note: pybiomart Limitations

The BioMart MCP server uses the pybiomart Python package, which relies on an older XML-based approach to BioMart. This limits its functionality compared to the biomaRt R package, which uses direct API access. Some queries that work with biomaRt may not be possible with the current MCP server.

Starting the BioMart MCP Server

We begin by downloading the BioMart MCP server script to a temporary directory and launching it with all dependencies using uvx and start_mcp().

library(wizrd)

biomart_mcp_path <- system.file("mcp", "biomart-mcp.py", package = "wizrd")
server <- start_mcp(biomart_mcp_path)

Connect to the BioMart MCP server.

session <- connect_mcp(server)

Alternatively, you can connect to a hosted BioMart MCP server using:

session <- connect_mcp("https://server.smithery.ai/@jzinno/biomart-mcp/mcp")

This option requires a free account on smithery.ai and uses interactive authorization, so this vignette will use the local server.

Creating a Natural Language BioMart Agent

We create an agent, equip it with the BioMart MCP tools, and instruct it to answer user queries using BioMart. All outputs are returned as data.frames for easy downstream analysis.

biomart_tools <- tools(session)
agent <- openai_agent("o4-mini") |>
    equip(biomart_tools) |>
    instruct(
        "You are an expert in Ensembl gene annotation.",
        "Answer user queries using BioMart,",
        "using only strings for filter values, with lists comma-separated.",
        "Return results as a data.frame."
    ) |>
    output_as(S7::class_data.frame)

Example 1: Gene Coordinates (cf. biomaRt Section 3.1)

In biomaRt, you must specify the dataset, attributes, and filters. With wizrd, just ask:

suppressPackageStartupMessages(library(GenomicRanges))
coords <- agent |>
    output_as(as.data.frame(GRanges())) |>
    predict(
        "What are the chromosomal coordinates for TP53 and BRCA2?"
    )
coords |> makeGRangesFromDataFrame(keep.extra.columns = TRUE)
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames            ranges strand
#>          <Rle>         <IRanges>  <Rle>
#>   [1]       17   7661779-7687546      -
#>   [2]       13 32315086-32400268      +
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The above relies on the GenomicRanges package in order to return a useful R data structure for further computations, but every query can return a simple data.frame, as demonstrated in later examples.

biomaRt equivalent

The biomaRt equivalent requires the user to learn how to use the biomaRt package, to determine many different identifiers, and to exert extra effort to convert the output to a GRanges.

library(biomaRt)
mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
getBM(
    attributes = c(
        "external_gene_name", "chromosome_name", "start_position",
        "end_position", "strand"
    ),
    filters = "external_gene_name",
    values = c("TP53", "BRCA2"),
    mart = mart
) |>
    transform(strand = ifelse(strand > 0, "+", "-")) |>
    makeGRangesFromDataFrame(
        seqnames.field = "chromosome_name",
        start.field = "start_position",
        end.field = "end_position",
        keep.extra.columns = TRUE
    )

Example 2: GO Term Query (cf. biomaRt Section 3.6)

Retrieve all EntrezGene IDs and gene symbols for genes with MAP kinase activity (GO:0004707):

go_genes <- agent |>
    predict(
        "List all EntrezGene IDs and gene symbols with MAP kinase activity"
    )
go_genes
#>    entrezgene_id hgnc_symbol
#> 1           5596       MAPK4
#> 2           5594       MAPK1
#> 3           6300      MAPK12
#> 4           5600      MAPK11
#> 5           5595       MAPK3
#> 6         225689      MAPK15
#> 7           5598       MAPK7
#> 8           5602      MAPK10
#> 9           5603      MAPK13
#> 10          5609      MAP2K7
#> 11          5601       MAPK9
#> 12          5599       MAPK8
#> 13          5597       MAPK6
#> 14          6885      MAP3K7
#> 15         51701         NLK
#> 16          1432      MAPK14

Example 3: Genes on Specific Chromosomes (cf. biomaRt Section 3.3)

Retrieve all HUGO gene symbols of genes located on chromosome 20 and associated with a specific GO term:

chr_go_genes <- agent |>
    predict(paste(
        "Retrieve all HUGO gene symbols of genes on chromosome 20",
        "associated with GO:0005515 (protein binding)"
    ))
head(chr_go_genes)
#>   external_gene_name
#> 1             ADRA1D
#> 2              RSPO4
#> 3            DEFB129
#> 4              SHLD1
#> 5             SLX4IP
#> 6               BMP2

Conclusion

With wizrd and the BioMart MCP server, you can perform complex biological queries using natural language, without needing to know the technical details of BioMart datasets, attributes, or filters. All results are returned as data.frames, making downstream analysis in R seamless and efficient.

Michael Lawrence

2025-07-25