chunk.Rd
Chunks text for the purpose of generating embeddings and query approaches like RAG. Supported formats include: R character vectors, plain text files, Rd files, markdown files (including Rmd and Quarto), HTML files and PDF files.
chunk(x, by, ...)
default_chunking(x, ...)
A container or source of text to be chunked
A Chunking object representing the chunking strategy and its
parameters, or a named list of such objects. By default, the return
value of default_chunking(x)
is used. If NULL
, no
chunking is performed (the text is returned as a single chunk, if
possible).
Arguments passed to methods
chunk
returns a data.frame with a “text” column
containing the text chunks, along with a variable set of metadata
columns, depending on the type of text and chunking strategy.
default_chunking
returns the default Chunking object to use for
the given object. For a character vector or list (assumedly of
filenames), it returns a list of default file extension to Chunking
mappings (this registry will be exposed to extensions in the future).
Here are the different types of input and how they are handled by default:
A vector of strings is chunked directly based on by
. By
default, by
is a list, in which case the names of the
vector are used to look up the appropriate Chunking object based
on its name in the list. If there are no names, sentence-aligned
word chunking is used.
If a file, reads it as a single string and chunks it according to
by
. If by
is a list, the element whose name matches
the file extension is used (see below for the built-in
handlers). If there is no match, sentence-aligned chunking is
used.
RMarkdown and Quarto files are first broken into
sections. Optionally, each section can be further chunked based on
the @section_chunking
property. Typically these files are
identified by their file extension as described above. It is also
possible to pass an object from the parsermd package.
HTML files are converted to markdown using pandoc and then chunked as described above.
Blocks of text are extracted using pdftools and
potentially further chunked according @section_chunking
.
vignette()
The return value of vignette
, a packageIQR object,
is a convenient way to chunk all of the vignettes in a package.
Rd_db()
The return value of Rd_db
, an Rd object, is a
convenient way to chunk all of the man pages of a package.
The common default sentence-aligned word chunking breaks a body of
text into chunks with boundaries aligned to the starts/ends of
sentences. It uses a default word limit (@token_limit
) of 512
and maximum word overlap (max_overlap
) of 64.
For more fine-grained control, the easiest approach currently is to
call default_chunking
on the input, take the desired Chunking
object, modify its properties, and pass it via the by
argument. In the future, more of the API will be exposed.
text_store
for storing and querying the chunks
The text chunking capabilities wizrd are currently experimental and quite primitive. In particular, the token counting is just the word count, which will not be consistent with LLM tokenization; we hope to address this in the future. It mostly depends on whether anyone else finds text chunking useful.