Skip to main content

pub.layers.segmentation

A segmentation record that binds one or more tokenizations to an expression. Each tokenization can cover the whole expression or a specific sub-expression (e.g., a sentence). Multiple segmentations can coexist for the same expression, enabling alternative tokenization strategies.

Structural hierarchy (sections, sentences, paragraphs, turns) is expressed via expression records with parentRef and appropriate kind values. The segmentation record provides the token-level decomposition only.

Types

segmentation

NSID: pub.layers.segmentation.segmentation Type: Record

A segmentation of an expression into tokenizations.

FieldTypeDescription
expressionat-uriReference to the expression this segmentation applies to.
tokenizationsarrayThe tokenizations in this segmentation. Each can optionally scope to a sub-expression via expressionRef. Array of ref: pub.layers.segmentation.defs#tokenization
metadatarefRef: pub.layers.defs#annotationMetadata
knowledgeRefsarrayKnowledge graph references (e.g., tokenizer algorithm, sentence splitting model). Array of ref: pub.layers.defs#knowledgeRef
featuresrefOpen-ended features (e.g., tokenizer version, parameters, language model used). Ref: pub.layers.defs#featureMap
createdAtdatetimeRecord creation timestamp.

tokenization

NSID: pub.layers.segmentation.defs#tokenization Type: Object

An ordered sequence of tokens for an expression or sub-expression. Multiple tokenizations can coexist for the same expression (e.g., whitespace vs. BPE vs. morphological), enabling interlinear glossing, alternative segmentation strategies, or multi-granularity analysis. Use pub.layers.alignment.alignment to map between tokenizations.

FieldTypeDescription
uuidrefRef: pub.layers.defs#uuid
kindUriat-uriAT-URI of the tokenization kind definition node. Community-expandable via knowledge graph.
kindstringTokenization kind slug (fallback when kindUri unavailable). Known values: whitespace, penn-treebank, bpe, sentencepiece, character, morphological, custom
expressionRefat-uriReference to the specific sub-expression this tokenization covers (e.g., a sentence-level expression). If absent, covers the entire expression referenced by the segmentation record.
tokensarrayThe ordered token sequence. Array of ref: pub.layers.segmentation.defs#token
metadatarefRef: pub.layers.defs#annotationMetadata

token

NSID: pub.layers.segmentation.defs#token Type: Object

A single token within a tokenization.

FieldTypeDescription
tokenIndexintegerPosition of this token in the tokenization (0-based).
textstringThe surface form of the token.
textSpanrefUTF-8 byte offsets into the expression text. Ref: pub.layers.defs#span
temporalSpanrefTemporal span for audio/video-grounded tokens. Ref: pub.layers.defs#temporalSpan

XRPC Queries

getSegmentation

NSID: pub.layers.segmentation.getSegmentation

Retrieve a single segmentation record by AT-URI.

ParameterTypeDescription
uriat-uri (required)The AT-URI of the segmentation record.

Output: The segmentation record object.

listSegmentations

NSID: pub.layers.segmentation.listSegmentations

List segmentation records for a given expression.

ParameterTypeDescription
expressionat-uri (required)The expression to list segmentations for.
limitintegerMaximum number of records to return (1-100, default 50).
cursorstringPagination cursor from previous response.

Output: { records: segmentation[], cursor?: string }