Skip to main content

Multimodal Annotation

Layers supports annotation across text, audio, video, image, and paged documents through a single schema. The key mechanism is the polymorphic anchor type: the same annotation record works across modalities by switching the anchor kind. This guide explains how anchoring, expressions, and media records work together for multimodal annotation.

The Polymorphic Anchor

Every annotation attaches to source data through an anchor. The anchor's kind field determines the modality:

Anchor KindModalityValue
textSpanText{start, end} byte/character offsets
tokenRefTextSingle token identifier
tokenRefSequenceTextOrdered sequence of token references
temporalSpanAudio/Video{start, end} time in milliseconds
spatioTemporalAnchorVideoKeyframe-based bounding boxes over time
pageAnchorPaged documents{page, x, y, width, height}
externalTargetWeb/ExternalURL or resource identifier

The same pub.layers.annotation#annotation record type is used regardless of modality. A POS tag on a text token and a label on a video region differ only in their anchor kind.

Expressions Across Modalities

Expressions are recursive containers for linguistic units. The kind field indicates the modality:

Text: documentsectionparagraphsentencewordmorpheme

Audio: recordingturnutteranceword

Video: videoturnutteranceword

Multimodal: multimodal → any combination of text, audio, and video sub-expressions

Each expression can reference its parent via parentRef and specify how it attaches to the parent through anchor. A word expression within a recording uses a temporalSpan anchor to indicate its time range.

Media Records

Media records (pub.layers.media) store technical metadata about source files. An expression references its media via the mediaRef field.

Expression (kind="recording", text="Hello world")
├── mediaRef → Media (kind="audio", sampleRate=16000, codec="flac")
├── Word (text="Hello", anchor={temporalSpan: {start: 0, end: 500}})
└── Word (text="world", anchor={temporalSpan: {start: 520, end: 1100}})

Media records carry modality-specific metadata through composable info objects:

  • audioInfo: sample rate, channels, bit depth, codec, speaker count
  • videoInfo: resolution, frame rate, codec, aspect ratio, color space
  • documentInfo: DPI, page count, script system, writing direction, OCR engine

A video media record can carry both videoInfo and audioInfo since video files typically contain an audio track.

Annotating Text

Text annotation uses textSpan or tokenRef anchors. Character offsets reference the expression's text field.

{
"kind": "span",
"subkind": "ner",
"annotations": [
{
"anchor": {
"kind": "textSpan",
"textSpan": { "start": 0, "end": 5 }
},
"label": "PERSON",
"text": "Alice"
}
]
}

For token-aligned annotations, use tokenIndex referencing a segmentation record:

{
"kind": "token-tag",
"subkind": "pos",
"annotations": [
{ "tokenIndex": 0, "label": "NNP" },
{ "tokenIndex": 1, "label": "VBD" }
]
}

Annotating Audio

Audio annotation uses temporalSpan anchors with millisecond offsets. The expression's mediaRef points to an audio media record.

{
"kind": "tier",
"subkind": "speaker",
"annotations": [
{
"anchor": {
"kind": "temporalSpan",
"temporalSpan": { "start": 0, "end": 3200 }
},
"label": "SPK01",
"text": "I went to the store yesterday"
}
]
}

Multiple annotation layers (speaker turns, transcription, POS tags, prosody) can all reference the same temporal spans, building up layers of analysis.

For forced alignment between audio and text, use pub.layers.alignment with kind="audio-to-text".

Annotating Video

Video annotation combines temporal and spatial dimensions. Two anchor types apply:

Temporal only (temporalSpan): For annotations that span a time range without spatial specificity, such as scene labels, speaker turns, and temporal events.

Spatiotemporal (spatioTemporalAnchor): For tracking objects through video frames. Defined by keyframes, each with a timestamp and bounding box:

{
"kind": "span",
"subkind": "entity-mention",
"annotations": [
{
"anchor": {
"kind": "spatioTemporalAnchor",
"spatioTemporalAnchor": {
"keyframes": [
{ "timeMs": 0, "bbox": { "x": 100, "y": 50, "width": 200, "height": 300 } },
{ "timeMs": 1000, "bbox": { "x": 120, "y": 55, "width": 195, "height": 295 } },
{ "timeMs": 2000, "bbox": { "x": 150, "y": 60, "width": 190, "height": 290 } }
],
"interpolation": "linear"
}
},
"label": "person",
"text": "Speaker A"
}
]
}

Frames between keyframes are computed via interpolation (linear, step, or cubic).

For semantic spatial annotation (e.g., "this scene takes place in Tokyo"), use the spatial field on annotations with a spatialExpression. See the Spatial Representation guide for details.

Annotating Images

Image annotation uses bounding boxes in pixel coordinates via the spatial field:

{
"kind": "span",
"subkind": "entity-mention",
"annotations": [
{
"anchor": {
"kind": "textSpan",
"textSpan": { "start": 0, "end": 0 }
},
"label": "cat",
"spatial": {
"type": "region",
"value": {
"bbox": { "x": 50, "y": 30, "width": 200, "height": 150 },
"crs": "pixel"
}
}
}
]
}

For non-rectangular regions, use spatialEntity.geometry with polygon coordinates. Layers supports WKT, GeoJSON, SVG path, and COCO polygon formats via the geometryFormat field. See the Spatial Representation guide for format details.

Annotating Paged Documents

Paged documents (PDFs, scanned manuscripts) use pageAnchor:

{
"kind": "span",
"subkind": "ner",
"annotations": [
{
"anchor": {
"kind": "pageAnchor",
"pageAnchor": {
"page": 3,
"x": 100,
"y": 200,
"width": 150,
"height": 20
}
},
"label": "PERSON",
"text": "Marie Curie"
}
]
}

The media record for a paged document carries documentInfo with DPI, page count, script system, and OCR engine metadata.

Annotating Web Content

Web content uses externalTarget anchors combined with W3C selectors:

{
"anchor": {
"kind": "externalTarget",
"sourceUri": "at://did:plc:.../pub.layers.expression/...",
"selector": {
"type": "TextQuoteSelector",
"exact": "linguistic annotation",
"prefix": "the field of ",
"suffix": " has grown"
}
}
}

Layers supports three W3C selector types: TextQuoteSelector, TextPositionSelector, and FragmentSelector. These enable compatibility with W3C Web Annotation clients.

Combining Modalities

A multimodal expression can nest sub-expressions of different modalities:

Expression (kind="multimodal")
├── Expression (kind="video", mediaRef → video.mp4)
│ └── annotations on temporalSpan and spatioTemporalAnchor
├── Expression (kind="transcript", text="...")
│ └── annotations on textSpan and tokenRef
└── Alignment (kind="audio-to-text")
└── links video temporal spans to transcript tokens

The pub.layers.alignment lexicon connects annotations across modalities. An audio-to-text alignment links temporal spans in the audio/video to token ranges in the transcript.

Semantic Time and Space

Beyond anchoring (where in the media), annotations can carry semantic temporal and spatial information (what time/place the content refers to):

These are independent of the anchor. An annotation anchored at 3:45 in a recording (media time) might carry a temporal expression referring to "next Tuesday" (semantic time).

See Also