Skip to main content

pub.layers.media

Media source records for audio, video, image, and document data associated with expressions. Modality-specific metadata is factored into composable object types (audioInfo, videoInfo, documentInfo) so that multimodal media can carry all relevant technical metadata. Domain-specific metadata (recording conditions, speaker demographics, consent, quality assessment) is handled through the open featureMap with documented key conventions.

Types

audioInfo

Type: Object

Composable audio metadata. Attach to any media record representing audio content — standalone audio files, audio tracks in video, etc.

FieldTypeDescription
sampleRateintegerAudio sample rate in Hz (e.g., 8000, 16000, 22050, 44100, 48000).
channelsintegerNumber of audio channels.
bitDepthintegerAudio bit depth (e.g., 16, 24, 32).
codecstringAudio codec identifier (e.g., 'pcm_s16le', 'aac', 'opus', 'flac').
bitRateintegerAudio bitrate in bits per second.
bitRateModestringBitrate mode. Known values: cbr (constant), vbr (variable).
numberOfSamplesintegerTotal number of audio samples. Enables sample-accurate alignment (Praat, ELAN, forced alignment tools).
speakerCountintegerNumber of distinct speakers (for spoken language data).
transcriptRefat-uriAT-URI of a pub.layers.expression containing the transcript.
segmentationRefat-uriAT-URI of a pub.layers.segmentation record structuring the transcript.

videoInfo

Type: Object

Composable video metadata. Attach to any media record representing video content.

FieldTypeDescription
widthintegerWidth in pixels.
heightintegerHeight in pixels.
frameRateintegerFrame rate scaled by 100 (e.g., 2997 = 29.97fps). Avoids floats.
codecstringVideo codec identifier (e.g., 'h264', 'h265', 'vp9', 'av1', 'prores').
aspectRatiostringDisplay aspect ratio (e.g., '16:9', '4:3', '1:1').
colorSpacestringColor space. Known values: rgb, yuv420, yuv422, yuv444, grayscale
bitRateintegerVideo bitrate in bits per second.
scanTypestringScan type. Known values: progressive, interlaced. Affects frame extraction for annotation.

documentInfo

Type: Object

Composable document/image metadata. Attach to any media record representing scanned documents, manuscripts, printed text, or other page-based media for OCR/HTR annotation workflows.

FieldTypeDescription
dpiintegerScanning resolution in dots per inch (300+ recommended for OCR).
colorModestringScan color mode. Known values: color, grayscale, bitonal
pageCountintegerNumber of pages in the document.
scriptSystemstringWriting system (ISO 15924 codes: 'Latn', 'Arab', 'Deva', 'Hans', 'Hant', 'Cyrl', 'Grek', etc.).
writingDirectionstringPrimary text direction. Known values: ltr, rtl, ttb, btt
ocrEnginestringOCR/HTR engine identifier (e.g., 'tesseract-5.3', 'transkribus', 'abbyy', 'google-vision').

main

Type: Record

A media source record (audio, video, image, or document) that can be referenced by expressions and annotations. Modality-specific metadata lives in composable audioInfo/videoInfo/documentInfo objects.

FieldTypeDescription
kindUriat-uriAT-URI of the media kind definition node. Community-expandable via knowledge graph.
kindstringMedia kind slug (fallback). Known values: audio, video, image, document
titlestringMedia title.
descriptionstringDescription of the media.
blobblobThe media blob.
externalUriuriURI for externally hosted media.
mimeTypestringMIME type of the media.
durationMsintegerDuration in milliseconds (for audio/video).
fileSizeBytesintegerFile size in bytes.
parentMediaRefat-uriAT-URI of the parent media record this excerpt/clip was extracted from. For provenance tracking of media segments.
startOffsetMsintegerOffset in milliseconds where this excerpt starts within the parent media. Used with parentMediaRef.
audiorefAudio-specific metadata. Ref: #audioInfo
videorefVideo-specific metadata. Ref: #videoInfo
documentrefDocument-specific metadata. Ref: #documentInfo
languagestringBCP-47 language tag.
knowledgeRefsarrayKnowledge graph references. Array of ref: pub.layers.defs#knowledgeRef
metadatarefProvenance: who created/uploaded this media record. Ref: pub.layers.defs#annotationMetadata
featuresrefOpen-ended features (see Feature Key Conventions below). Ref: pub.layers.defs#featureMap
createdAtdatetimeRecord creation timestamp.

Feature Key Conventions

The features field on media records is a featureMap — an open key-value store for domain-specific metadata that does not warrant dedicated schema fields. All feature values are strings (per the feature type definition); consumers parse typed values based on key semantics. The keys below are conventions, not requirements. Applications should use these keys when applicable to enable cross-corpus interoperability.

Recording & Equipment

KeyDescription
recording.dateISO 8601 date of the recording session.
recording.locationPlace name or address where the recording was made.
recording.coordinatesGPS coordinates (latitude, longitude).
recording.environmentRecording environment: studio, field, lab, classroom, telephone, broadcast, home, outdoor
recording.microphoneMicrophone model (e.g., 'Sennheiser HMD 414', 'DPA 4006').
recording.microphoneTypeMicrophone type: condenser, dynamic, electret, lavalier, headset, array, contact
recording.microphonePlacementMicrophone placement: close-talk, far-field, head-mounted, lapel, tabletop
recording.equipmentRecording device or interface model.
recording.softwareRecording software used.
recording.noiseLevelAmbient noise characterization.
recording.roomAcousticsRoom acoustics description (RT60, treatment, dimensions).

Speaker/Participant Metadata

Speaker metadata uses the pattern speaker.{id}.* where {id} is a speaker identifier (e.g., speaker.SPK01.age). For single-speaker recordings, use speaker.0.*.

KeyDescription
speaker.{id}.ageAge or age range at time of recording.
speaker.{id}.genderGender of the speaker.
speaker.{id}.L1Native language (BCP-47 tag).
speaker.{id}.L2Second language(s), comma-separated BCP-47 tags.
speaker.{id}.dialectRegional dialect or variety.
speaker.{id}.educationEducation level.
speaker.{id}.roleRole in the recording: interviewer, interviewee, narrator, subject, caller, callee, target-child, mother, father, examiner
speaker.{id}.channelAssignmentWhich audio channel this speaker is on (e.g., '0', '1', 'left', 'right').
speaker.{id}.voiceCharacteristicsPitch range, speaking rate, voice quality notes.
speaker.{id}.ethnicityEthnic or racial background (following corpus conventions).
speaker.{id}.birthDateISO 8601 date of birth.
speaker.{id}.handednessDominant hand (for sign language): left, right, ambidextrous
speaker.{id}.hearingStatusHearing status (for sign language): deaf, hard-of-hearing, hearing, coda
speaker.{id}.ageOfAcquisitionAge at which sign language was acquired.

Audio Quality Assessment

KeyDescription
quality.snrDbSignal-to-noise ratio in decibels (string-encoded integer, e.g., '42').
quality.pesqPESQ score (string-encoded integer scaled by 100, e.g., '350' = 3.50).
quality.polqaPOLQA score (string-encoded integer scaled by 100).
quality.stoiShort-Time Objective Intelligibility (string-encoded integer 0-10000, e.g., '9500' = 0.95).
quality.clippingDetectedWhether audio clipping was detected: true or false.
quality.silenceRatioProportion of recording that is silence (string-encoded integer 0-10000, e.g., '1500' = 15%).
quality.ratingSubjective quality rating: poor, fair, good, excellent

Multi-Stream Synchronization

KeyDescription
sync.timeOriginMsTime offset in ms for aligning this media to a master clock, cf. ELAN TIME_ORIGIN (string-encoded integer).
sync.clockDriftPpmClock drift in parts per million relative to master (string-encoded integer).
sync.syncMethodSynchronization method: timecode, clap, genlock, ntp, ptp, software, audio-sync
sync.masterMediaRefAT-URI of the master media record in a multi-stream setup.
sync.precisionTemporal precision of synchronization (e.g., 'under 1ms', 'under 15ms').
KeyDescription
consent.typeConsent type: informed, community, blanket, oral, written
consent.scopePermitted uses: research, education, public, commercial, archive-only
consent.anonymizationLevelAnonymization applied: none, pseudonymized, face-blurred, voice-altered, fully-anonymized
consent.restrictionsFree-text access restrictions or conditions.
consent.irbIRB/ethics committee approval identifier.
consent.culturalProtocolCultural sensitivity notes (CARE principles, indigenous data sovereignty).
consent.licenseLicense identifier (e.g., 'CC-BY-4.0', 'CC-BY-NC-SA-4.0').

Format Conversion Provenance

KeyDescription
conversion.sourceFormatOriginal file format before conversion.
conversion.sourceCodecOriginal codec before conversion.
conversion.sourceBitRateOriginal bitrate before conversion.
conversion.toolConversion tool used (e.g., 'ffmpeg 6.1', 'sox 14.4').
conversion.dateISO 8601 date of conversion.
conversion.losslessWhether the conversion was lossless: true or false.
conversion.generationsNumber of compression generations/re-encodings (string-encoded integer).

Sign Language Video

KeyDescription
signing.cameraAngleCamera angle relative to signer: frontal, side, overhead, three-quarter
signing.cameraCountNumber of cameras in the recording setup (string-encoded integer).
signing.cameraPositionCamera position description (e.g., 'frontal at chest height, 2m distance').
signing.signerPositionWhere the signer is positioned relative to the camera.
signing.signingSpaceApproximate dimensions of the captured signing space.
signing.backgroundTypeBackground description: solid-black, blue-screen, green-screen, natural
signing.glossConventionGlossing convention used (e.g., 'hamburg-notation', 'id-glosses').
signing.interactionTypeInteraction type: monologue, dialogue, group, elicitation

Fieldwork & Language Documentation

KeyDescription
fieldwork.elicitationTypeElicitation method: narrative, conversation, wordlist, paradigm, picture-task, retelling, interview
fieldwork.archiveIdArchive identifier (PARADISEC, ELAR, AILLA, etc.).
fieldwork.archiveCollectionCollection within the archive.
fieldwork.endangermentLevelLanguage endangerment: safe, vulnerable, endangered, severely-endangered, critically-endangered
fieldwork.communityNameSpeaker community name.
fieldwork.genreDiscourse genre: narrative, dialogue, procedural, oratory, singing, formulaic, ludic

Clinical Speech

KeyDescription
clinical.diagnosisClinical diagnosis relevant to speech (e.g., 'aphasia', 'dysarthria', 'stuttering', 'ASD').
clinical.severitySeverity level of the condition.
clinical.taskTypeClinical task: reading, spontaneous, repetition, picture-naming, sentence-completion, diadochokinesis
clinical.assessmentToolStandardized assessment used (e.g., 'WAB-R', 'BNT', 'ADOS-2').
clinical.treatmentPhaseTreatment phase: pre-treatment, during-treatment, post-treatment, follow-up

Multimodal Sensor References

KeyDescription
mocap.fileRefURI or AT-URI of associated motion capture data.
mocap.formatMotion capture format: bvh, c3d, fbx, trc
mocap.frameRateMotion capture sampling rate in Hz (string-encoded integer).
mocap.systemMotion capture system name (e.g., 'OptiTrack', 'Vicon', 'Xsens').
eyetracking.fileRefURI or AT-URI of associated eye-tracking data.
eyetracking.sampleRateEye-tracking sampling rate in Hz (string-encoded integer).
eyetracking.deviceEye-tracking hardware (e.g., 'Tobii Pro Spectrum', 'EyeLink 1000').
depth.sensorTypeDepth sensor type: structured-light, time-of-flight, stereo
depth.resolutionDepth stream resolution (e.g., '640x480').

Accessibility

KeyDescription
accessibility.hasCaptionsWhether captions/subtitles are available: true or false.
accessibility.captionFormatCaption format: webvtt, srt, ttml, cea-608, cea-708
accessibility.captionLanguageBCP-47 tag of caption language.
accessibility.hasAudioDescriptionWhether an audio description track is present: true or false.
accessibility.hasSignLanguageInterpretationWhether sign language interpretation is present: true or false.
accessibility.signLanguageTypeSign language used for interpretation (BCP-47 sign language subtag).
accessibility.hazardsAccessibility hazards: flashing, motion-simulation, sound, none

What Does NOT Belong on Media Records

Several categories of metadata are better placed on other Layers record types:

  • Segmentation (VAD, IPUs, breath groups, turn boundaries) → pub.layers.annotation layers on the expression, with subkind values like vad, ipu, breath-group, turn-boundary, diarization
  • Derived acoustic measurements (pitch tracks, formant tracks, spectrograms, intensity contours) → pub.layers.annotation layers with appropriate subkind (e.g., pitch, formant, intensity, spectrogram)
  • Analysis parameters (Praat settings, window size, step size, frequency range) → annotationMetadata.features on the annotation layer that contains the derived measurements
  • Corpus-level statistics (total hours, speaker count, language distribution) → pub.layers.corpus features
  • Temporal alignment (millisecond/frame/sample alignment of annotations to media) → handled by pub.layers.defs#temporalSpan and pub.layers.defs#anchor