Skip to main content

Concrete (HLTCOE)

Overview

Concrete is a stand-off annotation data model originally defined via Apache Thrift. It provides a hierarchical document model (Communication → Section → Sentence → Tokenization → Token) with UUID-based cross-referencing and typed annotation layers (EntityMention, SituationMention, TokenTagging, Parse, DependencyParse). Concrete is the primary structural inspiration for Layers's document model.

Type-by-Type Mapping

Document Model

Concrete TypeLayers EquivalentNotes
Communicationpub.layers.expression (record)Direct mapping. Concrete's Communication maps to a top-level Expression with kind="document". Layers adds sourceUrl, sourceRef, eprintRef, knowledgeRefs for ATProto ecosystem integration, and parentRef/anchor for recursive nesting. Concrete's id field maps to the record's rkey; uuid maps to the AT-URI.
CommunicationMetadatapub.layers.defs#annotationMetadata + pub.layers.defs#featureMapLayers separates metadata (tool, timestamp, confidence, persona) from open features.

Hierarchical Structure

Concrete TypeLayers EquivalentNotes
Sectionpub.layers.expression with kind="section" (or paragraph, chapter, turn, etc.)Concrete sections map to nested Expressions with parentRef pointing to the document Expression and anchor specifying character offsets. Layers adds kindUri for community-expandable section types and temporalSpan for audio/video sections.
Sentencepub.layers.expression with kind="sentence"Nested Expression with parentRef pointing to its section. Layers adds temporalSpan.
Tokenizationpub.layers.segmentationTokenization is represented in the segmentation record, which defines how a parent Expression is decomposed into child Expressions. Layers supports multiple tokenizations per sentence and community-expandable tokenization strategies via kindUri.
Tokenpub.layers.expression with kind="word"Concrete's Token has tokenIndex, text, and TextSpan; Layers adds temporalSpan for audio-grounded tokens. Tokens are word-level Expressions nested within their sentence.
TextSpanpub.layers.defs#spanConcrete uses start/ending (exclusive); Layers uses the same convention.

Segmentation Binding

Concrete TypeLayers EquivalentNotes
Communication.sectionListpub.layers.segmentation (record)In Concrete, the section list is embedded in the Communication. Layers separates segmentation into its own record, allowing multiple segmentations per expression and enabling segmentation to be contributed by different users in a decentralized context.

Token-Level Annotations

Concrete TypeLayers EquivalentNotes
TokenTaggingpub.layers.annotation#annotationLayer with kind="token-tag"Concrete's TokenTagging is a flat list of TaggedToken objects. Layers uses annotationLayer with kind="token-tag" and discriminates by subkind (pos, ner, lemma, morph, etc.). The TaggedToken.tag field maps to annotation.label.
TaggedTokenpub.layers.annotation#annotationTaggedToken.tokenIndexannotation.tokenIndex; TaggedToken.tagannotation.label.

Entity Annotations

Concrete TypeLayers EquivalentNotes
EntityMentionSetpub.layers.annotation#annotationLayer with kind="span", subkind="entity-mention"Direct mapping. Concrete's EntityMentionSet.mentionListannotationLayer.annotations.
EntityMentionpub.layers.annotation#annotationEntityMention.tokensannotation.anchor.tokenRefSequence; EntityMention.entityTypeannotation.label; EntityMention.phraseTypeannotation.features.phraseType.
Entitypub.layers.annotation#clusterSet with kind="coreference"Concrete's Entity groups EntityMention objects into coreference chains. Layers uses clusterSet with kind="coreference" and cluster.memberIds pointing to annotation UUIDs. The Entity.canonicalName maps to cluster.canonicalLabel.
EntitySetpub.layers.annotation#clusterSetOne clusterSet per entity resolution output.

Situation Annotations

Concrete TypeLayers EquivalentNotes
SituationMentionSetpub.layers.annotation#annotationLayer with kind="span", subkind="situation-mention" or subkind="frame"Concrete's SituationMention is used for situations, frames, and states. Layers discriminates these by subkind.
SituationMentionpub.layers.annotation#annotationSituationMention.tokensannotation.anchor; SituationMention.situationKindannotation.label.
MentionArgumentpub.layers.annotation#argumentRefMentionArgument.roleargumentRef.role; MentionArgument.entityMentionIdargumentRef.annotationId (same-layer) or argumentRef.layerRef + argumentRef.objectId (cross-layer).
Situationpub.layers.annotation#clusterSet with kind="situation-coreference"Concrete's Situation groups SituationMention objects. Maps to Layers clusterSet.
SituationSetpub.layers.annotation#clusterSetOne clusterSet per situation resolution output.

Syntactic Annotations

Concrete TypeLayers EquivalentNotes
Parsepub.layers.annotation#annotationLayer with kind="tree", subkind="constituency"Concrete's Parse is a tree of Constituent objects. Layers represents each constituent as an annotation with parentId/childIds/tokenIndex.
Constituentpub.layers.annotation#annotationConstituent.tagannotation.label; Constituent.childListannotation.childIds; Constituent.headChildIndex distinguished via features.
DependencyParsepub.layers.annotation#annotationLayer with kind="graph", subkind="dependency"Direct mapping. Each Dependency becomes an annotation.
Dependencypub.layers.annotation#annotationDependency.depannotation.tokenIndex; Dependency.govannotation.headIndex; Dependency.edgeTypeannotation.label.

Metadata and Provenance

Concrete TypeLayers EquivalentNotes
AnnotationMetadatapub.layers.defs#annotationMetadataDirect mapping. tooltool; timestamptimestamp; confidenceconfidence. Layers adds personaRef for annotator persona, digest for content hashing, and dependencies for provenance chains.
TheoryDependenciespub.layers.defs#annotationMetadata.dependenciesConcrete's TheoryDependencies tracks which upstream analyses an annotation depends on. Layers uses the dependencies array on annotationMetadata, containing objectRef references to upstream records.
kBestpub.layers.annotation#annotationLayer.rank + alternativesRefConcrete supports k-best lists for parse trees and other analyses. Layers models this with rank (1 = best) and alternativesRef (points to the top-ranked layer) on annotationLayer. Each alternative is a separate layer record.
CommunicationTaggingpub.layers.annotation#annotationLayer with kind="document-tag"Concrete's document-level tagging maps to an annotation layer with kind="document-tag" on the expression.
LanguageIdentificationpub.layers.expression.languages + pub.layers.annotation#annotationLayer with subkind="language-id"Concrete's document-level language ID maps to expression.language (primary) and expression.languages (additional). Per-span language identification uses an annotation layer with subkind="language-id".

Cross-Document Features

Concrete TypeLayers EquivalentNotes
CommunicationSetpub.layers.annotation#clusterSet with expressionRefs + corpusRefConcrete's cross-document entity and event clustering uses CommunicationSet to define the document scope. Layers uses clusterSet with optional expression (single-document) or expressionRefs/corpusRef (cross-document).

Features Not in Concrete (Layers Extensions)

Layers extends Concrete's model in several dimensions that Concrete does not address:

  • Recursive expressions: Documents, paragraphs, sentences, words, and morphemes are all expressions with recursive nesting via parentRef — Concrete has a fixed hierarchy
  • Multimodal anchoring: temporalSpan, spatioTemporalAnchor, pageAnchor, boundingBox — Concrete is primarily text-oriented
  • W3C selectors: textQuoteSelector, textPositionSelector, fragmentSelector — for web annotation interoperability
  • Knowledge graph integration: knowledgeRef, pub.layers.graph (generic typed property graph) — Concrete has no built-in knowledge base references
  • Alignment records: pub.layers.alignment — Concrete has no parallel text or interlinear glossing support
  • Ontology definitions: pub.layers.ontology — Concrete relies on external tagset definitions
  • Judgment experiments: pub.layers.judgment — Concrete has no annotation experiment framework
  • Community-expandable enums: URI+slug dual-field pattern — Concrete uses fixed enum types
  • Decentralized ownership: ATProto records live in user PDSes — Concrete assumes centralized storage

Conversion Notes

A Concrete Communication can be converted to Layers records as follows:

  1. Create a pub.layers.expression record with kind="document" from the Communication's text, id, and metadata
  2. Create pub.layers.expression records for each Section (kind="section"), Sentence (kind="sentence"), and Token (kind="word") with parentRef chains
  3. Create a pub.layers.segmentation record defining the ordered decomposition
  4. For each TokenTagging, create an annotationLayer with kind="token-tag" and appropriate subkind
  5. For each EntityMentionSet, create an annotationLayer with kind="span", subkind="entity-mention"
  6. For each EntitySet, create a clusterSet with kind="coreference"
  7. For each SituationMentionSet, create an annotationLayer with kind="span" and appropriate subkind
  8. For each Parse, create an annotationLayer with kind="tree", subkind="constituency"
  9. For each DependencyParse, create an annotationLayer with kind="graph", subkind="dependency"

All UUID references are preserved. Character offsets (TextSpan) transfer directly.