Skip to main content

Concrete

Overview

Concrete is a stand-off annotation data model originally defined via Apache Thrift. It provides a hierarchical document model (Communication → Section → Sentence → Tokenization → Token) with UUID-based cross-referencing and typed annotation layers (EntityMention, SituationMention, TokenTagging, Parse, DependencyParse). Concrete is the primary structural inspiration for Layers's document model.

Type-by-Type Mapping

Document Model

Concrete TypeLayers EquivalentNotes
Communicationpub.layers.expression.expression (record)Direct mapping. Concrete's Communication maps to a top-level Expression with kind="document". Layers adds sourceUrl, sourceRef, eprintRef, knowledgeRefs for ATProto ecosystem integration, and parentRef/anchor for recursive nesting. Concrete's id field maps to the record's rkey; uuid maps to the AT-URI.
CommunicationMetadatapub.layers.defs#annotationMetadata + pub.layers.defs#featureMapLayers separates metadata (tool, timestamp, confidence, persona) from open features.

Hierarchical Structure

Concrete TypeLayers EquivalentNotes
Sectionpub.layers.expression.expression with kind="section" (or paragraph, chapter, turn, etc.)Concrete sections map to nested Expressions with parentRef pointing to the document Expression and anchor specifying UTF-8 byte offsets. Layers adds kindUri for community-expandable section types and temporalSpan for audio/video sections.
Sentencepub.layers.expression.expression with kind="sentence"Nested Expression with parentRef pointing to its section. Layers adds temporalSpan.
Tokenizationpub.layers.segmentation.segmentationTokenization is represented in the segmentation record, which contains a list of tokens decomposing an expression's text. Each tokenization has an optional expressionRef that scopes it to a specific sub-expression (e.g., a sentence-level expression). Layers supports multiple tokenizations per expression and community-expandable tokenization strategies via kindUri.
Tokenpub.layers.expression.expression with kind="word"Concrete's Token has tokenIndex, text, and TextSpan; Layers adds temporalSpan for audio-grounded tokens. Tokens are word-level Expressions nested within their sentence.
TextSpanpub.layers.defs#spanConcrete uses start/ending (exclusive); Layers uses byteStart/byteEnd (UTF-8 byte offsets, exclusive end). The import pipeline converts character offsets to byte offsets at import time.

Segmentation and Structural Binding

Concrete TypeLayers EquivalentNotes
Communication.sectionListpub.layers.expression.expression records with parentRefIn Concrete, the section list is embedded in the Communication. In Layers, structural hierarchy (sections, sentences) is expressed via expression records with parentRef pointing to their parent expression and appropriate kind values (section, sentence, etc.). This allows structural decomposition to be contributed by different users in a decentralized context. The segmentation record (pub.layers.segmentation.segmentation) is reserved for tokenization only.

Token-Level Annotations

Concrete TypeLayers EquivalentNotes
TokenTaggingpub.layers.annotation.annotationLayer with kind="token-tag"Concrete's TokenTagging is a flat list of TaggedToken objects. Layers uses annotationLayer with kind="token-tag" and discriminates by subkind (pos, ner, lemma, morph, etc.). The TaggedToken.tag field maps to annotation.label.
TaggedTokenpub.layers.annotation.defs#annotationTaggedToken.tokenIndexannotation.tokenIndex; TaggedToken.tagannotation.label.

Entity Annotations

Concrete TypeLayers EquivalentNotes
EntityMentionSetpub.layers.annotation.annotationLayer with kind="span", subkind="entity-mention"Direct mapping. Concrete's EntityMentionSet.mentionListannotationLayer.annotations.
EntityMentionpub.layers.annotation.defs#annotationEntityMention.tokensannotation.anchor.tokenRefSequence; EntityMention.entityTypeannotation.label; EntityMention.phraseTypeannotation.features.phraseType.
Entitypub.layers.annotation.clusterSet with kind="coreference"Concrete's Entity groups EntityMention objects into coreference chains. Layers uses clusterSet with kind="coreference" and cluster.memberIds pointing to annotation UUIDs. The Entity.canonicalName maps to cluster.canonicalLabel.
EntitySetpub.layers.annotation.clusterSetOne clusterSet per entity resolution output.

Situation Annotations

Concrete TypeLayers EquivalentNotes
SituationMentionSetpub.layers.annotation.annotationLayer with kind="span", subkind="situation-mention" or subkind="frame"Concrete's SituationMention is used for situations, frames, and states. Layers discriminates these by subkind.
SituationMentionpub.layers.annotation.defs#annotationSituationMention.tokensannotation.anchor; SituationMention.situationKindannotation.label.
MentionArgumentpub.layers.annotation.defs#argumentRefMentionArgument.roleargumentRef.role; MentionArgument.entityMentionIdargumentRef.annotationId (same-layer) or argumentRef.layerRef + argumentRef.objectId (cross-layer).
Situationpub.layers.annotation.clusterSet with kind="situation-coreference"Concrete's Situation groups SituationMention objects. Maps to Layers clusterSet.
SituationSetpub.layers.annotation.clusterSetOne clusterSet per situation resolution output.

Syntactic Annotations

Concrete TypeLayers EquivalentNotes
Parsepub.layers.annotation.annotationLayer with kind="tree", subkind="constituency"Concrete's Parse is a tree of Constituent objects. Layers represents each constituent as an annotation with parentId/childIds/tokenIndex.
Constituentpub.layers.annotation.defs#annotationConstituent.tagannotation.label; Constituent.childListannotation.childIds; Constituent.headChildIndex distinguished via features.
DependencyParsepub.layers.annotation.annotationLayer with kind="graph", subkind="dependency"Direct mapping. Each Dependency becomes an annotation.
Dependencypub.layers.annotation.defs#annotationDependency.depannotation.tokenIndex; Dependency.govannotation.headIndex; Dependency.edgeTypeannotation.label.

Metadata and Provenance

Concrete TypeLayers EquivalentNotes
AnnotationMetadatapub.layers.defs#annotationMetadataDirect mapping. tooltool; timestamptimestamp; confidenceconfidence. Layers adds personaRef for annotator persona, digest for content hashing, and dependencies for provenance chains.
TheoryDependenciespub.layers.defs#annotationMetadata.dependenciesConcrete's TheoryDependencies tracks which upstream analyses an annotation depends on. Layers uses the dependencies array on annotationMetadata, containing objectRef references to upstream records.
kBestpub.layers.annotation.annotationLayer.rank + alternativesRefConcrete supports k-best lists for parse trees and other analyses. Layers models this with rank (1 = best) and alternativesRef (points to the top-ranked layer) on annotationLayer. Each alternative is a separate layer record.
CommunicationTaggingpub.layers.annotation.annotationLayer with kind="document-tag"Concrete's document-level tagging maps to an annotation layer with kind="document-tag" on the expression.
LanguageIdentificationpub.layers.expression.languages + pub.layers.annotation.annotationLayer with subkind="language-id"Concrete's document-level language ID maps to expression.language (primary) and expression.languages (additional). Per-span language identification uses an annotation layer with subkind="language-id".

Cross-Document Features

Concrete TypeLayers EquivalentNotes
CommunicationSetpub.layers.annotation.clusterSet with expressionRefs + corpusRefConcrete's cross-document entity and event clustering uses CommunicationSet to define the document scope. Layers uses clusterSet with optional expression (single-document) or expressionRefs/corpusRef (cross-document).

Features Not in Concrete (Layers Extensions)

Layers extends Concrete's model in several dimensions that Concrete does not address:

  • Recursive expressions: Documents, paragraphs, sentences, words, and morphemes are all expressions with recursive nesting via parentRef. Concrete has a fixed hierarchy.
  • Multimodal anchoring: temporalSpan, spatioTemporalAnchor, pageAnchor, boundingBox. Concrete is primarily text-oriented.
  • W3C selectors: textQuoteSelector, textPositionSelector, fragmentSelector for web annotation interoperability.
  • Knowledge graph integration: knowledgeRef, pub.layers.graph (generic typed property graph). Concrete has no built-in knowledge base references.
  • Alignment records: pub.layers.alignment.alignment. Concrete has no parallel text or interlinear glossing support.
  • Ontology definitions: pub.layers.ontology.ontology. Concrete relies on external tagset definitions.
  • Judgment experiments: pub.layers.judgment. Concrete has no annotation experiment framework.
  • Community-expandable enums: URI+slug dual-field pattern. Concrete uses fixed enum types.
  • Decentralized ownership: ATProto records live in user PDSes. Concrete assumes centralized storage.

Conversion Notes

A Concrete Communication can be converted to Layers records as follows:

  1. Create a pub.layers.expression.expression record with kind="document" from the Communication's text, id, and metadata
  2. Create pub.layers.expression.expression records for each Section (kind="section"), Sentence (kind="sentence"), and Token (kind="word") with parentRef chains
  3. Create a pub.layers.segmentation.segmentation record for each tokenization, with an optional expressionRef scoping it to a specific sub-expression (e.g., a sentence)
  4. For each TokenTagging, create an annotationLayer with kind="token-tag" and appropriate subkind
  5. For each EntityMentionSet, create an annotationLayer with kind="span", subkind="entity-mention"
  6. For each EntitySet, create a clusterSet with kind="coreference"
  7. For each SituationMentionSet, create an annotationLayer with kind="span" and appropriate subkind
  8. For each Parse, create an annotationLayer with kind="tree", subkind="constituency"
  9. For each DependencyParse, create an annotationLayer with kind="graph", subkind="dependency"

All UUID references are preserved. The import pipeline converts Concrete's character offsets (TextSpan) to UTF-8 byte offsets at import time.