Skip to main content

pub.layers.corpus

Corpus records. A corpus is a named, versioned collection of expressions with shared metadata, annotation guidelines, and ontologies.

Types

main

Type: Record

A corpus: a curated collection of expressions.

FieldTypeDescription
namestringCorpus name.
descriptionstringDetailed description of the corpus.
versionstringVersion string for the corpus release.
languagestringPrimary BCP-47 language tag.
languagesarrayAll languages represented. Array of strings
domainUriat-uriAT-URI of the domain definition node. Community-expandable via knowledge graph.
domainstringDomain slug (fallback when domainUri unavailable). Known values: news, biomedical, legal, social-media, dialogue, literary, scientific, web, spoken, custom
licensestringLicense identifier (e.g., 'CC-BY-4.0', 'LDC-User-Agreement').
ontologyRefsarrayOntologies used in this corpus. Array of at-uri
eprintRefsarrayEprint links for this corpus. Array of at-uri
expressionCountintegerNumber of expressions in the corpus.
featuresrefRef: pub.layers.defs#featureMap
createdAtdatetimeRecord creation timestamp.

membership

Type: Record

A record indicating that an expression belongs to a corpus, with optional split assignment.

FieldTypeDescription
corpusRefat-uriAT-URI of the corpus.
expressionRefat-uriAT-URI of the expression.
splitUriat-uriAT-URI of the split definition node. Community-expandable via knowledge graph.
splitstringSplit slug (fallback when splitUri unavailable). Known values: train, dev, test, unlabeled
ordinalintegerOrdering index within the corpus.
metadatarefProvenance: who assigned this expression to this corpus, when, with what tool. Ref: pub.layers.defs#annotationMetadata
featuresrefOpen-ended features for this membership (e.g., source file, import batch, quality flags). Ref: pub.layers.defs#featureMap
createdAtdatetimeRecord creation timestamp.