Skip to main content

CoNLL

Overview

CoNLL formats are tab-separated column-based annotation formats widely used in NLP shared tasks. CoNLL-U (Universal Dependencies) is the most widely used variant. Each line represents a token with multiple annotation columns. CoNLL formats are flat, single-file representations optimized for machine learning pipelines.

CoNLL-U (Universal Dependencies)

Column Mapping

CoNLL-U ColumnLayers EquivalentNotes
IDpub.layers.expression.expression token tokenIndexToken position. CoNLL-U uses 1-based; Layers uses 0-based.
FORMpub.layers.expression.expression token textSurface form.
LEMMAannotationLayer(kind="token-tag", subkind="lemma")annotation.valueLemmatization layer.
UPOSannotationLayer(kind="token-tag", subkind="pos")annotation.labelUniversal POS tag.
XPOSannotationLayer(kind="token-tag", subkind="xpos")annotation.labelLanguage-specific POS tag.
FEATSannotationLayer(kind="token-tag", subkind="morph")annotation.featuresMorphological features (e.g., Case=Nom|Number=Sing). Each feature key-value pair maps to a feature entry.
HEADannotationLayer(kind="graph", subkind="dependency")annotation.headIndexGovernor token index. CoNLL-U's 0 (root) maps to headIndex absent or a sentinel.
DEPRELSame dependency layer → annotation.labelDependency relation label.
DEPSannotationLayer(kind="graph", subkind="enhanced-dependency")Enhanced dependencies (multiple heads). Each head:deprel pair creates a separate annotation in the enhanced layer.
MISCpub.layers.defs#featureMap on the token or annotationCatch-all for SpaceAfter=No, Translit=..., etc.

Special Token Types

CoNLL-U FeatureLayers EquivalentNotes
Multi-word tokens (e.g., 1-2 del)pub.layers.defs#tokenRefSequenceMulti-word token ranges are represented as a tokenRefSequence with tokenIndexes covering the component tokens. The surface form and span of the multi-word token are stored in features.
Empty nodes (e.g., 2.1)pub.layers.annotation.defs#annotation with featuresEmpty nodes in enhanced UD are represented as annotations (not tokens) in the enhanced dependency layer, with features indicating they are empty/null nodes. Their position is tracked via decimal indices stored in features.
Sentence boundariespub.layers.expression.expression (kind: sentence) with parentRefCoNLL-U blank lines between sentences map to sentence-level expression records with parentRef pointing to the document expression. Tokenization for each sentence is a pub.layers.segmentation.segmentation record with expressionRef pointing to that sentence expression.
# text = ... commentpub.layers.expression.expression sentence features or pub.layers.expression.textSentence-level metadata from comments.
# sent_id = ... commentpub.layers.expression.expression sentence uuidSentence identifier.
# newpar / # newdocpub.layers.expression.expression (kind: section) boundariesParagraph and document boundaries.

CoNLL-2003 (NER)

CoNLL-2003 ColumnLayers EquivalentNotes
Wordtoken.textSurface form.
POS tagannotationLayer(kind="token-tag", subkind="pos")POS tag.
Chunk tagannotationLayer(kind="token-tag", subkind="chunk")IOB chunk tag.
NER tagannotationLayer(kind="token-tag", subkind="ner")IOB NER tag. Can also be converted to kind="span", subkind="entity-mention" with token spans.

CoNLL-2005 / CoNLL-2009 (SRL)

CoNLL-200x ColumnLayers EquivalentNotes
PredicateannotationLayer(kind="span", subkind="predicate")Predicate identification.
Predicate senseannotationLayer(kind="span", subkind="frame", formalism="PropBank")annotation.labelPropBank sense (roleset ID).
Argument columnsannotation.arguments[] with argumentRef.roleEach argument column (ARG0, ARG1, ARGM-TMP, etc.) becomes an argumentRef on the frame annotation.

CoNLL-2012 (OntoNotes Coreference)

CoNLL-2012 FeatureLayers EquivalentNotes
Coreference columnpub.layers.annotation.clusterSet with kind="coreference"Parenthetical coreference notation (e.g., (12), (12, 12)) maps to cluster membership. Each cluster ID becomes a cluster with memberIds pointing to span annotations.
Speaker columnannotationLayer(kind="token-tag", subkind="speaker")Speaker diarization.
Named entity spansannotationLayer(kind="span", subkind="entity-mention")Entity mention spans.

Conversion Notes

CoNLL formats are flat, token-per-line representations. Converting to Layers requires:

  1. Create a document-level pub.layers.expression.expression with kind="document"
  2. For each sentence (blank-line delimited), create a pub.layers.expression.expression with kind="sentence" and parentRef pointing to the document expression
  3. For each sentence, create a pub.layers.segmentation.segmentation record with the tokenization (kind="whitespace" or appropriate strategy) and expressionRef pointing to the sentence expression
  4. For each annotation column, create a separate annotationLayer with appropriate kind/subkind
  5. IOB/BILOU tags can remain as token-tags or be converted to span annotations. Layers supports both.

The reverse conversion (Layers → CoNLL) selects the appropriate annotation layers and serializes them column-by-column.