Skip to main content

CoNLL Formats

Overview

CoNLL formats are tab-separated column-based annotation formats widely used in NLP shared tasks. CoNLL-U (Universal Dependencies) is the most widely used variant. Each line represents a token with multiple annotation columns. CoNLL formats are flat, single-file representations optimized for machine learning pipelines.

CoNLL-U (Universal Dependencies)

Column Mapping

CoNLL-U ColumnLayers EquivalentNotes
IDpub.layers.expression token tokenIndexToken position. CoNLL-U uses 1-based; Layers uses 0-based.
FORMpub.layers.expression token textSurface form.
LEMMAannotationLayer(kind="token-tag", subkind="lemma")annotation.valueLemmatization layer.
UPOSannotationLayer(kind="token-tag", subkind="pos")annotation.labelUniversal POS tag.
XPOSannotationLayer(kind="token-tag", subkind="xpos")annotation.labelLanguage-specific POS tag.
FEATSannotationLayer(kind="token-tag", subkind="morph")annotation.featuresMorphological features (e.g., Case=Nom|Number=Sing). Each feature key-value pair maps to a feature entry.
HEADannotationLayer(kind="graph", subkind="dependency")annotation.headIndexGovernor token index. CoNLL-U's 0 (root) maps to headIndex absent or a sentinel.
DEPRELSame dependency layer → annotation.labelDependency relation label.
DEPSannotationLayer(kind="graph", subkind="enhanced-dependency")Enhanced dependencies (multiple heads). Each head:deprel pair creates a separate annotation in the enhanced layer.
MISCpub.layers.defs#featureMap on the token or annotationCatch-all for SpaceAfter=No, Translit=..., etc.

Special Token Types

CoNLL-U FeatureLayers EquivalentNotes
Multi-word tokens (e.g., 1-2 del)pub.layers.defs#tokenRefSequenceMulti-word token ranges are represented as a tokenRefSequence with tokenIndexes covering the component tokens. The surface form and span of the multi-word token are stored in features.
Empty nodes (e.g., 2.1)pub.layers.annotation#annotation with featuresEmpty nodes in enhanced UD are represented as annotations (not tokens) in the enhanced dependency layer, with features indicating they are empty/null nodes. Their position is tracked via decimal indices stored in features.
Sentence boundariespub.layers.expression (kind: sentence) + pub.layers.segmentationCoNLL-U blank lines between sentences map to sentence boundaries in the segmentation record.
# text = ... commentpub.layers.expression sentence features or pub.layers.expression.textSentence-level metadata from comments.
# sent_id = ... commentpub.layers.expression sentence uuidSentence identifier.
# newpar / # newdocpub.layers.expression (kind: section) boundariesParagraph and document boundaries.

CoNLL-2003 (NER)

CoNLL-2003 ColumnLayers EquivalentNotes
Wordtoken.textSurface form.
POS tagannotationLayer(kind="token-tag", subkind="pos")POS tag.
Chunk tagannotationLayer(kind="token-tag", subkind="chunk")IOB chunk tag.
NER tagannotationLayer(kind="token-tag", subkind="ner")IOB NER tag. Can also be converted to kind="span", subkind="entity-mention" with token spans.

CoNLL-2005 / CoNLL-2009 (SRL)

CoNLL-200x ColumnLayers EquivalentNotes
PredicateannotationLayer(kind="span", subkind="predicate")Predicate identification.
Predicate senseannotationLayer(kind="span", subkind="frame", formalism="PropBank")annotation.labelPropBank sense (roleset ID).
Argument columnsannotation.arguments[] with argumentRef.roleEach argument column (ARG0, ARG1, ARGM-TMP, etc.) becomes an argumentRef on the frame annotation.

CoNLL-2012 (OntoNotes Coreference)

CoNLL-2012 FeatureLayers EquivalentNotes
Coreference columnpub.layers.annotation#clusterSet with kind="coreference"Parenthetical coreference notation (e.g., (12), (12, 12)) maps to cluster membership. Each cluster ID becomes a cluster with memberIds pointing to span annotations.
Speaker columnannotationLayer(kind="token-tag", subkind="speaker")Speaker diarization.
Named entity spansannotationLayer(kind="span", subkind="entity-mention")Entity mention spans.

Conversion Notes

CoNLL formats are flat, token-per-line representations. Converting to Layers requires:

  1. Parse the file into tokens, creating a tokenization with kind="whitespace" or appropriate strategy
  2. Group tokens into sentences (blank-line delimited), creating sentence objects
  3. Wrap in a segmentation record bound to an expression
  4. For each annotation column, create a separate annotationLayer with appropriate kind/subkind
  5. IOB/BILOU tags can remain as token-tags or be converted to span annotations — Layers supports both

The reverse conversion (Layers → CoNLL) selects the appropriate annotation layers and serializes them column-by-column.