Skip to main content

Query and Discovery Patterns

The appview answers two categories of questions: retrieval (get a specific record by its AT-URI) and discovery (find records matching some criteria). Retrieval hits PostgreSQL directly. Discovery queries fan out across PostgreSQL, Elasticsearch, and Neo4j depending on the query shape.

Service Layer

Query logic is encapsulated in service classes in src/services/, matching Chive's pattern:

ServiceFileResponsibility
SearchServicesrc/services/search/search-service.tsFull-text search, faceted filtering via Elasticsearch
RankingServicesrc/services/search/ranking-service.tsResult scoring by confidence, recency, persona reputation
AutocompleteServicesrc/services/search/autocomplete-service.tsExpression text, ontology names, label value completion
QueryCachesrc/services/search/query-cache.tsRedis-backed TTL cache for ES query results
DiscoveryServicesrc/services/discovery/discovery-service.tsRecommendations: "similar annotations", "related corpora"

All service methods return Result<T, LayersError> and are injected via tsyringe.

Discovery Use Cases

Use CasePrimary BackendQuery Shape
Find all annotations on a given expressionPGWHERE expression_ref = $1 on annotation_layers
Find all expressions in a given corpusPG + Neo4jcorpus_memberships join or Neo4j MEMBER_OF traversal
Find all annotation layers using a given ontologyPGWHERE ontology_ref = $1 on annotation_layers
Find all entities grounded to a Wikidata QIDNeo4jKNOWLEDGE_REF edge traversal from a knowledge node
Find all annotations in Universal Dependencies formalismESFaceted filter on formalism = "universal-dependencies"
Find all experiments measuring acceptabilityESFaceted filter on measureType = "acceptability"
Find all corpora in a given languageESKeyword filter on language
Find all data linked to a given eprintPG + Neo4jcross_references WHERE target_uri = $eprint or LINKS_EPRINT traversal
Find all annotations by a specific personaPGWHERE persona_ref = $1 on annotation_layers
Find the graph neighborhood of a nodeNeo4jCypher variable-length path query
Find all changes to a given recordPG + ESchangelogs WHERE subject_uri = $1 or ES filter on subject
Find recent changes across a collection typeESFaceted filter on subjectCollection, sorted by createdAt

Query Implementation Patterns

Single-Record Retrieval

Every get* XRPC endpoint resolves to a PostgreSQL primary key lookup:

SELECT record FROM expressions WHERE uri = $1;

Expected latency: < 5ms for indexed lookups.

Paginated Collection Listing

Every list* XRPC endpoint paginates with a cursor over a user's records:

SELECT uri, record
FROM expressions
WHERE did = $1
AND uri > $2 -- cursor
ORDER BY uri ASC
LIMIT $3;

Elasticsearch powers the /api/v1/search endpoint:

{
"query": {
"bool": {
"must": [
{ "multi_match": {
"query": "syntactic ambiguity",
"fields": ["text^3", "text.stemmed"]
}}
],
"filter": [
{ "term": { "lang": "en" } }
]
}
}
}

The text field uses a custom layers_text analyzer with ICU tokenization and Unicode normalization. The text.stemmed sub-field applies language-specific stemming.

The three-dimensional annotation search (kind, subkind, formalism) uses ES term aggregations:

{
"query": {
"bool": {
"filter": [
{ "term": { "kind": "span" } },
{ "term": { "subkind": "ner" } },
{ "term": { "formalism": "ontonotes" } }
]
}
},
"aggs": {
"by_label": {
"terms": { "field": "annotations.label", "size": 50 }
}
}
}

This returns matching annotation layers and a label distribution histogram in a single request.

Graph Traversal

Neo4j handles multi-hop queries that would require expensive recursive CTEs in PostgreSQL:

// Find all annotations transitively connected to a Wikidata entity
MATCH (kb:KnowledgeNode {externalId: "Q76"})
<-[:KNOWLEDGE_REF]-(ann:Annotation)
-[:PART_OF]->(layer:AnnotationLayer)
-[:ANNOTATES]->(expr:Expression)
RETURN layer.uri, expr.uri, ann.label
LIMIT 100

Cross-Reference Traversal

Forward References ("What does this record point to?")

SELECT to_uri, ref_type
FROM cross_references
WHERE from_uri = $1;

Reverse References ("What points to this record?")

SELECT from_uri, ref_type
FROM cross_references
WHERE to_uri = $1;

Transitive Closure ("All descendants of this expression")

Expression hierarchy traversal uses Neo4j's variable-length path syntax:

MATCH (root:Expression {uri: $1})-[:PARENT_OF*1..]->(desc:Expression)
RETURN desc.uri, length(path) AS depth
ORDER BY depth

This is faster than PostgreSQL recursive CTEs for deep hierarchies (documents with hundreds of nested paragraphs, sentences, and words).

Annotation-Specific Queries

By Kind/Subkind/Formalism

All three fields are keyword-indexed in Elasticsearch, enabling combinatorial filtering:

QueryES Filter
All POS layerskind = "token-tag" AND subkind = "pos"
All NER layers in OntoNotessubkind = "ner" AND formalism = "ontonotes"
All dependency parseskind = "relation" AND subkind = "dependency"
All UD layersformalism = "universal-dependencies"

By Label/Value

Individual annotation labels within layers are indexed as nested objects in ES:

{
"query": {
"nested": {
"path": "annotations",
"query": {
"term": { "annotations.label": "PERSON" }
}
}
}
}

By Confidence Threshold

{
"query": {
"nested": {
"path": "annotations",
"query": {
"range": { "annotations.confidence": { "gte": 800 } }
}
}
}
}

By Anchor Type

{
"query": {
"nested": {
"path": "annotations",
"query": {
"term": { "annotations.anchor_type": "temporalSpan" }
}
}
}
}

This finds annotations anchored to temporal regions (audio/video), as opposed to text spans or token references.

Graph Queries

Neighborhood Expansion

MATCH (n {uri: $1})-[r]-(neighbor)
RETURN type(r) AS edgeType, r.edgeType AS semanticType,
neighbor.uri AS neighborUri, labels(neighbor) AS nodeLabels
LIMIT 50

Typed Traversal

Follow only edges of a specific type (e.g., only denotes edges):

MATCH (n {uri: $1})-[r:GRAPH_EDGE {edgeType: "denotes"}]->(target)
RETURN target.uri, target.name

Shortest Path

MATCH path = shortestPath(
(a {uri: $1})-[*..10]-(b {uri: $2})
)
RETURN [n IN nodes(path) | n.uri] AS nodeUris,
[r IN relationships(path) | type(r)] AS edgeTypes

Aggregation Queries

Label Distribution per Corpus

SELECT a.label, COUNT(*) AS count
FROM annotations a
JOIN annotation_layers al ON a.layer_uri = al.uri
JOIN cross_references cr ON cr.from_uri = al.expression_ref
JOIN corpus_memberships cm ON cm.expression_ref = cr.to_uri
WHERE cm.corpus_ref = $1
GROUP BY a.label
ORDER BY count DESC;

Annotation Coverage per Expression

SELECT al.kind, al.subkind, COUNT(*) AS layer_count
FROM annotation_layers al
WHERE al.expression_ref = $1
GROUP BY al.kind, al.subkind
ORDER BY layer_count DESC;

Caching Strategy

Redis caches frequently accessed data to reduce database load:

Cache Key PatternTTLContent
record:{uri}5 minFull record JSONB
refs:{uri}5 minCross-reference list for a record
search:{hash}1 minES search result page
corpus_stats:{uri}15 minMaterialized corpus statistics

Cache invalidation: when a record is updated or deleted via the firehose, its cache key and related cache keys are evicted immediately.

Future Considerations

  • Semantic search: ES dense_vector fields could enable vector-based semantic search over annotation label embeddings, complementing keyword-based faceting with similarity-based retrieval.
  • Learning-to-rank: A RelevanceLogger (analogous to Chive's) could collect click-through data on search results to train a learning-to-rank model for improved result ordering.

See Also