Hierarchy-Consistent Inference for GO Predictions

Problem. Independent multilabel classifiers for Gene Ontology (GO) often violate the ancestor rule: if a child term is predicted “on,” all its ancestors must also be on. Curators then fix outputs by hand, and metrics inflate/deflate unpredictably.

This note shows fast, reproducible ways to enforce hierarchy consistency at inference time, plus alternatives and measurement guidance.

Background: what “consistency” means in GO

GO is a DAG with relations like is_a, part_of (and others, e.g., regulates).
In practice, for classification we usually propagate annotations along is_a and optionally selected transitive relations.
Consistency constraint: for any protein (i) and term (t), if (Y_{i,t}=1) then for every ancestor (a\in \mathrm{Anc}(t)), (Y_{i,a}=1).

You must state which relations you close under (e.g., is_a only, or is_a + part_of) and ensure your train/eval labels were also ancestor-closed.

Fast closure algorithm (set-based, post-threshold)

Ensures every predicted child has all ancestors after thresholding.

def close_under_ancestors(scores, go_dag, thr):
    """
    scores: np.ndarray [N, C] class probabilities (0..1)
    go_dag: object exposing .ancestors(term) and .topo_order()
    thr:    float or per-class array; threshold(s) in [0,1]
    returns: np.uint8 matrix [N, C] with ancestor-closed multilabels
    """
    Y = (scores >= thr).astype(np.uint8)

    # parents appear before children in topo order
    order = go_dag.topo_order()          # e.g., list of term indices
    for t in reversed(order):             # visit children before parents
        for a in go_dag.ancestors(t):     # ancestors(a) excludes t
            # If child t is on, force ancestor a on
            Y[:, a] = np.maximum(Y[:, a], Y[:, t])
    return Y

When to use: simple, transparent, and fast for most setups; great as a default.

Vectorized form (fast for large C)

Precompute a binary ancestor matrix (A\in{0,1}^{C\times C}) where (A_{a,t}=1) if (a\in\mathrm{Anc}(t)).

# A_anc: [C, C] with 1 if row=ancestor, col=child
# Y:     [N, C] initial post-threshold labels
# Y_closed = Y OR (Y @ A_anc^T)
Y_closed = (Y | (Y.dot(A_anc.T) > 0)).astype(np.uint8)

This does the same “turn on all ancestors” in a single matrix multiply (Boolean semiring approximated with dot+>0). Build A_anc once from your DAG (careful with memory if C is large; use scipy.sparse).

Probability-space consistency (pre-threshold)

If you keep probabilities consistent before thresholding, you can apply class-specific thresholds later without re-violating parents.

Simple projection: for each edge (a\to t), enforce (p(t)\le p(a)):

def monotone_project_probs(P, edges, iters=1):
    """
    P: np.ndarray [N, C] class probabilities
    edges: list of (parent, child) index pairs consistent with topo order
    """
    P = P.copy()
    for _ in range(iters):  # a few passes often suffice
        for a, t in edges:  # parent, child
            # clamp child to parent if it violates monotonicity
            P[:, t] = np.minimum(P[:, t], P[:, a])
    return P

Pros: thresholds won’t re-break the hierarchy.
Cons: order-dependent; a few passes help. For stricter solutions, see “Energy-based constraints.”

Alternatives (when you need more than post-hoc closure)

Hierarchical loss (train-time): add a penalty or constraint that encourages (p(child)\le p(parent)).
- Example: ( \lambda\sum_{(a,t)} \max(0, p_t - p_a)^2 )
- Pros: model learns constraints; fewer violations.
- Cons: hyper-parameters; possible under-confidence for children.
Lowest-ancestor prediction (LA): predict only the minimal set (no predicted term has a predicted child). Expand to ancestors for the final set.
- Pros: small final label sets; curator-friendly.
- Cons: can hurt recall for mid/upper-level terms if thresholds aren’t tuned well.

Energy-based / projected inference: solve (\min

p - \hat p

^2) s.t. (p(child)\le p(parent)) over the DAG (isotonic regression on a DAG).

Pros: principled projection to the feasible set; order-invariant.
Cons: more complex (requires a DAG isotonic solver).

Where to apply closure: before vs after threshold

Pre-threshold (probabilities): monotone projection -> then per-class thresholds.
- Advantage: any thresholding respects hierarchy; better calibration downstream.
- Caveat: may shrink child probabilities; report calibration metrics.
Post-threshold (labels): fastest, easiest; takes your Y and enforces ancestor closure.
- Advantage: robust, minimal surgery.
- Caveat: metrics like precision/recall can jump because parents get turned on; report both pre- and post-closure metrics in ablations.

Recommendation: keep the post-threshold closure as your mainline (transparent), and optionally provide pre-threshold monotone projection for calibration-sensitive users.

Practical details you should document

Ontology scope: BP/MF/CC trained separately or jointly?
Relations closed: is_a only (common), or is_a + part_of (state it).
Obsoletes & alt IDs: remove obsoletes; map alt IDs to primaries before building DAG.
Cross-namespace edges: if present, specify handling.
Thresholds: global vs per-class; how tuned (Fmax on validation)?
Evidence filtering: whether IEA is allowed; keep train/eval policies consistent.
Time-based split: to avoid future leakage.

Complexity & data structures

Set-based closure: (O(N \times E)) per batch, where (E) = number of DAG edges; with vectorization (O(N \times \mathrm{nnz}(A))).
Ancestor matrix build: one-time DAG transitive closure; use sparse matrices for (C\sim 5-10k).
Memory: a dense (C\times C) can be large; prefer scipy.sparse.csr_matrix.

Sparse example (post-threshold vectorized):

import numpy as np
from scipy.sparse import csr_matrix

# Build sparse ancestor matrix A_anc [C, C] once:
# rows = ancestor indices, cols = child indices (value=1)
A_anc = csr_matrix((data, (rows, cols)), shape=(C, C))

def close_labels_sparse(Y):
    # Y: [N, C] uint8
    # Y_closed = Y OR (Y @ A_anc^T)  (Boolean-OR semantics)
    propagated = (Y.dot(A_anc.T) > 0).astype(np.uint8)
    return np.maximum(Y, propagated)

Torch variant (batched GPU)

import torch

# P: [B, C] probs; edges: LongTensor [E, 2] (parent, child) topo-sorted
def monotone_project_probs_torch(P, edges, iters=1):
    P = P.clone()
    parents, children = edges[:,0], edges[:,1]
    for _ in range(iters):
        P[:, children] = torch.minimum(P[:, children], P[:, parents])
    return P

# Post-threshold closure using a sparse COO ancestor matrix A_anc: [C, C]
def close_labels_torch(Y, A_anc):
    # Y: [B, C] uint8 or bool; A_anc: torch.sparse_coo_tensor
    propagated = torch.sparse.mm(A_anc.transpose(0,1), Y.T.float()).T > 0
    return (Y.bool() | propagated).to(torch.uint8)

Measurement

Always report both the raw (possibly inconsistent) and the consistent results in ablations.

Violations (%): fraction of predicted child labels whose ancestors were off (raw); should be ~0 after closure.
ΔFmax: change in Fmax per ontology after closure (often improves due to fewer false negatives on parents).
Calibration (ECE): pre- vs post-monotone projection if you use probability-space constraints.
Curator effort: time saved (qualitative) or % predictions already curator-consistent.

Ablation table template

| Method                         | Viol% | Fmax_BPP | Fmax_MF | Fmax_CC | micro-auPRC |
|--------------------------------|-------|----------|---------|---------|-------------|
| PLM + per-class thr (raw)      | 12.4  | 0.41     | 0.52    | 0.49    | 0.36        |
| + Post-thr closure (set)       | 0.0   | 0.43     | 0.54    | 0.50    | 0.37        |
| + Pre-thr monotone projection  | 0.0   | 0.44     | 0.55    | 0.51    | 0.38        |

When closure can hurt and how to mitigate

Over-activation of very general terms (high up the DAG) can reduce precision. Mitigation: use per-class thresholds (higher for general terms), or apply closure after minimal-set selection (lowest-ancestor strategy).
Noisy children forcing noisy parents on. Mitigation: apply probability-space projection first, which tends to dampen child scores that exceed parents.

Recommendations (practical defaults)

Make training labels ancestor-closed.
Use per-class thresholds tuned for Fmax on validation.
At inference, apply post-threshold closure (vectorized) as your default.
For better calibration, add one pass of pre-threshold monotone projection.
Document: relations used, closure stage, and ablation results.

References

https://pmc.ncbi.nlm.nih.gov/articles/PMC3654867/ https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1835-8 https://www.science.org/doi/10.1126/science.ade2574