DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Problem. Independent multilabel classifiers for Gene Ontology (GO) often violate the ancestor rule: if a child term is predicted “on,” all its ancestors must also be on. Curators then fix outputs by hand, and metrics inflate/deflate unpredictably.

This note shows fast, reproducible ways to enforce hierarchy consistency at inference time, plus alternatives and measurement guidance.


Background: what “consistency” means in GO

  • GO is a DAG with relations like is_a, part_of (and others, e.g., regulates).
  • In practice, for classification we usually propagate annotations along is_a and optionally selected transitive relations.
  • Consistency constraint: for any protein (i) and term (t), if (Y_{i,t}=1) then for every ancestor (a\in \mathrm{Anc}(t)), (Y_{i,a}=1).

You must state which relations you close under (e.g., is_a only, or is_a + part_of) and ensure your train/eval labels were also ancestor-closed.


Fast closure algorithm (set-based, post-threshold)

Ensures every predicted child has all ancestors after thresholding.

def close_under_ancestors(scores, go_dag, thr):
    """
    scores: np.ndarray [N, C] class probabilities (0..1)
    go_dag: object exposing .ancestors(term) and .topo_order()
    thr:    float or per-class array; threshold(s) in [0,1]
    returns: np.uint8 matrix [N, C] with ancestor-closed multilabels
    """
    Y = (scores >= thr).astype(np.uint8)

    # parents appear before children in topo order
    order = go_dag.topo_order()          # e.g., list of term indices
    for t in reversed(order):             # visit children before parents
        for a in go_dag.ancestors(t):     # ancestors(a) excludes t
            # If child t is on, force ancestor a on
            Y[:, a] = np.maximum(Y[:, a], Y[:, t])
    return Y

When to use: simple, transparent, and fast for most setups; great as a default.


Vectorized form (fast for large C)

Precompute a binary ancestor matrix (A\in{0,1}^{C\times C}) where (A_{a,t}=1) if (a\in\mathrm{Anc}(t)).

# A_anc: [C, C] with 1 if row=ancestor, col=child
# Y:     [N, C] initial post-threshold labels
# Y_closed = Y OR (Y @ A_anc^T)
Y_closed = (Y | (Y.dot(A_anc.T) > 0)).astype(np.uint8)

This does the same “turn on all ancestors” in a single matrix multiply (Boolean semiring approximated with dot+>0). Build A_anc once from your DAG (careful with memory if C is large; use scipy.sparse).


Probability-space consistency (pre-threshold)

If you keep probabilities consistent before thresholding, you can apply class-specific thresholds later without re-violating parents.

Simple projection: for each edge (a\to t), enforce (p(t)\le p(a)):

def monotone_project_probs(P, edges, iters=1):
    """
    P: np.ndarray [N, C] class probabilities
    edges: list of (parent, child) index pairs consistent with topo order
    """
    P = P.copy()
    for _ in range(iters):  # a few passes often suffice
        for a, t in edges:  # parent, child
            # clamp child to parent if it violates monotonicity
            P[:, t] = np.minimum(P[:, t], P[:, a])
    return P
  • Pros: thresholds won’t re-break the hierarchy.
  • Cons: order-dependent; a few passes help. For stricter solutions, see “Energy-based constraints.”

Alternatives (when you need more than post-hoc closure)

  1. Hierarchical loss (train-time): add a penalty or constraint that encourages (p(child)\le p(parent)).

    • Example: ( \lambda\sum_{(a,t)} \max(0, p_t - p_a)^2 )
    • Pros: model learns constraints; fewer violations.
    • Cons: hyper-parameters; possible under-confidence for children.
  2. Lowest-ancestor prediction (LA): predict only the minimal set (no predicted term has a predicted child). Expand to ancestors for the final set.

    • Pros: small final label sets; curator-friendly.
    • Cons: can hurt recall for mid/upper-level terms if thresholds aren’t tuned well.
  3. Energy-based / projected inference: solve (\min p - \hat p ^2) s.t. (p(child)\le p(parent)) over the DAG (isotonic regression on a DAG).
    • Pros: principled projection to the feasible set; order-invariant.
    • Cons: more complex (requires a DAG isotonic solver).

Where to apply closure: before vs after threshold

  • Pre-threshold (probabilities): monotone projection -> then per-class thresholds.

    • Advantage: any thresholding respects hierarchy; better calibration downstream.
    • Caveat: may shrink child probabilities; report calibration metrics.
  • Post-threshold (labels): fastest, easiest; takes your Y and enforces ancestor closure.

    • Advantage: robust, minimal surgery.
    • Caveat: metrics like precision/recall can jump because parents get turned on; report both pre- and post-closure metrics in ablations.

Recommendation: keep the post-threshold closure as your mainline (transparent), and optionally provide pre-threshold monotone projection for calibration-sensitive users.


Practical details you should document

  • Ontology scope: BP/MF/CC trained separately or jointly?
  • Relations closed: is_a only (common), or is_a + part_of (state it).
  • Obsoletes & alt IDs: remove obsoletes; map alt IDs to primaries before building DAG.
  • Cross-namespace edges: if present, specify handling.
  • Thresholds: global vs per-class; how tuned (Fmax on validation)?
  • Evidence filtering: whether IEA is allowed; keep train/eval policies consistent.
  • Time-based split: to avoid future leakage.

Complexity & data structures

  • Set-based closure: (O(N \times E)) per batch, where (E) = number of DAG edges; with vectorization (O(N \times \mathrm{nnz}(A))).
  • Ancestor matrix build: one-time DAG transitive closure; use sparse matrices for (C\sim 5-10k).
  • Memory: a dense (C\times C) can be large; prefer scipy.sparse.csr_matrix.

Sparse example (post-threshold vectorized):

import numpy as np
from scipy.sparse import csr_matrix

# Build sparse ancestor matrix A_anc [C, C] once:
# rows = ancestor indices, cols = child indices (value=1)
A_anc = csr_matrix((data, (rows, cols)), shape=(C, C))

def close_labels_sparse(Y):
    # Y: [N, C] uint8
    # Y_closed = Y OR (Y @ A_anc^T)  (Boolean-OR semantics)
    propagated = (Y.dot(A_anc.T) > 0).astype(np.uint8)
    return np.maximum(Y, propagated)

Torch variant (batched GPU)

import torch

# P: [B, C] probs; edges: LongTensor [E, 2] (parent, child) topo-sorted
def monotone_project_probs_torch(P, edges, iters=1):
    P = P.clone()
    parents, children = edges[:,0], edges[:,1]
    for _ in range(iters):
        P[:, children] = torch.minimum(P[:, children], P[:, parents])
    return P

# Post-threshold closure using a sparse COO ancestor matrix A_anc: [C, C]
def close_labels_torch(Y, A_anc):
    # Y: [B, C] uint8 or bool; A_anc: torch.sparse_coo_tensor
    propagated = torch.sparse.mm(A_anc.transpose(0,1), Y.T.float()).T > 0
    return (Y.bool() | propagated).to(torch.uint8)

Measurement

Always report both the raw (possibly inconsistent) and the consistent results in ablations.

  • Violations (%): fraction of predicted child labels whose ancestors were off (raw); should be ~0 after closure.
  • ΔFmax: change in Fmax per ontology after closure (often improves due to fewer false negatives on parents).
  • Calibration (ECE): pre- vs post-monotone projection if you use probability-space constraints.
  • Curator effort: time saved (qualitative) or % predictions already curator-consistent.

Ablation table template

| Method                         | Viol% | Fmax_BPP | Fmax_MF | Fmax_CC | micro-auPRC |
|--------------------------------|-------|----------|---------|---------|-------------|
| PLM + per-class thr (raw)      | 12.4  | 0.41     | 0.52    | 0.49    | 0.36        |
| + Post-thr closure (set)       | 0.0   | 0.43     | 0.54    | 0.50    | 0.37        |
| + Pre-thr monotone projection  | 0.0   | 0.44     | 0.55    | 0.51    | 0.38        |

When closure can hurt and how to mitigate

  • Over-activation of very general terms (high up the DAG) can reduce precision. Mitigation: use per-class thresholds (higher for general terms), or apply closure after minimal-set selection (lowest-ancestor strategy).

  • Noisy children forcing noisy parents on. Mitigation: apply probability-space projection first, which tends to dampen child scores that exceed parents.


Recommendations (practical defaults)

  1. Make training labels ancestor-closed.
  2. Use per-class thresholds tuned for Fmax on validation.
  3. At inference, apply post-threshold closure (vectorized) as your default.
  4. For better calibration, add one pass of pre-threshold monotone projection.
  5. Document: relations used, closure stage, and ablation results.

References

https://pmc.ncbi.nlm.nih.gov/articles/PMC3654867/ https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1835-8 https://www.science.org/doi/10.1126/science.ade2574

© Copyright 2017-2025