DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Problem. Independent multilabel classifiers for Gene Ontology (GO) often violate the ancestor rule: if a child term is predicted “on,” all its ancestors must also be on. Curators then fix outputs by hand, and metrics inflate/deflate unpredictably.

This note shows fast, reproducible ways to enforce hierarchy consistency at inference time, plus alternatives and measurement guidance.


Background: what “consistency” means in GO

  • GO is a DAG with relations like is_a, part_of (and others, e.g., regulates).
  • In practice, for classification we usually propagate annotations along is_a and optionally selected transitive relations.
  • Consistency constraint: for any protein (i) and term (t), if (Y_{i,t}=1) then for every ancestor (a\in \mathrm{Anc}(t)), (Y_{i,a}=1).

You must state which relations you close under (e.g., is_a only, or is_a + part_of) and ensure your train/eval labels were also ancestor-closed.


Fast closure algorithm (set-based, post-threshold)

Ensures every predicted child has all ancestors after thresholding.

def close_under_ancestors(scores, go_dag, thr):
    """
    scores: np.ndarray [N, C] class probabilities (0..1)
    go_dag: object exposing .ancestors(term) and .topo_order()
    thr:    float or per-class array; threshold(s) in [0,1]
    returns: np.uint8 matrix [N, C] with ancestor-closed multilabels
    """
    Y = (scores >= thr).astype(np.uint8)

    # parents appear before children in topo order
    order = go_dag.topo_order()          # e.g., list of term indices
    for t in reversed(order):             # visit children before parents
        for a in go_dag.ancestors(t):     # ancestors(a) excludes t
            # If child t is on, force ancestor a on
            Y[:, a] = np.maximum(Y[:, a], Y[:, t])
    return Y

When to use: simple, transparent, and fast for most setups; great as a default.


Vectorized form (fast for large C)

Precompute a binary ancestor matrix (A\in{0,1}^{C\times C}) where (A_{a,t}=1) if (a\in\mathrm{Anc}(t)).

# A_anc: [C, C] with 1 if row=ancestor, col=child
# Y:     [N, C] initial post-threshold labels
# Y_closed = Y OR (Y @ A_anc^T)
Y_closed = (Y | (Y.dot(A_anc.T) > 0)).astype(np.uint8)

This does the same “turn on all ancestors” in a single matrix multiply (Boolean semiring approximated with dot+>0). Build A_anc once from your DAG (careful with memory if C is large; use scipy.sparse).


Probability-space consistency (pre-threshold)

If you keep probabilities consistent before thresholding, you can apply class-specific thresholds later without re-violating parents.

Simple projection: for each edge (a\to t), enforce (p(t)\le p(a)):

def monotone_project_probs(P, edges, iters=1):
    """
    P: np.ndarray [N, C] class probabilities
    edges: list of (parent, child) index pairs consistent with topo order
    """
    P = P.copy()
    for _ in range(iters):  # a few passes often suffice
        for a, t in edges:  # parent, child
            # clamp child to parent if it violates monotonicity
            P[:, t] = np.minimum(P[:, t], P[:, a])
    return P
  • Pros: thresholds won’t re-break the hierarchy.
  • Cons: order-dependent; a few passes help. For stricter solutions, see “Energy-based constraints.”

Alternatives (when you need more than post-hoc closure)

  1. Hierarchical loss (train-time): add a penalty or constraint that encourages (p(child)\le p(parent)).

    • Example: ( \lambda\sum_{(a,t)} \max(0, p_t - p_a)^2 )
    • Pros: model learns constraints; fewer violations.
    • Cons: hyper-parameters; possible under-confidence for children.
  2. Lowest-ancestor prediction (LA): predict only the minimal set (no predicted term has a predicted child). Expand to ancestors for the final set.

    • Pros: small final label sets; curator-friendly.
    • Cons: can hurt recall for mid/upper-level terms if thresholds aren’t tuned well.
  3. Energy-based / projected inference: solve (\min p - \hat p ^2) s.t. (p(child)\le p(parent)) over the DAG (isotonic regression on a DAG).
    • Pros: principled projection to the feasible set; order-invariant.
    • Cons: more complex (requires a DAG isotonic solver).

Where to apply closure: before vs after threshold

  • Pre-threshold (probabilities): monotone projection -> then per-class thresholds.

    • Advantage: any thresholding respects hierarchy; better calibration downstream.
    • Caveat: may shrink child probabilities; report calibration metrics.
  • Post-threshold (labels): fastest, easiest; takes your Y and enforces ancestor closure.

    • Advantage: robust, minimal surgery.
    • Caveat: metrics like precision/recall can jump because parents get turned on; report both pre- and post-closure metrics in ablations.

Recommendation: keep the post-threshold closure as your mainline (transparent), and optionally provide pre-threshold monotone projection for calibration-sensitive users.


Practical details you should document

  • Ontology scope: BP/MF/CC trained separately or jointly?
  • Relations closed: is_a only (common), or is_a + part_of (state it).
  • Obsoletes & alt IDs: remove obsoletes; map alt IDs to primaries before building DAG.
  • Cross-namespace edges: if present, specify handling.
  • Thresholds: global vs per-class; how tuned (Fmax on validation)?
  • Evidence filtering: whether IEA is allowed; keep train/eval policies consistent.
  • Time-based split: to avoid future leakage.

Complexity & data structures

  • Set-based closure: (O(N \times E)) per batch, where (E) = number of DAG edges; with vectorization (O(N \times \mathrm{nnz}(A))).
  • Ancestor matrix build: one-time DAG transitive closure; use sparse matrices for (C\sim 5-10k).
  • Memory: a dense (C\times C) can be large; prefer scipy.sparse.csr_matrix.

Sparse example (post-threshold vectorized):

import numpy as np
from scipy.sparse import csr_matrix

# Build sparse ancestor matrix A_anc [C, C] once:
# rows = ancestor indices, cols = child indices (value=1)
A_anc = csr_matrix((data, (rows, cols)), shape=(C, C))

def close_labels_sparse(Y):
    # Y: [N, C] uint8
    # Y_closed = Y OR (Y @ A_anc^T)  (Boolean-OR semantics)
    propagated = (Y.dot(A_anc.T) > 0).astype(np.uint8)
    return np.maximum(Y, propagated)

Torch variant (batched GPU)

import torch

# P: [B, C] probs; edges: LongTensor [E, 2] (parent, child) topo-sorted
def monotone_project_probs_torch(P, edges, iters=1):
    P = P.clone()
    parents, children = edges[:,0], edges[:,1]
    for _ in range(iters):
        P[:, children] = torch.minimum(P[:, children], P[:, parents])
    return P

# Post-threshold closure using a sparse COO ancestor matrix A_anc: [C, C]
def close_labels_torch(Y, A_anc):
    # Y: [B, C] uint8 or bool; A_anc: torch.sparse_coo_tensor
    propagated = torch.sparse.mm(A_anc.transpose(0,1), Y.T.float()).T > 0
    return (Y.bool() | propagated).to(torch.uint8)

Measurement

Always report both the raw (possibly inconsistent) and the consistent results in ablations.

  • Violations (%): fraction of predicted child labels whose ancestors were off (raw); should be ~0 after closure.
  • ΔFmax: change in Fmax per ontology after closure (often improves due to fewer false negatives on parents).
  • Calibration (ECE): pre- vs post-monotone projection if you use probability-space constraints.
  • Curator effort: time saved (qualitative) or % predictions already curator-consistent.

Ablation table template

| Method                         | Viol% | Fmax_BPP | Fmax_MF | Fmax_CC | micro-auPRC |
|--------------------------------|-------|----------|---------|---------|-------------|
| PLM + per-class thr (raw)      | 12.4  | 0.41     | 0.52    | 0.49    | 0.36        |
| + Post-thr closure (set)       | 0.0   | 0.43     | 0.54    | 0.50    | 0.37        |
| + Pre-thr monotone projection  | 0.0   | 0.44     | 0.55    | 0.51    | 0.38        |

When closure can hurt and how to mitigate

  • Over-activation of very general terms (high up the DAG) can reduce precision. Mitigation: use per-class thresholds (higher for general terms), or apply closure after minimal-set selection (lowest-ancestor strategy).

  • Noisy children forcing noisy parents on. Mitigation: apply probability-space projection first, which tends to dampen child scores that exceed parents.


Recommendations (practical defaults)

  1. Make training labels ancestor-closed.
  2. Use per-class thresholds tuned for Fmax on validation.
  3. At inference, apply post-threshold closure (vectorized) as your default.
  4. For better calibration, add one pass of pre-threshold monotone projection.
  5. Document: relations used, closure stage, and ablation results.

References

https://pmc.ncbi.nlm.nih.gov/articles/PMC3654867/ https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1835-8 https://www.science.org/doi/10.1126/science.ade2574

This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.

© Copyright 2017-2025