DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Takeaway: Frozen PLM embeddings + linear classifier = strong, fast baseline for GO prediction.

Workflow

1) Embed proteins (batchable; GPU helpful but not required). 2) Train one-vs-rest Logistic Regression (balanced). 3) Calibrate; close predictions under GO ancestors. 4) Threshold per class for Fmax.

Sketch

# Load embeddings and labels
X_tr, Y_tr = ...  # [N_tr, D], [N_tr, C]
X_val, Y_val = ...

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression(max_iter=4000, class_weight="balanced"))
clf.fit(X_tr, Y_tr)

P = clf.predict_proba(X_val)
P = close_under_ancestors(P, go_dag)        # hierarchy consistency
th = tune_thresholds(P, Y_val, metric="Fmax")
Y_hat = (P >= th).astype(int)

Hyper-parameters

  • Dimensionality: if D>1024, try PCA->512 for speed.
  • Regularization: C=1.0 (grid: 0.1-10), early stop on val Fmax.
  • Class imbalance: class_weight="balanced"; consider focal loss for MLP.

Diagnostics

  • Per-class PR curves (spot long-tail collapse).
  • Reliability plot + ECE; calibrate if >0.05.
  • Ancestor coverage % (should be 100% after closure).

When to go beyond linear

  • Plateaued Fmax and clear network-driven biology -> add PPI smoothing or a GNN.
  • Heterogeneous evidence (sequence + structure + text) -> late-fusion of logits.

Tip: Keep PLM frozen at first. Fine-tune only with careful regularization and time-split eval.

© Copyright 2017-2025