DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

Takeaway: Frozen PLM embeddings + linear classifier = strong, fast baseline for GO prediction.

Workflow

1) Embed proteins (batchable; GPU helpful but not required). 2) Train one-vs-rest Logistic Regression (balanced). 3) Calibrate; close predictions under GO ancestors. 4) Threshold per class for Fmax.

Sketch

# Load embeddings and labels
X_tr, Y_tr = ...  # [N_tr, D], [N_tr, C]
X_val, Y_val = ...

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression(max_iter=4000, class_weight="balanced"))
clf.fit(X_tr, Y_tr)

P = clf.predict_proba(X_val)
P = close_under_ancestors(P, go_dag)        # hierarchy consistency
th = tune_thresholds(P, Y_val, metric="Fmax")
Y_hat = (P >= th).astype(int)

Hyper-parameters

  • Dimensionality: if D>1024, try PCA->512 for speed.
  • Regularization: C=1.0 (grid: 0.1-10), early stop on val Fmax.
  • Class imbalance: class_weight="balanced"; consider focal loss for MLP.

Diagnostics

  • Per-class PR curves (spot long-tail collapse).
  • Reliability plot + ECE; calibrate if >0.05.
  • Ancestor coverage % (should be 100% after closure).

When to go beyond linear

  • Plateaued Fmax and clear network-driven biology -> add PPI smoothing or a GNN.
  • Heterogeneous evidence (sequence + structure + text) -> late-fusion of logits.

Tip: Keep PLM frozen at first. Fine-tune only with careful regularization and time-split eval.

This is a personal blog. The views, thoughts, and opinions expressed here are my own and do not represent, reflect, or constitute the views, policies, or positions of any employer, university, client, or organization I am associated with or have been associated with.

© Copyright 2017-2025