Atabak - Thoughts and Experiences - Protein Language Models for GO: A Hands-on Starter

DR. ATABAK KH

Cloud Platform Modernization Architect specializing in transforming legacy systems into reliable, observable, and cost-efficient Cloud platforms.

Certified: Google Professional Cloud Architect, AWS Solutions Architect, MapR Cluster Administrator

2024

2021

2020

2019

2018

2017

Cloud

Computationalbiology

Devops

Finops

Software

4 common architecture solutions

Spark

Accountability on Resilience Engineering

agile

bigdata

Hadoop Spark via docker

ci/cd

cloud

computational algorithm

devops

digital transformation

docker

docker network

docker swarm

Docker Compose vs Swarm

dotnet core

kubernetes

lxd

Dotnet Core on LXD and Kubernetes

machine learning

Clusterized Spark

product development

protein function

resilience engineering

Accountability on Resilience Engineering

spark

team structure

Team structure

team leading

Agile with Mosquito concern

alerts

SLO Burn-Rate Alerts that Don't Page You at 3am (Unless They Should)

analytics

Cost-to-Serve in 30 Minutes: A Practical Quickstart

audit

Privacy-First Cloud Audits in the EU - No PII Needed

autoscaling

Cut Cloud Costs 20-30% with p95-Driven Scaling (No Rewrites)

bigquery

billing

Cost-to-Serve in 30 Minutes: A Practical Quickstart

bioai

budgets

Terraform Guardrails that Save Real Money (and Incidents)

burn-rate

SLO Burn-Rate Alerts that Don't Page You at 3am (Unless They Should)

cloud modernization

Privacy-First Cloud Audits in the EU - No PII Needed

cloudrun

Cloud Run: Concurrency, Min/Max Instances, and Cold Start Tuning

cold-start

Cloud Run: Concurrency, Min/Max Instances, and Cold Start Tuning

concurrency

Cloud Run: Concurrency, Min/Max Instances, and Cold Start Tuning

cost

Cost-to-Serve in 30 Minutes: A Practical Quickstart

cost-control

Terraform Guardrails that Save Real Money (and Incidents)

cost-to-serve

Cost-to-Serve in 30 Minutes: A Practical Quickstart

cost optimization

data-platform

Hadoop/Oracle -> BigQuery: 7 Pitfalls That Blow Up Cost (and Fixes)

docker-compose

finops

gcp

governance

Terraform Guardrails that Save Real Money (and Incidents)

infrastructure

Terraform Guardrails that Save Real Money (and Incidents)

kubernetes

migration

Hadoop/Oracle -> BigQuery: 7 Pitfalls That Blow Up Cost (and Fixes)

monitoring

SLO Burn-Rate Alerts that Don't Page You at 3am (Unless They Should)

performance

Cloud Run: Concurrency, Min/Max Instances, and Cold Start Tuning

privacy

Privacy-First Cloud Audits in the EU - No PII Needed

prometheus

SLO Burn-Rate Alerts that Don't Page You at 3am (Unless They Should)

reliability

Cut Cloud Costs 20-30% with p95-Driven Scaling (No Rewrites)

scaling

Cloud Run: Concurrency, Min/Max Instances, and Cold Start Tuning

slos

SLO Burn-Rate Alerts that Don't Page You at 3am (Unless They Should)

sre

terraform

Terraform Guardrails that Save Real Money (and Incidents)

Services

Takeaway: Frozen PLM embeddings + linear classifier = strong, fast baseline for GO prediction.

Workflow

1) Embed proteins (batchable; GPU helpful but not required). 2) Train one-vs-rest Logistic Regression (balanced). 3) Calibrate; close predictions under GO ancestors. 4) Threshold per class for Fmax.

Sketch

# Load embeddings and labels
X_tr, Y_tr = ...  # [N_tr, D], [N_tr, C]
X_val, Y_val = ...

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression(max_iter=4000, class_weight="balanced"))
clf.fit(X_tr, Y_tr)

P = clf.predict_proba(X_val)
P = close_under_ancestors(P, go_dag)        # hierarchy consistency
th = tune_thresholds(P, Y_val, metric="Fmax")
Y_hat = (P >= th).astype(int)

Hyper-parameters

Dimensionality: if D>1024, try PCA->512 for speed.
Regularization: C=1.0 (grid: 0.1-10), early stop on val Fmax.
Class imbalance: class_weight="balanced"; consider focal loss for MLP.

Diagnostics

Per-class PR curves (spot long-tail collapse).
Reliability plot + ECE; calibrate if >0.05.
Ancestor coverage % (should be 100% after closure).

When to go beyond linear

Plateaued Fmax and clear network-driven biology -> add PPI smoothing or a GNN.
Heterogeneous evidence (sequence + structure + text) -> late-fusion of logits.

Tip: Keep PLM frozen at first. Fine-tune only with careful regularization and time-split eval.

© Copyright 2017-2025

FORK GH-PAGES-BLOG ON GITHUB