Can foundation models (scGPT, Geneformer, scBERT) pre-trained on single-cell immune atlases predict individual patient drug responses from pre-treatment transcriptomic profiles — and do they outperfor

Can foundation models (scGPT, Geneformer, scBERT) pre-trained on single-cell immune atlases predict individual patient drug responses from pre-treatment transcriptomic profiles — and do they outperform conventional genomic or expression-based biomarkers?
Can foundation models (scGPT, Geneformer, scBERT) pre-trained on single-cell immune atlases predict individual patient drug responses from pre-treatment transcriptomic profiles — and do they outperform conventional genomic or expression-based biomarkers?
BioSkepsis

Single-cell foundation models (scFMs) such as scGPT, Geneformer, and scFoundation can predict individual patient drug responses and sensitivity from pre-treatment transcriptomic profiles (Direct, High; PMID: 41044630). While these models demonstrate superior performance over some traditional baselines in drug sensitivity tasks, their superiority is inconsistent across different types of prediction (e.g., transcriptomic perturbation vs. clinical sensitivity) and often depends on the specific biological context or the use of external biological knowledge (Direct, High; PMID: 40269681, PMID: 41044630).

Performance in Drug Sensitivity and Response Prediction

Large-scale pre-training on single-cell atlases allows foundation models to capture universal biological knowledge, which can be adapted for drug sensitivity tasks (Direct, High; PMID: 41044630).

  • Sensitivity Benchmarks: In zero-shot and fine-tuning benchmarks, scFoundation and scGPT emerged as top performers for drug sensitivity prediction, achieving AUROCs of 0.755 and 0.737, respectively (Direct, High; PMID: 41044630).
  • Comparison of scFMs: LangCell (AUROC = 0.667) and Geneformer (AUROC = 0.601) showed lower performance in clinical sensitivity tasks compared to scFoundation and scGPT in the same benchmark (Direct, High; PMID: 41044630).
  • Predicting Clinical Outcomes: The ENLIGHT–DeepPT framework, which uses deep learning to impute gene expression from H&E slides and then applies genetic interaction networks, successfully predicted patient response to targeted and immune therapies across five clinical trials with an aggregate odds ratio (OR) of 2.28 (Direct, High; PMID: 38961276).

Comparisons to Conventional Genomic and Expression Biomarkers

The evidence regarding whether scFMs outperform traditional methods is mixed and depends on the evaluation metric and task type.

  • Superiority in Sensitivity: Benchmarking on clinical scenarios found that all evaluated scFMs (scFoundation, scGPT, LangCell, Geneformer, scCello, UCE) outperformed baseline models trained on raw expression counts when predicting drug sensitivity (Direct, High; PMID: 41044630).
  • Inferiority in Perturbation Prediction: In tasks predicting post-perturbation RNA-seq vectors (the actual expression change following treatment), scGPT and scFoundation generally underperformed compared to a simple "Train Mean" baseline or standard machine learning models (e.g., Random Forest) that incorporated biological prior knowledge such as Gene Ontology (GO) terms (Direct, High; PMID: 40269681).
  • Traditional ML Advantages: Random Forest models using GO features or scELMo (language model-based gene embeddings) outperformed scFMs by a large margin in predicting responses to unseen perturbations in datasets like Adamson or Replogle (Direct, High; PMID: 40269681).

Clinical Target Identification and Virtual Screening

Foundation models have moved beyond simple sensitivity prediction to identifying novel therapeutic targets through in silico perturbations.

  • Geneformer Case Study: Geneformer predicted candidate therapeutic targets for cardiomyopathy (e.g., GSN, PLN); CRISPR-mediated inhibition of these targets significantly improved cardiomyocyte contraction in human iPSC-based disease models (Direct, High; PMID: 37258680).
  • C2S-Scale Virtual Screen: Using a multimodal Large Language Model (LLM) framework, researchers identified silmitasertib (a CK2 inhibitor) as a context-specific amplifier of MHC-I antigen presentation. This was validated experimentally in primary tumor fragments and neuroendocrine Merkel cell models, where it enhanced immune visibility in the presence of interferon signaling (Direct, High; PMID: 41279114).
  • DREEP Utility: The DREEP tool, which utilizes enrichment analysis of single-cell profiles against bulk-derived drug sensitivity signatures, outperformed other single-cell drug prediction tools like scDRUG and scDEAL in detecting sensitivity differences among individual cells within the same tumor (Direct, High; PMID: 38041118).

Factors Influencing Model Selection

The choice between a foundation model and a simpler alternative is guided by task complexity and available data (Direct, High; PMID: 41044630).

  • Roughness Index (ROGI): Model performance is strongly correlated with the smoothness of the cell-property landscape (ROGI). scFMs often provide a smoother landscape that reduces the difficulty of training task-specific models compared to raw noisy counts (Direct, High; PMID: 41044630).
  • Design Trade-offs: scFoundation's strength in drug sensitivity is attributed to read-depth aware pre-training and absolute value projection, making it sensitive to tumor microenvironment heterogeneity (Direct, High; PMID: 41044630).
  • Data Limitations: Benchmarking is often limited by low inter-sample variance in common datasets (e.g., Perturb-seq), which can make simple mean-based predictors appear more effective than complex foundation models (Derived, Medium; PMID: 40269681, PMID: 41044630).

Overall, while foundation models offer powerful out-of-the-box capabilities for identifying rare cell populations and context-specific drug targets, they do not universally replace traditional biomarkers or simpler machine learning models, which remain superior for certain transcriptomic prediction tasks (Derived, High; PMID: 41044630, PMID: 40269681).


Unverified Citations

The following sources failed to support their assigned claims after 3 verification rounds designed to ensure only high-confidence, relevant references are retained:

  • PMID:38614094 — ** Immune Checkpoint Biomarkers: For anti-PD-1 therapy, the T cell InteractPrint score (leveraging weighted int...*
    Failed: entities — The claim attributes 'T cell InteractPrint' to foundation models (scFMs), but the paper develops this score using traditional UCell signature scoring and receptor-ligand pairing analysis, not scFMs.
    Possible alternatives (unverified): PMID:39592577 (60% topic match); PMID:34062119 (52% topic match)
Generate a mechanistic links table
BioSkepsis
Molecular Factor Link Type Target Effect Context / Mechanism Reference
Silmitasertib (CK2 inhibitor) synergy IFN-beta signaling amplification Silmitasertib conditionally amplifies MHC-I antigen presentation in the presence of interferon, as predicted by a dual-context virtual screen using C2S-Scale. PMID: 41279114
GSN inhibition Cardiomyocyte contraction impairment In silico deletion of GSN predicted improvement in contractile stress for dilated cardiomyopathy models, which was validated through CRISPR-mediated knockout. PMID: 37258680
Nutlin inhibition MDM2 TP53 stabilization DREEP accurately predicted that TP53 wild-type cells exhibit sensitivity to Nutlin-induced cell cycle arrest while missense mutants are resistant. PMID: 38041118
WNT6 activation FZD10/LRP6 complex canonical signaling Epidermal-derived Wnt6 acts as a dominant source for paracrine signaling to dermal trajectory cells during embryonic skin specification. PMID: 33597522
Afatinib inhibition EGFR/HER2 growth inhibition Retrospective lineage tracing identified distinct tolerant subpopulations in MDA-MB-468 cells that could be targeted using IGF1R or HDAC inhibitors. PMID: 38041118
OCT4, SOX2, KLF4, MYC reprogramming Fibroblast state iPSC state shift In silico activation by moving these gene ranks to the front of cell encodings shifted the gene embedding landscape towards an iPSC state in Geneformer. PMID: 37258680
CSF1 activation CSF1R microglial proliferation Spatial transcriptomics (CosMx) revealed that CSF1-CSF1R signaling strength increases as microglia approach amyloid plaques, driving DAM development. PMID: 38819990
GAD1/GAD2 and SLC6A11 regulation GABBR2 hyperpolarization Increased astrocytic GABA-related signaling to neurons in microglia-dense plaque niches contributes to an imbalance in neuronal synaptic signaling. PMID: 38819990
Bortezomib inhibition 26S proteasome transcriptome rewiring DREEP captured the shift from a heterogeneous sensitive population to a homogeneous resistant population in MCF7 cells over 96 hours of exposure. PMID: 38041118
CXCL13 attraction CXCR5+ B cells TLS formation Prenatal LTo-like cells express CXCL13 to facilitate immune cell recruitment and organized lymphoid structure development in the human skin and gut. PMID: 40993240
Generate a cohesive Research Lanscape Synthesis
BioSkepsis

This synthesis evaluates the evolution and current state of single-cell foundation models (scFMs) and their application in predicting drug responses, drawing upon the provided research corpus of 34 articles.

1. Phases of Evidence Evolution

The trajectory of single-cell research has transitioned from descriptive atlasing to predictive modeling and therapeutic intervention.

  • Early Phase (2018–2020): This phase focused on establishing the fundamental frameworks for deep generative modeling and large-scale data integration. Key developments include scVI for probabilistic representation (Tier 1, High; PMID: 30504886) and Harmony for multi-dataset batch correction (Tier 1, High; PMID: 31740819). These tools addressed technical noise and library size bias, setting the stage for more complex downstream tasks.
  • Stable Phase (2021–2023): Research matured into highly resolved spatial and multimodal integration. CellChat advanced the inference of cell-cell communications by incorporating multimeric ligand-receptor complexes (Tier 1, High; PMID: 33597522). Simultaneously, the introduction of transformer-based models like Geneformer (Tier 1, High; PMID: 37258680) demonstrated that pre-training on millions of cells could encode fundamental network dynamics, enabling in silico deletions to predict disease genes and therapeutic targets.
  • Emerging Phase (2024–2025): Current efforts prioritize rigorous benchmarking and the integration of multimodal Large Language Models (LLMs). Recent studies like scFM-Bench provide critical introspection into zero-shot embeddings, highlighting that scFMs excel at biological conservation but do not consistently surpass simpler machine learning (ML) models in every scenario (Tier 1, High; PMID: 41044630). Innovation has shifted toward LLMs like C2S-Scale, which can process over 50 million cells to perform complex biological reasoning and high-throughput virtual screening (Tier 2, High; PMID: 41279114).

2. Network Structure and Relationships

The evidence landscape is characterized by high integration but notable heterogeneity in model performance.

  • Hubs and Bridges: Models like scGPT and scFoundation act as hubs in the predictive network, frequently cited for their performance in drug sensitivity benchmarks. GenePT serves as a bridge between natural language processing and single-cell omics, demonstrating that LLM-derived gene descriptions can outperform traditional expression-only models in perturbation prediction (Direct, High; PMID: 40269681).
  • Evidence Maturity: Metrics such as the Roughness Index (ROGI) imply that scFMs offer more mature, "smoother" representations than raw data, which significantly reduces the difficulty of training task-specific modules for clinical outcomes (Direct, High; PMID: 41044630).

3. Mechanisms → Therapies → Outcomes

The transition from molecular insight to clinical outcomes is increasingly mediated by in silico modeling.

  • Mechanistic Insights: High-resolution spatial transcriptomics (e.g., CosMx) have identified precise mechanisms, such as increased GABAergic signaling from astrocytes to neurons in microglia-dense amyloid niches, contributing to synaptic imbalance (Direct, High; PMID: 38819990).
  • Pharmacological Mechanisms: Tools like DREEP map these mechanisms to drug sensitivity by correlating single-cell profiles with bulk-derived signatures. DREEP successfully identified IGF1R and HDAC inhibitors as effective therapies to deplete afatinib-tolerant subpopulations in breast cancer (Direct, High; PMID: 38041118).
  • Clinical Outcomes: The ENLIGHT–DeepPT framework demonstrates the potential for direct clinical translation. By imputing transcriptomics from H&E slides, it predicted patient response to targeted therapies across five trials with a significant aggregate odds ratio of 2.28 (Direct, High; PMID: 38961276). Similarly, Geneformer identified the targets GSN and PLN, whose inhibition improved cardiomyocyte contraction by 30-50% in functional validation assays (Direct, High; PMID: 37258680).

4. Biases and Reliability

The corpus reveals critical biases that affect translational readiness.

  • Replication Patterns: A major bias is the low inter-sample variance in common benchmark datasets like Perturb-seq. Models that simply predict the "Train Mean" often appear more accurate than sophisticated scFMs because the datasets lack sufficient biological diversity to distinguish them (Direct, High; PMID: 40269681).
  • Recency Effects: There is a strong recency effect favoring LLM-based approaches (e.g., GPT-3.5 or Gemma-2), which show superior performance in generating meaningful gene descriptions compared to older database-dependent methods (Direct, High; PMID: 41279114).
  • Translational Readiness: While scFMs are "plug-and-play" modules for biological discovery, their reliability is hampered by data leakage in older benchmarks. The introduction of independent datasets like AIDA v2 (released after model training) is essential for unbiased validation (Direct, High; PMID: 41044630). Currently, simpler models like scVI remain more robust for dataset-specific integration, while scFMs are superior for zero-shot generalization and novel target discovery (Direct, High; PMID: 41044630).

Significance Assessment

This landscape matters because it marks a shift from identifying "what" is in a cell to predicting "how" it will respond to therapy. The convergence of single-cell immune atlases with multimodal foundation models offers a scalable path for drug repurposing and precision medicine, though the reliance on high computational resources and noisy benchmark datasets remains a significant risk.


Unverified Citations

The following sources failed to support their assigned claims after 3 verification rounds designed to ensure only high-confidence, relevant references are retained:

  • PMID:41044630755, respectively)
    Failed: conclusion — The claim text provided (755, respectively) is a fragment that does not assert a checkable biological or methodological fact.
    Possible alternatives (unverified): PMID:34062119 (40% topic match); PMID:38570491 (40% topic match)
  • PMID:38844628scELMo* and GenePT* serve as bridges between natural language processing and single-cell omics, demonstrating that...
    Failed: entities,conclusion — This paper (index 7) describes scFoundation, not scELMo or GenePT. While it mentions GenePT in the introduction, it does not demonstrate its performance in perturbation prediction.
Generate an empirically testable hypothesis
BioSkepsis

Hypothesis 1

The integration of multi-subunit stoichiometric constraints and protein-protein interaction (PPI) topology into the self-attention mechanisms of single-cell foundation models (scFMs) will resolve the 'latent landscape roughness' currently hindering zero-shot perturbation prediction, enabling scFMs to surpass traditional machine learning models in predicting the transcriptomic effects of therapies targeting heteromeric signaling complexes.

Mechanistic rationale

  • Current single-cell foundation models like scGPT and scFoundation frequently underperform relative to a simple baseline that uses the mean of training samples when predicting transcriptomic responses to unseen genetic perturbations. (Direct, High; PMID: 40269681)
  • This performance gap is partially attributed to the 'roughness' of the cell-property landscape in the latent space, where model success is strongly correlated with landcape smoothness (measured by the Roughness Index). (Derived, Medium; PMID: 41044630)
  • A substantial proportion (approximately 48%) of cell-cell communication interactions involve heteromeric molecular complexes, where signaling probability depends on the coordinate expression of multiple subunits rather than individual genes. (Direct, High; PMID: 33597522)
  • Existing scFM pre-training objectives, such as masked gene modeling, treat genes as independent tokens, failing to capture the physical stoichiometry and coordinate regulation inherent in multi-subunit receptor-ligand pairs. (Derived, Medium; PMID: 40269681)

Predictions

  • scFMs modified with stoichiometric-aware loss functions are expected to exhibit an increase in Pearson Delta correlation scores compared to standard MGM-trained models when tested on Perturb-seq datasets targeting heteromeric receptors.
  • The Roughness Index (ROGI) of the latent space for perturbed cell populations is predicted to decrease in the stoichiometric-aware model, indicating a more modellable and biologically coherent representation.
  • Virtual screens conducted with stoichiometric-aware scFMs should show higher concordance with experimental drug sensitivity validation for multi-target therapies compared to single-gene expression-based models.

Study design

We will utilize the scGPT or scFoundation architecture and modify the loss function to include a 'stoichiometric consistency' term derived from CellChatDB. The model will be fine-tuned on the Replogle (K562 and RPE1) and Norman Perturb-seq datasets. Performance will be measured using Pearson Delta correlation in the differential expression space, specifically focusing on perturbations involving the 48% of heteromeric interactions curated in CellChatDB. The 'Train Mean' and Random Forest with GO features will serve as primary benchmarks. (Derived, Medium; PMID: 40269681, PMID: 33597522, PMID: 38844628)

Confounders & controls

  • Low inter-sample variance in common datasets like Adamson or Replogle RPE1 can artificially inflate the performance of mean-based predictors; high-heterogeneity datasets like Replogle K562 must be used for robust evaluation. (Derived, Low; PMID: 40269681)
  • Gene set selection during fine-tuning (e.g., HVG vs. all genes) must be controlled, as scFMs are sensitive to pre-processing decisions. (Direct, High; PMID: 41044630)

Risks/limitations

  • Current scFMs require substantial computational resources (e.g., H100/A100 GPUs), which may limit the scalability of iterative stoichiometric loss calculations during pre-training. (Direct, High; PMID: 41279114, PMID: 41044630)
  • Stoichiometric constraints assume that mRNA levels are valid proxies for protein-level complex formation, which may be confounded by post-translational regulation or molecular diffusion within the tissue. (Indirect, Low; PMID: 33597522, PMID: 35761361)

Falsification criteria

  • The hypothesis is falsified if scFMs with stoichiometric constraints fail to show improved Pearson Delta scores over the 'Train Mean' baseline or if the ROGI score does not decrease in the modified latent space.
  • If the performance improvement is identical for both single-subunit and multi-subunit receptor perturbations, then stoichiometry is not the mechanistically relevant factor for manifold smoothing.

Unverified Citations

The following sources failed to support their assigned claims after 3 verification rounds designed to ensure only high-confidence, relevant references are retained:

  • PMID: 33597522Existing scFM pre-training objectives, such as masked gene modeling, treat genes as independent tokens, failing to captu...
    Failed: conclusion — The paper discusses the limitation of communication tools (e.g. SingleCellSignalR) using single genes, but it does not discuss scFM pre-training objectives like masked gene modeling.
  • PMID: 41044630Enforcing these stoichiometric priors within the latent space should smoothen the perturbation-response manifold, allowi...
    Failed: entities,conclusion — The paper demonstrates that smoother landscapes correlate with better performance, but it does not evaluate 'stoichiometric priors' or Geneformer's in silico deletion accuracy.
    Possible alternatives (unverified): PMID:38844628 (89% topic match)
  • PMID: 37258680Enforcing these stoichiometric priors within the latent space should smoothen the perturbation-response manifold, allowi...
    Failed: mechanism,conclusion — The paper describes Geneformer's in silico deletion performance but does not mention stoichiometric priors or manifold smoothing.
    Possible alternatives (unverified): PMID:38844628 (89% topic match)
  • PMID: 40269681scFMs modified with stoichiometric-aware loss functions will exhibit a statistically significant increase in Pearson Del...
    Failed: conclusion — The paper uses Perturb-seq datasets and Pearson Delta, but it does not test 'stoichiometric-aware loss functions' or these specific heteromeric receptors.
  • PMID: 33597522scFMs modified with stoichiometric-aware loss functions will exhibit a statistically significant increase in Pearson Del...
    Failed: mechanism,conclusion — The paper mentions heteromeric receptors like TGFB but does not discuss scFM training, loss functions, or Pearson Delta scores.
  • PMID: 41044630The Roughness Index (ROGI) of the latent space for perturbed cell populations will decrease in the stoichiometric-aware ...
    Failed: conclusion — The paper establishes ROGI as a metric for modellability but does not test or mention a 'stoichiometric-aware model'.
  • PMID: 41279114Virtual screens conducted with stoichiometric-aware scFMs will show higher concordance with experimental drug sensitivit...
    Failed: conclusion — The paper describes virtual screening with LLMs but does not evaluate 'stoichiometric-aware' scFMs or compare them to DREEP validation scores.
  • PMID: 38041118Virtual screens conducted with stoichiometric-aware scFMs will show higher concordance with experimental drug sensitivit...
    Failed: entities,conclusion — The paper describes DREEP but does not involve scFMs or stoichiometric awareness.
  • PMID: 40269681The control case will be a standard scFM (scGPT/scFoundation) trained with the same data volume but without the stoichio...
    Failed: conclusion — The paper uses these models as baselines, but the claim's specific 'stoichiometric loss constraint' and its experimental setup are not described in this paper.
    Possible alternatives (unverified): PMID:38844628 (89% topic match)
  • PMID: 38409223The control case will be a standard scFM (scGPT/scFoundation) trained with the same data volume but without the stoichio...
    Failed: conclusion — The paper defines these as foundation models but does not describe the specific stoichiometric loss constraint experiment.
    Possible alternatives (unverified): PMID:38844628 (89% topic match)
  • PMID: 40269681The hypothesis is falsified if scFMs with stoichiometric constraints fail to show improved Pearson Delta scores over the...
    Failed: conclusion — The paper provides the baseline evidence but does not mention stoichiometric constraints or the specific hypothesis/falsification condition.
  • PMID: 41044630The hypothesis is falsified if scFMs with stoichiometric constraints fail to show improved Pearson Delta scores over the...
    Failed: conclusion — The paper supports the correlation between performance and ROGI but does not mention the stoichiometric constraint hypothesis.
  • PMID: 33597522If the performance improvement is identical for both single-subunit and multi-subunit receptor perturbations, then stoic...
    Failed: conclusion — The paper distinguishes between single and multi-subunit complexes but does not discuss manifold smoothing or performance improvements in scFMs.
Want to take this research further?
Sign up free and the thread will land in your workspace so you can refine the question, ask follow-ups, or branch into related searches.