How many compounds that failed Phase II/III trials in the last 15 years targeted mechanisms that have since been genetically validated by new GWAS or rare variant data — and can AI-driven literature m

Question

Accepted Answer

The provided articles do not report a single global numerical tally for the total number of compounds that failed Phase II/III trials in the last 15 years and have since been genetically validated. However, the evidence confirms that drug targets with human genetic support are approximately twice as likely to result in approved therapies, and several AI-driven frameworks have been developed to systematically identify resurrection or repositioning candidates using literature and genomic data (Direct, High; PMID: 31830040, PMID: 35608290) «✓ PMID:31830040» «✓ PMID:35608290».

## Genetic Validation of Failed Trial Mechanisms
While a cumulative count of "resurrected" failed candidates is not reported (NR), the literature provides specific instances and large-scale re-evaluations of the impact of genetic evidence on clinical progression:

*   **Impact on Success Rates:** An analysis of 21,934 gene target-indication pairs confirms that targets with Mendelian (OMIM) or GWAS-supported evidence have significantly higher probabilities of progressing from Phase II to Phase III and achieving final approval (Direct, High; PMID: 31830040) «✓ PMID:31830040».
*   **Genetic Insights into Failure:** Mendelian randomization (MR) has been used to provide genetic explanations for high-profile failures. For example, the mixed lineage kinase inhibitor **CEP-1347** failed a Phase III trial for Parkinson's disease; subsequent genetic analysis suggested the target mechanism might actually increase disease risk, explaining the lack of efficacy (Direct, High; PMID: 34930919) «✓ PMID:34930919».
*   **Target Selection Tiers:** Approximately 22% of the human protein-coding genome (4,479 genes) is considered "druggable." Loci identified by GWAS have already "rediscovered" at least 74 licensed drug targets through concordant associations with disease indications or mechanism-based side effects, suggesting many more failed or discordant targets remain for systematic exploration (Direct, Medium; PMID: 28356508) «✓ PMID:28356508».

## AI-Driven Systematic Identification of Candidates
The provided evidence describes several advanced AI and natural language processing (NLP) frameworks designed to systematically mine literature and genomic data to prioritize drug candidates:

*   **LEADS (Search, Screening, and Data Extraction):** This foundation large language model decomposes literature mining into six subtasks, including search query generation and study eligibility assessment. It has been validated to accelerate citation screening and data extraction by 20.8% and 26.9% respectively, while maintaining expert-level accuracy (Direct, High; PMID: 40993125) «✓ PMID:40993125».
*   **GNNHap (Multi-modal Graph-based Pipeline):** This graph neural network-based pipeline analyzes 29 million published papers to assess gene-phenotype relationships. It integrates protein-protein interaction (PPI) networks and protein sequence features to identify causal genetic factors, successfully identifying novel factors for diabetes and obesity (Direct, High; PMID: 35608290) «✓ PMID:35608290».
*   **SemaTyP (Knowledge Graph Inference):** This method utilizes "SemKG," a biomedical knowledge graph constructed from PubMed abstracts using SemRep. It exploits semantic types of paths to discover drug therapies and can provide the specific mechanism of action for candidate drugs (Direct, High; PMID: 29843590) «✓ PMID:29843590».
*   **SMR and GSMR Frameworks:** Summary-data-based MR (SMR) and generalized SMR (GSMR) integrate GWAS with expression QTL (eQTL) and protein QTL (pQTL) data to prioritize targets. These methods have identified genes like **RNASET2** for autoimmune thyroiditis and **C4BPA** for NSCLC as prioritized targets with high translational potential (Direct, High; PMID: 41704487, PMID: 41053817) «✓ PMID:41704487» «✓ PMID:41053817».

## Challenges in Systematic Resurrection
Despite the availability of AI tools, the identification of resurrection candidates is hindered by several factors mentioned in the context:
*   **Data Quality and Complexity:** Systematic reviews often face obstacles such as poor metadata indexing in PubMed and data extraction errors in published studies (Direct, Medium; PMID: 40993125) «✓ PMID:40993125».
*   **Context-Dependent Effects:** Genetic associations for highly dissimilar traits are negative predictors of drug approval, suggesting that target-disease links must be precisely matched to avoid failure due to unrelated side effects (Direct, High; PMID: 31830040) «✓ PMID:31830040».
*   **Translational Gaps:** In silico models for drug-induced liver injury (DILI) have achieved a 90.9% success rate in predicting hepatotoxicity in failed candidates, identifying safety risks that often contribute to late-stage trial termination (Direct, Medium; PMID: 38922065) «✓ PMID:38922065».

In summary, while the total number of failed compounds with subsequent genetic validation is not globally quantified in the provided texts, the evidence establishes that targets with genetic support are approximately twice as likely to succeed (Derived, High; PMID: 31830040, PMID: 28356508) «✓ PMID:31830040» «✓ PMID:28356508». AI frameworks like LEADS, GNNHap, and SemaTyP can systematically mine millions of publications and genomic datasets to identify these high-probability resurrection candidates (Derived, Medium; PMID: 40993125, PMID: 35608290, PMID: 29843590).

[What specific features of the GNNHap pipeline enable it to distinguish between true causal genetic factors and false positive associations in GWAS?](#cta)

[How do the success rates of drugs with OMIM-based genetic support compare to those with GWAS-based support in the revised 2013-2018 estimates?](#cta)

[Which automated NLP techniques are most effective for extracting structured clinical trial outcomes from the unstructured text of failed Phase III reports?](#cta)