🧬 DTI Models: Drug–Target Interaction Prediction

A collection of Drug–Target Interaction (DTI) models for activity prediction and potency estimation.

Current Status: 🚧 Active Development / Research Prototype


Overview

This repository contains two Drug–Target Interaction models developed for computational drug discovery research.

Available Models

1. DTI-LLM (LoRA Adapter)

A LoRA fine‑tuned LLaMA‑3 model designed for:

  • Activity Classification
  • Potency Regression (pXC50 Prediction)

This is a dual‑task model capable of both predicting whether a drug–target pair is active and estimating interaction potency.

2. DTI-BioMedBERT (Classification Checkpoint)

A BioMedBERT‑based checkpoint trained specifically for:

  • Binary Activity Classification

This model focuses exclusively on determining whether a compound is likely to be biologically active against a target protein.


Research Goal

The primary objective of this project is to explore how modern AI models can be adapted for computational drug discovery tasks.

The long-term goals include:

  • Improving virtual screening workflows
  • Assisting early-stage lead prioritization
  • Exploring LLM-based molecular reasoning
  • Investigating structured biomedical prediction
  • Building lightweight domain-specific AI systems deployable on consumer hardware

This repository represents an ongoing research effort rather than a finished production system.


Model Variants

DTI-LLM (LoRA Adapter)

Component Value
Base Model unsloth/llama-3-8b-bnb-4bit
Fine‑Tuning Method LoRA + Checkpoint
Training Hardware NVIDIA T4 16GB
Framework Unsloth

Tasks

  • Classification – Predict whether a drug is likely to be biologically active against a target protein.
  • Regression – Estimate the interaction potency (pXC50) of the drug–target pair.

DTI-BioMedBERT (Classification Checkpoint)

Component Value
Base Model microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
Architecture BioMedBERT
Task Binary Classification
Output Active / Inactive
Framework Transformers

Task

Predict whether a drug–target pair is biologically active.

Unlike DTI-LLM, this checkpoint does not perform potency regression and is intended solely for activity prediction.


Input Format

The models expect information about:

  • Drug molecule (SMILES)
  • Protein target (UniProt ID)
  • Optional assay metadata

Example:

Drug:
SMILES: NC1=NC(=S)C2=C(N1)N=CN2

Target:
UniProt ID: Q13043

Output Formats

DTI-LLM

{
  "is_active": true,
  "pxc50": 6.2,
  "confidence": "high",
  "reasoning": "Structural similarity suggests moderate binding affinity."
}
Field Description
is_active Binary activity prediction
pxc50 Predicted potency value
confidence Model confidence estimate
reasoning Generated explanation

DTI-BioMedBERT

{
  "is_active": true,
  "confidence": 0.91
}
Field Description
is_active Binary activity prediction
confidence Predicted confidence score

Performance

DTI-LLM

Classification Task (Activity Prediction)

Metric Score
Accuracy 0.946
Precision 1.000
Recall 0.512
F1 Score 0.658
ROC-AUC 0.765
PR-AUC 0.610

Interpretation

The model currently exhibits extremely high precision.

When the model predicts that a compound is active, it is rarely incorrect. This behavior makes it useful for reducing false positives during early-stage virtual screening.

However, recall remains moderate, meaning some genuinely active compounds may not be identified.

Current development efforts are focused on improving recall while maintaining strong precision.


Regression Task (Potency Prediction)

Metric Score
RMSE 1.099
MAE 0.723
RΒ² -0.235
Pearson r 0.404
Spearman ρ 0.578

Interpretation

The regression component remains experimental.

While the model demonstrates moderate ranking capability (Spearman correlation 0.578), absolute potency prediction is currently unreliable.

The model can often distinguish stronger interactions from weaker ones, but exact pXC50 values should not be interpreted as experimentally accurate measurements.

For the current release:

βœ… Suitable for relative ranking

⚠️ Not suitable for precise potency estimation

Future work will focus heavily on improving regression performance through larger datasets, improved loss functions, and multi-task optimization.


DTI-BioMedBERT

Classification Task (Activity Prediction)

Metric Score
Accuracy 0.925
Precision 0.560
Recall 0.593
F1 Score 0.576
ROC-AUC 0.903

Interpretation

The DTI-BioMedBERT checkpoint demonstrates strong classification performance with a ROC-AUC of 0.903, indicating effective discrimination between active and inactive drug–target pairs.

Compared with DTI-LLM, it provides a more balanced precision–recall tradeoff and is optimized specifically for activity prediction.

Recommended use cases include:

βœ… Binary DTI classification

βœ… Large-scale virtual screening

βœ… Activity prediction benchmarks

βœ… Fast inference workflows


Choosing a Model

Use Case Recommended Model
Activity Prediction Only DTI-BioMedBERT
Activity + Potency Prediction DTI-LLM
Fast Screening DTI-BioMedBERT
Potency Ranking DTI-LLM
LLM-Based Biomedical Research DTI-LLM
Highest ROC-AUC Classification DTI-BioMedBERT

Current Development Status

These models are actively being developed.

Planned improvements include:

  • Larger and more diverse training datasets
  • Additional target protein coverage
  • Improved regression accuracy
  • Better calibration of confidence scores
  • Multi-stage fine-tuning strategies
  • Retrieval-augmented biomedical context
  • Expanded benchmark evaluation

Performance metrics and model behavior may change significantly between releases.


Example Usage

Installation

pip install unsloth transformers accelerate bitsandbytes peft

Loading DTI-LLM

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/llama-3-8b-bnb-4bit"
)

model = PeftModel.from_pretrained(
    base_model,
    "Cyanex/BioGPT-X"
)

tokenizer = AutoTokenizer.from_pretrained(
    "Cyanex/BioGPT-XCyanex/BioGPT-X"
)

CLI Inference (Recommended) for lora:

The repository includes a ready-to-use inference script for generating Drug–Target Interaction predictions.

Example:

python inference.py \
  --model_path ./lora_adapter \
  --smiles "CCO" \
  --uniprot "P04637" \
  --target_name "p53" \
  --mechanism "binding" \
  --technology "IC50 assay"

Supported Arguments

Argument Description
--model_path Path to the LoRA adapter
--smiles Drug SMILES string
--uniprot UniProt protein identifier
--target_name Optional target name
--mechanism Optional assay mechanism
--technology Optional assay technology

The CLI script is the recommended way to run inference and reproduce the results reported in this repository.

Loading DTI-BioMedBERT

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "Cyanex/BioGPT-X"
)

tokenizer = AutoTokenizer.from_pretrained(
    "Cyanex/BioGPT-X"
)

Repository Contents

dti_llm/
β”œβ”€β”€ adapter_config.json
β”œβ”€β”€ adapter_model.safetensors
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
└── training_config.json

dti_biomedbert/
β”œβ”€β”€ config.json
β”œβ”€β”€ model.safetensors
β”œβ”€β”€ tokenizer.json
└── tokenizer_config.json

Limitations

Regression Performance

Potency prediction remains the weakest component of the DTI-LLM system and should be considered experimental.

Dataset Bias

Training data originates from public biological assays and may not represent all protein families, assay conditions, or chemical spaces.

Hallucinated Reasoning

Generated explanations from DTI-LLM are model-generated text and should not be interpreted as mechanistic biological evidence.

Not for Clinical Use

These models are intended solely for research, education, and experimentation.

Predictions must never be used for:

  • Clinical decision making
  • Medical diagnosis
  • Drug prescription
  • Regulatory submissions

All predictions require experimental validation.


Intended Use

Appropriate uses include:

  • Academic research
  • Educational projects
  • Drug discovery experimentation
  • Virtual screening exploration
  • Biomedical AI benchmarking
  • Model fine-tuning demonstrations

Acknowledgements

Special thanks to:

  • Meta for LLaMA-3
  • Unsloth for efficient fine-tuning tools
  • Microsoft Research for BioMedBERT
  • The creators of the eve-bio/drug-target-activity dataset
  • The open-source biomedical AI community

License

Research Only.

Commercial use may be subject to the license terms of the underlying LLaMA-3 and BioMedBERT models.


Disclaimer

DTI-LLM and DTI-BioMedBERT are experimental research projects under active development.

All predictions are computational estimates and should not be considered biological evidence.

Experimental validation is required before any practical use.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Cyanex/BioGPT-X

Adapter
(308)
this model

Dataset used to train Cyanex/BioGPT-X