Token Classification
Transformers
TensorBoard
Safetensors
English
electra
biology
chemistry
medical
cancer
carcinogenesis
biomedical
ner
oncology
Eval Results (legacy)
Instructions to use jimnoneill/CarD-T with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jimnoneill/CarD-T with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="jimnoneill/CarD-T")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("jimnoneill/CarD-T") model = AutoModelForTokenClassification.from_pretrained("jimnoneill/CarD-T") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -43,7 +43,7 @@ model-index:
|
|
| 43 |
CarD-T (Carcinogen Detection via Transformers) is a novel text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. This model is designed to address the challenges faced by current systems in managing the burgeoning biomedical literature related to carcinogen identification and classification.
|
| 44 |
|
| 45 |
## Model Details
|
| 46 |
-
* **Architecture**: Based on Bio-ELECTRA, a 335 million parameter language model
|
| 47 |
* **Training Data**: [CarD-T-NER dataset](https://huggingface.co/datasets/jimnoneill/CarD-T-NER) containing 19,975 annotated examples from PubMed abstracts (2000-2024)
|
| 48 |
* Training set: 11,985 examples
|
| 49 |
* Test set: 7,990 examples
|
|
@@ -255,7 +255,7 @@ training_args = TrainingArguments(
|
|
| 255 |
learning_rate=2e-5,
|
| 256 |
per_device_train_batch_size=16,
|
| 257 |
per_device_eval_batch_size=16,
|
| 258 |
-
num_train_epochs=
|
| 259 |
weight_decay=0.01,
|
| 260 |
evaluation_strategy="epoch",
|
| 261 |
save_strategy="epoch",
|
|
@@ -265,19 +265,6 @@ training_args = TrainingArguments(
|
|
| 265 |
)
|
| 266 |
```
|
| 267 |
|
| 268 |
-
## Evaluation Metrics
|
| 269 |
-
|
| 270 |
-
Detailed performance metrics on the test set (7,990 examples):
|
| 271 |
-
|
| 272 |
-
| Entity Type | Precision | Recall | F1-Score | Support |
|
| 273 |
-
|-------------|-----------|---------|----------|---------|
|
| 274 |
-
| carcinogen | 0.912 | 0.878 | 0.895 | 2,341 |
|
| 275 |
-
| negative | 0.867 | 0.823 | 0.844 | 987 |
|
| 276 |
-
| cancertype | 0.889 | 0.856 | 0.872 | 3,124 |
|
| 277 |
-
| antineoplastic | 0.908 | 0.871 | 0.889 | 1,456 |
|
| 278 |
-
| **Overall** | **0.894** | **0.857** | **0.875** | **7,908** |
|
| 279 |
-
|
| 280 |
-
## Citation
|
| 281 |
|
| 282 |
If you use this model in your research, please cite:
|
| 283 |
|
|
|
|
| 43 |
CarD-T (Carcinogen Detection via Transformers) is a novel text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. This model is designed to address the challenges faced by current systems in managing the burgeoning biomedical literature related to carcinogen identification and classification.
|
| 44 |
|
| 45 |
## Model Details
|
| 46 |
+
* **Architecture**: Based on Bio-ELECTRA, a 335 million parameter language model (sultan/BioM-ELECTRA-Large-SQuAD2)
|
| 47 |
* **Training Data**: [CarD-T-NER dataset](https://huggingface.co/datasets/jimnoneill/CarD-T-NER) containing 19,975 annotated examples from PubMed abstracts (2000-2024)
|
| 48 |
* Training set: 11,985 examples
|
| 49 |
* Test set: 7,990 examples
|
|
|
|
| 255 |
learning_rate=2e-5,
|
| 256 |
per_device_train_batch_size=16,
|
| 257 |
per_device_eval_batch_size=16,
|
| 258 |
+
num_train_epochs=5,
|
| 259 |
weight_decay=0.01,
|
| 260 |
evaluation_strategy="epoch",
|
| 261 |
save_strategy="epoch",
|
|
|
|
| 265 |
)
|
| 266 |
```
|
| 267 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
If you use this model in your research, please cite:
|
| 270 |
|