| | --- |
| | library_name: transformers |
| | tags: |
| | - Code |
| | - Vulnerability |
| | - Detection |
| | datasets: |
| | - DetectVul/devign |
| | language: |
| | - en |
| | base_model: |
| | - microsoft/codebert-base |
| | license: mit |
| | metrics: |
| | - accuracy |
| | - precision |
| | - f1 |
| | - recall |
| | --- |
| | |
| | ## CodeBERT for Code Vulnerability Detection |
| |
|
| | ## Model Summary |
| | This model is a fine-tuned version of **microsoft/codebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **DetectVul/devign** dataset. The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**. |
| |
|
| | ## Model Details |
| |
|
| | - **Developed by:** Mukit Mahdin |
| | - **Finetuned from:** `microsoft/codebert-base` |
| | - **Language(s):** English (for code comments & metadata), C/C++ |
| | - **License:** MIT |
| | - **Task:** Code vulnerability detection |
| | - **Dataset Used:** `DetectVul/devign` |
| | - **Architecture:** Transformer-based sequence classification |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| | This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for: |
| | - **Developers**: To analyze their code for potential security flaws. |
| | - **Security Teams**: To scan repositories for known vulnerabilities. |
| | - **Researchers**: To study vulnerability detection in AI-powered systems. |
| |
|
| | ### Downstream Use |
| | This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection. |
| |
|
| | ### Out-of-Scope Use |
| | - The model is **not meant to replace human security experts**. |
| | - It may not generalize well to **languages other than C/C++**. |
| | - False positives/negatives may occur due to dataset limitations. |
| |
|
| | ## Bias, Risks, and Limitations |
| | - **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities. |
| | - **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages. |
| | - **Dataset Bias:** The training data may not cover all possible vulnerabilities. |
| |
|
| | ### Recommendations |
| | Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**. |
| |
|
| | ## How to Get Started with the Model |
| | Use the code below to load the model and run inference on a sample code snippet: |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | # Load the fine-tuned model |
| | tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base") |
| | model = AutoModelForSequenceClassification.from_pretrained("mahdin70/codebert-devign-code-vulnerability-detector") |
| | |
| | # Sample code snippet |
| | code_snippet = ''' |
| | void process(char *input) { |
| | char buffer[50]; |
| | strcpy(buffer, input); // Potential buffer overflow |
| | } |
| | ''' |
| | |
| | # Tokenize the input |
| | inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512) |
| | |
| | # Run inference |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
| | predicted_label = torch.argmax(predictions, dim=1).item() |
| | |
| | # Output the result |
| | print("Vulnerable Code" if predicted_label == 1 else "Safe Code") |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| | - **Dataset:** `DetectVul/devign` |
| | - **Classes:** `0 (Safe)`, `1 (Vulnerable)` |
| | - **Size:** `21800` Code Snippets |
| |
|
| | ### Training Procedure |
| | - **Optimizer:** AdamW |
| | - **Loss Function:** CrossEntropyLoss |
| | - **Batch Size:** 8 |
| | - **Learning Rate:** 2e-05 |
| | - **Epochs:** 3 |
| | - **Hardware Used:** 2x T4 GPU |
| |
|
| | ### Metrics |
| | | Metric | Score | |
| | |------------|-------------| |
| | | **Train Loss** | 0.5898 | |
| | | **Evaluation Loss** | 0.6153 | |
| | | **Accuracy** | 64.09% | |
| | | **F1 Score** | 46.42% | |
| | | **Precision** | 73.78% | |
| | | **Recall** | 33.86% | |
| |
|
| | ## Environmental Impact |
| |
|
| | | Factor | Value | |
| | |-----------|----------| |
| | | **GPU Used** | T4 GPU | |
| | | **Training Time** | ~1 hour | |