Update README.md

7c73292 verified 12 months ago

3.89 kB

	---
	library_name: transformers
	tags:
	- Code
	- Vulnerability
	- Detection
	datasets:
	- DetectVul/devign
	language:
	- en
	base_model:
	- microsoft/codebert-base
	license: mit
	metrics:
	- accuracy
	- precision
	- f1
	- recall
	---

	## CodeBERT for Code Vulnerability Detection

	## Model Summary
	This model is a fine-tuned version of microsoft/codebert-base, optimized for detecting vulnerabilities in code. It is trained on the DetectVul/devign dataset. The model takes in a code snippet and classifies it as either safe (0) or vulnerable (1).

	## Model Details

	- Developed by: Mukit Mahdin
	- Finetuned from: `microsoft/codebert-base`
	- Language(s): English (for code comments & metadata), C/C++
	- License: MIT
	- Task: Code vulnerability detection
	- Dataset Used: `DetectVul/devign`
	- Architecture: Transformer-based sequence classification

	## Uses

	### Direct Use
	This model can be used for static code analysis, security audits, and automatic vulnerability detection in software repositories. It is useful for:
	- Developers: To analyze their code for potential security flaws.
	- Security Teams: To scan repositories for known vulnerabilities.
	- Researchers: To study vulnerability detection in AI-powered systems.

	### Downstream Use
	This model can be integrated into IDE plugins, CI/CD pipelines, or security scanners to provide real-time vulnerability detection.

	### Out-of-Scope Use
	- The model is not meant to replace human security experts.
	- It may not generalize well to languages other than C/C++.
	- False positives/negatives may occur due to dataset limitations.

	## Bias, Risks, and Limitations
	- False Positives & False Negatives: The model may flag safe code as vulnerable or miss actual vulnerabilities.
	- Limited to C/C++: The model was trained on a dataset primarily composed of C and C++ code. It may not perform well on other languages.
	- Dataset Bias: The training data may not cover all possible vulnerabilities.

	### Recommendations
	Users should not rely solely on the model for security assessments. Instead, it should be used alongside manual code review and static analysis tools.

	## How to Get Started with the Model
	Use the code below to load the model and run inference on a sample code snippet:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load the fine-tuned model
	tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
	model = AutoModelForSequenceClassification.from_pretrained("mahdin70/codebert-devign-code-vulnerability-detector")

	# Sample code snippet
	code_snippet = '''
	void process(char *input) {
	char buffer[50];
	strcpy(buffer, input); // Potential buffer overflow
	}
	'''

	# Tokenize the input
	inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_label = torch.argmax(predictions, dim=1).item()

	# Output the result
	print("Vulnerable Code" if predicted_label == 1 else "Safe Code")
	```

	## Training Details

	### Training Data
	- Dataset: `DetectVul/devign`
	- Classes: `0 (Safe)`, `1 (Vulnerable)`
	- Size: `21800` Code Snippets

	### Training Procedure
	- Optimizer: AdamW
	- Loss Function: CrossEntropyLoss
	- Batch Size: 8
	- Learning Rate: 2e-05
	- Epochs: 3
	- Hardware Used: 2x T4 GPU

	### Metrics
	\| Metric \| Score \|
	\|------------\|-------------\|
	\| Train Loss \| 0.5898 \|
	\| Evaluation Loss \| 0.6153 \|
	\| Accuracy \| 64.09% \|
	\| F1 Score \| 46.42% \|
	\| Precision \| 73.78% \|
	\| Recall \| 33.86% \|

	## Environmental Impact

	\| Factor \| Value \|
	\|-----------\|----------\|
	\| GPU Used \| T4 GPU \|
	\| Training Time \| ~1 hour \|