File size: 2,738 Bytes
91c1980
a2c3f1f
91c1980
a2c3f1f
 
 
91c1980
a2c3f1f
 
 
 
91c1980
a2c3f1f
91c1980
 
a2c3f1f
 
91c1980
a2c3f1f
91c1980
 
 
 
 
 
 
a2c3f1f
 
 
91c1980
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2c3f1f
 
 
 
 
 
 
 
91c1980
a2c3f1f
 
91c1980
a2c3f1f
91c1980
 
 
 
 
 
a2c3f1f
 
 
91c1980
a2c3f1f
 
 
 
91c1980
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# CodeT5-small Java Optimization Model

A fine-tuned Salesforce/codet5-small model for Java code optimization tasks.

## Overview

This repository contains a fine-tuned CodeT5-small model specifically trained for Java code optimization. The model takes verbose or inefficient Java code and generates more optimal versions.

## Model Information

- **Base Model**: Salesforce/codet5-small
- **Training Dataset**: [nlpctx/java_optimisation](https://huggingface.co/datasets/nlpctx/java_optimisation)
- **Framework**: HuggingFace Transformers with Seq2SeqTrainer
- **Training Setup**: Dual-GPU DataParallel (Kaggle T4×2)
- **Dataset Size**: ~6K training / 680 validation Java optimization pairs
- **Optimization Focus**: Java code refactoring and performance improvements

## Files

- `config.json` - Model configuration
- `generation_config.json` - Generation parameters
- `model.safetensors` - Model weights (safetensors format)
- `merges.txt` - BPE merges file
- `special_tokens_map.json` - Special tokens mapping
- `tokenizer_config.json` - Tokenizer configuration
- `vocab.json` - Vocabulary file

## Usage

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained("model_directory")
tokenizer = T5Tokenizer.from_pretrained("model_directory")

# Prepare input Java code
java_code = "your Java code here"
input_ids = tokenizer(java_code, return_tensors="pt").input_ids

# Generate optimized code
with torch.no_grad():
    outputs = model.generate(
        input_ids,
        max_length=512,
        num_beams=4,
        early_stopping=True
    )

optimized_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(optimized_code)
```

## Example Optimizations

The model has been trained to recognize and optimize common Java patterns:

- **Switch Expressions**: Converting verbose switch statements to switch expressions
- **Collection Operations**: Replacing manual iterator removal with `removeIf()`
- **String Handling**: Optimizing string concatenation with `StringBuilder`
- **Loop Optimizations**: Improving iterative constructs
- **And more...**

## Training Details

The model was fine-tuned using:
- **Base Model**: Salesforce/codet5-small
- **Dataset**: nlpctx/java_optimisation from Hugging Face
- **Training Framework**: Seq2SeqTrainer with DataParallel
- **Hardware**: Kaggle T4×2 (dual GPU)
- **Approach**: Standard supervised fine-tuning on Java optimization pairs

## License

This model is provided for educational and demonstration purposes.

## Acknowledgements

- Model based on Salesforce/codet5-small
- Training data from nlpctx/java_optimisation dataset
- Built with HuggingFace Transformers