PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.
By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.
PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.
Ethos: In our team at UT Austin, we train students to become full-stack researchers—and increasingly, designers of the systems that do research. Our students learn to carry projects end-to-end: from idea generation and theory to data creation, analysis, and iterative refinement across diverse subfields. Using modern AI (including agentic workflows) and scalable computation, students build reproducible pipelines that can ingest and update planetary-scale data—like satellite imagery and other high-dimensional sources. But the goal isn’t tool use for its own sake: students learn to set the objectives, constraints, and evaluation standards that guide these systems through large spaces of hypotheses, while grounding results in causal inference and careful measurement. The outcome is scholarship that can rigorously test policy counterfactuals and translate evidence into durable, responsible improvements in societal well-being.
We welcome students at every stage to engage with projects—from motivated high-schoolers to undergraduates, graduate students, and those from highly non-traditional backgrounds.
✨ Built from real enterprise data (Enron + financial institutions), not synthetic tasks ✨ Tests end-to-end finance workflows ✨ Multimodal & cross-file reasoning ✨ Expert annotated (700+ hours) and genuinely challenging hard