PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.

New activity in allenai/SERA-32B about 1 month ago

Was the 32B model trained on 25K or 200K Trajectories?

#2 opened about 1 month ago by

elchulito89

upvoted a paper about 1 month ago

Advancing Open-source World Models

Paper • 2601.20540 • Published Jan 28 • 128

liked a model about 1 month ago

robbyant/lingbot-world-base-cam

Image-to-Video • Updated 28 days ago • 320

upvoted 2 papers about 1 month ago

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Paper • 2601.11868 • Published Jan 17 • 34

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Paper • 2601.11077 • Published Jan 16 • 65

reacted to cjerzak's post with 🔥 about 1 month ago

Post

691

Ethos: In our team at UT Austin, we train students to become full-stack researchers—and increasingly, designers of the systems that do research. Our students learn to carry projects end-to-end: from idea generation and theory to data creation, analysis, and iterative refinement across diverse subfields. Using modern AI (including agentic workflows) and scalable computation, students build reproducible pipelines that can ingest and update planetary-scale data—like satellite imagery and other high-dimensional sources. But the goal isn’t tool use for its own sake: students learn to set the objectives, constraints, and evaluation standards that guide these systems through large spaces of hypotheses, while grounding results in causal inference and careful measurement. The outcome is scholarship that can rigorously test policy counterfactuals and translate evidence into durable, responsible improvements in societal well-being.

We welcome students at every stage to engage with projects—from motivated high-schoolers to undergraduates, graduate students, and those from highly non-traditional backgrounds.

Join us! https://connorjerzak.com/students/

2 replies

liked a model about 1 month ago

zai-org/GLM-4.7-Flash

Text Generation • Updated Jan 29 • 1.7M • • 1.58k

upvoted 3 papers about 1 month ago

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Paper • 2601.08763 • Published Jan 13 • 148

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Paper • 2601.10305 • Published Jan 15 • 36

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Paper • 2601.10402 • Published Jan 15 • 37

liked a model about 2 months ago

YatharthS/NovaSR

Audio-to-Audio • Updated Jan 19 • 469 • 77

upvoted a paper about 2 months ago

Recursive Language Models

Paper • 2512.24601 • Published Dec 31, 2025 • 90

upvoted 2 papers 2 months ago

SWE-RM: Execution-free Feedback For Software Engineering Agents

Paper • 2512.21919 • Published Dec 26, 2025 • 10

Reinforcement Learning for Self-Improving Agent with Skill Library

Paper • 2512.17102 • Published Dec 18, 2025 • 36

reacted to AdinaY's post with 🔥 2 months ago

Post

4630

Finch 💰 an enterprise-grade benchmark that measures whether AI agents can truly handle real world finance & accounting work.

FinWorkBench/Finch

✨ Built from real enterprise data (Enron + financial institutions), not synthetic tasks
✨ Tests end-to-end finance workflows
✨ Multimodal & cross-file reasoning
✨ Expert annotated (700+ hours) and genuinely challenging hard

John Posada

AI & ML interests

Recent Activity

Organizations

elchulito89's activity

Was the 32B model trained on 25K or 200K Trajectories?