Spaces:
Sleeping
title: Inventory Optimization Environment
emoji: π¦
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
- openenv
base_path: /web
Retail Inventory Optimization Environment
An OpenEnv reinforcement learning environment that simulates day-by-day retail inventory management across 5 product categories. An AI agent must balance purchasing, pricing, shipping, and liquidation decisions to maximize profit over a 30-day episode.
Why Inventory Management?
Retail inventory optimization is a real-world task performed daily by store managers, warehouse operators, and supply chain planners. The agent faces the same challenges as a human manager: uncertain demand, perishable goods, shipping delays, seasonal events, and limited cash flow. Poor decisions lead to stockouts (lost sales), waste (expired goods), or cash tied up in unsold inventory.
Environment Description
You manage a retail store selling 5 products with different characteristics:
| Product | Sell Price | Cost Price | Profit Margin | Shelf Life |
|---|---|---|---|---|
| Electronics | $150 | $100 | $50 | No expiry |
| Clothing | $40 | $25 | $15 | No expiry |
| Groceries | $10 | $5 | $5 | 5 days |
| Furniture | $200 | $130 | $70 | No expiry |
| Toys | $25 | $12 | $13 | No expiry |
Each day the agent receives the current store state (cash, inventory with batch expiry, pending deliveries, upcoming events) and must decide:
- What to buy and how much of each product
- How to ship β slow (cheap but unreliable), medium, or fast (expensive but guaranteed)
- What to liquidate β dispose of expiring or excess stock
- How to price β set per-product price multipliers that affect demand via elasticity
Customer demand is generated each day based on base ranges, weekend boosts (1.2x on days 5-6), and seasonal event multipliers (up to 3x during Black Friday, Christmas, etc.). The agent cannot see future demand β only yesterday's demand as feedback.
The episode runs for 30 days. The goal is to maximize total profit.
Environment Design Highlights
Batch-Tracked Inventory with FIFO
Inventory is tracked per batch with individual expiry dates. Groceries expire after 5 days. Selling and liquidation follow FIFO (First In, First Out) β oldest batches are consumed first, mimicking real warehouse operations.
{"groceries": [[20, 3], [15, 5], [10, 1]]}
Three batches: 20 units (3 days left), 15 units (5 days left), 10 units (1 day left β liquidate or lose them).
Dynamic Pricing with Price Elasticity
The agent can set per-product price multipliers (0.5x to 1.5x) each day. Demand responds to pricing via realistic elasticity values β groceries are inelastic (people buy regardless), while clothing and toys are highly elastic (price-sensitive customers).
| Product | Elasticity | Effect of 1.3x price |
|---|---|---|
| Electronics | 1.2 | Demand drops ~24% |
| Clothing | 1.5 | Demand drops ~38% |
| Groceries | 0.4 | Demand drops only ~11% |
| Furniture | 0.8 | Demand drops ~22% |
| Toys | 1.3 | Demand drops ~33% |
Delivery Jitter
Shipping isn't perfectly reliable. Slow delivery has +/-2 day variance, medium has +/-1 day. Only fast delivery (at 5x the cost) is guaranteed next-day. The agent must account for uncertainty when planning restocks before events.
Seasonal Events with Demand Spikes
Five events are spread across the 30-day episode. Each event triggers a 2-day demand multiplier β Black Friday triples electronics demand, Christmas triples toys, etc. A "new competitor" event actually reduces demand. The agent sees countdowns and must stock up in advance.
Decomposed Per-Step Reward
The reward function provides granular feedback every step, not just end-of-episode:
| Signal | Formula | Purpose |
|---|---|---|
| Successful sales | +sold * sell_price * 0.001 |
Reward revenue proportional to product value |
| Missed sales | -missed * sell_price * 0.001 |
Penalize stockouts, weighted by product value |
| Expired groceries | -0.05 * expired_count |
Penalize waste from overbuying perishables |
| Failed purchases | -0.5 per rejected order |
Penalize ordering beyond cash budget |
| Liquidation loss | -disposed_value * 0.001 |
Penalize disposal proportional to cost |
Conversation History for LLM Agents
The inference script maintains a rolling 7-day conversation history. The LLM sees its past observations and decisions, enabling it to spot demand trends, learn from mistakes, and adjust strategy across the episode.
Action Space
class InventoryAction(Action):
buy_quantities: Dict[str, int] = {}
delivery_method: Literal["slow", "medium", "fast"] = "slow"
liquidate: Dict[str, int] = {}
price_multipliers: Dict[str, float] = {}
| Field | Description |
|---|---|
buy_quantities |
Products and amounts to order. Empty {} to skip buying. |
delivery_method |
"slow" ($2/unit, 3-7 days), "medium" ($5/unit, 2-4 days), "fast" ($10/unit, 1 day guaranteed) |
liquidate |
Products and amounts to dispose of (no revenue). Use for expiring groceries or freeing warehouse space. |
price_multipliers |
Per-product selling price multiplier (0.5-1.5). Affects demand via elasticity. Default 1.0 if omitted. |
Observation Space
class InventoryObservation(Observation):
current_day: int
total_cash: float
day_profit: float
total_profit: float
demand_today: Dict[str, int] # yesterday's demand (feedback)
updated_inventory: Dict[str, List] # [[qty, days_left], ...] per batch
remaining_capacity: Dict[str, int] # warehouse space left per product
updated_events: Dict[str, int] # event countdowns (negative = active/ended)
updated_deliveries: List[Dict] # in-transit shipments
Tasks (Easy / Medium / Hard)
Easy β "Steady State"
- Low starting stock, low steady demand, no events
- Starting cash: $1,000 | Full warehouse capacity
- Agent needs to restock regularly but demand is predictable
- No events, no demand spikes β pure supply chain management
Medium β "Seasonal Rush"
- Default stock/cash, all 5 events spread across 30 days
- Events: Black Friday (day 6), Christmas (day 12), Back to School (day 18), Summer Clearance (day 24), New Competitor (day 28)
- Agent must anticipate demand spikes and restock before events hit
Hard β "Chaos Mode"
- Half starting cash ($500), low stock, events packed close together (days 4, 8, 12, 16, 20)
- Higher base demand, smaller warehouse capacity
- Agent must balance tight budget, overlapping event spikes, perishable goods, and limited storage
Grading (0.0 - 1.0)
Each task is scored by comparing agent profit against two deterministic baselines:
- Floor: Passive agent that never buys (sells initial stock until depleted)
- Ceiling: Theoretical max profit assuming perfect demand knowledge and cheapest shipping
score = clamp((agent_profit - floor) / (ceiling - floor), 0.0, 1.0)
Both baselines are deterministic (seeded RNG) and computed fresh each run to ensure reproducibility.
Setup
# Install dependencies
pip install openenv-core[core] fastapi uvicorn pydantic openai numpy python-dotenv
# Run grader baselines
python -c "from server.grader import compute_baselines; [print(f'{t}: floor={f:.2f}, ceiling={c:.2f}') for t in ['easy','medium','hard'] for f,c in [compute_baselines(t)]]"
# Start server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000
# Test endpoints
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset
Running Inference
# Using HuggingFace Router
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen3-32B"
export HF_TOKEN="your-token"
python inference.py
# Using OpenAI
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export API_KEY="sk-your-key"
python inference.py
Docker
docker build -t inventory-env .
docker run -p 8000:8000 inventory-env
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check β returns 200 if server is running |
/reset |
POST | Reset environment, returns initial observation |
/step |
POST | Submit an action (JSON body), returns next observation with reward |
/state |
GET | Get current episode state (day, cash, inventory) |
/tasks |
GET | List all 3 tasks with full config (stock, capacity, demand ranges, events) |
/grader |
POST | Score an episode given task name and agent profit |
/baseline |
GET | Run LLM inference on a task and return the score |
Example Queries
# List all tasks with full schemas
curl http://localhost:8000/tasks
# Grade a specific profit
curl -X POST "http://localhost:8000/grader?task_name=easy&agent_profit=5000"
# β {"task_name":"easy","agent_profit":5000.0,"floor":2200.0,"ceiling":10011.0,"score":0.358}
# Run baseline inference (requires API keys in container env)
curl "http://localhost:8000/baseline"
curl "http://localhost:8000/baseline?task_name=hard"
# β {"task_name":"easy","score":0.822}
Step Execution Order
Each step() call processes in this order:
- Tick event countdowns (into negatives to track active duration)
- Remove expired groceries (shelf life = 0)
- Receive arriving deliveries (add to inventory with fresh shelf life)
- Process purchase orders (deduct cash, schedule deliveries with jitter)
- Generate demand (base + weekend boost + event multipliers + price elasticity)
- Sell products FIFO (oldest batches first, track missed sales)
- Liquidate requested stock FIFO (no revenue)
- Compute profit, reward, update state, return observation
Project Structure
βββ models.py # InventoryAction, InventoryObservation, InventoryState (Pydantic)
βββ client.py # EnvClient for remote WebSocket connections
βββ inference.py # LLM inference script with conversation history (runs all 3 tasks)
βββ openenv.yaml # OpenEnv spec manifest
βββ pyproject.toml # Python dependencies
βββ Dockerfile # Multi-stage container build from openenv-base
βββ server/
β βββ app.py # FastAPI server (create_app + uvicorn entry point)
β βββ inventory_env.py # Environment (reset, step, state, demand generation)
β βββ constants.py # All configs: prices, stock, events, tasks, elasticity
β βββ grader.py # Floor/ceiling baselines and 0.0-1.0 scoring
βββ scripts/
βββ validate-submission.sh # Pre-submission validator