---
title: Inventory Optimization Environment
emoji: 📦
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
tags:
  - openenv
base_path: /web
---

# Retail Inventory Optimization Environment

An OpenEnv reinforcement learning environment that simulates day-by-day retail inventory management across 5 product categories. An AI agent must balance purchasing, pricing, shipping, and liquidation decisions to maximize profit over a 30-day episode.

## Why Inventory Management?

Retail inventory optimization is a real-world task performed daily by store managers, warehouse operators, and supply chain planners. The agent faces the same challenges as a human manager: uncertain demand, perishable goods, shipping delays, seasonal events, and limited cash flow. Poor decisions lead to stockouts (lost sales), waste (expired goods), or cash tied up in unsold inventory.

## Environment Description

You manage a retail store selling 5 products with different characteristics:

| Product | Sell Price | Cost Price | Profit Margin | Shelf Life |
|---------|-----------|------------|---------------|------------|
| Electronics | $150 | $100 | $50 | No expiry |
| Clothing | $40 | $25 | $15 | No expiry |
| Groceries | $10 | $5 | $5 | 5 days |
| Furniture | $200 | $130 | $70 | No expiry |
| Toys | $25 | $12 | $13 | No expiry |

Each day the agent receives the current store state (cash, inventory with batch expiry, pending deliveries, upcoming events) and must decide:
- **What to buy** and how much of each product
- **How to ship** — slow (cheap but unreliable), medium, or fast (expensive but guaranteed)
- **What to liquidate** — dispose of expiring or excess stock
- **How to price** — set per-product price multipliers that affect demand via elasticity

Customer demand is generated each day based on base ranges, weekend boosts (1.2x on days 5-6), and seasonal event multipliers (up to 3x during Black Friday, Christmas, etc.). The agent cannot see future demand — only yesterday's demand as feedback.

The episode runs for 30 days. The goal is to maximize total profit.

## Environment Design Highlights

### Batch-Tracked Inventory with FIFO
Inventory is tracked per batch with individual expiry dates. Groceries expire after 5 days. Selling and liquidation follow FIFO (First In, First Out) — oldest batches are consumed first, mimicking real warehouse operations.

```json
{"groceries": [[20, 3], [15, 5], [10, 1]]}
```
Three batches: 20 units (3 days left), 15 units (5 days left), 10 units (1 day left — liquidate or lose them).

### Dynamic Pricing with Price Elasticity
The agent can set per-product price multipliers (0.5x to 1.5x) each day. Demand responds to pricing via realistic elasticity values — groceries are inelastic (people buy regardless), while clothing and toys are highly elastic (price-sensitive customers).

| Product | Elasticity | Effect of 1.3x price |
|---------|-----------|----------------------|
| Electronics | 1.2 | Demand drops ~24% |
| Clothing | 1.5 | Demand drops ~38% |
| Groceries | 0.4 | Demand drops only ~11% |
| Furniture | 0.8 | Demand drops ~22% |
| Toys | 1.3 | Demand drops ~33% |

### Delivery Jitter
Shipping isn't perfectly reliable. Slow delivery has +/-2 day variance, medium has +/-1 day. Only fast delivery (at 5x the cost) is guaranteed next-day. The agent must account for uncertainty when planning restocks before events.

### Seasonal Events with Demand Spikes
Five events are spread across the 30-day episode. Each event triggers a 2-day demand multiplier — Black Friday triples electronics demand, Christmas triples toys, etc. A "new competitor" event actually reduces demand. The agent sees countdowns and must stock up in advance.

### Decomposed Per-Step Reward
The reward function provides granular feedback every step, not just end-of-episode:

| Signal | Formula | Purpose |
|--------|---------|---------|
| Successful sales | `+sold * sell_price * 0.001` | Reward revenue proportional to product value |
| Missed sales | `-missed * sell_price * 0.001` | Penalize stockouts, weighted by product value |
| Expired groceries | `-0.05 * expired_count` | Penalize waste from overbuying perishables |
| Failed purchases | `-0.5 per rejected order` | Penalize ordering beyond cash budget |
| Liquidation loss | `-disposed_value * 0.001` | Penalize disposal proportional to cost |

### Conversation History for LLM Agents
The inference script maintains a rolling 7-day conversation history. The LLM sees its past observations and decisions, enabling it to spot demand trends, learn from mistakes, and adjust strategy across the episode.

## Action Space

```python
class InventoryAction(Action):
    buy_quantities: Dict[str, int] = {}
    delivery_method: Literal["slow", "medium", "fast"] = "slow"
    liquidate: Dict[str, int] = {}
    price_multipliers: Dict[str, float] = {}
```

| Field | Description |
|-------|-------------|
| `buy_quantities` | Products and amounts to order. Empty `{}` to skip buying. |
| `delivery_method` | `"slow"` ($2/unit, 3-7 days), `"medium"` ($5/unit, 2-4 days), `"fast"` ($10/unit, 1 day guaranteed) |
| `liquidate` | Products and amounts to dispose of (no revenue). Use for expiring groceries or freeing warehouse space. |
| `price_multipliers` | Per-product selling price multiplier (0.5-1.5). Affects demand via elasticity. Default 1.0 if omitted. |

## Observation Space

```python
class InventoryObservation(Observation):
    current_day: int
    total_cash: float
    day_profit: float
    total_profit: float
    demand_today: Dict[str, int]           # yesterday's demand (feedback)
    updated_inventory: Dict[str, List]     # [[qty, days_left], ...] per batch
    remaining_capacity: Dict[str, int]     # warehouse space left per product
    updated_events: Dict[str, int]         # event countdowns (negative = active/ended)
    updated_deliveries: List[Dict]         # in-transit shipments
```

## Tasks (Easy / Medium / Hard)

### Easy — "Steady State"
- Low starting stock, low steady demand, no events
- Starting cash: $1,000 | Full warehouse capacity
- Agent needs to restock regularly but demand is predictable
- No events, no demand spikes — pure supply chain management

### Medium — "Seasonal Rush"
- Default stock/cash, all 5 events spread across 30 days
- Events: Black Friday (day 6), Christmas (day 12), Back to School (day 18), Summer Clearance (day 24), New Competitor (day 28)
- Agent must anticipate demand spikes and restock before events hit

### Hard — "Chaos Mode"
- Half starting cash ($500), low stock, events packed close together (days 4, 8, 12, 16, 20)
- Higher base demand, smaller warehouse capacity
- Agent must balance tight budget, overlapping event spikes, perishable goods, and limited storage

## Grading (0.0 - 1.0)

Each task is scored by comparing agent profit against two deterministic baselines:
- **Floor**: Passive agent that never buys (sells initial stock until depleted)
- **Ceiling**: Theoretical max profit assuming perfect demand knowledge and cheapest shipping

```
score = clamp((agent_profit - floor) / (ceiling - floor), 0.0, 1.0)
```

Both baselines are deterministic (seeded RNG) and computed fresh each run to ensure reproducibility.

## Setup

```bash
# Install dependencies
pip install openenv-core[core] fastapi uvicorn pydantic openai numpy python-dotenv

# Run grader baselines
python -c "from server.grader import compute_baselines; [print(f'{t}: floor={f:.2f}, ceiling={c:.2f}') for t in ['easy','medium','hard'] for f,c in [compute_baselines(t)]]"

# Start server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Test endpoints
curl http://localhost:8000/health
curl -X POST http://localhost:8000/reset
```

## Running Inference

```bash
# Using HuggingFace Router
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen3-32B"
export HF_TOKEN="your-token"
python inference.py

# Using OpenAI
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export API_KEY="sk-your-key"
python inference.py
```

## Docker

```bash
docker build -t inventory-env .
docker run -p 8000:8000 inventory-env
```

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check — returns 200 if server is running |
| `/reset` | POST | Reset environment, returns initial observation |
| `/step` | POST | Submit an action (JSON body), returns next observation with reward |
| `/state` | GET | Get current episode state (day, cash, inventory) |
| `/tasks` | GET | List all 3 tasks with full config (stock, capacity, demand ranges, events) |
| `/grader` | POST | Score an episode given task name and agent profit |
| `/baseline` | GET | Run LLM inference on a task and return the score |

### Example Queries

```bash
# List all tasks with full schemas
curl http://localhost:8000/tasks

# Grade a specific profit
curl -X POST "http://localhost:8000/grader?task_name=easy&agent_profit=5000"
# → {"task_name":"easy","agent_profit":5000.0,"floor":2200.0,"ceiling":10011.0,"score":0.358}

# Run baseline inference (requires API keys in container env)
curl "http://localhost:8000/baseline"
curl "http://localhost:8000/baseline?task_name=hard"
# → {"task_name":"easy","score":0.822}
```

## Step Execution Order

Each `step()` call processes in this order:
1. Tick event countdowns (into negatives to track active duration)
2. Remove expired groceries (shelf life = 0)
3. Receive arriving deliveries (add to inventory with fresh shelf life)
4. Process purchase orders (deduct cash, schedule deliveries with jitter)
5. Generate demand (base + weekend boost + event multipliers + price elasticity)
6. Sell products FIFO (oldest batches first, track missed sales)
7. Liquidate requested stock FIFO (no revenue)
8. Compute profit, reward, update state, return observation

## Project Structure

```
├── models.py              # InventoryAction, InventoryObservation, InventoryState (Pydantic)
├── client.py              # EnvClient for remote WebSocket connections
├── inference.py           # LLM inference script with conversation history (runs all 3 tasks)
├── openenv.yaml           # OpenEnv spec manifest
├── pyproject.toml         # Python dependencies
├── Dockerfile             # Multi-stage container build from openenv-base
├── server/
│   ├── app.py             # FastAPI server (create_app + uvicorn entry point)
│   ├── inventory_env.py   # Environment (reset, step, state, demand generation)
│   ├── constants.py       # All configs: prices, stock, events, tasks, elasticity
│   └── grader.py          # Floor/ceiling baselines and 0.0-1.0 scoring
└── scripts/
    └── validate-submission.sh  # Pre-submission validator
```