We're building a dataset to evaluate AI coding agents by creating challenging tasks and evaluation criteria within realistic simulated environments. Tasks involve creating virtual companies, assembling and calibrating tasks, designing isolated environments, and writing tests that accept correct solutions and reject incorrect ones. The goal is to create tasks that genuinely challenge the best AI models.
Requirements
- Degree in Computer Science, Software Engineering, or related fields
- 5+ years in software development, primarily Python (FastAPI, pytest, async/await, subprocess, file operations)
- Background in full-stack development, with experience building React-based interfaces (JavaScript/TypeScript) and robust back-end systems
- Experience writing tests (functional, integration — not just running them)
- Docker containers, and familiarity with infrastructure tools (Postgres, Kafka, Redis)
- CI/CD understanding (GitHub Actions as a user: triggers, labels, reading results)
- English proficiency - B2
Benefits
- Opportunity to work on AI evaluation projects
- Flexibility to choose projects and work at your own pace
- Potential earnings of up to $12 per hour equivalent
