We're building a dataset to evaluate AI coding agents by creating challenging tasks and evaluation criteria within realistic simulated environments. The tasks involve building virtual companies, assembling and calibrating tasks, designing tasks set in isolated environments, writing tests, and iterating with AI agents. Contributors will work part-time, non-permanent projects, with a focus on testing and evaluating AI systems.
Requirements
- Degree in Computer Science, Software Engineering, or related fields
- 5+ years in software development, primarily Python (FastAPI, pytest, async/await, subprocess, file operations)
- Background in full-stack development, with experience building React-based interfaces (JavaScript/TypeScript) and robust back-end systems
- Experience writing tests (functional, integration — not just running them)
- Docker containers, and familiarity with infrastructure tools (Postgres, Kafka, Redis)
- CI/CD understanding (GitHub Actions as a user: triggers, labels, reading results)
- English proficiency - B2
Benefits
- Up to $50 per hour equivalent
- Flexible work schedule
