We are looking for an AI Evaluation Engineer specialized in data analysis to design benchmark tasks that simulate real-world analytical workflows.
Requirements
- Design and develop multi-agent benchmark tasks focused on complex data analysis workflows
- Create or curate realistic datasets (CSV, JSON, logs, reports, financial or operational data)
- Implement evaluation pipelines using Python and SQL
- Create reproducible environments using Docker
- Analyze task performance and refine for clarity, difficulty, and scoring accuracy
Benefits
- Contractor assignment
- Duration of contract: 4 weeks+
