We are looking for an AI Evaluation Engineer specialized in data analysis to design benchmark tasks that simulate real-world analytical workflows. Responsibilities include designing and developing multi-agent benchmark tasks, creating realistic datasets, and implementing evaluation pipelines using Python and SQL.
Requirements
- 5+ years of experience in data analysis or analytics-heavy roles
- Strong proficiency in Python (pandas, NumPy) and SQL
- Experience working with real-world, messy datasets (CSV, JSON, logs, reports)
- Ability to design analytical problems with clear, verifiable answers
- Solid understanding of statistics (distributions, correlations, outliers)
- Familiarity with AI benchmarks or evaluation environments (e.g., SWE-bench or similar)
- Hands-on experience with Docker (Dockerfiles, image builds, debugging)
