Open to opportunities

Livia Maia

@livmaia

AI evaluation specialist with QA testing and language analysis background, designing frameworks to improve linguistic accuracy and LLM performance.

United States

Message

What I'm looking for

I’m looking to lead AI evaluation and human-judgment systems end-to-end—building rubrics, annotation workflows, and multilingual quality standards—while partnering with Engineering, Product, and Data Science to ship reliable, decision-ready model outputs.

I’m an AI evaluation specialist with 5+ years of experience building and scaling human judgment systems for AI/ML at production scale. With a background in QA testing—including mobile application testing—and language analysis, I design evaluation frameworks that translate complex model behavior into clear, actionable insights. My work focuses on linguistic quality evaluation, including rubric and rating scale development, defect taxonomy, labeling guidelines, and QA systems that ensure consistency and reliability.

I’ve contributed to evaluation strategy across 10+ AI/ML initiatives, including LLM output assessment and LLM-as-a-judge and human-in-the-loop workflows. I specialize in failure mode analysis, adversarial testing, and edge case evaluation, partnering with engineering and product teams to drive model improvements. I bring multilingual expertise in Brazilian Portuguese, English, and Spanish, and have led or supported large-scale data collection efforts with 6,000+ participants, implementing quality monitoring processes to ensure data integrity and detect drift over time.

Experience

Work history, roles, and key accomplishments

Current

AI/ML Evaluation Specialist

Current

Google

Jan 2023 - Present (3 years 6 months)

Led end-to-end evaluation strategy across 10+ AI/ML initiatives, defining rubrics, rating scales, and defect taxonomies to produce consistent, decision-ready outputs at production scale. Orchestrated evaluation and QA workflows across 10+ projects, supported 6,000+ contributors through onboarding and calibration, and performed deep error analysis to drive engineering and product iteration prioriti

LLM Evaluation Human Judgment & Annotation Defect Taxonomy & Adjudication Evaluation Pipelines English Spanish Annotator Onboarding & Calibration

Creative Writing Graduate Instructor

San Francisco State University

Jan 2019 - Jan 2022 (3 years)

Designed evaluation rubrics for open-ended, subjective work and delivered evidence-based, structured feedback to support consistent assessment standards. Built an interdisciplinary curriculum with faculty to establish shared quality expectations across courses.

Curriculum Development Instructional Design

Localization QA Tester

Urban Apps

Jan 2014 - Present (12 years 6 months)

Tested multilingual mobile apps (EN/ES/PT) for translation quality, UX compliance, and cultural fit, documenting issues and recommending process changes to reduce turnaround time by 15%. Applied locale-specific quality standards to improve consistency across languages.

Localization Testing Translation Quality Evaluation Cultural Appropriateness Review Bug Reporting I18n Evaluation Process Improvement