What You'll Do
- Build large-scale speech and audio data pipelines using frameworks like Google Cloud Platform and Apache Beam
- Work on machine learning projects powering new generative AI experiences and helping to build state-of-the-art text-to-speech models
- Learn and contribute to the teams best practices and techniques for building data pipelines for large scale generative models, including cleaning, filtering, classifying and labelling
- Collaborate with other engineers, researchers, product managers and stakeholders, taking on learning and leadership opportunities that arise
- Deliver scalable, testable, maintainable, and high-quality code
- Share knowledge, promote standard methodologies, making your team the best version of itself through mentorship and constructive accountability.
Who You Are
- You have Data Engineering experience and you know how to work with high-volume, heterogeneous data, preferably with distributed systems such as Hadoop, BigTable, Cassandra, GCP, AWS
- You have experience building clean, high quality datasets for training large scale machine learning models, a focus on audio data is preferred
- You have experience with one or more higher-level Python or Java based data processing frameworks such as Beam, Dataflow, Crunch, Scalding, Storm, Spark etc
- You have strong Python programming abilities. You might have worked with Docker as well as Luigi, Airflow, or similar tools
- You care about quality and you know what it means to ship high quality code
- You have experience managing data retention policies
- You care about agile software processes, data-driven development, reliability, and responsible experimentation
- You understand the value of collaboration and partnership within teams
- You have experience in developing datasets tailored for training high-performance machine learning models.
- Familiarity with generative models or audio-based machine learning applications is highly desirable.
- You are proficient in cleaning, filtering, and evaluating dataset quality, leveraging both pre-trained and in-house machine learning models, as well as human evaluation techniques, to ensure optimal quality.
Where You'll Be
- We offer you the flexibility to work where you work best! For this role, you can be within the UK region as long as we have a work location.
- This team operates within the GMT time zone for collaboration.