What we’re doing isn’t easy, but nothing worth doing ever is.
We envision a future powered by robots that work seamlessly with human teams. We build artificial intelligence that enables service robots to collaborate with people and adapt to dynamic human environments. Join our mission-driven, venture-backed team as we build out current and future generations of humanoid robots.
The MLOps Engineer works together with engineering teams, IT, and Security to address unique business challenges through comprehensive solutions while taking into account system uptime, reliability, and maintainability. Instrument and monitor the breadth of our full platform stack (hosts, applications, and performance). In this role you will work closely with our engineering and information security teams to enhance the automated system provisioning and deployment subsystems within codified infrastructure. You will work with developers to create more robust and scalable services independent of cloud implementations. You will help to isolate, trap, and respond from the inevitability of system failure and develop strategies for continuous monitoring and analysis to reduce both downtime and required manual intervention. You will participate in On-Call rotation to maintain platform SLAs.
Key Responsibilities
- Analyze our current operational toolset for shortcomings and product improvements; provide and implement recommendations.
- Creating, configuring and maintaining cloud-based infrastructure and services for the rapid development and monitoring of complex robotics and analytics applications.
- Build tools to automate monitoring and management of robot fleets.
- Build tools to automate and improve ML Ops tooling and workflow.
- Build tools to automate and improve data workflows for ML training and simulation.
- Triage issues as they arise, both on robots and in deployed software.
- Automate common operations to allow Diligent’s robotic fleet to scale exponentially.
- Being an active member of the software engineering team, helping to improve the organization’s SDLC process and minimizing time from code-complete to production.
- Mentor engineers in SRE best practices and modern software engineering
- Occasional off-hours, on-call work required.
Qualifications
- Bachelor’s degree in Computer Science, related field, or equivalent experience
- 5+ years of combined experience in MLOps, DevOps or Software Engineering or related technical roles.
- Deep experience in modern cloud infrastructure (AWS, Azure, GCP) especially managed ML/AI services.
- Experience with modern datastores at small to medium scale (Firestore, Redshift, Postgres, Mongo, distributed queues like Kafka, MosquittoMQ).
- Experience automating system provisioning, configuration, and Infrastructure as Code (Terraform, Ansible, etc)
- Management of hosting environment, including database administration and scaling an application to support load changes
- Experience soliciting systems requirements, designing, and implementing new platform components leveraging infrastructure or SaaS services.
- Experience working with distributed, fault tolerant systems
- Experience with converting monolithic applications to microservices and service discovery technology
- Solid Linux skills and proficiency in at least one high-level language (i.e. Python).
- Experience working in an agile methodology development lifecycle