We are seeking a Senior DevOps Engineer to join our team and contribute to the development of cutting-edge payment solutions. As a remote full-time position, you will be responsible for designing, implementing, and managing a robust Kafka-based messaging infrastructure and collaborate closely with the company's founders to ensure the delivery of high-quality, scalable software.
Requirements
- Automating deployment, management, and operations of complex distributed systems with Apache Kafka
- Implementing tracing and performance observability in high scale distributed microservice architectures
- Designing and managing scalable, high-throughput, and low-latency Kafka clusters for real-time data streaming between services
- Building and maintaining infrastructure as code (IaC) for Kafka and related services using Terraform, Ansible, or similar tools
- Monitoring and optimizing Kafka performance, ensuring message reliability and minimal downtime in a high-availability payment environment
- Setting up and maintaining centralized observability systems for logs, metrics, and traces across all services using Prometheus, Grafana, or Datadog
- Designing and maintaining CI/CD pipelines for infrastructure and microservices using tools such as GitHub Actions, and Jenkins
- Managing containerized workloads using Docker and Kubernetes, ensuring scalability, and automated rollouts/rollbacks in production
- Collaborating with backend engineers, SREs, and platform teams to implement Kafka producers/consumers that integrate cleanly with payment processing flows
- Establishing security, access control, and encryption protocols for Kafka to meet regulatory and compliance standards (e.g., PCI DSS)
- Leading Kafka upgrades, partition strategy design, and rebalancing without disrupting critical microservices
- Implementing observability tooling for Kafka (e.g., Confluent Control Center, Prometheus/Grafana, or Datadog integrations)
- Developing disaster recovery and failover strategies for Kafka-related components in production
- Participating in incident response processes for Kafka-related outages
- Strong communication skills in both English and Spanish
Benefits
- Generous Paid Time Off
- 401k Matching
- Retirement Plan
- Four Day Work Week
