About the Role
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join our dynamic team. In this critical role, you will play a key part in ensuring the flawless operation, performance, and scalability of our mission-critical systems and applications. You will collaborate closely with software engineers, DevOps engineers, and other cross-functional teams to design, build, and maintain a robust and efficient infrastructure that supports our business objectives.
Key Responsibilities
- Infrastructure Engineering:
- Design, build, and maintain a scalable and resilient infrastructure for our applications, leveraging cloud platforms (e.g., AWS, Azure, GCP), container orchestration platforms (e.g., Kubernetes, Docker Swarm), and serverless technologies.
- Automate infrastructure provisioning and management using tools like Terraform, Ansible, or Puppet, ensuring consistency and efficiency.
- Operational Excellence:
- Develop, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and improve system performance and reliability.
- Proactively identify and mitigate potential risks and bottlenecks within our infrastructure.
- Respond swiftly and effectively to incidents and outages, conducting thorough post-mortem analyses to identify root causes and implement preventative measures.
- Collaboration & Communication:
- Foster strong working relationships with software engineering teams to optimize application performance, reliability, and security.
- Clearly communicate technical concepts and solutions to both technical and non-technical audiences.
- Actively participate in knowledge sharing and mentoring within the team.
- Continuous Improvement:
- Stay abreast of the latest advancements in cloud computing, containerization, and other relevant technologies.
- Continuously evaluate and refine our operational processes and tooling to enhance efficiency and effectiveness.
- Participate in on-call rotations to ensure 24/7 system availability and support.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- 3+ years of hands-on experience as an SRE, DevOps Engineer, or Systems Administrator.
- Proven expertise in cloud computing platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
- Demonstrated experience with infrastructure-as-code tools (Terraform, Ansible, Puppet) and their practical application.
- Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, firewalls).
- Proficiency in scripting languages (Python, Bash, Go) for automation and system administration tasks.
- Experience with monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog) and their effective configuration.
- Excellent analytical, problem-solving, and troubleshooting skills.
- Strong communication and collaboration skills with a focus on teamwork and knowledge sharing.
Bonus Points
- Experience with security best practices and tools.
- Experience with serverless computing platforms (AWS Lambda, Google Cloud Functions).
- Experience with stream processing technologies (Kafka, Kinesis).
- Contributions to open-source projects.
To Apply
Please submit your resume and a compelling cover letter that highlights your relevant experience and passion for building reliable systems to ekta@digitalxnode.com