About the Role

We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join our dynamic team. In this critical role, you will play a key part in ensuring the flawless operation, performance, and scalability of our mission-critical systems and applications. You will collaborate closely with software engineers, DevOps engineers, and other cross-functional teams to design, build, and maintain a robust and efficient infrastructure that supports our business objectives.

Key Responsibilities

Infrastructure Engineering:
- Design, build, and maintain a scalable and resilient infrastructure for our applications, leveraging cloud platforms (e.g., AWS, Azure, GCP), container orchestration platforms (e.g., Kubernetes, Docker Swarm), and serverless technologies.
- Automate infrastructure provisioning and management using tools like Terraform, Ansible, or Puppet, ensuring consistency and efficiency.
Operational Excellence:
- Develop, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure and improve system performance and reliability.
- Proactively identify and mitigate potential risks and bottlenecks within our infrastructure.
- Respond swiftly and effectively to incidents and outages, conducting thorough post-mortem analyses to identify root causes and implement preventative measures.
Collaboration & Communication:
- Foster strong working relationships with software engineering teams to optimize application performance, reliability, and security.
- Clearly communicate technical concepts and solutions to both technical and non-technical audiences.
- Actively participate in knowledge sharing and mentoring within the team.
Continuous Improvement:
- Stay abreast of the latest advancements in cloud computing, containerization, and other relevant technologies.
- Continuously evaluate and refine our operational processes and tooling to enhance efficiency and effectiveness.
- Participate in on-call rotations to ensure 24/7 system availability and support.

Qualifications

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
3+ years of hands-on experience as an SRE, DevOps Engineer, or Systems Administrator.
Proven expertise in cloud computing platforms (AWS, Azure, or GCP) and containerization technologies (Docker, Kubernetes).
Demonstrated experience with infrastructure-as-code tools (Terraform, Ansible, Puppet) and their practical application.
Strong understanding of networking fundamentals (TCP/IP, DNS, load balancing, firewalls).
Proficiency in scripting languages (Python, Bash, Go) for automation and system administration tasks.
Experience with monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog) and their effective configuration.
Excellent analytical, problem-solving, and troubleshooting skills.
Strong communication and collaboration skills with a focus on teamwork and knowledge sharing.

Bonus Points

Experience with security best practices and tools.
Experience with serverless computing platforms (AWS Lambda, Google Cloud Functions).
Experience with stream processing technologies (Kafka, Kinesis).
Contributions to open-source projects.

To Apply

Please submit your resume and a compelling cover letter that highlights your relevant experience and passion for building reliable systems to ekta@digitalxnode.com

Apply for this position

Salesforce Developer (Apex...

Cloud Sales Engineer

No Jobs Available