We are looking for a Site Reliability Engineer (SRE).
As a Microsoft Cloud Tier 1 Managed Service Provider and System Integrator, DEX empowers digital natives, startups, and enterprises to achieve success through cutting-edge cloud, data, and AI solutions. Our strong partnerships with Microsoft and other industry leaders enable us to design and deliver innovative, scalable, and secure cloud architectures that drive business transformation.
We’re looking for a people-oriented team player with a passion for technology and customer success to join our growing team as a Site Reliability Engineer (SRE).
Job description
In this role, you will ensure the availability, performance, and reliability of our customers’ cloud and hybrid environments. You will manage incidents, service requests, monitoring systems, and preventive maintenance activities, while collaborating closely with customers to ensure seamless service delivery and satisfaction.
This is a hands-on operational role focused on maintaining healthy systems, improving observability, and driving proactive reliability improvements
Key Responsibilities
- Customer Service & Operations
- Serve as a trusted technical contact for customer incidents, service requests, and operational needs.
- Deliver clear, timely communication and maintain a high level of professionalism and ownership throughout the support lifecycle.
- Collaborate with customers and internal teams to identify recurring issues and drive reliability and performance improvements.
- Incident, Problem & Change Management
- Monitor system health and respond to alerts, incidents, and degradations to ensure minimal customer impact.
- Conduct root cause analysis (RCA) and implement preventive actions.
- Support planned maintenance, updates, and configuration changes across environments.
- Cloud & DevOps Operations
- Manage and maintain Microsoft Azure environments, including virtual machines, networking, storage, and governance.
- Administer Azure Active Directory / Entra ID, hybrid identity, and role-based access control (RBAC).
- Deploy and manage infrastructure using Infrastructure as Code (IaC) tools like Terraform and Ansible.
- Build and maintain CI/CD pipelines and operational workflows using Azure DevOps or GitHub Actions.
- Ensure security, compliance, and scalability of cloud-based systems.
- Support and manage Microsoft 365 Solutions and related SaaS services.
- IT Infrastructure
- Manage Windows and Linux systems, including performance tuning, patching, and updates.
- Support network and security configurations (firewalls, VPNs, load balancers).
- Contribute to hardware and software lifecycle management, ensuring reliable and consistent infrastructure operations.
- Observability & Reliability
- Enhance system visibility with Azure Monitor, Log Analytics, Application Insights, and Microsoft Sentinel.
- Configure monitoring, alerting, and automated remediation for proactive issue detection.
- Define and track reliability metrics (SLAs, SLOs, SLIs) to ensure service quality.
- Conduct regular environment health checks, performance assessments, and optimization reviews.