Return to jobs

Site Reliability Engineer

Ref: CA_EN_6_919740_1406818

Posted on 13 September 2021
Location
Toronto, Ontario
Contract Type
Direct Hire

Full Time Permanent opportunity

Location: Anywhere in Canada (ability to support EST hours).

System Reliability Engineer

Position Description

Our client is one of the largest independent information technology services firm, and still growing!  

They are currently expanding their System Reliability Engineering team that helps one of thier key clients deploy, manage, troubleshoot, and enhance their developer tooling platform, servicing over 2000 developers.

As a System Reliability Engineer, you will be responsible for designing, implementing, and supporting a verity of developer productivity tools that include Ansible Tower, GitLab, Artifactory and SonarQube.  The technology stack used to manage the platform includes Ansible, Terraform, Python, Prometheus, Splunk, ELK and PageDuty.

You will build automation solutions to provision and validate infrastructure and help debug and resolve problems.  You will help to improve operational performance by focusing on user experience, effectively assessing and managing risk, and minimizing the impact of failures.

Responsibilities

  • Keeping all components of the developer productivity platform up and running
  • Working closely with internal partners and platform users to ensure that all services meet security, SLA, and performance requirements
  • Writing, updating, and using documentation, including runbooks and playbooks
  • Automating infrastructure deployment, testing, application failover, failure mitigation, user self-service functions, and more
  • Debugging complex problems across the entire stack
  • Developing, designing, and deploying new features and capabilities
  • Key Skills and Attributes

  • 6 years experience with software engineering, software development, or system operations
  • Experience working with Linux and can write shell scripts and understands Linux internals and performance tuning
  • Strong understanding of networking principles
  • Experience debugging large scale complex systems in production
  • Experience in building, implementing, and supporting highly available production systems
  • Experience automating infrastructure and deployments using Terraform, Ansible, and Python or equivalent technologies
  • Understanding of DevOps engineering, CI/CD, and software deployment
  • Experience deploying and managing developer tooling such as Artifactory, GitLab, SonarQube, and Ansible Tower
  • Experience with various monitoring and observability tools
  • Experience deploying and managing workloads on one of the major public cloud platforms, private clouds such as OpenStack
  • Experience deploying and managing workloads on one of the major container management platforms like Kubernetes, OpenShift, PCF or Rancher
  • A curiosity about how complex socio-technical systems operate and what happens during failure
  • NOTE: It’s not expected that any single candidate would have experience across all these areas. Our client is looking for someone who is strong in a few areas, and has interest and desire to learn in others.


    Find your local office.

    Find your local office. Modis has over 100 offices in the United States, Canada and Europe. With both industry and location-specific expertise, our people know their area and their labor market and can find the right position for you.

    Locations