Site Reliability Engineer Manager

Job Code
LSAC0585
Type
Regular Full-Time
Category
Technology & Infrastructure
Location : Location
US-PA-Newtown
Remote work arrangements will be considered for this position
Yes

Overview

LSAC is a not-for profit organization whose mission is to advance law and justice by supporting the learning journey from prelaw through practice.

 

Pay rate: $145,000 to $160,000, depending on experience

 

The Site Reliability Engineer Manager is responsible for leading a team of technology professionals in overseeing the reliability and performance of critical systems by establishing best practices, driving automation, managing incident response, collaborating with development teams, and ensuring a culture of continuous improvement, all aimed at maintaining high service levels and minimizing system downtime. The SRE Manager also collaborates with business and product owners to prioritize operational requirements by defining service-level indicators (SLIs) and service-level objectives (SLOs) to monitor and optimize customer's journey and experience to aid in the design and operation of scalable resilient systems utilizing software engineering principles. The nature of the work involved means that the Site Reliability Engineer Manager will directly work on reliability initiatives that span multiple Customer and IT teams.


To be successful, you must be able to lead a team of IT professionals in their role to ensure software product’s reliability, scalability, and performance, Take a holistic view of system health, including the health of transactions internal to the systems, Improve Site Reliability Engineering practices, Work with infrastructure and operations to manage software and applications, Improve reliability, quality, and reduce the impact of any disruptions for our suite of software solutions, Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement, and Provide primary operational support and engineering for multiple large-scale distributed software applications.

Responsibilities

Essential Job Functions


Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions of this position. The individual employed in this position will be required to:

 

  • Lead the SRE team in the use of automation tools to monitor software reliability, critical applications, and services.
  • Manage and mentor a team of SREs and DevOps engineers dedicated to the SRE process.
  • Improve Performance: Analyze metrics to tune performance and find faults.
  • Collaborate with teams: Work with development teams, product supervisors, and other stakeholders to prioritize tasks and meet reliability requirements.
  • Design systems: Participate in system design consulting, platform management, and capacity planning.
  • Set service-level objectives: Balance feature development speed and reliability with service-level objectives.
  • Gather and analyze metrics from operating systems as well as applications to assist in dependency mapping, performance tuning and fault finding.
  • Collaborate closely with product owners and teams, architects, IT service management, software developers, security and network engineers, as well as other subject matter experts and roles.
  • Participate in system design consulting, platform management, and capacity planning.
  • Create sustainable systems and services through automation and uplifts.
  • Balance feature development speed and reliability with well-defined service-level objectives.
  • Create and improve application release automation and orchestration, Automated environment provisioning and configuration using infrastructure as code practices.
  • Provide people management functions such as tracking goals, identifying training opportunities, performance coaching, work assignment and timekeeping.
  • Coordinate efforts starting with initiative launch to product delivery.
  • Blameless root cause analysis, focusing on finding the root cause of the problem and instituting change to prevent re-occurrence of the same or similar issues in the future.

 

Competencies

 

  • Strong grasp of service-level objects (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
  • Hands-on experience with CI/CD tools (e.g., Azure Pipelines) and best practices for automated deployments.
  • Programming proficiency with (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++/C#, JavaScript, and next.js.
  • Understanding of DevOps practice.
  • Deep understanding of distributed systems, microservices, and cloud-native architectures.
  • Ability to integrate testing, security, and monitoring into the continuous delivery process.
  • Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
  • Outstanding communication, presentation, and leadership skills.
  • Holds a strong sense of accountability for both individual and team objectives.
  • Embraces a forward-thinking mindset, contributing to a culture of continuous improvement and creativity.
  • Sharp analytical and problem-solving skills.
  • Excellent time management, prioritization, attention to detail and organization skills.

Qualifications

Education and Experience

 

  • Bachelor’s degree in computer science, Software Design, Software Engineering, Computer Programming or related IT field.
  • 3+ years of related experience working in a mature Site Reliability Engineering framework and previous success in the SRE role.
  • 2 years’ experience with leading a team of IT professionals.
  • Experience leading incident response, troubleshooting complex production issues, and conducting effective postmortems.
  • Experience with Azure, React and NoSQL highly preferred.
  • Experience with distributed storage technologies, as well as dynamic resource management frameworks.

Additional Information

 

Supervisory Responsibilities

This role has people management responsibilities.

 

Position Type

The LSAC standard business hours are Monday-Friday, 8:30 a.m. - 4:45 p.m. ET. While these are the standard office hours for LSAC, as an exempt employee, the employee will be expected to work the hours necessary to satisfactorily complete their assignments in a responsible and professional manner. This position is required to work weekends during test administration and as business needs demand

 

Work Environment

This job operates in a remote environment. This role routinely uses standard office equipment such as computers, phones, photocopiers, filing cabinets and fax machines.

 

Travel Requirements

Minimal to no travel is expected for this position.

 

Physical Demands

The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. While performing the duties of this job, the employee is regularly required to write, hear, speak, and present materials.

 

Special Conditions or Requirements

The ability to work weekends is required.

 

Additional Information:

Please note that this job description may not contain a comprehensive listing of activities, duties or responsibilities that are required of the employee for this job. Job responsibilities may change at any time with or without notice.

 

Except as otherwise provided by law, all terms of employment are subject on an at-will basis and can change at any time.

Options

Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
Share on your newsfeed