Site Reliability Engineer (SRE) - 100% Remote

The Judge Group

Atlanta Georgia

United States

Information Technology
(No Timezone Provided)

Location: REMOTE
Description: Our client is currently seeking a Site Reliability Engineer (SRE) - Remote

Job Description:


Responsibilities:

  • Lead the newly established SRE team supporting Discover.com and Discover's mobile application. Discover.com gets 3.5million and Discover Mobile gets 2.5 million logins daily!
  • Champion a culture of learning, continuous improvement, and blameless retrospection within your team.
  • Mentor and grow your junior engineers, and empower and unblock your senior ones.
  • Partner with our Talent Acquisition team as we recruit, interview and hire the best engineering talent to join Discover's growing SRE practice.
  • Partner with Product teams and Solution Architects to help design solutions that achieve the required reliability outcomes for their services.
  • Be a leader in the SRE community of practice and evolve the SRE practice or the entire organization.
  • Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run platforms
  • Partner with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.

Necessary experience:

• Well versed with the entire software development lifecycle, DevOps, and SRE practices

• Expertise and operational experience at scale - designing and operating highly available, scalable and fault-tolerant systems using container platforms

• Experience with operational monitoring tools (AppDynamics, NewRelic, Instana, CatchPoint) with a mindset towards predictive analysis

• Experience with Splunk or ELK Stack, Grafana, DataDog, or Sysdig

• Working knowledge of the automation tools such as Ansible, Terraform, or Chef

• Experience with Pivotal Cloud Foundry (PCF), OpenShift (OCP), Amazon Web Service (AWS), and Google Cloud Platform (GCP)

  • Good understanding of networking including L2 and L3 concepts, including Firewall, Load Balancing, Routing and Switching.
  • A working knowledge of Linux based systems and Virtual Machines (VM) technology
  • Strong scripting skills including ability to write scripts from scratch using Python and/or Bash
  • Basic knowledge and understanding of Security (CIA Model and PCI compliance) is a plus
  • Experience with Continuous Integration and Continuous Delivery models including Blue/Green and Canary release models is a plus

Minimum Qualifications:

• You have 5+ years of SRE experience in a highly customer-focused environment.

• You have 3+ years experience successfully managing a team of engineers on large-scale projects that included technical deep-dives and production troubleshooting in the areas of: distributed systems, programming, configuration management, networking, storage, and operating systems

• You possess strong leadership skills and the ability to motivate teams.

• You bring a strong perspective and collaborative partnership that drives change, and motivates engineers to develop simple solutions to complex operational or reliability challenges.

• You have experience formulating a team's technical strategy and roadmap, and you've collaborated and partnered effectively with several other teams.

• You are capable of leading a discussion with upper management, and are able to tailor the level of technical detail to suit your audience.

B.S. in Computer Science or equivalent experience

Contact:
This job and many more are available through The Judge Group. Find us on the web at

Site Reliability Engineer (SRE) - 100% Remote

The Judge Group

Atlanta Georgia

United States

Information Technology

(No Timezone Provided)

Location: REMOTE
Description: Our client is currently seeking a Site Reliability Engineer (SRE) - Remote

Job Description:


Responsibilities:

  • Lead the newly established SRE team supporting Discover.com and Discover's mobile application. Discover.com gets 3.5million and Discover Mobile gets 2.5 million logins daily!
  • Champion a culture of learning, continuous improvement, and blameless retrospection within your team.
  • Mentor and grow your junior engineers, and empower and unblock your senior ones.
  • Partner with our Talent Acquisition team as we recruit, interview and hire the best engineering talent to join Discover's growing SRE practice.
  • Partner with Product teams and Solution Architects to help design solutions that achieve the required reliability outcomes for their services.
  • Be a leader in the SRE community of practice and evolve the SRE practice or the entire organization.
  • Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run platforms
  • Partner with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.

Necessary experience:

• Well versed with the entire software development lifecycle, DevOps, and SRE practices

• Expertise and operational experience at scale - designing and operating highly available, scalable and fault-tolerant systems using container platforms

• Experience with operational monitoring tools (AppDynamics, NewRelic, Instana, CatchPoint) with a mindset towards predictive analysis

• Experience with Splunk or ELK Stack, Grafana, DataDog, or Sysdig

• Working knowledge of the automation tools such as Ansible, Terraform, or Chef

• Experience with Pivotal Cloud Foundry (PCF), OpenShift (OCP), Amazon Web Service (AWS), and Google Cloud Platform (GCP)

  • Good understanding of networking including L2 and L3 concepts, including Firewall, Load Balancing, Routing and Switching.
  • A working knowledge of Linux based systems and Virtual Machines (VM) technology
  • Strong scripting skills including ability to write scripts from scratch using Python and/or Bash
  • Basic knowledge and understanding of Security (CIA Model and PCI compliance) is a plus
  • Experience with Continuous Integration and Continuous Delivery models including Blue/Green and Canary release models is a plus

Minimum Qualifications:

• You have 5+ years of SRE experience in a highly customer-focused environment.

• You have 3+ years experience successfully managing a team of engineers on large-scale projects that included technical deep-dives and production troubleshooting in the areas of: distributed systems, programming, configuration management, networking, storage, and operating systems

• You possess strong leadership skills and the ability to motivate teams.

• You bring a strong perspective and collaborative partnership that drives change, and motivates engineers to develop simple solutions to complex operational or reliability challenges.

• You have experience formulating a team's technical strategy and roadmap, and you've collaborated and partnered effectively with several other teams.

• You are capable of leading a discussion with upper management, and are able to tailor the level of technical detail to suit your audience.

B.S. in Computer Science or equivalent experience

Contact:
This job and many more are available through The Judge Group. Find us on the web at