Site Reliability Engineer

This is a contract position, partially remote, in West Houston. The SRE will support the reliability of Digital IT/OT critical applications. This transformative role involves automating IT infrastructure tasks and driving SRE best practices, tools, and processes. The ideal candidate should exhibit a growth mindset and proactively monitor and respond to incidents for optimal user experience.

* Maintaining survivability and reliability of IT/OT critical resources.
* Write and build CI/CD pipelines and build/release processes for IT/OT workflow applications.
* Provide mentoring to the IT/OT Devops team in the best practices associated with CI/CD deployments using ADO, and GIT.
* Perform periodic load and scalability testing to establish baselines, drift, and capacity planning.
* Conduct weekly operational state reviews covering performance trends, anomalies, errors, and other availability events with SREs, product owners, and development teams.
* Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc.
* Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection).

The candidate must have senior level experience deploying and supporting applications in OpenShift/Kubernetes container platforms.
The successful candidate will possess a strong developer background as well as interpersonal skills needed to communicate design requirements and objectives
Candidates should be self-motivated and collaborative IT professionals with a strong background in software development, systems administration and IT automation.

Required Qualifications

* Candidates must have a bachelor’s degree and 8 years of IT experience.
* Senior level experience with OCP and Kubernetes.
* Familiarity with continuous integration/deployment processes and tools such as IDEs (Eclipse), Source Code management. (GIT/Stash), ADO Pipelines, Maven, Nexus artifacts, etc.
* Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering and blameless postmortems.
* Expertise in application performance monitoring, observability, and proactive alert correlation, including monitoring containers and failure-based alerting.
* Scripting experience such as Python and Bash
* Experienced in deploying applications in OCP in both public and private cloud.
* Excellent written and oral communications skills
* Demonstrated ability to communicate to nontechnical audience on technical issues.
* Demonstrated ability to communicate on a technical level to a technical audience.
* Strong interpersonal skills, adaptable and able to learn quickly.
* Requires limited supervision and have excellent time management skills.
* Self-motivated and self-starter.
* Ability to work and interact with others in a structured/team environment.

Experience with at least one technology in each of the tech stack categories below:

* Monitoring and Logging Tools(s): AppDynamics, Splunk, ELK Stack, DataDog, Prometheus, AWS CloudWatch/X-Ray, Grafana
* Programming: C# .NET, PowerShell, Python, YAML
* Containers: Docker, Helm Chart
* OS: Linux – RHEL, Ubuntu, CentOS
* Code Repos: Azure Repos, GitHub
* Infrastructure as code: Terraform, Ansible
* Automation Tools: Jenkins, Chef, Puppet
* Agile: JIRA, SAFe

This listing has expired.