This is a direct hire opportunity for a globally Fortune 500 O&G company.
Responsible for the reliability and uptime appropriate to users’ needs of DELFI cloud solutions and services. Site Reliability Engineering is a discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. Site Reliability Engineers are also responsible of engaging in and improving the whole lifecycle of services from inception and design, through deployment, operation and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Maintain and improve services once they are live by measuring and monitoring availability, latency and overall system health.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Gauges the effectiveness and efficiency of existing systems and infrastructure; implements strategies for improving or further leveraging these systems within a geoscience workflow
- Effectively uses the service management systems, ensuring that best practices and lessons learned are made available to wider technical community
- Engaged in incident response and blameless postmortems.
Previous Experience and Competencies:
- Bachelor’s degree in IT related discipline
- 10+ years seniority
- Interest in designing, analyzing and troubleshooting large-scale distributed systems.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Ability to debug and optimize code and automate routine tasks.