Hadoop Site Reliability Engineer will be responsible for building and enhancing the tooling needed to deploy and operate Hadoop clusters at scale.
Responsible for monitoring, troubleshooting, automating and continuously developing tools to improve the availability and resiliency of the data ecosystem.
We are seeking an experienced Site Reliability Engineer with expertise in managing reliability of large Hadoop clusters.
Experience in building, managing and tuning performance of Hadoop platforms.
Excellent Shell, Python programming skills for automation requirement for repetitive dev-ops tasks
Tune alerting and setup observability to proactively identify the issues and performance problems.
Deploy and scale Hadoop Infrastructure, capacity planning, data cluster monitoring and troubleshooting, and drive operational enhancements.
DevOps + Deployment + Production Ops
Production Ops processes in Hadoop environment on the application side, ability to handle and work with multiple self-service teams.
Expert in reliability Engineering, Incident management, Observability, monitoring, Incident management
Subject Matter expert on above
Not looking for Infra focused person but at the Hadoop platform level
The ideal candidate will be a bridge between these L1,L2 and L3 level support across the largest instance in the bank (50 TB+ SDP - Strategic Data Platform) who can communicate technically and demonstrate leadership in resolving problems for production ops
Hands-on and strong understanding of Hadoop architecture
Experience with Hadoop ecosystem components - HDFS, YARN, MapReduce & cluster management tools like Ambari or Cloudera Manager and related technologies.
Proficiency in scripting, Linux system administration, networking, and troubleshooting skills.
• Location : 4835 Lyndon B Johnson fwy,Suite 540 Dallas,TX 75244,Dallas,US Texas Longitude -96.82463, Dallas, NC