Site Reliability Engineering Vs IT Operations
Let’s start with definitions of IT Operations and Site Reliability Engineering as we speak about these two things I need to refer these many time in this white paper I would say Idea Paper as we are filling this paper with the ideas while doing some explanation and brainstorming. I would call IT Operations as ITOps and Site Reliability Engineering as SRE.
Site Reliability Engineering: Supporting Production systems in more automated way by coding the repeated operational tasks and ensuring Service Levels of Application and user experience.
IT Operations: Ensuring Service Levels by reacting immediately to any incidents in the production environments and ensuring application availability to end users.
Traditional IT operations ensures the Environment stability and availability for Endures by reacting to the issues , and the team size for IT operations will laterally scale with the size of the environments that team is supporting and it directly related to the stability of the code that is release to the production environment. If you have a bad code on production of-course you will have more incidents and Operations team end up busy and out of capacity results in Environment unstable and in case of failures it make little longer to restore the service if every one is busy with issues though you have Service levels set for each and every component
SRE team is a team of people who has skills in development ,automation and also skills in systems administration , basically system admin with Development skills. They advantage with this team they can write programs and develop software that can react to an incident and resolve that incident automatically, so the repeated or expected incidents are self healed and team will concentrate on the manual things and the way to automate the manual things. So the the scaling of this SRE team depends on the number of different applications or products supported by this team. lets say if the release has some code issue that created number of environment issues , as this team has automations and self healing capabilities the load on the team is less and they concentrate on issue that new needs a manual attention and work on resolving the issues manually a restore environment ASAP and follow up automation to the issue so that if get that issue in the future that will be handled by automation.
SRE team and IT operations team works on the same things availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. but how they work on these is different