From an operations perspective, Amazon, Facebook and Google all have efforts focusing on changing their data center and application delivery strategies, moving away from an older style of enterprise capacity planning which focused on scaling vertically at the application level to a newer web level of capacity planning and optimization that delivers a flexible infrastructure across all applications. Both Facebook and Google have created standards around server configurations, builds, deployments and decommissions, and Facebook has custom built their data center, squeezing cost out of server builds and power delivery. Amazon focuses on capacity planning and has recently moved all Amazon online store applications off of physical servers onto an ec2 infrastructure, putting all their apps into the cloud, even if it’s their cloud.
Other areas where web operations is adding value is in the continuous deployment and configuration management space. Common configuration management and deployment tools identified at Velocity were cf engine, puppet, chef and the cast-project. There was a lot of talk about devops which is basically introducing a operations strategy into the development process while at the same time introducing a development strategy into the operations process. Examples of this would be using a code repository for all applications builds and deployments, and from a development perspective, introducing the developers to the monitoring and performance stats before they release their code. I agree and fully support this strategy, however I do feel that there is a too much hype surrounding the “devops” term, and prefer the “web operations” term as a replacement for the devops term.
There was not a lot of discussion around gathering metrics however a number of folks did talk about monitoring, looking at your data and understanding your logs and data. John Allspaw gave an interesting talk on “Advanced PostMortem” where he spoke of Time to Detect TTD an incident, Time to Recover TTR from an incident and the overall impact time which is TTR-TTD. John made the point that the severity of all outages are not the same and each organization should define different levels of severity, and track TTD, TTR, Impact Time along with the severity level.
I was also looking for some help with deployments, and chef and the new cast-project look interesting, I am going to hold off on a configuration management systems at work and focus on building out our continuous deployment process. We have started this process in development using Bamboo, however our organization needs to commit to this strategy, and then scale out the current Bamboo infrastructure to accommodate building all of our apps. There was more talk about application servers in the cloud as opposed to application servers on premise so my application server question did not get answered. A couple of vendors offered sophisticated java monitoring products however both came with a steep pricetag. The one java monitoring product that I thought was interesting was dynaTrace which I will probably investigate.
A challenge that I have and I am sure that many other enterprise level organizations have is that our infrastructures are carved out and deployed on an application by application basis, which means that we have to pay for and build redundancy into every new application as opposed to the market leaders like Google, Amazon and Facebook who have redundant infrastructures and add applications onto their redundant infrastructures. The market leaders have had a lot of success with this infrastructure strategy, it is surprising that more large enterprises has not started down this same path.
I thought this was a valuable conference and next year I am going to recommend that we have two or more folks attend velocity. More information on Speaker Slides and Video can be found here.