One of the first questions that I like to ask any customers that I work with is “how many VMs do you have?” This gives me a good understanding of how large their environment is. Do they have less than a 1,000 VMs? 1,000 to 3,000 VMs? Or do they have more than 3,000 VMs? I also ask them how many people are supporting these VMs. Some of my largest customers have thousands and thousands of VMs.
A decade or two ago, companies had smaller number of servers and more people to manage them. The ratio of servers to admins was a small number. Admins had enough time to make sure they were running their expensive hardware efficiently. People knew what runs on what server. Today, that is not the case. And, every time I look at resource utilization at a large shop, my jaw drops. Hardware is a lot less expensive, compared to an IT staff’s fully burdened cost. I see customers have thousands of servers running at 20 to 30 percent utilization. And, customers do everything possible to be prepared for the burst in their workloads. Most environments are way over configured, and hardware vendors are very happy. Most sophisticated customers have a good understanding of their peak to average ratios for their critical applications and their servers.
The last few weeks, I had the opportunity to work with a customer on capacity planning who has more than 20,000 servers. This is a big environment. They actually have a team of dedicated Capacity Planners. And they know their stuff -- I really enjoyed the interactions. They have their VMware, HyperV and . . . environments. Nice, very nice.
In this article, I will start by covering couple of very important issues that Capacity Planners have to deal with.
Tagging the servers
Let’s start by making some assumptions here. In each IT shop, the teams that support the applications are very well aware of what are their most important applications. In larger organizations, this list gets to be a long one. One of my customers, who have about 6,000 VMs, wanted to get their hands around 80 applications. In this case there are 80 Business Critical applications. I always suggest people to work on the top three, five, or 10 first before expanding to the larger number. If you have 800, 6,000 or 20,000 servers, you still need to focus on the most important ones first.
Another question that is usually asked is if customers have good naming standards. Everyone has a different expectation on what is a good naming standard. I like to see location, application name, type of server, prod or non prod and tagging. Tagging does what naming conventions can’t do. If I like to work on all my production JVMs in Los Angeles, I can have a tag for it. If I like to have all the servers that support my stock trading applications in NYC, I can have a tag for it. I can have a tag for my top 10 to 20 business critical applications, and all the servers supporting them. This becomes more important when these servers are VMs on different Clusters, ESX Hosts, vCenters. And, to make things worse, they are fluid and are moved around. So, tagging becomes very important. Some products allow customers to build services and add the servers that are needed in that service group, so you can monitor the components that are supporting your application, set SLAs for it, and ultimately use these services and grouping when you are performing capacity planning.
In this case, I’d like to review what I had to do several years ago for our CIO. He and his finance buddy were doing the Capacity Planning for the organization. This was a small organization with about 200 people in IT. I had to generate about 30 reports for them at the end of each month. If there were any anomalies, I had to generate more reports to explain why. Some of the reports were daily reports for that month, some were weekly for the last three months, and some were monthly for the last 12 to 36 months. Reports would run a very long time, when we did not have summaries, because they had to process “millions of records.”
A few months ago, I had the opportunity to work with a telecom service provider with thousands of servers, and a very aggressive growth rate. They were very impressed that I could run a report and show everything they needed for the servers involved in matter of seconds. I could group my servers by customers using it, add/remove field like Max CPU Utilization, and change the format. The key is to have a repository that I can control, and I don’t have to process millions and millions of records for each report. I select my retention policy for any kind of records, 3 to 5 minute short term intervals, hourly, daily, weekly, and monthly. And, each record would have MIN, MAX and averages.
Please provide your feedback as I like to hear your perspectives. I always learn from my customers and peers.
More to follow soon. . .