I’ll never forget a conversation I had with a customer several years ago. I was asked to join a conference call to discuss a new SharePoint 2013 farm deployment. My involvement was limited to designing and deploying a SQL Server 2012 Availability Group for all of their environments – a different vendor was engaged to deal with the SharePoint implementation.
If you’ve been invited to join a conference call as a technical expert, you know what I mean when I say “it’s one of those calls.” My ears hurt as we’ve gone past the 2-hour mark, thinking to myself, “how long are we still going to be in this call?“
It wasn’t fun to be in that call. But I was very thankful to be involved in that project.
Because I’ve learned a lot from that conference call – lessons that can help SQL Server DBAs avoid these common Windows Server Failover Clustering (WSFC) mistakes that can cost them their jobs. While these lessons pertain to implementation of WSFC for SQL Server workloads, they are applicable to any life situation you can think of.
- Relying on outdated experience and information. If you’ve deployed a SQL Server failover clustered instance in the past, you are probably aware of the Hardware Compatibility List (HCL) found in the Windows Server Catalog website. That’s because you need to have the exact same hardware with the exact same software/firmware/updates/patches/etc. on all of the nodes in the WSFC in order for it to be considered supported by Microsoft. Oh, and you need to buy a very expensive SAN storage to run SQL Server on a WSFC. Well, that’s exactly what that specific customer did. Based on what their resident SQL Server specialist told them, they bought a new Dell Compellent SC8000 storage specifically for the SharePoint databases. For an hour-and-a-half on the phone, I tried to explain that SQL Server Availability Groups do not require shared storage. Unfortunately, the storage has been purchased two months before we did that call. Outdated information becomes even more risky when the solution is already in place and the availability of a mission-critical database is at stake.
- Working in isolation. As we went along on the project, I’ve provided a step-by-step documentation on how to install and configure a WSFC for their SQL Server Availability Group. Being a large organization, the teams were siloed and assigned specific tasks; there’s the network team, the systems team, the database team, the application team, and so on. One of their systems engineers responsible for building the WSFC could not get past the Create Cluster Wizard. So he gave me a call. Realizing what the problem was, I told him to get the Active Directory (AD) team involved to get the issue resolved. It turned out that his AD account did not have permissions to create the cluster name object (CNO) in AD (I cover this concept in this blog post – and it still is the most popular blog post to date.) But what’s really interesting is that he already had the idea that the issue was related to AD. He just didn’t bother getting the AD team involved before reaching out to me. Keep in mind, as a consultant, I’m still an outsider.
- Not thoroughly testing the solution. Need I say more? This is so true especially with tight deadlines. When availability of mission-critical databases is at stake, ensure that availability goals are met with the minimum amount of required tests performed. Power failure? Checked. Network failure? Checked. Server blue screening? Checked. Prepare a checklist of things that you need to test to validate whether or not your solution meets the availability goals. The reason I say this is because after going live with the project, our operations team got blamed when their primary SQL Server Availability Group replica was taken offline when nothing was wrong with it. I referred them to point #1 in this list. They didn’t like it when I said, “it’s by design.” I’ll save the story for another blog post.
- Becoming too focused on the technology. As technology professionals, we get too attached to our work that we fail to step back and look at the bigger picture. We forget that solutions are useless if there are no business problems to solve. While I’m a big fan of WSFC for SQL Server, I don’t instantly recommend it as a solution without understanding what the real requirement is. There’s a reason why I always start any high availability and disaster recovery discussion with The Alphabet Soup for HA/DR. As the customer told me their intent to deploy SQL Server Availability Groups on all of their environments, I suggested a different architecture for their production environment – a combination of SQL Server failover clustered instance (FCI) for their production environment with an asynchronous Availability Group for their DR. When they asked me why I preferred that design over the original one, mentioning the cost difference between SQL Server licenses and the SAN storage was more than enough to convince them. Plus, they got to use one of their Dell Compellent SC8000 storage.
Feeling helpless and confused when dealing with Windows Server Failover Clustering (WSFC) for your SQL Server databases?
You’re not alone. I’ve heard the same thing from thousands of SQL Server administrators throughout my entire career. These are just a few of them.
“How do I properly size the server, storage, network and all the AD settings which we do not have any control over?”
“I don’t quite understand how the Windows portion of the cluster operates and interacts with what SQL controls.”
“I’m unfamiliar with multi-site clustering.”
“Our servers are setup and configured by our parent company, so we don’t really get much experience with setting up Failover Clusters.“
If you feel the same way, then, this course is for you. It’s a simple and easy-to-understand way for you to learn and master how Windows Server Failover Clusters can keep your SQL Server databases highly available. Be confident in designing, building and managing SQL Server databases running on Windows Server Failover Clusters.
But don’t take my word for it. Here’s what my students have to say about the course.
“The techniques presented were very valuable, and used them the following week when I was paged on an issue.”
“Thanks again for giving me confidence and teaching all this stuff about failover clusters.”
“I’m so gladdddddd that I took this course!!”
“Now I got better knowledge to setup the Windows FC ENVIRONMENT (DC) for SQL Server FCI and AlwaysON.”
Hi Edwin, Great post and advice. I recently rolled out an HADR SQL solution at my shop and it’s working like a champ because we followed a very similar set of rules. Glad to see them so well written and presented. One thing I think you may have overlooked, or just didn’t dig in to is the depth of expertise required to support some of the more advanced solutions. An AG in 2 datacenters will have a significantly different skills requirement than a simple DB mirroring setup. It is important to factor in the supportability of the solution in the given environment.
Thank you for reading, sir.
Agreed, the skills and the depth of expertise required to support some of the more advanced solutions are different from the simple ones. And that spells the difference between successful and failed implementation and operational support.
When I started teaching and delivering presentations on Availability Groups back in 2011, I made claim that I could get in trouble with Microsoft marketing for saying that Availability Groups is really nothing new. That’s because the technology is based on Windows Server Failover Clustering and Database Mirroring – technologies that existed even on SQL Server 2008. But instead of learning just one of them, you now need to learn both.
This is the reason why I decided to create this online course – to enable SQL Server DBAs to be more confident in building SQL Server HA/DR solutions that rely on Windows Server Failover Clustering, be it failover clustered instances or Availability Groups. I also leveraged my experience as a former data center engineer to highlight the need for acquiring the different skills needed to support a complex HA/DR solution