You are here:   Home
Jun
18
2010

R2 Print Cluster? Get this Hotfix! KB 976571

Hi Cluster Fans,

Many of you use Failover Clustering to provide high-availability to your print servers to ensure that the print spooler resource stays up and running.  We have recently released a Rollup Hotfix specific to print clustering which contains several fixes to improve the stability of the overall print system and fix issues with migration of print servers using PrintBRM.  Microsoft recommends that you install this Hotfix on all of your 2008 R2 print cluster nodes.

If you are running a Windows Server 2008 R2 print cluster GET THIS HOTFIX! http://support.microsoft.com/kb/976571

How will print cluster issues be triaged by Microsoft?

If you run into any issues and need to call Microsoft’s support line, they will follow this triage process:

1) Does your complete solution pass the ‘Validate a Configuration’ tests?

· No: Fix the errors which are reported to bring your cluster to a supported configuration and try to reproduce the problem again.

· Yes: Proceed to Step 2

· More information about Validate: http://technet.microsoft.com/en-us/library/cc772055.aspx

2) Does your print cluster have the rollup Hotfix, KB 976571?

· No: Install this Hotfix on all of your cluster nodes and try to reproduce the problem again.

· Yes: Microsoft will triage the specific issue you are reporting. This may include recommending Driver Isolation and removing unnecessary 3rd party print components, like language monitors and print processors.  Or updating required 3rd party print components, such as is recommended for the HP Universal Print Driver when the properties page takes a long time to load: http://blogs.technet.com/yongrhee/archive/2009/09/14/windows-2008-r2-cluster-the-print-queue-propertes-of-a-hp-printer-may-take-a-long-time-to-open.aspx

· More information about installing Hotfixes: http://blogs.msdn.com/clustering/archive/2009/06/12/9731520.aspx (follow this process even though it refers to Service Packs)

 

What is included in this Hotfix?

R2 Print Cluster?  Get this Hotfix!  KB 976571
read full article at source

 
Jun
18
2010

PowerShell for Failover Clustering: Understanding Error Codes

Hi,

 

This blog is about how to handle error codes returned by PowerShell Failover Clustering CMDlets. For an introduction to this topic you can take a look at http://blogs.msdn.com/clustering/archive/2009/05/23/9636665.aspx.  Cluster CMDlets can fail for various different reasons, such as passing non-existent entities as parameters (try to delete a group using a name that is not present in the cluster); mismatching types (try to pass a string but when a cluster object is required); or when the passed parameter breaks because of some cluster business logic. This blog focuses on how to get error codes from the third type of failure.

 

 

Imagine that you are trying to add a cluster node using the Add-ClusterNode CMDlet:

 

Add-ClusterNode -Name "node1"

 

If “node1” is not part of any cluster this operation will succeed, otherwise it will fail and you will get the error message:

Add-ClusterNode : The computer 'node1.domain' is joined to a cluster.

At line:1 char:16

+ Add-ClusterNode <<<<  -Name "node1"

    + CategoryInfo          : NotSpecified: (:) [Add-ClusterNode], ClusterCMDletException

    + FullyQualifiedErrorId : Add-ClusterNode,Microsoft.FailoverClusters.PowerShell.AddClusterNodeCommand  

 

Let’s run:

 

$error[0].Exception | get-Member

 

This will give us the type of the exception contained in the latest error record.  In this case it will be Microsoft.FailoverClusters.PowerShell.ClusterCMDletException.  When this kind of exception occurs we have a non-terminating error, which means that the CMDlet itself finished its execution properly, however there was some problem with the operation completing successfully.  If the node did not get added to the cluster, since the passed name breaks some cluster business logic. 

 

 

When writing PowerhShell scripts, one often needs to know not only that the CMDlet finished, but also if its result are as expected.  In order to verify this, an extra step is needed to get the error code embedded on the returned exception.

 

You can collect the error information with:

 

$error[0].Exception.ErrorCode

 

error[0] contains an Exception object (Microsoft.FailoverClusters.PowerShell.ClusterCMDletException), which contains the error code.  For the given example this will return:

 

$error[0].Exception.ErrorCode

-2147024809

 

If you want to have the error code in a more readable format you can run:

$error[0].Exception.ErrorCode –band 0xffff

87

 

So we see this is error code 87.

 

 

 PowerShell can also be used with managed code.  The error codes can be captured similarly to the console approach described erlier.  Here is an example of code that does that:

 

System.Management.Automation.PowerShell _powershell = PowerShell.Create();

….

ErrorRecord errRec = _powershell.Streams.Error[0];

PropertyInfo errCodeInfo = null;

int retErrorCode = 0;

Try {

// if the exception is a ClusterCMDletException object we

// should have a property with error code

      errCodeInfo = errRec.Exception.GetType().GetProperty("ErrorCode");

 } catch (Exception) {

                    // Handle other type of exceptions differently

}

 if (errCodeInfo != null) {

     object propValue = errCodeInfo.GetValue(errRec.Exception, null);

     retErrorCode = Convert.ToInt32(propValue); // thats and hr Value

     retErrorCode = retErrorCode & 0xFFFF;   // this would give the int error code

}

 

 

 

I hope this information is useful for all of you writing PowerShell scripts or C# code.

 

 

Thanks,

Emanoel Xavier

Software Development Engineer in Test

Clustering & High-Availability

Microsoft

 

PowerShell for Failover Clustering: Understanding Error Codes
read full article at source
 
Jun
18
2010

Introduction to the Cluster Quorum Model

Failover clusters ensure that workloads such as File Server, Exchange, SQL, and Virtual Machines are highly available.  These resources are considered highly available if the nodes that host resources are up, however the cluster generally requires more than half the nodes to be running, which is known as having “quorum”.  

Quorum is designed to prevent “split-brain” scenarios which can happen when there is a partition in the network and subsets of nodes cannot communication with each other.  This can cause both subsets of nodes to try to own the workload and write to the same disk which can lead to numerous problems.  However this is prevented with the Windows Server Failover Clustering quorum model which forces one of these groups to have a majority of nodes running, so only one of these groups will stay online.

 

If the cluster loses quorum, all the nodes of a cluster will offline and any hosted resources will offline.  Choosing the right quorum mode while deploying your cluster will help ensure that the cluster stays up for as long as possible while sustaining node, disk or network failures.

 

Clustering supports four quorum modes. They are: Node Majority, Node and Disk Majority, Node and File Share Majority, and No Majority: Disk Only (Legacy).

 

Terminology

Here is some relevant terminology:

·         Disk Witness Resource – A clustered disk which has access to all nodes can contribute towards the cluster’s quorum.  This disk resides in the cluster group.  Besides providing a vote for the quorum, this resource serves two other critical functions.

o   Stores a constantly-updated version of the cluster database.  This allows the cluster to maintain its state and configuration independently of individual node failures, which ensures that nodes will always have the most up-to-date copy of the database.

o   The quorum resource enforces cluster unity, preventing the “split-brain” scenario described earlier.  

·         File Share Witness (FSW) Resource – A file share accessible by all nodes of the cluster can contribute to the cluster’s quorum.  Besides provide a vote for the quorum, it also helps with the “split brain” scenario.  However, file share witness doesn’t contain the cluster database.

·         Vote – The quorum calculation is based on votes.  Cluster nodes, disk witness resources and file share witness resources may have a vote base on the quorum configuration.  The table in the next section shows the relationship between quorum mode and votes.

 

 

Quorum Model Description

The following table describes the different quorum modes available since Windows Server 2008.

Quorum Mode

Components that has vote

Votes for Quorum (v denotes vote)

Node Majority

Nodes (1 per node)

v/2 + 1

Node and Disk Majority

Nodes (1 per node) and Disk Witness Resource (1)

v/2 + 1

Node and File Share Majority

Nodes (1 per node) and File Share Witness Resource (1)

v/2 + 1

No Majority: Disk Only (Legacy)

Disk Witness Resource (1)

v

 

It is recommended to have an odd number of total votes in the cluster since quorum requires more than half of the votes to be online.  If I have a 4-node cluster, and only give each node a vote for 4 total votes, I need 3 nodes to stay running to maintain quorum with more than half of the votes.  This means I can only sustain a single node failure.  However, by assigning a disk or FSW a 5th vote, I still need 3 votes to maintain quorum, however I can now sustain two node failures, instead of one.  So by adding these extra votes by using a disk or file share, instead of requiring the purchase of an additional node, Failover Clustering can offer higher availability at a much lower cost.

 

 

Default Configuration

When the cluster is first created, the most appropriate quorum mode is automatically assigned, which is based on the number of nodes and available cluster storage.  This can always be changed, as described in the next section.  The cluster will attempt to configure quorum so that there is always an odd number of votes.  If there is an odd number of nodes, the cluster will select Node Majority as the quorum type to keep the odd number of votes.  If there are an even number of nodes, and disks in Available Storage, the cluster will select Node and Disk Majority, giving a disk a single vote, so that there is an odd number of total votes.  If there is an even number of nodes, but no Available Storage, the cluster will select Node Majority and issue a warning message.  The cluster will never select Node and File Share Witness since it requires additional configuration, and it will never select No Majority: Disk Only as this is not recommended because it is a single point of failure.

 

 

Configuring Quorum

In Failover Cluster Manager, the quorum configuration can be changed through the Configure Cluster Quorum Wizard.  This page can be reached by right clicking on the cluster name, selecting “More Actions…” and then “Configure Cluster Quorum Settings…”

 

Introduction to the Cluster Quorum Model

 

 

Once “Configure Cluster Quorum Setting…” is selected, the Configure Cluster Quorum Wizard appears.  This will recommend the best configuration for you based on the number of nodes and Available Storage and inform you about the number of failures you can sustain.  In the snapshot below, Node Majority quorum is recommended because it’s a 3 node cluster, which will be explained later.

 

Introduction to the Cluster Quorum Model

 

 

In PowerShell, the Set-ClusterQuorum CMDlet is used to configure the quorum mode. The specific command to set each individual quorum configuration will be explain in the next blog post in the quorum series.  To view the TechNet topic on quorum, visit: http://technet.microsoft.com/en-us/library/cc770620(WS.10).aspx.

 

Thanks,
Peter Huang
Software Development Engineer in Test II
Clustering and High-Availability
Microsoft

 

Introduction to the Cluster Quorum Model
read full article at source
 
Jun
18
2010

Introduction to the Cluster Quorum Model (Part 2)

Hi,

 

A few weeks ago we gave an overview of the cluster quorum model: http://blogs.msdn.com/b/clustering/archive/2010/05/14/10012930.aspx

 

This week we will provide some more details about when and how the different types are used.

 

 

Node Majority  &  Node and Disk Majority

 

The two most recommended quorum modes are the Node Majority and Node and Disk Majority.  For cluster with odd number of nodes, Node Majority is always recommended.  For cluster with even number of nodes and Available Storage, Node and Disk Majority is the recommended type.

 

For example, if a cluster has 2 nodes, add a disk witness resource would create a Node and Disk Majority cluster and the cluster would have 3 votes (2 nodes + disk witness resource).  The cluster would maintain quorum if either 1 node is down or if the disk witness failed.  If a cluster has 2 nodes and there is a no disk witness resource (Node Majority cluster), the number of votes is 2 and the minimum number of quorum voters to remain online 2. This means that if either of the nodes goes down, the whole cluster would come down.  So the minimum number of nodes needed for failover cluster with 3 nodes without disk witness resource and 2 nodes with disk witness resource (3 total votes).

 

For an odd number nodes cluster, a disk witness resource would not increase the survivability of a cluster. For example, if a disk witness resource is added to a 3 node to create a Node and Disk Majority cluster, the number of votes is now 4. The quorum requirement is 4/2+1= 3.  This means that at least 3 votes must be up all the time.  If the disk witness resource is down, any of the 3 nodes goes down would bring the cluster under the quorum and bring down the cluster.  If the 3 node cluster is a Node Majority cluster, which means it has no disk witness resource, the cluster can survive when 1 of the nodes goes down. Therefore, the quorum mode should be chosen to increase the survivability of the cluster when with the maximum of nodes failure in mind.

 

Fortunately, Failover Cluster Manager intelligently picks the best model for you based on the number of nodes and disk availability when creating the cluster.  It will use Node and Disk Majority Quorum settings when create cluster with even number of nodes, provided that cluster disk resource is available. Similarly, Failover Cluster Manager will select Node Majority Quorum when creating a cluster with odd number of nodes.   

 

For Node and Disk Majority Quorum, the disk chosen automatically by Failover Cluster Manager during create cluster might not be the one you wish to place as a disk witness.  To change the disk witness, right-click on the cluster name and select More Actions…, then Configure Cluster Quorum Settings… to get to the “Configure Cluster Quorum Wizard”.  Walk through the wizard, select “Node and Disk Majority Quorum” button and select or change the disk witness.  The original disk witness would automatically be place back in the available storage disk.

 

The Powershell commands to set the quorum are listed below and documented online: http://technet.microsoft.com/en-us/library/ee461013.aspx

 

For Node Majority:

                Set-ClusterQuorum -NodeMajority

For Node and Disk Majority”

            Set-ClusterQuorum -NodeAndDiskMajority "Cluster Disk 5"

 

 

Node and File Share Majority

 

Node and File Share Majority Quorum mode uses a file share location instead of disk witness as the additional vote.  So Node and File Share Majority cluster would only make sense for even number nodes. The explanation for why Node and File Share Majority should be use only on even number node is exactly the same as the Node and Disk Majority Cluster. Please consult the previous section for Node and disk Majority Quorum.

 

This quorum configuration option is often seen in a multi-site / multi-subnet cluster with an even number of nodes at two sites, and this additional vote creates an odd number of total votes.  If you have 2 nodes at Site 1, 2 nodes at Site 2 and the File Share Witness at a third site, you can sustain the failure of any site and still maintain quorum with at least 3 votes.

 

Node and File Share Majority Quorum are chosen because no shared disks are available or disks are needed for other tasks.  To setup a file share witness disk, create a share folder on a remote machine accessible from all the nodes in the cluster.  The share must be in the same forest and should not be on a node in the same cluster (as losing that node would result in the loss of 2 votes).  The file share folder must granted full access to the Computer Name Account (CNO) of the cluster.  From “Configure Cluster Quorum Wizard”, select the “Node and File Share Quorum” button and enter the file share path information.

 

\\<machine name>\<file share path and folder>

 

The Powershell command for Node and File Share Majority is:

                Set-ClusterQuorum -NodeAndFileShareMajority \\A19C1I4X64C1\fsw

 

 

No Majority: Disk Only

 

Prior to Windows Server 2003, the only available quorum type was the Disk Only.  In this setup, only the Disk Witness has the single authoritative vote and if the disk witness resource fails, the entire cluster would go down.  Because of this single point failure in the single disk vote, No Majority: Disk Only is not recommended.  The primary scenario for using No Majority: Disk Only is when dealing with unstable software or network that constantly cause networks or nodes to fail while the disk access is stable.  With this scenario, selecting No Majority: Disk Only Quorum ensures that as long as a node is up, the cluster is up and thus the resource is running.  Sometimes this will also be temporarily using during an upgrade when the number of nodes in the cluster gradually increases, so as to avoid constantly changing the quorum model.

 

The Powershell command for No Majority: Disk Only Quorum is:

                Set-ClusterQuorum  -DiskOnly “cluster disk  5”

 

The PowerShell CMDlets are documented online at: http://technet.microsoft.com/en-us/library/ee461013.aspx

 

Thanks,
Peter Huang
Software Development Engineer in Test II
Clustering and High-Availability
Microsoft

Introduction to the Cluster Quorum Model (Part 2)
read full article at source
 
Jun
18
2010

Clustering &amp; High-Availability at TechEd 2010

Hi,

 

The Clustering team is coming to TechEd New Orleans from June 7-10! 

 

Sign up now at: http://northamerica.msteched.com/

 

 

In addition to many technical presentation, the team will be manning the Cluster Booth (Orange - WSV - 7) throughout the week, so please stop by to ask your questions and share your feedback.

Sessions from the Cluster Team

 

WSV313 | Failover Clustering Deployment Success

Session Type: Breakout Session

Wednesday, June 9  |  1:30 PM - 2:45 PM  |  Rm 295

Track: Windows Server

Speaker: Symon Perriman

Level: 300 - Advanced

Learn all about setup and deployment of Windows Server 2008 R2 Failover Clusters. This session gives an overview of the interaction of Failover Clustering with a variety of deployment strategies and considerations, upgrades, migration, sysprep, automating deployment, AD considerations, validation, patching, scripting, Hyper-V Clustering, and SKU selection.

 

WSV314 | Failover Clustering Pro Troubleshooting with Windows Server 2008 R2

Session Type: Breakout Session

Tuesday, June 8  |  3:15 PM - 4:30 PM  |  Auditorium C

Track: Windows Server

Speaker: Steven Ekren

Level: 300 - Advanced

Failover Clustering with Windows Server 2008 R2 has seen the fastest adoption rate of any clustering release, but do you know all the tips and tricks to make it even better? Do you know what to do when your mission critical solution breaks? Do you know how to quickly triage issues to achieve the highest possible levels of availability? This session addresses some common issues for 2008 and 2008 R2 Failover Clusters, and offers strategies to identify, troubleshoot and solve these problems, leveraging the various tools of Validate, new eventing channels, and generating and collecting the verbose debug Cluster.log for analyzing the cluster.

 

VIR06-INT | Failover Clustering with Hyper-V Unleashed with Windows Server 2008 R2

Session Type: Interactive Session

Tuesday, June 8  |  9:45 AM - 11:00 AM  |  Rm 350

Track: Virtualization

Speaker: Steven Ekren, Symon Perriman

Level: 300 - Advanced

This is a freeform session that demos, whiteboards, and discusses what’s on YOUR MIND! Share the pain you have using Failover Clustering with Hyper-V today, ask difficult questions or let us know why some customers avoid clustering even when there is a need to implement a high availability solution. Get the answers you need directly from Microsoft’s Clustering and High-Availability product team.

 

VIR303 | Disaster Recovery by Stretching Hyper-V Clusters across Sites

Session Type: Breakout Session

Wednesday, June 9  |  5:00 PM - 6:15 PM  |  Rm 276

Track: Virtualization

Speaker(s): Symon Perriman

Level: 300 - Advanced

As servers are consolidated into VMs, the availability of those VMs is becoming increasingly important, along with providing disaster tolerance and business continuance. This session covers considerations of multi-site clustering, where virtual machines can be configured on a Hyper-V Failover Cluster that spans across sites. This session also covers conducting live migrations across datacenters and considerations of Cluster Shared Volumes with replication software.

 

WSV01-HOL | Failover Clustering in Windows Server 2008 R2

Session Type: Hands-on Lab

Track: Windows Server

Level: 300 - Advanced

This consistently top-rated lab introduces you to Failover Clustering and some of its features and improvements in Windows Server 2008 R2. This lab is for both those new to clustering, and experienced users, as it covers creating and validating a cluster, deploying resources and Windows PowerShell, and the new management utility.

 

Breakout Sessions

 

ARC308 | High Availability: A Contrarian View

Session Type: Breakout Session

Tuesday, June 8  |  3:15 PM - 4:30 PM  |  Rm 279

Track: Architecture

Speaker(s): Udi Dahan

Level: 300 - Advanced

Many developers are aware of the importance of high availability, critically analyzing any single points of failure in the infrastructure. Those same developers rarely give a second thought to the periods of time when a system is being upgraded. Even if all the servers are running, most systems cannot function in-between versions. Yet with the increased pace of business, users are demanding ever more frequent releases. The maintenance programmers and system administrators are left holding the bag, long after the architecture that sealed their fate was formulated. Join Udi for some different perspectives on high availability -- architecture and methodology for the real world.

 

DAT207 | SQL Server High Availability: Overview, Considerations, and Solution Guidance

Session Type: Breakout Session

Track: Database Platform

Speaker(s): Vineet Rao

Level: 200 - Intermediate

Mission-critical applications require careful planning to achieve maximum uptime. Various factors like availability SLAs for RPO (Recovery Point Objective) and RTO (Recovery Time Objective), log generation rates, latency, storage environment, data size, application dependencies, virtualization, etc. contribute towards building the availability strategy. As an architect or DBA it is important to develop the right HADR strategy and corresponding solution which meets the availability requirement and at the same time provide the cost benefit for your organization. Come to this session to learn about how you can develop the right HADR solution for your environment. In this session, we discuss different solution alternatives and explain the key metrics to use for making the decision as well as discuss best practices.

 

DAT303 | Architecting and Using Microsoft SQL Server Availability Technologies in a Virtualized World

Session Type: Breakout Session

Track: Database Platform

Speaker(s): Damir Bersinic, Joel Shalaby

Level: 300 - Advanced

The addition of virtualization support has changed the way architects and DBAs need to look at availability in the SQL Server world. How does Hyper-V Live Migration or vMWare's vMotion work in conjunction with existing SQL Server availability technologies like log shipping, database mirroring, or replication? As you architect your SQL Server infrastructure virtually, what do you need to know to ensure that you help provide the best solution using available technologies? Through demonstration using a scenario-based approach, learn when and how to use virtualization availability features and Windows Failover Clustering in combination with SQL Server technologies and the impact of combining them. See how Live Migration does not cover you in all cases and how database mirroring and other SQL Server features can enhance your virtual infrastructure to ensure databases continue to operate when things go wrong. We also evaluate using a mixture of virtual and physical environments for SQL Server and the impact on availability when combining them. Come to this session and get answers to questions your managers or customers will be asking you, if they are not already.

 

DAT305 | See the Largest Mission Critical Deployment of Microsoft SQL Server around the World

Session Type: Breakout Session

Track: Database Platform

Speaker(s): Kevin Cox

Level: 300 - Advanced

How do they scale; do they scale up or scale out? How are High Availabilty and Disaster Recovery architected? Are there any common techniques that bring the largest gain? This session is intended for people interested in learning how we have solved some very hard problems in the database tier. Take the lessons we have learned back to your business to improve your solutions.

 

DAT401 | High Availability and Disaster Recovery: Best Practices for Customer Deployments

Session Type: Breakout Session

Thursday, June 10  |  8:00 AM - 9:15 AM  |  Rm 272

Track: Database Platform

Speaker(s): Prem Mehra, Sanjay Mishra

Level: 400 - Expert

Microsoft SQL Server provides various High Availability and Disaster Recovery technologies. This session involves discussions on how various customers use these technologies to achieve their HA and DR goals. Focus is on deployment architectures, lessons learned and best practices from real-life customer deployment scenarios.

 

DAT407 | Windows Server 2008 R2 and Microsoft SQL Server 2008: Failover Clustering Implementations

Session Type: Breakout Session

Wednesday, June 9  |  3:15 PM - 4:30 PM  |  Rm 261

Track: Database Platform

Speaker(s): Allan Hirt

Level: 400 - Expert

Forget a good portion of everything you know about deploying failover clustering if you're used to Windows Server 2003 and SQL Server 2000 or 2005. With the introduction of Windows Server 2008 and with further improvements in R2, Windows failover clustering is a different experience than in the past. Come to see how the changes and improvements to Windows affect clustered SQL Server deployments. There are quite a few things that can trip you up in the areas of setup, upgrade, and patching processes (among other things) with SQL Server 2008 with its changes. This session bridges the gap of what you may have known to what you now need to know to have successful deployments of SQL Server 2008 failover clustering on Windows Server 2008 R2.

 

UNC304 | Microsoft Exchange Server 2010: High Availability Deep Dive

Session Type: Breakout Session

Wednesday, June 9  |  9:45 AM - 11:00 AM  |  Rm 391

Track: Unified Communications

Speaker(s): Scott Schnoll

Level: 300 - Advanced

This session starts with a brief overview of the mailbox resiliency features in Exchange 2010 SP1, and then dives deep into planning and design considerations, advanced operational tasks and troubleshooting of database availability groups. We’ll cover database availability group sizing and design, database copy count and distribution, server capacity planning and storage and network design.

 

UNC305 | Microsoft Exchange Server 2010 High Availability Design Considerations

Session Type: Breakout Session

Tuesday, June 8  |  8:00 AM - 9:15 AM  |  Rm 265

Track: Unified Communications

Speaker(s): Ross Smith IV

Level: 300 - Advanced

Go beyond the basics of Exchange High Availability and gain a solid understanding of what you need to know to design a highly available and site resilient Exchange 2010 solution. In this session we cover Exchange 2010 Mailbox DAG design, including DAG sizing, database copy count/distribution, server capacity planning, storage design, and network design.

 

 

Interactive & BOF Sessions

 

UNC01-INT | Real-World Database Availability Group (DAG) Design

Session Type: Interactive Session

Tuesday, June 8  |  1:30 PM - 2:45 PM  |  Rm 351

Track: Unified Communications

Speaker(s): Kumar Venkateswar, Scott Schnoll

Level: 300 - Advanced

Microsoft provides a great deal of content covering the theory around designing highly available and site resilient solutions, but there is very little discussion of how to apply that to “non-standard” customer scenarios. Not all customers have 50% of their users in one datacenter and 50% in a second datacenter, or one “hot datacenter” for all user access and one “warm datacenter” for site resilient scenarios. In this session, we apply the theory of DAG design and utilize the tools provided by Microsoft to design some real-world site resilient solutions.

 

VIR02-INT | Hyper-V Live Migration over Distance: A Multi-Datacenter Approach

Session Type: Interactive Session

Monday, June 7  |  1:00 PM - 2:15 PM  |  Rm 351

Track: Virtualization

Speaker(s): James Schwartz, Ryan Sokolowski

Level: 300 - Advanced

Join us to explore a proven architectural approach to migrating VMs between geographically separated datacenters. While there has been a lot of buzz in the industry around the VMware/Cisco/EMC “VCE Coalition", Microsoft and partners deliver compelling and differentiated solutions. This architecture leverages Hyper-V and multi-site Windows Server 2008 R2 clustering from Microsoft, “Best-in-Class” storage replication technology from Hitachi Data Systems and a real-world data and Fibre Channel infrastructure from Brocade. Also featured, Hitachi Storage Cluster (HSC) for Hyper-V allows simplified management of your VM operations, without requiring manual intervention on your storage arrays. Not a marketing presentation, this is a highly interactive and technical session, featuring demos and walkthroughs, diagrams and explanations of the architecture and more. You'll leave this session knowing everything you need to deploy this solution when you get home (and then get promoted, or at least justify your next trip to Tech•Ed). All session attendees also receive a FREE copy of the joint Microsoft / Hitachi Data Systems / Brocade Whitepaper! Do not miss this session - this is sure to be one of the highlights of the show!

 

BOF34-IT | Microsoft Exchange Server High Availability and Disaster Recovery: Are You Prepared?

Session Type: Birds-of-a-Feather

Thursday, June 10  |  1:30 PM - 2:45 PM  | 

Track: Unified Communications

Speaker(s): Vladimir Meloski

Microsoft Exchange Server is a business-critical system for many organizations. Are you prepared to respond to the challenges of managing its reliability and availability? What are your plans and experiences regarding high availability technologies such as Network Load Balancing and Database Availability Groups? What are your backup procedures and how about the domain controllers and communication devices availability? And at the end, have you tested your solutions and strategies at all? In this BOF session share your experiences, discuss, and learn from others about high availability and disaster recovery in Microsoft Exchange Server.

 

 

Hands on Labs

 

DAT01-HOL | Create a Two-Node Windows Server 2008 R2 Failover Cluster

Session Type: Hands-on Lab

Track: Database Platform

Level: 300 - Advanced

During this lab, create a two node failover cluster using Windows Server 2008 R2, Enterprise Edition.

 

DAT02-HOL | Create a Windows Server 2008 R2 MSDTC Cluster

Session Type: Hands-on Lab

Track: Database Platform

Level: 300 - Advanced

During this lab, create a highly available Microsoft Distributed Transaction Coordinator (MSDTC) service using both the high availability wizard and using advanced procedures in Failover Cluster Manager snap-in.

 

DAT09-HOL | Installing a Microsoft SQL Server 2008 + SP1 Clustered Instance

Session Type: Hands-on Lab

Track: Database Platform

Level: 300 - Advanced

During this lab, create a highly available SQL Server 2008 + SP1 clustered instance on a Windows Server 2008 R2 Failover Cluster using Slipstreaming process.

 

DAT12-HOL | Maintaining a Microsoft SQL Server 2008 Failover Cluster

Session Type: Hands-on Lab

Track: Database Platform

Level: 300 - Advanced

During this lab, perform typical maintenance tasks for a SQL Server 2008 failover cluster.

 

UNC02-HOL | Microsoft Exchange Server 2010 High Availability and Storage Scenarios

Session Type: Hands-on Lab

Track: Unified Communications

Level: 300 - Advanced

After completing this lab you will be able to design and implement a DAG solution that will include a simplified Mailbox with High Availability and Disaster Recovery on a unified platform, that is server and storage group independent, can replicate data for High Availability and Disaster Recovery, and has Data Center resiliency. Other topics addressed are server availability, full redundancy, and no single point of failure.

 

VIR06-HOL | Implementing High Availability and Live Migration with Windows Server 2008 R2 Hyper-V

Session Type: Hands-on Lab

Track: Virtualization

Level: 200 - Intermediate

 

 

 

We will see you soon!

Symon Perriman

Program Manager II
Clustering & High-Availability

Microsoft

Clustering & High-Availability at TechEd 2010
read full article at source
 
Jun
18
2010

Creating a Cluster Resource DLL (Part 3)

Hi,

In this series of blog posts we will help you to design, develop and debug the Resource DLL you are developing to give your application high-availability with Windows Server 2008 & 2008 R2 Failover Clustering.

 

We recommend you start with the other blog post in the series:

·         Part 1: http://blogs.msdn.com/clustering/archive/2010/03/11/9976620.aspx

·         Part 2: http://blogs.msdn.com/clustering/archive/2010/03/30/9987135.aspx

 

In our last post we showed the following RHS Resource state machine and described how to intepret the diagram.  We will be referring to this same state machine in this blog post.

 

Creating a Cluster Resource DLL (Part 3)

 

In this post, the final post in the series, we will walk through some scenarios using the above state chart diagram. 

 

Scenario 1

First we will assume we have a resource that does not have dependencies, and this resource is offline. The user tells the cluster to online this resource.  RCM will call to RHS, and RHS will call the Online entry point.  From now on, the resource is in the Onlining state, and specifically it is in the “Online in Progress” sub-state.  By default, the resource is allowed to remain in this state for up to 5 minutes.  The diagram below shows the workflow if resource comes online successfully.

 

Creating a Cluster Resource DLL (Part 3) 

 

Please note that in this sample scenario the Resource DLL handles the online in a worker thread, which is not always required.  You can find details on how to implement pending operations using a worker thread here http://msdn.microsoft.com/en-us/library/aa370471(VS.85).aspx.

If you are confident that online will complete within 5 minutes you can take all actions required to online the application within the Online call.  Returning ERROR_SUCCESS from the Online call will move the RHS state machine directly to the Online state.  You need to be careful when choosing if you need to pend Online/Offline call, because if Online does not complete within 5 minutes this will cause “Online Failure”, and will increase your application down time.  We suggest to always pend Online and Offline unless it is very trivial, and can complete below 300 Milliseconds.

 

Scenario 2

Now let’s review how the scenario might change if the Online call to the Resource DLL completes in a different way.

·         If Online call completes with a success then the resource is moved to the Online state

·         If Online call does not return in 5 minutes then RHS will declare that the resource is “deadlocked”, and will move the resource to a “Failed/Terminating” state

·         If Online call completes with a failure then the resource is moved to the “Failed/Terminating” state

·         If Online call completes with code ERROR_PENDING then resource is moved to the Online Pending state.  In that state RHS assumes that the Resource DLL wants more time to complete the online and that it has spawned a worker thread that continues the online.  The resource can stay in that state for up to 3 minutes without talking to RHS.  After 3 minutes of silence, RHS will declare that resource has deadlocked and will move the resource to the “Failed/Terminating” state.  If during these 3 minutes the resource calls SetResourceStatus then it will be given another 3 minutes to continue the online.

·         If resource is in the “Online Pending” state and SetResourceStatus has indicated that the call has completed with ERROR_SUCCESS then resource is moved to the Online state

·         If resource is in the “Online Pending” state and SetResourceStatus has indicated that the call has completed with an error then resource is moved to the “Failed/Terminating” state

·         In the “Failed/Terminating” state resource waits for a Terminate call from the RCM

 

The Offline call is handled very similar to the Online call, since it can also be “pending”.  Generally all the statements we made above for the Online (& Online Pending) call apply to the Offline (& Offline Pending) call.

While a resource is in the “Online in Progress” or “Offline in Progress” states, RHS will not send resource controls to the resource, but as soon as resource moves to the “Online Pending” or “Offline Pending” state resource will start receiving resource controls.

Please note that when Online and Offline are taking too long they are handled by the RHS differently from how other calls are handled.  For Online and Offline calls, RHS notifies RCM.  RCM is expected to issue a Terminate call.  For the other calls with a timeout, RHS still notifies RCM, but after that RHS just terminates itself after creating a crash dump file (for more information about that see the following blog post http://blogs.msdn.com/clustering/archive/2009/06/27/9806160.aspx).  RCM will observe that RHS is gone and will take a recovery action.

Scenario 3

The next scenario refers to the diagram below:

 

Creating a Cluster Resource DLL (Part 3)               

 c

 

This image shows the case when user is trying to take a resource offline, and offline is taking too long due to a call which takes longer than 3 minutes.  As soon as RHS has detected the timeout, it changes the resource state to “Failed/Terminating”, and reports that to RCM.  In response, RCM issues a terminate call back to RHS.  RHS calls the Resource DLL Terminate entry point.  The Resource DLL takes an action to speed up the Online call completion.  For instance, if the Resource DLL uses a socket to communicate with the application then Terminate might try to close the socket.  After that Terminate waits for the Offline worker thread to complete, it then takes some action to offline the application. Please note that the Terminate call is also subject to the 5 minutes timeout.  So if the Offline worker thread does not complete in time, or if Terminate will get stuck on its own, then RHS will take recovery actions.  This time RHS will terminate itself. 

As soon as resource moves to the Online state, then RHS starts health monitoring the resource. Resource can choose one of the two health monitoring modes:

·         One option is that the Resource provides a handle.  This could be any handle that can be used in a call to WaitForSingleObject such as a handle to an event or to a process.  As long as the corresponding to the handle object is not signaled, then the resource is healthy.  If the object got signaled, then RHS will notify RCM about that and will wait for a Terminate call.

·         If no handle is provided, RHS will call LooksAlive and IsAlive Resource DLL entry points to verify resource health.  It is assumed that LooksAlive calls are much lighter, and are called more frequently compared to the IsAlive.  By default, LooksAlive is called every 5 seconds, and IsAlive is called every minute.  If LooksAlive indicates failure, then RHS immediately calls IsAlive.  If IsAlive confirms the failure, then RCM is notified about that, and eventually it will send a Terminate call.  RHS will stop scheduling LooksAlive and IsAlive calls as soon as resource fails or moves out of the Online state due to an Offline or a Terminate.

 

Terminate call can run concurrently with any other call. Terminate is expected to perform the following tasks:

·         If there is an ongoing Online or Offline call then Terminate is responsible for canceling it or for doing something that will expedite completion of the call

·         After that, Terminate should wait for the ongoing call to complete

·         Finally it is responsible for bringing the resource to Offline state

 

Let’s review the concurrency rules:

·         Open is called before any other calls to the resource

·         At the moment Close is called, no other calls should be going on, and no calls to that resource can happen after the Close

·         Terminate can be called concurrently with any other call except Open or Close

·         ResourceControl, IsAlive, LooksAlive, Offline (in the “Offline in Progress” sub-state), and Online (in the “Online in Progress” sub-state) are serialized with each other

·         IsAlive and LooksAlive can be sent to the resource only while the resource is Online, and it has not indicated a failure, unless pre-empted by Terminate

·         As soon as Online or Offline moves to a pending state, the resource controls can be executed concurrently with the worker thread that conducts online or offline

 

I hope this series of blog posts will be helpful when you design your own Resource DLL.

 

Thanks,
Vladimir Petter
Senior Software Development Engineer
Clustering & High-Availability
Microsoft

Creating a Cluster Resource DLL (Part 3)
read full article at source
 
StartPrev31323334353637383940NextEnd

Hi, my name is Misha Hanin. I have served as an IT Network Administrator and IT Consultant for over 15 years. I have a number of certifications including CNE, Citrix CCA, VMWare VCP, MCP+I, MCSE, MCTS, MCITP Enterprise Messaging Administrator & MCITP Enterprise Administrator .

Microsoft presented me with the 2008 Microsoft® MVP Award (MVP) in Windows Server - Admin Frameworks! More...




Subscribe to CuruIT
Get tips, news and tutorials via RSS, Email or Twitter

Enter your email address:


Subscribe to my news feed for free Follow me on Twitter! Become a CuruIT fan on Facebook :) Become a CuruIT fan on YouTube :)