Category Archives: Disaster Recovery

Trials and Tribulations of SQL Transactional Replication

I’ve dealt with SQL replication for decades, and in a sense, not a lot has changed. I mean this from a basic configuration and troubleshooting perspective, though it has in some ways been extended a bit through the years, for new SQL Server features (like In-Memory OLTP, Azure, etc.).

Many refer to replication as the the Swiss Army Knife of SQL Server, and I can understand why, but with this “extreme flexibility” comes “extreme shortcomings”, and this post will delve into some of the issues you should be aware of.

Many have blogged about SQL replication, but Kendra Little (b|t) stands out – her posts are clear and concise, and you will always learn something from her.

But despite having read all of her posts about replication, and those of many others, when I started my current role almost 3 years ago, I was woefully ignorant of many facets of the replication feature.

This post is my effort to save you some of the pain and head scratching that I’ve endured!

Gotcha #1: you want to go parallel for initial sync

Much of the pain of SQL replication occurs when tables/articles are large. Setting up subscribers to receive terabytes of data requires changing some very specific things in very specific places, in order to have a chance of going parallel.

Unfortunately, the out-of-the-box defaults for replication place you squarely in the single-threaded camp. The Snapshot Agent can be configured with a setting named maxbcpthreads that can make snapshot generation much faster by using parallel threads to generate the snapshot.

The catch?

It’s ignored unless the sync_method for the publication is native. Now you might have looked at the docs and determined that in fact you’re already getting native snapshots, because you are using the default sync_method of concurrent, but your snapshot will still be generated by a single thread with concurrent. In order to get parallel snapshot generation, you must have a sync_method of native and use the maxbcpthreads switch on the Snapshot Agent job step. You have been warned!

Gotcha #2: Agent job naming sucks

Deploying replication generates tons of SQL Agent jobs, especially if you use push subscriptions, and you have the distribution database on the publisher. If you have long publication names (or even if you don’t), when it comes time to look for a specific job, it can be difficult to discern which ones are for generating snapshots, which are for distribution, etc. That’s why I always rename my replication jobs according to function/category. I append “_Snapshot” or “_Distribution” to the relevant jobs, using a script.

Doing this might make it impossible from other places in the UI to bring up the proper job when drilling down in Replication Monitor, but I don’t use the monitor for that – it’s a sacrifice I’m willing to make.

Gotcha #3: Password rotation can break replication

Most companies have a password rotation policy for domain users, as well as domain service accounts. If you use a domain account for anything having to do with replication agent security, obviously the SQL Server environment will have no knowledge of this password rotation. So while using domain accounts is sometimes unavoidable when using replication, the runbook you use for password rotation should include this often-overlooked area.

Gotcha #4: Replication cleanup issues

The default batch size for replication cleanup is 5,000 records. If you are in charge of a very busy system, the replication cleanup process will never catch up, and your distribution database will continue to grow. Depending on where you have the distribution database located, this could affect your production databases, because the drive can run out of space.

The cleanup job runs often by default, and so if you change the batch size to something larger, it might cause blocking in your system.

A different approach – if you have the storage to handle it – is to schedule the cleanup job so that it doesn’t run at all during the week, and use a large batch size when it runs on the weekend (assuming you have a maintenance window to actually do this type of work on weekends).

Gotcha #5: Not likely to get compression for your snapshots

The rules around getting compression for snapshots are baffling and mysterious. There is some “2GB per file” rule that I can never seem to find a way to have properly observed, and so I’ve never been successful at creating compressed snapshots. Mind you, generating compressed snapshots is not likely to be any faster than uncompressed snapshots, and in fact is likely to be slower. But if you have not-so-great bandwidth, then it would be great to not send all that data across the wire.

The documentation for snapshot compression is located here (search for “Compressed snapshots”).

Gotcha #6: Initializing from a backup will probably not float your boat

Now before you comment below and tell me that you’ve read the paper written by Robert Davis and Kenneth Fischer about initializing from a filegroup backup, most databases in the wild have tons of user tables on the PRIMARY filegroup, and so are not really candidates for what’s described in that paper.

The main issue is that you can’t really recover just a slice (or multiple slices) of your database to initialize a subscriber. You have to install PRIMARY plus the filegroups that you’re interested in, and that’s too large 99% of the time. Or you restore your entire database backup, then delete what you don’t want. But you’ll have to disable/remove Foreign Keys, etc.

Other than for demos and/or really small databases, I don’t see how this path will work.

Gotcha #7: No easy/good way to filter horizontally if > 1 table is involved

See my thread with the late, great Robert Davis here.

Gotcha #8: Log Reader woes

The log reader will sometimes stop running, and if that happens, you won’t receive any warning/error, unless you’re polling for that. Perhaps a better way is to add Retry attempts to the job step. That will work, but you’ll have to remember to re-add that retry for each log reader when you build a new environment.

Gotcha #9: DR sucks with SQL replication

How will you handle setting up replication in a DR environment? One possibility is to automagically script the entire PRD replication environment on some type of schedule, and then modify the server names in those scripts for deployment to DR. But no matter how you handle deploying replication to DR, you will still have to initialize the new DR subscribers, and as described in Gotcha #1, that can be painful.

Gotcha #10: Rows deleted at the subscriber will break distribution

For transactional replication, I can’t think of a good reason that writes should be allowed on subscription databases. But if they are allowed, eventually someone will delete a row that appears in the distribution database, and when the change to that row is attempted at the subscriber, it will of course fail. And all of the rows behind it in the distribution database will also fail to be distributed.

You can work around this – even on a temporary basis, just to get past it – by choosing a different profile for the Distribution Agent. I believe they even have some that are canned for getting past this specific error. It will allow your distribution database to stop growing, but unless you want to leave this profile in place permanently, you’ll eventually have to reinitialize at least the offending article in the relevant publication, and switch back to the default/former profile.

Gotcha #11: Nothing sucks worse than schema changes to replicated tables

Why? Because they force you to do the following:

  • remove the table from replication
  • make the schema change at the publisher
  • add the table back to the subscription
  • reinitialize the subscriber(s) (that’s right, you’re back to Gotcha #1 if your tables are large)

If done properly, the snapshot you generate will only include the table you removed from the subscription and added back. But if that table is large, it can still be painful. Also, It’s easy to make a mistake when removing a table/article from a subscription.

Of course, using AGs removes this hassle, because schema changes will simply flow through to secondary replicas via the log blocks that are sent to them (your send and redo queues will explode for large tables…). But hey, if you could be running AGs, you’d probably never have deployed replication in the first place.

Gotcha #12: CI/CD woes

Thought your CI/CD pipeline meant “push button” deployments? Think again, because CI/CD has no idea that you have tables that are replicated. This causes DBAs to get involved with deployments due to Gotcha #11. 

Gotcha #13: Monitoring for replication errors/latency

There’s nothing out-of-the-box that I’m aware of that properly handles monitoring the facets of transactional replication that most shops would be interested in.

For example – it might be fine during the day to have 10 million rows pending, but at night when ETL jobs are executing, that threshold might need to be increased to 20 million.

But what if the number of pending rows was within the allowable threshold but did not decrease after 20 minutes? Or after 40 minutes? Every shop will likely want to tweak this kind of monitoring, but you’ll probably have to roll your own and execute it across all environments where replication is deployed, and take appropriate action.

Sometimes, everything can look fine – no latency – but replication is “down”, because the log reader is down. My point is there are many things that can contribute to a replication failure.

Kendra has an excellent post on monitoring replication here.

Be forewarned that any script that literally counts the number of undistributed commands can be very slow in doing so. Imagine an environment where several publications have millions of rows pending in the distribution database – it would probably take longer to count them all than you have as a threshold for monitoring latency.

That’s why when I had to solve this problem, I cycled through each publication, and stored the out put of calling sp_replmonitorpendingcmds  to a table, and then alerted if required.

Gotcha #14: Immediate_sync is misunderstood

Most DBAs don’t really understand what this setting is used for, and that out-of-the box settings for it can cause you to have to….reinitialize.

If a subscription is set to sync immediately (immediate_sync = 1, which is the default), the data is kept in the distribution database for a specific period of time (72 hours by default). This allows new subscribers to be synced “immediately”, without having to create a new snapshot. But if a subscriber goes offline for too long, you will have to…..reinitialize.

In my current role, I set immediate_sync to 0 for all subscriptions. I’ve only got 2 subscribers, and I’m ok with having to generate a new snapshot for a new subscriber (which never happens). 

But I don’t want SQL server telling me when I have to reinitialize.

Mohammed Moinudheen wrote an excellent post here about retention periods for replication.

Mohammed Moinudheen also wrote an excellent post here about immediate_sync.

Gotcha #15: Replication takes precedence over AGs, by default

If you replicate a subset of data from a database that belongs to an asynchronous AG, you should know that the default behavior is that the Log Reader won’t harvest transactions to send to subscribers that have not yet been sent to the AG secondary replicas.

That means your production database transaction log can’t be cleared of these transactions, and will continue to grow, potentially affecting your production environment. You’ll need Trace Flag 1448 to solve this, and luckily it takes effect immediately.

Andy Mallon wrote an excellent post on TF 1448 here.

TF 1448 only applies to async replicas. If you’re running synchronous replicas, the default behavior still applies – subscribers don’t receive new data until the sync replicas have been hardened.

More “gotchas”

Undocumented “gotchas” of transactional replication by Robert Davis can be found here.

I’m sure there are tons more gotchas for transactional replication, but these are the most glaring I’ve tripped across.

In-Memory OLTP: The moving target that is RTO

Anyone who has worked with or studied In-Memory OLTP should know that upon database restart (which occurs for a variety of reasons), data for durable memory-optimized databases must be streamed from checkpoint file pairs that reside on disk (CFPs), into memory.

During recovery, data files are read in parallel, using as many threads as there are logical CPUs. If all CFPs reside on a single volume, the speed of recovery is limited by how much IO is available from just that single volume. That’s why RTO is affected by not only how many containers exist for a database, but how many volumes those containers are spread across.

That means that other than keeping the amount of data stored in durable memory-optimized tables to a minimum, the only option you have for reducing the amount of time it takes to stream data into memory, is to spread your containers across multiple volumes.

Let’s say you’ve got 5 containers, and your RTO is 15 minutes. You’ve tested recovery with your current data set, and the actual time required to restart your database is 5 minutes. Everything is ok, you’ve got RTO minutes to spare.

Over time, the amount of memory consumed by your durable memory-optimized tables – and the required storage footprint for CFPs – continue to grow. But you’re ok with all of that, because hey – memory is cheap, and so is storage, so you’ll just make sure that you have enough of both to continue doing business.

Now fast forward to some point in the future, and the size of your durable data set has quadrupled. You can’t get application owners to agree to migrate data from memory-optimized tables to disk, because of who-knows-what, or perhaps it’s because of some of the reasons mentioned here.

And now, due to the continued growth of your durable memory-optimized data, you can’t make your RTO. How will you solve this problem?

If you only create additional containers, the In-Memory OLTP engine will be able to use them for writing new data. But that won’t get you where you want to be, because the existing CFPs don’t get redistributed among all existing containers – the new containers get their own CFPs.

The moment you add the new containers, 100% of your existing data still resides in the 5 original containers. Even though you theoretically have enough containers to make your RTO (the original 5 containers plus 5 new ones), it’s still blown, because the pre-existing CFPs have not been redistributed across all containers (10 containers total).

The solution is to:

  • Create new containers on the pre-existing volumes
  • Create new containers on the new volumes
  • Drop the old containers

Dropping the old containers will cause the existing CFPs to be redistributed among all new containers (10).

Now we have a way to redistribute pre-existing CFPs across new containers/volumes so that you can make your RTO. And in case you were wondering, Microsoft says that redistribution of CFPs can be performed while your database is online (obviously you should test this entire process in a non-production environment to see if it works as promised).

How much durable data can you store in a SQL Server database? For SQL 2016, you are limited only by what the operating system can handle. With Windows Server 2012R2, the maximum memory is 4TB, but Windows Server 2016 supports 24TB of memory. That could mean you got a whole lotta CPF redistribution goin’ on…..

UPDATE [2017-05-03]

The Microsoft documentation seems to directly contradict what I’ve written in this blog post, but I can assure you that what I wrote is correct.

Here’s a script to prove it:

USE [master]
GO
 
DROP DATABASE IF EXISTS [InMemContainerTest]
 
CREATE DATABASE [InMemContainerTest] ON PRIMARY 
(
    NAME = N'InMemContainerTest'
   ,FILENAME = N'S:\SQLDATA\InMemContainerTest.mdf' 
   ,SIZE = 100MB 
   ,MAXSIZE = UNLIMITED
   ,FILEGROWTH =100MB 
)
LOG ON 
(
    NAME = N'InMemContainerTest_log'
   ,FILENAME = N'S:\SQLDATA\InMemContainerTest.ldf' 
   ,SIZE = 100MB
   ,MAXSIZE = 2048GB
   ,FILEGROWTH = 100MB
)
GO
 
ALTER DATABASE [InMemContainerTest]
ADD FILEGROUP InMemContainerTestFG
CONTAINS MEMORY_OPTIMIZED_DATA;
GO
ALTER DATABASE [InMemContainerTest]
ADD FILE
(
    NAME = 'InMemContainerTestFile'
   ,FILENAME = 'S:\SQLDATA\InMemContainerTestFile'
)
TO FILEGROUP InMemContainerTestFG;
GO
ALTER DATABASE [InMemContainerTest] SET COMPATIBILITY_LEVEL = 130;
GO
 
USE [InMemContainerTest]
GO
DROP TABLE IF EXISTS [dbo].[InMemT1]
GO
CREATE TABLE [dbo].[InMemT1]
(
    [OrderId]      [INT]           IDENTITY(1, 1) NOT NULL
   ,[StoreID]      [INT]           NOT NULL
   ,[CustomerID]   [INT]           NOT NULL
   ,[OrderDate]    [DATETIME]      NOT NULL
   ,[DeliveryDate] [DATETIME]      NULL
   ,[Amount]       [FLOAT]         NULL
   ,[Discount]     [FLOAT]         NULL
   ,[DiscountCode] [VARCHAR](25)   NULL
   ,INDEX [IndOrders_OrderID] NONCLUSTERED ([OrderId] ASC)
   ,INDEX [IndOrders_StoreID] NONCLUSTERED ([StoreID] ASC)
   ,INDEX [IndOrders_CustomerID] NONCLUSTERED ([CustomerID] ASC)
   ,CONSTRAINT [PK_InMemT1_OrderID] PRIMARY KEY NONCLUSTERED ([OrderId] ASC)
)
WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
GO
 
/*
	add data
*/
SET NOCOUNT ON
GO
INSERT dbo.InMemT1
SELECT 1 AS [StoreID]     
      ,2 AS [CustomerID]  
      ,GETDATE() AS [OrderDate]   
      ,GETDATE()AS [DeliveryDate]
      ,1.11 AS [Amount]      
      ,0.10 AS [Discount]    
      ,'Holiday1' AS [DiscountCode]
GO 200000
 
 
/*
	verify CFP folder
	dir S:\SQLData\InMemContainerTestFile\$HKv2
*/
 
 
/*
	add two new containers
*/
 
ALTER DATABASE [InMemContainerTest]
ADD FILE
(
    NAME = 'InMemContainerNew1'
   ,FILENAME = 'S:\SQLDATA\InMemContainerNew1'
)
TO FILEGROUP InMemContainerTestFG;
GO
ALTER DATABASE [InMemContainerTest]
ADD FILE
(
    NAME = 'InMemContainerNew2'
   ,FILENAME = 'S:\SQLDATA\InMemContainerNew2'
)
TO FILEGROUP InMemContainerTestFG;
GO
 
/*
	verify contents of NEW CFP folders
	dir S:\SQLData\InMemContainerNew1\$HKv2
	dir S:\SQLData\InMemContainerNew2\$HKv2
 
	on my server, these folders are empty
*/
 
ALTER DATABASE [InMemContainerTest] REMOVE FILE InMemContainerTestFile;
 
/*
	verify contents of NEW CFP folders
	dir S:\SQLData\InMemContainerNew1\$HKv2
	dir S:\SQLData\InMemContainerNew2\$HKv2
 
	InMemContainerTestFile has been deleted, and the CFPs have been 
	redistributed to InMemContainerNew1 and InMemContainerNew2
*/

In-Memory OLTP data/delta file corruption: “Now trappable”

4 April 2017

(This post used to be called In-Memory OLTP data/delta file corruption: “The Untrappable”, but as you’ll see in a moment, things have changed)

There’s a lot of confusion out there about SQL Server’s In-Memory OLTP feature.

You can run CHECKDB on a database that contains memory-optimized tables, but all memory-optimized tables will be ignored. And running CHECKTABLE against memory-optimized tables will fail with an “unsupported” error. As a result, some shops might not consider deploying this feature. While it’s not possible to recover from data/delta file corruption, you can still detect corruption. As I blogged a while ago in this post, a checksum is calculated for every block written to data/delta files, and those checksums are recalculated any time the block is read. That occurs during restore, backup, and any other operation that reads data/delta files. As Brent Ozar blogged in this post, you can execute a copy-only backup of the memory-optimized filegroup to DISK = ‘nul’, to force recalculation of all checksums, which will in turn compare them to the values stored with the blocks. If there are no mismatches between the newly calculated and stored checksum values, your durable memory-optimized data is corruption free.

Let’s say you execute that copy-only backup for your memory-optimized filegroup each night – if there is corruption, which conditions can trigger alerts, so that you can be notified?

As of SQL 2016/SP1, Microsoft has fixed the various code paths that access data/delta files, so that they produce trappable errors upon checksum failure:

  • Regular processing, i.e. file merge: Severity 21, Error 41355
  • Backup/Restore: Severity 16, Error 41316
  • Database recovery: Severity 21, Error 3414

(The sordid details of my attempts to reproduce data/delta file corruption before SQL 2016/SP1 are listed below).

17 August 2016

There’s a lot of confusion out there about SQL Server’s In-Memory OLTP feature.

If there is corruption, what mechanism can you use to be alerted?

I had been in touch with Microsoft about this type of corruption, and they stated that it would be logged in the SQL errorlog as Severity 21. Of course you can create an alert on Severity 21 errors, but I wanted to find a way to determine that it’s specifically related to data/delta file corruption.

How would you go about reproducing this type of corruption for durable memory-optimized tables?

About a year ago I sent an email to corruption guru Paul Randal, asking if he had experimented with this type of corruption for durable memory-optimized data, and at least at that point he had not. So I set out to roll my own corruption repro, and so far the results are not what I expected.

I created a single durable memory-optimized table, and added one row. Then I ran CHECKPOINT to close the data file, and used a hex editor, attempting to open each of the data files. If I tried to open one of the files that had been written to, I received a “file in use” error, so I set the database OFFLINE, and overwrote some of the data in the formerly “in use” file with 00.

Next, I attempted to ONLINE the database, and received the following error:

Msg 41316, Level 16, State 0, Line 51
Restore operation failed for database ‘HKCorruption’ with internal error code ‘0x8800000e’.
Msg 5181, Level 16, State 5, Line 52
Could not restart database “HKCorruption”. Reverting to the previous status.
Msg 5069, Level 16, State 1, Line 52
ALTER DATABASE statement failed.

I checked the SQL errorlog, and there was no Severity 21 listed. That’s where it stands for now – unfortunately I’ve not been able to reproduce pseudo storage corruption that affects data/delta files.

I’ve contacted Microsoft, informing them of the details of my testing, and I’ll update this post if/when I hear back from them.

Update 23 August 2016

Today I heard back from Microsoft. Turns out I had actually been able to reproduce corruption in the data/delta files. Look carefully at the errors from the SQL errorlog that I posted above. See that ‘0x8800000e’ ? It’s the only indication that there was a checksum failure. To be clear, this is what does and does not happen when there is a checksum failure found in data/delta files:

1. a value of 0x8800000e is written to the SQL errorlog
2. no severity is written to the SQL errorlog
3. no standardized error ID is written to the SQL errorlog
4. no text indicating corruption is written to the SQL errorlog

There are many problems with this situation, the first one being that there is no way to trap the corruption error with an alert. If there was a Severity associated with the error, we could create an alert, and receive some type of notification when the corruption occurs.

It’s bad enough that CHECKDB/CHECKTABLE ignores memory-optimized tables. If we force checksums to be calculated by backing up the memory-optimized filegroup to disk = ‘nul’, in order to determine that there are no checksum errors, you will have to scan the SQL errlog for ‘0x8800000e‘ after every memory-optimized filegroup backup.

This would seem to be a somewhat radical departure from standard ways to be informed of corruption (and other SQL Server errors in general).

Who could blame potential adopters of In-Memory OLTP for thinking that it’s not ready for prime time (and in this regard it’s definitely not). What could be more important than knowing your data is corruption free, and being alerted immediately if corruption occurs?

The present state of corruption detection and notification will do little to change the minds of those hesitant to adopt In-Memory OLTP. If Microsoft wants people to consider using In-Memory OLTP, they need to fix this issue immediately.

I have created this connect item about the issues described in this post (thanks for upvoting!)

Update 24 August 2016

Microsoft followed up with me again today, and said the following:

  • If the checkpointing process detects a checksum failure during regular processing, for example during a file merge, a sev21, error 41355 is written to SQL the error log
  • If there is a checksum failure during backup or restore, a sev16 error is written to the SQL error log, which is the same as what SQL Server does for checksum failures in mdf/ndf or ldf files
  • The team is looking at the DB startup code path to raise a sev21 error

That’s much better news than what I thought (what was originally explained to me).

Hopefully Microsoft will be able to fix the DB startup code path issue as part of a CU or SP1 (which in recent history would take about a year from the RTM release date).

Catch a Cluster by its Tail

I’ve been fascinated with SQL Server clustering for at least 15 years. It has matured considerably since the “Wolfpack” days back in 2000, when I sat next to the resident clustering guru at the contracting client I had at that time. He explained the basics to me, and I’m sure I had that “deer in the headlights” look. As a DBA, I had absolutely no interest in storage, networking, DNS, or Active Directory. I simply wanted to expand my SQL DBA skills in a vacuum. Besides, the initial MS implementation of clustering was not at all robust.

But as the years passed, I could see that world of clustering/high availability was catching on, so I decided to learn more about it, and I let go of my irrational lack of desire to learn things not directly connected to SQL Server. I set them up in my lab multiple dozens of times, and came to see clusters as a sort of gigantic puzzle, one that had many inputs and variables, and could be difficult to troubleshoot. Eventually Microsoft released SQL 2012 which included Availability Groups, whose foundation is Windows Server Failover Clustering. I knew my understanding of clustering needed improvement, and so I signed up for an in-person class.  There were only five other students in the class, and so we each received a lot of attention from the instructor, who was definitely super-knowledgeable. In some ways, there is nothing like being in the same room with a technologist who has that type of experience, and the ability to ask questions and also hear the questions that others ask is invaluable.

However, the costs for this class were not insignificant. The course fee was $2,395, hotel was $840, and I missed 4 days of work, for which I was not paid (I’m a contractor/consultant). I considered it an investment in my career, and didn’t give it a second thought. After the training, and following up with the materials that were given in class, my understanding and skills were improved. But four days wasn’t enough for me, and so I began to seek another way of taking my clustering skills to the next level, desiring to have a much deeper understanding of both Windows Server Failover Clustering (WSFC) and SQL Failover Cluster Instances (FCI).

“Timing is everything”, as they say, and I was thrilled to discover that SQL Server MCM and Data Platform MVP Edwin Sarmiento (b | t) had just completed the Herculean effort of creating an online course of study entitled  “Windows Server Failover Clustering for the Smart SQL Server DBA”. I reviewed the course outline, and saw many things that I already knew well, but also many that I needed to improve in my skill set. I liked that you could purchase only the modules that you needed.

Here’s the course outline:

  • Introduction to Microsoft® High Availability Technologies
  • Windows Server Failover Clustering (WSFC) Fundamentals
  • Planning and Installing a Windows Server Failover Cluster (WSFC)
  • Deep Dive on Windows Server Failover Cluster Quorum
  • Windows Server Failover Cluster (WSFC) Configuration
  • Planning and Installing SQL Server Failover Clustered Instance
  • Configuring SQL Server Failover Clustered Instances
  • Managing SQL Server Failover Clustered Instances

The course is described as “advanced” and “deep-dive”, and that’s definitely true, but it starts at the very beginning, and makes no assumptions about the skill level of the viewer with regard to WSFC or FCIs.

When it comes to learning, it’s often said that “repetition is good”. That’s one of the benefits that online training has versus in-person training – you can review it over and over again, and really let it sink in.

You can purchase individual modules or the entire course, and the pricing is extremely reasonable. The course can be viewed at a time and place of your choosing, and you can view modules an unlimited number of times. 

“Windows Server Failover Clustering for the Smart SQL Server DBAtruly expanded my mind about Windows Failover Clustering and FCIs, and Edwin always responded to the dozens of questions I had. His course is a fantastic resource, and I highly recommend it to anyone seeking to up their game in the vast and complex world of clustering.

The course is located here: https://learnsqlserverhadr.com

SQL Server log shipping within the AWS Cloud

Much of what you see in the blogosphere pertaining to log shipping and AWS references an on-premises server as part of the topology. I searched far and wide for any information about how to setup log shipping between AWS VMs, but found very little. However, I have a client that does business solely within AWS, and needed a solution for HA/DR that did not include on-premises servers.

Due to network latency issues and disaster recovery requirements (the log shipping secondary server must reside in a separate AWS region), it was decided to have the Primary server push transaction logs to S3, and the Secondary server pull from S3. On the Primary, log shipping would occur as usual, backing up to a local share, with a separate SQL Agent job responsible for copying the transaction log backups to S3. Amazon has created a set of Powershell functionality embodied in AWS Tools for Windows Powershell, which can be downloaded here. One could argue that Amazon RDS might solve some of the HA/DR issues that this client faced, but it was deemed too restrictive.

image_thumb12

S3 quirks

When files are written to S3, the date and time of when the file was last modified is not retained. That means when the Secondary server polls S3 for files to copy, it cannot rely on the date/time from S3. Also, it is not possible to set the LastModified value on S3 files. Instead, a list of S3 file name must be generated, and compared to files that reside on the Secondary. If the S3 file does not reside locally, it must be copied.

Credentials – AWS Authentication

AWS supports different methods of authentication:

  1. IAM roles (details here)
  2. profiles (details here)

From an administrative perspective, I don’t have and don’t want access to the client’s AWS administratove console. Additionally, I needed a solution that I could easily test and modify without involving the client. For this reason, I chose an authentication solution based on AWS profiles that are stored within the Windows environment, for a specific Windows account (in case you’re wondering, the profiles are encrypted).

Windows setup

  • create a Windows user named SQLAgentCmdProxy
  • create a password for the SQLAgentCmdProxy account (you will need this later)

The SQLAgentCmdProxy Windows account will be used as a proxy in for SQL Agent job steps, which will execute Powershell scripts. (NOTE: if you change the drive letters and or folder names, you will need to update the scripts in this post)

from a cmd prompt, execute the following:

Powershell setup

(The scripts in this blog post should be run on the Secondary log shipping server, but with very little effort, they can be modified to run on the Primary and push transaction log backups to S3.)

The following scripts assume you already have an S3 bucket that contains one or more transaction log files that you want to copy to the Secondary server (they must have the extension “trn”, otherwise you will need to change -Match “trn” in the script below). Change the bucket name to match your bucket, and if required, also change the name of the region. Depending on the security configuration for your server, you may also need to execute “Set-ExecutionPolicy RemoteSigned” in a Powershell prompt as a Windows Administrator, prior to executing any Powershell scripts.

After installing AWS Tools for Windows Powershell, create a new Powershell script with the following commands

Be sure to fill in your AccessKey and SecretKey values in the script above, then save the script as C:\Powershell\Setup.ps1. When this script is executed, it will establish an AWS environment based on the proxy for the SQL Agent job step.

The next step is to create a new Powershell script with the following commands:

Again you should substitute your bucket and region names in the script above. Note that after the files are copied to the Secondary, the LastModifiedTime is updated based on the file name (log shipping uses the UTC format when naming transaction log backups). Save the Powershell script as C:\powershell\CopyS3TRNToLocal.ps1

SQL Server setup

  • create a login for the SQLAgentCmdProxy Windows account (for our purposes, we will make this account a member of the sysadmin role, but you should not do that in your production environment)
  • create a credential named TlogCopyFromS3Credential, mapped to SQLAgentCmdProxy (you will need the password for SQLAgentCmdProxy in order to accomplish this)
  • create a SQL Agent job
  • create a job step, Type: Operating System (CmdExec), Runas: TlogCopyFromS3Credential

Script for the above steps

  • Change references to <DomainName> to be your domain or local server name, and save the script
  • Execute the job
  • Open the job and navigate to the job step. In the Command window, change the name of the Powershell script from Setup.ps1 to CopyS3TRNToLocal.ps1
  • Execute the job
  • Verify the contents of the C:\Backups\logs folder – you should now see the file(s) from your S3 bucket

Troubleshooting credentials

If you see errors for the job that resemble this:

InitializeDefaultsCmdletGet-S3Object : No credentials specified or obtained from persisted/shell defaults.

then recheck the AccessKey and SecretKey values that you ran in the Setup.ps1 script. If you find errors in either of those keys, you’ll need to rerun the Setup.ps1 file (change the name of the file to be executed in the SQL Agent job, and re-run the job). If you don’t find any errors in the AccessKey or SecretKey values, you might have luck with creating the AWS profile for the proxy account manually (my results with this approach have been mixed). Since profiles are specific to a Windows user, we can use runas /user:SQLAgentCmdProxy powershell_ise.exe to launch the Powershell ISE, and then execute the code from Setup.ps1.

You can verify that the Powershell environment uses the SQL proxy account by temporarily adding $env:USERNAME to the script.

S3 Maintenance

When you setup log shipping on the Primary or Secondary, you can specify the retention period, but S3 file maintenance needs to be a bit more hands on. The following script handles purging local and S3 files with the extension “trn” that are more than 30 days old, based on UTC file name.

Save the script, and create a SQL Agent job to execute it. You’ll also have to reference the proxy account as in the prior SQL Agent job.

Don’t forget

If you use log shipping between AWS VMs as outlined in this post, you will need to disable/delete the SQL Agent copy jobs on the Primary and Secondary servers.

Disaster Recovery

All log shipping described here occurs within the AWS cloud. An alternative would be to ship transaction logs to a separate storage service (that does not use S3), or a completely separate cloud. At the time of this writing, this blog post by David Bermingham clearly describes many of the issues and resources associated with HA/DR in AWS.

“Hope is not a strategy”

HA/DR strategies require careful planning and thorough testing. In order to save money, some AWS users may be tempted to create a Secondary instance with small memory and CPU requirements, hoping to be able to resize the Secondary when failover is required. For patching, the ‘”resize it when we need it” approach might work, but for Disaster Recovery it can be fatal. Be forewarned that Amazon does not guarantee the ability to start an instance of a specific size, in a specific availability zone/region, unless the instance is reserved. If the us-east region has just gone down, everyone with Disaster Recovery instances in other AWS regions will attempt to launch them. As a result, it is likely that some of those who are desperately trying to resize and then launch their unreserved Disaster Recovery instances in the new region will receive the dreaded “InsufficientInstanceCapacity” error message from AWS. Even in my limited testing for this blog post, I encountered this error after resizing a t1-micro instance to r2.xlarge, and attempting to start the instance (this error persisted for at least 30 minutes, but the web is full of stories of people waiting multiple hours). You could try to launch a different size EC2 instance, but there is no guarantee you will have success (more details on InstanceCapacity can be found here).

The bottom line is that if you run a DR instance that is not reserved, at the precise moment you require more capacity it may be unavailable. That’s not the type of hassle you want when you’re in the middle of recovering from a disaster.

I am indebted to Mike Fal (b) for reviewing this post.