I’ve dealt with SQL replication for decades, and in a sense, not a lot has changed. I mean this from a basic configuration and troubleshooting perspective, though it has in some ways been extended a bit through the years, for new SQL Server features (like In-Memory OLTP, Azure, etc.).
Many refer to replication as the the Swiss Army Knife of SQL Server, and I can understand why, but with this “extreme flexibility” comes “extreme shortcomings”, and this post will delve into some of the issues you should be aware of.
Many have blogged about SQL replication, but Kendra Little (b|t) stands out – her posts are clear and concise, and you will always learn something from her.
But despite having read all of her posts about replication, and those of many others, when I started my current role almost 3 years ago, I was woefully ignorant of many facets of the replication feature.
This post is my effort to save you some of the pain and head scratching that I’ve endured!
Gotcha #1: you want to go parallel for initial sync
Much of the pain of SQL replication occurs when tables/articles are large. Setting up subscribers to receive terabytes of data requires changing some very specific things in very specific places, in order to have a chance of going parallel.
Unfortunately, the out-of-the-box defaults for replication place you squarely in the single-threaded camp. The Snapshot Agent can be configured with a setting named maxbcpthreads that can make snapshot generation much faster by using parallel threads to generate the snapshot.
The catch?
It’s ignored unless the sync_method for the publication is native. Now you might have looked at the docs and determined that in fact you’re already getting native snapshots, because you are using the default sync_method of concurrent, but your snapshot will still be generated by a single thread with concurrent. In order to get parallel snapshot generation, you must have a sync_method of native and use the maxbcpthreads switch on the Snapshot Agent job step. You have been warned!
Gotcha #2: Agent job naming sucks
Deploying replication generates tons of SQL Agent jobs, especially if you use push subscriptions, and you have the distribution database on the publisher. If you have long publication names (or even if you don’t), when it comes time to look for a specific job, it can be difficult to discern which ones are for generating snapshots, which are for distribution, etc. That’s why I always rename my replication jobs according to function/category. I append “_Snapshot” or “_Distribution” to the relevant jobs, using a script.
Doing this might make it impossible from other places in the UI to bring up the proper job when drilling down in Replication Monitor, but I don’t use the monitor for that – it’s a sacrifice I’m willing to make.
Gotcha #3: Password rotation can break replication
Most companies have a password rotation policy for domain users, as well as domain service accounts. If you use a domain account for anything having to do with replication agent security, obviously the SQL Server environment will have no knowledge of this password rotation. So while using domain accounts is sometimes unavoidable when using replication, the runbook you use for password rotation should include this often-overlooked area.
Gotcha #4: Replication cleanup issues
The default batch size for replication cleanup is 5,000 records. If you are in charge of a very busy system, the replication cleanup process will never catch up, and your distribution database will continue to grow. Depending on where you have the distribution database located, this could affect your production databases, because the drive can run out of space.
The cleanup job runs often by default, and so if you change the batch size to something larger, it might cause blocking in your system.
A different approach – if you have the storage to handle it – is to schedule the cleanup job so that it doesn’t run at all during the week, and use a large batch size when it runs on the weekend (assuming you have a maintenance window to actually do this type of work on weekends).
Gotcha #5: Not likely to get compression for your snapshots
The rules around getting compression for snapshots are baffling and mysterious. There is some “2GB per file” rule that I can never seem to find a way to have properly observed, and so I’ve never been successful at creating compressed snapshots. Mind you, generating compressed snapshots is not likely to be any faster than uncompressed snapshots, and in fact is likely to be slower. But if you have not-so-great bandwidth, then it would be great to not send all that data across the wire.
The documentation for snapshot compression is located here (search for “Compressed snapshots”).
Gotcha #6: Initializing from a backup will probably not float your boat
Now before you comment below and tell me that you’ve read the paper written by Robert Davis and Kenneth Fischer about initializing from a filegroup backup, most databases in the wild have tons of user tables on the PRIMARY filegroup, and so are not really candidates for what’s described in that paper.
The main issue is that you can’t really recover just a slice (or multiple slices) of your database to initialize a subscriber. You have to install PRIMARY plus the filegroups that you’re interested in, and that’s too large 99% of the time. Or you restore your entire database backup, then delete what you don’t want. But you’ll have to disable/remove Foreign Keys, etc.
Other than for demos and/or really small databases, I don’t see how this path will work.
Gotcha #7: No easy/good way to filter horizontally if > 1 table is involved
See my thread with the late, great Robert Davis here.
Gotcha #8: Log Reader woes
The log reader will sometimes stop running, and if that happens, you won’t receive any warning/error, unless you’re polling for that. Perhaps a better way is to add Retry attempts to the job step. That will work, but you’ll have to remember to re-add that retry for each log reader when you build a new environment.
Gotcha #9: DR sucks with SQL replication
How will you handle setting up replication in a DR environment? One possibility is to automagically script the entire PRD replication environment on some type of schedule, and then modify the server names in those scripts for deployment to DR. But no matter how you handle deploying replication to DR, you will still have to initialize the new DR subscribers, and as described in Gotcha #1, that can be painful.
Gotcha #10: Rows deleted at the subscriber will break distribution
For transactional replication, I can’t think of a good reason that writes should be allowed on subscription databases. But if they are allowed, eventually someone will delete a row that appears in the distribution database, and when the change to that row is attempted at the subscriber, it will of course fail. And all of the rows behind it in the distribution database will also fail to be distributed.
You can work around this – even on a temporary basis, just to get past it – by choosing a different profile for the Distribution Agent. I believe they even have some that are canned for getting past this specific error. It will allow your distribution database to stop growing, but unless you want to leave this profile in place permanently, you’ll eventually have to reinitialize at least the offending article in the relevant publication, and switch back to the default/former profile.
Gotcha #11: Nothing sucks worse than schema changes to replicated tables
Why? Because they force you to do the following:
- remove the table from replication
- make the schema change at the publisher
- add the table back to the subscription
- reinitialize the subscriber(s) (that’s right, you’re back to Gotcha #1 if your tables are large)
If done properly, the snapshot you generate will only include the table you removed from the subscription and added back. But if that table is large, it can still be painful. Also, It’s easy to make a mistake when removing a table/article from a subscription.
Of course, using AGs removes this hassle, because schema changes will simply flow through to secondary replicas via the log blocks that are sent to them (your send and redo queues will explode for large tables…). But hey, if you could be running AGs, you’d probably never have deployed replication in the first place.
Gotcha #12: CI/CD woes
Thought your CI/CD pipeline meant “push button” deployments? Think again, because CI/CD has no idea that you have tables that are replicated. This causes DBAs to get involved with deployments due to Gotcha #11.
Gotcha #13: Monitoring for replication errors/latency
There’s nothing out-of-the-box that I’m aware of that properly handles monitoring the facets of transactional replication that most shops would be interested in.
For example – it might be fine during the day to have 10 million rows pending, but at night when ETL jobs are executing, that threshold might need to be increased to 20 million.
But what if the number of pending rows was within the allowable threshold but did not decrease after 20 minutes? Or after 40 minutes? Every shop will likely want to tweak this kind of monitoring, but you’ll probably have to roll your own and execute it across all environments where replication is deployed, and take appropriate action.
Sometimes, everything can look fine – no latency – but replication is “down”, because the log reader is down. My point is there are many things that can contribute to a replication failure.
Kendra has an excellent post on monitoring replication here.
Be forewarned that any script that literally counts the number of undistributed commands can be very slow in doing so. Imagine an environment where several publications have millions of rows pending in the distribution database – it would probably take longer to count them all than you have as a threshold for monitoring latency.
That’s why when I had to solve this problem, I cycled through each publication, and stored the out put of calling sp_replmonitorpendingcmds to a table, and then alerted if required.
Gotcha #14: Immediate_sync is misunderstood
Most DBAs don’t really understand what this setting is used for, and that out-of-the box settings for it can cause you to have to….reinitialize.
If a subscription is set to sync immediately (immediate_sync = 1, which is the default), the data is kept in the distribution database for a specific period of time (72 hours by default). This allows new subscribers to be synced “immediately”, without having to create a new snapshot. But if a subscriber goes offline for too long, you will have to…..reinitialize.
In my current role, I set immediate_sync to 0 for all subscriptions. I’ve only got 2 subscribers, and I’m ok with having to generate a new snapshot for a new subscriber (which never happens).
But I don’t want SQL server telling me when I have to reinitialize.
Mohammed Moinudheen wrote an excellent post here about retention periods for replication.
Mohammed Moinudheen also wrote an excellent post here about immediate_sync.
Gotcha #15: Replication takes precedence over AGs, by default
If you replicate a subset of data from a database that belongs to an asynchronous AG, you should know that the default behavior is that the Log Reader won’t harvest transactions to send to subscribers that have not yet been sent to the AG secondary replicas.
That means your production database transaction log can’t be cleared of these transactions, and will continue to grow, potentially affecting your production environment. You’ll need Trace Flag 1448 to solve this, and luckily it takes effect immediately.
Andy Mallon wrote an excellent post on TF 1448 here.
TF 1448 only applies to async replicas. If you’re running synchronous replicas, the default behavior still applies – subscribers don’t receive new data until the sync replicas have been hardened.
More “gotchas”
Undocumented “gotchas” of transactional replication by Robert Davis can be found here.
I’m sure there are tons more gotchas for transactional replication, but these are the most glaring I’ve tripped across.
Pingback: Learning Experiences from Transactional Replication – Curated SQL
This article highlights several of the gotchas encountered with transactional replication, but I would like to quibble with a couple of points:
1 – Not all schema changes require you to drop the article and then add it back. The gotcha here is actually more nuanced. On large tables, you do NOT want to add columns that are NOT NULL with a default value, as this will clog up replication mightily, but replication will replicate a lot of DDL commands without issue. Just read up on how replication is going to handle it before you begin.
2 – As a sub-point to item #1, you don’t have to re-initialize your whole database after dropping a single article and re-adding it. There are settings you can adjust so that just the single article will be re-initialized which can save a tremendous amount of time.
3 – Part of the gotcha with Log Reader Agent is that the job is set to effectively retry infinitely, so the job never fails and you never get notified. There are a couple ways around this: You can set up a SQL Agent alert for a failure of that agent (be sure to set a reasonable delay for the time between notifications, unless you want your phone or e-mail to become unusable if something bad happens) or you can adjust the number of retries in the job to be more reasonable (like 10 retries, once a minute), set a notification for failure, and set the job to start every 20 minutes so that it has some recoverability.
Thanks for a great article!
Hi Joe – thanks for stopping by —
You are correct that not all schema changes will require removing an article from a publication, and I’ll update the post to reflect that.
For your second point, I don’t believe I referenced having to recyn all publications (db level) if only changing a single article, not clear on why you responded about that. Perhaps you meant to reference the publication, and if so, you are correct.