In-Memory OLTP data/delta file corruption: “Now trappable”

4 April 2017

(This post used to be called In-Memory OLTP data/delta file corruption: “The Untrappable”, but as you’ll see in a moment, things have changed)

There’s a lot of confusion out there about SQL Server’s In-Memory OLTP feature.

You can run CHECKDB on a database that contains memory-optimized tables, but all memory-optimized tables will be ignored. And running CHECKTABLE against memory-optimized tables will fail with an “unsupported” error. As a result, some shops might not consider deploying this feature. While it’s not possible to recover from data/delta file corruption, you can still detect corruption. As I blogged a while ago in this post, a checksum is calculated for every block written to data/delta files, and those checksums are recalculated any time the block is read. That occurs during restore, backup, and any other operation that reads data/delta files. As Brent Ozar blogged in this post, you can execute a copy-only backup of the memory-optimized filegroup to DISK = ‘nul’, to force recalculation of all checksums, which will in turn compare them to the values stored with the blocks. If there are no mismatches between the newly calculated and stored checksum values, your durable memory-optimized data is corruption free.

Let’s say you execute that copy-only backup for your memory-optimized filegroup each night – if there is corruption, which conditions can trigger alerts, so that you can be notified?

As of SQL 2016/SP1, Microsoft has fixed the various code paths that access data/delta files, so that they produce trappable errors upon checksum failure:

  • Regular processing, i.e. file merge: Severity 21, Error 41355
  • Backup/Restore: Severity 16, Error 41316
  • Database recovery: Severity 21, Error 3414

(The sordid details of my attempts to reproduce data/delta file corruption before SQL 2016/SP1 are listed below).

17 August 2016

There’s a lot of confusion out there about SQL Server’s In-Memory OLTP feature.

If there is corruption, what mechanism can you use to be alerted?

I had been in touch with Microsoft about this type of corruption, and they stated that it would be logged in the SQL errorlog as Severity 21. Of course you can create an alert on Severity 21 errors, but I wanted to find a way to determine that it’s specifically related to data/delta file corruption.

How would you go about reproducing this type of corruption for durable memory-optimized tables?

About a year ago I sent an email to corruption guru Paul Randal, asking if he had experimented with this type of corruption for durable memory-optimized data, and at least at that point he had not. So I set out to roll my own corruption repro, and so far the results are not what I expected.

I created a single durable memory-optimized table, and added one row. Then I ran CHECKPOINT to close the data file, and used a hex editor, attempting to open each of the data files. If I tried to open one of the files that had been written to, I received a “file in use” error, so I set the database OFFLINE, and overwrote some of the data in the formerly “in use” file with 00.

Next, I attempted to ONLINE the database, and received the following error:

Msg 41316, Level 16, State 0, Line 51
Restore operation failed for database ‘HKCorruption’ with internal error code ‘0x8800000e’.
Msg 5181, Level 16, State 5, Line 52
Could not restart database “HKCorruption”. Reverting to the previous status.
Msg 5069, Level 16, State 1, Line 52
ALTER DATABASE statement failed.

I checked the SQL errorlog, and there was no Severity 21 listed. That’s where it stands for now – unfortunately I’ve not been able to reproduce pseudo storage corruption that affects data/delta files.

I’ve contacted Microsoft, informing them of the details of my testing, and I’ll update this post if/when I hear back from them.

Update 23 August 2016

Today I heard back from Microsoft. Turns out I had actually been able to reproduce corruption in the data/delta files. Look carefully at the errors from the SQL errorlog that I posted above. See that ‘0x8800000e’ ? It’s the only indication that there was a checksum failure. To be clear, this is what does and does not happen when there is a checksum failure found in data/delta files:

1. a value of 0x8800000e is written to the SQL errorlog
2. no severity is written to the SQL errorlog
3. no standardized error ID is written to the SQL errorlog
4. no text indicating corruption is written to the SQL errorlog

There are many problems with this situation, the first one being that there is no way to trap the corruption error with an alert. If there was a Severity associated with the error, we could create an alert, and receive some type of notification when the corruption occurs.

It’s bad enough that CHECKDB/CHECKTABLE ignores memory-optimized tables. If we force checksums to be calculated by backing up the memory-optimized filegroup to disk = ‘nul’, in order to determine that there are no checksum errors, you will have to scan the SQL errlog for ‘0x8800000e‘ after every memory-optimized filegroup backup.

This would seem to be a somewhat radical departure from standard ways to be informed of corruption (and other SQL Server errors in general).

Who could blame potential adopters of In-Memory OLTP for thinking that it’s not ready for prime time (and in this regard it’s definitely not). What could be more important than knowing your data is corruption free, and being alerted immediately if corruption occurs?

The present state of corruption detection and notification will do little to change the minds of those hesitant to adopt In-Memory OLTP. If Microsoft wants people to consider using In-Memory OLTP, they need to fix this issue immediately.

I have created this connect item about the issues described in this post (thanks for upvoting!)

Update 24 August 2016

Microsoft followed up with me again today, and said the following:

  • If the checkpointing process detects a checksum failure during regular processing, for example during a file merge, a sev21, error 41355 is written to SQL the error log
  • If there is a checksum failure during backup or restore, a sev16 error is written to the SQL error log, which is the same as what SQL Server does for checksum failures in mdf/ndf or ldf files
  • The team is looking at the DB startup code path to raise a sev21 error

That’s much better news than what I thought (what was originally explained to me).

Hopefully Microsoft will be able to fix the DB startup code path issue as part of a CU or SP1 (which in recent history would take about a year from the RTM release date).

Leave a Reply

Your email address will not be published. Required fields are marked *