Daily IT Matters, this is the place where I post my daily findings on IT.

Thursday, September 25, 2008

Exchange Server Error -1018: How Microsoft IT Recovers Damaged Exchange Databases

I found this paper on the showcase site from microsoft I hope it is of help to you.

It sure helped me!

 

 

Technical White Paper

Published: August 1, 2005

Executive Summary

Error –1018 (JET_errReadVerifyFailure) is a familiar—and dreaded—error in Microsoft® Exchange Server. It indicates that an Exchange database file has been damaged by a failure or problem in the underlying file system or hardware.

This paper explains the conditions that result in error –1018. It also covers the detection mechanisms that Exchange uses to discover and recover from damage to its database files.

The Microsoft Information Technology group (Microsoft IT) runs one of the most extensive Exchange Server organizations in the world. Exchange administrators at Microsoft have investigated and recovered from dozens of –1018 error problems. This paper shows you how Microsoft IT monitors for this error, what happens after database file damage has been discovered, and how Microsoft recovers databases affected by the problem.

Note: For security reasons, the sample names of forests, domains, internal resources, organizations, and internally developed security file names used in this paper do not represent real resource names used within Microsoft and are for illustration purposes only.

Readers of this paper are assumed to be familiar with the basics of Exchange administration and database architecture. This paper describes Microsoft IT's experience and recommendations for dealing effectively with error –1018. It is not intended to serve as a procedural guide. Each enterprise environment has unique circumstances; therefore, each organization should adapt the material to its specific needs.

While the focus here is on Exchange Server 2003, nearly all the material covered applies to any version of Exchange. Exchange Server 2003 implements important new functionality for recovering from –1018 errors. This is discussed in "ECC Page Correction in Exchange Server 2003 SP1" later in this document.

Introduction

No computer data storage mechanism is perfect. Disks and tapes go bad. Glitches in hardware or bugs in firmware can cause data to be corrupted. The most basic strategy for dealing with this reality is redundancy: disks are mirrored or replicated; data is backed up to remote locations so that when—not if—primary storage is compromised, data can be recovered from another copy.

Loss of data is not the only risk when data becomes corrupted. If corruption is undetected, bad decisions may be made based on the data. Stories are occasionally reported in the press about a decimal point that is removed by random corruption of a database record, and someone becomes a temporary millionaire as a result. Corruption of a database can cause even more subtle or difficult errors. In Exchange, acting on a piece of corrupted metadata could cause mail destined for one user to be sent to another, or could cause all mail in a database to be lost.

Exchange databases therefore implement functionality to detect such damage. Even more important than detecting random corruption is not acting on it. After Exchange detects damage to its databases, the damaged area is treated as if it were completely unreadable. Thus, the database cannot be further harmed by relying on the data.

The error code –1018 is reported when Exchange detects random corruption of its data by a problem in the underlying platform. Although data corruption is a serious problem, it is rare for a –1018 error detected during database run time to cause the database to stop or to seriously malfunction. This is because the majority of pages in an Exchange database have user message data written on them. The loss of a single random page in the database is most likely to result in lost messages. One user or group of users may be affected, but there is no impact to the overall structural and logical integrity of the database. After a –1018 problem has been detected, Exchange will keep running as long as the lost data is not critical to the integrity of the database as a whole.

A –1018 error may be reported repeatedly for the same location in the database. This can happen if a user tries repeatedly to access a particular damaged message. Each time the access will fail, and each time a new error will be logged.

Because the immediate loss of data associated with error –1018 may be minimal, you may be tempted to ignore the error. That would be a dangerous mistake. A –1018 error must be investigated thoroughly and promptly. Error –1018 indicates the possibility of other imminent failures in the platform.

Understanding Error –1018

Error code –1018 (JET_errReadVerifyFailure) means one of two conditions has been detected when reading a page in the database:

  • The logical page number recorded on the page does not correspond to the physical location of the page inside the database file.

  • The checksum recorded on the page does not match the checksum Exchange expects to find on the page.

Statistically, a –1018 error is much more likely to be related to a wrong checksum than to a wrong page number.

To understand why these conditions indicate file-level damage to the database, you need to know a little more about how Exchange database files are organized.

Page Ordering

Each Exchange Server 2003 database consists of two matched files: the .edb file and the .stm file. These files must be copied, moved, or backed up together and must remain synchronized with each other.

Inside the database files, data is organized in sequential 4-kilobyte (KB) (4,096 byte) pages. Several pages can be collected together to form logical structures called balanced trees (B+-Trees). Several of these trees are linked together to form database tables. There may be thousands of tables in a database, depending on how many mailboxes or folders it hosts.

Each page is owned by a single B+-Tree, and each B+-Tree is owned by a single table. Error –1018 reports damage at the level of individual pages. Because database tables are made up of pages, the error also implies problems at the higher logical levels of the database.

At the beginning of each database file are two header pages. The header pages record important information about the database. You can view the information on the header pages with the Exchange Server Database Utilities tool Eseutil.

After the header pages, every other page in a database file is either a data page or an empty page waiting for data. Each data page is numbered, in sequential order, starting at 1. Because of the two header pages at the beginning of the file, the third physical page is the first logical data page in the database. (You can consider the two header pages to be logical pages -1 and 0.)

Note: Each database file as a whole has a header, and each page in a database also has its own header. It can be confusing to distinguish between the two.

The database header is at the beginning of the database file and it records information about the database as a whole. A page header is the first 40 bytes of each and every page, and it records important information only about that particular page. Just as Eseutil can display database header information, it can also display page header fields.

In an Exchange database, you can easily calculate which logical page you are on for any physical byte offset into the database file. Logical page –1, which is the first copy of the database header, starts at offset 0. Logical page 0, a second copy of the database header, starts at offset 4,096. Logical page 1, the first data page in the database, starts at offset 8,192. Logical page 2 starts at offset 12,228, and so on.

Each –1018 error is for a single page in the database, and it can be useful in advanced troubleshooting to be able to locate the exact page where the error occurred.

As general formulas:

  • (Logical page number + 1) × 4,096 = byte offset

  • (byte offset ÷ 4,096) – 1 = logical page number

These examples may be useful:

Suppose you need to know the exact byte offset for logical page 101 in a database. Using the first formula, (101 + 1) × 4,096 = 417,792, logical page 101 starts exactly 417,792 bytes into the file.

Now, suppose you need to know what page is at byte offset 4,104,192. Using the second formula, (4,104,192 ÷ 4,096) – 1 = 1,001, logical page 1,001 starts at 4,104,192 bytes into the file.

In most cases, a Windows Application Log event reporting error –1018 will list the location of the bad page as a byte offset. Therefore, the second formula is likely to be the most frequently used. In any case, the two formulas allow you to translate back and forth between logical pages and byte offsets as needed.

The logical page number is actually recorded on each page in the database. (In Exchange Server 2003 with Service Pack 1 (SP1), the method for doing this has changed. For more details, see "ECC Page Correction in Exchange Server 2003 SP1" later in this document.) When Exchange reads a page, it checks whether the logical page number matches the byte offset. If it does not match, a –1018 error results, and the page is treated as unreadable.

The correspondence between physical and logical pages is important because it allows Exchange to detect whether its pages have been stored in correct order in the database files. If the physical location does not match the logical page number, the page was written to the wrong place in the file system. Even if the data on the page is correct, if the page is in the wrong place, Exchange will detect the problem and not use the page.

Page Checksum

Along with the logical page number, each page in the database also stores a calculated checksum for its data. The checksum is at the beginning of the page and is derived by running an algorithm against the data on the page. This algorithm returns a 4-byte checksum number. If something on a page changes, the checksum on the page will no longer match the data on the page. (In Exchange Server 2003 SP1, the checksum algorithm has become more complicated than this, as you will learn in the next section.)

Every time Exchange reads a page in the database, it runs the checksum algorithm again and makes sure the result is the same as the checksum already on the page. If it is not, something has changed on the page. A –1018 error is logged, and the page is treated as unreadable.

ECC Page Correction in Exchange Server 2003 SP1

Exchange Server 2003 SP1 includes an important new recovery mechanism for some –1018 related damage. This mechanism is an Error Correction Code (ECC) checksum that is placed on each page. This checksum is in addition to the checksum present in previous versions of Exchange.

Each Exchange page now has two checksums, one right after the other, at the beginning of each page. The first checksum (the data integrity checksum) determines whether the page has been damaged; the second checksum (the ECC checksum) can be used to automatically correct some kinds of random corruption. Before Exchange Server 2003 SP1, Exchange could reliably detect damage, but could not do anything about it.

By surveying many –1018 cases, Microsoft discovered that approximately 40 percent of –1018 errors are caused by a bit flip. A bit flip occurs when a single bit on a page has the wrong value—a bit that should be a 1 flips to 0, or vice versa. This is a common error with computer disks and memory.

The ECC checksum can correct a bit flip. This means that approximately 40 percent of –1018 errors are self-correcting if you are using Exchange Server 2003 SP1 or later.

Note: ECC checksums that can detect multiple bit flips are possible, but not practical to implement. Single-bit error correction has minimal performance overhead, but it would be costly in terms of performance to detect and correct multiple bit errors. As a statistical matter, the distribution of page errors tends to cluster in two extremes: single bit errors and massive damage to the page.

If a –1018 error is corrected by the ECC mechanism, it does not mean you can safely ignore the error. ECC correction does not change the fact that the underlying platform did not reliably store or retrieve data. ECC makes recovery from error –1018 automatic (40 percent of the time), but does not change anything else about the way you should respond to a –1018 error.

The format of Exchange database page headers had to be changed to accommodate the ECC checksum. The field in each page header that used to carry the logical page number now carries the page number mixed with the ECC checksum. This means that Exchange Server 2003 SP1 databases are not backward compatible, even with the Exchange Server 2003 original release. The same applies to database tools, such as Eseutil. With older versions of the tools, the ECC databases appear to be massively corrupt, because the ECC checksum is not considered.

For more information about ECC page correction, refer to the Microsoft Knowledge Base article "New error correcting code is included in Exchange Server 2003 SP1" [ http://support.microsoft.com/kb/867626 ] .

Backup and Error –1018

A –1018 error may be encountered at any time while the database is running. However, this is not how the majority of –1018 problems are actually discovered. Instead, they are more often found during backup.

A –1018 error is reported only when a page is read, and not all pages in the database are likely to be read frequently. For example, messages in a user's Deleted Items folder may not be accessed for long periods. A –1018 error in such a location could go undetected for a long time. To detect –1018 problems quickly, you must read all the pages in the databases. Online backup is a natural opportunity for checking the entire database for –1018 damage, because to back up the whole database you have to read the whole database.

Exchange Online Streaming API Backups

Exchange has always supported an online streaming backup application programming interface (API) that allows Exchange databases to be backed up while they are running. Many third-party vendors have created Exchange-aware backup modules or agents that use this API. Backup, the backup program that comes with Microsoft Windows Server™ 2003 or Windows® 2000 Server, supports the Exchange streaming backup API. If you install Exchange Server or Exchange administrator programs on a computer, Backup is automatically enabled for Exchange-aware online backups.

If a –1018 page is encountered during online backup, the backup will be stopped. Exchange will not allow you to complete an online backup of a database with –1018 damage. This is to ensure that your backup can never have a –1018 problem in it. This is important because it means you can recover from a –1018 problem by restoring from your last backup and bringing the database up-to-date with the subsequent transaction log files. After you do this, you will have a database that is up-to-date, with no data loss, and with no –1018 pages.

Playing transaction logs will never introduce a –1018 error into a database. However, playing transaction logs may uncover an already existing –1018 error. To apply transaction log data, Exchange must read each destination page in the database. If a destination page is damaged, transaction log replay will fail. Exchange cannot replace a page with what is in the transaction log because transaction log updates may be done for only parts of a page.

If you restore from an online backup and encounter a –1018 error during transaction log replay, the most likely reason is that corruption was introduced into the database by hardware instability during or after restoration. To test this, restore the same backup to known good hardware. For more information, see "Can Exchange Cause a –1018 Error?" later in this document.

Restoring from an online backup and replaying subsequent transaction logs is the standard strategy for recovering from –1018 errors. Other strategies for special circumstances are outlined in "Recovering from a –1018 Error" later in this document.

Backup Retries and Transient –1018 Errors

Not all –1018 errors are permanent. A –1018 error may be reported because of a failure in memory or in a subsystem other than the disk. The database page on the disk is good, but the system does not read the disk reliably. To handle such cases, and to give the backup a better chance to succeed even on failing hardware, Exchange has functionality to retry –1018 errors encountered during backup.

If a –1018 error is reported when a page is backed up, Exchange will wait a second or two, and then try again to read the page. This will happen up to 16 times before Exchange gives up, fails the read of the page, and then fails the backup.

If Exchange eventually reads the page successfully, the copy of the page on the disk is good, but there is a serious problem elsewhere in the system. Even if Exchange is not successful in reading the page, it does not prove conclusively that the page is bad. Depending on how hardware caching has been implemented, all 16 read attempts may come from the same cache rather than directly from the disk. Exchange waits between each read attempt and tries to read again directly from the disk to increase the likelihood that the read will not be satisfied from cache.

Exchange Volume Shadow Copy Service API Online Backups

If you are running Exchange Server 2003 on Windows Server 2003, you have the additional online backup option of performing Volume Shadow Copy service-based online backups of Exchange. The Volume Shadow Copy service online backup API is a new method that is similar in its capabilities to the streaming backup API, but that can allow for faster restoration times independent of the database file size. How fast Volume Shadow Copy service backup is compared to streaming backup depends on a number of factors, the most important of which is whether the Volume Shadow Copy provider is software-based or hardware-based. Both software-based and hardware-based providers can make snapshot and clone copies of files even when the files are locked open and in use. However, if you use a software provider, the process is no faster than when making an ordinary file copy. To make the snapshot or clone process almost instantaneous, even for very large files, you must use a hardware provider.

Backup for Windows 2003 includes a software-based generic Volume Shadow Copy service provider, but does not support Exchange-aware Volume Shadow Copy service backups. If you are using any version of Backup for Windows as your Exchange backup application, you must perform streaming API online backups.

An Exchange-aware Volume Shadow Copy service backup must complete in less than 20 seconds. This is because Exchange suspends changes to the database files during the backup. If the snapshot or clone does not complete within 20 seconds, the backup fails. Thus, a hardware provider is required because the backup must complete so quickly.

Exchange has no opportunity to read database pages during a Volume Shadow Copy service backup. Therefore, the database cannot be checked for –1018 problems during backup. If you use a Volume Shadow Copy service-based Exchange backup solution, the vendor must verify the integrity of the backup in a separate operation soon after the backup has finished.

For more information about Volume Shadow Copy service backup and Exchange, see the Microsoft Knowledge Base article "Exchange Server 2003 data backup and Volume Shadow Copy services" [ http://support.microsoft.com/kb/822896 ] .

Application Log Event IDs

When a –1018 error occurs, you will not see a –1018 event in the application log. Instead, there are several different events that will report the –1018 as part of their Description fields. Which event is logged depends on the circumstances under which the –1018 problem was detected.

This listing of events associated with error –1018 is not comprehensive, but it does include the core events for which you should monitor.

For all versions of Exchange, Microsoft Operations Manager (MOM) monitors for events 474, 475, and 476 from the event source Extensible Storage Engine (ESE). If you are running Exchange Server 2003 SP1, you should also ensure that event 399 is monitored.

Event 474

For versions of Exchange prior to Exchange Server 2003 SP1, event 474 is logged when any checksum discrepancy is detected. For Exchange Server 2003 SP1, this event is logged only when multiple bit errors exist on a page. If a single bit error is detected, event 399 (discussed later in this document) is logged instead.

Here is an example of a typical event 474:

Event Type: Error
Event Source: ESE
Event ID: 474

Description: Information Store (3500) First Storage Group: The database page read from the file "C:\mdbdata\priv1.edb" at offset 2121728 (0x0000000000206000) for 4096 (0x00001000) bytes failed verification due to a page checksum mismatch. The expected checksum was 1848886333 (0x6e33c43d) and the actual checksum was 1848886845 (0x6e33c63d). The read operation will fail with error –1018 (0xfffffc06). If this condition persists then please restore the database from a previous backup. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

The Description field of this event provides information that can be useful for advanced troubleshooting and analysis. You should always preserve this information after a –1018 error has been reported. Providing this information to hardware vendors or to Microsoft Product Support Services may be helpful when troubleshooting multiple –1018 errors.

The Description field shows which database has been damaged and where the damage occurred. For translating a byte offset to a logical page number, recall the formula described in "Page Ordering" earlier in this document. Using that formula, you know that the page damaged in this error is logical page 517 because (2121728 ÷ 4096) – 1 = 517. Direct analysis of the page may show patterns that will help a hardware vendor determine the problem that caused the damage.

The description also lists the checksum that is written on the page as the expected checksum: 6e33c43d. The actual checksum is the checksum that Exchange calculates again as it reads the page: 6e33c63d.

Why does it help to know what the checksum values are? Patterns in the checksum differences may assist in advanced troubleshooting. For an example of this, see "Appendix A: Case Studies" later in this document.

In addition, you can tell whether a particular –1018 error is the result of a single bit error (bit flip) by comparing the expected and actual checksums. To do this, translate the checksums to their binary numbering equivalents. If the checksums are identical except for a single bit, the error on the page was caused by a bit flip.

The checksums listed in the preceding example can be translated to their binary equivalents using Calc.exe in its scientific mode:

0x6e33c43d = 1101110001100111100010000111101

0x6e33c63d = 1101110001100111100011000111101

Single bit difference ^

In the preceding example, if this error had occurred on an Exchange Server 2003 SP1 database, the error would have been automatically corrected.

In Exchange Server 2003 SP1, the checksum reported in the Description field of event 474 shows the page integrity checksum and the ECC checksum together. For example:

Description: Information Store (3000) SG1018: The database page read from the file "D:\exchsrvr\SG1018\priv1.edb" at offset 2371584 (0x0000000000243000) for 4096 (0x00001000) bytes failed verification due to a page checksum mismatch. The expected checksum was 2484937984258 (0x0000024291d88902) and the actual checksum was 62488400759392765 (0x00de00de91d889fd). The read operation will fail with error –1018 (0xfffffc06). If this condition persists then please restore the database from a previous backup. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

Notice that the checksum listed is 16 hexadecimal characters, and in the previous example, the checksum is eight hexadecimal characters. In the new checksum format, the first eight characters are the ECC checksum, and the last eight characters are the page integrity checksum.

Event 475

Event 475 indicates a –1018 problem caused by a wrong page number. It is no longer used in Exchange Server 2003. Instead, bad checksums and wrong page numbers are reported together under Event 474. The following is an example of event 475:

Event Type: Error

Event Source: ESE

Event ID: 475

Description: Information Store (1448) The database page read from the file "C:\MDBDATA\priv1.edb" at offset 1257906176 (0x000000004afa2000) for 4096 (0x00001000) bytes failed verification due to a page number mismatch. The expected page number was 307105 (0x0004afa1) and the actual page number was 307041 (0x0004afe1). The read operation will fail with error –1018 (0xfffffc06). If this condition persists then please restore the database from a previous backup.

Event 475 can be misleading. It may not mean the page is in the wrong location in the database. It only indicates that the page number field is wrong. Only if the checksum on the page is also valid can you conclude that the page is in the wrong location. Advanced analysis of the actual page is required to determine whether the field is corrupted or the page is in the wrong place. In the majority of cases, the page field is corrupted.

Notice that in the preceding example, the difference in the page number fields is a single bit, indicating that this page is probably in the right place, but was damaged by a bit flip.

Event 476

Event 476 indicates error 1019 (JET_PageNotInitialized). This error will occur if a page in the database is expected to be in use, but the page number is zero.

In releases of Exchange prior to Exchange 2003 Service Pack 1, the first four bytes of each page store the checksum, and the next four bytes store the page number. If the page number field is all zeroes, then the page is considered uninitialized. To make room for the ECC checksum in Exchange 2003 Service Pack 1, the page number field has been converted to the ECC checksum field. The page number is now calculated as part of the checksum data, and a page is now considered to be uninitialized if both the original checksum and ECC checksum fields are zeroed.

Event Type: Error

Event Source: ESE

Event ID: 476

Description: Information Store (3500) First Storage Group: The database page read from the file "C:\mdbdata\priv1.edb" at offset 2121728 (0x0000000000206000) for 4096 (0x00001000) bytes failed verification because it contains no page data. The read operation will fail with error 1019 (0xfffffc05). If this condition persists then please restore the database from a previous backup. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

In most cases, error 1019 is just a special case of error –1018. However, it could also be that a logical problem in the database has caused a table to show that an empty page is in use. Because you cannot tell between these two cases without advanced logical analysis of the entire database, error 1019 is reported instead of error –1018.

Error 1019 is rare, and full discussion of analysis and troubleshooting this error is outside the scope of this paper.

Event 399

Event 399 is a new event that was added in Exchange Server 2003 SP1. It is a Warning event, and not an Error event. It indicates that a single bit corruption has been detected and corrected in the database.

Event Type: Warning

Event Source: ESE

Event ID: 399

Description: Information Store (3000) First Storage Group: The database page read from the file "C:\mdbdata\priv1.edb" at offset 4980736 (0x00000000004c0000) for 4096 (0x00001000) bytes failed verification. Bit 144 was corrupted and has been corrected. This problem is likely due to faulty hardware and may continue. Transient failures such as these can be a precursor to a catastrophic failure in the storage subsystem containing this file. Please contact your hardware vendor for further assistance diagnosing the problem.

Although Event 399 is a warning rather than an error, it should be monitored for and treated as seriously as any uncorrectable –1018 error. All –1018 errors indicate platform instability of one degree or another and may indicate additional errors will occur in the future.

Event 217

Event 217 indicates backup failure because of a –1018 error.

Event Type: Error

Event Source: ESE

Event ID: 217

Description: Information Store (1224) First Storage Group: Error ( 1018) during backup of a database (file C:\mdbdata\priv1.edb). The database will be unable to restore.

Immediately before this error occurs, you will typically find a series of 16 event 474 errors in the application log, all for the same page. During backup, Exchange will retry a page read 16 times, waiting a second or two between each attempt. This is done in case the error is transient, so that a backup has a better chance to succeed.

Retries are not done for normal run-time read errors, but only during backup. Performing retries during normal operation could stall the database, if a frequently accessed page is involved.

Event 221

Event 221 indicates backup success. It is generated for each database file individually when it is backed up.

Event Type: Information

Event Source: ESE

Event ID: 221

Description: Information Store (1224) First Storage Group: Ending the backup of the file C:\mdbdata\priv1.edb.

----------

Event Type: Information

Event Source: ESE

Event ID: 221

Description: Information Store (1224) First Storage Group: Ending the backup of the file D:\mdbdata\priv1.stm.

If you are using third-party backup applications, there may be additional backup events that you should monitor in addition to those listed here.

Root Causes

At the simplest level, there are only three root causes for –1018 errors:

  • The underlying platform for your Exchange database has failed to reliably write Exchange data to storage.

  • The underlying platform for your Exchange database has failed to reliably read Exchange data from storage.

  • The underlying platform for your Exchange database has failed to reliably preserve Exchange data while in storage.

This level of analysis defines the scope of the issue. At a practical level, you want to know:

  • Will this happen again?

  • What should I do to recover from the error?

How Microsoft IT assesses the likelihood of additional errors is described in "Server Assessment and Root Cause Analysis" later in this document. Recovery strategies are also described later in this document. This section summarizes the most common root causes for error –1018:

  • Failing disk drives. Along with simple drive failures, it is not uncommon for Microsoft Product Support Services to handle cases where rebuilding a redundant array of independent disks (RAID) drive set after a drive failure is not successful.

  • Hard failures. Sudden interruption of power to the server or the disk subsystem may result in corruption or loss of recently changed data. Enterprise class server and storage systems should be able to handle sudden loss of power without corruption of data. Microsoft has tested Exchange and Exchange servers by unplugging a test server thousands of times in succession, with no corruption of Exchange data afterward.

    Exchange is an application that is well suited to uncovering problems from such testing because of its transaction log replay behavior and its checksum function. Damage to Exchange files often becomes evident during post-failure transaction log replay and recovery, or through verifying checksums on the database files after a test pass.

    For more information about input/output (I/O) atomicity, and its importance for data integrity after a hard failure, refer to "Best Practices" later in this document.

  • Cluster failovers. As an application is transitioned from one cluster node to another, disk I/O may be lost or not properly queued during the transition. Even though individual components may be robust and well designed, they may not work well together as a cluster system. This is one reason that Microsoft has a qualification program for cluster systems that is separate from the qualification program for stand-alone components. The cluster system qualification program tests all critical components together rather than separately.

  • Resets and other events in the disk subsystem. Companies are increasingly implementing Storage Area Network (SAN) and other centralized storage technologies, in which multiple servers access a shared storage frame. Not only is correct configuration and isolation of disk resources essential in these environments, but you must also manage redundant I/O paths and an increasing number of filters and services that are involved in disk I/O. The increasing complexity of the I/O chain necessarily introduces additional points of failure and exposes poor product integration.

  • Hardware or firmware bugs. Standard diagnostic test runs are seldom successful in diagnosing these problems. (If the standard diagnostic run could catch this particular problem, would it not already have been caught?) Understanding these issues frequently requires correlating data from multiple servers and using specialized diagnostic suites and stress test harnesses.

This is not a comprehensive list of all causes of error –1018, but it does outline the problem categories that account for the majority of these errors.

Can Exchange Cause a –1018 Error?

Can Exchange be the root cause of a –1018 error? Exchange might be responsible for creating a –1018 condition if it did one or both of the following:

  • Constructed the wrong checksum for a page.

  • Constructed a page correctly, but instructed the operating system to write the page in the wrong location.

The Exchange mechanisms for generating checksums and writing pages back to the database files are based on simple algorithms that have been stable since the first Exchange release. Even the addition of the ECC checksum in Exchange Server 2003 SP1 did not fundamentally alter the page integrity checksum mechanism. The ECC checksum is an additional checksum placed next to the original corruption detection checksum. The integrity of Exchange database pages is still verified through the original checksum.

Note: If you use versions of Esefile or Eseutil from versions of Exchange prior to Exchange Server 2003 SP1 to verify checksums in an Exchange Server 2003 SP1 or later database, nearly every page of the database will be reported as damaged. The page format was altered in Exchange Server 2003 SP1 and previous tools cannot read the page correctly. You must use post-Exchange Server 2003 SP1 tools to verify ECC checksum pages.

A logical error in the page integrity checksum mechanism would likely result in reports of massive and immediate corruption of the database, rather than in infrequent and seemingly random page errors.

This does not mean that there have never been any problems in Exchange that have resulted in logical data corruption. However, these problems cause different errors and not a –1018 error. Error –1018 is deliberately scoped to detect logically random corruptions caused by the underlying platform.

There are a few cases where false positive or false negative –1018 reports have been caused by a problem in Exchange. In these cases, the checksum mechanism worked correctly, but there was a problem in a code path for reporting an error. This caused a –1018 error to be reported when there was no problem, or an error to not be reported that should have been. Examination of the affected databases quickly leads to resolution of such issues.

The Exchange transaction log file replay capability is another capability that allows Microsoft to effectively diagnose –1018 errors that may be the fault of Exchange. Recall from the previous section that online backups are not allowed to complete if –1018 problems exist in the database. In addition, after restoration of a backup, transaction log replay re-creates every change that happened subsequent to the backup. This allows Exchange development to start from a known good copy of the database and trace every change to it.

As an Exchange administrator, the following two symptoms indicate that Exchange should be looked at more closely as the possible cause of a –1018 error:

  • After restoration of an online backup, and before transaction log file replay begins, there is a –1018 error in the restored database files. This could indicate that checksum verification failed to work correctly during backup. It is also possible that the backup media has gone bad, or that data was damaged while being restored because of failing hardware. The next test is more conclusive.

  • After checksum verification of restored databases, a –1018 error is present after successful transaction log replay has completed. This could indicate that a logical problem resulted in generation of an incorrect checksum. Reproducing this problem consistently on different hardware will rule out the possibility that failing hardware further damaged the files during the restoration and replay process.

Conversely, if restoring from the backup and rolling forward the log files eliminate a –1018 error, this is a strong indication that damage to the database was caused by an external problem.

In summary, error –1018 is scoped to report only two specific types of data corruption:

  • A logical page number recorded on a page is nonzero and does not match the physical location of the page in the database.

  • The checksum recorded on a page does not match the actual data recorded on the page.

Exchange thus detects both corruption of the data on a page and guards against the possibility that a page in the database has been written in the wrong place.

How Microsoft IT Responds To Error 1018

Microsoft IT uses Microsoft Operations Manager (MOM) 2005 to monitor the health and performance of Microsoft Exchange servers. MOM sends alerts to operator consoles for critical errors, including error –1018.

MOM provides enterprise-class operations management to improve the efficiency of IT operations. You can learn more about MOM at the Microsoft Windows Server System Web site [ http://www.microsoft.com/mom/default.mspx ] .

At Microsoft, automatic e-mail notifications are sent to a select group of hardware analysts whenever a –1018 occurs. Thus, all –1018 errors are investigated by an experienced group of people who track the errors over time and across all servers. As you will see later in this document, this approach is an important part of the methodology at Microsoft for handling –1018 errors.

Monitoring Backup Success

Every organization, regardless of size, should monitor Exchange servers for error –1018. The most basic way to accomplish this, if your organization does not use a monitoring application such as MOM, is to verify the success of each Exchange online backup. Even if you do use MOM, you should still monitor backup success separately.

If Exchange online backups are failing unnoticed, you are at risk on at least these counts:

  • A common reason for backup failure is that the database has been damaged. Thus, the Exchange platform may be at risk of sudden failure.

  • >You do not have a recent known good backup of critical Exchange data. While an older backup can be rolled forward with additional transaction logs for zero loss restoration, the older the backup, the less likely this will be successful, for a number of operational reasons. For example, an older backup tape may be inadvertently recycled. In addition, if the platform issues on the Exchange server result in loss of the transaction logs, rolling forward will be impossible.

  • >After successful completion of an online backup, excess transaction logs are automatically removed from the disk. With backups not completing, transaction log files will remain on the disk, and you are at risk for eventually running out of disk space on the transaction log drive. This will force dismount of all databases in the affected storage group. (If a transaction log drive becomes full, do not simply delete all the log files. Instead, refer to the Microsoft Knowledge Base article "How to Tell Which Transaction Log Files Can Be Safely Removed" [ http://support.microsoft.com/kb/240145 ] .

Verifying backup success is arguably the single most important monitoring task for an Exchange administrator.

As a best practice, Microsoft IT not only sets notifications and alerts for backup errors and failures, but also for backup successes. A daily report for each database is generated and reviewed by management. This review ensures that there is positive confirmation that each database has actually been backed up recently, and that there is immediate attention to each backup failure.

Securing Data after a –1018 Error

The most common way that a –1018 error comes to the attention of Microsoft IT analysts is through a backup failure. While a –1018 error may occur during normal database operation, normal run-time –1018 errors are less frequent than errors during backup.

Note: Exchange databases perform several self-maintenance tasks on a regular schedule (which can be set by the administrator). One of these tasks, called online defragmentation, consolidates and moves pages within the database for better efficiency. Thus, error –1018 may be reported more frequently during the online maintenance window than during normal run time.

This is the general process that occurs at Microsoft after a –1018 error:

  • MOM alerts are generated and e-mail notification is sent to Exchange analysts.

  • Verification is done that recent good backups exist for all databases on the server.

    It is important that backups are good for all databases, and not just the database affected by the –1018, because the error indicates that the entire server is at risk.

  • All transaction log files on the server are copied to a remote location, in case there is a failure of the transaction log drive. As the investigation proceeds, new log files are periodically copied to a safe location or backed up incrementally.

    You can copy log files to a safe location by doing an incremental or differential online backup. In Exchange backup terminology, an incremental or differential backup is one that backs up only transaction log files and not database files. An incremental backup removes logs from the disk after copying them to the backup media. A differential backup leaves logs on the disk after copying them to the backup media.

After existing Exchange data has been verified to be recoverable and safe, it is time to begin assessing the server and performing root cause analysis.

Server Assessment and Root Cause Analysis

There are two levels at which you must gauge the seriousness of a –1018 error:

  • The immediate impact of the error on the functioning of the database.

  • The likelihood of additional and escalating failures.

These two factors are independent of each other. Ignoring a –1018 error because the damaged page is not an important one is a mistake. The next page destroyed may be critical and may result in a sudden catastrophic failure of the database.

There are two common analysis and recovery scenarios for a –1018 condition:

  • There is only a single error, and little or no immediate impact on the overall functioning of the server. You have time to do careful diagnosis, and plan and schedule a low-impact recovery strategy. However, root cause analysis is likely to be difficult because the server is not showing obvious signs of failure beyond the presence of the error.

  • There are multiple damaged pages or the error occurs in conjunction with other significant failures on the server. You are in an emergency recovery situation.

In the majority of emergency recovery situations, root cause analysis is simple because there is a strong likelihood that the –1018 was caused by a catastrophic or obvious hardware failure. Even in an emergency situation, you should take the time to preserve basic information about the error that is needed for statistical trending across servers. For more information, see "Appendix B: –1018 Record Keeping" later in this document.

Even before root cause analysis, your first priority should be to make sure that existing data has already been backed up and that current transaction log files are protected. Then you can begin analysis with bookending.

Bookending

The point at which a page was actually damaged and the point at which a –1018 was reported may be far apart in time. This is because a –1018 error will only be reported when a page is read by the database. Bookending is the process of bracketing the range of time in which the damage must have occurred.

The beginning bookend is the time the last good online backup was done of the database (marked by event 221). Because the entire database is checked for –1018 problems during backup, you know that the problem did not occur until after the backup occurred. The other bookend is the time at which the –1018 error was actually reported in the application log. Frequently, this will be a backup failure error (event 217). The event that caused the –1018 error must have occurred between these two points in time.

After you have established your bookends, the next task is to look for what else happened on the server during this time that may be responsible for the –1018 error:

  • Was there a hard server or disk failure?

  • Was the server restarted (event 6008 in the system log)?

  • Were Exchange services stopped and restarted?

  • Have there been any storage-related errors? This includes memory, disk, multipath, and controller errors. Not only should you search the Windows system log, but you should also be aware of other logging mechanisms that may be used by the disk system. Many external storage controllers do not log events to the Windows system or application logs, and, by default, the controller may not be set up to log errors. You must ensure that error logging is enabled and that you can locate and interpret the logs.

  • Did Chkdsk run against any of the volumes holding Exchange data?

  • If this is a clustered server, were there any failovers or other cluster state transitions?

  • Have any hardware changes been made, or has firmware or software been upgraded on the server?

Any unusual event that occurred between the bookend times must be considered suspect. If there are no unusual events that can account for damage to the database files, you must consider the possibility that there is an undetected problem with the reliability of the underlying platform.

It is also possible that the error is due to a transient condition external to the hardware. A variety of environmental factors can corrupt computer data or cause transient malfunctions. Vibration, heat, power fluctuations, and even cosmic rays are known to cause glitches or even permanent damage. Hard drive manufacturers are well aware that normally functioning drives are not 100 percent reliable in their ability to read, write, and store data, and design their systems to detect errors and rewrite corrupted data.

Keeping in mind that no computer storage system is 100 percent reliable, how can you decide whether a –1018 is indicative of an underlying problem that you should address, or is just a random occurrence that you should accept?

A Microsoft Senior Storage Technologist who has extensive experience in root cause analysis of disk failures and Exchange –1018 errors, suggests this principle: For 100 Exchange servers running on similar hardware, you should experience no more than a single –1018 error in a year. The phrase running on similar hardware is important in understanding the proper application of this principle.

Standardizing on a single hardware platform for Exchange is useful in root cause analysis of 1018 errors. In the absence of an obvious root cause, the next step of investigation is to look for patterns of errors across similar servers.

A single –1018 error on a single page may be a random event. Only after another –1018 error occurs on the same or a similar server do you have enough information to begin looking for a trend or common cause. If a –1018 error occurs on two servers that have nothing in common, you have two errors that have nothing in common rather than two data points that may reveal a pattern.

As a general rule, if you average less than one –1018 error across 100 servers of the same type per year, it is unlikely that root cause analysis will reveal an actionable problem.

This does not mean that you should not record data for each –1018 error that occurs on a server. Until a second error has occurred, you cannot know whether a particular error falls below the threshold of this principle.

If a –1018 error is caused by a subtle hardware problem, providing data from multiple errors can be critical. With only a single error to consider, it is likely to be difficult for Microsoft or a hardware vendor to identify a root cause beyond what you can identify on your own. For two actual –1018 root cause investigations, and examples of how difficult and subtle some issues can be to analyze, see "Appendix A: Case Studies" later in this document.

Detailed information about every –1018 error that happens at Microsoft is logged into a spreadsheet as described in "Appendix B: –1018 Record Keeping " later in this document.

Verifying the Extent of Damage

Error –1018 applies to problems on individual pages in the database, and not to the database as a whole. When a –1018 error is reported, you cannot assume that the reported page is the only one damaged. Because a backup will stop at the moment the first –1018 is encountered, you cannot even rely on errors reported during the backup to show you the full extent of the damage.

You need to know how many pages are damaged in the database as part of deciding on a recovery strategy. If multiple pages are damaged, multiple errors have likely occurred, and the platform should be considered in imminent danger of complete failure.

In the majority of –1018 cases investigated by Microsoft IT, there is only a single damaged page in the database. In this circumstance, absent other indications of an underlying problem, Microsoft IT will leave the server in service and wait to implement recovery until a convenient off-peak time. The assumption is that this is a random error, unless a second error in the near future or similar issues on other servers indicate a trend.

Note: Remember that an error –1018 prevents an online backup from completing. Delaying recovery of the database will require you to recover with an increasingly out-of-date backup. This situation will definitely result in longer downtime during recovery, because of additional transaction log files that must be replayed. In Exchange Server 2003 SP1, the typical performance of log file replay is better than 1,000 log files per hour with that performance remaining consistent, regardless of the number of log files that must be replayed. In prior versions of Exchange, transaction log file replay can be more than five times slower, with the average speed of replay tending to diminish as more logs are replayed.

Comprehensively testing an entire database for –1018 pages requires taking the database offline and running Eseutil in checksum mode.

If you bring a database down after a –1018 error has occurred, there is some chance that it will not start again. If other, unknown pages have also been damaged, one of them could be critical to the startup of the database. Statistically, this is a low probability risk, and Microsoft IT does not hesitate to dismount databases that have displayed run-time –1018 errors.

Eseutil is installed in the \exchsrvr\bin folder of every Exchange server. When run in checksum mode (with the /K switch), Eseutil rapidly scans every page in the database and verifies whether the page is in the right place and whether its checksum is correct. Eseutil runs as rapidly as possible without regard to other applications running on the server. Running Eseutil /K on a database drive shared with other databases is likely to adversely affect the performance of other running databases. Therefore, you should schedule testing of a database for off-peak hours whenever possible.

Note: If you decide to copy Exchange databases to different hardware to safeguard them, be sure that you copy them rather than move them. The problems on the current platform may not be in the disk system, but may cause corruption to occur during the move process. If you move the files, you get no second chance if this corruption happens.

At Microsoft, Eseutil checksum verification is done by running multiple copies of Eseutil simultaneously against the database. One instance of Eseutil /K is started against the database, and after a minute, another instance is started against the same database. The reason for doing this is that in a mirror set, one side of the mirror may have a bad page, but the other side may not.

Running two copies of Eseutil slightly out of synch with each other makes it much more likely that both sides of a mirror will be read. It is not often that one side of a mirror is good and one side is bad, but it does happen, and a thorough test requires testing both sides of the mirror. At Microsoft, this Eseutil regimen is also run five times in succession, to further increase the confidence level in the results.

Note: Multiple runs of Eseutil /K are unnecessary if databases are stored on a RAID-5 stripe set, where data is striped with parity across multiple disks. This is because there is only one copy of a particular page in the set, with redundancy being achieved by the ability to rebuild the contents of a lost drive from the parity. Also, note that as a general rule, RAID 1 (Mirroring) or RAID 1+0 (Mirroring+Striping) drive sets are recommended for heavily loaded database drives for performance reasons.

Recovering from a –1018 Error

Microsoft IT undertakes two fundamental tasks to recover from a –1018 error:

  • Correct the root problem that caused the error.

  • Recover Exchange data damaged by the error.

These tasks are not completely independent of each other. What is discovered about the root cause may influence the data recovery strategy.

For example, if there are overt signs that server hardware is in imminent danger of complete failure, the data recovery strategy may require immediate data migration to a different server. If the server appears to be otherwise stable, data recovery may consist merely of restoring from backup, to remove the bad page from the database.

Server Recovery

At Microsoft, a single –1018 error puts a server on a watch list. It does not trigger replacement or retirement of the hardware unless there has been positive identification of the component that caused the error. If additional –1018 errors occur on the same server in the near future, regardless of whether the root cause has been specifically determined, the server is treated as untrustworthy. It is taken out of production and extensive testing is done.

It may seem obvious that after any –1018 error occurs, you should immediately take the server down and run a complete suite of manufacturer diagnostics. Yet this is not something that Microsoft IT does as a matter of course. The reason is that standard diagnostic tests are seldom successful in uncovering the root cause of a –1018 error. This is because:

  • The corruption may be an anomaly. Power fluctuations and interference, temporary rises in heat or humidity, and even cosmic rays can corrupt computer data. Unless these conditions are repeated at the time the test is run, the test will show nothing.

  • If a –1018 error occurs only once and is not accompanied by any other visible errors or issues, it is probable that the server is currently functioning normally. The condition that caused the problem may occur infrequently or require a particular confluence of circumstances that cannot be replicated by a general diagnostic tool.

  • Hardware frequently fails in an intermittent rather than steady or progressive pattern.

  • The problem may be the result of a subtle hardware or firmware bug rather than due to a progressively failing component. In this case, ordinary manufacturer diagnostics may be incapable of uncovering the issue. If these diagnostics could detect the issue, it would have already been uncovered in a previous diagnostic run.

  • The problem may be a Heisenberg. The term Heisenberg refers to a problem that cannot be reproduced because the diagnostic tools used to observe the system change the system enough that the problem no longer occurs. For example, a tool that monitors the contents of RAM may slow down processing enough that timing tolerances are no longer exceeded, and the problem disappears.

  • The diagnostic tool may not be able to simulate a load against the server that is sufficiently complex. There is a misconception that –1018 errors are more likely to appear when you place a system under a heavy I/O load. The experience at Microsoft is that the complexity of the load is more relevant to exposing a data corruption issue than is the overall level of load. Complexity can be in the type of access (the I/O size combined with direction), as well as in the actual data content patterns. Certain complex patterns can show noise or crosstalk problems that will not be exposed by simpler patterns. One of the strengths of the Finisar Medusa Labs Test Tool Suite is its ability to generate such patterns.

Manufacturer diagnostics are typically run only after the server has already been taken out of production. This happens after a pattern of –1018 errors has established that an underlying problem exists, but the root cause has not yet been discovered. Along with these diagnostics, Microsoft IT also tries to reproduce data corruption problems by using tools that stress the disk subsystem.

The Jetstress (Jetstress.exe) and Exchange Server Load Simulator (LoadSim) tools can be used to realistically simulate the I/O load demands of an actual Exchange server. The primary function of these tools is for capacity planning and validation, but they are also useful for testing hardware capabilities.

Jetstress creates several Exchange databases and then exercises the databases with realistic Exchange database I/O requests. This approach allows determining whether the I/O bandwidth of the disk system is sufficient for its intended use.

LoadSim simulates Messaging application programming interface (MAPI) client (Microsoft Office Outlook 2003) activity against an Exchange server and is useful for judging the overall performance of the server and network. LoadSim requires additional client workstation computers to present high levels of client load to the server.

While neither tool is intended as a disk diagnostic tool, both can be used to create large amounts of realistic Exchange disk I/O. For this purpose, most people prefer Jetstress because it is simpler to set up and tune. Both Jetstress and LoadSim come with extensive documentation and setup guidance and are available free for download from Microsoft. You can download Jetstress [ http://www.microsoft.com/downloads/details.aspx?FamilyId=94B9810B-670E-433A-B5EF-B47054595E9C&displaylang=en ] from the Microsoft Download Center. You can download LoadSim from the Microsoft Windows Server System Web site [ http://www.microsoft.com/exchange/downloads/2000/loadsim.mspx ] .

Microsoft IT also uses the Medusa Labs Test Tools Suite from Finisar for advanced stress testing of disk systems. The Finisar tools can generate complex and specific I/O patterns, and are designed for testing the reliability of enterprise-class systems and storage. While Jetstress and LoadSim are capable of generating realistic Exchange server loads, the Finisar tools generate more complex and demanding I/O patterns that can uncover subtle data and signal integrity issues.

For detailed information about the Medusa Labs Test Tools Suite, see the Finisar Web site [ http://www.finisar.com/nt/Medusalabs.php ] .

Use of Jetstress, LoadSim, or the Medusa tools requires that the server be taken out of production service. Each of these tools, used in a stress test configuration, makes the server unusable for other purposes while the tests are running.

The Eseutil checksum function is also sometimes useful in reproducing unreliability in the disk system. Eseutil scans through a database as quickly as possible, reading each page and calculating the checksum that should be on it. It will use all the disk I/O bandwidth available. This puts significant I/O load on the server, although not a particularly complex load. If successive Eseutil runs report different damage to pages, this indicates unreliability in the disk system. This is a simple test to uncover relatively obvious problems. A disk system that fails this test should not be relied on to host Exchange data in production. However, the Eseutil checksum function is unlikely to reveal subtle problems in the system.

Another test that is frequently done is to copy a large checksum-verified file (such as an Exchange database) from one disk to another. If the file copy fails with errors, or the copied file is not identical to the source, this is a strong indication of serious disk-related problems on the server.

As a final note about server recovery, you should verify that the Exchange server and disk subsystem are running with the latest firmware and drivers recommended by the manufacturers. If they are not, it is possible that upgrading will resolve the underlying problem.

Microsoft works closely with manufacturers when –1018 patterns are correlated with particular components or configurations, and hardware manufacturers are continually improving and upgrading their systems. In rare cases, you may discover that –1018 errors begin occurring soon after application of a new driver or upgrade. This is another case where a standardized hardware platform can make troubleshooting and recognizing patterns easier.

Data Recovery

The first—if somewhat obvious—question to answer when deciding on a data recovery strategy is this: Is the database still running?

If the database is running, you know that the error has not damaged parts of the database critical to its continuing operation. While some user data may have already been lost, it is likely that the scope of the loss is limited.

The next question is: Do you believe the server is still reliable enough to remain in production?

At Microsoft, if a single –1018 occurs on a server but there is no other indication of system instability, the server is deemed healthy enough to remain in production indefinitely. This conclusion is subject to the appearance of additional errors.

Before deciding on a data recovery strategy, you must assess the urgency with which the strategy must be executed. Along with the current state of the database, what you have learned already from the root cause analysis will factor heavily into this assessment. The following questions must be considered:

  • Has more than one error occurred? If multiple errors have occurred, or additional errors are occurring during your troubleshooting, you should consider it highly likely that the entire platform may suddenly fail.

  • Is more than one database involved?

  • Is the platform obviously unstable? For example, suppose that you find during root cause analysis that you cannot copy large files to the affected disk without errors during the copy. It becomes much more urgent at this point to move the databases to a different platform immediately.

  • Is there a recent backup of the affected data? If you have not been monitoring backup success, backups may have been failing for days or weeks because the database was already damaged. You are at even greater risk if there is a sudden failure of the server.

If you do not have a good, recent online backup, you must make it a high priority to shut down the databases and copy the database files from the server to a safe location. If you do not have a recent online backup, and if you do not make an offline backup, you run the risk that subsequent damage to the database will make it irreparable and result in catastrophic data loss.

While it is true that the database is already damaged, it can be repaired with Eseutil, as long as the damage does not become too extensive. More detail about repairing the database is provided later in this document.

Microsoft IT chooses from several standard strategies to recover a database after a –1018 error occurs. The next sections outline the advantages and disadvantages of each strategy, along with the preconditions required to use the strategy.

Restore from Backup

Restoring from a known good backup and rolling the database forward is the only strategy guaranteed to result in zero data loss regardless of how many database pages have been damaged. This strategy requires the availability and integrity of all transaction logs from the time of backup to the present.

The reason that this strategy results in zero data loss is that after Exchange detects non-transient –1018 damage on a page, the page is never again used or updated. One of two conditions applies: either the backup copy of the database already carries the most current version of the page, or one of the transaction logs after the point of backup carries the last change made to the page before it was damaged. Thus, restoring and rolling forward expunges the bad page with no data loss.

Note: Before restoring from a backup, you should always make a copy of the current database. Even if the database is damaged, it may be repairable. If you restore from a backup, the current database will be overwritten at the beginning of the restoration process. If restoration fails, and you have a copy of the damaged database, you can then fall back on repairing the database as your recovery strategy.

Restoration from a backup is the method used the majority of time by Microsoft IT to recover from a –1018 error. Each Exchange database at Microsoft is sized so that it can be restored in about an hour.

Restoration is also much faster than other recovery strategies. Assuming that the server is deemed stable enough, restoration is scheduled for an off-peak time, and results in minimal disruption for end users. For more information about how Microsoft backs up Exchange, refer to the IT Showcase paper Backup Process Used with Clustered Exchange Server 2003 Servers at Microsoft [ http://www.microsoft.com/technet/itsolutions/msit/operations/exchbkup.mspx ] .

Migrate to a New Database

Exchange System Manager provides the Move Mailbox facility for moving all mailbox data from one database or server to another. This can be done while the database is online, and even while users are logged on to their mailboxes. However, most Exchange administrators prefer to schedule a general outage when moving mailboxes so that individual users do not experience a short disconnect when each mailbox is moved.

In Exchange Server 2003, mailbox moves can be scheduled and batched. In conjunction with Microsoft Office Outlook's Exchange Cached Mode, the interruption in service when each mailbox is moved often goes unnoticed by end users, who can continue to work from a cached copy of the mailbox.

For public folder databases, each folder can be migrated to a different server by replication. If additional replicas of all folders already exist on other servers, you can migrate all data by removing all replicas from the problem database. This will trigger a final synchronization of folders from this database to the other replicas.

After Exchange System Manager shows that replication has finished for all folders in a public folder database, you may delete the original database files. When you mount a database again after deleting its files, a new, empty database is generated. You can then replicate folders from other public folder servers back to this new database, if desired.

Migrating data to a different database leaves behind any –1018 or 1019 problems because bad pages will not be used during the move or replication operations. Unlike using a restore and roll forward strategy, migrating data will not recover the information that was on the bad page. It will definitely leave the bad data behind.

A particular message, folder, or mailbox may fail to move, and you may notice a simultaneous –1018 error in the application log. This can allow you to identify the error and the data affected by it. In Exchange Server 2003, new move mailbox logging can report details about each message that fails to move, or can skip mailboxes that show errors during a mass move operation. For more details about configuring, batching, and logging mailbox move operations, refer to Exchange Server 2003 online Help.

Sometimes, a single bad page can affect multiple users. This is because of single instance storage. In an Exchange database, if a copy of a message is sent to multiple users, only one copy of the message is stored, and all users share a link to it.

Sometimes, the data migration will complete with no errors, even though you know there are –1018 problems in the database. This will happen if the bad page is in a database structure such as a secondary index. Such structures are not moved, but are rebuilt after data is migrated. If the Move Mailbox or replication operations complete with no errors, this indicates the bad page was in a section of the database that could be reconstructed, or in a structure such as a secondary index that could be discarded. In these cases, migrating from the database does result in a zero data loss recovery.

Moving or replicating all the data in a 50-gigabyte (GB) database can take a day or two. Therefore, if you choose a migration strategy, you must believe that the server is stable enough to remain in service long enough to complete the operation.

Repair the Database

The Eseutil and Information Store Integrity Checker (Isinteg.exe) tools are installed on every Exchange server and administrative workstation. These tools can be used to delete bad pages from the database and restore logical consistency to the remaining data.

Repairing a database typically results in some loss of data. Because Exchange treats a bad page as completely unreadable, nothing that was on the page will be salvaged by a repair. In some cases, repair may be possible with zero data loss, if the bad page is in a structure that can be discarded or reconstructed. The majority of pages in an Exchange database contain user data. Therefore, the chance that a repair will result in zero data loss is low.

Repair is a multiple stage procedure:

  1. Make a copy of the database files in a safe, stable location.

  2. Run Eseutil in repair mode (/P command-line switch). This removes bad pages and restores logical consistency to individual database tables.

  3. Run Eseutil in defragmentation mode (/D command-line switch). This rebuilds secondary indexes and space trees in the database.

  4. Run Isinteg in fix mode (-Fix command-line switch). This restores logical consistency to the database at the application level. For example, if several messages were lost during repair, Isinteg will adjust folder item counts to reflect this, and will remove missing message header lines from folders.

Typically, repairing a database takes much longer than restoring it from a backup and rolling it forward. The amount of time required varies depending on the nature of the damage and the performance of the system. As an estimate, the repair process often takes about one hour per 10 GB of data. However, it is not uncommon for it to be several times faster or slower than this estimate.

Repair also requires additional disk space for its operations. You must have space equivalent to the size of the database files. If this space is not available on the same drive, you can specify temporary files on other drives or servers, but doing so will dramatically reduce the speed of repair.

Because repair is slow and usually results in some data loss, it should be used as a recovery strategy only when you cannot restore from a backup and roll the database forward.

There may be cases where you have a good backup, but are unable to roll the database forward. You can then combine the restoration and repair strategies to recover the maximal amount of data. This option is explored in more detail in the next section.

The database repair tools have been refined and improved continually since the first version of Exchange was released, and they are typically effective in restoring full logical consistency to a database. Despite the effectiveness of repair, Microsoft IT considers repair an emergency strategy to be used only if restoration is impossible. Because Microsoft IT is stringent about Exchange backup procedures, repair is almost never used except as part of the hybrid strategy described in the next section.

After repairing a database, Microsoft IT policy is to migrate all folders or mailboxes to a new database rather than to run a repaired database indefinitely in production.

Restore, Repair, and Merge Data

There is a hybrid recovery strategy that can be used if you are unable to roll forward with a restored database because a disaster has destroyed necessary transaction log files.

In this scenario, an older, but good, copy of the database is restored from a backup. Because the transaction logs needed for zero loss recovery are unavailable, the restored database is missing all changes since the backup was taken.

However, the damaged database likely contains the majority of this missing data. The goal is to merge the contents of the damaged database with the restored database, thus recovering with minimal data loss.

To do this, the damaged database is moved to an alternate location where it can be repaired while the restored database is running and servicing users. In Exchange Server 2003, you can use the recovery storage group feature to do the restoration and repair on the same server. In previous versions of Exchange, it was necessary to copy the database to a different server to repair it and merge data.

Bulk merge of data between mailbox databases can be accomplished in two ways:

  • Run the Mailbox Merge Wizard (ExMerge). You can download ExMerge from the Microsoft Download Center [ http://www.microsoft.com/downloads/details.aspx?FamilyID=429163EC-DCDF-47DC-96DA-1C12D67327D5&displaylang=en ] . ExMerge will copy mailbox contents between databases, suppressing copying of duplicate messages, and allowing you to filter the data merge based on timestamps, folders, and other criteria. ExMerge is a powerful and sophisticated tool for extracting and importing mailbox data.

  • Use the Recovery Storage Group Wizard in Exchange System Administrator In Exchange Server 2003 SP1. The Recovery Storage Group Wizard merges mailbox contents from a database mounted in the recovery storage group to a mounted copy of the original database. Like ExMerge, the Recovery Storage Group Wizard suppresses duplicates, but it does not provide other filtering choices. For the majority of data salvage operations, duplicate suppression is all that is required. In most cases, the Recovery Storage Group Wizard provides core ExMerge functionality, but is simpler to use.

Alternate Server Restoration

Exchange allows restoration of a backup created on one server to a different server. In this scenario, you create a storage group and database on the destination server, and restore the backup to it. You can also copy log files from one server to another to roll the database forward.

This strategy may be necessary if the original server is deteriorating rapidly, and you must find an alternate location quickly to host the database. You can restore either an online backup or offline copies of the databases to the alternate server.

After the database has been restored, you must redirect Active Directory® directory service accounts to the mailboxes now homed on the new server. This can be done by:

  • In Exchange Server 2003, use the Remove Exchange Attributes task for all users with mailboxes in the database, followed by using the Mailbox Recovery Center to automatically reconnect all users to the mailboxes on the new server.

  • Use a script for the Active Directory attribute changes to redirect Active Directory accounts to the new server.

This is an advanced strategy. You may want to consult with Microsoft Product Support Services if it becomes necessary to use it, and you have not successfully accomplished it in the past. This strategy may also require the editing or re-creation of client Outlook profiles.

Best Practices

Microsoft IT manages approximately 95 Exchange mailbox servers that host 100,000 mailboxes worldwide. In the last year, there have been six occurrences of error –1018 across all these servers, with the errors limited to two servers.

One server had four errors and another had two errors. In the first case, the root cause was traced to a specific hardware failure. The second server is still under investigation because the two errors occurred very close together in time, but have not occurred since.

Microsoft IT has seen a general trend of decreasing numbers of –1018 errors year over year. This corresponds with the experience of many Exchange administrators who see fewer –1018 errors in Exchange today than in years past. Administrators often assume that the decrease in these errors must be due to improvements in Exchange. However, the credit really belongs to hardware vendors who are continually increasing the reliability and scalability of their products. Microsoft's primary contribution has been to point out problems that the vendors have then solved.

Along with using reliable enterprise-class hardware for your Exchange system, there are several best practices used by Microsoft IT that you can implement to reduce even further the likelihood of encountering data file corruption.

Hardware Configuration and Maintenance

Follow these best practices:

  • Disable hardware write back caching on disk drives that contain Exchange data, or ensure you have a reliable controller that can maintain its cache state if power is interrupted.

    It is important to distinguish here between caching on a disk drive and caching on a disk controller. You should always disable write back caching on a disk drive that hosts Exchange data, but you may enable it on the disk controller if the controller can preserve the cache contents through a power outage.

    When Exchange writes to its transaction logs and database files, it orders the operating system to flush those writes to disk immediately. Nearly all modern disk controllers report to the operating system that writes have been flushed to a disk before they actually have. This means that disks and controllers must ensure that writes have succeeded in case there is a power outage. There is nothing an application can do to reliably override disk system behavior and actually force writes to be secured to a disk.

  • Change cache batteries in disk controllers, uninterruptible power supplies (UPSs), and other power interruption recovery systems as manufacturers recommend. A failed battery is a common reason for data corruption after a power failure.

  • Test systems before putting them in production. Microsoft IT uses Jetstress for burn-in testing of new Exchange systems. The Medusa Labs Test Tool Suite from Finisar is normally used in Microsoft IT only for advanced forensic analysis after less sophisticated tools have not been able to reproduce a problem.

  • Test the actual drive rebuild and hot swap capabilities of your disk system for both performance and data integrity reasons. It is possible that the performance of a system will be so greatly impacted during a drive rebuild operation that it becomes unusable. There have also been cases where the drive rebuild functionality has become unstable when disks have remained under heavy load during a drive rebuild operation.

  • Power down server and disk systems in the order and by the methods recommended by manufacturers. You should know the expected shutdown times for your systems, and at which points a hard shutdown is safe or risky. Many server systems take much longer to shut down than consumer computer systems. The experience of Microsoft Product Support Services is that impatience during shutdown is an all too common cause of data corruption.

  • Standardize the hardware platform used for Exchange. Not only does this improve general server manageability, but it also makes troubleshooting and analysis of errors across servers easier.

  • Stay current on upgrades for servers, disk controllers, switches and other firmware, and software that manage disks and disk I/O.

  • Verify with your vendor that the disk controllers used with Exchange support atomic I/O, and find out the atomicity value.

    To support atomic I/O is to support writing all of the data that an application requests in a single I/O or to write none of it. For example, if an application sends a 64-KB write to a disk, and a hard failure occurs during the write, the result should be that none of the write is preserved on a disk. Atomicity involves all or nothing

    Without atomic I/O, you are vulnerable to torn pages where a chunk of disk may be composed of a mixture of old and new data. In the 64 KB example, it may be that the first 32 KB is new data and the last 32 KB is old data. In Exchange, a torn 4-KB write to the database will certainly result in a –1018 error.

    The atomicity value refers to the largest single write that the controller guarantees to write on an all or nothing basis. For example, this might be 128 KB: for any I/O request less than 128 KB, the write will happen atomically, or, in effect all at once with no possibility of a partial write. However, for write requests greater than 128 KB, there may be no such guarantee.

    Exchange issues database write commands in 4 KB or smaller chunks. Therefore, on a drive hosting only Exchange databases, a write atomicity of 4 KB is required.

Operations

Follow these best practices:

  • Place Exchange databases and transaction log files in separate disk groups. As a rule, Exchange log files should never be placed on the same physical drives as Exchange database files. There are two important reasons for this:

  • Fault tolerance. If the disks hosting Exchange database files also hold the transaction logs, loss of these disks will result in loss of both database and transaction log files. This will make rolling the database forward from a backup impossible.

  • Performance. The disk I/O characteristics for an Exchange database are a high amount of random 4-KB reads and writes, typically with twice as many reads as writes. For Exchange transaction log files, I/O is sequential and consists only of writes. As a rule, mixing sequential and random I/O streams to the same disk results in significant performance degradation.

  • Track all Exchange data corruption issues across all Exchange servers. This provides you data for trend analysis and troubleshooting of subtle platform flaws. For more information, see "Appendix B: –1018 Record Keeping " later in this document.

  • Preserve Windows event logs. It is all too common for event logs generated during the bookend period to be cleared or automatically overwritten. (For details, see "Bookending" earlier in this document.) The event logs are important for root cause analysis. If you are running Exchange in a cluster, ensure that event log replication is configured, or that you gather and preserve the event logs from every node in the cluster, whether actively running Exchange or not.

Conclusion

For most organizations, huge amounts of important data are managed in Microsoft Exchange database files. Current server class computer hardware is very reliable but it is not perfect. Because Exchange data files compose many gigabytes or even terabytes of storage, it is inevitable that the database files will occasionally be damaged by storage failures.

While no administrator welcomes the appearance of a –1018 error, the error prevents data corruption from going undetected, and often provides you with an early warning before problems become serious enough that a catastrophic failure occurs.

Every –1018 error should be logged (as described in Appendix B). Moreover, every –1018 requires some kind of recovery strategy to restore data integrity (as described above in "Recovering from a –1018 Error"). However, not every –1018 error indicates failing or defective hardware.

At Microsoft, a rate of one error –1018 per 100 Exchange servers per year is considered normal and to be expected. This "1 in 100" acceptable error rate is based on Microsoft's experience with the limits of hardware reliability.

Microsoft IT will replace hardware or undertake a root cause investigation if any of the following conditions exist:

  • The –1018 error is associated with other errors or symptoms that indicate failures or defects in the system.

  • More than one –1018 error has occurred on the same system.

  • 1018 errors begin occurring above the "1 in 100" threshold on multiple systems of the same type.

While there may be nothing you can do about the fact that –1018 errors occur, you can reduce the incidence of errors. If you are experiencing –1018 errors at a rate greater than one or two a year per 100 Exchange servers, the root cause analysis advice and practices outlined in this paper can be of practical benefit to you. Even if you are not experiencing excessive rates of this problem, we hope that the recovery methods suggested in this paper will help you recover more quickly and effectively

For More Information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to:

http://www.microsoft.com [ http://www.microsoft.com/ ]

http://www.microsoft.com/itshowcase [ http://www.microsoft.com/itshowcase ]

http://www.microsoft.com/technet/itshowcase [ http://www.microsoft.com/technet/itsolutions/msit/ ]

For any questions, comments, or suggestions on this document, or to obtain additional information about How Microsoft Does IT, please send e-mail to:

showcase@microsoft.com

Appendix A: Case Studies

This section outlines two case studies of actual –1018 investigations, conducted jointly by Microsoft, third-party vendors, and Exchange customers. For privacy reasons, the names of the customers and vendors are omitted, and identifying details may have been changed.

These investigations are not typical of what is required to identify the root cause for the majority of –1018 errors. Rather, they illustrate the more subtle and difficult cases that are sometimes encountered. In both cases, trending –1018 errors across a common platform was critical to the investigation.

Case Study 1

An Exchange customer with nearly 100 Exchange servers in production was experiencing occasional but recurring –1018 errors on a minority of the servers. All servers used for Exchange were from the same manufacturer, with two different models used depending on the role and load of the server. Errors occurred, seemingly at random, in both server models.

Ordinary diagnostics showed nothing wrong with any of the servers. If a –1018 error occurred on a server, another error might not occur for several months. Microsoft personnel recommended taking some of the servers out of production and running extended Jetstress tests. These tests also revealed nothing. Although all the servers were all similar to each other, only a minority of the servers (about 20 percent) ever experienced –1018 problems. Still, this was far above a reasonable threshold for random errors, and so the server platform was considered suspect.

Microsoft personnel recommended tracking each –1018 error that happened across all servers in a single spreadsheet. (For details, see "Appendix B: –1018 Record Keeping " later in this document.) This technique would allow confirmation of subjective impressions and allow better analysis of subtle patterns that might have been overlooked.

Over time, 17 errors were logged in the spreadsheet and a pattern did emerge. For most of the –1018 errors, the twenty-eighth bit of the checksum was wrong. If it was not the twenty-eighth bit, it was the twenty-third or the thirty-second bit.

One of the characteristics of an Exchange checksum is that if an error introduced on a page is a single bit error (a bit flip), the checksum on the page will also differ from the checksum that should be on the page by only a single bit.

For example, suppose a –1018 error is reported with these characteristics:

  • Expected checksum (that is actually on the page): 39196aa6

  • Actual checksum (calculated as the page is read): 38196aa6.

Checksums are stored in little endian format on an Exchange page. The actual checksum on the page is therefore derived by reversing the order of the four bytes that make up the eight-digit checksum:

  • The number 51 79 f5 33 becomes 33 f5 79 51.

  • The number 41 79 f5 33 becomes 33 f5 79 41.

To determine whether two checksums match each other except for a single bit, you must convert them to binary and then use the XOR logical operator. An XOR operation compares each bit of one checksum to the corresponding bit of the other. If the bits are the same (both 0 or both 1), the XOR result is 0. If the bits are different, the XOR result is 1. Therefore, a single bit difference between two numbers will result in an XOR result with exactly a single 1 in it. If more than a single bit was changed on a page, the XOR checksum results will be off by more than a single bit. An illustration of this is shown in Table 1.

Checksums Hexadecimal Binary

Expected checksum

51 79 f5 33

00110011 11110101 01111001 01010001

Actual checksum

41 79 f5 33

00110011 11110101 01111001 01000001

XOR Result

XOR Result

00000000 00000000 00000000 00010000

Table 1. Checksum XOR Analysis

Patterns in –1018 corruptions are often a valuable clue for hardware vendors in identifying an elusive problem. Along with logging the checksum discrepancies, it is also useful to dump the actual damaged page for direct analysis. (For details, see "Appendix B: –1018 Record Keeping " later in this document.)

A server was finally discovered where the problem happened more than once within a short time frame. Jetstress tests were able to consistently create new –1018 errors, almost always manifesting as a change in the twenty-eighth bit of the checksum. The server was shipped to Microsoft for analysis. The errors could not be reproduced despite weeks of stress testing and diagnostics performed by both Microsoft and the manufacturer.

In the meantime, the customer noticed that –1018 errors had begun to occur on Active Directory domain controllers as well as on Exchange servers. The Active Directory database is based on the same engine as the Exchange database, and it also detects and reports –1018 errors.

It was noticed that the errors seemed to occur on the Active Directory servers after restarting the servers remotely with a hardware reset card. Investigators at Microsoft tried restarting the test server in the same way and were eventually successful in reproducing the problem.

At this point, it might seem that the reset card was the most likely suspect. However, the error did not occur every time after a restart with the card. Most of the time, there was no issue. Long Jetstress runs could be done sometimes with no errors, and then suddenly all Jetstress runs would fail serially.

Eventually, it became apparent that the problem could be reproduced almost every seventh restart with the card. It was not the fault of the card, but the fact that the card performed a complete cold restart of the server, simulating a power reset.

After every seventh cold restart, the server would become unstable. This state would last through warm restarts until the next cold restart, at which time the server would be stable again until after another six cold restarts.

Both server models in production in the customer's organization used the exact same server component with the same part number. However, only 20 percent of the components were manufactured with this problem, which made it much harder to narrow the cause down to the faulty component.

Case Study 2

A major Exchange customer with 250 Exchange servers was plagued with frequent –1018 errors on multiple servers and multiple SAN disk systems. Rarely did a week go by without a full-scale –1018 recovery.

There had been significant data loss multiple times after –1018 errors had occurred. In one case, there was no backup monitoring being done. The most recent Exchange backup had actually been overwritten, with no subsequent backups succeeding. After a month, there was a catastrophic failure on the server, and the database was not salvageable. All user mail was lost. In another case, the first –1018 error corrupted several hundred thousand pages in the database, and the transaction log drives were also affected. Backups had also been neglected on this server as the problem worsened. The most recent backup was several weeks old, and thus all mail since then was lost.

Microsoft Product Support Services had been called multiple times over the last several months and had been mostly successful in recovering data after each problem. However, each of these cases involved individual server operators and Product Support Services engineers, working in isolation on recovery, but not focusing on root cause analysis across all servers.

The data loss cases got the attention of both the Microsoft account team and the Exchange customer's executive management. As Microsoft began correlating cases and asking for more information about the prevalence of the issue, it became clear quickly that the rate of –1018 occurrences was far above the standard threshold.

Information about past issues was mostly unavailable or incomplete. However, Microsoft created a spreadsheet to track each new problem. The spreadsheet started to fill quickly, and patterns began to emerge. The problem was that there was no single pattern, but multiple patterns.

In several cases, the lowest two bytes of the checksum were changed. This seemed promising, but then came several errors where bits 29 and 30 were wrong, with nothing else in common. Then there was an outbreak of errors where there were large-scale checksum differences with no discernible pattern in the checksums or the damaged pages. On some servers, there were multiple bad pages. There were frequent transient –1018 errors, and frequently a checksum on a full database would reveal different errors on successive runs.

The investigation and resolution lasted almost a year. As time went on, it became clear that some servers and disk frames were much more problematic than others, and that this was not just a general problem with all the Exchange servers across the organization. During that year, the following problems were discovered to be root causes of –1018 errors:

  • Server operators were hard cycling servers with disk controllers that had no I/O atomicity guarantees.

  • SANs where there was no logical unit number (LUN) masking, allowed multiple servers to control a single disk simultaneously, and thus corrupt it.

  • Badly out-of-date firmware revisions were in use, including versions known to cause data corruption.

  • Cluster systems had not passed Windows Hardware Quality Labs (WHQL) certification. These clusters had disk controllers that were unable to handle in-flight disk I/Os during cluster state transitions.

  • Antivirus applications were not configured correctly to exclude Exchange data files. This was causing sudden quarantine, deletion, or alteration of Exchange files and processes. Generic file scanning antivirus programs should never be used on Exchange databases. Many vendors have effective Exchange-aware scanners that implement the Microsoft Exchange antivirus APIs.

  • A vendor hardware bug accounted for a minority of the errors.

  • Aging and progressively failing hardware, which had exceeded its lifecycle, caused obvious problems.

Correcting the –1018 root causes was an arduous, but ultimately worthwhile process. It required not only changes to hardware and configurations, but also operational improvements. Not only was the organization successful in dramatically reducing the incidence of –1018 errors, but also in greatly decreasing the impact of each error on end users by implementing effective monitoring and recovery procedures.

This case study contrasts sharply with Case Study 1. In Case Study 1, a mysterious and subtle hardware bug was the single root cause for all the failures. However, for most Exchange administrators, the key to reducing and controlling –1018 errors will be implementing ordinary operational improvements. Most of the time, the patterns revealed by keeping track of –1018 errors across your organization will point to obvious errors and problems that should be defended against. Case Study 1, while perhaps more interesting, was atypical, while Case Study 2 is representative of the process that several Exchange organizations have gone through to control and reduce –1018 errors.

Apendix B: –1018 Record Keeping

For the majority of –1018 errors, the root cause will be indicated by another correlated error or failure. For errors where the cause is not so obvious, tracking –1018 errors across time and across servers is critical for identifying the root cause.

Even for errors where the root cause is easily determined, there is still value in consistently tracking –1018 errors. You can learn how the errors affect your organization, and where operational and other improvements could reduce the impact of the errors.

You may want to track errors in a database, in a spreadsheet, or using a simple text file. At Microsoft, Microsoft Office Excel 2003 spreadsheets are used. The following list of fields can be adapted to your needs and your willingness to track detailed information.

Essentials

These files should always be saved for each –1018 error:

  • Application and system logs for the bookend period from the time when the –1018 error was reported and the time of the last good backup.

  • Page dumps.

Eseutil Page Dump

This Eseutil facility will show you the contents of important header fields on the page. This command requires the logical page number. You can calculate the logical page number from the error description as described in "Page Ordering" earlier in this document.

If, for example, logical page 578 is damaged in the database file Priv1.edb, you can dump the page to the file 578.txt with this command:

Eseutil.exe /M priv1.edb /P578 ≥ 578.txt

Note that there is no space between the /P switch and the page number.

The output of this command might look similar to this:

Microsoft(R) Exchange Server Database Utilities

Version 6.5

Copyright (C) Microsoft Corporation. All Rights Reserved.

Initiating FILE DUMP mode...

Database: priv1.edb

Page: 578

checksum <0x03300000, 8>: 2484937984258 (0x0000024291d88902)

expected checksum = 0x0000024291d88902

****** checksum mismatch ******

actual checksum = 0x00de00de91d889fd

new checksum format

expected ECC checksum = 0x00000242

actual ECC checksum = 0x00de00de

expected XOR checksum = 0x91d88902

actual XOR checksum = 0x91d889fd

checksum error is NOT correctable

dbtimeDirtied <0x03300008, 8>: 12701 (0x000000000000319d)

pgnoPrev <0x03300010, 4>: 577 (0x00000241)

pgnoNext <0x03300014, 4>: 579 (0x00000243)

objidFDP <0x03300018, 4>: 114 (0x00000072)

cbFree <0x0330001C, 2>: 6 (0x0006)

cbUncommittedFree <0x0330001E, 2>: 0 (0x0000)

ibMicFree <0x03300020, 2>: 4038 (0x0fc6)

itagMicFree <0x03300022, 2>: 3 (0x0003)

fFlags <0x03300024, 4>: 10370 (0x00002882)

Leaf page

Primary page

Long Value page

New record format

New checksum format

TAG 0 cb:0x0000 ib:0x0000 offset:0x0028-0x0027 flags:0x0000

TAG 1 cb:0x000e ib:0x0000 offset:0x0028-0x0035 flags:0x0001 (v)

TAG 2 cb:0x0fb8 ib:0x000e offset:0x0036-0x0fed flags:0x0001 (v)

If you do not see a checksum mismatch in the dump, that does not necessarily mean that the –1018 error is transient. It is possible that a mistake was made in calculating the logical page number. It is a good idea to double-check your arithmetic, and to dump the preceding and next pages as well if you do not find a –1018 error on the dumped page. Running Eseutil /K against the entire database will also provide an additional check.

Required Error Information

For each –1018 occurrence, you should always log the following:

  • Application log –1018 event information:

  • Date and time

  • Server name

  • Event ID

  • Event description

  • If a cluster, cluster node where the error occurred

  • Server make and model

  • Storage type:

  • Direct access storage device (DASD)

  • Fiber Channel Storage Area Network (SAN)

  • Internet small computer system interface (iSCSI) SAN

  • Network-attached storage

  • Storage make and model:

  • Disk controller

  • Multiple path configuration

  • Permanent location or share for event, log, and dump files

Additional Information

For each 1018 occurrence, you can also note the following:

  • Bookend period anomalies:

  • Restart

  • Cluster transition

  • Disk error

  • Memory error

  • Other

  • File offset

  • Logical page number (calculated from byte offset)

  • Actual checksum (calculated at run time)

  • Expected checksum (read from page)

  • Binary actual checksum

  • Binary expected checksum

  • Checksum XOR result

  • How discovered (run time, mount failure, or backup failure)

  • Server unavailable or available

  • Last good backup time

  • Error confirmed by, such as: Eseutil /m /p, /k

  • Permanent or transient error

  • Location of files (Eseutil and Esefile page dumps, raw page dumps, MPSReports)

  • Server hardware

  • Server BIOS

  • Controller

  • Controller firmware revision

  • Storage

  • Impact (databases affected)

  • Recovery downtime

  • Recovery strategy

  • Root cause

  • Comments

  • Entry by

XOR Calculation Sample for Excel

Appendix A described how to compare checksums to look for patterns. The Microsoft Office Excel formulas below can be used to automate this comparison. You must install the Analysis Toolpak for Excel for the necessary functions to be available. The Toolpak can be installed from the Tools, Add-Ins menu in Excel.

Converting a Hexadecimal Checksum to Binary

Copy this formula into an Excel cell. This formula assumes that the hexadecimal checksum is in cell A1. If the hexadecimal checksum is in a different cell, change each reference to A1 in the formula to represent the actual cell. Ignore line breaks in the formula—it is intended to be a single line in Excel:

=CONCATENATE(HEX2BIN(MID(A1,7,2),8)," ",HEX2BIN(MID(A1,5,2),8),"
",HEX2BIN(MID(A1,3,2),8)," ",HEX2BIN(MID(A1,1,2),8))

This formula also reverses each byte of the checksum to conform to the Intel little endian storage format.

Using XOR with Two Binary Checksums

This formula assumes that the binary checksums are in cells B1 and B2. If the checksums are in other cells, replace each occurrence of B1 or B2 as appropriate. Ignore line breaks in the formula—it is intended to be a single line in Excel:

=CONCATENATE((HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,1,8)+MID(B2,1,8),2,0
)),8),8))," ",
(HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,10,8)+MID(B2,10,8),2,0)),8),8)),"
",(HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,19,8)+MID(B2,19,8),2,0)),8),8)),"
",(HEX2BIN(BIN2HEX(VALUE(SUBSTITUTE(MID(B1,28,8)+MID(B2,28,8),2,0)),8),8)))


Situation

Error –1018 signals that an Exchange database file has been damaged by a hardware or file system problem. Exchange reports this error to provide early warning of possible server failure and data loss.

Solution

This paper shows you how Microsoft IT responds to this error and recovers affected Exchange data. It also covers the methods and tools used to find root causes and resolve the underlying problems responsible for the error.

Benefits

  • Improve your monitoring of Exchange data integrity.
  • Increase your ability to determine seriousness and urgency of –1018 errors.
  • Learn specific recovery strategies and how to decide when to implement them.
  • Improve your operational effectiveness in handling hardware and data integrity problems.

Products & Technologies

  • Microsoft Exchange Server 2003
  • Microsoft Windows Server 2003
  • Exchange Jetstress and LoadSim I/O and capacity modeling tools
  • Microsoft Office Excel 2003
  • Medusa Labs Test Tool Suite by Finisar
  • Exchange Eseutil and Isinteg repair and integrity verification tools

No comments:

Google