More on logging and recovery
in distributed DBs, etc.
Transaction Processing, Note 6, Sp98, V1.0
6/1/98
COM1317, Transaction Processing, Spring 1998, Professor R. P. Futrelle
This is the sixth in a series of notes that summarize important concepts in transaction processing. The first part draws on material from the books mentioned in Note 4.
Distributed DBs -- Multiple managers can co-exist in a single machine and be coordinated by two-phase commit protocols. But the widespread use of truly distributed systems, with distinct sites connected over networks, and the challenges they present for ACID operation, make them important and interesting. The network topology of such systems can be important for reliability. The star topology in Figure 6.1a only needs a single link broken to partition the net into two non-communicating components (one node and all the rest). The topologies in 6.1b and 6.1c give increasing connectivity such that there is no single link that can be broken that will lead to partitioning.
a. A single link will separate (partition) this network. | b. At least two links have to be broken to separate this one. | c. Redundant links also enhance reliability. |
Figure 6.1. Greater connectivity in distributed databases systems helps to protect against failure.
The nodes in distributed system distribute the computational resources. Any node that supports transactions is usually assumed to have its own DB and log (or multiple DBs and logs). The DB at a node may contain no information that overlaps with the DB in another node. They can still be involved in a single transaction, e.g., an account transfer. But separate nodes may contain some data that is replicated from other nodes. In the extreme case one node may contain a complete replicate of the DB of another. This is particularly useful for non-stop systems that must run continuously in the face of failure. The two nodes both can execute exactly the same transactions or are kept in synch by other strategies, so that if one fails, the other can continue. The control of processing can be transferred quickly to the still functioning system, a hot swap. The failed system can then be restored and then must catch up, so its DB is brought up to match the current state of the other system that continued to process transactions while recovery at the other node was proceeding.
The replication of data across nodes is often only partial. The full description and company-wide inventories of goods may be fully replicated on a pair of central machines, with subsidiary nodes containing only records of local inventories. When relational tables are broken up in this way, they may be vertically or horizontally fragmented. In vertical fragmentation, only certain fields are replicated. This may lead to a loss of uniqueness of records, which is critical for relational DBs, so often a primary key is created for each fragmented record and included in it. Horizontal fragments are full records (rows) from tables, but just those records of interest to a site.
Two-phase commit --
The basic two-phase commit protocol was explained in Note 5. This section will focus more on the details of recovery. Note that the entire process starts because an application starts a transaction that ultimately will require related transactions at other sites. When the application decides to commit, the coordinator must initiate the two-phase commit protocol. (I have not seen any discussion in the literature about what a participant might do if it is executing a transaction and never receives a Request-to-prepare message. Presumably it can abort on time-out, but it still has to remain active in order to respond to the coordinator message.)
Figure 6.2. This shows the general organization of the two-phase commit protocol, including some of the many points at which the system could fail. This example shows the coordinator on the left and two participants on the right. A failure could be due to a coordinator crash, a network outage or long delay, in the center, or a participant crash. In addition, various logs are either written lazily to the log buffer or forced to disk, at various points in time, both by the coordinator and participant, as explained in the text.
The coordinator may or may not log a Start-two-phase-commit record when it begins (see below). A participant always logs a Prepared record when it has finished all the actions in its transactions, committed them to disk in its log, and is ready to send a Prepared message. If a participant decides to abort a transaction before replying to the Request-to-prepare message it logs an Abort record and rolls back its transaction and sends a NO message to indicate what it has done. When the first NO message arrives at the Coordinator, it logs an Abort record and notifies all participants that have not sent it a NO message (or no message) that they are to abort. Instead, if all the participants reply with Prepared, the coordinator logs a Commit record and sends Commit to all participants. When the Participants receive a Commit message, the log a commit record (the log data update records are already forced to disk). It is the logging of the Commit or Abort record by the coordinator that settles the status of the distributed transaction.
Some failure examples --
a. If the Coordinator fails at a, then when it recovers, it will or will not find a Start-two-phase-commit record in its log. If its policy is not to write one, and no record of the transaction is found at all, then it will correctly presume that the transaction aborted. This is called the presumed abort protocol.
b. If the network connection to P1 fails at b, then P1 never receives the message and will never reply. But the coordinator will be expecting a reply and will eventually time out and abort the transaction. If P1 later connects and queries the coordinator, the coordinator will find an abort message in its log.
c. If P2 fails at c and it has logged its Prepared record, then when it recovers it can reply again to the coordinator, which will then continue trying to commit the transaction or will have aborted and can tell from its log and tell P2 to abort.
d. If the coordinator has logged the Start-two-phase-commit record, it must include a list of all participants, and it must do this before sending any messages, so it will know which participants to contat if it fails at d.
e. If P1 fails after replying and before time e, then it can recover as in c.
f. If the message doesn't make it back to the coordinator, P2 can consult its log when reconnected and the coordinator requires its status.
g. If the coordinator fails at g after receiving some of the messages, then it can use its Start-two-phase-commit record to recontact the participants and finish the transaction.
h. If the coordinator fails at h it will already have logged either a commit or abort record and the fate of the transaction is settled. All it needs to do on recovery is to inform the participants of its decision. This case, and the related case of a disconnect at i, is the most difficult, because once a participant has returned a Prepared message, it may not commit or abort until told to do so by the coordinator; it is blocked. While it is blocked, it must continue to hold all its locks. The failure holding it in the blocked state may be of long duration, which can cause many problems. As we said in Note 5, the participant can poll other participants to see if any of them have received either a commit or an abort message. (Also, if it finds a participant that sent an abort message, it can also abort.)
Serializability -- One might think that serializability in a distributed system is extraordinarily complex. But in fact, it is automatic! The following remark is excerpted from the Lewis, A. Bernstein, and M. Kifer book (see Note 4):
If the concurrency controls at each site indpendently use a strict two-phase locking protocol and the system uses a two-phase commit protocol, then every global schedule is serializable (in the order in which their coordinators have committed them).
Object-oriented DBs , creating the log -- Interestingly, its not necessary to know much about OODBs in order to understand the recovery issues involved. Each persistent object, e.g., an instance of a C++ class or Lisp CLOS class or JAVA class, has a unique persistent identifier referring to its image in the DB, a dbpid. A log file is just a persistent object store similar in form to an OODB, but it contains a simple sequence of log records. All objects in the log file are identified by log persistent identifiers, logpids. Each log record will typically contain a header as shown below, followed by the actual object data for the undo and redo records.
It is simplest (and often correct) to think of the pids as byte locations in the corresponding files.
When a write of an object to the DB is done, such as a class instance or vector, we will assume a basic recursive strategy: If the object in a slot or vector element is a simple one, an immediate such as a number or character, something that does not have full object status, then it is simply written to the DB. If it is a reference (a pointer) to a more complex object then if that object is already in the DB, it is not copied to the DB again. If it has never been stored, it will be written out, and so forth, recursively. Our discussion below refers to a single write of one object in this recursive sequence. If the application updates an object2 referred to in another, object1, and a copy of object1 already exists in the DB, it is up to the application to write out the updated object2 -- this will not happen automatically when object1 is written, just because object2 is referenced there. The TP checks a table maintained in memory that contains maps between objects in memory and objects in the DB; that is how it knows whether or not it should write an object to the DB. It should be clear that to write any updated object that already exists in the DB, the application must use some kind of forced write, since only the application knows that the object has changed and requires updating in the DB.
When a write is attempted to the DB, logging is done before any changes are made to the DB page cache. The log record is created. The current DB version of the object is found in the page cache or read in from the DB to the cache if its page is not in memory. If the object was never written to the DB before, that is noted by the logger (there is no before image). The critical step now occurs in which the object in the DB cache is written to a fresh location in the log, receiving a logpid in the process, which it inserts as the undo logpid in the log header. It is important to note that any object references in this logged undo record are dbpids. Now the write to the DB by the transaction is made, except it is intercepted and made to the log buffer as the redo object. The logpid for the redo record is written to the log header. After this, the write is repeated to the DB cache page(s). These updates can be accumulated in the DB cache and the log buffer as long as the log is forced before any of the updated DB cache pages are flushed (the standard WAL = write-ahead log). Depending on the language, the size of any object is either known at compile-time or dynamically, at run-time, so the system always knows how many bytes to write.
Object-oriented DBs, recovery -- Let us look first at rollback. In an ordinary relational DB rollback, the focus is entirely on updating the values in the DB, so they will reflect the earlier values. To do this, the log is applied to the DB.
In the OO case, there is also an object in memory that is supposed to be a mirror of the DB object. Just rolling back the value of the object in the DB, which is straightforward, is only half the story. If a transaction is rolled back in the DB, so that in effect, no changes were ever made to the DB, it is still the case that the corresponding in-memory objects may reflect the updated values. The in-memory objects can be simply and automatically rolled back once the undos have been done to the DB (cache) by doing load operations for each one. A load operation for an object is the inverse of the store operation we have been discussing up to this point. A load simply sets the in-memory value to the value in the DB (cache). This would require no intervention by the application programmer. Since ACID transactions are supported, rolling back an in-memory value will not corrupt other transactions that refer to it, because isolation and serializability are maintained. In this scheme, if the application alters an in-memory object after its last store to the DB, the application must not assume that the in-memory value will be maintained, because an abort may occur. In another approach, if the transaction requests a rollback it would have to manage the rollback of the in-memory values itself. As in all databases, in-memory values in OODB systems will be lost in a crash, so that the only true record of the system state is the one recorded durably on disk.
In crash recovery, all memory state is lost, so redos are simply applied to the DB itself by traversing the log in the forward direction.
In dynamic languages such as Lisp, where introspection is available to analyze types at runtime, it is possible to do certain types of optimization. For example, it is not efficient to produce full log records for tiny objects such as single list cells (cons cells in Lisp). So if the system is storing a list, it could be done with a single custom log record. There is also the question of which list items may have been updated. Similarly, if a store is done of a very large vector in which only a few items have been changed, a more efficient image could be constructed of just the altered items, rather than a copy of the large vector, most of whose items are unchanged. Some OODB systems support specialized accessors so that it is possible to store an item in an array in the persistent DB without dealing with the entire array. Again, specialized log records could be introduced for this.