您的位置:首页 > 运维架构 > Linux

Linux 日志文件系统 详细介绍

2012-06-22 15:38 1031 查看
原文:

Linux: The Journaling Block Device

Linux: The Journaling Block Device

Submitted by Kedar Sovani
on June 20, 2006 - 11:40pm

Linux kernel

Atomicity is a property of an operation either to succeed or failcompletely. Disks assure atomicity at the sector level. This means thata write to a sector either goes through completely or not at all. Butwhen an operation spans over multiple sectors of
the disk, ahigher-level mechanism is needed. This mechanism should ensure thatmodifications to the entire set of sectors are handled atomically.Failure to do so leads to inconsistencies. This document talks aboutthe implementation of the Journaling Block Device
in Linux.

Let's look at how these inconsistencies could be introduced to a filesystem. Say we have an application that creates a file. The

filesystem internally has to decrease the number of free inodes by one, intialize the inode on the disk and add an entry to the

parentdirectory for the newly created file. But what happens if the machinecrashes after only the first operation is executed? In thiscircumstance, an inconsistency has been introduced in the filesystem.The number of free inodes has decreased, but no initialisation
of theinode has been performed on the disk.

The only way to detect these inconsistencies is by scanning theentire filesystem. This task is called fsck, filesystem consistencycheck. In large installations, the consistency check requires asignificant amount of time (many hours) to check and fixinconsistencies.
As you might have guessed, such downtime is notdesirable. A better approach to solve this problem is to avoidintroducing inconsistencies in the first place, and this could beaccomplished by providing atomicity to operations. Journaling is such away to provide
atomicity to operations.

Simply stated, using journaling is like using a scratch pad. Youperform operations on the scratch pad, and once you are satisfied thatthe operations are correct, you reflect them in a fairer copy.

In the case of filesystems, all the metadata and data are stored onthe block device for the filesystem. Journaling filesystems use ajournal or the log area as the scratch pad. A journal may be a part ofthe same block device or it may be a separate device
in itself. Ajournaling filesystem first records all the operations it has performedin the journal. Once the set of operations that is part of one singleatomic operation has completed and been recorded in the journal, onlythen is it writtent to the actual block
device. Henceforth, the termdisk is used to indicate the actual block device, whereas the termjournal is used for the log area.

Journal Recovery Scenarios

The example operation from above requires that three blocks bemodified—the inode count block, the block containing the on-disk inodeand the block holding the directory where the entry is to be added. Allof these blocks first are written to the journal. After
that, a specialblock, called the commit record, is written to the journal. The commitrecord is used to indicate that all the blocks belonging to a singleatomic operation are written to the journal.

Given journaling behavior, then, here is how a journaling filesystem reacts in the following three basic scenarios:

The machine crashes after only the first block is flushed to thejournal. In this case, when the machine comes back up again and checksthe journal, it finds an operation with no commit record at the end.This indicates that it may not be a completed operation.
Hence, nomodifications are done to the disk, preserving the consistency.

The machine crashes after the commit record is flushed to thejournal. In this case, when the machine comes back up again and checksthe journal, it finds an operation with the commit record at the end.The commit record indicates that this is a completed operation
andcould be written to the disk. All the blocks belonging to thisoperation are written at their actual locations on the disk, replayingthe journal.

The machine crashes after all the three blocks are flushed to thejournal but the commit record is not yet flushed to the journal. Evenin this case, because of the absence of the commit record, nomodifications are done to the disk. The scenario thus is reduced
to thescenario described in the first case.

Likewise, any other crash scenario could be reduced to any of the scenarios listed above.

Thus, journaling guarantees consistency for the filesystem. The timerequired for looking up the journal and replaying the journal isminimal as compared to that taken by the filesystem consistency check.

Journaling Block Device

The Linux Journaling Block Device (JBD) provides this scratch padfor providing atomicity in operations. Thus, a filesystem controlling ablock device can make use of JBD on the same or on another block devicein order to maintain consistency. The JBD is a
modular implementationthat exposes a set of APIs for the use of such applications. The

following sections describe the concepts and implementation of the Linux JBD as is present in the Linux 2.6 kernel.

Before we move on to the implementation details of the JBD, anunderstanding of some of the objects that JBD uses is required. Ajournal is a log that internally manages updates for a single blockdevice. As mentioned above, the updates first are stored in
the journaland then are reflected to their real locations on the disk. The areabelonging to the journal is managed like a circular-linked list. Thatis, the journal reuses its area when the journal is full.

A handle represents a single atomic update. The entire set ofchanges/writes that should be performed atomically are carried out withreference to a single handle.

It may not be an efficient approach to flush each atomic update(handle) to the journal, however. To achieve better performance, theJBD bunches a set of handles together into a transaction and flushesthis transaction to the journal. The JBD ensures that the
transactionis atomic in nature. Hence, the handles, which are the subcomponents ofthe transaction, also are guaranteed to be atomic.

The most important property of a transaction is its state. When atransaction is being committed, it follows the lifecycle of stateslisted below.

Running: the transaction currently is live and can accept newhandles. In a system only one transaction can be in the running state.

Locked: the transaction does not accept any new handles but existinghandles are not complete. Once all the existing handles are completed,the transaction goes to the next state.

Flush: all the handles in a transaction are complete. The transaction is writing itself to the journal.

Commit: the entire transaction log has been written to the journal.The transaction is writing a commit block indicating that thetransaction log in the journal is complete.

Finished: the transaction is written completely to the journal. Ithas to remain there until the blocks are updated to the actuallocations on the disk.

Transaction Committing and CheckPointing

A running transaction is written to the journal area after a certainperiod. Thus, a transaction can be either in-memory (running) oron-disk. Flushing a transaction to the journal and marking thatparticular transaction as finished is a process called transactioncommit.

The journal has a limited area under its control, and it needs toreuse this area. As for committed transactions, those having all theirblocks written to the disk, they no longer need to be kept in thejournal. Checkpointing, then, is the process of flushing
the finishedtransactions to the disk and reclaiming the corresponding space in thejournal. It is discussed in more detail later in this article.

Implementation Briefs

The JBD layer performs journaling of the metadata, during which thedata simply is written to the disk without being journaled. But thisdoes not stop applications from journaling the data, as it could bepresented to the JBD as metadata itself. This document
takes the linuxkernel version 2.6.0 as a reference.





Commit

[journal_commit_transaction(journal object)]

A Kjournald thread is associated with every journaled device. TheKjournald thread ensures that the running transaction is committedafter a specific interval. The transaction commit code is divided intoeight different phases, described below. Figure 1 shows
a logicallayout of a journal.

Phase 0: moves the transaction from running state (T_RUNNING) tolocked state (T_LOCKED), meaning the transaction no longer can issuenew handles. The transaction waits until all the existing handles havecompleted. A transaction always has a set of buffers
reserved for whenthe transaction is initiated. Some of these buffers may be unused andare unfiled in this phase. The transaction now is ready to be committedwith no outstanding handles.

Phase 1: the transaction enters into the flush state (T_FLUSH). The transaction is marked as a currently committing

transactionfor the journal. This phase also marks that no running transactionexists for the journal; therefore, new requests for handles initiate anew transaction.

Phase 2: the actual buffers of the transaction are flushed to thedisk. Data buffers go first. There are no complications here, as databuffers are not saved in the log area. Instead, they are flusheddirectly to their actual positions on the disk. This phase
ends whenthe I/O completion notifications for all such buffers are received.

Phase 3: all the data buffers are written to a disk but theirmetadata still is in the volatile memory. Metadata flushing is not asstraightforward as data buffer flushing, because metadata needs to bewritten to the log area and the actual positions on the
disk need to beremembered. This phase starts with flushing these metadata buffers, forwhich a journal descriptor block is acquired. The journal descriptorblock stores the mapping of each metadata buffer in the journal to itsactual location on the disk in the
form of tags. After this, metadatabuffers are flushed to the journal. Once the journal descriptor is fullof tags or all metadata buffers are flushed to the journal, the journaldescriptor also is flushed to the journal. Now we have all the metadatabuffers in
the journal, and their actual positions on the disk areremembered. This data, being persistent, can be used for recovery iffailure occurs.

Phase 4 and Phase 5: both phase 4 and phase 5 wait on I/O completion notifications

of metadata buffers and journal descriptor blocks, respectively. The

buffers are unfiled from in-memory lists once I/O completion is

received.

Phase 6: all the data and metadata is on safe storage, data at itsactual locations and metadata in the journal. Now transactions need tobe marked as committed so that it can be known that all the updates aresafe in the journal. For this reason, a journal
descriptor block againis allocated. A tag is written stating that the transaction hascommitted successfully, and the block is synchronously written to itsposition in the journal. After this, the transaction is moved to thecommitted state, T_COMMIT.

Phase 7: occurs when a number of transactions are present in thejournal, without yet being flushed to the disk. Some of the metadatabuffers in this transaction already may be a part of some previoustransaction. These need not be kept in the older transactions
as wehave their latest copy in the current committed transaction. Suchbuffers are removed from older transactions.

Phase 8: the transaction is marked as being in the finished state,T_FINISHED. The journal structure is updated to reflect this particulartransaction as the latest committed transaction. It also is added tothe list of transactions to be checkpointed.

Checkpointing

Checkpointing is initiated when the journal is being flushed to thedisk—think of unmount— or when a new handle is started. A new handlecan fall short of guaranteed number of buffers, so it may be necessaryto carry out a checkpointing process in order to
free some space in thejournal.

The checkpointing process flushes the metadata buffers of atransaction not yet written to its actual location on the disk. Thetransaction then is removed from the journal. The journal can havemultiple checkpointing transactions, and each checkpointing transactioncan
have multiple buffers. The process considers each committingtransaction, and for each transaction, it finds the metadata buffersthat need to be flushed to the disk. All these buffers are flushed inone batch. Once all the transactions are checkpointed, their
log isremoved from the journal.





Recovery

[journal_recover(journal object)]

When the system comes up after a crash and it can see that the logentries are not null, it indicates that the last unmount was notsuccessful or never occurred. At this point, you need to attempt arecovery. Figure 2 depicts a sample physical layout of journal.
Therecovery takes place in three phases.

PASS_SCAN: the end of the log is found.

PASS_REVOKE: a list of revoked blocks is prepared from the log.

PASS_REPLAY: unrevoked blocks are rewritten (replayed) in order to guarantee the consistency of the disk.

For recovery, the available information is provided in terms of thejournal. But the exact state of the journal is unknown, as we do notknow the point at which the system crashed. Hence, the last transactioncould be in the checkpointing or committing state.
A runningtransaction cannot be found, as it was only in the memory.

For committing transactions, we have to forget the updates made, asall of the updates may not be in place. So in the PASS_SCAN phase, thelast log entry in the log is found. From here, the recovery processknows which transactions need to be replayed.

Every transaction can have a set of revoked blocks. This isimportant to know in order to prevent older journal records from beingreplayed on top of newer data using the same block. In PASS_REVOKE, ahash table of all these revoked blocks is prepared. This
table is usedevery time we need to find out whether a particular block should getwritten to a disk through a replay.

In the last phase, all the blocks that need to be replayed areconsidered. Each block is tested for its presence in the revokedblocks' hash table. If the block is not in there, it is safe to writethe block to its actual location on the disk. If the block
is there,only the newest version of the block is written to the disk. Noticethat we have not changed anything in the on-disk journal. Hence, evenif system crashes again while the recovery is in progress, no harm isdone.

The same journal is present for the recovery next time, and nonon-idempotent operation is performed during the process of recovery.

Amey Inamdar (www.geocities.com/amey_inamdar) is a kernel developer working at Kernel Corporation. His interest areas include filesystems and
distributed systems.

Kedar Sovani (www.geocities.com/kedarsovani) works for Kernel Corporation as a kernel developer. His areas of interest include filesystems and
storage technologies.

Copyright (c) 2004-2006 Kedar Sovani and Amey Inamdar
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: