您的位置:首页 > 其它

Replication复制

2015-08-14 12:42 435 查看
Replication复制

Because each replica inswift functions independently, and clients generally require only a simplemajority of nodes responding to consider an operation successful, transientfailures like network partitions can quickly
cause replicas to diverge. Thesedifferences are eventually reconciled by asynchronous, peer-to-peer replicatorprocesses. The replicator processes traverse their local filesystems,concurrently performing operations in a manner that balances load acrossphysical
disks.
由于每个副本在Swift中独立地运行,并且客户端通常只需要一个简单的主节点响应就可以认为操作成功,如网络等瞬时故障虚节点会快速导致副本出现分歧。这些不同最终由异步、对等网络的replicator进程来调解。replicator进程遍历它们的本地文件,在物理磁盘上以平衡负载的方式并发地执行操作。
Replication uses a pushmodel, with records and files generally only being copied from local to remotereplicas. This is important because data on the node may not belong there (asin the case of handoffs and ring changes),
and a replicator can’t know whatdata exists elsewhere in the cluster that it should pull in. It’s the duty ofany node that contains data to ensure that data gets to where it belongs.Replica placement is handled by the ring.
复制使用推模型(推模型的简单实现是通过循环的方式将任务发送到服务器上),记录和文件通常只是从本地拷贝到远程副本。这一点非常重要,因为节点上的数据可能不属于那儿(当在传送数据而环改变的情况下),并且replicator不知道在集群的其它位置上它应该拉什么数据。这是任何一个含有数据的节点职责,确保数据到达它所应该到达的位置。副本的位置由环来处理。
Every deleted record orfile in the system is marked by a tombstone, so that deletions can bereplicated alongside creations. The replication process cleans up tombstonesafter a time period known as the consistency
window. The consistency windowencompasses replication duration and how long transient failure can remove anode from the cluster. Tombstone cleanup must be tied to replication to reachreplica convergence.
文件系统中每个被删除的记录或文件被标记为墓碑,因此删除可以在创建的时候被复制。在一段称为一致性窗口的时间后,墓碑文件被replication进程清除,与复制的持续时间和将节点从集群移除瞬时故障的持续时间有关。tombstone的清除应该绑定replication和对应的replica,不应该出现有的replica中的tombstone删除掉了,而有的却没有删除掉。
If a replicator detectsthat a remote drive has failed, the replicator uses the get_more_nodesinterface for the ring to choose an alternate node with which to synchronize.The replicator can maintain desired levels
of replication in the face of diskfailures, though some replicas may not be in an immediately usable location.Note that the replicator doesn’t maintain desired levels of replication when otherfailures, such as entire node failures, occur because most failure
aretransient.
如果replicator检测到远程驱动器发生故障,它将使用环的get_more_nodes接口来选择一个替代节点进行同步。在面临硬件故障时,复制器通常可以维护所需的复制级别,即使有一些副本可能不在一个直接可用的位置。

Replication is an area ofactive development, and likely rife with potential improvements to speed andcorrectness.
复制是一个活跃的开发领域,在速度和正确性上具有提升的潜力。
There are two majorclasses of replicator - the db replicator, which replicates accounts andcontainers, and the object replicator, which replicates object data.
有两种主要的replicator类型——用来复制账号和容器的db复制器,以及用来复制对象数据的对象复制器。

DBReplication

The first step performedby db replication is a low-cost hash comparison to determine whether tworeplicas already match. Under normal operation, this check is able to verifythat most databases in the system are already
synchronized very quickly. If thehashes differ, the replicator brings the databases in sync by sharing recordsadded since the last sync point.
db复制执行的第一步是一个低消耗的哈希比较来查明两个副本是否已匹配。在常规运行下,这一检测可以非常快速地验证系统中的大多数数据库已经同步。如果哈希值不一致,复制器通过共享最后一次同步点之后增加的记录对数据库进行同步。
This sync point is a highwater mark noting the last record at which two databases were known to be insync, and is stored in each database as a tuple of the remote database id andrecord id. Database ids are unique
amongst all replicas of the database, andrecord ids are monotonically increasing integers. After all new records havebeen pushed to the remote database, the entire sync table of the local databaseis pushed, so the remote database can guarantee that it is in
sync witheverything with which the local database has previously synchronized.
所谓的同步点是一个高水印标记用来记录上一次记录在哪两个数据库间进行了同步,并且存储在每个数据库中作为一个由remote database id和record id组成的元组。在数据库的所有副本中,数据库的id是唯一的,并且记录id为单调递增的整数。当所有的新纪录推送到远程数据库后,本地数据库的整个同步表被推送出去,因此远程数据库知道现在已经和先前本地数据库与之同步的所有数据库同步了。
If a replica is found tobe missing entirely, the whole local database file is transmitted to the peerusing rsync(1) and vested with a new unique id.
如果某个副本完全丢失了,使用rsync(1)传送整个数据库文件到对等节点的远程数据库,并且赋予一个新的唯一id。
In practice, DBreplication can process hundreds of databases per concurrency setting persecond (up to the number of available CPUs or disks) and is bound by the numberof DB transactions that must be performed.
实际运行中,DB复制可以处理数百个数据库每并发设定值每秒(取决于可用的CPU和磁盘的数量)并且受必须执行DB事务的数量约束。

ObjectReplication

The initialimplementation of object replication simply performed an rsync to push datafrom a local partition to all remote servers it was expected to exist on. Whilethis performed adequately at small scale, replication
times skyrocketed oncedirectory structures could no longer be held in RAM. We now use a modificationof this scheme in which a hash of the contents for each suffix directory issaved to a per-partition hashes file. The hash for a suffix directory isinvalidated
when the contents of that suffix directory are modified.
对象复制的最初实现是简单地执行rsync从本地虚节点推送数据到它预期存放的所有远程服务器上。虽然该方案在小规模上的表现出色,然而一旦目录结构不能保存在RAM中时,复制的时间将会突飞猛涨。我们现在使用这一机制的改进版本,将每个后缀目录的内容的哈希值保存到每一虚节点的哈希文件中。当目录后缀的内容被修改时,它的哈希值将无效。
The object replicationprocess reads in these hash files, calculating any invalidated hashes. It thentransmits the hashes to each remote server that should hold the partition, andonly suffix directories with differing
hashes on the remote server are rsynced.After pushing files to the remote server, the replication process notifies itto recalculate hashes for the rsynced suffix directories.
对象复制进程读取这些哈希文件,计算出失效的哈希值。然后传输这些哈希值到每个有该partition的远程服务器上,并且仅有不一致哈希的后缀目录在远程服务器上的被rsync。在推送文件到远程服务器之后,复制进程通知服务器重新计算执行了rsync的后缀目录的哈希值。
Performance of objectreplication is generally bound by the number of uncached directories it has totraverse, usually as a result of invalidated suffix directory hashes. Usingwrite volume and partition counts from
our running systems, it was designed sothat around 2% of the hash space on a normal node will be invalidated per day,which has experimentally given us acceptable replication speeds.
对象复制的性能通常被它要遍历的未缓存目录的数量限制,常常作为是失效的后缀目录的哈希值的结果。从我们运行的系统上使用写卷和虚节点计数,它被设计因此在一个普通节点上有每天大约2%的哈希空间会失效,已经通过试验,提供给我们可接受的复制速度。
Work continues with a newssync method where rsync is not used at all and instead all-Swift code is usedto transfer the objects. At first, this ssync will just strive to emulate thersync behavior. Once deemed stable
it will open the way for future improvementsin replication since we’ll be able to easily add code in the replication pathinstead of trying to alter the rsync code base and distributing suchmodifications.
One of the firstimprovements planned is an “index.db” that will replace the hashes.pkl. Thiswill allow quicker updates to that data as well as more streamlined queries.Quite likely we’ll implement a better scheme
than the current one hashes.pkluses (hash-trees, that sort of thing).
Another improvementplanned all along the way is separating the local disk structure from theprotocol path structure. This separation will allow ring resizing at somepoint, or at least ring-doubling.
Note that for objectsbeing stored with an Erasure Code policy, the replicator daemon is notinvolved. Instead, the reconstructor is used by Erasure Code policies and isanalogous to the replicator for Replication type
policies. See Erasure CodeSupport for complete information on both Erasure Codesupport as well as the reconstructor.

Hashes.pkl

The hashes.pkl file is akey element for both replication and reconstruction (for Erasure Coding). Bothdaemons use this file to determine if any kind of action is required betweennodes that are participating in the
durability scheme. The file itself is apickled dictionary with slightly different formats depending on whether thepolicy is Replication or Erasure Code. In either case, however, the same basicinformation is provided between the nodes. The dictionary contains
a dictionarywhere the key is a suffix directory name and the value is the MD5 hash of thedirectory listing for that suffix. In this manner, the daemon can quicklyidentify differences between local and remote suffix directories on a perpartition basis as the
scope of any one hashes.pkl file is a partitiondirectory.
For Erasure Codepolicies, there is a little more information required. An object’s hashdirectory may contain multiple fragments of a single object in the event thatthe node is acting as a handoff or perhaps if a rebalance
is underway. Eachfragment of an object is stored with a fragment index, so the hashes.pkl for anErasure Code partition will still be a dictionary keyed on the suffix directoryname, however, the value is another dictionary keyed on the fragment index withsubsequent
MD5 hashes for each one as values. Some files within an object hashdirectory don’t require a fragment index so None is used to represent those.Below are examples of what these dictionaries might look like.
Replication hashes.pkl:

{'a43':
'72018c5fbfae934e1f56069ad4425627',
'b23':
'12348c5fbfae934e1f56069ad4421234'}

Erasure Code hashes.pkl:

{'a43': {None:
'72018c5fbfae934e1f56069ad4425627',
2: 'b6dd6db937cb8748f50a5b6e4bc3b808'},
'b23': {None:
'12348c5fbfae934e1f56069ad4421234', 1:
'45676db937cb8748f50a5b6e4bc34567'}}

Dedicatedreplication network

Swift has support forusing dedicated network for replication traffic. For more information see Overview
ofdedicated replication network.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: