soft raid5阅读笔记之五--同步
2016-07-21 20:45
921 查看
resync:由成员磁盘计算出校验磁盘的过程,即初始化时需要同步;
recovery:当成员磁盘出现故障,由其它的磁盘和校验盘进行计算获得该故障磁盘数据的过程;
md_ioctl()函数中对SET_DISK_FAULTY的处理会调用set_disk_faulty()函数,该函数的参数:mddev:指向md设备的描述符指针;dev:md设备的设备号。从函数名可以看出,该函数是要设置该成员磁盘的故障标志位;
md_error()函数的参数:mddev:指向MD设备的描述符指针;rdev:指向故障成员磁盘的描述符指针。该函数首先会调用个性的错误处理函数进行处理,然后设置该故障磁盘的StateChanged标志位,同时设置->recovery的MD_RECOVERY_INTR和MD_RECOVERY_NEEDED标志位,然后唤醒管理线程进行下一步处理。对于raid5,pers->error_handler被实例化为error()函数。管理线程被实例化为raid5d线程。下面分别看这两个函数。
error()函数的参数:mddev:指向MD设备的描述符指针;rdev:指向故障成员磁盘的描述符指针。该函数主要就是设置一些标志位,标志该成员磁盘故障、阵列中的磁盘状态变化、以及及时中断正在执行的修复操作。而raid5d线程则会调用md_check_recovery()函数,我们来看看这个函数:
md_check_recovery()函数会根据状态创建同步线程,进行同步的处理。下面看看同步线程的处理:
同步线程设置同步、reshape或者是修复的起始扇区j和终止扇区max_sectors,然后循环执行pers->sync_request()函数进行处理,再来看sync_request()函数:
该函数主要是先获取一个sh,然后设置该sh的正在同步标识,并清除已同步标识,然后调用handle_stripe()进行处理.handle_stripe()会调用handle_stripe5()函数,在handle_stripe5()中,主要完成的操作如下:
1、首先会设置sh的状态s.syncing,表明该sh正在同步中;
2、然后调用handle_stripe_fill5()函数,该函数会对sh中的每个dev调用fetch_block5()函数,检测是否该sh中有需要读取的数据,在同步中,会将所有的dev的状态设置为R5_Wantread和R5_LOCKED,表明要读取该sh中的dev数据;
3、调用ops_run_io()函数,该函数根据R5_Wantread标志,设置读完成回调函数raid5_end_read_request,并调用generic_make_request()函数将读数据请求下发到对应的成员磁盘。在读数据请求完成后,raid5_end_read_request()函数被调用,会将读取数据成的成员磁盘dev的状态设置为R5_UPTODATE(已更新),并清除R5_LOCKED标志位,然后调用release_stripe()函数进行下一轮的处理;
4、在第二轮的处理中,由于成员磁盘的dev状态为R5_UPTODATE,调用handle_parity_checks5()函数,该函数会根据sh->check_state的状态选择进行何种操作,在同步中,sh->check_state的起始状态为idle,这时,会设置sh->check_state的状态为run,并设置s->ops_request的状态为STRIPE_OP_CHECK,并只清除校验磁盘的R5_UPTODATE状态,并将s->update--,说明校验磁盘的数据无效,要对校验磁盘进行计算后获得;
5、调用raid_run_ops()函数,该函数根据s->ops_request的STRIPE_OP_CHECK状态标志,调用ops_run_check_p()函数,该函数会以校验磁盘的物理内存页作为目标页,以数据磁盘的物理内存页作为源操作页,通过调用async_xor_val()函数进行异或操作,该函数的异或操作是将源操作页和目标页全部进行异或操作,结果保存在sh->ops.zero_sum_result中,如果为0,则说明校验磁盘是正确的,否则说明此时的校验磁盘是错误的。并设置完成回调函数ops_complete_check,在ops_complete_check()函数中,会将sh->check_state标志为check_result,然后调用release_stripe()进行下一轮的处理;
6、在第三轮的处理中,继续调用handle_parity_checks5()函数,该函数根据sh->check_state的check_result状态,进行相应的处理,主要是:如果sh->ops.zero_sum_result=0,说明校验磁盘数据是正确的,这时标记sh的状态为STRIPE_INSYNC;否则,设置sh->check_state的compute_run状态,并设置sh的状态为STRIPE_COMPUTE_RUN,并设置s->ops_request的状态为STRIPE_OP_COMPUTE_BLK,以及校验磁盘的状态为R5_Wantcompute,表明校验磁盘的数据需要通过其它成员磁盘的异或操作来获取;
7、调用__raid_run_ops()函数,根据STRIPE_OP_COMPUTE_BLK状态,进入ops_run_compute5()函数,执行异或操作,在完成异或操作的回调函数ops_complete_compute()中,设置sh->check_state的compute_result状态,调用release_stripe()进行下一轮的处理;
8.、在第四轮处理中,继续调用handle_parity_checks5()函数,该函数根据sh->check_state的compute_result状态,进行相应的处理,主要是:设置校验磁盘的R5_Wantwrite标志位,要进行校验磁盘的写操作,后面和正常的写操作相同,不再重述。
至此,完成了raid5的同步过程的分析。
recovery:当成员磁盘出现故障,由其它的磁盘和校验盘进行计算获得该故障磁盘数据的过程;
md_ioctl()函数中对SET_DISK_FAULTY的处理会调用set_disk_faulty()函数,该函数的参数:mddev:指向md设备的描述符指针;dev:md设备的设备号。从函数名可以看出,该函数是要设置该成员磁盘的故障标志位;
static int set_disk_faulty(mddev_t *mddev, dev_t dev) { mdk_rdev_t *rdev; if (mddev->pers == NULL) return -ENODEV; rdev = find_rdev(mddev, dev); //在MD设备的成员磁盘链表中查找设备号为dev的成员磁盘 if (!rdev) return -ENODEV; md_error(mddev, rdev); // return 0; } void md_error(mddev_t *mddev, mdk_rdev_t *rdev) { if (!mddev) { MD_BUG(); return; } if (!rdev || test_bit(Faulty, &rdev->flags)) //如果该成员磁盘已经被标识为故障设备,则直接返回 return; if (mddev->external) //如果该MD设备是外部管理,则设置该成员磁盘的Blocked标志位,表示该成员磁盘不能写, set_bit(Blocked, &rdev->flags); //直到该标志位被清除,也就是阻塞的意思 if (!mddev->pers) //没有定义个性,直接返回 return; if (!mddev->pers->error_handler) //个性中没有定义错误处理的方法,直接返回 return; mddev->pers->error_handler(mddev,rdev); //调用个性的错误处理方法 if (mddev->degraded) set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); set_bit(StateChanged, &rdev->flags); set_bit(MD_RECOVERY_INTR, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); //唤醒管理线程,对于raid5,该线程就是raid5d md_new_event_inintr(mddev); }
md_error()函数的参数:mddev:指向MD设备的描述符指针;rdev:指向故障成员磁盘的描述符指针。该函数首先会调用个性的错误处理函数进行处理,然后设置该故障磁盘的StateChanged标志位,同时设置->recovery的MD_RECOVERY_INTR和MD_RECOVERY_NEEDED标志位,然后唤醒管理线程进行下一步处理。对于raid5,pers->error_handler被实例化为error()函数。管理线程被实例化为raid5d线程。下面分别看这两个函数。
static void error(mddev_t *mddev, mdk_rdev_t *rdev) { char b[BDEVNAME_SIZE]; raid5_conf_t *conf = (raid5_conf_t *) mddev->private; pr_debug("raid5: error called\n"); if (!test_bit(Faulty, &rdev->flags)) { //如果该成员磁盘还没有设置Faulty标志位 set_bit(MD_CHANGE_DEVS, &mddev->flags); //设置MD设备的MD_CHANGE_DEVS标志位,该位表示阵列中有些磁盘状态改变了 if (test_and_clear_bit(In_sync, &rdev->flags)) { //如果该成员磁盘是In_sync(表示和阵列中的其他磁盘同步),则清除之 unsigned long flags; spin_lock_irqsave(&conf->device_lock, flags); mddev->degraded++; //降级磁盘计数器加1 spin_unlock_irqrestore(&conf->device_lock, flags); /* * if recovery was running, make sure it aborts. */ set_bit(MD_RECOVERY_INTR, &mddev->recovery); //设置->recovery的MD_RECOVERY_INTR标志位,要中断修复的操作 } set_bit(Faulty, &rdev->flags); //设置该成员磁盘Faulty printk(KERN_ALERT "raid5: Disk failure on %s, disabling device.\n" "raid5: Operation continuing on %d devices.\n", bdevname(rdev->bdev,b), conf->raid_disks - mddev->degraded); } }
error()函数的参数:mddev:指向MD设备的描述符指针;rdev:指向故障成员磁盘的描述符指针。该函数主要就是设置一些标志位,标志该成员磁盘故障、阵列中的磁盘状态变化、以及及时中断正在执行的修复操作。而raid5d线程则会调用md_check_recovery()函数,我们来看看这个函数:
/*该函数主要是处理同步和super-block的更新。 * This routine is regularly called by all per-raid-array threads to * deal with generic issues like resync and super-block update. * Raid personalities that don't have a thread (linear/raid0) do not * need this as they never do any recovery or update the superblock. *它其实并没有自己进行同步处理,而是通过创建其它的线程来处理 * It does not do any resync itself, but rather "forks" off other threads * to do that as needed. * When it is determined that resync is needed, we set MD_RECOVERY_RUNNING in * "->recovery" and create a thread at ->sync_thread. * When the thread finishes it sets MD_RECOVERY_DONE * and wakeups up this thread which will reap the thread and finish up. * This thread also removes any faulty devices (with nr_pending == 0). * * The overall approach is: * 1/ if the superblock needs updating, update it. 1.更新super-block * 2/ If a recovery thread is running, don't do anything else. * 3/ If recovery has finished, clean up, possibly marking spares active. * 4/ If there are any faulty devices, remove them. * 5/ If array is degraded, try to add spares devices * 6/ If array has spares or is not in-sync, start a resync thread. */ void md_check_recovery(mddev_t *mddev) { mdk_rdev_t *rdev; if (mddev->bitmap) bitmap_daemon_work(mddev->bitmap); //可能是完成bitmap的冲刷进磁盘 if (mddev->ro) return; if (signal_pending(current)) { if (mddev->pers->sync_request && !mddev->external) { printk(KERN_INFO "md: %s in immediate safe mode\n", mdname(mddev)); mddev->safemode = 2; } flush_signals(current); } if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) return; if ( ! ( (mddev->flags && !mddev->external) || test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_DONE, &mddev->recovery) || (mddev->external == 0 && mddev->safemode == 1) || (mddev->safemode == 2 && ! atomic_read(&mddev->writes_pending) && !mddev->in_sync && mddev->recovery_cp == MaxSector) )) return; if (mddev_trylock(mddev)) { int spares = 0; if (mddev->ro) { /* 4000 Only thing we do on a ro array is remove * failed devices. */ remove_and_add_spares(mddev); clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); goto unlock; } if (!mddev->external) { int did_change = 0; spin_lock_irq(&mddev->write_lock); if (mddev->safemode && !atomic_read(&mddev->writes_pending) && !mddev->in_sync && mddev->recovery_cp == MaxSector) { mddev->in_sync = 1; did_change = 1; if (mddev->persistent) set_bit(MD_CHANGE_CLEAN, &mddev->flags); } if (mddev->safemode == 1) mddev->safemode = 0; spin_unlock_irq(&mddev->write_lock); if (did_change) sysfs_notify_dirent(mddev->sysfs_state); } if (mddev->flags) md_update_sb(mddev, 0); list_for_each_entry(rdev, &mddev->disks, same_set) if (test_and_clear_bit(StateChanged, &rdev->flags)) sysfs_notify_dirent(rdev->sysfs_state); if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) && !test_bit(MD_RECOVERY_DONE, &mddev->recovery)) { /* resync/recovery still happening */ clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); goto unlock; } if (mddev->sync_thread) { /* resync has finished, collect result */ md_unregister_thread(mddev->sync_thread); mddev->sync_thread = NULL; if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { /* success...*/ /* activate any spares */ if (mddev->pers->spare_active(mddev)) sysfs_notify(&mddev->kobj, NULL, "degraded"); } if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && mddev->pers->finish_reshape) mddev->pers->finish_reshape(mddev); md_update_sb(mddev, 1); /* if array is no-longer degraded, then any saved_raid_disk * information must be scrapped */ if (!mddev->degraded) list_for_each_entry(rdev, &mddev->disks, same_set) rdev->saved_raid_disk = -1; mddev->recovery = 0; /* flag recovery needed just to double check */ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); sysfs_notify_dirent(mddev->sysfs_action); md_new_event(mddev); goto unlock; } /* Set RUNNING before clearing NEEDED to avoid * any transients in the value of "sync_action". */ set_bit(MD_RECOVERY_RUNNING, &mddev->recovery); clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery); /* Clear some bits that don't mean anything, but * might be left set */ clear_bit(MD_RECOVERY_INTR, &mddev->recovery); clear_bit(MD_RECOVERY_DONE, &mddev->recovery); if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) goto unlock; /* no recovery is running. * remove any failed drives, then * add spares if possible. * Spare are also removed and re-added, to allow * the personality to fail the re-add. */ if (mddev->reshape_position != MaxSector) { if (mddev->pers->check_reshape == NULL || mddev->pers->check_reshape(mddev) != 0) /* Cannot proceed */ goto unlock; set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if ((spares = remove_and_add_spares(mddev))) { clear_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_CHECK, &mddev->recovery); clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery); set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if (mddev->recovery_cp < MaxSector) { set_bit(MD_RECOVERY_SYNC, &mddev->recovery); clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery); } else if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) /* nothing to be done ... */ goto unlock; if (mddev->pers->sync_request) { if (spares && mddev->bitmap && ! mddev->bitmap->file) { /* We are adding a device or devices to an array * which has the bitmap stored on all devices. * So make sure all bitmap pages get written */ bitmap_write_all(mddev->bitmap); } mddev->sync_thread = md_register_thread(md_do_sync, mddev, "resync"); //创建resync线程,完成同步 if (!mddev->sync_thread) { printk(KERN_ERR "%s: could not start resync" " thread...\n", mdname(mddev)); /* leave the spares where they are, it shouldn't hurt */ mddev->recovery = 0; } else md_wakeup_thread(mddev->sync_thread); sysfs_notify_dirent(mddev->sysfs_action); md_new_event(mddev); } unlock: if (!mddev->sync_thread) { clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); if (test_and_clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) if (mddev->sysfs_action) sysfs_notify_dirent(mddev->sysfs_action); } mddev_unlock(mddev); } }
md_check_recovery()函数会根据状态创建同步线程,进行同步的处理。下面看看同步线程的处理:
void md_do_sync(mddev_t *mddev) { mddev_t *mddev2; unsigned int currspeed = 0, window; sector_t max_sectors,j, io_sectors; unsigned long mark[SYNC_MARKS]; sector_t mark_cnt[SYNC_MARKS]; int last_mark,m; struct list_head *tmp; sector_t last_check; int skipped = 0; mdk_rdev_t *rdev; char *desc; /* just incase thread restarts... */ if (test_bit(MD_RECOVERY_DONE, &mddev->recovery)) return; if (mddev->ro) /* never try to sync a read-only array */ return; if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) desc = "data-check"; else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) desc = "requested-resync"; else desc = "resync"; } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) desc = "reshape"; else desc = "recovery"; /* we overload curr_resync somewhat here. * 0 == not engaged in resync at all * 2 == checking that there is no conflict with another sync * 1 == like 2, but have yielded to allow conflicting resync to * commense * other == active in resync - this many blocks * * Before starting a resync we must have set curr_resync to * 2, and then checked that every "conflicting" array has curr_resync * less than ours. When we find one that is the same or higher * we wait on resync_wait. To avoid deadlock, we reduce curr_resync * to 1 if we choose to yield (based arbitrarily on address of mddev structure). * This will mean we have to start checking from the beginning again. * */ do { mddev->curr_resync = 2; try_again: if (kthread_should_stop()) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); goto skip; } for_each_mddev(mddev2, tmp) { if (mddev2 == mddev) continue; if (!mddev->parallel_resync && mddev2->curr_resync && match_mddev_units(mddev, mddev2)) { DEFINE_WAIT(wq); if (mddev < mddev2 && mddev->curr_resync == 2) { /* arbitrarily yield */ mddev->curr_resync = 1; wake_up(&resync_wait); } if (mddev > mddev2 && mddev->curr_resync == 1) /* no need to wait here, we can wait the next * time 'round when curr_resync == 2 */ continue; /* We need to wait 'interruptible' so as not to * contribute to the load average, and not to * be caught by 'softlockup' */ prepare_to_wait(&resync_wait, &wq, TASK_INTERRUPTIBLE); if (!kthread_should_stop() && mddev2->curr_resync >= mddev->curr_resync) { //如果当前的dev->curr_resync小于其它线程的curr_resync,则将本线程休眠 printk(KERN_INFO "md: delaying %s of %s" " until %s has finished (they" " share one or more physical units)\n", desc, mdname(mddev), mdname(mddev2)); mddev_put(mddev2); if (signal_pending(current)) flush_signals(current); schedule(); finish_wait(&resync_wait, &wq); goto try_again; } finish_wait(&resync_wait, &wq); } } } while (mddev->curr_resync < 2); /*设置同步、reshape或者是修复的起始扇区j和终止扇区max_sectors*/ j = 0; if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { /* resync follows the size requested by the personality, * which defaults to physical size, but can be virtual size */ max_sectors = mddev->resync_max_sectors; //设置同步的最大扇区编号 mddev->resync_mismatches = 0; /* we don't use the checkpoint if there's a bitmap */ if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) j = mddev->resync_min; else if (!mddev->bitmap) j = mddev->recovery_cp; } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) max_sectors = mddev->dev_sectors; //设置reshape的最大扇区编号为磁盘的物理大小 else { /* recovery follows the physical size of devices */ max_sectors = mddev->dev_sectors; //设置修复的最大扇区编号为磁盘的物理大小 j = MaxSector; list_for_each_entry(rdev, &mddev->disks, same_set) //遍历整个MD的所有成员磁盘,选择recovery_offset最小的扇区最为起始扇区 if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags) && !test_bit(In_sync, &rdev->flags) && rdev->recovery_offset < j) j = rdev->recovery_offset; } printk(KERN_INFO "md: %s of RAID array %s\n", desc, mdname(mddev)); printk(KERN_INFO "md: minimum _guaranteed_ speed:" " %d KB/sec/disk.\n", speed_min(mddev)); printk(KERN_INFO "md: using maximum available idle IO bandwidth " "(but not more than %d KB/sec) for %s.\n", speed_max(mddev), desc); is_mddev_idle(mddev, 1); /* this initializes IO event counters */ io_sectors = 0; for (m = 0; m < SYNC_MARKS; m++) { mark[m] = jiffies; mark_cnt[m] = io_sectors; } last_mark = 0; mddev->resync_mark = mark[last_mark]; mddev->resync_mark_cnt = mark_cnt[last_mark]; /* * Tune reconstruction: */ window = 32*(PAGE_SIZE/512); printk(KERN_INFO "md: using %dk window, over a total of %llu blocks.\n", window/2,(unsigned long long) max_sectors/2); atomic_set(&mddev->recovery_active, 0); last_check = 0; if (j>2) { printk(KERN_INFO "md: resuming %s of %s from checkpoint.\n", desc, mdname(mddev)); mddev->curr_resync = j; } while (j < max_sectors) { sector_t sectors; skipped = 0; if (!test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) && ((mddev->curr_resync > mddev->curr_resync_completed && (mddev->curr_resync - mddev->curr_resync_completed) > (max_sectors >> 4)) || (j - mddev->curr_resync_completed)*2 >= mddev->resync_max - mddev->curr_resync_completed )) { /* time to update curr_resync_completed */ blk_unplug(mddev->queue); wait_event(mddev->recovery_wait, atomic_read(&mddev->recovery_active) == 0); mddev->curr_resync_completed = mddev->curr_resync; set_bit(MD_CHANGE_CLEAN, &mddev->flags); sysfs_notify(&mddev->kobj, NULL, "sync_completed"); } while (j >= mddev->resync_max && !kthread_should_stop()) { /* As this condition is controlled by user-space, * we can block indefinitely, so use '_interruptible' * to avoid triggering warnings. */ flush_signals(current); /* just in case */ wait_event_interruptible(mddev->recovery_wait, mddev->resync_max > j || kthread_should_stop()); } if (kthread_should_stop()) goto interrupted; sectors = mddev->pers->sync_request(mddev, j, &skipped, currspeed < speed_min(mddev)); if (sectors == 0) { set_bit(MD_RECOVERY_INTR, &mddev->recovery); goto out; } if (!skipped) { /* actual IO requested */ io_sectors += sectors; atomic_add(sectors, &mddev->recovery_active); } j += sectors; if (j>1) mddev->curr_resync = j; mddev->curr_mark_cnt = io_sectors; if (last_check == 0) /* this is the earliers that rebuilt will be * visible in /proc/mdstat */ md_new_event(mddev); if (last_check + window > io_sectors || j == max_sectors) continue; last_check = io_sectors; if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) d6af break; repeat: if (time_after_eq(jiffies, mark[last_mark] + SYNC_MARK_STEP )) { /* step marks */ int next = (last_mark+1) % SYNC_MARKS; mddev->resync_mark = mark[next]; mddev->resync_mark_cnt = mark_cnt[next]; mark[next] = jiffies; mark_cnt[next] = io_sectors - atomic_read(&mddev->recovery_active); last_mark = next; } if (kthread_should_stop()) goto interrupted; /* * this loop exits only if either when we are slower than * the 'hard' speed limit, or the system was IO-idle for * a jiffy. * the system might be non-idle CPU-wise, but we only care * about not overloading the IO subsystem. (things like an * e2fsck being done on the RAID array should execute fast) */ blk_unplug(mddev->queue); cond_resched(); currspeed = ((unsigned long)(io_sectors-mddev->resync_mark_cnt))/2 /((jiffies-mddev->resync_mark)/HZ +1) +1; if (currspeed > speed_min(mddev)) { if ((currspeed > speed_max(mddev)) || !is_mddev_idle(mddev, 0)) { msleep(500); goto repeat; } } } printk(KERN_INFO "md: %s: %s done.\n",mdname(mddev), desc); /* * this also signals 'finished resyncing' to md_stop */ out: blk_unplug(mddev->queue); wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); /* tell personality that we are finished */ mddev->pers->sync_request(mddev, max_sectors, &skipped, 1); if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) && mddev->curr_resync > 2) { if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { if (test_bit(MD_RECOVERY_INTR, &mddev->recovery)) { if (mddev->curr_resync >= mddev->recovery_cp) { printk(KERN_INFO "md: checkpointing %s of %s.\n", desc, mdname(mddev)); mddev->recovery_cp = mddev->curr_resync; } } else mddev->recovery_cp = MaxSector; } else { if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) mddev->curr_resync = MaxSector; list_for_each_entry(rdev, &mddev->disks, same_set) if (rdev->raid_disk >= 0 && !test_bit(Faulty, &rdev->flags) && !test_bit(In_sync, &rdev->flags) && rdev->recovery_offset < mddev->curr_resync) rdev->recovery_offset = mddev->curr_resync; } } set_bit(MD_CHANGE_DEVS, &mddev->flags); skip: mddev->curr_resync = 0; mddev->curr_resync_completed = 0; if (!test_bit(MD_RECOVERY_INTR, &mddev->recovery)) /* We completed so max setting can be forgotten. */ mddev->resync_max = MaxSector; sysfs_notify(&mddev->kobj, NULL, "sync_completed"); wake_up(&resync_wait); set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); return; interrupted: /* * got a signal, exit. */ printk(KERN_INFO "md: md_do_sync() got signal ... exiting\n"); set_bit(MD_RECOVERY_INTR, &mddev->recovery); goto out; }
同步线程设置同步、reshape或者是修复的起始扇区j和终止扇区max_sectors,然后循环执行pers->sync_request()函数进行处理,再来看sync_request()函数:
/* FIXME go_faster isn't used */ static inline sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, int go_faster) { raid5_conf_t *conf = (raid5_conf_t *) mddev->private; struct stripe_head *sh; sector_t max_sector = mddev->dev_sectors; int sync_blocks; int still_degraded = 0; int i; if (sector_nr >= max_sector) { //如果同步的起始扇区大于设备的最大扇区数 /* just being told to finish up .. nothing much to do */ unplug_slaves(mddev); //泄流 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) { //如果是reshape,则调用end_stripe()结束reshape end_reshape(conf); return 0; } if (mddev->curr_resync < max_sector) /* aborted */ //如果curr_resync小于设备的最大扇区数,说明同步被终止了, bitmap_end_sync(mddev->bitmap, mddev->curr_resync, &sync_blocks, 1); else /* completed sync */ //说明完成了同步 conf->fullsync = 0; bitmap_close_sync(mddev->bitmap); return 0; } /* Allow raid5_quiesce to complete */ wait_event(conf->wait_for_overlap, conf->quiesce != 2); if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) //如果是reshape,则调用reshape_request() return reshape_request(mddev, sector_nr, skipped); /* No need to check resync_max as we never do more than one * stripe, and as resync_max will always be on a chunk boundary, * if the check in md_do_sync didn't fire, there is no chance * of overstepping resync_max here */ /* if there is too many failed drives and we are trying * to resync, then assert that we are finished, because there is * nothing we can do. */ if (mddev->degraded >= conf->max_degraded && test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { //如果degraded数量超过了支持的最大数,则不能完成同步,这时设置 sector_t rv = mddev->dev_sectors - sector_nr; //*skipped=1,说明忽略了该sh的同步操作,并返回未完成同步的扇区数 *skipped = 1; return rv; } if (!bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, 1) && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery) && !conf->fullsync && sync_blocks >= STRIPE_SECTORS) { /* we can skip this block, and probably more */ sync_blocks /= STRIPE_SECTORS; *skipped = 1; return sync_blocks * STRIPE_SECTORS; /* keep things rounded to whole stripes */ } bitmap_cond_end_sync(mddev->bitmap, sector_nr); sh = get_active_stripe(conf, sector_nr, 0, 1, 0); //以非阻塞的方式获取一个sh,如果获取不到,则以阻塞的方式重新获取 if (sh == NULL) { sh = get_active_stripe(conf, sector_nr, 0, 0, 0); /* make sure we don't swamp the stripe cache if someone else * is trying to get access */ schedule_timeout_uninterruptible(1); } /* Need to check if array will still be degraded after recovery/resync * We don't need to check the 'failed' flag as when that gets set, * recovery aborts. */ for (i = 0; i < conf->raid_disks; i++) if (conf->disks[i].rdev == NULL) still_degraded = 1; bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded); spin_lock(&sh->lock); set_bit(STRIPE_SYNCING, &sh->state); //设置sh的状态为正在同步中 clear_bit(STRIPE_INSYNC, &sh->state); //清除sh的已同步状态 spin_unlock(&sh->lock); handle_stripe(sh); release_stripe(sh); return STRIPE_SECTORS; }
该函数主要是先获取一个sh,然后设置该sh的正在同步标识,并清除已同步标识,然后调用handle_stripe()进行处理.handle_stripe()会调用handle_stripe5()函数,在handle_stripe5()中,主要完成的操作如下:
1、首先会设置sh的状态s.syncing,表明该sh正在同步中;
2、然后调用handle_stripe_fill5()函数,该函数会对sh中的每个dev调用fetch_block5()函数,检测是否该sh中有需要读取的数据,在同步中,会将所有的dev的状态设置为R5_Wantread和R5_LOCKED,表明要读取该sh中的dev数据;
3、调用ops_run_io()函数,该函数根据R5_Wantread标志,设置读完成回调函数raid5_end_read_request,并调用generic_make_request()函数将读数据请求下发到对应的成员磁盘。在读数据请求完成后,raid5_end_read_request()函数被调用,会将读取数据成的成员磁盘dev的状态设置为R5_UPTODATE(已更新),并清除R5_LOCKED标志位,然后调用release_stripe()函数进行下一轮的处理;
4、在第二轮的处理中,由于成员磁盘的dev状态为R5_UPTODATE,调用handle_parity_checks5()函数,该函数会根据sh->check_state的状态选择进行何种操作,在同步中,sh->check_state的起始状态为idle,这时,会设置sh->check_state的状态为run,并设置s->ops_request的状态为STRIPE_OP_CHECK,并只清除校验磁盘的R5_UPTODATE状态,并将s->update--,说明校验磁盘的数据无效,要对校验磁盘进行计算后获得;
5、调用raid_run_ops()函数,该函数根据s->ops_request的STRIPE_OP_CHECK状态标志,调用ops_run_check_p()函数,该函数会以校验磁盘的物理内存页作为目标页,以数据磁盘的物理内存页作为源操作页,通过调用async_xor_val()函数进行异或操作,该函数的异或操作是将源操作页和目标页全部进行异或操作,结果保存在sh->ops.zero_sum_result中,如果为0,则说明校验磁盘是正确的,否则说明此时的校验磁盘是错误的。并设置完成回调函数ops_complete_check,在ops_complete_check()函数中,会将sh->check_state标志为check_result,然后调用release_stripe()进行下一轮的处理;
6、在第三轮的处理中,继续调用handle_parity_checks5()函数,该函数根据sh->check_state的check_result状态,进行相应的处理,主要是:如果sh->ops.zero_sum_result=0,说明校验磁盘数据是正确的,这时标记sh的状态为STRIPE_INSYNC;否则,设置sh->check_state的compute_run状态,并设置sh的状态为STRIPE_COMPUTE_RUN,并设置s->ops_request的状态为STRIPE_OP_COMPUTE_BLK,以及校验磁盘的状态为R5_Wantcompute,表明校验磁盘的数据需要通过其它成员磁盘的异或操作来获取;
7、调用__raid_run_ops()函数,根据STRIPE_OP_COMPUTE_BLK状态,进入ops_run_compute5()函数,执行异或操作,在完成异或操作的回调函数ops_complete_compute()中,设置sh->check_state的compute_result状态,调用release_stripe()进行下一轮的处理;
8.、在第四轮处理中,继续调用handle_parity_checks5()函数,该函数根据sh->check_state的compute_result状态,进行相应的处理,主要是:设置校验磁盘的R5_Wantwrite标志位,要进行校验磁盘的写操作,后面和正常的写操作相同,不再重述。
至此,完成了raid5的同步过程的分析。
相关文章推荐
- 硬件RAID解决方案与RAID组建过程详解
- 使用 mdadm 管理 RAID 阵列
- 如何在linux下做软raid实现方法
- 在Linux下用软件实现RAID功能的实现方法
- DELL服务器RAID H700,PERC H800阵列卡配置中文完全手册图解
- DELL R900 服务器 RAID阵列 配置详解
- LSI SAS1068 RAID 阵列卡图文配置教程
- RAID制作教程图文详解
- 教你安装磁盘阵列:组建RAID需要几块硬盘
- HP服务器RAID 0阵列配置教程
- linux系统下一个冷门的RAID卡ioc0及其监控mpt-status
- linux中如何查看Raid磁盘阵列信息
- Raid教程 全程图解手把手教你做RAID
- 华为RAID 1阵列卡设置教程图文详解
- Linux中mdadm命令管理RAID磁盘阵列的实例总结
- 磁盘阵列(DiskArray)原理
- Storage Networks Explained (1)
- RAID驱动器状态描述
- 什么是raid的background initialization?
- 马哥linux学习笔记:raid(维基百科)