您的位置:首页 > 其它

工作流挖掘:相关问题和方法的研究(5)开始正题

2007-04-26 09:18 731 查看

3.工作流挖掘

工作流挖掘的目的是为了从操作日志中抽取与过程相关的信息。与从工作流设计开始不同,我们是从收集当工作流过程发生时的相关信息着手的。我们假设记录如下信息是可能的:
(I) 与任务(即:一个工作流中被完全定义的步骤)相关的每一个事件;
(II) 与案例(即:一个工作流实例)相关的每一个事件;
(III) 这些事件是全部顺序执行的。
任何使用诸如ERP、CRM的 交互系统或工作流管理系统都会以某些形式提供这些信息。特别的,我们并没有假设一个工作流管理系统的存在。我们仅仅需要假设的是:记录了事件数据的工作流 日志是可以被收集的。这些工作流日志被用于构建一个可以恰当模仿所注册的行为的过程定义。基本的过程挖掘是指从一系列真实的执行过程中提取结构化过程描述 的方法。因为这些方法都集中于被现代的工作流管理系统所广泛支持的所谓案例驱动的过程,所以我们就用了工作流挖掘这个概念。
表1是由Staffware系统产生的一系列工作流日志。在Staffware系统中事件是基于案例对案例分组的。第一列是指任务(描述),第二列是事件的类型,第三列是触发该事件的用户,最后一列则是事件产生的时间。Staffware所对应的模型参看图2。如表1所示,案例10所描述第一项任务“注册”是通过顺序执行问卷下发,接收问卷和评估完成的。基于评估,最终的决定是直接存档(任务档案)而不做进一步的处理。案例9则需要进一步的处理,然而案例8则包含一个暂停(动作)和某些人物的反复执行。熟悉Staffware的人会得出这三个案例实际上来自图2所示的Staffware模型。然而,3个案例并不是完全自动的从图2所示的模型中得到的。需要特别指出的是,更多的Staffware模型能够产生表1所示的3个场景。富有挑战性的是通过工作流挖掘尽可能的从较少的信息中得出“好”的工作流模型。
为了进一步的展示过程挖掘的规则,我们列举了表2所示的工作流日志。这个日志抽象自时间、日期和事件类型,并且将信息局限于任务执行的顺序。表2所示的工作流日志包含了5个案例(即:工作流实例)的信息。日志表明对于案例(1,2,3和4)任务A、B、C和D已经执行了。而第5个案例这只执行了3项任务A、E和D。每个案例都是以任务A开始,任务D结束。如果执行了任务B,任务C也要被执行。然而,某些情况下任务C会先于任务B执行。基于表2所示的信息和对于日志完整性的某些假设(即:假设案例具有代表性,并且所观察到的是可能行为的充分大的子集)。模型是基于Petri网[50]展示的。网络起始于任务A,结束于任务D。过渡代表了这些任务。在执行了任务A之后,可以平行的执行B和C,或者只执行任务E。在平行执行B和C的时候,增加了两个不太明显的任务(AND-split和AND-join)。这些任务仅仅是为了路径选择的需要,并不在工作流日志中出现。要指出的是,在这个例子中我们假设:不管以什么顺序执行,这两项任务(B和C)都是平行的。通过区分任务的开始事件和结束事件使明确的检测平行性成为可能(参见第4部分)。
表2所 包含的是我们所假定的最小信息集。在许多应用程序中,工作流日志都包含每个事件发生的时间,通过该信息可以提取附属的因果信息。另外,典型的日志还应包含 事件类型信息,如启动事件(当操作人员从工作列表中选择了一项任务时发生)、完成事件(任务完成时发生)、撤销事件(移除任务计划时发生)等等。进一步 地,我们所感兴趣的还有案例之间的关系和一个通用案例的实际路径。以处理交通违规为例:交通违规是否与汽车的制造者有关?(例如,一辆法拉利的驾驶者总是 要负责支付其罚金的)。
对于表2所示的这个简单的例子,可以很容易的构造出一个过程模型以重新生成工作流日志(如图3)。更加贴近现实的情况是还存在着许多复杂的因素:

工作流模型越大,其挖掘工作就越困难。例如,如果一个模型同时包含更替和并行的路径,那么工作流日志一般都不能包含所有可能的组合。假设同时可被执行的任务有10项,总共的组合数将是10!=3628800种。想要在日志中包含每一种情况是不现实的。进而,过程模型中某些路径出现的概率会很低并且不会被检测到。

工 作流日志一般都存在噪音,即:日志的部分内容可能是不正确的、不完整的或不正常的。由于人为的或技术的错误,事件会被错误的记进日志。如果某些任务是手 动的或由另外的系统或组织单元处理的,(与其相关的)事件将不在日志中出现。有些事件可能是稀有的或不被要求的。举个医院里工作流的例子,如果迫于事件的 压力,叠倒了两个事件(如X光检查和排便)的顺序,这并不能说是正常的医学规则的一部分,而且应该被医院的工作流支持。还有2个在因果上无关的事件(如:采血样和病人的死亡)可能在没有任何联系(即:病人的死亡并不是由采制血样引起的,这纯属巧合)的情况下紧挨着就发生了。明显的,仅仅是被记录了1次的意外情况不应自动成为正常工作流的一部分。

表2仅仅展示了事件的顺序而并没有给出事件的类型、时间和属性(即:与案例和/或任务相关的数据)信息。显而易见,这些附属信息是非常有用的。

第5-9部分,将向您展示解决这些问题的不同方法。
总结一下这一部分,我们提出了一些与时间相关的工作流日志挖掘相关的适当的问题。明显的,工作流日志可以用作系统的检查雇员的行为。不同国家对于像隐私和个人信息的保护这些问题的合法性是不同的。例如,荷兰的公司就要受到由欧盟引导的个人信息保护行动(##)的约束。有关像荷兰这种情况下的实践情况可以阅读参考文献14、31和51。 只要日志中的信息不涉及个人(营私)就不会受这些法律的约束。如果日志中的信息可以追溯到一位特定的雇员,那么让他知道其行为被记录并且这些记录被用作控 制其行为的事实是非常重要的。这里要提得是我们可以从与时间关联的工作流日志中抽象出工作人员执行任务的信息,而且还可以挖掘这些过程。因此,避免收集与 工人生产有关的和那些像个人信息保护行动等法规所不允许的信息是可能(可以做到)的。然而,大多数工作流系统得日志都包含有与个人有关的信息,因此,这些 问题必须认真对待。

3. Workflow mining

The goal of workflow mining is to extract information about processes from transaction logs. Instead of starting with a workflow design, we start by gathering information about the workflow processes as they take place. We assume that it is possible to record events such that (i) each event refers to a task (i.e., a well-defined step in the workflow), (ii) each event refers to a case (i.e., a workflow instance), and (iii) events are totally ordered. Any information system using transactional systems such as ERP, CRM, or workflow management systems will offer this information in some form. Note that we do not assume the presence of a workflow management system. The only assumption we make, is that it is possible to collect workflow logs with event data. These workflow logs are used to construct a process specification which adequately models the behavior registered. The term process mining refers to methods for distilling a structured process description from a set of real executions. Because these methods focus on so-called case-driven process that are supported by contemporary workflow management systems, we also use the term workflow mining.
Table 1 shows a fragment of a workflow log generated by the Staffware system. In Staffware events are grouped on a case-by-case basis. The first column refers to the task (description), the second to the type of event, the third to the user generating the event (if any), and the last column shows a time stamp. The corresponding Staffware model is shown in Fig. 2. Case 10 shown in Table 1 follows the scenario where first task Register is executed followed Send questionnaire, Receive questionnaire, and Evaluate. Based on the Evaluation, the decision is made to directly archive (task Archive) the case without further processing. For Case 9 further processing is needed, while Case 8 involves a timeout and the repeated execution of some tasks. Someone familiar with Staffware will be able to decide that the three cases indeed follow a scenario possible in the Staffware model shown in Fig. 2. However, three cases are not sufficient to automatically derive the model of Fig. 2. Note that there are more Staffware models enabling the three scenarios shown in Table 1. The challenge of workflow mining is to derive “good” workflow models with as little information as possible.
To illustrate the principle of process mining in more detail, we consider the workflow log shown in Table 2. This log abstracts from the time, date, and event type, and limits the information to the order in which tasks are being executed. The log shown in Table 2 contains information about five cases (i.e., workflow instances). The log shows that for four cases (1, 2, 3, and 4) the tasks A, B, C, and D have been executed. For the fifth case only three tasks are executed: tasks A, E, and D. Each case starts with the execution of A and ends with the execution of D. If task B is executed, then also task C is executed. However, for some cases task C is executed before task B. Based on the information shown in Table 2 and by making some assumptions about the completeness of the log (i.e., assuming that the cases are representative and a sufficient large subset of possible behaviors is observed), we can deduce for example the process model shown in Fig. 3. The model is represented in terms of a Petri net [50]. The Petri net starts with task A and finishes with task D. These tasks are represented by transitions. After executing A there is a choice between either executing B and C in parallel or just executing task E. To execute B and C in parallel two nonobservable tasks (AND-split and AND-join) have been added. These tasks have been added for routing purposes only and are not present in the workflow log. Note that for this example we assume that two tasks are in parallel if they appear in any order. By distinguishing between start events and complete events for tasks it is possible to explicitly detect parallelism (cf. Section 4).
Table 2 contains the minimal information we assume to be present. In many applications, the workflow log contains a time stamp for each event and this information can be used to extract additional causality information. In addition, a typical log also contains information about the type of event, e.g., a start event (a person selecting an task from a worklist), a complete event (the completion of a task), a withdraw event (a scheduled task is removed), etc. Moreover, we are also interested in the relation between attributes of the case and the actual route taken by a particular case. For example, when handling traffic violations: Is the make of a car relevant for the routing of the corresponding traffic violation? (For example, People driving a Ferrari always pay their fines in time.)
For this simple example (i.e., Table 2), it is quite easy to construct a process model that is able to regenerate the workflow log (e.g., Fig. 3). For more realistic situations there are however a number of complicating factors:













For larger workflow models mining is much more difficult. For example, if the model exhibits alternative and parallel routing, then the workflow log will typically not contain all possible combinations. Consider 10 tasks which can be executed in parallel. The total number of interleavings is 10!=3628800. It is not realistic that each interleaving is present in the log. Moreover, certain paths through the process model may have a low probability and therefore remain undetected.

Workflow logs will typically contain noise, i.e., parts of the log can be incorrect, incomplete, or refer to exceptions. Events can be logged incorrectly because of human or technical errors. Events can be missing in the log if some of the tasks are manual or handled by another system/organizational unit. Events can also refer to rare or undesired events. Consider for example the workflow in a hospital. If due to time pressure the order of two events (e.g., make X-ray and remove drain) is reversed, this does not imply that this would be part of the regular medical protocol and should be supported by the hospital’s workflow system. Also two causally unrelated events (e.g., take blood sample and death of patient) may happen next to each other without implying a causal relation (i.e., taking a sample did not result in the death of the patient; it was sheer coincidence). Clearly, exceptions which are recorded only once should not automatically become part of the regular workflow.

Table 2 only shows the order of events without giving information about the type of event, the time of the event, and attributes of the event (i.e., data about the case and/or task). Clearly, it is a challenge to exploit such additional information.

Sections 5–9 will present different approaches to some of these problems.
To conclude this section, we point out legal issues relevant when mining (timed) workflow logs. Clearly, workflow logs can be used to systematically measure the performance of employees. The legislation with respect to issues such as privacy and protection of personal data differs from country to country. For example, Dutch companies are bound by the Personal Data Protection Act (Wet Bescherming Persoonsgegeven) which is based on a directive from the European Union. The practical implications of this for the Dutch situation are described in [14,31,51]. Workflow logs are not restricted by these laws as long as the information in the log cannot be traced back to individuals. If information in the log can be traced back to a specific employee, it is important that the employee is aware of the fact that her/his activities are logged and the fact that this logging is used to monitor her/his performance. Note that in a timed workflow log we can abstract from information about the workers executing tasks and still mine the process. Therefore, it is possible to avoid collecting information on the productivity of individual workers and legislation such as the Personal Data Protection Act does not apply. Nevertheless, the logs of most workflow systems contain information about individual workers, and therefore, this issue should be considered carefully.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: