openstack 之 ceilometer: Alarm
2015-09-21 17:54
543 查看
Overview
ceilometer 中 alarm是基于监控sample,进行评估,然后得出告警是触发还是清除的信息。结构如下:告警的基础是监控数据!在获取告警监控的基础上,分析这些数据,并最终得出并设置告警的状态。
实现代码结构
ceilometer/alarm
├── evaluator │ ├── combination.py │ ├── gnocchi.py │ ├── __init__.py │ ├── threshold.py │ └── utils.py ├── __init__.py ├── notifier │ ├── __init__.py │ ├── log.py │ ├── rest.py │ ├── test.py │ └── trust.py ├── partition │ ├── coordination.py │ └── __init__.py ├── rpc.py ├── service.py └── storage ├── base.py ├── impl_db2.py ├── impl_hbase.py ├── impl_log.py ├── impl_mongodb.py ├── impl_sqlalchemy.py ├── __init__.py ├── models.py └── pymongo_base.py
Alarm evaluator
目前,ceilometer提供了三种evaluator- threshold:通过判断设定的指标的最大、最小或者平均值是否“超过“了设定的阈值来评估。
- combination:判断多个指标是否“超过“阈值来评估。
- gnocchi:openstack 的gnocchi项目提供的评估实现。
以threshold方式为例,分析告警评估过程
第一步:获取设置相应告警的对象的最大值、最小值、平均值,调用方法如下:
ceilometer/alarm/evaluator/threshold.py
def _statistics(self, alarm, query): """Retrieve statistics over the current window.""" LOG.debug(_('stats query %s') % query) try: return self._client.statistics.list( meter_name=alarm.rule['meter_name'], q=query, period=alarm.rule['period']) except Exception: LOG.exception(_('alarm stats retrieval failed')) return []
事实上这是一个API调用,调用ceilometer自身提供的statistic API,获取指定对象的制定的指标(meter_name)在一个设定的时间段内的统计情况,统计的值包括:
最大值、最小值、平均值
第二步:判断获取的统计值是否充足:
ceilometer/alarm/evaluator/threshold.py
def _sufficient(self, alarm, statistics): """Check for the sufficiency of the data for evaluation. Ensure there is sufficient data for evaluation, transitioning to unknown otherwise. """ sufficient = len(statistics) >= alarm.rule['evaluation_periods'] if not sufficient and alarm.state != evaluator.UNKNOWN: LOG.warn(_LW('Expecting %(expected)d datapoints but only get ' '%(actual)d') % { 'expected': alarm.rule['evaluation_periods'], 'actual': len(statistics)}) # Reason is not same as log message because we want to keep # consistent since thirdparty software may depend on old format. reason = _('%d datapoints are unknown') % alarm.rule[ 'evaluation_periods'] last = None if not statistics else ( getattr(statistics[-1], alarm.rule['statistic'])) reason_data = self._reason_data('unknown', alarm.rule['evaluation_periods'], last) self._refresh(alarm, evaluator.UNKNOWN, reason, reason_data) return sufficient
即将设定时间段内的数据个数与这段时间预估获取的个数值进行比较
第三步:通过设定的比较规则,判断当前是否触发、清除告警:
ceilometer/alarm/evaluator/threshold.py
def _compare(stat): op = COMPARATORS[alarm.rule['comparison_operator']] value = getattr(stat, alarm.rule['statistic']) limit = alarm.rule['threshold'] LOG.debug(_('comparing value %(value)s against threshold' ' %(limit)s') % {'value': value, 'limit': limit}) return op(value, limit)
第四步:处理告警数据
ceilometer/alarm/evaluator/threshold.py
def _transition(self, alarm, statistics, compared): """Transition alarm state if necessary. The transition rules are currently hardcoded as: - transitioning from a known state requires an unequivocal set of datapoints - transitioning from unknown is on the basis of the most recent datapoint if equivocal Ultimately this will be policy-driven. """ distilled = all(compared) unequivocal = distilled or not any(compared) unknown = alarm.state == evaluator.UNKNOWN continuous = alarm.repeat_actions if unequivocal: state = evaluator.ALARM if distilled else evaluator.OK reason, reason_data = self._reason(alarm, statistics, distilled, state) if alarm.state != state or continuous: self._refresh(alarm, state, reason, reason_data) elif unknown or continuous: trending_state = evaluator.ALARM if compared[-1] else evaluator.OK state = trending_state if unknown else alarm.state reason, reason_data = self._reason(alarm, statistics, distilled, state) self._refresh(alarm, state, reason, reason_data)
当发生告警状态变化的时候,会调用_refresh方法,_refresh方法会最终将判断的结果当作告警历史存储到数据库,并通过alarm-notifier服务将告警抛出
Alarm notifier
notifier目前提供了log、rest两种实现方式及时抛出告警,这里以rest方式为例分析。rest方式实现的实质是一个http POST请求回调,即是调用一个其他服务的API接口,将告警抛出,实现代码如下:
ceilometer/alarm/notifier/rest.py
@staticmethod def notify(action, alarm_id, alarm_name, severity, previous, current, reason, reason_data, headers=None): headers = headers or {} if not headers.get('x-openstack-request-id'): headers['x-openstack-request-id'] = context.generate_request_id() LOG.info(_( "Notifying alarm %(alarm_name)s %(alarm_id)s with severity" " %(severity)s from %(previous)s to %(current)s with action " "%(action)s because %(reason)s. request-id: %(request_id)s ") % ({'alarm_name': alarm_name, 'alarm_id': alarm_id, 'severity': severity, 'previous': previous, 'current': current, 'action': action, 'reason': reason, 'request_id': headers['x-openstack-request-id']})) body = {'alarm_name': alarm_name, 'alarm_id': alarm_id, 'severity': severity, 'previous': previous, 'current': current, 'reason': reason, 'reason_data': reason_data} headers['content-type'] = 'application/json' kwargs = {'data': jsonutils.dumps(body), 'headers': headers} if action.scheme == 'https': default_verify = int(cfg.CONF.alarm.rest_notifier_ssl_verify) options = urlparse.parse_qs(action.query) verify = bool(int(options.get('ceilometer-alarm-ssl-verify', [default_verify])[-1])) kwargs['verify'] = verify cert = cfg.CONF.alarm.rest_notifier_certificate_file key = cfg.CONF.alarm.rest_notifier_certificate_key if cert: kwargs['cert'] = (cert, key) if key else cert # FIXME(rhonjo): Retries are automatically done by urllib3 in requests # library. However, there's no interval between retries in urllib3 # implementation. It will be better to put some interval between # retries (future work). max_retries = cfg.CONF.alarm.rest_notifier_max_retries session = requests.Session() session.mount(action.geturl(), requests.adapters.HTTPAdapter(max_retries=max_retries)) eventlet.spawn_n(session.post, action.geturl(), **kwargs)
Setting alarms
以threshold方式的告警设置为例,调用如下curl -i -X 'POST' 'http://<ceilometer_API_host_ip>:8777/v2/alarms' -H 'X-Auth-Token: 625d336a15b04a21bd2a588bd6a79572' -d '{alarm_actions: [], ok_actions: [], name: "alarm-test", threshold_rule: {meter_name: "cpu_util", statistic: "min", threshold: 30, comparison_operator: "lt", evaluation_periods: 1, period: 60, query: [], type: "threshold", enabled: true}}'
参数说明
- alarm_actions:list,告警触发时notifier回调的URL列表
- ok_actions:list,告警清除时notifier回调的URL列表
- name:string,告警名称
- threshold_rule:告警规则,meter_name:监控的sample;query:告警对象查询;period:统计时间段;evaluation_periods:时间段内数据个数;statistic:统计方式:min(最小值)、max(最大值)、avg(平均值);threshold:阈值
- type:string,告警设置类型
- enabled:bool,是否启用
Query alarms & Alarm History
Alarm查询如下:curl -i -X 'GET' 'http://ceilometer_API_host_ip:8777/v2/alarms' -H 'X-Auth-Token: 625d336a15b04a21bd2a588bd6a79572'
Alarm history查询:
curl -i -X 'POST' 'http://200.21.9.11:8777/v2/alarms/history' -H 'Content-Type: application/json' -H 'X-Auth-Token: fb3bb73481664a61a90235b1adf6f567'
当然,可以查询制定alarm的历史状态变化,也可以根据时间查询历史告警
相关文章推荐
- Linux IPC之共享内存
- Linux下的共享内存(02)---创建共享内存
- 理解 Linux 的硬链接与软链接
- Apache编译安装
- puppet运维自动化之cron管理
- 关于popupwindow的dismiss问题
- Linux下的共享内存(01)---查看和释放共享内存
- Corosync+Pacemaker+DRBD+MySQL 实现高可用MySQL集群
- Apache运行异常问题
- PopupWindow的使用细节以及今天遇到的小问题
- 9-13 shell编程练习
- shell的详细介绍和编程(下)
- [9-13]Shell系列3――分支结构if与case语句
- Bash中的字符串变量扩展
- Linux的计划任务
- Linux开关命令(shutdown,reboot,halt,init)
- Linux parted 分区
- Hadoop复习
- Hbase Region Server整体架构
- linux自学心得之--安装内核头文件