您的位置:首页 > 运维架构

openstack 之 ceilometer: Alarm

2015-09-21 17:54 543 查看

Overview

ceilometer 中 alarm是基于监控sample,进行评估,然后得出告警是触发还是清除的信息。结构如下:



告警的基础是监控数据!在获取告警监控的基础上,分析这些数据,并最终得出并设置告警的状态。

实现代码结构

ceilometer/alarm

├── evaluator
│   ├── combination.py
│   ├── gnocchi.py
│   ├── __init__.py
│   ├── threshold.py
│   └── utils.py
├── __init__.py
├── notifier
│   ├── __init__.py
│   ├── log.py
│   ├── rest.py
│   ├── test.py
│   └── trust.py
├── partition
│   ├── coordination.py
│   └── __init__.py
├── rpc.py
├── service.py
└── storage
├── base.py
├── impl_db2.py
├── impl_hbase.py
├── impl_log.py
├── impl_mongodb.py
├── impl_sqlalchemy.py
├── __init__.py
├── models.py
└── pymongo_base.py


Alarm evaluator

目前,ceilometer提供了三种evaluator

- threshold:通过判断设定的指标的最大、最小或者平均值是否“超过“了设定的阈值来评估。

- combination:判断多个指标是否“超过“阈值来评估。

- gnocchi:openstack 的gnocchi项目提供的评估实现。

以threshold方式为例,分析告警评估过程

第一步:获取设置相应告警的对象的最大值、最小值、平均值,调用方法如下:

ceilometer/alarm/evaluator/threshold.py

def _statistics(self, alarm, query):
"""Retrieve statistics over the current window."""
LOG.debug(_('stats query %s') % query)
try:
return self._client.statistics.list(
meter_name=alarm.rule['meter_name'], q=query,
period=alarm.rule['period'])
except Exception:
LOG.exception(_('alarm stats retrieval failed'))
return []


事实上这是一个API调用,调用ceilometer自身提供的statistic API,获取指定对象的制定的指标(meter_name)在一个设定的时间段内的统计情况,统计的值包括:

最大值、最小值、平均值

第二步:判断获取的统计值是否充足:

ceilometer/alarm/evaluator/threshold.py

def _sufficient(self, alarm, statistics):
"""Check for the sufficiency of the data for evaluation.
Ensure there is sufficient data for evaluation, transitioning to
unknown otherwise.
"""
sufficient = len(statistics) >= alarm.rule['evaluation_periods']
if not sufficient and alarm.state != evaluator.UNKNOWN:
LOG.warn(_LW('Expecting %(expected)d datapoints but only get '
'%(actual)d') % {
'expected': alarm.rule['evaluation_periods'],
'actual': len(statistics)})
# Reason is not same as log message because we want to keep
# consistent since thirdparty software may depend on old format.
reason = _('%d datapoints are unknown') % alarm.rule[
'evaluation_periods']
last = None if not statistics else (
getattr(statistics[-1], alarm.rule['statistic']))
reason_data = self._reason_data('unknown',
alarm.rule['evaluation_periods'],
last)
self._refresh(alarm, evaluator.UNKNOWN, reason, reason_data)
return sufficient


即将设定时间段内的数据个数与这段时间预估获取的个数值进行比较

第三步:通过设定的比较规则,判断当前是否触发、清除告警:

ceilometer/alarm/evaluator/threshold.py

def _compare(stat):
op = COMPARATORS[alarm.rule['comparison_operator']]
value = getattr(stat, alarm.rule['statistic'])
limit = alarm.rule['threshold']
LOG.debug(_('comparing value %(value)s against threshold'
' %(limit)s') %
{'value': value, 'limit': limit})
return op(value, limit)


第四步:处理告警数据

ceilometer/alarm/evaluator/threshold.py

def _transition(self, alarm, statistics, compared):
"""Transition alarm state if necessary.
The transition rules are currently hardcoded as:
- transitioning from a known state requires an unequivocal
set of datapoints
- transitioning from unknown is on the basis of the most
recent datapoint if equivocal
Ultimately this will be policy-driven.
"""
distilled = all(compared)
unequivocal = distilled or not any(compared)
unknown = alarm.state == evaluator.UNKNOWN
continuous = alarm.repeat_actions

if unequivocal:
state = evaluator.ALARM if distilled else evaluator.OK
reason, reason_data = self._reason(alarm, statistics,
distilled, state)
if alarm.state != state or continuous:
self._refresh(alarm, state, reason, reason_data)
elif unknown or continuous:
trending_state = evaluator.ALARM if compared[-1] else evaluator.OK
state = trending_state if unknown else alarm.state
reason, reason_data = self._reason(alarm, statistics,
distilled, state)
self._refresh(alarm, state, reason, reason_data)


当发生告警状态变化的时候,会调用_refresh方法,_refresh方法会最终将判断的结果当作告警历史存储到数据库,并通过alarm-notifier服务将告警抛出

Alarm notifier

notifier目前提供了log、rest两种实现方式及时抛出告警,这里以rest方式为例分析。

rest方式实现的实质是一个http POST请求回调,即是调用一个其他服务的API接口,将告警抛出,实现代码如下:

ceilometer/alarm/notifier/rest.py

@staticmethod
def notify(action, alarm_id, alarm_name, severity, previous,
current, reason, reason_data, headers=None):
headers = headers or {}
if not headers.get('x-openstack-request-id'):
headers['x-openstack-request-id'] = context.generate_request_id()

LOG.info(_(
"Notifying alarm %(alarm_name)s %(alarm_id)s with severity"
" %(severity)s from %(previous)s to %(current)s with action "
"%(action)s because %(reason)s. request-id: %(request_id)s ") %
({'alarm_name': alarm_name, 'alarm_id': alarm_id,
'severity': severity, 'previous': previous,
'current': current, 'action': action, 'reason': reason,
'request_id': headers['x-openstack-request-id']}))
body = {'alarm_name': alarm_name, 'alarm_id': alarm_id,
'severity': severity, 'previous': previous,
'current': current, 'reason': reason,
'reason_data': reason_data}
headers['content-type'] = 'application/json'
kwargs = {'data': jsonutils.dumps(body),
'headers': headers}

if action.scheme == 'https':
default_verify = int(cfg.CONF.alarm.rest_notifier_ssl_verify)
options = urlparse.parse_qs(action.query)
verify = bool(int(options.get('ceilometer-alarm-ssl-verify',
[default_verify])[-1]))
kwargs['verify'] = verify

cert = cfg.CONF.alarm.rest_notifier_certificate_file
key = cfg.CONF.alarm.rest_notifier_certificate_key
if cert:
kwargs['cert'] = (cert, key) if key else cert

# FIXME(rhonjo): Retries are automatically done by urllib3 in requests
# library. However, there's no interval between retries in urllib3
# implementation. It will be better to put some interval between
# retries (future work).
max_retries = cfg.CONF.alarm.rest_notifier_max_retries
session = requests.Session()
session.mount(action.geturl(),
requests.adapters.HTTPAdapter(max_retries=max_retries))
eventlet.spawn_n(session.post, action.geturl(), **kwargs)


Setting alarms

以threshold方式的告警设置为例,调用如下

curl -i -X 'POST' 'http://<ceilometer_API_host_ip>:8777/v2/alarms' -H 'X-Auth-Token: 625d336a15b04a21bd2a588bd6a79572' -d '{alarm_actions: [], ok_actions: [], name: "alarm-test", threshold_rule: {meter_name: "cpu_util", statistic: "min", threshold: 30, comparison_operator: "lt", evaluation_periods: 1, period: 60, query: [], type: "threshold", enabled: true}}'


参数说明

- alarm_actions:list,告警触发时notifier回调的URL列表

- ok_actions:list,告警清除时notifier回调的URL列表

- name:string,告警名称

- threshold_rule:告警规则,meter_name:监控的sample;query:告警对象查询;period:统计时间段;evaluation_periods:时间段内数据个数;statistic:统计方式:min(最小值)、max(最大值)、avg(平均值);threshold:阈值

- type:string,告警设置类型

- enabled:bool,是否启用

Query alarms & Alarm History

Alarm查询如下:

curl -i -X 'GET' 'http://ceilometer_API_host_ip:8777/v2/alarms' -H 'X-Auth-Token: 625d336a15b04a21bd2a588bd6a79572'


Alarm history查询:

curl -i -X 'POST' 'http://200.21.9.11:8777/v2/alarms/history' -H 'Content-Type: application/json' -H 'X-Auth-Token: fb3bb73481664a61a90235b1adf6f567'


当然,可以查询制定alarm的历史状态变化,也可以根据时间查询历史告警
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: