您的位置:首页 > 运维架构

为什么nova计算节点上报的剩余磁盘空间为负数?

2015-05-19 09:44 519 查看
<span style="font-family: Tahoma; text-align: -webkit-auto; background-color: rgb(255, 255, 255);">注:本文针对Kilo版本。</span>


在使用openstack时,遇到了计算节点上报的可用磁盘空间为负数的情况,这里通过代码走读来一窥究竟。

在计算节点上运行的nova-compute服务中,由一个周期任务update_available_resource来负责资源统计和上报:
@<strong>periodic_task</strong>.<strong>periodic_task</strong>
def <span style="color:#000066;">update_available_resource</span>(self, context):
"""See driver.get_available_resource()

Periodic process that keeps that the compute host's understanding of
resource availability and usage in sync with the underlying hypervisor.

:param context: security context
"""


这个函数中,调用的是ResourceTracker的接口获取可用资源:
rt = self._get_resource_tracker(nodename)
rt.<strong>update_available_resource</strong>(context)


而ResourceTracker又是实际调用libvirt driver来进行资源统计信息的获取:
def <strong>update_available_resource</strong>(self, context):
"""Override in-memory calculations of compute node resource usage based
on data audited from the hypervisor layer.

Add in resource claims in progress to account for operations that have
declared a need for resources, but not necessarily retrieved them from
the hypervisor layer yet.
"""
LOG.info(_LI("Auditing locally available compute resources for "
"node %(node)s"),
{'node': self.nodename})
resources = <strong>self.driver.get_available_resource</strong>(self.nodename)


这个获取资源统计信息的函数定义在virt\libvirt\driver.py中:
def <strong>get_available_resource</strong>(self, nodename):
"""Retrieve resource information.

This method is called when nova-compute launches, and
as part of a periodic task that records the results in the DB.

:param nodename: will be put in PCI device
:returns: dictionary containing resource info
"""

disk_info_dict = self._get_local_gb_info()
data = {}

# NOTE(dprince): calling capabilities before getVersion works around
# an initialization issue with some versions of Libvirt (1.0.5.5).
# See: https://bugzilla.redhat.com/show_bug.cgi?id=1000116 # See: https://bugs.launchpad.net/nova/+bug/1215593 
# Temporary convert supported_instances into a string, while keeping
# the RPC version as JSON. Can be changed when RPC broadcast is removed
data["supported_instances"] = jsonutils.dumps(
self._get_instance_capabilities())

data["vcpus"] = self._get_vcpu_total()
data["memory_mb"] = self._get_memory_mb_total()
data["local_gb"] = disk_info_dict['total']
data["vcpus_used"] = self._get_vcpu_used()
data["memory_mb_used"] = self._get_memory_mb_used()
data["local_gb_used"] = disk_info_dict['used']
data["hypervisor_type"] = self._host.get_driver_type()
data["hypervisor_version"] = self._host.get_version()
data["hypervisor_hostname"] = self._host.get_hostname()
# TODO(berrange): why do we bother converting the
# libvirt capabilities XML into a special JSON format ?
# The data format is different across all the drivers
# so we could just return the raw capabilities XML
# which 'compare_cpu' could use directly
#
# That said, arch_filter.py now seems to rely on
# the libvirt drivers format which suggests this
# data format needs to be standardized across drivers
data["cpu_info"] = jsonutils.dumps(self._get_cpu_info())

disk_free_gb = disk_info_dict['free']
disk_over_committed = self._get_disk_over_committed_size_total()
available_least = disk_free_gb * units.Gi - disk_over_committed
data['disk_available_least'] = available_least / units.Gi

data['pci_passthrough_devices'] = \
self._get_pci_passthrough_devices()

numa_topology = self._get_host_numa_topology()
if numa_topology:
data['numa_topology'] = numa_topology._to_json()
else:
data['numa_topology'] = None

return data

看一下跟磁盘资源相关的部分,首先是调用了libvirt driver的这个静态函数,得到total/free/used三个值,以gigabytes为单位:

@staticmethod
def get_local_gb_info():
"""Get local storage info of the compute node in GB.

:returns: A dict containing:
:total: How big the overall usable filesystem is (in gigabytes)
:free: How much space is free (in gigabytes)
:used: How much space is used (in gigabytes)
"""

if CONF.libvirt.images_type == 'lvm':
info = libvirt_utils.get_volume_group_info(CONF.libvirt.images_volume_group)
else:
info = libvirt_utils.get_fs_info(CONF.instances_path)

for (k, v) in info.iteritems():
info[k] = v / units.Gi  //注意:这里把结果的单位都换算成了GB!

return info

从get_local_gb_info这个函数中可以看到,如果存放instances用的是文件系统而非lvm,则调用下面的函数获取资源数据:
def get_fs_info(path):
"""Get free/used/total space info for a filesystem

:param path: Any dirent on the filesystem
:returns: A dict containing:

:free: How much space is free (in bytes)
:used: How much space is used (in bytes)
:total: How big the filesystem is (in bytes)
"""
hddinfo = os.statvfs(path)
total = hddinfo.f_frsize * hddinfo.f_blocks
free = hddinfo.f_frsize * hddinfo.f_bavail
used = hddinfo.f_frsize * (hddinfo.f_blocks - hddinfo.f_bfree)
return {'total': total,
'free': free,
'used': used}

get_fs_info这个函数获取到的信息和用df命令看到的结果基本是一样的:
[root@host123 ~]# python
Python 2.7.5 (default, Feb 11 2014, 07:46:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> hddinfo = os.statvfs("/var/lib/nova")
>>> total = hddinfo.f_frsize * hddinfo.f_blocks
>>> free = hddinfo.f_frsize * hddinfo.f_bavail
>>> used = hddinfo.f_frsize * (hddinfo.f_blocks - hddinfo.f_bfree)
>>>
>>> print total/1024/1024/1024
254
>>> print free/1024/1024/1024
194
>>> print used/1024/1024/1024
46

[root@host123 ~]# df -h
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/vg_sys-lv_root    20G  3.6G   16G  20% /
devtmpfs                      11G     0   11G   0% /dev
tmpfs                         12G     0   12G   0% /dev/shm
tmpfs                         12G   83M   12G   1% /run
tmpfs                         12G     0   12G   0% /sys/fs/cgroup
/dev/sda1                    380M   96M  260M  27% /boot
/dev/mapper/vg_nova-lv_nova  255G   47G  195G 
20% /var/lib/nova
update_status直接利用了获取到的total和used数据项,但是注意free却没有直接使用,而是计算成了disk_available_least:

<strong>disk_free_gb </strong>= disk_info_dict['free']
<strong>disk_over_committed </strong>= self.<strong>_get_disk_over_committed_size_total</strong>()
<strong>available_least </strong>= <strong>disk_free_gb </strong>* units.Gi - <strong>disk_over_committed</strong>
data['<strong>disk_available_least</strong>'] = available_least / units.Gi


可以看到,它从操作系统给的disk_free_gb 里面又减去了disk_over_committed的值。

我们来看看get_disk_over_committed_size_total是怎么获取的,这个函数也是libvirt
driver的成员:

    def _get_disk_over_committed_size_total(self):
        """Return total over committed disk size for all instances."""
        # Disk size that all instance uses : virtual_size - disk_size
        disk_over_committed_size = 0
        for dom in self._host.list_instance_domains():
            try:
                xml = dom.XMLDesc(0)
                disk_infos = jsonutils.loads(
                        self._get_instance_disk_info(dom.name(), xml))
                for info in disk_infos:
                    disk_over_committed_size += int(
                        info['over_committed_disk_size'])
            except ……(此处略过)
            # NOTE(gtt116): give other tasks a chance.
            greenthread.sleep(0)
        return disk_over_committed_size


它是逐个获取每个instance的over_committed_disk_size,然后把它们累加起来。
意思是有的instance已经在超额使用磁盘了,那么超额在哪里呢?

对于每一个instance,是通过下面的函数获取over_committed_disk_size的:

def _get_instance_disk_info(self, instance_name, xml,
block_device_info=None):
block_device_mapping = driver.block_device_info_get_mapping(
block_device_info)

volume_devices = set()
for vol in block_device_mapping:
disk_dev = vol['mount_device'].rpartition("/")[2]
volume_devices.add(disk_dev)

disk_info = []
doc = etree.fromstring(xml)
disk_nodes = doc.findall('.//devices/disk')
path_nodes = doc.findall('.//devices/disk/source')
driver_nodes = doc.findall('.//devices/disk/driver')
target_nodes = doc.findall('.//devices/disk/target')

for cnt, path_node in enumerate(path_nodes):
disk_type = disk_nodes[cnt].get('type')
path = path_node.get('file') or path_node.get('dev')
target = target_nodes[cnt].attrib['dev']

if not path:
LOG.debug('skipping disk for %s as it does not have a path',
instance_name)
continue

if disk_type not in ['file', 'block']:
LOG.debug('skipping disk because it looks like a volume', path)
continue

if target in volume_devices:
LOG.debug('skipping disk %(path)s (%(target)s) as it is a '
'volume', {'path': path, 'target': target})
continue

# get the real disk size or
# raise a localized error if image is unavailable
<strong>            if disk_type == 'file':
dk_size = int(os.path.getsize(path))
elif disk_type == 'block':
dk_size = lvm.get_volume_size(path)

disk_type = driver_nodes[cnt].get('type')
if disk_type == "qcow2":
backing_file = libvirt_utils.get_disk_backing_file(path)
virt_size = disk.get_disk_size(path)
over_commit_size = int(virt_size) - dk_size
else:
backing_file = ""
virt_size = dk_size
over_commit_size = 0</strong>

disk_info.append({'type': disk_type,
'path': path,
'virt_disk_size': virt_size,
'backing_file': backing_file,
'disk_size': dk_size,
'over_committed_disk_size': over_commit_size})
return jsonutils.dumps(disk_info)

举个例子,对于qcow2格式的镜像,这个overcommit size等于virt_size减去dk_size:

[root@host123 ~]# ll -h /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk

-rw-r--r-- 1 root root 5.0G Feb 25 11:41 /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk

镜像文件实际大小dk_size是5.0G。我们再用qemu-img命令查看一下qcow2的详细信息:

[root@host123 ~]# qemu-img info /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk

image: /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk

file format: qcow2
virtual size: 20G (21474836480 bytes)

disk size: 4.9G


cluster_size: 65536
backing file: /var/lib/nova/instances/_base/afd631de55a9b7026775a4a1ada098a9ae6888c7

Format specific information:

    compat: 0.10

这里的virtual size减去disk size,便是over_commit_size。

可以看到,这里仅仅对qcow2格式的镜像做了overcommit处理,其它文件的over_commit_size等于0。

我们知道,在nova调度服务的DiskFilter里面,用到了disk_allocation_ratio对磁盘资源做了超分,它和这里的overcommit不是一个概念,它是从控制节点角度看到的超额使用,而计算节点看不到,overcommit是计算节点看到了磁盘qcow2压缩格式之后所得到的结果,它最终上报的剩余空间是扣除了假设qcow2镜像文件解压之后的实际结果。所以会遇到实际上报的剩余空间小于肉眼看到的空间大小。

如果管理员部署时指定了计算节点,则不走调度流程,就会把虚拟机硬塞给该计算节点,强行占用了已经归入超额分配计划的空间,则最终可能导致计算节点上报的磁盘资源为负数。并且将来随着虚拟机实际占用的磁盘空间越来越大,最终可能就导致计算节点硬盘空间不足了。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息