微观、宏观、精准 多视角估算数据库性能(选型、做预算不求人)
2017-09-25 14:20
176 查看
想不想知道是什么
PostgreSQL , PPAS , Greenplum , HybridDB for PostgreSQL , 性能 , 估算 , pgbench , 统计细信息 , explain算法 , 统计信息导入导出
在提预算时必不可少的环境是评估需要多少硬件。
通常会要求业务方提供一些数据,例如用户数、PV、UV等。但是这种评估纯靠经验,方法非常的粗糙也不准确。
那么到底如何评估需要多少硬件、或者说需要什么样规格的硬件来支撑你未来的业务呢?
对于PostgreSQL这个数据库产品来说,我介绍一下三种评估方法:
1、微观评估(相对来说比较准确)
2、宏观评估(对选型有帮助,对规格帮助不大,略显粗糙)
3、精准评估(最为准确,但是要求对业务非常熟悉,对未来的瓶颈把握准确)
我们在通过SQL与数据库交互时,数据库是如何执行SQL的呢?
首先要PARSE SQL,然后生成执行路径,选择最优执行路径,执行SQL,最关键的是选择最优执行路径。PostgreSQL是CBO的优化器,根据成本选择。
这里提到了成本,成本是怎么算出来的呢?成本是结合扫描方法、统计信息、估算需要扫描多少个数据块,扫描多少条记录,最后通过对应扫描方法的成本估算算法算出来的。
1、成本包括:
IO成本,CPU成本。
2、IO成本包括:
连续IO成本,离散IO层板。
3、CPU成本包括:
获取索引、TOAST索引、堆表、TOAST表的tuple或ITEM的成本;
操作符、函数处理行的成本;
处理JOIN的成本等等。
生成好执行计划后,QUERY的执行就会按执行树来执行
执行树由若干个节点组成,从一个节点,跳到下一个节点,就好像接力赛一样。
节点跟节点之间传递的是什么呢?
Path数据结构,主要包含(rows, startup_cost, total_cost)。一个数据节点
rows,表示这个节点有多少满足条件的行,输出到下一个节点。
startup_cost,表示这个节点得到第一条符合条件的记录,需要多少成本。
total_cost,表示这个节点得到所有符合条件的记录,需要多少成本。
执行节点的种类很多,可以从成本计算的代码中得到:
src/backend/optimizer/path/costsize.c
想不想知道是什么
标签
PostgreSQL , PPAS , Greenplum , HybridDB for PostgreSQL , 性能 , 估算 , pgbench , 统计细信息 , explain算法 , 统计信息导入导出
背景
在提预算时必不可少的环境是评估需要多少硬件。通常会要求业务方提供一些数据,例如用户数、PV、UV等。但是这种评估纯靠经验,方法非常的粗糙也不准确。
那么到底如何评估需要多少硬件、或者说需要什么样规格的硬件来支撑你未来的业务呢?
对于PostgreSQL这个数据库产品来说,我介绍一下三种评估方法:
1、微观评估(相对来说比较准确)
2、宏观评估(对选型有帮助,对规格帮助不大,略显粗糙)
3、精准评估(最为准确,但是要求对业务非常熟悉,对未来的瓶颈把握准确)
一、微观估算法
我们在通过SQL与数据库交互时,数据库是如何执行SQL的呢?首先要PARSE SQL,然后生成执行路径,选择最优执行路径,执行SQL,最关键的是选择最优执行路径。PostgreSQL是CBO的优化器,根据成本选择。
这里提到了成本,成本是怎么算出来的呢?成本是结合扫描方法、统计信息、估算需要扫描多少个数据块,扫描多少条记录,最后通过对应扫描方法的成本估算算法算出来的。
一个 QUERY 有哪些成本
1、成本包括:IO成本,CPU成本。
2、IO成本包括:
连续IO成本,离散IO层板。
3、CPU成本包括:
获取索引、TOAST索引、堆表、TOAST表的tuple或ITEM的成本;
操作符、函数处理行的成本;
处理JOIN的成本等等。
一个 QUERY 如何执行和传递成本
生成好执行计划后,QUERY的执行就会按执行树来执行执行树由若干个节点组成,从一个节点,跳到下一个节点,就好像接力赛一样。
节点跟节点之间传递的是什么呢?
Path数据结构,主要包含(rows, startup_cost, total_cost)。一个数据节点
rows,表示这个节点有多少满足条件的行,输出到下一个节点。
startup_cost,表示这个节点得到第一条符合条件的记录,需要多少成本。
total_cost,表示这个节点得到所有符合条件的记录,需要多少成本。
执行节点有哪些种类
执行节点的种类很多,可以从成本计算的代码中得到:src/backend/optimizer/path/costsize.c
/* * cost_seqscan * Determines and returns the cost of scanning a relation sequentially. * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL */ cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) /* * cost_samplescan * Determines and returns the cost of scanning a relation using sampling. * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL */ cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) /* * cost_gather * Determines and returns the cost of gather path. * * 'rel' is the relation to be operated upon * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL * 'rows' may be used to point to a row estimate; if non-NULL, it overrides * both 'rel' and 'param_info'. This is useful when the path doesn't exactly * correspond to any particular RelOptInfo. */ cost_gather(GatherPath *path, PlannerInfo *root, RelOptInfo *rel, ParamPathInfo *param_info, double *rows) /* * cost_gather_merge * Determines and returns the cost of gather merge path. * * GatherMerge merges several pre-sorted input streams, using a heap that at * any given instant holds the next tuple from each stream. If there are N * streams, we need about N*log2(N) tuple comparisons to construct the heap at * startup, and then for each output tuple, about log2(N) comparisons to * replace the top heap entry with the next tuple from the same stream. */ cost_gather_merge(GatherMergePath *path, PlannerInfo *root, RelOptInfo *rel, ParamPathInfo *param_info, Cost input_startup_cost, Cost input_total_cost, double *rows) /* * cost_index * Determines and returns the cost of scanning a relation using an index. * * 'path' describes the indexscan under consideration, and is complete * except for the fields to be set by this routine * 'loop_count' is the number of repetitions of the indexscan to factor into * estimates of caching behavior * * In addition to rows, startup_cost and total_cost, cost_index() sets the * path's indextotalcost and indexselectivity fields. These values will be * needed if the IndexPath is used in a BitmapIndexScan. * * NOTE: path->indexquals must contain only clauses usable as index * restrictions. Any additional quals evaluated as qpquals may reduce the * number of returned tuples, but they won't reduce the number of tuples * we have to fetch from the table, so they don't reduce the scan cost. */ cost_index(IndexPath *path, PlannerInfo *root, double loop_count, bool partial_path) /* * cost_bitmap_heap_scan * Determines and returns the cost of scanning a relation using a bitmap * index-then-heap plan. * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL * 'bitmapqual' is a tree of IndexPaths, BitmapAndPaths, and BitmapOrPaths * 'loop_count' is the number of repetitions of the indexscan to factor into * estimates of caching behavior * * Note: the component IndexPaths in bitmapqual should have been costed * using the same loop_count. */ cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info, Path *bitmapqual, double loop_count) /* * cost_bitmap_tree_node * Extract cost and selectivity from a bitmap tree node (index/and/or) */ cost_bitmap_tree_node(Path *path, Cost *cost, Selectivity *selec) /* * cost_bitmap_and_node * Estimate the cost of a BitmapAnd node * * Note that this considers only the costs of index scanning and bitmap * creation, not the eventual heap access. In that sense the object isn't * truly a Path, but it has enough path-like properties (costs in particular) * to warrant treating it as one. We don't bother to set the path rows field, * however. */ cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root) /* * cost_bitmap_or_node * Estimate the cost of a BitmapOr node * * See comments for cost_bitmap_and_node. */ cost_bitmap_or_node(BitmapOrPath *path, PlannerInfo *root) /* * cost_tidscan * Determines and returns the cost of scanning a relation using TIDs. * * 'baserel' is the relation to be scanned * 'tidquals' is the list of TID-checkable quals * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL */ cost_tidscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, List *tidquals, ParamPathInfo *param_info) /* * cost_subqueryscan * Determines and returns the cost of scanning a subquery RTE. * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL */ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) /* * cost_functionscan * Determines and returns the cost of scanning a function RTE. * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL */ cost_functionscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) /* * cost_tablefuncscan * Determines and returns the cost of scanning a table function. * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL */ cost_tablefuncscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) /* * cost_valuesscan * Determines and returns the cost of scanning a VALUES RTE. * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL */ cost_valuesscan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) /* * cost_ctescan * Determines and returns the cost of scanning a CTE RTE. * * Note: this is used for both self-reference and regular CTEs; the * possible cost differences are below the threshold of what we could * estimate accurately anyway. Note that the costs of evaluating the * referenced CTE query are added into the final plan as initplan costs, * and should NOT be counted here. */ cost_ctescan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) cost_namedtuplestorescan(Path *path, PlannerInfo *root, RelOptInfo *baserel, ParamPathInfo *param_info) /* * cost_recursive_union * Determines and returns the cost of performing a recursive union, * and also the estimated output size. * * We are given Paths for the nonrecursive and recursive terms. */ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm) /* * cost_sort * Determines and returns the cost of sorting a relation, including * the cost of reading the input data. * * If the total volume of data to sort is less than sort_mem, we will do * an in-memory sort, which requires no I/O and about t*log2(t) tuple * comparisons for t tuples. * * If the total volume exceeds sort_mem, we switch to a tape-style merge * algorithm. There will still be about t*log2(t) tuple comparisons in * total, but we will also need to write and read each tuple once per * merge pass. We expect about ceil(logM(r)) merge passes where r is the * number of initial runs formed and M is the merge order used by tuplesort.c. * Since the average initial run should be about sort_mem, we have * disk traffic = 2 * relsize * ceil(logM(p / sort_mem)) * cpu = comparison_cost * t * log2(t) * * If the sort is bounded (i.e., only the first k result tuples are needed) * and k tuples can fit into sort_mem, we use a heap method that keeps only * k tuples in the heap; this will require about t*log2(k) tuple comparisons. * * The disk traffic is assumed to be 3/4ths sequential and 1/4th random * accesses (XXX can't we refine that guess?) * * By default, we charge two operator evals per tuple comparison, which should * be in the right ballpark in most cases. The caller can tweak this by * specifying nonzero comparison_cost; typically that's used for any extra * work that has to be done to prepare the inputs to the comparison operators. * * 'pathkeys' is a list of sort keys * 'input_cost' is the total cost for reading the input data * 'tuples' is the number of tuples in the relation * 'width' is the average tuple width in bytes * 'comparison_cost' is the extra cost per comparison, if any * 'sort_mem' is the number of kilobytes of work memory allowed for the sort * 'limit_tuples' is the bound on the number of output tuples; -1 if no bound * * NOTE: some callers currently pass NIL for pathkeys because they * can't conveniently supply the sort keys. Since this routine doesn't * currently do anything with pathkeys anyway, that doesn't matter... * but if it ever does, it should react gracefully to lack of key data. * (Actually, the thing we'd most likely be interested in is just the number * of sort keys, which all callers *could* supply.) */ cost_sort(Path *path, PlannerInfo *root, List *pathkeys, Cost input_cost, double tuples, int width, Cost comparison_cost, int sort_mem, double limit_tuples) /* * cost_append * Determines and returns the cost of an Append node. * * We charge nothing extra for the Append itself, which perhaps is too * optimistic, but since it doesn't do any selection or projection, it is a * pretty cheap node. */ cost_append(Path *path, List *subpaths, int num_nonpartial_subpaths) /* * cost_merge_append * Determines and returns the cost of a MergeAppend node. * * MergeAppend merges several pre-sorted input streams, using a heap that * at any given instant holds the next tuple from each stream. If there * are N streams, we need about N*log2(N) tuple comparisons to construct * the heap at startup, and then for each output tuple, about log2(N) * comparisons to replace the top entry. * * (The effective value of N will drop once some of the input streams are * exhausted, but it seems unlikely to be worth trying to account for that.) * * The heap is never spilled to disk, since we assume N is not very large. * So this is much simpler than cost_sort. * * As in cost_sort, we charge two operator evals per tuple comparison. * * 'pathkeys' is a list of sort keys * 'n_streams' is the number of input streams * 'input_startup_cost' is the sum of the input streams' startup costs * 'input_total_cost' is the sum of the input streams' total costs * 'tuples' is the number of tuples in all the streams */ cost_merge_append(Path *path, PlannerInfo *root, List *pathkeys, int n_streams, Cost input_startup_cost, Cost input_total_cost, double tuples) /* * cost_material * Determines and returns the cost of materializing a relation, including * the cost of reading the input data. * * If the total volume of data to materialize exceeds work_mem, we will need * to write it to disk, so the cost is much higher in that case. * * Note that here we are estimating the costs for the first scan of the * relation, so the materialization is all overhead --- any savings will * occur only on rescan, which is estimated in cost_rescan. */ cost_material(Path *path, Cost input_startup_cost, Cost input_total_cost, double tuples, int width) /* * cost_agg * Determines and returns the cost of performing an Agg plan node, * including the cost of its input. * * aggcosts can be NULL when there are no actual aggregate functions (i.e., * we are using a hashed Agg node just to do grouping). * * Note: when aggstrategy == AGG_SORTED, caller must ensure that input costs * are for appropriately-sorted input. */ cost_agg(Path *path, PlannerInfo *root, AggStrategy aggstrategy, const AggClauseCosts *aggcosts, int numGroupCols, double numGroups, Cost input_startup_cost, Cost input_total_cost, double input_tuples) /* * cost_windowagg * Determines and returns the cost of performing a WindowAgg plan node, * including the cost of its input. * * Input is assumed already properly sorted. */ cost_windowagg(Path *path, PlannerInfo *root, List *windowFuncs, int numPartCols, int numOrderCols, Cost input_startup_cost, Cost input_total_cost, double input_tuples) /* * cost_group * Determines and returns the cost of performing a Group plan node, * including the cost of its input. * * Note: caller must ensure that input costs are for appropriately-sorted * input. */ cost_group(Path *path, PlannerInfo *root, int numGroupCols, double numGroups, Cost input_startup_cost, Cost input_total_cost, double input_tuples) /* * cost_subplan * Figure the costs for a SubPlan (or initplan). * * Note: we could dig the subplan's Plan out of the root list, but in practice * all callers have it handy already, so we make them pass it. */ cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan) /* * cost_rescan * Given a finished Path, estimate the costs of rescanning it after * having done so the first time. For some Path types a rescan is * cheaper than an original scan (if no parameters change), and this * function embodies knowledge about that. The default is to return * the same costs stored in the Path. (Note that the cost estimates * actually stored in Paths are always for first scans.) * * This function is not currently intended to model effects such as rescans * being cheaper due to disk block caching; what we are concerned with is * plan types wherein the executor caches results explicitly, or doesn't * redo startup calculations, etc. */ cost_rescan(PlannerInfo *root, Path *path, Cost *rescan_startup_cost, /* output parameters */ Cost *rescan_total_cost) /* * cost_qual_eval * Estimate the CPU costs of evaluating a WHERE clause. * The input can be either an implicitly-ANDed list of boolean * expressions, or a list of RestrictInfo nodes. (The latter is * preferred since it allows caching of the results.) * The result includes both a one-time (startup) component, * and a per-evaluation component. */ cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root) /* * cost_qual_eval_node * As above, for a single RestrictInfo or expression. */ cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root) cost_qual_eval_walker(Node *node, cost_qual_eval_context *context)
想不想知道是什么
相关文章推荐
- 服务器性能估算参考(硬件-分析数据库)
- BI项目中数据库服务器硬件性能估算示例及问题
- 系统架构时 主机、数据库性能估算
- 解决数据库性能瓶颈的几种方法 - NoSql视角(草稿)
- web开发性能优化---数据库篇
- 数据库访问性能优化
- 【SqlServer2005+ 查询优化】MSSQL优化SQL语句 提高数据库的访问性能
- 【数据库】关于数据库查询性能调优和索引优化的总结
- MySQL 数据库性能优化之SQL优化
- MySql,Mssql,Oracle三种数据库性能优缺点及异同
- 数据库性能优化
- SQL Server 2005 数据库设计与性能
- 阿里云数据库专家玄惭的“武功”全记录之性能优化篇
- 改善网站性能和改善数据库性能
- 使用SQL从AWR收集数据库性能变化趋势
- 已解决:大量的全表扫描 "直接路径读" 引发的数据库性能问题
- aix下 maxperm设置引起数据库性能问题
- 面向程序员的数据库访问性能优化法则
- 微观架构及宏观架构
- 论软件开发中的宏观与微观