您的位置:首页 > 其它

<<Big Data: Principles and Best Practices of Scalable Realtime Data Systems>>读书笔记

2015-10-22 20:23 651 查看
Chapter 1 A New Paradigm for Big Data

1.1 How this Book is structured

focus on principles of big data problem => theory / illustration

1.2 Scaling with a traditional database

original 

problem: timeout error on inserting to the database

solution: queue batch updates with queue and worker

problem: more and more writes, workload still too heavy for the database

solution: horizontal partitioning or sharding spreads the write load across multiple machines

problem: keep having to reshard the database into more shards to keep up with the write load and easy to make mistake

solution: Big Data?

making your data immutable. with traditional databases, you'd be wary of using immutable data because of how fast such a dataset would grow. But because Big Data techniques can scale to so much data, you have the ability
to design systems in different ways.

1.3 NoSQL is not a panacea

using conjunction with one another, you can produce scalable systems for arbitrary data problems with human-fault tolerance and a minimum of complexity

1.4 First principles

A data system answers questions based on information that was acquired in the past up to present

definition of data system: 

query = function (all data)    [how about write new data?]

1.5 Desired properties of a Big Data system

robustness and fault tolerance

low latency reads and updates

scalability

generalization

extensibility

ad hoc queries

minimal maintenance

debuggability

1.6 The Problems of incremental architectures

Traditional Architecture: use of read/write databases and maintaining the state in those database incrementally as new data is seen

Complexity: 

operation

achieving eventual consistency

lack of human-fault tolerance(感觉不是很立的住脚)

1.7 Lambda Architecture

Batch Layer

responsibility: 1. stores master dataset; 2. computes arbitrary views

formula: batch view = function(all data)

implementation: Hadoop, MapReduce, HDFS

Serving Layer

responsibility: 1. random access to batch views; 2. updated by batch layer

formula: NONE

implementation: Thrift, Protocol Buffers, Avro

Speed Layer

responsibility: 1. compensate for high latency of updates to serving layer; 2. Fast, incremental algorithm; 3. Batch layer eventually overrides speed layer

formula: realtime view = function(realtime view, new data)

note: can think similar to batch layer, but only looks recent data, doesn't look all new data at once

implementation: Cassandra, HBase, MongoDB, Voldemort, Riak, CouchDB

Messaging / Queueing Systems: Kafka

Realtime Computation System: Storm

Summary

batch view = function(all data)

realtime view = function(realtime view, new data)

query = function(batch view, realtime view)

Complexity Isolation

Part 1 Batch Layer

Chapter 2 Data Model for Big Data

2.1 The properties of data

Information: general collection of knowledge relevant to your Big Data system. It's synonymous with the colloquial usage of the word data

Data: refers to the information can't be derived from anything else. Data serves as the axioms from which everything drives

Queries: question you ask of your data

Views: information that has been derived from your data. They are built to assist with answering specific types of queries

Key Properties of Data

1. rawness (storing raw data is hugely valuable because you rarely know in advance all the questions you want answered)

2. immutability (Human-fault tolerance / simplicity )

3. perpetuity

2.2 The fact-based model for representing data

In fact-based model, you deconstruct your data into fundamental units calledfacts.

Fact Properties

1. atomic

2. timestamped

Benefits of the fact-based model

1. Is queryable at any time in its history

2. Tolerates human errors (by deleting the error fact)

3. Handles partial information

4. Has the advantages of both normalized and denormalized forms (In lambda architecture, the master dataset is fully normalized)

2.3 Graph schemas

graph schemas: capture the structure of a dataset stored using the fact-based model.

Nodes: entities

Edges: relationships between nodes

Properties: information about entities

The need for an enforceable schema: defines structure of fact.

Implement an enforceable schemausing a serialization framework.A serialization framework provides a language-netrual way to define nodes, edges and properties

Chapter 3 Data Model for Big Data: Illustration

Thirft: cannot do validation like the value should be non-negatiev

Chapter 4 Data Storage on The Batch Layer

topics:

Storage requirements for the master dataset

Distributed filesystems

Improving efficiency with vertical partitioning

4.1 Storage requirements for the master dataset

OperationRequisiteDiscussion
WriteEfficient of appends of new dataThe only write operation is to add new pieces of data, so it must be easy and efficient to append a new set of data

objects to the master dataset
 Scalable storageThe batch layer stores the complete dataset -- potentially terabytes or petabytes of data. It must therefore be easy

to scale the storage as you dataset grows
ReadSupport for parallel processingConstructing the batch views requires computing functions on the entire master dataset. The batch storage must

consequently support parallel processing to handle large amounts of data in a scalable manner (no need for random

access)
BothTunable storage and processing costsStorage costs money. You may choose to compress your data to help minimize your expense, but decompressing

your data during computation can affect performance. The batch layer should give you the flexibility to decide how

to store and compress your data to suit your specific needs
 Enforceable immutabilityIt's critical that you're able to enforce the immutable property of your master dataset. Of course, computers by their

very nature are mutable, so there will always be a way to mutate your data stored. The best you can do is put checks

to disallow mutable operations. These checks should prevent bugs or other random errors from trampling over exist

data
Chapter 5 Data Storage on The Batch Layer: Illustration

HDFS

Pail

Chapter 6 Batch Layer

Recomputation vs Incremental

6.6 Low-level nature of MapReduce

MapReduce is a great primitive for batch computation -- providing you a generic, scalable, and fault-tolerant way to compute functions of large datasets -- it doesn't lend itself to particularly elegant code. You'll
find that MapReduce programs written manually tend to be long, unwidely, and difficult to understand. (MapReduce比较底层,不适合用于一些场景,这些场景下代码会变的复杂、难于理解)

1. multistep computations are unnatural

2. joins are very complicated to implement manually

3. logical and physical execution tightly coupled

6.7 Pipe diagrams: a higher-level way of thinking about batch computation

Chapter 7 Batch Layer: Illustration

JCascalog as a practical implementation of pipe diagrams

inputs and outputs are defined via an abstraction called atap

7.2 Common Pitfalls of data-processing tools

1. custom languages

2. poorly composable abstraction

Chapter 8 An Example Batch Layer: Architecture and Algorithm

Chapter 9 An Example Batch Layer: Implementation

Part 2 Serving Layer

Chapter 10 Serving Layer

Part 3 Speed Layer

speed layer -> [synchronously / asynchronously]

asynchronously -> queues and streaming

two paradigms of stream processing -> [one-at-a-time / micro-batched]

Chapter 12 Realtime Views

speed layer is based on incremental computation instead of batch computation

facts: storing the realtime views and processing the incoming data stream so as to update those views.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  big data principles