您的位置:首页 > 大数据

大数据课程学习笔记(1)

2014-10-15 20:38 295 查看
1.Do word count over a given set of web pages in parallel use MapReduce

                                 MapReduce Framework

2.结构化数据和半结构化数据

 结构化数据:以表格形式表示的信息

非结构化数据:形式比较自由

半结构化数据:事实上几乎没有数据是无结构化的。

3.IR

 Ultimate Focus of IR:Satisfying user information need(Emphasis is on retrieval of information not data),User information need(Printer reviews,Book prices and availability,Words in which all vowels appear and so on),Predicting which documents are relevent,and
the linearly ranking them.

4.DIKW Hierarchy

 D is DATA:Symbolic units  E.g:Records of customer or Bytes from sensors.

 I is Information:Data with an interpretation(who?what?when?where?)E.g:按年龄分组的用户信息

 K is Knowledge:information organized with theoretical concepts or abstract ideas(how?)E.g:经济危机下多少用户在减少开支?

 W is wisdom:understanding of fundamental principles and Human judgement.

 

Thinking at scale

1.problem:We can process data very quickly but we can read/write it very sloely.

     Sharing is slow,we should distribute the data

     Sharing is tricky:exchanging data requires synchronization(Deadlock becomes a problem),finite bandwidth is available(distributed systems can "drown themselves" and failovers can cause cascading failure),temporal dependencies
are complicated.

2.Reliability demands

 Support partial failure– Total system must support graceful decline in application  performance rather than a full halt.

 Data Recoverability– If components fail, their workload must be picked up by still-  functioning units.

 Individual Recoverability– Nodes that fail and restart must be able to rejoin the group  activity without a full group restart.

Consistency

– Concurrent operations or partial internal failures should not cause externally visible 

nondeterminism

Scalability

– Adding increased load to a system should not cause outright failure, but a graceful decline

– Increasing resources should support a proportional increase in load capacity
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  mapreduce parallel