大数据课程学习笔记(1)
2014-10-15 20:38
295 查看
1.Do word count over a given set of web pages in parallel use MapReduce
MapReduce Framework
2.结构化数据和半结构化数据
结构化数据:以表格形式表示的信息
非结构化数据:形式比较自由
半结构化数据:事实上几乎没有数据是无结构化的。
3.IR
Ultimate Focus of IR:Satisfying user information need(Emphasis is on retrieval of information not data),User information need(Printer reviews,Book prices and availability,Words in which all vowels appear and so on),Predicting which documents are relevent,and
the linearly ranking them.
4.DIKW Hierarchy
D is DATA:Symbolic units E.g:Records of customer or Bytes from sensors.
I is Information:Data with an interpretation(who?what?when?where?)E.g:按年龄分组的用户信息
K is Knowledge:information organized with theoretical concepts or abstract ideas(how?)E.g:经济危机下多少用户在减少开支?
W is wisdom:understanding of fundamental principles and Human judgement.
Thinking at scale
1.problem:We can process data very quickly but we can read/write it very sloely.
Sharing is slow,we should distribute the data
Sharing is tricky:exchanging data requires synchronization(Deadlock becomes a problem),finite bandwidth is available(distributed systems can "drown themselves" and failovers can cause cascading failure),temporal dependencies
are complicated.
2.Reliability demands
Support partial failure– Total system must support graceful decline in application performance rather than a full halt.
Data Recoverability– If components fail, their workload must be picked up by still- functioning units.
Individual Recoverability– Nodes that fail and restart must be able to rejoin the group activity without a full group restart.
Consistency
– Concurrent operations or partial internal failures should not cause externally visible
nondeterminism
Scalability
– Adding increased load to a system should not cause outright failure, but a graceful decline
– Increasing resources should support a proportional increase in load capacity
MapReduce Framework
2.结构化数据和半结构化数据
结构化数据:以表格形式表示的信息
非结构化数据:形式比较自由
半结构化数据:事实上几乎没有数据是无结构化的。
3.IR
Ultimate Focus of IR:Satisfying user information need(Emphasis is on retrieval of information not data),User information need(Printer reviews,Book prices and availability,Words in which all vowels appear and so on),Predicting which documents are relevent,and
the linearly ranking them.
4.DIKW Hierarchy
D is DATA:Symbolic units E.g:Records of customer or Bytes from sensors.
I is Information:Data with an interpretation(who?what?when?where?)E.g:按年龄分组的用户信息
K is Knowledge:information organized with theoretical concepts or abstract ideas(how?)E.g:经济危机下多少用户在减少开支?
W is wisdom:understanding of fundamental principles and Human judgement.
Thinking at scale
1.problem:We can process data very quickly but we can read/write it very sloely.
Sharing is slow,we should distribute the data
Sharing is tricky:exchanging data requires synchronization(Deadlock becomes a problem),finite bandwidth is available(distributed systems can "drown themselves" and failovers can cause cascading failure),temporal dependencies
are complicated.
2.Reliability demands
Support partial failure– Total system must support graceful decline in application performance rather than a full halt.
Data Recoverability– If components fail, their workload must be picked up by still- functioning units.
Individual Recoverability– Nodes that fail and restart must be able to rejoin the group activity without a full group restart.
Consistency
– Concurrent operations or partial internal failures should not cause externally visible
nondeterminism
Scalability
– Adding increased load to a system should not cause outright failure, but a graceful decline
– Increasing resources should support a proportional increase in load capacity
相关文章推荐
- [数据挖掘课程笔记]无监督学习——聚类(clustering)
- 大数据课程体系-学习笔记-第一阶段-Java Thread
- 大数据课程体系-学习笔记概要
- 大数据课程体系-学习笔记-第一阶段-Java Base
- 大数据课程体系-学习笔记-第一阶段-Java Socket(转载)
- 加州理工学院公开课:机器学习与数据挖掘课程笔记(一)学习问题
- 大数据课程体系-学习笔记-第一阶段-Java Collection
- 第七章 数据控件基础课程 -- 学习笔记
- CS231n课程学习笔记(七)——数据预处理、批量归一化和Dropout
- 大数据课程体系-学习笔记-第一阶段-Linux Base
- 大数据课程体系-学习笔记-第一阶段-Java Reflect
- 大数据课程体系-学习笔记-第一阶段-Java IDE
- Andrew机器学习课程笔记(5)—— 推荐系统、大数据下的机器学习
- LAMP兄弟连PHP课程学习笔记 第二天 数据类型和变量使用
- .net学习笔记数据绑定20060524
- 微软课程学习笔记
- [BizTalk][Adapter][部署]BTS学习笔记1:建立一个简单的Biztalk数据交换项目(一)
- MyGeneration学习笔记(8) :dOOdad提供的数据绑定、特殊函数和事务处理
- [ASP.NET学习笔记之十四]ASP.NET 2.0 数据绑定
- Chap 4 学习笔记-使用C#存储变量数据