Use External Storage Process Big Data(1)
2013-03-05 21:57
393 查看
Problem:
We discussed big data is that data can not fit in main memory(often called RAM, for Random Access Memory) all at once, how would you handle this situation?
Solution:
We can use Divide-Conquer algorithm to solve big problem by dividing it into small problems, then solving every small problem with the same method, and finally merge every results. In this case a different kind of storage is necessary. Disk files generally
have a much larger capacity than main memory, but we should clearly know that external storage is much slower than main memory. This speed difference means that different techniques must be used to handle it efficiently.
Here we suppose our big data(suppose holds many records) are in a file. We can divide the file into blocks(data is stored on the disk in chunks called blocks,pages,allocation units; the disk drive always reads or writes
a minimum of one block of data at a time; here block can be the biggest size your main memory can afford;Data is read from and written to disk in units known as blocks. The Block Size property specifies the number of bytes per block.) , then
we can read the block what we want into main memory. But the problem is how can you find the block quickly.
![](http://img.my.csdn.net/uploads/201303/05/1362490737_5751.jpg)
Problem:How can you find the block quickly?
Solution:
We must keep in mind a fact that the time to access a block is much larger than any internal processing on data in main memory, so the overriding consideration in devising an external storage strategy is minimizing the number of block accesses. The usually
techniques to handle this problem are hashing, index and B-tree.
1 Hashing and External Storage
The central feature in external hashing is a hash table containing block numbers, which refer to block in external storage. The hash table is sometimes called an index (in the sense of a bool's index). It can be stored in main memory or, if it is too
large, stored externally on disk, with only part of it being read into main memory at a time.
1)Firstly, all records with keys that hash to the same value are located in the same block.
2)Secondly, to find a record with a particular key, the search algorithm hashes the key, uses the hash value as an index to the hash table, gets the block number at that index, and reads the block.
![](http://img.my.csdn.net/uploads/201303/05/1362492180_5911.jpg)
To implement this scheme, we must choose the hash function and the size of the hash table with some care so that a limited number of keys hash to the same value.
For example:
We can put all the blocks in a catalog, and the hash values are the bock files names. So you can find the block file according the block file name. For instance, if your search key's hash value is 2, then you can find the 2.txt file and read it into main
memory because all the keys with the same hash value are in the same block.
![](http://img.my.csdn.net/uploads/201303/05/1362493200_7671.jpg)
You may confused the 11.txt in the above figure. 11.txt is the overflow bock file of 1.txt if the 1.txt is full. This is the separate chaining method to handle the full blocks, of course, you can use other methods to find the overflow blocks. In seperate
chaining, special overflow blocks are made available; when a primary block is found to be full,the new record in the overflow block.
We discussed big data is that data can not fit in main memory(often called RAM, for Random Access Memory) all at once, how would you handle this situation?
Solution:
We can use Divide-Conquer algorithm to solve big problem by dividing it into small problems, then solving every small problem with the same method, and finally merge every results. In this case a different kind of storage is necessary. Disk files generally
have a much larger capacity than main memory, but we should clearly know that external storage is much slower than main memory. This speed difference means that different techniques must be used to handle it efficiently.
Here we suppose our big data(suppose holds many records) are in a file. We can divide the file into blocks(data is stored on the disk in chunks called blocks,pages,allocation units; the disk drive always reads or writes
a minimum of one block of data at a time; here block can be the biggest size your main memory can afford;Data is read from and written to disk in units known as blocks. The Block Size property specifies the number of bytes per block.) , then
we can read the block what we want into main memory. But the problem is how can you find the block quickly.
![](http://img.my.csdn.net/uploads/201303/05/1362490737_5751.jpg)
Problem:How can you find the block quickly?
Solution:
We must keep in mind a fact that the time to access a block is much larger than any internal processing on data in main memory, so the overriding consideration in devising an external storage strategy is minimizing the number of block accesses. The usually
techniques to handle this problem are hashing, index and B-tree.
1 Hashing and External Storage
The central feature in external hashing is a hash table containing block numbers, which refer to block in external storage. The hash table is sometimes called an index (in the sense of a bool's index). It can be stored in main memory or, if it is too
large, stored externally on disk, with only part of it being read into main memory at a time.
1)Firstly, all records with keys that hash to the same value are located in the same block.
2)Secondly, to find a record with a particular key, the search algorithm hashes the key, uses the hash value as an index to the hash table, gets the block number at that index, and reads the block.
![](http://img.my.csdn.net/uploads/201303/05/1362492180_5911.jpg)
To implement this scheme, we must choose the hash function and the size of the hash table with some care so that a limited number of keys hash to the same value.
For example:
We can put all the blocks in a catalog, and the hash values are the bock files names. So you can find the block file according the block file name. For instance, if your search key's hash value is 2, then you can find the 2.txt file and read it into main
memory because all the keys with the same hash value are in the same block.
![](http://img.my.csdn.net/uploads/201303/05/1362493200_7671.jpg)
You may confused the 11.txt in the above figure. 11.txt is the overflow bock file of 1.txt if the 1.txt is full. This is the separate chaining method to handle the full blocks, of course, you can use other methods to find the overflow blocks. In seperate
chaining, special overflow blocks are made available; when a primary block is found to be full,the new record in the overflow block.
相关文章推荐
- 1.3 Quick Start中 Step 8: Use Kafka Streams to process data官网剖析(博主推荐)
- Do not hardcode "/sdcard/"; use Environment.getExternalStorageDirectory().ge
- Use Thread Local Storage to Pass Thread Specific Data
- Android Data Storage(数据存储)值External Storage
- [Android] Use Jsoup to grab the web data and process the data with string.indexOf()
- MS Bigdata HDInsight -Process, analyze, and gain new insights from big data using the power of Apache Hadoop
- Use IE userdata behavior as a client-side data storage
- big data use
- Putting Spark to Use: Fast In-Memory Computing for Your Big Data Applications
- Use IE userdata behavior as a client-side data storage
- 100.In which situations does the Oracle Data Pump use external tables and not the direct path load w
- 关于小米手机Environment.getExternalStorageState()状态不是MOUNTED的问题
- [Javascript] Web APIs: Persisting browser data with window.localStorage
- SharePoint 2007 External Binary Storage Component Preview 发布
- Note: Bigtable, A Distributed Storage System for Structured Data
- Want to start a big data company? Here are 5 things you need to know
- APACHE 安装出错 configure: error: Cannot use an external APR with the bundled APR-util
- commondatastorage.googleapis.com访问失败快速解决
- BigInteger/BigDeciaml/Calendar/Data/System常用类
- Structure Big Data揭示Hadoop未来