This document is of interest primarily for its discussion of the HDF team's motivation for implementing raw data caching. At a more abstract level, the discussion of the principles of data chunking is also of interest, but a more recent discussion of that topic can be found in Dataset Chunking Issues. The performance study described here predates the current chunking implementation in the HDF5 Library, so the particular performance data is no longer apropos.      -- Frank Baker, Editor, September 2008

Testing the chunked layout of HDF5

This is the results of studying the chunked layout policy in HDF5. A 1000 by 1000 array of integers was written to a file dataset extending the dataset with each write to create, in the end, a 5000 by 5000 array of 4-byte integers for a total data storage size of 100 million bytes.

Order that data was written
Fig 1: Write-order of Output Blocks

After the array was written, it was read back in blocks that were 500 by 500 bytes in row major order (that is, the top-left quadrant of output block one, then the top-right quadrant of output block one, then the top-left quadrant of output block 2, etc.).

I tried to answer two questions:

I started with chunk sizes that were multiples of the read block size or k*(500, 500).

Table 1: Total File Overhead
Chunk Size (elements) Meta Data Overhead (ppm) Raw Data Overhead (ppm)
500 by 500 85.84 0.00
1000 by 1000 23.08 0.00
5000 by 1000 23.08 0.00
250 by 250 253.30 0.00
499 by 499 85.84 205164.84

500x500
Fig 2: Chunk size is 500x500

The first half of Figure 2 shows output to the file while the second half shows input. Each dot represents a file-level I/O request and the lines that connect the dots are for visual clarity. The size of the request is not indicated in the graph. The output block size is four times the chunk size which results in four file-level write requests per block for a total of 100 requests. Since file space for the chunks was allocated in output order, and the input block size is 1/4 the output block size, the input shows a staircase effect. Each input request results in one file-level read request. The downward spike at about the 60-millionth byte is probably the result of a cache miss for the B-tree and the downward spike at the end is probably a cache flush or file boot block update.


1000x1000
Fig 2: Chunk size is 1000x1000

In this test I increased the chunk size to match the output chunk size and one can see from the first half of the graph that 25 file-level write requests were issued, one for each output block. The read half of the test shows that four times the amount of data was read as written. This results from the fact that HDF5 must read the entire chunk for any request that falls within that chunk, which is done because (1) if the data is compressed the entire chunk must be decompressed, and (2) the library assumes that a chunk size was chosen to optimize disk performance.


5000x1000
Fig 3: Chunk size is 5000x1000

Increasing the chunk size further results in even worse performance since both the read and write halves of the test are re-reading and re-writing vast amounts of data. This proves that one should be careful that chunk sizes are not much larger than the typical partial I/O request.


250x250
Fig 4: Chunk size is 250x250

If the chunk size is decreased then the amount of data transfered between the disk and library is optimal for no caching, but the amount of meta data required to describe the chunk locations increases to 250 parts per million. One can also see that the final downward spike contains more file-level write requests as the meta data is flushed to disk just before the file is closed.


499x499
Fig 4: Chunk size is 499x499

This test shows the result of choosing a chunk size which is close to the I/O block size. Because the total size of the array isn't a multiple of the chunk size, the library allocates an extra zone of chunks around the top and right edges of the array which are only partially filled. This results in 20,516,484 extra bytes of storage, a 20% increase in the total raw data storage size. But the amount of meta data overhead is the same as for the 500 by 500 test. In addition, the mismatch causes entire chunks to be read in order to update a few elements along the edge or the chunk which results in a 3.6-fold increase in the amount of data transfered.


THG Help Desk: 'help' at hdfgroup.org
Last update of technical content: 30 Jan 1998
Last modified: 10 September 2008