2008-FAST-Avoiding the Disk Bottleneck in the Data Domain De(8)

时间:2026-01-19   来源:未知    
字号:

FAST有关论文。。

by combining the container abstraction with a segment cache as discussed next.

For segments that cannot be resolved by the Summary Vector and LPC, we resort to looking up the segment in the segment index. We have two goals on this retrieval:

Making this retrieval a relatively rare occurrence. Whenever the retrieval is made, it benefits segment filtering of future segments in the locale.

4.4 Accelerated Segment Filtering

We have combined all three techniques above in the segment filtering phase of our implementation.

For an incoming segment for write, the algorithm does the following:

Checks to see if it is in the segment cache. If it is in the cache, the incoming segment is a duplicate. If it is not in the segment cache, check the Summary Vector. If it is not in the Summary Vector, the segment is new. Write the new segment into the current container.

If it is in the Summary Vector, lookup the segment index for its container Id. If it is in the index, the incoming segment is a duplicate; insert the metadata section of the container into the segment cache. If the segment cache is full, remove the metadata section of the least recently used container first. If it is not in the segment index, the segment is new. Write the new segment into the current container.

LPC implements a segment cache to cache likely base segment descriptors for future duplicate segments. The segment cache maps a segment fingerprint to its corresponding container ID. Our main idea is to maintain the segment cache by groups of fingerprints. On a miss, LPC will fetch the entire metadata section in a container, insert all fingerprints in the metadata section into the cache, and remove all fingerprints of an old metadata section from the cache together. This method will preserve the locality of fingerprints of a container in the cache.

The operations for the segment cache are:

Init(): Initialize the segment cache.

Insert(container): Iterate through all segment descriptors in container metadata section, and insert each descriptor and container ID into the segment cache.

Remove(container): Iterate through all

segment descriptors in container metadata section, and remove each descriptor and container ID from the segment cache.

Lookup(fingerprint): Find the corresponding

container ID for the fingerprint specified.

Descriptors of all segments in a container are added or removed from the segment cache at once. Segment caching is typically triggered by a duplicate segment that misses in the segment cache, and requires a lookup in the segment index. As a side effect of finding the corresponding container ID in the segment index, we prefetch all segment descriptors in this container to the segment cache. We call this Locality Preserved Caching. The intuition is that base segments in this container are likely to be checked against for future duplicate segments, based on segment duplicate locality. Our results on real world data have validated this intuition overwhelmingly.

We have implemented the segment cache using a hash table. When the segment cache is full, containers that are ineffective in accelerating segment filtering are leading candidates for replacement from the segment cache. A reasonable cache replacement policy is Least-Recently-Used (LRU) on cached containers.

We aim to keep the segment index lookup to a minimum in segment filtering.

5

Experimental Results

How well does the deduplication storage system work with real world datasets?

How effective are the three techniques in terms of reducing disk I/O operations?

What throughput can a deduplication storage system using these techniques achieve?

We would like to answer the following questions:

For the first question, we will report our results with real world data from two customer data centers. For the next two questions, we conducted experiments with several internal datasets. Our experiments use a Data Domain DD580 deduplication storage system as an NFS v3 server [PJSS*94]. This deduplication system features two-socket duel-core CPU’s running at 3 Ghz, a total of 8 GB system memory, 2 gigabit NIC cards, and a 15-drive disk subsystem running software RAID6 with one spare drive. We use 1 and 4 backup client computers running NFS v3 client for sending data.

5.1 Results with Real World Data

The system described in this paper has been used at over 1,000 data centers. The following paragraphs report the deduplication results from two data centers, generated from the auto-support mechanism of the system.

276

FAST ’08: 6th USENIX Conference on File and Storage TechnologiesUSENIX Association

…… 此处隐藏:2602字,全部文档内容请下载后查看。喜欢就下载吧 ……
2008-FAST-Avoiding the Disk Bottleneck in the Data Domain De(8).doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
× 游客快捷下载通道(下载后可以自由复制和排版)
VIP包月下载
特价:19 元/月 原价:99元
低至 0.1 元/份 每月下载300
全站内容免费自由复制
VIP包月下载
特价:19 元/月 原价:99元
低至 0.1 元/份 每月下载300
全站内容免费自由复制
注:下载文档有可能出现无法下载或内容有问题,请联系客服协助您处理。
× 常见问题(客服时间:周一到周五 9:30-18:00)