Friday, November 07, 2008

Duplication from a backup perspective ..

Data deduplication identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored.Byte-level deduplication provides a more granular inspection of data than block-level approaches, ensuring more accuracy, but it often requires more knowledge of the backup stream to do its job.

Block-level approaches

Block-level data deduplication segments data streams into blocks, inspecting the blocks to determine if each has been encountered before (typically by generating a digital signature or unique identifier via a hash algorithm for each block). If the block is unique, it is written to disk and its unique identifier is stored in an index; otherwise, only a pointer to the original, unique block is stored. By replacing repeated blocks with much smaller pointers rather than storing the block again, disk storage space is saved.

Byte-level data deduplication

Analyzing data streams at the byte level is another approach to deduplication. By performing a byte-by-byte comparison of new data streams versus previously stored ones, a higher level of accuracy can be delivered. Deduplication products that use this method have one thing in common: It's likely that the incoming backup data stream has been seen before, so it is reviewed to see if it matches similar data received in the past.

Products leveraging a byte-level approach are typically "content aware," which means the vendor has done some reverse engineering of the backup application's data stream to understand how to retrieve information such as the file name, file type, date/time stamp, etc. This method reduces the amount of computation required to determine unique versus duplicate data.

Backup jobs, therefore, complete at full disk performance, but require a reserve of disk cache to perform the deduplication process. It's also likely that the deduplication process is limited to a backup stream from a single backup set and not applied "globally" across backup sets.

Once the deduplication process is complete, the solution reclaims disk space by deleting the duplicate data. Before space reclamation is performed, an integrity check can be performed to ensure that the deduplicated data matches the original data objects. The last full backup can also be maintained so recovery is not dependent on reconstituting deduplicated data, enabling rapid recovery.

The better one ...

Well, the choice is yours depending on what exactly you would want to do. Things like Trade-off between duplication accuracy versus performance, cache ratio etc should come onto the fore while making your decision ..