Emerging Trends in File Archival
Date: Saturday , August 01, 2009
These days the volume of data stored in enterprises is growing at a phenomenal rate – different estimates put the growth rate anywhere from 30-50 percent. One aspect of this increase that hasn’t received much attention is the ‘unstructured’ data (i.e. data stored in files or messages) which is growing much faster, as against data stored in databases.
Managing the growth of such unstructured data is a difficult task. There is no single control point or a single application that touches this data. It can be scattered across several machines, disks, and directories. The file sizes can enormously vary, the data may belong to different users, and the usage pattern can also vary drastically.
How does one manage such data? On the one hand, we want to keep this data accessible all the time, with an acceptable access speed. But on the other, we would like to control the cost of storing and securing this data. An archival solution attempts to provide an answer to this, which can meet both these objectives.
There are different storage media available to us for storing data, and each of them has some unique advantages. Disks are fast and allow random access to data. Tapes are cheap, and can be transported or saved in safe locations. Tapes do not consume any power when they are not being used. We also have DVDs which can be used for write-once type of storage – the storage medium guarantees that data cannot be over-written.
But then disks consume a lot of power, and their cost per megabit is higher than that of other media. Tapes do not have random access, and DVDs are slower than disks and costlier than tape. So no one storage medium is ideal; and a storage strategy needs to exploit the advantage of each medium, while trying to avoid its disadvantages. But how to do this?
File archival is basically identifying less used data and keeping it on low cost storage systems. This idea is not new, and it was used in some early Hierarchical Storage Management (HSM) solutions. HSM acquired a bad name because of excessive hype, and also because the solutions did not meet the expectations. But the basic premise behind that approach is not wrong. What if software can detect less used files, and move them to low cost storage automatically? The end user should not be aware of this movement – they should feel that their data is accessible to them all the time. The data should be accessible on demand. If a solution achieves these basic objectives, any storage manager would love to have it. File archival solutions were born on the basis of this premise. Today, any self respecting storage manager would use such solutions in his data center. Many times, the solution is so transparent that the end user is not even aware of it!
The Basic Approach
There are many ways to implement such a solution, but the most common method is to create a file system filter driver running on the host server. This filter driver keeps monitoring the access time and modification time of each file. When a file reaches certain user-defined age and or size criteria, this file is marked to be moved to secondary storage media (like tape, DVD, or low cost disk). When a file is moved to secondary storage, in its original location a small pointer or stub file is placed. This stub file keeps the information about the location of the file data on the secondary storage. The stub file acquires the name and other attributes of the original file, and so the user thinks that the data is still available for his use.
The filter driver does the second duty of monitoring the access to the stub files. When a stub file is opened for reading or writing, it moves the data back from the secondary storage, and gives it to the user. Thus the user may experience some delay in accessing his or her old data, but the data is still accessible all the time. A directory listing or other such file access is duplicated by the filter without migrating data back from the secondary storage. Thus the archival software tries to keep data on secondary storage as much as possible. There are several possible variations of this basic approach, but they all achieve the same basic purpose – keeping less accessed data on low cost storage, and yet keeping it accessible all the time.
Apart from lowering the cost of storage, archival has other advantages. Archived data can be kept secure by making a backup copy for disaster recovery. The cost to provide a disaster recovery solution for all the servers in an enterprise would be prohibitive. But it can be easily provided for a few archival storage servers. A large tape library or DVD library may be too expensive for a single department, but it can be attached to the archival server.
Archival software is one of the fastest growing fields in storage software. There are many vendors who provide such solutions. Apart from basic file movement to and from low cost storage devices, these solutions provide many advanced features. Now we will have a look at where these solutions are today, and what is coming in the near future.
State of the Art
Retention: Retention of data is one of the most important features of any archival solution. File retention is required to meet various regulatory requirements. In the medical field, patients’ records need to be retained for several years. Similarly, all financial institutions (like banks, hedge funds, and credit card companies) need to maintain a record of all their transactions. Many corporations need to maintain the electronic mail of their senior officers to handle litigation. The storage administrators of such companies need to have a solution which can demonstrate to any regulatory authority that their data has not been tampered with. Once written, such data should not be deleted even by its owner. Archival software should be able to guarantee such retention not just on the secondary storage, but also on the primary storage.
Backup integration: Archival software has many similarities with backup, but it is not a backup solution. It doesn’t maintain different versions of a file. So storage administrators need a backup solution for their data as well. Archival software should work seamlessly with backup software. A backup cycle should not unnecessarily cause data to migrate from secondary to primary storage. At the same time, an archival action should not make a backup un-recoverable. If both solutions are provided by the same vendor, the user’s task is simplified. Otherwise, the user may need to specifically test their inter-operability.
Virus scan: A virus scan typically involves reading all files on a system periodically, just to check that they don’t contain any virus. But when files have been previously scanned and moved to low cost storage, there is no point in checking them again on the primary. Archival software would need to work with the virus scan software to prevent such unnecessary migrations.
Application integration: Applications like email or design software use a lot of files, and they need an archival solution customized for their needs. An ideal approach is to implement basic file archival as a software module, which can be included into any application specific archival solution. Software for managing patient records could be another example where the application can direct specific policies for different categories of patients.
Index and search: With the growth of unstructured data comes the requirement of indexing it and searching it for certain keywords or patterns. This is a growing field, and desktop search is one area where archival will play a key role. When files are archived, the index and search operation should ideally happen on the secondary storage. Archival software can generate the index during idle times, or at the time of migrating the data. It may keep index entries on disk storage for fast access. A lot of innovation is expected in this field in the coming days.
Deduplication: Deduplication, or avoiding multiple copies of data, is a hot field these days. Different vendors have implemented it in different ways. Some do this on primary storage, while some others do this at the time of taking backup. When data belonging to different users is being archived, it makes sense to deduplicate it so that even less storage is used on the back-end. Thus, deduplication and archival solutions should work together seamlessly. Should we first archive, and then deduplicate the data? Or should it be the other way around? Can we introduce these solutions independently, or should they be deployed simultaneously? Will one solution trample upon the other? People are trying hard to find the answers to these problems. We can expect to see a lot of changes in this area in the near future.
To summarize, one can say with certainly that file archival is an essential solution in any large enterprise. It allows us to keep a lid on the cost of storing our ever-growing unstructured data sets. Along with backup, de-duplication, indexing and search solutions, file archival forms a cornerstone of a successful storage management strategy.
Anil Degwekar, Consultant Software Engineer, EMC Corporation