siliconindia

Emerging Trends in File Archival

Author: Anil Degwekar
Anil Degwekar, Consultant Software Engineer, EMC Corporation
Introduction

These days the volume of data stored in enterprises is growing at a phenomenal rate – different estimates put the growth rate anywhere from 30-50 percent. One aspect of this increase that hasn’t received much attention is the ‘unstructured’ data (i.e. data stored in files or messages) which is growing much faster, as against data stored in databases.

Managing the growth of such unstructured data is a difficult task. There is no single control point or a single application that touches this data. It can be scattered across several machines, disks, and directories. The file sizes can enormously vary, the data may belong to different users, and the usage pattern can also vary drastically.

How does one manage such data? On the one hand, we want to keep this data accessible all the time, with an acceptable access speed. But on the other, we would like to control the cost of storing and securing this data. An archival solution attempts to provide an answer to this, which can meet both these objectives.

There are different storage media available to us for storing data, and each of them has some unique advantages. Disks are fast and allow random access to data. Tapes are cheap, and can be transported or saved in safe locations. Tapes do not consume any power when they are not being used. We also have DVDs which can be used for write-once type of storage – the storage medium guarantees that data cannot be over-written.

But then disks consume a lot of power, and their cost per megabit is higher than that of other media. Tapes do not have random access, and DVDs are slower than disks and costlier than tape. So no one storage medium is ideal; and a storage strategy needs to exploit the advantage of each medium, while trying to avoid its disadvantages. But how to do this?

File Archival

File archival is basically identifying less used data and keeping it on low cost storage systems. This idea is not new, and it was used in some early Hierarchical Storage Management (HSM) solutions. HSM acquired a bad name because of excessive hype, and also because the solutions did not meet the expectations. But the basic premise behind that approach is not wrong. What if software can detect less used files, and move them to low cost storage automatically? The end user should not be aware of this movement – they should feel that their data is accessible to them all the time. The data should be accessible on demand. If a solution achieves these basic objectives, any storage manager would love to have it. File archival solutions were born on the basis of this premise. Today, any self respecting storage manager would use such solutions in his data center. Many times, the solution is so transparent that the end user is not even aware of it!

The Basic Approach

There are many ways to implement such a solution, but the most common method is to create a file system filter driver running on the host server. This filter driver keeps monitoring the access time and modification time of each file. When a file reaches certain user-defined age and or size criteria, this file is marked to be moved to secondary storage media (like tape, DVD, or low cost disk). When a file is moved to secondary storage, in its original location a small pointer or stub file is placed. This stub file keeps the information about the location of the file data on the secondary storage. The stub file acquires the name and other attributes of the original file, and so the user thinks that the data is still available for his use.

The filter driver does the second duty of monitoring the access to the stub files. When a stub file is opened for reading or writing, it moves the data back from the secondary storage, and gives it to the user. Thus the user may experience some delay in accessing his or her old data, but the data is still accessible all the time. A directory listing or other such file access is duplicated by the filter without migrating data back from the secondary storage. Thus the archival software tries to keep data on secondary storage as much as possible. There are several possible variations of this basic approach, but they all achieve the same basic purpose – keeping less accessed data on low cost storage, and yet keeping it accessible all the time.

Apart from lowering the cost of storage, archival has other advantages. Archived data can be kept secure by making a backup copy for disaster recovery. The cost to provide a disaster recovery solution for all the servers in an enterprise would be prohibitive. But it can be easily provided for a few archival storage servers. A large tape library or DVD library may be too expensive for a single department, but it can be attached to the archival server.

Archival software is one of the fastest growing fields in storage software. There are many vendors who provide such solutions. Apart from basic file movement to and from low cost storage devices, these solutions provide many advanced features. Now we will have a look at where these solutions are today, and what is coming in the near future.

State of the Art

Retention: Retention of data is one of the most important features of any archival solution. File retention is required to meet various regulatory requirements. In the medical field, patients’ records need to be retained for several years. Similarly, all financial institutions (like banks, hedge funds, and credit card companies) need to maintain a record of all their transactions. Many corporations need to maintain the electronic mail of their senior officers to handle litigation. The storage administrators of such companies need to have a solution which can demonstrate to any regulatory authority that their data has not been tampered with. Once written, such data should not be deleted even by its owner. Archival software should be able to guarantee such retention not just on the secondary storage, but also on the primary storage.

Backup integration: Archival software has many similarities with backup, but it is not a backup solution. It doesn’t maintain different versions of a file. So storage administrators need a backup solution for their data as well. Archival software should work seamlessly with backup software. A backup cycle should not unnecessarily cause data to migrate from secondary to primary storage. At the same time, an archival action should not make a backup un-recoverable. If both solutions are provided by the same vendor, the user’s task is simplified. Otherwise, the user may need to specifically test their inter-operability.

Virus scan: A virus scan typically involves reading all files on a system periodically, just to check that they don’t contain any virus. But when files have been previously scanned and moved to low cost storage, there is no point in checking them again on the primary. Archival software would need to work with the virus scan software to prevent such unnecessary migrations.

Application integration: Applications like email or design software use a lot of files, and they need an archival solution customized for their needs. An ideal approach is to implement basic file archival as a software module, which can be included into any application specific archival solution. Software for managing patient records could be another example where the application can direct specific policies for different categories of patients.

Upcoming Trends

Index and search: With the growth of unstructured data comes the requirement of indexing it and searching it for certain keywords or patterns. This is a growing field, and desktop search is one area where archival will play a key role. When files are archived, the index and search operation should ideally happen on the secondary storage. Archival software can generate the index during idle times, or at the time of migrating the data. It may keep index entries on disk storage for fast access. A lot of innovation is expected in this field in the coming days.

Deduplication: Deduplication, or avoiding multiple copies of data, is a hot field these days. Different vendors have implemented it in different ways. Some do this on primary storage, while some others do this at the time of taking backup. When data belonging to different users is being archived, it makes sense to deduplicate it so that even less storage is used on the back-end. Thus, deduplication and archival solutions should work together seamlessly. Should we first archive, and then deduplicate the data? Or should it be the other way around? Can we introduce these solutions independently, or should they be deployed simultaneously? Will one solution trample upon the other? People are trying hard to find the answers to these problems. We can expect to see a lot of changes in this area in the near future.

Conclusion

To summarize, one can say with certainly that file archival is an essential solution in any large enterprise. It allows us to keep a lid on the cost of storing our ever-growing unstructured data sets. Along with backup, de-duplication, indexing and search solutions, file archival forms a cornerstone of a successful storage management strategy.

Anil Degwekar, Consultant Software Engineer, EMC Corporation
Previous  article
Next article
 
Write your comment now

Email    Password: 
Don't have SiliconIndia account? Sign up    Forgot your password? Reset
  Cancel
Reader's comments(2)
1: From: Mrs. Mary David

This mail may be a surprise to you because you did not give me the permission to do so and neither do you know me but before I tell you about myself I want you to please forgive me for sending this mail without your permission. I am writing this letter in confidence believing that if it is the will of God for you to help me and my family, God almighty will bless and reward you abundantly. I need an honest and trust worthy person like you to entrust this huge transfer project unto.

My name is Mrs. Mary David, The Branch Manager of a Financial Institution. I am a Ghanaian married with 3 kids. I am writing to solicit your assistance in the transfer of US$7,500,000.00 Dollars. This fund is the excess of what my branch in which I am the manager made as profit last year (i.e. 2010 financial year). I have already submitted an annual report for that year to my head office in Accra-Ghana as I have watched with keen interest as they will never know of this excess. I have since, placed this amount of US$7,500,000.00 Dollars on an Escrow Coded account without a beneficiary (Anonymous) to avoid trace.

As an officer of the bank, I cannot be directly connected to this money thus I am impelled to request for your assistance to receive this money into your bank account on my behalf. I agree that 40% of this money will be for you as a foreign partner, in respect to the provision of a foreign account, and 60% would be for me. I do need to stress that there are practically no risk involved in this. It's going to be a bank-to-bank transfer. All I need from you is to stand as the original depositor of this fund so that the fund can be transferred to your account.

If you accept this offer, I will appreciate your timely response to me. This is why and only reason why I contacted you, I am willing to go into partnership investment with you owing to your wealth of experience, So please if you are interested to assist on this venture kindly contact me back for a brief discussion on how to proceed.

All correspondence must be via my private E-mail (dmary4love1@yahoo.fr) for obvious security reasons.

Best regards,
Mrs. Mary David.
Posted by: mary lovely david - Monday 26th, September 2011
2: Really good for future.
Posted by: MOHAMMAD MOTIUR RAHMAN - Tuesday 18th, August 2009
More articles
by Kaushal Mehta - Founder & CEO, Motif Inc..
The retail industry is witnessing an increased migration of customers from traditional brick and mortar retail to E-commerce (online retail)...more>>
by Samir Shah - CEO, Zephyr .
You probably do because you are on the phone with them! For all of you working in some technical management capacity here in Silicon Valley,...more>>
by Raj Karamchedu - Chief Operating Officer, Legend Silicon .
These days are a mixed bag for me. Of late I have been considering "doing something bigger and better," in my life, perhaps seriously though...more>>
by Madhavi Vuppalapati - CEO of Prithvi Information Solutions .
IT Services Rise of Tier II companies The Indian IT outsourcing industry is going through very exciting phase in its business life...more>>
by Bhaskar Bakthavatsalu- Country Manager, India and SAARC of Check Point Software Technologies.
Data loss occurs every day through corporate email. In fact, given the sheer number of emails an organization sends every day, data loss inc...more>>