An Archiving Problem: Can You Read Your Digital Data after 100 Years?
Date: Wednesday , November 05, 2008
In the last twenty years, IT has taken over the lives of common citizenry in most countries. Today, most of our banking transactions are through ATMs, most of our travel and hotel booking is done online, purchases are through Amazon or eBay, movie rentals done online, yada, yada. The concept of e-governance has taken off and even land records in remote corners of Andhra Pradesh have now been digitized. This has given rise to a unique and grave issue.
Consider this. Around two years ago, we had to sell off an ancestral property owned by my grandfather (born 1897). The local authorities insisted that we produce the original birth certificate for completing the formalities. After a struggle (quite a hard one), the local tahsildar could get us the birth certificate that was ultimately accepted by the authorities, to the great relief of us all.
Contrast this with a 'document' that you created 20 years ago and stored it away. It would have been probably stored on an 8" diskette (outmoded now and not to be seen on today's computers) in a Double Sided Double Density Format (not readable now), using a MS-DOS 6.0 operating system file format (not understandable now), and using WordStar 2.0 (not traceable now). If this is what has happened in just 20 years, imagine the likely complexity of the problem another 80 years from now. And considering e-governance where all 'legal' records need to be retrievable after considerable durations of time, the problem is humongous.
This paper attempts to articulate the different issues around this and provides some starting points to the solutions.
The 100-year archive problem has two dimensions – Data corruption and readability. Data corruption corresponds to the problem of having moths eat up all your papers. The physical media in which the information is stored is corrupted due to natural causes – weather, demagnetization, whatever. We do not intend to address this issue here.
The issue of readability is, however, a big one that needs to be solved. The inherent issue with readability comes back to 'evolution'. The 'technologies' and 'languages' in the IT industry are evolving at a rapid pace and hence maintaining 'backward compatibility' is a challenging task. If you were to compare the complexity of reading a stored archive 100 years later, the correct analogy would be of reading a tablet inscription from Mohenjo Daro times. It would almost take the same amount of time (and skill) to read an archive 100 years from now as it takes to read the eleventh century Chola inscription (figure 1).
The complexity is at 3 levels.
Media – The physical media would be unsupported, deprecated, or extinct. This would render any ability to read the contents virtually impossible.
Data – The contents of the media would be stored in the native operating system/ file system format. The OS would not be available to decipher the byte stream even if you are able to read the contents from the media.
Application – Each application stores the relevant information wrapped in 'idiosyncratic' metadata. With the application going out of production, even if you had the 'specific' file, it would be impossible to decode it without the application running.
Contrast this to the birth certificate example – the media remained the same (paper), the file format remained the same (natural human language), and the application remained the same (both the syntax and semantics of the local language had not changed in the 100 years). Thus the information, once located, could be deciphered very quickly. The key in the above sentence is – once located. Digitization (and adding of metadata) has helped us index, search, and hence easily locate. But the IT industry being a fledgling industry has not successfully bargained yet for longevity (remember Y2K?).
This paper addresses these issues and possible solutions at each of the above levels. As you can imagine, this is a green field area and Storage Networking Industry Association (SNIA) has taken the lead in trying to come out with standards as well as best practices for the long term archiving solution.
Let us try to understand the description of the challenges faced in the layers, possible technical solutions to the challenge, and the best practices that should be implemented in the organization.
2.1 Media Layer The data being stored in the physical media becomes obsolete because the media itself becomes obsolete. The challenges posed due to the changes in physical media can be characterized as follows:
The physical media becomes extinct and unsupported – for example, you will be unable to find magnetic tape drives that were once the most popular medium for long-term storage.
The interconnects from the physical media to the compute power becomes extinct – for example, in the early days Winchester Disk Drives had a 9 pin connector to the bus. Interconnect protocols become extinct.
Any of the above 3 could stop the data from being read from the media.
2.1.1 Best Practice Recommendations: 1. All the data to be stored on a networked storage media, as this will ensure that the ability to read information now relies only on smaller amount of data sources and thus reduces the complexity of the problem.
The ability to manage such smaller devices with 'data migration' tools becomes easier. Ability to protect or backup information becomes easier to manage. 2. Information and Technology Audit, which is typically an yearly affair, should have a section on Inventory Management (See below).
The ITA would be to bring out the ‘media under threat’ and push the vendor to have a data migration strategy (see below).
2.1.2 Technical Solutions 1. A technology inventory – All networked storage media can be uniquely identified using either a WWN (FC) or a MAC address. This number typically gets assigned by the manufacturer. A standards body (SNIA) can insist that each member company submits to SNIA an IP protected technology library that contains a set of specifications or source code or tool that would be able to read the bit stream of these storage media. SNIA can then offer it as a service to the end users. SNIA shall maintain the integrity and readability of this data and get paid for this. This technology inventory overcomes the protocol obsolescence, while the 'networked storage' paradigm will overcome the interconnect obsolescence. SNIA can provide an accreditation program, which would offer a comprehensive validation of the media and the SNIA accreditation would make a significant impact on vendor choice for the enterprises.
2. Enterprise Hardware Inventory Management would be an important step in ensuring that the information is in the hands of the CIO. This Inventory Management solution would probably detail the hardware inventory, the amount of information stored, etc.
2.2 Data Layer The complexity in the data layer arises due to the fact that even if the bit stream from the media layer is readable, the file-based metadata may have become obsolete. This includes the file system metadata, which is used to decode which file got stored in which blocks, etc. In essence, if you look at the Media layer as 'physical files', then the data layer is like the table of contents. There needs to be a uniform way in which it is represented and would help search for the information better and faster.
2.2.1 Best Practice Recommendations 1. Do not use any kind of file-based storage solutions. All the networked storage should be block-based ones. This practice will ensure that the migration of the host systems that use the storage would immediately migrate the existing data without which it would become unusable.
2. The current regulations require backup archives to be stored for 7 years – with financial agencies being in the forefront. Since any archiving solution gives two generations of backward compatibility, this is not an issue. However, for data that require longer duration of archiving, it is better that the archived information is stored in an independent bit-stream format.
2.2.2 Technical Solutions 1. Self Healing Systems – Each software vendor (Microsoft, Red Hat, etc.) would provide, along with the system installation, the ability to seamlessly migrate the data into the latest file system format.
2. The current system of automatically pushing out virus update definitions will be extended to push archive formats seamlessly into the enterprise archiving solution. This new policy push will automatically trigger the migration of archive formats.
2.3 Application Layer If we extend the original analogy of the birth certificate – the media layer is like the paper, the data layer is like the filing system, and the application layer is the information laid out in the certificate. The complexity arises due to the fact that the evolution of natural language is phenomenally slow as compared to the application data formats. But, the information is relevant and requires to be retrievable for 100 years or more. And each of the applications store information in a very proprietary format – which gives rise to what is increasingly looking like an insurmountable problem.
2.3.1 Best Practice Recommendations 1. Segregate the information based on the longevity. Today’s solutions do the segregation based on either the access or regulatory requirements but not on the basis of the information's life. Enterprises should evolve a policy infrastructure, which places primacy on longevity and have different storage mechanisms. I am sure this is the practice from age-old times and the inscriptions we unearth today are the ones that were intended to be preserved for long periods of time.
2. Store the long living information in ‘Text Format’ only. The likelihood of this format surviving is very high because of its minimalist approach to specific meta-data. While it may not look pleasing to the eye in terms of features and formatting, especially 100 years from now – you do not care that the legal document is in Bold, Italics, or Regular, or whether it uses a Times New Roman, Courier, or Tahoma fonts. All that you are interested in is the contents and not the format. Did any one ever care for the paragraph spacing on the Raja Raja Chola inscriptions found in Thanjavur?
2.3.2 Technical Solutions 1. The end-goal is to have the information stored in such a way that it is independent of the application. One of the approaches being put forward by SNIA lays out a Self Describing, Self Contained Data Format (SCSCDF). This format intends to encapsulate the object information with a wrapper, which lays out how the information should be decoded.
It is expected that significant progress would be made towards standardization in the next few quarters. 2. A simpler solution (though not a complete one) would be to have XML wrappers around the object information – wherein the XML would describe the object decoding information. The question is would XML survive 100 years?
3. Conclusion We have here highlighted the complexities involved in retaining and decoding digital information 100 years after creation. The Media, Data, and Application layers have significant and as yet unsolved challenges and the different standards bodies are coming together to redress this problem. So, the next time someone says that your land records are available in digital format; you know what questions to ask.
The author is VP, Mindtree