Keeping Our Bits About Us
Stephen Manes, 02.27.06 , Forbes
When it comes to preserving your digital heritage, backup is only the beginning. Is that photo album you stored on DVD going to be readable in 20 years?
You just used your high-powered digital camera to take
wonderful pictures of your kids romping at the petting zoo. But unless you're
both careful and lucky, your grandkids may never get to see them. "People
are accumulating digital photos and music and tax returns and personal
correspondence," says William LeFurgy, digital initiatives project manager
at the U.S. Library of Congress. "Eventually the disk's going to fail. If
you haven't backed it up, it's gone." Multiply that problem by a billion
or so and you begin to understand the challenge of preserving information
"born digital"--anything and everything that began its life as
electronic ones and zeroes.
A 2003 University of California study estimated that new information in electronic form amounted to about 17.7 exabytes per annum--17.7 billion gigabytes. The number has only grown since. Nowadays information--whether document, photo, architectural rendering, high-def video or aircraft design--starts out as bits. And preserving those bits for posterity isn't as easy as sticking a sheet of paper in a drawer.
"It's not at all uncommon to have people looking at printed records that are 200 years old," says Clifford Lynch, executive director of the Coalition for Networked Information. "In the traditional world many things would survive for an awfully long time just through benign neglect. For digital things, they will survive only if people plan and think systematically about their survival on a continuing basis." The issues surrounding our digital heritage have become so complex that intense academic, institutional and corporate efforts are under way to develop means of preserving digitally born data and having a chance of understanding it decades and centuries from now.
The Library of Congress is midway through its ten-year, $100 million National Digital Information Infrastructure & Preservation Project, designed to develop digital preservation strategies. Last September the U.S. National Archives awarded Lockheed Martin a $308 million contract to develop ways of preserving diverse electronic government records. In the past year and a half Iron Mountain, a 55-year-old company specializing in the storage of physical records, bought digital-archiving specialists Connected and LiveVault.
At minimum, future digital archeologists will need ways to extract information from existing and future storage media. Since devices eventually become unavailable or unworkable--when was the last time you saw a Commodore 64 floppy disk drive?--organizations bent on preservation move info from older systems to newer ones regularly. Bill Gates' company Corbis, for example, stores 73 terabytes--73,000 gigabytes--of images on hard drives it upgrades on a three-year schedule.
Just storing bits may be the easy part. Making them usable can require the hardware and software that created them. The multimedia BBC Domesday project, a record of life in Britain, cost (at historic exchange rates) $4.2 million to develop in 1986; restoring it only 15 years later required reconstructing an obsolete computer and laser-disk player, reverse-engineering software data structures and writing a new program. The Washington State Archives maintains a legacy library of old hardware and software and is beginning to collect some oft-overlooked missing links:manuals and how-to books.
Sometimes software can stand in for hardware. Software emulators let thousands of old arcade and console games be played on today's PCs--though few of those games have been licensed legally. But seemingly simple chores like translating file formats can be tricky; for example, no competing word processor renders every last element of Microsoft Word files with absolute fidelity.
The new game is to preserve and extract electronic records "free from dependence on any specific hardware or software," in the words of the National Archives. Given the diversity of what's born digital today, this is a daunting task. Kenneth Thibodeau, director of the National Archives' Electronic Records Archives program, points to Navy ships with a life span of 50 years: "All the records to keep the ships operational are digital," including computer-assisted manufacturing data designed to interface with a particular tool. As the ship gets older, "How do they know that the data can be used to replace a system if it gets damaged?"
One key to minimizing the importance of original hardware and software is metadata--additional data that describes the digital information and explains how to handle it. As Thibodeau puts it, theoretically you "wrap the records in enough information that you could figure out what you've got and what you need to do with it."
Until that exalted state comes to pass, simpler metadata can help users search and retrieve born-digital content. Information like the date-and-time stamp attached to data files, the lens and shutter info embedded in digital photo files and the correspondents included in e-mail add descriptive information without forcing users to take extra action. The content in text files amounts to internal metadata ripe for automatic indexing.
Sound and image files put more demands on humans to create metadata that makes them useful and searchable--as anybody who's received hilarious automated results from Google's image search can attest. Metadata standards do exist: Digital photojournalists, for example, often use a standard called IPTC for captions, locations and credits. Closed captions provide a form of internal metadata for TVshows. Communal metadata, like the "tagging" from users of sites like Flickr or del.icio.us., help categorize Web pages and snapshots for retrieval.
Nonetheless, much of today's information--like, say, the Web--refuses to sit still for its portrait. If you rely on the representations of a governmental or corporate Web site, how do you later prove what was there? The Internet Archive stores snapshots of the publicly available Web, but Brewster Kahle, its director and cofounder, points out that it's like a camera with a shutter that takes two months to get the picture. Plenty of change can happen in the interim.
That changeability leads to another archival problem: authenticity. New U.S. Securities & Exchange Commission regulations require that if securities dealers maintain mandated transaction records in electronic form, they must be serialized, time stamped and stored on a nonrewritable, nonerasable medium in more than one location. Stringent but less specific rules apply for health records and those dealing with Sarbanes-Oxley compliance--and in lawsuits that compel discovery of electronic information. Compounding the headache for the health care industry is another preservation challenge: digital privacy.
One of the traditional roles of archivists--deciding what to throw away--is becoming unnecessary in many situations. Lynch observes that today's cheap storage media make it relatively easy to store virtually everything you create directly.
But sensors of all sorts--scientific, agricultural, mechanical, and even the ones in a digital camcorder--can rapidly create massive collections of digital data. Scientific endeavors like Microsoft cofounder Paul Allen's Brain Atlas project to map the activity of genes in brain cells can create a terabyte--a thousand gigabtyes--of data every day. Given the costs of maintaining massive collections, determining what to keep and what to discard will remain an issue.
Automation is likely to help simplify some aspects of digital preservation. For example, BBNTechnologies' PodZinger Web site uses speech-to-text software to index podcasts and let you search their content. Software that can analyze images may one day let them be catalogued and retrieved with minimal human intervention. And digital advantages, like the ease of storing multiple copies of documents at separate locations, make preservation a key way to dodge the consequences of regional disasters like Hurricane Katrina.
The Internet Archive's Kahle sees easy access to data as digital preservation's ultimate rationale."Access,"he says, "drives preservation." As the issues get sorted out, the real achievement of digital preservation may turn out to be in collaboration with the World Wide Web--opening up heretofore hidden realms of information to the genealogists, historians, scientists, authors, musicians and videographers of today and tomorrow.
Stephen Manes, 02.14.06, 4:04 PM ET
What's the best strategy for making sure what's here today won't be gone tomorrow? Consider the 5 -inch floppy disk. A friend involved in a lawsuit frantically called in search of a compatible drive. The used one he eventually found could read only the disks' directories. Pricey data recovery services may have better luck.
The old-style floppy is a classic reminder of the need to regularly move bits from older platforms to newer ones. It's a nuisance, but any reasonable archiving strategy has to allow for it.
But which media should you use now? The usable life of recordable CDs and DVDs is subject to debate. A U.S. National Institute of Standards & Technology study suggests that some disks can hold data for "several tens of years." Adam Jansen, digital archivist of the Washington State Archives, disagrees:"The truth of the matter is we don't know. I've had CDs after a year start to suffer from CD rot; I've had CDs I created back in 1994 that are fine." NIST is currently planning a more detailed study.
There are no storage-life data as yet for new high-capacity Blu-Ray disks and HD-DVDs arriving on the market, but they typically use a hard surface coat designed to resist scratches that can obliterate data. Some premium-priced CDs and DVDs have similar coatings. The internal technology generally recognized as lasting the longest uses gold and is available from a company called MAM-A. Otherwise, stick with brands you've heard of; cheap no-name disks are not the archival media of choice.
Use fresh write-once disks, not RW models, and handle them with care. A NIST document recommends storing them upright in cases away from sources of light, heat and humidity and avoiding adhesive labels and marking pens that use anything other than water-based inks. Whatever you use, be sure it contains what you intended to put there. Employ CD- and DVD-burning software's "verify" option, which compares what was to have been written with what made it to the disk, and don't forget to label the case.
USB flash drives and memory cards have a 10-to-20-year data retention spec. And though some hard drives may claim a 114-year Mean Time Between Failure specification, the "service life" and warranty are typically three to five years.
Generic formats like TIFF, JPEG and plaintext are likely to survive longer than proprietary ones like RAW or Microsoft Word. If you've got old files created by outdated programs, now is the time to use translation software to convert them to a more modern, more generic format. But translation may not maintain all the relationships among the data; check the new files to see if you can accept the compromises. And store software carefully, particularly custom software--though it may not work properly as newer machines and operating systems arrive.
Try to keep data in multiple places to evade fire or flood. To avoid schlepping disks around, consider an online service like Connected DataProtector, which encrypts your data and sends it over the Internet to redundant sites that are not in your neighborhood. There's one more form of data storage that's worth a look: paper. It won't capture every last original bit of a digital photograph, it can fade or mildew, and it doesn't lend itself to electronic searching. But a document in a drawer is likely to be readable long after DVDs become laughable reminders of a primitive digital past.