Data, Data, Everywhere
Megaterabyte databases are getting downright common. But with more real-time data, complex queries, and increasing numbers of sources, managing them is anything but routine.
By Charles Babcock
There's no Moore's law to sum up the growth curve of databases. But here's a rule of thumb: The amount of data stored by businesses nearly doubles every 12 to 18 months. And the very biggest--those at or near the 100-terabyte mark--probably triple every three years.
But databases aren't just getting bigger. They're also becoming more real time. Wal-Mart Stores Inc. refreshes sales data hourly, adding a billion rows of data a day, allowing more complex searches. EBay Inc. lets insiders search auction data over short time periods to get deeper insight into what affects customer behavior. Data also is coming from increasingly complex sources: Radio-frequency identification readers now feed data to Wal-Mart, and Nielsen Media Research, in collecting info on TV-viewing habits, is getting data from TiVos along with the standard living-room set.
Businesses don't run the biggest databases in the world. That honor is reserved for the Stanford Linear Accelerator Center, NASA's Ames Research Center, and other government groups such as the National Security Agency, which run databases in the petabyte (1,000-terabyte) range. But because businesses run fast-response systems that need to quickly get data in and answers out, they're solving some of the most interesting problems in data management.
Businesses are dealing with the complexities of engineering databases that combine historical and real-time data from multiple sources. Designing and building the hundreds, even thousands, of tables that make up multiterabyte databases and the queries used to extract useful knowledge can test the technical and management skills of any database administrator. But the advantages of big databases are obvious: Most of the largest are data warehouses for analytical tasks where more, and more-detailed, data means better insights. With real-time or near-real-time data, the value of those insights increases exponentially. "We know how many 2.4-ounce tubes of toothpaste sold yesterday, and what was sold with them," says Dan Phillips, Wal-Mart's VP of information systems.
Business As Usual At Wal-Mart
"Our database grows because we capture data on every item, for every customer, for every store, every day," Phillips says. Wal-Mart deletes data after two years and doesn't track individual customer purchases, he says.
By refreshing the information its data warehouse holds every hour--1 billion rows of data or more are updated every day--Wal-Mart turned its data warehouse into an operational system for managing daily store operations. Store managers used to query the database at the end of the day to see what was selling at their location. Now they can check hourly and see what's happening at stores throughout a region that might be experiencing an unusual event such as a snowstorm or hurricane.
Phillips tells the story of how IT staff at Wal-Mart's Bentonville, Ark., headquarters tapped into the data warehouse the morning after Thanksgiving three years ago and noticed that East Coast sales of a computer-monitor holiday special were far below expectations. Marketing staff contacted stores and learned the computers and monitors weren't being displayed together, so potential buyers couldn't see what they were getting for the posted price. Calls went out to Wal-Mart stores across the country to rearrange the displays. "By 9:30 a.m. Central, the pace of sales could be seen picking up in our data," Phillips recalls.
Blurring The Lines
The dividing line between operational and historical data isn't as firmly drawn as just a few years ago, says Bill O'Connell, chief technology officer of IBM's data-warehouse and business-intelligence business. "You're seeing a blurring of the lines between operational and strategic systems," he says. But that means the two must be carefully engineered to work together, which complicates the life of the database administrator even more.
EBay learned a big-database lesson or two as it rapidly grew into the world's largest online auction house. "We started in 1999 and 2000 with one monolithic Oracle database," says David Pride, VP of information management and delivery. "Since then, we've done a series of splits that let us scale out horizontally" into several hundred databases totaling 100 terabytes of data.
's database of customer information and products fields 750,000 queries a day, with the traffic some days peaking at 1 million, Pride says. The company's developers build business services around the massive data system, such as the Marketplaces Research tool that lets sellers research customer activity with respect to a particular item, including customers' behavior on the site, how they searched for items, and what captured their attention.
A major recent refinement provides the option of searching a particular time period instead of just allowing the most recent site activity to be viewed. Demand for that feature came not from outside sellers but from eBay product managers and analysts who wanted to use it to see how to build demand for specific items and manage auctions. But the tool increases query traffic and complexity. "We accommodate rather than eliminate such demands," Pride says, and now eBay sees "a lot of business value coming out of its use."
Solid practices for managing stored data, properly indexed tables for query processing, and good data extraction, transformation, and loading techniques to ensure that users are able to correctly interpret data are the keys to keeping a 100-terabyte database from getting out of control, database administrators in the 100-terabyte club say. The basic rules of database management are the same for big and small databases, but the economics aren't, given that the computer hardware, software, and storage needed to manage 100 terabytes of data can run into the millions of dollars.
Companies are "spending big sums to get to the next level of detail. If there wasn't business value, then this trend would stop," says Richard Winter, president of Winter Corp., a consulting firm specializing in assembling and managing large database systems that periodically surveys business, academia, government, and other groups to identify the largest databases. The survey is voluntary--eBay and Wal-Mart aren't on the list--but it provides some measure of how big is "big" and an indication of database growth rates and trends.
In Winter Corp.'s most recent survey, conducted in mid-2005, the Yahoo Search Marketing database came out on top as the largest commercial database, with 100.4 terabytes of data running on an Oracle database and Unix-based Fujitsu-Siemens server. Second place went to AT&T Labs Research, which was running a 93.9-terabyte data warehouse using its proprietary Daytona database software running on a Unix-based Hewlett-Packard server. That system has since exceeded 100 terabytes, says David Browne, AT&T executive director of enterprise data warehousing.
Coping With The Data Deluge
Nielsen database administrator Tim Geary manages not one massive data stream but multiple streams into his data warehouse, collecting data from meters inside 12,000 households. Families often have satellite services or set-top boxes that let them record a program and view it later on TiVo. "Some people are watching the 6 o'clock news at 8 o'clock. A lot of other people record a program but never watch it. We have to watch the playback data," Geary says.
In the future, Geary says, Nielsen plans to collect TV-watching data from viewers who are outside the home in gyms or bars or carrying video segments with them on iPods.
That complicates his job running a data warehouse that grows 20 Gbytes daily, the equivalent of 40,000 books. The Nielsen data warehouse, running on Sybase IQ software on a Sun Microsystems server, has doubled in size every year for the last three years and now totals 20 terabytes of compressed data. Uncompressed, the data warehouse would be 80 to 100 terabytes, Geary says, making it eligible for membership in the 100-terabyte club.
Are there any limits? Wal-Mart's Phillips isn't betting that the retailer's rate of data accumulation will level anytime soon. Last year he added another stream of data, coming from the RFID tags Wal-Mart now requires its top suppliers to use on all shipments. He anticipates that the next generation of tags will periodically measure the temperature of chilled or frozen goods on their way to market. The readings will go into the data warehouse as proof that produce and frozen foods were kept at an acceptable temperature or should be disposed of because they weren't. He'll have no problem adding such data to the data warehouse, he says. Its business value will justify the expense.
"We update over a billion records per day," he says. "The data grows because the company grows."
This story was updated Jan. 30.