50 Years of Queries
Donald Chamberlin
DOI: https://doi.org/10.1145/3649887
IF: 22.7
2024-08-02
Communications of the ACM
Abstract:E.F. Codd's "A Relational Model of Data for Large Shared Data Banks" 10 is one of the most influential papers in all of computer science. In it, Codd defined concepts that are still in widespread use today, more than five decades later, including defining the theoretical foundation of the relational database industry. When Codd's paper appeared in Communications of the ACM in June 1970, I was a student member of ACM, but I didn't receive the issue right away. I was driving cross-country from Stanford University to take a summer job at IBM's T.J. Watson Research Center in Yorktown Heights, New York. Before long, my summer job turned into a permanent IBM job, and I joined a group that was looking into the future of data management. My first task was to get up to speed on the current state of the art. Key Insights The relational data model, proposed by E.F. Codd in 1970, is the most widely used format for business data. Its practical feasibility was demonstrated in the 1970s by experimental prototypes at IBM Research and the University of California. The 1980s saw a proliferation of relational database products. SEQUEL (later shortened to SQL) was designed in 1974 as a language for untrained users, but it has been used mainly by professional programmers. Acceptance of SQL was aided by its adoption as an ANSI Standard and by the availability of high-quality open-source implementations. Today, SQL remains the most widely used query language. Current requirements for massive scalability have led to new "NoSQL" system designs that relax some of the constraints of relational systems. Data has been stored in digital form for a long time. Herman Hollerith invented punched cards to process the 1890 U.S. census. Punched cards had a successful 65-year product life until they were largely replaced by magnetic tapes in the 1950s. In the mid-20 th century, data was typically stored on a magnetic tape and dedicated to a specific application. A tape might, for example, be used by an inventory-control application. Periodically, maybe once a week, the inventory-control job would read the tape sequentially, applying updates as it went along and producing a new, updated inventory tape. (As a college student in 1964, I had a summer job as a computer operator, running jobs like this.) The advent of magnetic disks, introduced with the IBM RAMAC in 1956, 16 had a radical impact on how data was stored and processed. It was no longer necessary for applications to process data sequentially, since data items stored on disks could be accessed directly in any order. This gave rise to a new wave of innovation in how data should be organized on disk. In the 1960s, a team of IBM engineers working on a NASA contract developed a disk-based information storage and retrieval system for use in the Apollo moon landing program. This system, named Information Management System (IMS), was made generally available to IBM customers in 1969. IMS organized data on disk in the form of hierarchies of "parent" and "child" records. In the 1960s, a team of IBM engineers working on a NASA contract developed a disk-based information storage and retrieval system for use in the Apollo moon landing program. At about the same time, a General Electric employee named Charles Bachman, known to his friends as Charlie, was designing a system—called Integrated Data Store (IDS)—for storing and retrieving data. Like IMS, IDS stored data on disk in the form of records and connections between records. Users retrieved information by explicitly referencing these connections, following paths from one record to another. Unlike IMS, however, IDS did not constrain the records to be connected in a hierarchical pattern but allowed records to be connected in networks of arbitrary complexity. As he worked on the design of IDS, Bachman had an important insight. If data was to be stored on disk and accessed in arbitrary order, there would no longer be a need for it to be dedicated to a single application. A new abstraction layer could be added above the operating system, managing shared data for multiple applications. This new abstraction layer, called a "database management system," could eliminate redundancy and make data consistent across applications. It could provide control over access to data by different categories of users. The database management system could provide services such as backup and recovery in the event of hardware or software failures. It could also provide transaction semantics to keep multiple concurrent users from interfering with each other. For his work in developing the concept of an integrated data -Abstract Truncated-
computer science, theory & methods, software engineering, hardware & architecture