Expert Committee Meeting | Data Documentation Initiative

October 12-13, 2003
Ann Arbor, Michigan

Present:
Tom Piazza, Chair (University of California - Berkeley, Computer-Assisted Survey Methods Program); Arofan Gregory, XML Consultant (AEON Consulting); Atle Alvheim (Norwegian Social Science Data Services [NSD]); Bill Bradley (Health Canada); Pat Doyle (U.S. Census Bureau, Demographic Surveys Division); Andrew Dzhigo (Princeton University Library); Ilona Einowski (University of California - Berkeley, UCDATA Archive); Fred Gey (University of California - Berkeley, UCDATA Archive); Pascal Heus (World Bank); James Jacobs (University of California, San Diego); Peter Joftis (University of Michigan, ICPSR); Ryan Johnson (Washington State University); Julie Linden (Yale University, Social Science Libraries & Information Services); Margaret Low (California Digital Library, University of California); Marc G. Maynard (University of Connecticut, Roper Center); Meinhard Moschner (Zentralarchiv, Cologne); Ron Nakao (Stanford University); Pilar Rey del Castillo (Centro De Investigaciones Sociologicas); Jostein Ryssevik (Nesstar Ltd); Janet M. Eisenhauer Smith (University of Wisconsin - Madison); Jeanne Spicer (Pennsylvania State University, Social Science Research Institute); Kevin Schurer, for Ken Miller (UK Data Archive); Wendy L. Thomas (University of Minnesota, Minnesota Population Center); Mary Vardigan (ICPSR); Oliver Watteler (Zentralarchiv, Cologne)

Also attending as observers:
Cavan Capps (U.S. Census Bureau); Mark Diggory (Harvard-MIT Data Center); Ann Green, Steering Committee (Yale University); Bjorn Henrichsen, Steering Committee (Norwegian Social Science Data Service); Dag Kiberg (Norwegian Social Science Data Service); Richard Rockwell, Steering Committee (Roper Center); Ed Thomson (Health Canada); Myron Gutmann, Steering Committee (ICPSR); Sanda Ionescu (ICPSR); I-Lin Kuo (ICPSR); Matthew Richardson (ICPSR)

Introductions and Procedural Issues

Interim Chair Tom Piazza opened the meeting and welcomed participants to the first meeting of the DDI Alliance Expert Committee. Steering Committee members Myron Gutmann, Bjorn Henrichsen, and Richard Rockwell offered additional welcoming comments and emphasized the importance of the work of the Expert Committee. Participants introduced themselves, described their interest in and use of the DDI, and discussed their expectations for the Alliance.

Tom briefly discussed the fact that the Expert Committee is a large group and will be most effective if it can form working groups to focus on specific tasks. Also, according to the Bylaws, the group must elect (1) a chair to head the Committee and attend meetings of the Steering Committee, and (2) a second representative to the Steering Committee meetings. Tom indicated that elections would take place later during the meeting.

A discussion of how the Committee should communicate took place. Virtual office software was raised as a possibility, as were listservs and bulletin boards, like the ezboard communications software that the group has been testing.

Data Model

After a brief presentation on the history of the DDI effort and a summing up of where the effort stands, there was discussion of a data model for the DDI. It is generally agreed that the XML Document Type Definition (DTD) for the DDI has limitations: it is not as modular and easily extensible as it should be and it has not been thoroughly reviewed for internal logic. The Committee needs to develop a data model, most likely in Universal Markup Language (UML), to reflect the underlying design and structure of the specification. Once the data model is in place, the DDI can be expressed as XML Schema, RDF, a DTD, or possibly other formats.

The Health Canada/Nesstar partnership has already done some work on a data model for the DAIS/nesstar software that harmonizes the DDI with ISO 11179. In the ISO standard, the notion of "concept" is central; ISO 11179 is also strong on administration and informs a number of metadata repositories on the Web.

There are clear differences between the DDI and ISO 11179, but they can be harmonized and can enrich one another. As developers of the DDI specification, we need to determine how we can learn from the ISO 11179 approach and how we can create interoperability with the ISO standard so that data that are ISO 11179-compliant can migrate into DDI.

Currently, the ISO 11179 standard is just a UML model, and there is no other representation. It basically lifts the data element, or variable, out of the study context that archives are familiar with. There is strength in this approach: it may help us to solve the problems inherent in documenting series and longitudinal data. We may decide that the new DDI data model should also focus on the variable and should in effect break the link between the variable and the survey. We need to recognize, though, that there are serious research issues in comparing variables across studies; one needs to know with certainty that variables are comparable in terms of sampling frame and other methodological issues.

In thinking about the aggregate extension recently added to the DDI specification in Version 2.0, it is clear that the logic of the "nCubes" is working but that the extension is positioned wrongly in the larger DDI structure. These sorts of issues can be remedied through the construction of an accurate and well thought out data model. The future of the DDI may involve a UML model and Schemas for separate modules. The Committee was encouraged to describe the process that we want to document first and then to construct a model based on the process.

We cannot abandon the DTD since a lot of markup has been done according to Versions 1 and 2. We probably need to proceed on parallel tracks, moving the DTD along from Version 2.0 to 2.x at the same time that we begin to create a modular Version 3 based on the new data model.

SDMX (Statistical Data and Metadata Exchange)

Arofan Gregory, a consultant with expertise in XML, apprised the group of another initiative he is involved in called SDMX. This is a project to develop an interchange format for time series data and metadata. In the SDMX model the data transfer format is separate from application-specific information, like OLAP cubes. The SDMX model is mainly about data; the metadata it carries is mostly about how the cubes were constructed. Thus, there seems a natural complementarity between the two efforts, and we should look for the points of intersection. Others on the Expert Committee also view this as a natural partnership.

XML Schema Version

We need to be thinking of a master plan with goals and a strict timetable to inform our work. The MetaDater and MADIERA projects in Europe are designed to be compatible with the DDI model, so we need to interface with those efforts. We need to think about moving the DTD to a Schema to take advantage of the modularity in Schemas, the capability for local extension, and the flexibility of namespaces. If we stay with the rigid DTD, any extension will break the standard, which is not the case in Schemas. With Schemas it is possible to control the amount of extensibility permitted. The METS Schema also has potential for the DDI project. METS is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, and some members of the Committee have used the METS Schema and inserted a DDI namespace. Namespaces lets the user control ownership and versioning.

ICPSR and Harvard-MIT Data Center have been working on a Schema version of the DDI that incorporates all of the documentation found in the Tag Library as well as the DTD comments.

Working Groups

It was decided to set up two major working groups, with subgroups:

Structural Reform Working Group
Substantive Content Working Group, broken out into Group 1: Aggregate Data, Geography & Time; Group 2: Comparative Data/Families of Datasets; Group 3: Complex Files; Group 4: Instrument Documentation
Usability and Outreach Working Group

The Structural Reform Working Group will take on the task of "schematizing" Version 2.0 of the DTD. Substantive proposals can be fed to the Structural Reform group, but we first need to set up some style guidelines for the architecture of the proposals.

We also need to bring the DDI into XML.org and OASIS. This can be a task for the Usability and Outreach group.

There is some funding from the Alliance for meetings of the working groups, either face-to-face or telephone conferences. It should be noted that the XML specification was developed without any face-to-face meetings. ICPSR will explore the cost of telephone conferencing. Another useful tool for communications is the WIKI, which ICPSR also has experience with. MhonArc, a threaded email list, is another tool that the DDI committee used previously.

Elections

The Expert Committee Chair and a second representative to the Steering Committee will have voting privileges in the Steering Committee, which has budgetary and oversight responsibilities for the Alliance. It was decided that the two elected representatives should have staggered terms, with the Chair elected for a two-year term and the second representative, who could also function as a Vice-Chair, for a one-year term, with the following term lasting two years. Nothing precludes nomination of the same person after the first term is completed.

Tom Piazza was elected as the Chair of the Committee and Hans Jorgen Marker of the Danish Data Archive (not present) was elected as the second representative to the Steering Committee.

Working Groups Structure

The following Working Groups were established:

Structural Reform Working Group:

Cavan Capps - Census Bureau
Mark Diggory - Harvard-MIT Data Center
Andrew Dzhigo - Princeton Library
Arofan Gregory - AEON Consulting
Pascal Heus - World Bank
I-Lin Kuo - ICPSR
Pilar Rey del Castillo - CIS, Spain
Jostein Ryssevik - Nesstar Ltd (Chair)
Wendy Thomas - Minnesota Population Center (Vice Chair)

Substantive Content Working Group:

Atle Alvheim - Norwegian Social Science Data Service
Pat Doyle - U.S. Census Bureau
Ilona Einowski - University of California, Berkeley - UCDATA (Chair)
Janet Eisenhauer - University of Wisconsin
Fred Gey - University of California, Berkeley - UCDATA
Peter Granda - ICPSR
Ryan Johnson - Washington State University
Julie Linden - Yale University
Margaret Low - California Digital Library
Mark Maynard - Roper Center
Meinhard Moschner - Zentralarchiv
Tom Piazza - University of California, Berkeley - CSM
Wendy Thomas - University of Minnesota
Ed Thomson - Health Canada
Oliver Watteler - Zentralarchiv

Subgroup 1: Aggregate Data, Geography & Time

Atle Alvheim
Ilona Einowski
Fred Gey
Peter Granda
Julie Linden
Margaret Low (Chair)
Wendy Thomas
Ed Thomson

Subgroup 2: Comparative Data/Families of Datasets

Atle Alvheim
Ryan Johnson
Mark Maynard
Meinhard Moschner
Wendy Thomas
Oliver Watteler (Chair)

Subgroup 3: Complex Files

Pat Doyle (Chair)
Janet Eisenhauer - University of Wisconsin
Peter Joftis
Tom Piazza

Subgroup 4: Instrument Documentation

Pat Doyle (Chair)
Tom Piazza

Usability and Outreach Working Group

Bill Bradley - Health Canada
Pat Doyle - Census Bureau
Janet Eisenhauer - University of Wisconsin
Pascal Heus - World Bank
Sanda Ionescu - ICPSR
Jim Jacobs - University of California, San Diego
Ryan Johnson - Washington State University
Matthew Richardson - ICPSR (Chair)
Jeanne Spicer - Pennsylvania State University
Mary Vardigan - ICPSR

Each group needs to determine how to communicate and meet and the frequency of their contacts.

Deliverables

It was decided that there would be two short-term deliverables:

Version 2.0 Schema released and fully documented by the end of November
Six months later, another version: an interim alpha version that would incorporate information from the substantive content groups and would be leaning toward the modular Version 3

In addition, the Structural Reform group will recommend formats for substantive proposals by the end of November.

The issue of whether the DDI should deal with access conditions was raised. Four or five years ago, it was decided not to include much detail on access conditions in the specification because the assumption was that this was an application issue. We should probably revisit this issue again. XML security and access systems have been developed that we might be able to link to or adopt.

Next Meeting

We have tentatively reserved Saturday, May 29, 2004 -- the day after the IASSIST meeting in Madison Wisconsin -- for the next Expert Committee meeting. This is Memorial Day weekend, however, and it is not certain how many can attend. If Working Groups want to get together at IASSIST, they should feel free to do so and should contact ICPSR to get meetings set up. More information will be forthcoming on the next meeting of the full Committee.