October 24-26, 2004
Ann Arbor, MI
Present:
Mark Diggory, Pascal Heus, Arofan Gregory, Sanda Ionescu, Peter Joftis, I-Lin Kuo, Ken Miller, Jostein Ryssevik, Wendy Thomas, Mary Vardigan, Achim Wackerow, Oliver Watteler
Sunday, October 24
Chair: Jostein Ryssevik
Jostein opened the meeting by thanking everyone for attending. The SRG working group has accomplished a great deal by telephone, but having a face-to-face meeting is important and appropriate at this time in the development of the data model. This will be an internal working session that will permit Arofan to go away and do the first draft of the technical implementation.
Basic outline of the three days:
Sunday:
- Review the metadata models that we might learn from and that overlap the DDI; discover which might have approaches that we can benefit from.
- Review comment log disposition and formal procedures for comments.
Monday:
Focus shifts to the DDI model itself - the model as it exists now and how we will attempt to change it.
Tuesday:
Decision Day. Based on the conclusions of the two previous days, we will focus in on the areas that are important to start formalizing the model.
Wendy has created an SRG timetable, to which Jostein has added a red line indicating where we are now. We are delayed on some issues. This meeting will allow us to catch up.
We have created a new conceptual model that steps away from the archival model and moves us in the direction of a lifecycle model, focusing on other parts of the complete lifecycle of a dataset from conception to archiving and use. We need to see how parts of the existing DDI fit into this model. Wendy has created preliminary sketches based on this underlying lifecycle. We need to focus on areas in which the existing model doesn't function well, determine where there is a need to change logic, and try to remove inconsistencies in the model. We are laying the foundation for a modular DDI. Arofan wants to have a first version available for review by the end of December. Then we can start involving other working groups.
Arofan described a process model for reviews. There are a couple of different ways of doing reviews and making progress. Sourceforge is good for doing development, but is not an acceptable way to do public reviews. We will have a schema and will use sourceforge to track its development. We will also run a formal disposition log. We will freeze a schema version in sourceforge with line numbers in PDF so that we are all talking about the same thing. Using a comment log, once a comment has been received and disposed of, the comment raised cannot be raised again. We need to make it obvious what people need to download to review and how to make comments. The disposition column of the comment log will be filled in by the SRG or some subset of the SRG. Disposition classes will be defined.
There will be a person or an editorial team that compiles the log. Everybody uses a template for submitting comments, and then the log is turned into a master PDF that is version controlled. We need to encourage the submission of solutions for complaints - that is, people must suggest a solution to the problem they are describing. If there are two conflicting comments, we can go to the two suggesters for further discussion.
The comments log is in Word format right now. Arofan will be the editor, and Mary Vardigan will collect comments and update the Web site. The SRG group disposes of comments, but we can perhaps rotate responsibility within the group.
Sourceforge has some maintenance issues now coming through. Bug fixes are to be disposed of quickly. Other proposals for changes to 2.0 will be voted on. All SRG members need to sign up for sourceforge. The process is that Tom, as Expert Committee Chair, will review initial submissions and then migrate them in sourceforge to Technical Review status. Sourceforge permits us to provide comments attached to the issue. The trackers were built to be public, but we are using them only for the SRG. We will have a separate mechanism outside of sourceforge to do the voting.
Other metadata models were discussed. There were 16 separate standards considered (see corresponding table). During the discussion, some items were moved to a "parking lot" for later deliberation. These items were later prioritized A, B, or C, depending on level of urgency.
Parking Lot:
- Fixity (checksums, encryption, etc.) - C
- Processing history of digital object - C
- Identifiers (URI format-bound). Regarding ids, need unique ids across set of instances to combine. - B+
- Atomic microdata; what does hierarchy look like for our data? What is DDI "item"? Hierarchy re ISADG - DONE
- Linking (METS, etc.). xlink is a barrier to adopting; METS not something we could copy - B
- Embeddable metadata; metadata embedded in data objects? Data embedded in metadata; real-time updating by tools - C
- Multiple doctypes approach (comparative data, etc.) - B
- Data typings - A
- Agencies and roles - B
- Reusable MD "core" but not problematic - A
- Multi-lingual (I-Lin's proposal module) - B
- External codes - DONE
- "Tabulation" module? Relationship to cube module? - C
- Administration" modules (or as part of archive module?) - B
- Add module to describe sampling? - A
- DDI Lite? Profiles for communities? - C
- Standard formal expression of computations - C
- Comparative data and how it fits in design - B
- Physical storage (look at CWM models) - A
- Standard data format at some point? - C
- How to deal with documentation? Dublin Core? Controlled vocabulary in terms of type - A
- Lists. Mechanism for reuse. - C
- Maintaining controlled vocabulary lists. Other standards allow you to incorporate other schemes and validate. UBL built mechanism for referencing external code lists. - A
- External classifications in XML. Neuchatel group? CLASET? (CLASET is in the public domain and SDMX is working on a subset of this large standard). EU funded project statistical offices. - C
- Namespace design rules - A
- Versioning - A
- Migrating issues from V 2.0 to V 3.0 - B+
Monday, October 25, 2004
Chair: Arofan Gregory
Pascal presented information on a new initiative at the World Bank that integrates DDI. This involves a Development Data Platform (DDP), which is a next generation data management information system for macroeconomic databases plus survey data. There is a new International Household Survey Network (ISHN) being established and the DDI is being adopted as the metadata standard for exchange.
For DDP microdata, there is a survey repository catalog that uses DDI section 2, with 1300 surveys and 400 datasets. Survey analysis is performed through Nesstar based on about 30 surveys.
In terms of availability, the initiative will go up on the project Intranet this week. It will be available to selected external partners by the end of 2004 and to the general public by mid-2005.
IHSN comes out of the Millennium Development Goals, and is intended to track major indicators. The recent Marrakech Action Plan for Statistics laid out six goals, one of which was the IHSN. The IHSN will collect better data, make more data available, and improve comparability. The project leaders first met in June 2004, then in October in Geneva. Potential IHSN participants include international banks, UN agencies, WHO, UNICEF, etc. There will be 50-100 participating countries.
Work is being done now on development of survey data dissemination tools and guidelines. There will be a Data Dissemination Toolkit, with simple tools for producers to archive and disseminate metadata based on DDI. The goal will be to make documentation available in PDF, data in Nesstar NSDStat file format with a free NSDStat reader. The main components of the Toolkit will be a metadata editor (an enhanced version of the Nesstar Publisher that will allow documenting hierarchical files together); a CD-ROM builder that will record on CD-ROMs the output of the Publisher; and a set of guidelines on metadata creation.
The project will further enhance the Nesstar Publisher by adding tools for quality control, subsetting, recoding, and anonymization. There will be custom branding and licensing agreements to the Publisher and CD Builder (CDs have been successful in these countries and agencies). The schedule is for betatesting in early 2005 and release in April 2005. This means that 50-100 countries will produce DDI markup. The WHO will use it for World Fertility Surveys.
The data repository will be searchable by country, year, type of survey, and topics. Search results will include an overview of the survey(s) (DDI study description, Section 2.0), the survey instrument, and links to data, and related reports and other publications. The survey metadata will be retrievable in several different formats, including DDI and HTML. It will be possible to analyze data from a single survey, or across multiple surveys for which subsets of data would have been previously harmonized.
For the DDI community, the main significance of this project will be the broad adoption of the DDI standard by several large international organizations, as well as an important number of data collectors/producers, as noted above. The project will also help bridge the present gap between data users and producers through their links with the data disseminator. The new visibility of these data will also be an incentive for producers to strive to improve data quality.
The SRG began discussing the projected data model.
The goal of the modular structure is to be able to encapsulate, swap, and add information. Modules can be updated individually, without affecting the whole structure.
The wrapper for all modules should be message and protocol neutral. It should provide functional information about the content, and should list all contents, providing a structural map of what pieces are included and how they are related.
How do we treat multilingual instances? Will there be a module for each language? One solution for covering such instances would be to identify which elements can contain translated text, and enable a language string to define language used as well as other characteristics, like:
<Title> <text original="true" xml:lang="EN"> <text xml:lang="FR"> </Title>
It is generally agreed that having individual modules for translations is not a good idea. However, the solution mentioned above would permit building individual files, if they appeared necessary.
As far as the basic design of the model is concerned, a choice may be made between more simplicity, which implies tighter control, or more flexibility, which involves greater complexity. It is generally agreed that we will strive for a model that's as simple as possible, but not simpler.
The possibility of having a "fonds"-type structure, involving hierarchical levels in addition to modularity, was discussed. The "fonds" model, currently used to describe collections of records, involves the following levels - fonds, subfonds, series, subseries, file, item. What would we include at each level? Are they all necessary?
One advantage of such a structure is that information inserted at one level would apply to everything below, and need not be repeated. One disadvantage is that a fixed-level structure appears too rigid for our purposes. Recursive generic grouping also reduces simplicity.
In connection with this hierarchical levels structure, the question also arises as to what we would describe at the lowest level. A fundamental difference between microdata and aggregate data is that the lowest level of description in microdata is the variable, while in aggregate data it is a single cell in a cube. There will be therefore two different descriptive structures at the lowest level, depending on the type of data.
For the overall model, a simpler hierarchical structure could be applied, in which there need not be too many levels, but the same principle will apply, according to which upper-level modules point to the ones below, and not the other way round. DDI instances (or "products") will be situated at a lower level, while additional information, related materials, and administrative-type documentation will be situated above, and will apply to all the lower-level modules they point to. At the same time, information at the lower level overrides information from above, when the two are different (this principle would allow versioning by module, without necessarily affecting every other module).
How are the lower-level modules -- the DDI instances themselves -- to be documented?
It was noted that variable and dimension are basically the same thing - representations of concepts. But they differ in defining values. Continuous variables are measures, while discrete variables are dimensions.
What terms are we using and what do they describe?
We noted some definitions:
Survey: Designed to gather information from a population of interest in a specific geography at a specific time.
Dataset: Data structure resulting from one or several survey. It may have one or more physical storages. Datasets are at the core of DDI.
Study: One or more surveys or cubes.
Microdata: Not all microdata are survey data. Microdata result from a survey or another process.
Measure: Measure is metadata that applies at the cell level and not to the structure of a cube. We can have different cubes that only differ in terms of measures. What is the relationship between measure and variable? Several input variables can create a single measure. The unit of measure in SDMX is an attribute at the series level. Measure can be a variable or a function of a variable. Measure is different from the aggregation mechanism.
We want to provide for variables that can be both continuous and discrete at the same time - both behaviors. If it is a continuous age variable, one may want to create age groups using categories so we could add one or more recoding structures that software could use.
We want to distinguish between:
- Documenting the instrument
- Documenting the relationship between the instrument and the dataset
Questions: They may be simple or multiple response; they may also have flowchecks, or elements that control the flow of questions. What is missing right now in the DDI is the concept of the flow through the questionnaire, which is hard to capture. We should look at IQML or TADEQ or other survey production packages, which may help with this.
Variables may be of several types, including raw, recode, imputation flag, weight, id, classification, computed, derived, key variables. A question goes directly to a raw variable, but the path is more complicated for a recode.
"Many to manyness" depends on variable type.
In multilingual surveys, do we want to link to different variable? It depends on how they are set up. Links should point from variable to question, not the other direction.
Where do we put descriptions of variables that are not based on questions? There should be something for the data-generating database queries that is equivalent to questionnaires for survey data; this would be process-derived data.
Questions and variables reference concepts. What do we point to as sources for secondary analysis?
The questions side of the development of DDI 3.0 belongs to the Instrument Documentation Working Group, so we need to follow up with them on these issues.
Tuesday, October 26
Chair: Wendy Thomas
Chris Nelson will send us the DTD for IQML, which we looked at only briefly. We need to recruit new members for the Instrument Documentation group.
It is important to revisit the issue of physical storage in the model because people are not happy with the physical storage information currently in the current DDI. We should have a flexible interface to the physical stores in various formats. Do we have to take into account all the different formats? A URI isn't sufficient. What we have now is one model fits all. The warehouse model is one way to model physical stores. That model could be extendable to incorporate proprietary packages. We could have a generic layer in DDI, and the user could describe specific the data source in plug-in module. We will "parking lot" this issue.
Record, relational, multidimensional, and xml are the data types in the Common Warehouse Model. The traditional data in the DDI world is records data. Let's pick up on what OMG has already done with the CWM.
If one has metadata in DDI, one is always at the mercy of proprietary formats. There may be an advantage to having our own format, but we don't want to reinvent a format and structure. Strategically, we should concentrate on metadata right now, which is our strength. We should solidify that and then there may be impetus to move to a common data format, perhaps for a Version 4.0. Triple-S has succeeded because it includes a way of encoding the data and has become an exchange format.
We now need to take Wendy's modules, drill down, and rearrange them according to the group's new conceptual model. In terms of modeling software, Wendy used Poseidon, and that is the default suggestion. I-Lin has used it to do basic things. Mark has been using a module from Eclipse and will send around links to this. We can start with Poseidon and look into using Eclipse to do class diagrams. People should use whatever tool they feel most comfortable with.
The namespace for DDI is now icpsr.umich.edu. This is a 3.0 issue. There are rules about how one publishes URIs for namespaces.
The group went back to examine the Parking Lot issues in greater detail:
- Typing in schema. PCDATA is flexible but encourages misuse. How do we get information for each type on a field-by-field basis? We can start with what's in 2.0 already, and then just figure out what we are adding in 3.0. Wendy will go through the current elements and determine those that require valid values and recommend for comment. The goal is to disallow abuse. We should type as strongly as we can get away with; relaxing changes are easier because they are noninvalidating. Thus, we should err on the side of overinclusion. We also need to get clearer and more explicit assumptions about what needs to be human- vs. machine-readable. Date is a good example: The archives need this to be both human-readable and machine-readable. As of 3.0, do we want a DTD equivalent to a schema? There is a transition issue. Should we ask this of a broader group? When making decisions on elements vs. attributes, there are implications if we are building a DTD or a schema. We should move ahead with good schema design and not limit ourselves by a DTD. That said, we may want to turn out a low-effort DTD. The canonical version of the DDI, however, will be a schema. It's possible that the "Lite" versions we envision may be DTD-based. So we will in effect be reversing what we do now with 2.0, which is that we have a canonical DTD with corresponding Schema.
- Reusable metadata core. Citation elements are a good example of this concept. They can be pulled from the core into specific fields. This is a good design for DDI. More than reusable types, it is namespace based.
- Add a sampling module? This could contain stratification and PSU variables. Pascal will supply an elements list. Some of this will allow applications to take sampling into account. The current DTD has the descriptive part of sampling, but there is a technical part that isn't in the specification now. We may be able to compute error estimates with well-specified sampling fields that provide information to program against. This may be part of the methodology module and not a separate module in itself.
- Physical modules. Arofan will get a copy of the Common Warehouse Model (CWM) and start from there.
- How to deal with documentation? What does documentation mean in this context? In 2.0, there are fields for external related material. Should we type this in terms of kinds of material? At what level do we include this? As you move through the lifecycle, there are certain documents that are relevant. It makes sense to have a generic referencing mechanism but to apply local semantics based on the module. We should create a type that is Dublin Core information with a controlled vocabulary if necessary and present a URI for this resource. This would be an external reference type, URI and Dublin Core. These two are declared as existing in the core namespace module. In the archive module, we would declare the element in the instance Related Publications. "A:RelPub" contents will be of the type declared or reference an ID. We would lay out in the schema the recognized enumeration and allow other types to be used also - that is, an enumeration plus string. This would permit local extensions that wouldn't be machine-actionable. Wendy (with Mary and Ken) will analyze the types of other materials we know about and get back to the group with a list.
- Maintaining controlled vocabularies (how extensive, how manage, etc.) UBL uses external codes of two types: (1) Codelists, or lists of enumerated values like currency code, country codes, etc.; and (2) Keywords. The idea is to point out to another scheme, point to the agency that maintains it, and give it a URL. This allows for binding a schema description of values and allows people to add to the document. Right now the URI we provide points at any type of file; it can be a PDF that you can't validate against. We should design a standard format for major cases. Validation is needed at authoring time. Concept is different from keyword and is fixed, not meant to have synonyms. Ncubes should have concepts, which they do. We also have in Version 2.0 controlled vocabularies hardwired into the DTD. We should distinguish between content vocabularies and qualifiers - lists of semantic distinctions that the DDI produces for its own purposes. The latter should be hardwired into the schema. The value of these codes informs the semantics for an action. For Source = archive/producer, we need to come up with a better list. Content should not be hardwired in. For keywords, we should use the approach in DDI 2.0, add a little more structure for the path, and add the ability to give concepts IDs. Enumerations should be treated differently based on whether they are structural qualifiers or content. Content should be treated like keywords, while structural qualifiers should be enumerated in the schema. We will decide others on a case-by-case basis.
- Guide: There is a need to provide information about case/record identification, but this will actually be addressed in the complex files area. Processing instructions, which are annotations proprietary to the system processing the instance, should be included.
- Namespace Design. This is related to visibility rules. One technique for modularizing schemas is that schema instances import other namespaces; each module could be its own namespace. This creates a natural packaging mechanism and is good for versioning. We can version namespaces. Once we publish a namespace, it never changes, and each version gets its own namespace. We should keep this to a limited set of namespaces that are functionally aligned. A version is one of the things that characterizes the namespace.
- Versioning of instances. We need to think about real things vs. virtual things vs. transient stuff (generated on the fly). Do we give an ID to the instance of a session ID? There is another kind of situation in which new data are generated everyday on an ongoing basis. Is the DDI instance a snapshot of the metadata at a point in the lifecycle? What are we versioning? What are the reasons we might want to version instances? There are (1) studies; (2) metadata (permanent and physical storage); (3) xml instances (don't version, this is transient). We need to think also in terms of products. Each product has only one maintenance agency; having a registry in the European scenario where there are many archive products based on the same data is a good thing. Language equivalents should be viewed as multiple products.
- Migration issues. We want to create a path to move existing instances into 3.0 using XSLT. How data typing in 3.0 affects this will depend on how instances using 2.0 were created. We need to think about creating a version of a migration stylesheet that we can then adjust for people depending on their unique usage.
Three changes to 2.0 are in the pipeline right now: (1) Bug fix of xml-lang to xml:lang; (2) nested categories proposal (reinstate nested categories because category groups do not work for every situation); (3) introducing use of XHTML 1.1 for formatting blocks of text (rather than using TEI tags).
What kind of input do we need from the subgroups to proceed? After we put out the initial 3.0 draft of the model within the next two months, the working groups can begin to help with that. Once we have this first draft, we will send it out for informal review and get groups' input. This will help them develop their substantive proposals. We can pull pieces out and provide them in context for the groups.
The schedule will be: Informal review in January; more formal internal review in February; possible meeting again in March; then another major revision of the schema to have something in good shape by the time of IASSIST in May. The DDI meeting in Edinburgh is on Sunday and Monday, May 22 and 23, 2005. If it is possible to have an interim face-to-face meeting between now and May, perhaps we could do that in late February or early March in Europe.
Mark, I-Lin, Wendy, and Arofan will now begin work on modeling. The deadline for other assignments is two weeks from now or November 8.
DDI-SRG Consideration of Other Metadata Standards
October 24, 2004
Dublin Core
Coverage |
Digital/physical objects |
Domain |
Library |
Purpose |
Bibliographic reference; resource discovery tool |
Representation |
XML, RDF, etc. |
Main Characteristics |
There is no authority control; this is just the basic elements you need to have, very generic; do we want to borrow their terms? topic and coverage very broad level; not exact subset of MARC |
Versioning |
Simple, qualified |
Modularity |
no |
Extensibility |
Yes, Dublin Core plus qualified Dublin Core |
Overlap |
Yes, but at a general level, DDI more detailed |
Learn/Borrow? |
We could do Citation type = Dublin Core or MARC; we need more detail in determining how various documents are linked; we can provide a downwalk not a crosswalk; optional but redundant; less of a problem this way maintaining synchronicity; how will applications deal with this? we provide guidelines; controlled vocabularies around types; mappings between types; gross discovery with Dublin Core; base format of OAI is Dublin Core; we could have embedded block that is Dublin Core; generate Dublin Core from DDI instance? Dublin Core is in wrapper; top level gets harvested; once we figure out modularization of DDI, we will know where to pull Dublin Core in; have to use their definitions if we use their names |
MARC
Coverage |
Catalog records |
Domain |
Library |
Purpose |
Bibiographic reference; resource description tool |
Representation |
Sparse XML |
Main Characteristics |
Standard catalog records at a high level of description |
Versioning |
|
Modularity |
|
Extensibility |
no |
Overlap |
More general resource discovery |
Learn/Borrow? |
Authority control |
OAIS
Coverage |
Digital objects |
Domain |
Archives |
Purpose |
Best practice for running a digital archive for a designated community; developed by the Consultative Committee on Space Data |
Representation |
N/A |
Main Characteristics |
SIP Submission Information Package; AIP Archival Information Package; DIP Dissemination Information Package; data objects interpreted using its representation information yields IO, information object; in our world an IO is produced by applying DDI |
Versioning |
no |
Modularity |
yes |
Extensibility |
|
Overlap |
|
Learn/Borrow? |
Use for archival portion of DDI lifecycle; Provenance, context, reference, and fixity are important preservation metadata to carry in the system; we don't have fixity currently but need to add this; capture processing history of a collection |
OAI
Coverage |
Digital instances |
Domain |
Library and archives |
Purpose |
Protocol for metadata harvesting; OAI-PMH; to promote interoperability; harvesting content only; messaging protocol |
Representation |
XML |
Main Characteristics |
Only six verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords; ListdSets; Unique id that must correspond to URI; identifies an item in a repository; record is an item expressed in a single format; all formats have same id; do we want unique id not dependent on format? URI resolves to something format-bound, which is not typical; other metadata would be in same record; reserved list of values for the formats? no |
Versioning |
yes |
Modularity |
no |
Extensibility |
yes |
Overlap |
|
Learn/Borrow? |
Need DDI in schema format for OAI; means using URI formats as identifiers; we must support OAI |
FONDS International Standard archival Description General ISADG
Coverage |
Collections of records |
Domain |
Archives |
Purpose |
Reference/description purpose: To describe the whole of the records, regardless of form or medium, created and accumulated and used by a particular person, family, etc; collection of somebody's papers, etc.; how traditional archives approach material |
Representation |
own |
Main Characteristics |
6 elements: identity; context; content and structure; conditions of access and use; allied materials; notes; five possible levels |
Versioning |
no |
Modularity |
yes |
Extensibility |
no |
Overlap |
|
Learn/Borrow? |
This may be how we structure relationships in conceptual model; describes collections of materials; archives are concerned with relationships, unlike libraries which treat objects separately; perspective of lifecycle of archive; snapshot at current time; if you change something the FONDS will change; driven by boxes of stuff; living collection; add to the FONDS each time new file comes in; hierarchies; Not detailed enough at the lower levels; we have to extend it |
EAD Encoded Archival Description
Coverage |
Collections of records |
Domain |
Libraries and archives |
Purpose |
Same as above |
Representation |
XML |
Main Characteristics |
Library of Congress standard; ways of describing collections; interesting for building tools across archives; |
Versioning |
|
Modularity |
|
Extensibility |
|
Overlap |
|
Learn/Borrow? |
Public forms of documentation for sale; authority control |
METS Metadata Encoding and Transmission Standard
Coverage |
Digital objects |
Domain |
Libraries and archives |
Purpose |
METS represents set of digital objects; like a directory structure |
Representation |
XML |
Main Characteristics |
When render a METS instance, you can rely on that structure to navigate through content; insufficient in describing richness between nodes; section for describing behaviors like Web services that can act on it; METS is like a FONDS but without fixed levels; Tools like Fedora, dspace going METS; higher up wrapper; insert more complex DDI content into it |
Versioning |
|
Modularity |
yes |
Extensibility |
|
Overlap |
|
Learn/Borrow? |
Structural map; look at linking; wrapper; behaviors; modularity |
XMP
Coverage |
Digital metadata |
Domain |
|
Purpose |
Adobes's extensible metadata platform; platform to attach metadata records to digital objects (embed) |
Representation |
RDF/XML/XMP |
Main Characteristics |
Picture created by digital camera; xmp travels with the photo; versioning added; open source; public domain; starts with a data model based on RDF information model; storage and packaging model; set of schemas defining metadata properties; internal properties only read by software; external by humans |
Versioning |
|
Modularity |
yes |
Extensibility |
Yes; xmp basic schema (basic descriptive information extending Dublin Core); rights schema, etc. |
Overlap |
Only when it comes to general descriptive Dublin Core-like information |
Learn/Borrow? |
Embedded, traveling metadata; modularity based on namespaces; extensibility; building model on a simple information model like RDF; embed DDI with SAS/SPSS would be equivalent; we are far from this right now |
FGDC
Coverage |
Geographic space |
Domain |
Geography |
Purpose |
Describe points, lines, polygons, circles, arcs; |
Representation |
|
Main Characteristics |
|
Versioning |
|
Modularity |
|
Extensibility |
|
Overlap |
|
Learn/Borrow? |
Geographers search by coordinates so we added bounding box to the DDI; we also included bounding polygon with a minimum of four points; our definitions differ; when geographers say smallest piece, they mean that they can create all levels in between; broad geographic element and smallest piece of geography are not sufficient; no controlled vocabulary; in redoing DDI we need to be conscious that a geographer expects to identify a geographic entity; how they understand definitions may be different; our situation now is that we have element names from FGDC and definitions from Dublin Core; Should DDI be able to incorporate another standard? Harmonize DDI tags with FGDC? Concept of geographic time missing; get another group to take a hard look at this; proper requirement analysis needed; we don't want to move into GIS territory; There are two problems we want to solve in the DDI: (1) ability to search by geography for statistics (locate statistical data in space) and (2) put data onto a map and link to maps; link to external standards; get working group on this; deliverable from group: uses of information; pieces of information needed; relationships to other information in DDI; what are geographic models out there? FGDC compliant with ISO; Open GIS and GML are simpler; what we have now is human-readable |
TEI
Coverage |
Text markup |
Domain |
Humanities |
Purpose |
Text markup and formatting |
Representation |
XML |
Main Characteristics |
|
Versioning |
yes |
Modularity |
|
Extensibility |
|
Overlap |
Formatting elements used now for large text areas |
Learn/Borrow? |
The DDI should dispose of the TEI, but white space formatting not adequate; xhtml has similar structure and seems the best option; restrict allowed set to structure elements that require formatting; use generic class attribute to link to a stylesheet; modular; subset xhtml namespace; distinguish at the containing level |
Triple-S
Coverage |
Survey data |
Domain |
Market research |
Purpose |
Interchange standard for market research data |
Representation |
Pre-XML |
Main Characteristics |
Minimalistic DDI with major focus on the question/variable level, although there is study level information like title, provider, producer, etc.; doesn't support multifile studies, aggregate data, hierarchical dimensions; supports multiresponse questions better than DDI; Language differentiation simpler, but better than the DDI; 60 software packages have Triple-S support; interchange standard for market research; includes both data and fixed format ASCII file |
Versioning |
no |
Modularity |
no |
Extensibility |
no |
Overlap |
90% of elements map directly |
Learn/Borrow? |
Simplicity; Language differentiation mechanism; made distinction between elements that were human readable and what would appear in different languages and what wouldn't; how can we leverage vendor support they have? provide crosswalk; third party software integrates Triple S with SPSS MR; simple core with optional modules |
SDMX
Coverage |
Aggregate data, almost exclusively time series |
Domain |
International statistical agencies |
Purpose |
Data exchange format with the metadata needed; |
Representation |
XML |
Main Characteristics |
SDMX drew limit of metadata at how much you need to understand the data; Version 2 will have pure metadata reporting; level that is above to represent that dimensions are equal; Insists on clean cubes; describes cube and will include hierarchical code lists in next version; has three schemas for timeseries, one for nontimeseries; exchange rates by country and by time; contains concepts; has standard way of expressing time (ISO 8601); concept has to be formal with a definition; frequency and periodicity; technical specs; semantic harmonization separate; ISO11179 style definitions for concepts; expressed using key families; assumption of clean cube is big assumption; hierarchical dimensions not in yet; typically, one can link multiple cubes; embed into another data cube; when confronted with sparsely populated cubes, produces logical values; uses namespaces for modularity and extensions; mapping that lets you see how comparable this is in registry; similar to comparability approach; published lots of different cubes not really related; like to have ability to say this time dimension repeated across cubes; descriptive richness not tied to all cubes; document trends without the data being there; if identify trends without the data accumulated, up to us to go down to the 11179 level |
Versioning |
yes |
Modularity |
yes |
Extensibility |
Yes; uses namespaces to accomplish this; early-bound namespace |
Overlap |
Timeseries data cubes |
Learn/Borrow? |
Encourage aggregate group to look at this; possible to have another level of abstraction to describe family of variables and put into DDI package; when documenting each instance, instead of repeating, just point to generic description; good approach; in general could be used to describe common information; generic dimension; way data warehouse treats cubes; make another instance as a collection of collections; base document and then a document that talks about more than one instance; standard ways of expressing time periods; FAME has good code list; timeseries database; there is a ISO standard that deals with numbering weeks, which is important when dealing with periodicity; Venetian blind version of schema; URIs for identifiers |
ISO 11179
Coverage |
Concepts, data elements |
Domain |
Statistical agencies/data producers |
Purpose |
Provide registry of items for use in survey questionnaires |
Representation |
XML |
Main Characteristics |
Data dictionary model; if you are doing a data vocabulary, do it like this; naming is object oriented; object property representation hierarchy; nobody uses that; naming rules not worth pain involved; guidelines for defining things; this is useful; managing concepts and data elements |
Versioning |
Five parts, until latest, unreleased one is blowing away part numbering |
Modularity |
yes |
Extensibility |
|
Overlap |
Concepts, data elements |
Learn/Borrow? |
Guidelines for defining things; CMR Census and StatCan use this; conceptual modeling and retrofitting them to work with metadata; becoming important in data world at large; CMR model complex; to be compliant is good politically; what type of compliance? Core concept is appealing; different value sets to the same concept; key to solving comparability issues; ISO markup means you can discover comparable items. Alignment is easy but not using their model; we could say we are compliant now; word our definitions for version 3; stay with these two levels of harmonization |
MetaDater
Coverage |
Survey data |
Domain |
European social science data archives, comparative data |
Purpose |
Involve PIs in documenting their data and store all metadata in a single place |
Representation |
UML use cases |
Main Characteristics |
Two software packages that interact and an underlying database; MD-PRO provider, and MD-COLL collector; tool for documenting while doing survey; then tool for archives; need conceptual model, which is being developed; overall model including all possible information on a survey; from the customer relationship level, user level, whom we give data to, variable level, etc.; end up with entity relationship model; still working on this; underlying structure like OAIS model |
Versioning |
|
Modularity |
yes |
Extensibility |
|
Overlap |
Lifecycle approach; survey done, analyzed, publication archive, dissemination; data producers not good at documenting; getting producers to fill in documentation information not ex-post, as happens now in archives; comparative data |
Learn/Borrow? |
Can they share their draft conceptual model? We will try to get a copy to inform our efforts; our data models need to align with each other |
CWM Common Warehouse Metamodel
Coverage |
Digital metadata |
Domain |
Business |
Purpose |
Major effort by Object Management Group (OMG) to enable easy interchange of warehouse and business intelligence metadata between warehouse tools, platforms, and metadata repositories. |
Representation |
UML |
Main Characteristics |
Built up using OMG tools; Meta Object Facility (MOF); XMI XML Metadata Interchange; Object Model is at bottom; a subset of UML itself; cubes, XML, record data, relational data object model; transformation on top of that along with OLAP; queries and extracts; data mining; information visualization; business nomenclature |
Versioning |
|
Modularity |
Highly modular; several layers of submodules; plug and play as you build |
Extensibility |
|
Overlap |
Yes, describing part of their package describes some of our data |
Learn/Borrow? |
This was cutting edge when Corba was in operation; this was developed in 1997; supported by Oracle, SAS, IBM, etc. |
IQML
Coverage |
Electronic questionnaires, Web surveys |
Domain |
European project |
Purpose |
Project to develop language combining survey responses and questions; self-configuring; describe survey questionnaires; especially Web based surveys |
Representation |
XML |
Main Characteristics |
Flexibility |
Versioning |
|
Modularity |
|
Extensibility |
|
Overlap |
|
Learn/Borrow? |
Project no longer in operation; need to talk to others about this |