Packaging and Distributing Data Collections for the Web

I had the pleasure to take part in the Open Data on The Web Workshop which was held in London in April 2013. I submitted a position paper Packaging and Distributing Data Collections for the Web (PDF) to the workshop. I also participated in the panel Tabular Data Formats and Packages, chaired by Jeni Tennison. In this page, I collected the position paper (reformatted in HTML) as well as the notes I prepared for the panel.

For more information about the workshop, please consult the workshop agenda, as well as the list of papers that were submitted to the workshop. Don't miss the excellent workshop report from Phil Archer at the W3C.

Packaging and Distributing Data Collections for the Web

Tyng-Ruey Chuang 1,2,*

1 Institute of Information Science, and 
2 Research Center for Information Technology Innovation
Academia Sinica
Nangang 115, Taipei, Taiwan

March 2013

What make data open and free to all to reuse? We take the position that for data to be free on the web, it shall be packaged and distributed like free software. To make data collections (i.e., datasets) easy to use and useful to many, we need to consider issues related to their usage both on and off the web. These issues necessarily involve software support and tool development. Tools made for and lessons learned from free software development, hence, shall apply when data collections are to made free.

In addition, many a data collection is content collection whose elements can be datasets, documents, programs, and other kinds of works. The documents may describe the datasets, for example, and the programs validate and visualize them. A collection may contain as well creative works based on and derived from datasets and other elements (from the current collection or other collections). The constituting elements of a data collection can be in textual forms. To release, revise, and redistribute data collections is to work on them as they are software packages.

We take the view that for a data collection to be open, they shall be freely downloaded, adapted, mixed with others, and rehosted for other services. Being available and accessible on the Web by itself is not sufficient. A data collection must be easily ported to other computer systems, either on or off the Web, for it to be called open. Starting from these considerations, we list below some of the main issues and offer our viewpoints.

Identifiers and references for/to elements in a data collection shall be relative and local. Use absolute URLs (i.e., "permanent links") as IDs does not help data portability.
Packaging and depositing tools and practices for source code management can be used to help release data collections. It shall not matter where a data collection comes from as long as it can be properly authenticated.
Validation and revision tools shall be used. Datasets shall be accompanied by programs that validate their structure and integrity. Source code for programs that revise, validate, package data collections shall be made available. It must be free to modify and redistribute these utility programs.
Co-publication of datasets and documents shall be considered at the same time when the data is being generated. While the datasets are machine--processable, the documents aim for human readers. The two shall cross reference each other (e.g., embedding semantic data in HTML documents with RDFa, as well as providing triple endpoints to access dataset documents). Tools shall be developed and used to extract datasets from documents, as well as to generate natural language text from datasets.
Independent services built from the same data collection are encouraged. These include meta services such as data catalogs and repositories. It shall be made easy to fork data collections and run new services based on their derivatives.
Rights and norms could be barriers to wide dissemination of some data collections but they need not be so. One way to get flexibility for maximal reuse is to treat all elements in a data collection homogeneously, for example, by declaring them to be in the public domain.

We shall illustrate these issues by the use of an exemplary public domain image collection on the Web.

* The views expressed here are those of the author, and are not necessarily those of Academia Sinica. Academia Sinica is a member of W3C.

Panel: Tabular Data Formats and Packages

(Note: The panelists were asked, in an e-mail message from the chair the day before the panel, for "a two-minute introduction from each of you about who you are and why you're interested in the topic." I prepared and read the following at the panel.)

I am Tyng-Ruey Chuang. I am a researcher from Academia Sinica at Taipei, which is a basic research institute supported by the Taiwanese government. My main research areas are programming languages, as well as standards, software, and systems about the Web, in particular in the domains of geospatial informatics and XML document transformation.

I am involved in Taiwan's National Digital Archive Program so I have some good ideas of the current practices of digitizing cultural collections and their preservations. I am also the project lead of Creative Commons Taiwan, so I am familiar with copyright issues and public licenses, more or less.

Because of the research collaborations I have with others, we often deal with heterogeneous data collections. By data collections, I mean a mixture of content collections that include datasets, documents, programs, and other media files. My concern about data collections is mainly about how better to share them, and how to ensure they will remain available and usable in the long term.

In my view the many barriers to data sharing are often cultural, legal, or institutional. The Web and the related technology has been a big help in making data available to people. But making data available on the Web by itself is not sufficient in ensuring it will be free and open for all to reuse. Just by putting data on the Web does not ensure its long-term availability either. My view is that, when thinking about issues related to open data, we can take lessons from the tools and practices used in free software development.

Allowing people to make many copies, and giving them freedom to remix these copies with others, and to distribute their modifications further, seems to me, is the key step to ensure openness and long-term availability. How do we package and distribute data collections like free software packages, therefore, is a topic that deeply interests me. (Let me say now that I agree with much of what Rufus just said at the beginning of this session except about the schema part.) So I am very glad to have this opportunity to share my views in this workshop, and to learn from you.

Questions from the Panel Chair

To help structure the discussions at the panel, the chair also posed the following questions to the panelists in the e-mail message:

a. why would anyone want to publish data in any format other than JSON?
b. are there any broad types of data where we don't have good standard formats for publication?
c. how close should published formats be to those data owners create and store?
d. how important is it that metadata travels with data? what are the methods for doing that?
e. what are the important characteristics of data formats to integrate them into a web of data?

I prepared the following notes:

a. on wanting to publish data in formats other than JSON: Metadata or schema may be an issue. Are there mature tools to validate JSON datasets with respect to their schema? After some modifications to my JSON datasets, I want to validate them too. It is my impression that JSON data does not come with schema. There is a JSON Schema specification under development, so things are also moving quickly here.
b. on data types without standard formats for publication: This question comes to me when we are dealing with legacy web sites. How do we archive a web site so that it can be again deployed in other contexts? They are useful now, but their future is in risk unless we know how to package a website's components and dependencies, and to publish them in ways that make it easy to rehost the site to other systems.
c. on published data format vs. native data formats: I think it is better if we all publish in standard data formats.
d. on whether metadata shall travel with data, and how: Very important. Data shall always travel with metadata in particular data schema and data validation tools. This ensures that when the data is modified, it can be checked to ensure the resulting data is still in good shape and useful to others.
e. on characteristics of data formats for the web: It is important that there is an open process in the specification and revision of the data formation. There shall be free software come with the format for data validation, editing, and conversion. For conversion to other formats, it better includes HTML and RDF for the metadata part at the very least.

(Note: The actual discussions at the panel were rather spontaneous and lively, about which I can no longer recall in details.)