This document defines a simple data publishing format (Tabular Data Package) for publishing and sharing tabular-style data.

Version1.0-beta-2
Last Updated16 March 2014
CreatedMay 7 2012

Language

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Summary

This document defines a simple data publishing format (Tabular Data Package) for publishing and sharing tabular-style data.

The format’s focus is on simplicity and ease of use over the web – that is, usage online with access and transmission over HTTP. In addition the format is focused on data that can be presented in a tabular structure and in making it easy to produce (and consume) this format from spreadsheets and relational databases.

The key features of this format are the following:

  • CSV (comma separated variables) for data
  • Single JSON file (datapackage.json) to describe the dataset including a schema for data files
  • Reuse wherever possible of existing work including other Data Protocols specifications

As suggested by the name, Tabular Data Package extends and specializes the Data Package spec for the specific case where the data is tabular.

If you have comments or suggestions please file them in the issue tracker at: https://github.com/dataprotocols/dataprotocols/issues.

Changelog

  • 1.0-beta-2: renamed from Simple Data Format to Tabular Data Package

Table of Contents

Quick Start

A Tabular Data Package package contains:

  • Data files in CSV
  • (Minimal) dataset information in JSON (including a schema for the CSV)

Here’s an example of a minimal simple data format dataset:

2 files

data.csv
datapackage.json


data.csv
--------

var1,var2,var3
A,1,2
B,3,4

datapackage.json
----------------

{
  "name": "my-dataset",
  # here we list the data files in this dataset
  "resources": [
    {
      "path": "data.csv",
      "schema": {
        "fields": [
          {
            "name": "var1",
            "type": "string"
          },
          {
            "name": "var2",
            "type": "integer"
          },
          {
            "name": "var3",
            "type": "number"
          }
        ]
      }
    }
  ]
}

Specification

This specification builds on the Data Packages specification. It defines a profile for publishing data packages that places some additional constraints on metadata and data formats.

A valid Tabular Data Package package MUST be a valid Data Package as defined in that specification. This means that it MUST:

  • Contain a package descriptor (datapackage.json)
  • Provide at least the minimum required package metadata as described in the Data Package specification
  • Include a description of each data file in the package in the resources array of the package

In addition to those basic rules, a valid Tabular Data Package package MUST conform to the following additional requirements:

  • It MUST contain at least one data file
  • All data files MUST be in CSV format, see below for additional rules on CSV file naming and structure
  • Every resource MUST have a schema following the JSON Table Schema specification

CSV Files

As a starting point, CSV files included in a Tabular Data Package package must conform to the RFC for CSV (4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files). In addition:

  • File names MUST end with .csv
  • Files MUST be encoded as UTF-8
  • Files MUST have a single header row. This row MUST be the first row in the file.

    • Terminology: each column in the CSV file is termed a field and its name is the string in that column in the header row.
    • The name MUST be unique amongst fields and MUST contain at least one character
    • There are no further restrictions on the form of the name but it is RECOMMENDED that it contain only alphanumeric characters together with “ .-_”
  • Rows in the file MUST NOT contain more fields than are in the header row (though they may contain less)
  • Each file MUST have an entry in the resources array in the datapackage.json file
  • The resource metadata MUST include a schema attribute whose value MUST conform to the JSON Table Schema
  • All fields in the CSV files MUST be described in the schema

CSV files generated by different applications often vary in their syntax, e.g. use of quoting characters, delimiters, etc. To encourage conformance, CSV files in a Tabular Data Package SHOULD

  • Use “,” as field delimiters
  • Use “\r\n” or “\n” as line terminators

If a CSV file does not follow these rules then its specific CSV dialect MUST be documented. The resource hash for the resource in the datapackage.json descriptor MUST:

Applications processing the CSV file SHOULD use the dialect of the CSV file to guide parsing.

Key Design Features and Principles

The format’s focus is on simplicity and web usage – that is, usage online with access and transmission over HTTP. In addition the format is focused on data that can be presented in a tabular structure and in making it easy to produce (and consume) this format from spreadsheets and relational databases.

The key features of this format are the following:

  • CSV (comma separated variables) as the base data format
  • JSON as the base format for schema definition
  • Reuse wherever possible of existing work including other Data Protocols specifications

Why CSV

  1. CSV is very simple - it is perhaps the most simple data format
  2. CSV is tabular-oriented. Most data structures are either tabular or can be transformed to a tabular structure by some form of normalization
  3. It is open and the “standard” is well-known
  4. It is widely supported - practically every spreadsheet program, relational database and programming language in existence can handle CSV in some form or other
  5. It is text-based and therefore amenable to manipulation and access from a wide range of standard tools (including revision control systems such as git, mercurial and subversion)
  6. It is line-oriented which means it can be incrementally processed – you do not need to read an entire file to extract a single row. For similar reasons it means that the format supports streaming.

More informally:

CSV is the data Kalashnikov: not pretty, but many wars have been fought with it and kids can use it. [@pudo (Friedrich Lindenberg)]

CSV is the ultimate simple, standard data format - streamable, text-based, no need for proprietary tools etc [@rufuspollock (Rufus Pollock)]

Why JSON

  • JSON is simple
  • JSON supports rich structure including nesting and basic types
  • JSON is very widely used and supported (all major programming languages can handle JSON)
  • JSON is web-native (every browser can access and manipulate JSON)
  • JSON is readable as simple text making it amenable to management and processing using simple text tools

Open Issues

  • Foreign keys between files
  • Primary keys and uniqueness

Alternatives Discussion

What alternatives are there for the data portion of the format?

  • Use line oriented JSON (?)

    • Greater flexibility and complexity on types and objects
    • No support from spreadsheets
  • Use SQLite

    • (+) compact, full DB in one
    • (-) lack of support (e.g. from spreadsheets), no streaming, not accessible with text tools etc

NOTE: this specification owes a great deal to the excellent Dataset Publishing Language (DSPL) put forward by Google. The main difference is in using JSON instead of XML for the schema and re-using as far as possible the JSON-LD schema language (based on linked-data) rather than inventing a new type and schema structure.

Alternative CSV format

This describes a possible alternative CSV-based format for serializing the metadata. It has the advantage that you could then put a single dataset in one spreadsheet (using multiple sheets).

In this format each key in the JSON file becomes a column. In addition there is one initial column named @key@. Values for this column are then attribute names. To illustrate (spacing for illustrative purposes only):

@key@, year, series, ...
@type, ...,  xsd:string
simpletype, date, string
label, ....
...