A Data Package (or DataPackage) is a coherent collection of data and possibly other assets in a single ‘package’. It provides the basis for convenient delivery, installation and management of datasets.
| Author | Rufus Pollock (Open Knowledge Foundation Labs) Matthew Brett (NiPY) Martin Keegan (Open Knowledge Foundation Labs) |
| Version | 1.0-beta.10 |
| Last Updated | 12 April 2014 |
| Created | 12 November 2007 |
Language
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
NOTE: This is a draft specification and still under development. If you have comments or suggestions please file them in the issue tracker. If you have explicit changes please fork the git repo and submit a pull request.
Changelog
1.0-beta.10:licenseintroduced andlicensesupdated as per this issue1.0-beta.8:last_modifiedandmodifiedremoved following this issue1.0-beta.7:dependenciesrenamed todataDependenciesfollowing this issue1.0-beta.5->1.0-beta.6: Movedresourcesfrom MUST to MAY
Table of Contents
Specification
A data package consists of:
- Data package metadata that describes the structure and contents of the package
- Optionally, additional resources, including data files, that make up the package
A valid data package MUST provide a data package “descriptor” file named
datapackage.json.
This file should be placed in the top-level directory (relative to any other resources provided as part of the data package).
The data package descriptor is used to provide metadata about the data package and to describe its contents. The descriptor should follow the structure described in the rest of this document.
A data package will normally include other resources (e.g. data files) but the Data Package specification does NOT impose any requirements on their form or structure.
The data included in the package may be provided as:
- Files bundled into the package itself
- Remote resources, referenced by URL
- “Inline” data (see below) which is included directly in the
datapackage.jsonfile
Illustrative Structure
A minimal data package on disk would be a directory containing a single file:
datapackage.json # (required) metadata and schemas for this data package
Obviously lacking a single piece of actual data would make this of doubtful use. A slightly less minimal version would be:
datapackage.json
# a data file (CSV in this case)
data.csv
Additional files such as a README, scripts (for processing or analyzing the data) and other material may be provided. By convention scripts go in a scripts directory and thus, a more elaborate data package could look like this:
datapackage.json # (required) metadata and schemas for this data package
README.md # (optional) README in markdown format
# data files may go either in data subdirectory or in main directory
mydata.csv
data/otherdata.csv
# the directory for code scripts - again these can go in the base directory
scripts/my-preparation-script.py
Several exemplar data packages can be found in the datasets organization on github, including:
Descriptor (datapackage.json)
datapackage.json is the central file in a Data Package. It provides:
- General metadata such as the name of the package, its license, its publisher etc
- A list of the data resources that make up this data package (plus, possibly, additional schema information about these data resources in a structured form)
The Package descriptor MUST be a valid JSON file. (JSON is defined in RFC 4627.
It MAY contain any number of attributes. All attributes at the first level not
otherwise specified here are considered metadata attributes.
A valid descriptor MUST contain a name attibute. These fields, and additional
metadata attributes, are described in the “Required Fields” section below.
A valid descriptor MAY contain a resources attribute.
Here is an illustrative example of a datapackage JSON file:
{
# general "metadata" like title, sources etc
"name" : "a-unique-human-readable-and-url-usable-identifier",
"title" : "A nice title",
"licenses" : [...],
"sources" : [...],
# list of the data resources in this data package
"resources": [
{
... resource info described below ...
}
],
# optional
... additional information ...
}
Metadata
Required Fields
A valid package MUST include the following fields:
-
name(required) - short url-usable (and preferably human-readable) name of the package. This MUST be lower-case and contain only alphanumeric characters along with “.”, “_” or “-“ characters. It will function as a unique identifier and therefore SHOULD be unique in relation to any registry in which this package will be deposited (and preferably globally unique).The name SHOULD be invariant, meaning that it SHOULD NOT change when a data package is updated, unless the new package version should be considered a distinct package, e.g. due to significant changes in structure or interpretation. Version distinction SHOULD be left to the version field. As a corollary, the name also SHOULD NOT include an indication of time range covered.
In addition to the above fields, it is recommended that the following fields SHOULD be included in every package descriptor:
-
resources- a JSON array of hashes that describe the contents of the package. The structure of the resource hash is described in the “Resource Information” section. -
license(orlicenses) - is a field specifying the license (or licenses) under which the package is provided. You MAY specify either alicensefield or alicensesfield but NOT both.This property is not legally binding and does not necessarily mean your package is licensed under the terms you define in this property.
-
licenseMUST be a string and its value SHOULD be an Open Definition license ID (preferably one that is Open Definition approved.{ "license" : "PDDL-1.0" } -
licensesMUST be an array. Each entry MUST be a hash with atypeand aurlproperty linking to the actual text. ThetypeSHOULD be an [Open Definition license ID][od-license] if an ID exists for the license and otherwise may be the general license name or identifier. Here is an Example:"licenses": [{ "type": "PDDL-1.0", "url": "http://opendatacommons.org/licenses/pddl/" }]
-
-
datapackage_version- the version of the data package specification this datapackage.json conforms to. It should follow the Semantic Versioning requirements (http://semver.org/). The current version of this specification is given at the top of this document.
Recommended Fields
Additionally, a package descriptor MAY include the following keys and values:
title- a title or one sentence description for this packagedescription- a description of the package. The first paragraph (up to the first double line break should be usable as summary information for the package)homepage- URL string for the data packages web siteversion- a version string identifying the version of the package. It should conform to the Semantic Versioning requirements (http://semver.org/).-
sources- an array of source hashes. Each source hash may havename,webandemailfields. Example:"sources": [{ "name": "World Bank and OECD", "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD" }] keywords- an Array of string keywords to assist users searching for the package in catalogs.image- a link to an image to use for this data package
Optional Fields
maintainers- Array of maintainers of the package. Each maintainer is a hash which must have a “name” property and may optionally provide “email” and “web” properties.-
contributors- an Array of hashes each containing the details of a contributor. Must contain a ‘name’ property and MAY contain an email and web property. By convention, the first contributor is the original author of the package. Example:"contributors":[ { "name": "Joe Bloggs", "email": "[email protected]", "web": "http://www.bloggs.com" }] publisher- like contributorsbase- a base URI used to resolveresourcesthat specify relative paths in the event that the actual data files specified by those resource paths are not located in the same directory in which the descriptor file (datapackage.json) resides.-
dataDependencies- Hash of prerequisite data packages on which this package depends in order to install. Follows same format as CommonJS Packages spec v1.1.Each dependency defines the lowest compatible MAJOR[.MINOR[.PATCH]] dependency versions (only one per MAJOR version) with which the package has been tested and is assured to work. The version may be a simple version string (see the version property for acceptable forms), or it may be a hash group of dependencies which define a set of options, any one of which satisfies the dependency. The ordering of the group is significant and earlier entries have higher priority. Example:"dataDependencies": { "country-codes": "", "unemployment": "2.1", "geo-boundaries": { "acmecorp-geo-boundaries": ["1.0", "2.0"], "othercorp-geo-boundaries": "0.9.8", }, }
NOTE: A Data Package author MAY add any number of additional fields beyond those
listed in the specification here. For example, suppose you were storing
time series data and want to list the temporal coverage of the data in the
Data Package you could add a field temporal (cf Dublin Core):
"temporal": {
"name": "19th Century",
"start": "1800-01-01",
"end": "1899-12-31"
}
This flexibility enables specific communities to extend Data Packages as appropriate for the data they manage. As an example, the Tabular Data Package specification extends Data Package to the case where all the data is tabular and stored in CSV.
Resource Information
Packaged data resources are described in the resources property of the package descriptor.
This property is an array of values. Each value describes a single resource and
MUST be a JSON hash.
Required Fields
Resource information MUST contain (at least) one of the following attributes which specify the location of the associated data file (either online, ‘relative’ (local), or ‘inline’):
url: url of this data resourcepath: unix-style (‘/’) relative path to the resource. Path MUST be a relative path, that is relative to the directory in which the descriptor file (datapackage.json) listing this file resides, or relative to the URI specified by the optionalbaseproperty (if it is defined).data: (inline) a field containing the data directly inline in thedatapackage.jsonfile. Further details below.
NOTE: the use of a url allows a data package to reference data not necessarily
contained locally in the Data Package. Of course, the path attribute may still
be used for Data Packages located online (in this case it determines the
relative URL) in combination with the optional base property if it is defined.
NOTE: When more than one of url, path or data are specified consumers need to
determine which option to use (or in which order to try them). The
recommendation is to utilize the following order: data, path, url. A consumer
should also stop processing once one of these options yields data.
There are NO other required fields. However, there are a variety of common fields that can be used which we detail below.
Recommended Fields
It is recommended that a resource SHOULD contain the following fields:
-
name: a resource SHOULD contain annameattribute. The name is a simple name or identifier to be used for this resource.- If present, the name MUST be unique amongst all resources in this data package.
- The name SHOULD be usable in a url path and SHOULD therefore consist only of alphanumeric characters plus “.”, “-“ and “_”.
- It would be usual for the name to correspond to the file name (minus the extension) of the data file the resource describes.
Optional Fields
A resource MAY contain any number of additional fields. Common fields include:
format: ‘csv’, ‘xls’, ‘json’ etc. Would be expected to be the the standard file extension for this type of resource.mediatype: the mediatype/mimetype of the resource e.g. ‘text/csv’, ‘application/vnd.ms-excel’asencoding: specify the character encoding of a resource data file. The values should be one of the “Preferred MIME Names” for a character encoding registered with IANA. If no value for this key is specified then the default is UTF-8.bytes: size of the file in byteshash: the md5 hash for this resourceschema: a schema for the resource - see below for more on this in the case of tabular data.sources: as for data package metadata.licenses: as for data package metadata. If not specified the resource inherits from the data package.
Inline Data
Resource data rather than being stored in external files can be shipped
‘inline’ on a Resource using the data attribute.
The value of the data attribute can be any type of data. However, restrictions of JSON require that the value be a string so for binary data you will need to encode (e.g. to Base64). Information on the type and encoding of the value of the data attribute SHOULD be provided by the format (or mediatype) attribute and the encoding attribute.
Specifically: the value of the data attribute MUST be:
- EITHER: a JSON array or hash - the data is then assumed to be JSON data and SHOULD be processed as such
- OR: a JSON string - in this case the format or mediatype attributes MUST be provided.
Thus, a consumer of resource hash MAY assume if no format or mediatype attribute is provided that the data is JSON and attempt to process it as such.
Examples 1 - inline JSON:
{
...
resources: [
{
"format": "json",
# some json data e.g.
"data": [
{ "a": 1, "b": 2 },
{ .... }
]
}
]
}
Example 2 - inline CSV:
{
...
resources: [
{
"format": "csv",
"data": "A,B,C\n1,2,3\n4,5,6"
}
]
}
Tabular Data
For tabular data the resource information MAY contain schema information in an
attribute named schema. If schema is provided its value MUST conform to
the JSON Table Schema.
Here is an example for a CSV file:
{
// one of url or path should be present
url:
path:
dialect: # as per CSV Dialect specification
schema: # as per JSON Table Schema
}
The Tabular Data Package provides a specification focused on tabular data. It builds on this data package specification (Tabular Data Package datasets are Data Packages) and provides additional specific requirements for the format and structure of data files and the resource information in the datapackage.json.
Background
Aims
- Simple
- Extensible
- Human editable (for metadata)
- Machine usable (easily parsable and editable)
- Based on existing standard formats
- Not linked to a particular language or system
How It Fits into the Ecosystem
- Minimal wrapping to provide for machine automated sharing and obtaining of data
- Data Packages can be registered into and found in indexes (local or remote)
- Tools (based on code libraries) integrate with these indexes (and storage) to download and upload material
Appendix
The specification is heavily inspired by various software packaging formats. Read the Appendix.
Data Protocols
Lightweight Standards and Patterns for Data