Guide to Wrangling Metatab Packages¶
Warning
This section hasn’t been updated recently, and is probably only of historically suggestive value.
Setting the Name¶
For any non-trivial use, the Root.Name
term is critical; most Metatab programs require it to be set. It can be set directly, but it is much more useful to allow metapack
to set it, by aggregating other terms. The other terms that metapack
will combine to create a name are:
Dataset. The base name of the dataset.
Origin. A part of a domain name ( like ‘usgs.gov’ or ‘census.gov’ ) for the source of the data.
Version. An integer version number
Space. The name of the region that the data covers.
Time. A year, year range, or other time interval for the temporal coverage of the data.
Grain. The name of what each row is about, such as a ‘school’ or a ‘county’ or a ‘person’
The Space
, Time
and Grain
are usually only used to distinguishing this package from other packages. If there is only one package for a particular Dataset
value, these three terms are rarely used.
Setting the Dataset
term triggers rebuilding the Name
term; if Dataset
is not set, metapack
will not update the Name
term. You can run metapack -u
to force regenerating the name.
Adding Properties to Sections¶
Root.Section
terms introduce Sections, which both group terms and set the headings for term properties. In the Section row, all of the values in the 3rd and later columns set the property name for child property terms. For instance, the default Schema
section is:
A B C D E
Section Schema DataType AltName Description
The B column is the section name, and the C, D, and E columns cause the parser to interpret values in those columns as being child values of terms on the row, with a term name given by the header in the Section
Line. So, for a row that starts with a Table.Column
term, the value in the C column is the value for a Column.DataType
property.
You can re-order these header values, and can create new ones, but in some cases, the metapack
program will expect some properties to exist. For instance, every Table.Column
term must have a Column.DataType
term.
Groups and Tags¶
When creating entries in a data repository like CKAN or Data.World, the metakan
and metaworld
programs may categorize the dataset entry with groups and tags. Metatab treats these term values as simple strings, so refer to the data repository documentation for specifics about how groups and tags are used.
For Tags, set a value for the Root.Tag
and for groups, use Root.group
Schemas¶
Schemas are the Root.Table
terms in the Schema
section of the metatab document, along with it’s Table.Column
children. The value of the Root.Table
term is the name of the schema, and this value can be referenced from the Root.DataSet
entries in the Resources
section either by being set to the Dataset.Name
for the entry, or by being set as the Dataset.Schema
. Using Dataset.Name
is the default case, but using this method of linking only allows one resource per schema. If there are multiple resources that should share the same schema, link the two with the Dataset.Schema
property.
Column Names¶
The value of a Table.Column
term is the primary name of a column, most often the column header from the original resource.
The Column.AltName
term sets and alternate name for the column, which will be used whenever the resource is copied into a new package. The alterate name is set when the primary name is not a well formed column name. For instance, if the header value from the original resource is ‘Date & Time’, the Table.Column
value will be ‘Date & Time’, but ‘Column.AltName’ will also be set and will be ‘date_time’.
When a resource is copied, such as building a package with metatab
or metasync
, the data file will have the header value from Column.AltName
when it exists and from Table.Column
when it doesn’t. The header values will be moved into the new package’s schema as in the Table.Column
values. Because all of the Column.AltName
values will have been “made official” when packaging, the Altname column is removed from the schema after packaging.
Because the header can come from either Column.AltName
or Table.Column` values, you only need to set the Column.AltName
when the Table.Column` value is an ill-formed header.
DataTypes¶
Every Table.Column
term must have a Column.Datatype
to be useful. The values for these terms are free-form, but most processing programs will expect them to be one of:
integer
number
text
These are the same values as are used in Tabular Data Packages. The value of number is a general real or floating point number.
Testing Packages¶
When you are working on a package where the metadata.csv
file is stored on Github or a similar VCS system, you are working on a “source” Metatab file, since the Metatab file will directly reference data files. To test that the file is what you want, you should occasionally build a filesystem package from this file, using metatab -F -f
. The -F
option will force the new package to be build, although if you want be completely sure, you can delete the _packages
directory in the current directory.
The first tests should be done by building the package, then inspecting the data files to see that they have the columns that you expect. Then open the index.html
file to ensure that all of the documentation you want has been generated.
When the package looks correct from direct inspection, you can open it in Jupyter Notebook to check the documentation.
Start Jupyter Notebook in the current directory, with the source metadata.csv
file. Then enter this in a cell:
import metatab
doc = metatab.open_package('./metadata.csv')
doc
You should get a pretty HTML version of the package documentation. Alternately, you can dump the docs for the package and the data dictoinaries for all of the resource with:
import metatab
from IPython.display import display_html
doc = metatab.open_package('./metadata.csv')
display_html(doc)
for r in doc.resources():
display_html(r)
The previous code is displaying the documentation generated from the source Metatab document. You may also want to view the documentation generated form the file system package you build with metapack -F -f. In that case, open the package document with:
doc = metatab.open_package('./_packages/<package_name>/')
The result should be the same documentation, but with different URLs.