Getting Started¶
Install¶
Install the Metapack package from PiPy with:
$ pip install metapack
For development, you’ll probably want the development package, with sub-mdules for related repos:
$ git clone --recursive https://github.com/Metatab/metapack-dev.git
$ cd metapack-dev
$ bin/init-develop.sh
Creating Packages with Metapack¶
Metapack data packages consists of metadata and data, linked together in an Excel file, Zip File, or as files in a directory. These package files are created by the mp build program, taking a source package as input. A Metapack source package is very similar to a output package: the primary difference is that a source package references datasets with URLs to remote resources. Building a package loads those resources into the load file. More generally, a source package decribes how to run a data processing pipeline, and the output package has just the outputs of these data processing steps.
So, what we’re going to do is create a directory-based source package, then build the soruce package to create an Excel File, a Zip File and another directory package.
Creating a new package¶
To create a new package, use the mp new program (new: Creating New Packages).
$ mp new -o metatab.org -d tutorial -L -E -T "Quickstart Example Package"
This command will create a directory named metatab.org-tutorial
,
which will contain a metadata.csv
file, the Metatab-formated metadata
file for the package.
The origin and dataset options are required. These options, along with time, space, grain, variant, and revision are used to build the name of the data package, which is also used in the name of the directory for the package. The origin should usually be a second level internet domain, such as ‘metatab.org’.
The mp new -E
option will generate example data, and the
mp new -L
option will create a pylib
directory that hold
some python code for generating rows.
If you need to change the name of the package later, you can edit the
identifiying terms in the metadata file. After setting the Dataset
,
Origin
, Version
, Time
or Space
and saving the file, , run
metapack -u
to update Name
:
$ cd metatab.org-tutorial
$ mp update -n
Changed Name
Name is: metatab.org-tutorial-2018-1
Otherwise, you will usually still want to edit the file to set the Title and Description terms.
Adding Data References¶
Since this is a data package, it is important to have references to data. The
package we are creating here is a filesystem package, and will usually
reference the URLs to data on the web. Later, we will generate other packages,
such as ZIP or Excel files, and the data will be downloaded and included
directly in the package. We define the paths or URLs to data files with the
Datafile
term in the Resources
section.
For the Datafile
term, you can add entries directly, but it is easier to
use the mp url program to add them. mp url program will
inspect the file for you, finding internal files in ZIP files and creating the
correct URLs for Excel files.
If you have made changes to the metadata.csv
file, save it, then run:
$ mp url -a http://public.source.civicknowledge.com/example.com/sources/test_data.zip
The test_data.zip
file is a test file with many types of tabular datafiles
within it. The mp url command will download it, open it, find all of
the metadata files int it, and add URLs to the metatab. If any of the files in
the zip file are Excel format, it will also create URLs for each of the tabs.
This file is large and may take awhile. If you need a smaller file, try: http://public.source.civicknowledge.com/example.com/sources/renter_cost.csv
Now reload the file. The Resource section should have 9 Datafile
entries,
all of them with fragments. The fragments will be URL encoded, so are a bit
hard to read. %2F is a ‘/’ and %3B is a ‘;’. The mp url program will
also add a name, and try to figure out on which row the data starts and which
lines are for headers.
Note that the unicode-latin1
and unicode-utf8
files do not have values
for HeaderLines and Startline. This is because the row intuiting process failed
to categorize the lines, because all of them are mostly strings. In these
cases, download the file and examine it. For these two files, you can enter ‘0’
for HeaderLines
and ‘1’ for StartLine
, or leave those values empty
and Metatab will use 0 and 1
If you enter the Datafile
terms manually, you should enter the URL for the
datafile, ( in the cell below “Resources” ) and the Name
value. If the URL
to the resource is a zip file or an Excel file, you can use a URL fragment to
indicate the inner filename. For Excel files, the fragment is either the name
of the tab in the file, or the number of the tab. ( The first number is 0 ). If
the resource is a zip file that holds an Excel file, the fragment can have both
the internal file name and the tab number, separated by a semicolon ‘;’ For
instance:
http://public.source.civicknowledge.com/example.com/sources/test_data.zip#simple-example.csv
http://example.com/test_data.zip#renter_cost_excel07.xlsx;B2
If you don’t specify a tab name for an Excel file, the first will be used.
There are also URL forms for Google spreadsheet, S3 files and Socrata.
To test manually added URLs, use the rowgen
program, which will download
and cache the URL resource, then try to interpret it as a CSV or Excel file.
$ rowgen http://public.source.civicknowledge.com/example.com/sources/test_data.zip#renter_cost_excel07.xlsx
------------------------ ------ ---------- ---------------- ---------------- -----------------
Renter Costs
This is a header comment
renter owner
id gvid cost_gt_30 cost_gt_30_cv cost_gt_30_pct cost_gt_30_pct_cv
1.0 0O0P01 1447.0 13.6176070904818 42.2481751824818 8.27214070699712
2.0 0O0P03 5581.0 6.23593207100335 49.280353200883 4.9333693053569
3.0 0O0P05 525.0 17.6481586482953 45.2196382428941 13.2887199930555
4.0 0O0P07 352.0 28.0619645779719 47.4393530997305 17.3833286873892
Or just download the file and look at it. In this case, for both
unicode-latin1 and unicode-utf8 you can see that the headers are on line 0
and the data starts on line 1 so enter those values into the metadata.csv
file. Setting the StartLine
and HeaderLines
values is critical for
properly generating schemas.
The URLs used in the resources, and the generators that produce row data from the data specified by the URLs are implemented in the rowgenerators module . Refer to the rowgenerators documentation for more details about the URL structure.
Adding Row Generators¶
If you’ve examined the metadata.csv
file in the example package,
you’ll have noticed that one of the Datafile
terms is not a normal url:
Section: Resources
Datafile: python:pylib#row_generator
This reference is for a function, written in Python, that will be called to
yield row data. The pylib
part of the URL is the module name, in this
case it is the module in the packages pylib
subdirectory, and
row_generator
is the function name.
See Generating Rows for more details about row generating functions and programs.
Building Packages¶
To build data packages from a source package, use the mp build program.
$ mp build # From within the soruce package.
If the current workking directory is not inside the soruce package, you can also reference it explictly, such as with our exmaple package:
$ mp build metatab.org-tutorial
Before the build starts, Metapack will ensure that all of the Datafile
terms have associated schemas, and try to autogenerate any that do not. You can
also trigger this process manually with mp update -s
. You will want
to run the schema update manually if you want to add column descriptions to the
autogenerated schema, or otherwise alter the schema.
By default, mp build will generate a Filesystem package, which is a
directory like the source package, but with all of the referenced datasets
localized to a data
directory, and with some additional generated
files. The build packages will be located inside the source package in the
_packages
directory. Building the example package will result in the
built package at _packages/metatab.org-tutorial-1
. This package
contains:
├── README.md
├── data
│ ├── random-names.csv
│ ├── random_names.csv
│ ├── renter_cost-2.csv
│ ├── renter_cost.csv
│ ├── renter_cost_excel07.csv
│ ├── renter_cost_excel97.csv
│ ├── row_generator.csv
│ ├── simple-example-altnames.csv
│ ├── simple-example.csv
│ ├── unicode-latin1.csv
│ └── unicode-utf8.csv
├── datapackage.json
├── docs
├── index.html
└── metadata.csv
The generated files include:
datapackage.json
. A Frictionless Data Package version of the metadataindex.html
. A data package overview and file list.data
. A directory holding CSV versions of all of the resources.metadata.csv
. An updates Metatab file with references to the local data sets and the date and time the package was created.
You can also generate other package formats, including CSV, Excel and Zip. The
Zip file format is the same as the Filesystem directory, but is zipped. The
Excel format has only the metadata and data files ( no index.html
or
other documentation ) but is a convenient single file. The CSV file just
references the file locations of the Filesystem package, and is primarily used
when the filesystem package is stored on the web.
To build all of the other file packages:
$ mp build -cez # -f is optional; the FS package is always built.
If you change the metadata and try to bulid again, mp buld will see
that the package already exists and will not build it. You can force it to
rebuild with the mp build -F
option, but if you’ve updated the
metadata or the data, rather than made an error, you should increment the
version number in the Root.Version term and build again.
Referencing Metatab Files¶
Now that some packages are built, it is a good time to mention how Metapack programs refer to packages. Nearly all of the programs take an optional metatabfile argument. This argument can be:
Empty. It will default to
metadata.csv
in the current directoryA path to a directory, which will be assumed to be a filesystem package with a
metadata.csv
file inside it.A path to a file, which will be guessed, by the extension, to be a ZIP, Excel or CSV package.
For instance, from the directory containing the example source package, all of the following commands will return the fully-versioned package name, “metatab.org-tutorial-1”
$ mp info metatab.org-tutorial/
$ mp info metatab.org-tutorial/metadata.csv
$ mp info metatab.org-tutorial/_packages/metatab.org-tutorial-1
$ mp info metatab.org-tutorial/_packages/metatab.org-tutorial-1.csv
$ mp info metatab.org-tutorial/_packages/metatab.org-tutorial-1.xlsx
$ mp info metatab.org-tutorial/_packages/metatab.org-tutorial-1.zip
As we will see in the next section ( and as you saw when adding URLs to the package ) a package URL can also have a fragment, which is a string that starts with ‘#’, appended to the URL. These are used to identify a resource within the package.
Examining Packages¶
There are a few programs you can use to examine packages and view their resources. The most important is mp run. The mp run command will run resources, generating the tabular data in a variety of formats. This is valuable when you are creating a new soruce package, or when you want to view the contents of a built package.
For instance, when you are working on a source package, mp run lets you see the tabuar data to test configurations. With no arguments, the program will list out the resources in the package.
$ cd metatab.org-tutorial
$ mp run
Type Name Url
-------- ----------------------- ---------------------------------------------------------------------
Resource random_names h.../random-names.csv
Resource row_generator python:pylib#row_generator
Resource random-names ...random-names.csv&encoding=ascii
Resource renter_cost ...renter_cost.csv&encoding=ascii
Resource simple-example-altnames ...simple-example-altnames.csv&encoding=ascii
Resource simple-example ...simple-example.csv&encoding=ascii
Resource unicode-latin1 ...unicode-latin1.csv&encoding=latin1
Resource unicode-utf8 ...unicode-utf8.csv&encoding=utf8
Resource renter_cost_excel07 ...renter_cost_excel07.xlsx;Sheet1&encoding=ascii
Resource renter_cost_excel97 ...renter_cost_excel97.xls;Sheet1&encoding=ascii
Resource renter_cost-2 ...renter_cost.tsv&encoding=ascii
To run one of thes resources, you add it to the URL of the package as a fragment, appending a ‘#’ and then the resorurce name. If the package is the local directory, the URL is empty, but the shell will interpret the ‘3’ as a comment, so you’ll need to escape it. So, to show the random names in the current source package:
$ mp run \#random_names
To show the same resource in one of the buld packages:
$ mp run _packages/metatab.org-tutorial-1.zip#random_names
Having the CSV dumped to the terminal isn’t very informative for large files,
so there are some options that are better suited for development. The
mp run -T
will produce a pretty table of the first 20 rows:
$ mp run -T \#random_names
┌──────────────────┬───────────────┐
│ name │ size │
├──────────────────┼───────────────┤
│ Gabriel Rowland │ 54.9378140631 │
├──────────────────┼───────────────┤
│ Jerry Gay │ 50.3511258436 │
├──────────────────┼───────────────┤
│ Tucker Good │ 48.6469162116 │
├──────────────────┼───────────────┤
│ Noah Fowlers │ 49.0099728493 │
...
This view is useful for viewing the rows, but it will truncate columns to the width of the terminal, so if you want to review all of the columns, you can “pivot” the table, transposing rows into columns.
$ mp run -T -p \#renter_cost_excel07
┌─────────────────────────┬──────────────────┬──────────────────┐
│ Column Name │ Row 1 │ Row 2 │
├─────────────────────────┼──────────────────┼──────────────────┤
│ id │ 1 │ 2 │
├─────────────────────────┼──────────────────┼──────────────────┤
│ gvid │ 0O0P01 │ 0O0P03 │
├─────────────────────────┼──────────────────┼──────────────────┤
│ renter_cost_gt_30 │ 1447 │ 5581 │
├─────────────────────────┼──────────────────┼──────────────────┤
│ renter_cost_gt_30_cv │ 13.6176070904818 │ 6.23593207100335 │
├─────────────────────────┼──────────────────┼──────────────────┤
│ owner_cost_gt_30_pct │ 42.2481751824818 │ 49.280353200883 │
├─────────────────────────┼──────────────────┼──────────────────┤
│ owner_cost_gt_30_pct_cv │ 8.27214070699712 │ 4.9333693053569 │
└─────────────────────────┴──────────────────┴──────────────────┘
This view will show as many rows ( which are now columns ) as the terminal
width can handle, so you may want to restrict the width of the columns with the
mp run -R
option.
Another useful option for analysis is the sample option mp run -S
,
which will run the resource and collect the most common values for a single
column:
$ mp run \#random_names -S name
Value Count
--------------- -------
Gabriel Rowland 1
Jerry Gay 1
Tucker Good 1
Noah Fowlers 1
Chase Mcmillan 1
Brody Grimes 1
Dylan Ferguson 1
Hashim Franco 1
Hakeem Bond 1
Fulton Jordan 1
The mp info command has some use ful options for examining packages. In particular, mp info -n
displays the name of the package, and mp info -s
displays the schema of a resource:
$ mp info -s \#random_names
Name AltName DataType Description
------ --------- ---------- -------------
Name name string
Size size number
Using a Package¶
At this point, the built packages are functionally complete, and you can check
that the packages are usable. Well work with the metatab.org-tutorial-1.zip
package in the _package
subdirectory of the source package. First, list the resources with :
$ mp info -r metatab.org-tutorial-1.zip
Type Name Url
-------- ----------------------- --------------------------------
Resource random_names data/random_names.csv
Resource row_generator data/row_generator.csv
Resource random-names data/random-names.csv
Resource renter_cost data/renter_cost.csv
Resource simple-example-altnames data/simple-example-altnames.csv
Resource simple-example data/simple-example.csv
Resource unicode-latin1 data/unicode-latin1.csv
Resource unicode-utf8 data/unicode-utf8.csv
Resource renter_cost_excel07 data/renter_cost_excel07.csv
Resource renter_cost_excel97 data/renter_cost_excel97.csv
Resource renter_cost-2 data/renter_cost-2.csv
You can dump one of the resources as a CSV by running the same command with the resource name as a fragment to the name of the metatab file:
$ mp run metatab.org-tutorial-1.zip#simple-example > /tmp/simple-example.csv
You can also read the resources from a Python program, with an easy way to convert a resource to a Pandas DataFrame.
import metapack
doc = metapack.open_package('metatab.org-tutorial-1.zip')
print(type(doc))
for r in doc.resources():
print(r.name, r.url)
r = doc.resource('renter_cost')
# Dump the row
for row in r:
print(row)
# Or, turn it into a pandas dataframe
# ( After installing pandas )
df = doc.resource('renter_cost').dataframe()
print(df.head())