Darwin Core Archive Publishing in Specify 7
Specify 7 is capable of exporting collections data in the Darwin Core Archive format. This capability expands on the existing Specify 6 functionality by supporting extensions to the core format.
Publishing of DwCAs in Specify 7 is managed via several App Resource records in XML format. There are three relevant types of resource: the Basic DwCA definition resource, the Dataset metadata resource, and the Export feed resource.
User Documentation
App resources
App resources can be accessed by admin users through the Resources
option in the User Tools dialog that is opened by clicking on the
user name, or by navigating to http://SITE/specify/appresources/
on a Specify 7
site.
The App Resources are organized hierarchically starting with Global
and Discipline level resources and descending through Collection,
User Types, and User levels. The various levels may be expanded or
collapsed by clicking on the headings. At each level, individual
resources can be opened by clicking on the name of the resource, and
new resources can be added by clicking on New Resource and entering
a resource name.
Basic DwCA definition resource
The fundamental resource for producing a DwCA consists of an XML
definition that is modeled on the DwCA Metafile (meta.xml
) as
described in
the
Darwin Core Text Guide,
hereafter DCTG.
New resources for DwCA definitions should be created under a particular
collection level within the App Resource hierarchy corresponding to
the collection whose data they will be used to export.
The basic structure is a single <archive>
element containing exactly
one <core>
stanza and optional <extension>
stanzas, each of which must
indicate the rowType
that specifies the class of data represented by
that stanza. This corresponds exactly to the rowType
that will
appear in the generated meta.xml
as described in the DCTG.
Each <core>
and <extension>
element may contain multiple free standing
<field>
elements with term
and value
attributes. The term
is the
URI for the term represented by the field, and value
is a value for
that term which is the same for all rows in the data set represented
by the stanza. Aside from these constant values, the data for each row
in the data set will be produced from one or more queries defined in
the <queries>
stanza within each <core>
and <extension>
stanza.
Example Specify 7 DwCA definition
<?xml version="1.0" encoding="utf-8"?>
<archive>
<!-- The core of this DwCA will consist of Occurrence records. -->
<core rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
<!-- Free standing constant fields -->
<field term="http://rs.tdwg.org/dwc/terms/institutionCode"
value="KU"/>
<field term="http://rs.tdwg.org/dwc/terms/institutionID"
value="http://grbio.org/cool/iakn-125z"/>
<field term="http://purl.org/dc/terms/rights"
value="http://creativecommons.org/licenses/by/4.0/deed.en_US"/>
<field term="http://purl.org/dc/terms/accessRights"
value="http://biodiversity.ku.edu/research/university-kansas-biodiversity-institute-data-publication-and-use-norms"/>
<field term="http://rs.tdwg.org/dwc/terms/collectionCode"
value="KUI"/>
<field term="http://rs.tdwg.org/dwc/terms/basisOfRecord"
value="PreservedSpecimen"/>
<field term="http://rs.tdwg.org/dwc/terms/datasetName"
value="University of Kansas Biodiversity Institute Fish Voucher Collection"/>
<!-- The remaining field are produced by the following query: -->
<queries>
<query contextTableId="1" name="occurrence.txt">
<id term="http://rs.tdwg.org/dwc/terms/occurrenceID" isNot="false" isRelFld="false" oper="11" stringId="1.collectionobject.guid" value=""/>
<field term="http://purl.org/dc/terms/accessRights" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.termsOfUse" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/basisOfRecord" isNot="false" isRelFld="false" oper="11" stringId="1,23.collection.collectionType" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/catalogNumber" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.catalogNumber" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/class" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Class" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/collectionCode" isNot="false" isRelFld="false" oper="11" stringId="1,23.collection.code" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/continent" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.Continent" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/country" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.Country" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/county" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.County" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/datasetName" isNot="false" isRelFld="false" oper="11" stringId="1,23.collection.description" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/decimalLatitude" isNot="false" isRelFld="false" oper="1" stringId="1,10,2.locality.latitude1" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/decimalLongitude" isNot="false" isRelFld="false" oper="1" stringId="1,10,2.locality.longitude1" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/eventDate" isNot="false" isRelFld="false" oper="1" stringId="1,10.collectingevent.startDate" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/family" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Family" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/fieldNumber" isNot="false" isRelFld="false" oper="11" stringId="1.collectionobject.fieldNumber" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/genus" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Genus" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/geodeticDatum" isNot="false" isRelFld="false" oper="11" stringId="1,10,2.locality.datum" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/georeferencedDate" isNot="false" isRelFld="false" oper="1" stringId="1,10,2,123-geoCoordDetails.geocoorddetail.geoRefDetDate" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/higherGeography" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.fullName" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Subspecies" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/institutionCode" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.code" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/institutionID" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.altName" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/kingdom" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Kingdom" value=""/>
<field term="http://purl.org/dc/terms/license" isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.copyright" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/locality" isNot="false" isRelFld="false" oper="11" stringId="1,10,2.locality.localityName" value=""/>
<field term="http://purl.org/dc/terms/modified" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.timestampModified" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/order" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Order" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/phylum" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Phylum" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/preparations" isNot="false" isRelFld="true" oper="11" stringId="1,63-preparations.preparation.preparations" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/recordedBy" isNot="false" isRelFld="true" oper="11" stringId="1,10,30-collectors.collector.collectors" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/scientificName" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.fullName" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/specificEpithet" isNot="false" isRelFld="false" oper="11" stringId="1,9-determinations,4-preferredTaxon.taxon.Species" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/stateProvince" isNot="false" isRelFld="false" oper="11" stringId="1,10,2,3.geography.State" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/typeStatus" isNot="false" isRelFld="false" oper="1" stringId="1,9-determinations.determination.typeStatusName" value=""/>
<field isNot="false" isRelFld="false" oper="13" stringId="1,9-determinations.determination.isCurrent" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/sex" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.text2" value=""/>
<field term="http://rs.tdwg.org/dwc/terms/lifeStage" isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.text3" value=""/>
</query>
</queries>
</core>
<!-- The DwCA will include a Multimedia extension. -->
<extension rowType="http://rs.tdwg.org/ac/terms/Multimedia">
<!-- Constant fields. -->
<field term="http://purl.org/dc/terms/rights" value="http://creativecommons.org/licenses/by/4.0/deed.en_US"/>
<field term="http://purl.org/dc/terms/accessRights" value="http://biodiversity.ku.edu/research/university-kansas-biodiversity-institute-data-publication-and-use-norms"/>
<!-- In this case there are two queries to produces the rows. -->
<queries>
<!-- Rows from the collection object attachments records. -->
<query contextTableId="111" name="collectionobjectattachment.csv">
<id isNot="false" isRelFld="false" oper="1" stringId="111,1.collectionobject.guid" value=""/>
<field term="http://rs.tdwg.org/ac/terms/accessURI" isNot="false" isRelFld="true" oper="1" stringId="111,41.attachment.attachment" value="" formatName="AttachmentTest"/>
<field term="http://purl.org/dc/terms/identifier" isNot="false" isRelFld="false" oper="1" stringId="111,41.attachment.guid" value=""/>
<field term="http://purl.org/dc/terms/format" isNot="false" isRelFld="false" oper="1" stringId="111,41.attachment.mimeType" value=""/>
</query>
<!-- Rows from the research work attachment records. -->
<query contextTableId="1" name="researchworkattachment.csv">
<id isNot="false" isRelFld="false" oper="1" stringId="1.collectionobject.guid" value=""/>
<field term="http://rs.tdwg.org/ac/terms/accessURI" isNot="false" isRelFld="true" oper="1" stringId="1,29-collectionobjectcitations,69,143-referenceworkattachments,41.attachment.attachment" value="" formatName="AttachmentTest"/>
<field term="http://purl.org/dc/terms/identifier" isNot="true" isRelFld="false" oper="12" stringId="1,29-collectionobjectcitations,69,143-referenceworkattachments,41.attachment.guid" value=""/>
<field term="http://purl.org/dc/terms/format" isNot="false" isRelFld="false" oper="1" stringId="1,29-collectionobjectcitations,69,143-referenceworkattachments,41.attachment.mimeType" value=""/>
</query>
</queries>
</extension>
</archive>
Query stanzas
The query stanzas within the DwCA definition are meant to be generated
with the assistance of the Specify 7 query builder. After a Specify
query has been defined, tested, and saved using the query builder, it
can be used as the basis for a query stanza. There is an export
button in the edit dialog that appears when the pencil icon next to
query name is clicked. The export button will open a dialog
containing XML that can be copy-pasted into an editor for inclusion in
a DwCA definition.
After the XML export of the query has been copied, the Specify query
can be deleted, if so desired.
Example query export
<?xml version="1.0" ?>
<query contextTableId="1" name="DwCKUI">
<field isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.code" value=""/>
<field isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.altName" value=""/>
<field isNot="false" isRelFld="false" oper="11" stringId="1,23,26,96,94.institution.copyright" value=""/>
<!-- . -->
<!-- . -->
<!-- . -->
<!-- . -->
</query>
The exported XML contains enough information to generate the rows for
the DwCA file, but needs to be edited to include the term
attribute
for each field which should be a URI. Fields which do not specify a
term will not be included in the output but maybe required for
filtering within the query.
It is also necessary for one of the fields to be designated the core
id field by changing the element tag from field
to id
. This field
will become the <id>
field if included in the <core>
stanza or the
<coreId>
field if included in an <extension>
stanza.
See the DCTG for more information about the <id>
and <coreId>
fields
and term
attributes.
Finally, the name
attribute should be adjusted to an appropriate
filename for the corresponding CSV file that will be included in the
DwCA file. That is, it should be unique within a given DwCA definition
and not contain special characters.
After these modifications, a query element and its contents can be
copied into the <queries>
element of a DwCA definition resource.
Overriding the formatter or aggregator
When a query field requires formatting or aggregating rows from a
table Specify uses definitions from the DataObjFormatters
appresource at the corresponding discipline level. By default the
data export will use the same formatters and aggregators as the form
system as configured through the schema config tool. For increased
flexibility this mechanism can be overridden by adding a formatName
attribute to <field>
elements in the query stanza. The value of the
attribute should be the name of one of the formatters or aggregators
defined in DataObjFormatters for the relavent table.
For instance, the following would apply the formatter named
CreateAttachmentURL
to the rows from the attachment table:
<query contextTableId="111" name="collectionobjectattachment.csv">
<!-- . -->
<field term="http://rs.tdwg.org/ac/terms/accessURI" isNot="false"
isRelFld="true" oper="1"
stringId="111,41.attachment.attachment" value=""
formatName="CreateAttachmentURL"/>
<!-- . -->
</query>
Testing a DwCA definition resource
After a DwCA definition resource has been created, it can be tested by
instructing Specify to produce a DwCA file based on it.
The user should be logged into the collection corresponding to the
DwCA definition resource. Clicking on the user name followed by Make
DwCA in User Tools opens a dialog that accepts a DwCA definition
and a Metadata resource. Enter the name of the DwCA definition
resource in the DwCA definition field, and click start. The
Metadata resource can be left blank. Specify will begin running the
queries and generating the DwCA file. A notification with a link to
download the archive will be produced when the process completes.
Dataset metadata resource
Specify 7 can include a metadata file in generated DwCA files
describing the whole dataset
in
Ecological Metadata Language.
Like the DwCA definition resource, the metadata resource should be
created at the collection level of the resource hierachy.
When generating a DwCA
as described above, the name of
a metadata resource can be provided. The contents of the resource will
be included in the generated DwCA unchanged except for pubDate
which
will be updated to reflect the date the archive was generated. Inside
the archive the resource will be given the filename eml.xml
.
Export feed resource
One final resource, the export feed resource, allows DwCA generation to
be automated and advertised through an RSS feed. While a single
Specify 7 instance (or database) may utilize multiple DwCA definition
and metadata resources, it may only use one export feed resource that
will define all of the published exports. The export feed resource
must be created at the Global Resources level of the hierarchy and
must be named ExportFeed
.
The content of the export feed resource should be XML with a root
element <channel>
comprising one or more <item>
elements. The
channel element corresponds to the eponymous RSS element and may also
include <title>
, <description>
, and <language>
elements to be
passed through to the generated RSS unchanged. Each <item>
resource
corresponds to an individual DwCA that may be published by the system.
Export feed item
Like the <channel>
element, the <item>
element accepts several
child elements that will be included in the RSS unchanged. These are
<title>
, <description>
, <id>
, and <guid>
.
Additionally, the <item>
element requires the following attributes
that control the generation of the DwCA the item is to represent:
-
collectionId
- This is the database id of the collection whose
data the DwCA will contain. The id of a collection can be found by
logging into the collection in a browser then visiting the URL
http://SITE/context/collection/
and inspecting thecurrent
value in the
resulting JSON. -
userId
- This is the database id of the Specify user under whose
account the DwCA query should be executed. This can, in principle,
affect things like which formatters are used and should generally
be the same user as created the query. User ids can be found in the
URL shown in the browser when editing a user record. -
notifyUserId
- This attribute is optional and indicates that
notifications generated during export should go to the indicated
user rather than the user specified byuserId
. -
definition
- The name of
the DwCA definition resource to
be used to generate the DwCA. This resource must be associated with
the collection indicated bycollectionId
. -
metadata
- The name of
the dataset metadata resource to be
included in the DwCA. This resource must be associated with
the collection indicated bycollectionId
. -
days
- When the RSS feed is updated via
the command line tool, the DwCA represented
by this item will only be updated if it is more than the stated
number of days old. -
filename
- The filename to be given to the generated DwCA. This
filename will be used in URLs, so it is better if it doesn’t
contain any awkward characters. -
publish
- If this attribute is not included and set totrue
,
the corresponding DwCA will not be included in the RSS
feed. Nevertheless, it will be generated or updated when
the feed is updated, and the
file will continue to be available for download to those knowing
the URL.
Example Specify 7 export feed resource
<channel>
<title>KUBI ichthyology RSS Feed</title>
<description>RSS feed for KUBI Ichthyology Voucher and Tissue collections</description>
<item collectionId="4" userId="2" notifyUserId="2" definition="DwCA_voucher" metadata="DwCA_voucher_metadata" days="7" filename="kui-dwca.zip" publish="true">
<title>KU Fish</title>
<id>8f79c802-a58c-447f-99aa-1d6a0790825a</id>
</item>
<item collectionId="32768" userId="2" notifyUserId="2" definition="DwCA_tissue" metadata="DwCA_tissue_metadata" days="7" filename="kuit-dwca.zip" publish="true">
<title>KU Fish Tissue</title>
<id>56caf05f-1364-4f24-85f6-0c82520c2792</id>
</item>
</channel>
Export feed endpoint
When a valid ExportFeed
resource is present at the Global
Resources app resource level, the http://SITE/export/rss/
URL will become
active and return the generated RSS feed.
Updating the data publishing feed
When an RSS publishing feed has been defined
its contents can be updated in two separate ways.
-
An admin user may select Update Feed Now from the User Tools
dialog. This will immediately update all items in the feed
irrespective of anydays
attribute. Notifications will also go to
the activating user rather than the user specified by the
notifyUserId
attribute. -
The following command may be invoked on the Specify 7 server from
the Specify 7 installation directory:python manage.py update_feed [--force]
The
--force
option will cause all items in the feed to be updated
regardless of anydays
attributes. Otherwise, only DwCA files
older than the given number of days will be updated by this
command.This command is intended to facilate a work-flow utilizing cron
or other task scheduler to maintain an up-to-date data publishing
feed.Automating RSS Feed Updates in Specify 7 with Cron
[!note]
If the installation is utilizing a python virtualenv, it
must be activated prior to issuing the command.