Data Quality Toolkit

Digital Data 2024 - Data Quality Toolkit

Catalog Numbers and Other Identifiers

Duplicate Catalog Numbers

Specify does not allow duplicate catalog numbers. This means when doing data entry it is impossible to create a duplicate.

Dates

Date Hasn’t Happened Yet

Any date field can be queried on to find dates that have not yet happened. Using the relative date function, a query can be set to look for dates that occur any number of days, months, or years in the future. A simple query showing all records that show a date greater than 0 days in the future will return all records with dates that have not happened yet.

To sort all dates in descending order, with the most recent or future dates appearing first, you could also build the following query:

Date is Suspiciously Old

To sort all dates in ascending order, with the earliest dates appearing first, you could also build the following query:

Identified Date Earlier than Collected Date

Requires two queries.

Year, Month, and Day Values Do Not Match Date

Specify combines these fields into a single field.

Geography

Coordinates are Zero

When building a query on the Locality table, you can select the latitude and longitude fields from the list of fields.

In Custom Field 1, select Decimal Latitude from the first dropdown menu, select EQUALS from the second dropdown menu, and enter 0 into the blank field.

Either individually edit erroneous records by clicking the link in the Symbiota ID column (far left), or batch edit all entries using the Batch Editing Tool.

Repeat steps 2 and 3 for the Decimal Longitude field. Alternatively, you can search for records with 0 for both latitude and longitude by adding another custom search term. To do so, click the pencil icon to the right of Custom Field 1 and adjust the fields of Custom Field 2 accordingly.

Coordinates Do Not Fall Within Named Geographic Unit

Queries can be built for a particular geography and mapped to find outliers that fall beyond the chosen political boundary.

Georeference Metadata with no Associated Georeference

A query can be built to return all localities that do not have geography coordinate detail records which capture the georeferencing work done for a particular locality.

Improperly Negated Latitudes/Longitudes

Invalid Coordinates

When importing data in bulk, Specify will validate and prevent the importation of invalid coordinate sets.

Lower Geography Values are Provided, but No Higher Geography

Specify’s trees (geography, taxonomy, chronostratigraphy, lithostratigraphy, and storage) are built with data validation in mind.

Ranks can be set to be “required”, where a node must exist at a given rank (such as Country, State, County) before a lower node can be created. For example, if “State” were required as a rank in the tree, you could not enter “Douglas County” under the United States unless there was a node named “Kansas” present at the “State” rank.

If for any reason it is not possible to enforce these requirements due to the current state of the data or for any other reason, there are a couple of ways to find if there are missing nodes at intermediate ranks.

You can use the tree viewer to see if there are children without the desired parents:

A query can also be built to return nodes in the tree that have missing intermediate ranks systematically to make sure ranks are not skipped.

Minimum and Maximum Elevation Values Mismatched

First, a query can be built in Specify to return all Localities that include the maximum and minimum elevation fields:

Using a spreadsheet software, the two columns can be compared to each other using a simple formula to identify any issues with the data:

After this formula is set up, the rows can be filtered to show only cases where this comparison is “TRUE”:

Including the GUIDs in the query

Mismatched Country and CountryCode Values

To compare all country codes to associated country in database:

To view geography (country) codes for a specific country:

To ensure country codes for a specific country all contain the correct value:

Conversely, for the same purpose:

Mismatched Geographic Terms

Ex. name of a city or county has rank of state or province

Solution: lowest rank not empty, scan results to ensure ranks are accurate

  • Feature request potential: interface with logic to detect same spelling of nodes with differing ranks

Missing Geodetic Datum

https://dwc.tdwg.org/terms/#dwc:geodeticDatum

Missing Latitudes/Longitudes

Misspelled Geographic Unit Names

Data entry: qcbx for geo to select existing

Query with “like”

  • % represents zero or more characters. For example, d%t will find any word that begins with d and ends with t (dot, dust, dealt, etc.). Tes% will find any word that starts with Tes.

  • _ represents a single character. It differs from ‘%’ in that is only ever a single character. c_t will only find words that start with c, contain one of any other character, and end in t (cat, cot). sp____y will find any word that starts with sp, has 4 of any other character, and end in y (specify).

  • The wildcards can be used in conjunction. For example, to find all words that must be at least 3 characters in length: ___%

  • To find all words that start with C and must have at least two following characters, and end with any single character followed by el (such as Caramel): C__%_el

Taxonomy

Misspelled or Invalid Taxonomic Names

When importing data, users can configure matching behavior

Ignore when blank:

This means that blank values are considered “null”. This means no value in the cell indicates to the WorkBench that it can be ignored for matching purposes when looking for existing records

Ignore always:

If the “Author” value differs slightly from an existing Taxon record in your database but matches other mapped fields like “Name,” it won’t stop it from matching existing records during the WorkBench upload. This helps avoid unnecessary duplication!

When importing new taxonomy data or collection object records, it’s advisable to skip non-essential matching fields with potentially slightly different non-essential data (think Taxon authors or common names) to prevent duplication.

Never ignore:

Field value is never ignored and even if blank requires that the matching record also has a blank field

Allow null:

Specifies whether blank cells are errors or not during validation

Default value

Unknown Higher Taxonomy

Specify Network can retrieve higher taxonomic information for each record individually.

Other Issues

Incorrect Character Encodings

You can select the character encoding and delimiter when creating a new data set from an imported file.