Digital Data 2024 - Data Quality Toolkit
Catalog Numbers and Other Identifiers
Duplicate Catalog Numbers
Specify does not allow duplicate catalog numbers. This means when doing data entry it is impossible to create a duplicate.
Dates
Date Hasn’t Happened Yet
Any date field can be queried on to find dates that have not yet happened. Using the relative date function, a query can be set to look for dates that occur any number of days, months, or years in the future. A simple query showing all records that show a date greater than 0 days in the future will return all records with dates that have not happened yet.
To sort all dates in descending order, with the most recent or future dates appearing first, you could also build the following query:
Date is Suspiciously Old
To sort all dates in ascending order, with the earliest dates appearing first, you could also build the following query:
Identified Date Earlier than Collected Date
Requires two queries.
Year, Month, and Day Values Do Not Match Date
Specify combines these fields into a single field.
Geography
Coordinates are Zero
When building a query on the Locality table, you can select the latitude and longitude fields from the list of fields.
In Custom Field 1, select Decimal Latitude from the first dropdown menu, select EQUALS from the second dropdown menu, and enter 0 into the blank field.
Either individually edit erroneous records by clicking the link in the Symbiota ID column (far left), or batch edit all entries using the Batch Editing Tool.
Repeat steps 2 and 3 for the Decimal Longitude field. Alternatively, you can search for records with 0 for both latitude and longitude by adding another custom search term. To do so, click the pencil icon to the right of Custom Field 1 and adjust the fields of Custom Field 2 accordingly.
Coordinates Do Not Fall Within Named Geographic Unit
Queries can be built for a particular geography and mapped to find outliers that fall beyond the chosen political boundary.
Georeference Metadata with no Associated Georeference
A query can be built to return all localities that do not have geography coordinate detail records which capture the georeferencing work done for a particular locality.
Improperly Negated Latitudes/Longitudes
Invalid Coordinates
When importing data in bulk, Specify will validate and prevent the importation of invalid coordinate sets.
Lower Geography Values are Provided, but No Higher Geography
Specify’s trees (geography, taxonomy, chronostratigraphy, lithostratigraphy, and storage) are built with data validation in mind.
Ranks can be set to be “required”, where a node must exist at a given rank (such as Country, State, County) before a lower node can be created. For example, if “State” were required as a rank in the tree, you could not enter “Douglas County” under the United States unless there was a node named “Kansas” present at the “State” rank.
If for any reason it is not possible to enforce these requirements due to the current state of the data or for any other reason, there are a couple of ways to find if there are missing nodes at intermediate ranks.
You can use the tree viewer to see if there are children without the desired parents:
A query can also be built to return nodes in the tree that have missing intermediate ranks systematically to make sure ranks are not skipped.
Minimum and Maximum Elevation Values Mismatched
First, a query can be built in Specify to return all Localities that include the maximum and minimum elevation fields:
Using a spreadsheet software, the two columns can be compared to each other using a simple formula to identify any issues with the data:
After this formula is set up, the rows can be filtered to show only cases where this comparison is “TRUE”:
Including the GUIDs in the query
Mismatched Country and CountryCode Values
To compare all country codes to associated country in database:
To view geography (country) codes for a specific country:
To ensure country codes for a specific country all contain the correct value:
Conversely, for the same purpose:
Mismatched Geographic Terms
Ex. name of a city or county has rank of state or province
Solution: lowest rank not empty, scan results to ensure ranks are accurate
- Feature request potential: interface with logic to detect same spelling of nodes with differing ranks
Missing Geodetic Datum
https://dwc.tdwg.org/terms/#dwc:geodeticDatum
Missing Latitudes/Longitudes
Misspelled Geographic Unit Names
Data entry: qcbx for geo to select existing
Query with “like”
-
% represents zero or more characters. For example, d%t will find any word that begins with d and ends with t (dot, dust, dealt, etc.). Tes% will find any word that starts with Tes.
-
_ represents a single character. It differs from ‘%’ in that is only ever a single character. c_t will only find words that start with c, contain one of any other character, and end in t (cat, cot). sp____y will find any word that starts with sp, has 4 of any other character, and end in y (specify).
-
The wildcards can be used in conjunction. For example, to find all words that must be at least 3 characters in length: ___%
-
To find all words that start with C and must have at least two following characters, and end with any single character followed by el (such as Caramel): C__%_el
Taxonomy
Misspelled or Invalid Taxonomic Names
When importing data, users can configure matching behavior
Ignore when blank:
This means that blank values are considered “null”. This means no value in the cell indicates to the WorkBench that it can be ignored for matching purposes when looking for existing records
Ignore always:
If the “Author” value differs slightly from an existing Taxon record in your database but matches other mapped fields like “Name,” it won’t stop it from matching existing records during the WorkBench upload. This helps avoid unnecessary duplication!
When importing new taxonomy data or collection object records, it’s advisable to skip non-essential matching fields with potentially slightly different non-essential data (think Taxon authors or common names) to prevent duplication.
Never ignore:
Field value is never ignored and even if blank requires that the matching record also has a blank field
Allow null:
Specifies whether blank cells are errors or not during validation
Default value
Unknown Higher Taxonomy
Specify Network can retrieve higher taxonomic information for each record individually.
Other Issues
Incorrect Character Encodings
You can select the character encoding and delimiter when creating a new data set from an imported file.