Workbench won't match with known species

While doing mass imports, we have just discovered something very odd with one particular collection (Vascular plants) that refuses to match the species names to existing ones in the tree. So as a result it creates multiple duplicate species that we would subsquently need to clean up again. It has no such issues with the genus or higher.

For other collections (Entomology), where we use import files that are structured and generate in the exact same way, we have no such issues.

Is there any way to explain this strange behaviour? Could there be some obscure setting in the collection or discipline that might cause this?

Using version 7.9.6.2

NHMD_Herba_20241113_14_55_JMJ_processed_imported.tsv (248.1 KB)

Hi @fedoras,

Thanks for sharing the spreadsheet! Could you also share the export mapping? We want to check if there are any additional fields mapped at a lower level (such as author, attribute, citation, etc.) that might lead Specify to treat them as unique.

Thank you!

Hi @Grant

Hereby some of the upload plans used:

NHMD_Herba_export_plan.json (8.8 KB)

NHMD_Herba_20241113_14_55_JMJ_processed_imported.json (6.7 KB)

NHMD_Herba_20250207_15_03_CB_processed_imported.json (7.0 KB)

Thanks in advance

Hi @fedoras,

Thanks for the files! We’re investigating this now and we’ll get back to you once we identify the issue.

1 Like

Hi @fedoras!

Thank you for your patience while this Issue was being looked into!
I do believe I have found the cause of the duplicate Species records.

The Problem

Ultimately, the problem lies in the seemingly innocuous mapping of Taxon -> Rank -> Is Hybrid and how Specify handles default field values when searching for existing Tree Records

Essentially, with isHybrid explicitly mapped with the default matching behavior, the searching and matching behavior for the associated record will always include the isHybrid field and value in the row. If the value of isHybrid for the row is blank (i.e., NULL), Specify will include a isHybrid=NULL in the search for existing records.

The problem with this is that isHybrid is marked as not nullable at a database level and is given a default value of False by the datamodel. In other words, if you don’t specify a isHybrid value for a Tree record, Specify will always default to isHybrid=False.

So for each Species in your Data Set, when Specify is trying to find a matching/existing Species, it is searching for isHybrid=NULL when there is no data in the cell: which will never result in a match and a new Species is created with a False isHybrid. Thus, each Species is being duplicated.

Below is a video which recreates and (mostly minimally) demonstrates the problem:

To track the problem, I have opened up a GitHub Issue which is slightly more technical and comprehensive in its explanation of the problem and proposes some proper solutions:

Feel free to provide any of your own feedback either here or on the GitHub Issue!

Solutions

Thankfully, there are some fairly easy workarounds until a proper solution is implemented.
You can do any of the following to resolve the problem:

  • Or provide a value in the cell of the isHybrid column for each Species you are uploading
3 Likes

Marvellous! Thank you for your assistance @jason_m !

2 Likes

That’s some very impressing sleuthing @jason_m!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.