Workbench match behaviour lacking

When using Workbench in Specify6, I recall it was possible to set the match behaviour to, say, always “pick first” as to take the first entry of a range of possible matches. In Specify7, I no longer see that option leading to a lot of extra manual labour, because now you need to manually disambiguate numerous entries.

Did I miss something?

2 Likes

Hi @fedoras,

At this time, there is no way to set matching behavior to pick the first entry in a list of matches despite its usefulness.


We have created branches of Specify 7 in the past to arbitrarily match to a returned agent, which you can see below:

This branch modifies the Workbench uploader to choose an arbitrary agent when multiple agents are matched. The modification is just hardwired in with no control flag or anything. This was for experimental purposes.

You could create a branch of Specify 7 with this change made for the desired table you want to match on if this is the behavior you need. This is experimental, so we do not recommend using this in production as it applies globally.


To properly implement your request, I have created a new issue:

Thank you for your report!

1 Like

Hi, do you still think about implementing this “pick first” possibility in the matching behaviour? In my problematic case, there are numerous of localities which are real duplicates so I wouldn’t mind if the system would pick any of them so I can proceed with the import.

Unfortunately, the workbench asks me to disambiguate every case (more than 300) which is painful. What’s you suggestion? Should I first kill every duplicate before my import? Thanks!

1 Like

Hi @alexis.beck,

We have not made any movement on this request since it was last discussed here. Since this was last discussed, we have added Record Merging support for the ‘Locality’ table. Have you tried to use this tool to do this de-duplication before uploading your WorkBench data set?

Hi Grant,

The record merging tool is very nice if you have few cases. In my case, I have hundreds, if not thousands of duplicates.

This was due to the fact that we copied/pasted a field to another (eg. text 2 to text 3) between two imports. As text 2 was hidden, the second import created already existing localities which caused the duplicates.

I understand I have to create a little script to kill the duplicates prior my import. Thanks for your guidance, I still think that would be nice to have the option proposed by Fedor as in SP6. Best

Hi again, given the list of fields considered to identify the duplicates in the Locality table, here is my query to list them all. The collection I was working on did not use any joined table such as locality details so please include that as well if it applies to your case.

SELECT GROUP_CONCAT(localityid ORDER BY localityid) AS duplicate_ids, Datum, ElevationAccuracy, ElevationMethod, GML, Lat1Text, Lat2Text, LatLongAccuracy, LatLongMethod, LatLongType, Latitude1, Latitude2, LocalityName, Long1Text, Long2Text, Longitude1, Longitude2, MaxElevation, MinElevation, NamedPlace, OriginalElevationUnit, OriginalLatLongUnit, RelationToNamedPlace, Remarks, SGRStatus, ShortName, SrcLatLongUnit, Text1, Text2, Text3, Text4, Text5, UniqueIdentifier, VerbatimElevation, VerbatimLatitude, VerbatimLongitude, Visibility, YesNo1, YesNo2, YesNo3, YesNo4, YesNo5, DisciplineID, GeographyID, PaleoContextID, VisibilitySetByID
FROM locality
GROUP BY Datum, ElevationAccuracy, ElevationMethod, GML, Lat1Text, Lat2Text, LatLongAccuracy, LatLongMethod, LatLongType, Latitude1, Latitude2, LocalityName, Long1Text, Long2Text, Longitude1, Longitude2, MaxElevation, MinElevation, NamedPlace, OriginalElevationUnit, OriginalLatLongUnit, RelationToNamedPlace, Remarks, SGRStatus, ShortName, SrcLatLongUnit, Text1, Text2, Text3, Text4, Text5, UniqueIdentifier, VerbatimElevation, VerbatimLatitude, VerbatimLongitude, Visibility, YesNo1, YesNo2, YesNo3, YesNo4, YesNo5, DisciplineID, GeographyID, PaleoContextID, VisibilitySetByID
HAVING COUNT(*) > 1;

Then, based on this group_concat (given the list of fields considered to identify the duplicates), I’ve created a procedure that (1) reassigns all related collecting events to the first locality in each group of duplicates, and (2) deletes the remaining localities that are no longer referenced.

DROP PROCEDURE IF EXISTS merge_locality_duplicates;

DELIMITER $$

CREATE PROCEDURE merge_locality_duplicates()
BEGIN
    DECLARE done INT DEFAULT FALSE;
    DECLARE id_list TEXT;
    DECLARE first_id INT;

    -- Cursor to loop through duplicate locality groups
    DECLARE cur CURSOR FOR
        SELECT GROUP_CONCAT(localityid ORDER BY localityid) AS duplicate_ids
        FROM locality
        GROUP BY Datum, ElevationAccuracy, ElevationMethod, GML, Lat1Text, Lat2Text,
                 LatLongAccuracy, LatLongMethod, LatLongType, Latitude1, Latitude2,
                 LocalityName, Long1Text, Long2Text, Longitude1, Longitude2,
                 MaxElevation, MinElevation, NamedPlace, OriginalElevationUnit,
                 OriginalLatLongUnit, RelationToNamedPlace, Remarks, SGRStatus,
                 ShortName, SrcLatLongUnit, Text1, Text2, Text3, Text4, Text5,
                 UniqueIdentifier, VerbatimElevation, VerbatimLatitude,
                 VerbatimLongitude, Visibility, YesNo1, YesNo2, YesNo3, YesNo4,
                 YesNo5, DisciplineID, GeographyID, PaleoContextID, VisibilitySetByID
        HAVING COUNT(*) > 1;

    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    -- Step 1: Update collectingevent.localityid to point to the first ID in each group
    OPEN cur;

    read_loop: LOOP
        FETCH cur INTO id_list;
        IF done THEN
            LEAVE read_loop;
        END IF;

        SET first_id = SUBSTRING_INDEX(id_list, ',', 1);

        -- Replace all localityids in collectingevent with the first_id of the group
        UPDATE collectingevent
        SET localityid = first_id
        WHERE FIND_IN_SET(localityid, id_list) > 0
          AND localityid != first_id;
      
        -- Step 2: Delete all duplicate locality IDs except the first one
        DELETE FROM locality
        WHERE FIND_IN_SET(localityid, id_list) > 0
        	 AND localityid != first_id;

    END LOOP;

    CLOSE cur;

END$$

DELIMITER ;

CALL merge_locality_duplicates();
1 Like