Give Feedback: Record Merging in Specify 7

Agent Merging Webinar

Introducing Record Merging

We are developing a new Record Merging feature that will be arriving in the coming months for all Specify 7 users! We are seeking feedback from the community for development!

With the introduction of Record Merging feature, Specify will make it easier to manage all records in a more efficient way.

For the initial release, we plan to only support the Agent table, but we are planning on next supporting Locality, and eventually all tables in the database.

Record Merging allows users to combine two or more records into one. This feature is especially useful when dealing with multiple records that represent the same person or locality.

For instance, when a user has uploaded a large amount of data that created duplicate agents of the same person, they can now merge the records into one and delete any duplicates, thus keeping their database clean and organized.

Additionally, this feature is also helpful when users have to combine records from different sources which may contain similar information.

Using Record Merging, users can quickly and easily combine records to form a single record. The merged record will contain all desired data from the original records, allowing users to keep the important information from each record, and offers the ability to modify and add information to the newly merged record.

The Record Merging feature can be accessed in the query builder within Specify. After selecting the records to be merged, users can select the Merge Records button and then they will be prompted with the “Merge Records” dialog.

Examples

A simple scenario– two agents that are intended to be one person are referenced by a variety of different records. Currently there is no easy way to point all these references to the correct agent, but with the new Merge Records interface it is as simple as selecting the duplicates and merging them in the following dialog:

A more complicated example where many fields in the Agent table differ from each other:

The Basics

In the following video, you can see the basic functionality of the new merging interface:

  • You can click on the < buttons to the left of each field’s value to place it in the merged record column.

  • You can preview the records you are about to merge as well as preview how the newly merged record will be displayed.

  • All of the items in the leftmost column can be edited as if you are using the forms in the database. These can override the merged data and allow the user who is merging the records to modify the data during the process.

  • All references to each record are displayed in the “References to this record” row in the merging interface, allowing the user to click on any of the items and display that record.

  • The formatted display for the table of the records is shown at the top above each column, allowing you to see a preview of how it will be displayed in places where records are shown as formatted.

  • You can uncheck the “Show conflicting fields only” checkbox to display all fields in that table.

  • The Merge Records dialog supports theoretically unlimited records selected at once.

Once you are satisfied, you can click Merge to commit the changes.

Feedback

We would love to hear feedback from the Specify Collections Consortium about our agent merging tool. Please share your thoughts, questions, concerns, or requests with us as we continue to develop this incredible tool.

1 Like

A number of thoughts here in no particular order:

  1. It would be nice to incorporate some fuzzy logic that would detect possible duplicates for merging rather than relying on the user to “find” these. After that, some kind of batch merging could be implemented that would step users through the various duplicates and allow them to do this in batch mode one after another.
  2. In the first screenshot there are three duplicates detected but only two previews offered in the bottom part of the window. I would expect to see three and then have a preview merged column that would be the result of the merged record created from those three options. Not sure why this is not the case?
  3. What would happen if a large number of possible duplicate records were selected? Would it scroll off the screen?
  4. Would it be possible to select related information like addresses associated with the agent to merge?
  5. The “References to this record” section is not very intuitive. I think it might be better to list the base table that the agent is being related through - these would most commonly be Collection Object (as cataloger), Preparation (as Prepared by), Collecting Event (as collector), Interactions (as loaner, gifter, borrower, preparer, etc.) and others that I am sure I am not recalling right now but you get the point. Just having agent listed doesn’t really help me.
  6. The GUID is always going to be different so probably of no use to show it.

That’s it for now but I am sure I will come across more when I start playing with it. Great job though. This will be immensely useful - especially when it is available for other tables.

oooooooooooooooooo, can’t wait to try it out! I have 3-4 known agent ‘merges’ just waiting to happen!

This will be such a useful feature - we have a lot of agents that we need to deduplicate following migration. It would be great to see it available for Collection Objects and Preparations in the future too.

A few thoughts:

Would access to this be managed through the user role/permission system?

How are the fields displayed for comparison/selection determined? There are certain fields that we would want to include to help ensure effective disambiguation of agents. e.g. birth/death dates, fields from agentIdentifier, group member etc.

Is there a way to determine which record to ‘keep’ and merge other data into?

I’d +1 the suggestion of some fuzzy logic to help identify the records to be merged.
Note, we actually already have a implementation of this developed as a custom app for the RBGE specify. - which does both record merging, and fuzzy matching (across all agent fields) to suggest candidate merges. Let us know if you’d be interested in talking through our experiences of this - I’m sure we could setup a call with the developer if useful?
Thanks

Hi @Ben,

That would be great! It would be incredible to talk through the development process and implementation. Can we set up a call with the developer sometime next week?

Thank you!

  1. Yes, there would be a separate permission for that (in addition to the existing permissions to edit and delete from the Agent table and the permission to run a query)
  2. By default it only shows the fields where the values don’t match. But, there is a checkbox “Show conflicting fields only” that you can uncheck to display all fields that have values. You can also click on the “Preview Merged” or “Preview 1” buttons to preview records in a regular form.
  3. Not visible on the screenshot (we added it a bit later), but there is a button to set a record as a base.

Looking forward to hear the details in tomorrow’s webinar.

Are there any thoughts on developing a tool for massive merging? For agents, of course? One often finds one’s database loaded with agents that are obvious duplicates, by the thousands. Resolving agent replication or name variants one at a time is extremely welcomed when in doubt: having the entire record info to do match variants or replicates with confidence is most useful.

But it is the volume of replicates what swamps most collections

Any sense of when this feature will be available?

Hi @gholman,

We are in the home stretch and one of our back-end developers has been hard at work creating unit tests (#3474) to ensure that the feature works exactly as intended. We are planning on releasing this feature in the next month, but we are preparing to perform comprehensive testing that may change this target.

I’ve been watching and reading about Record Merging in Specify 7 with interest. It’s fascinating to me that Agents are the first to be rolled into these features. The effort, UI and logic are commendable.

I have a particular slant with respect to these activities that leans to the outcome of post-merge actions in Specify and then how Agent strings are shared with the wider community via dwc:recordedBy or dwc:identifiedBy. I believe HeatherC has raised comparable observations such as the ordering of Agent strings: Collector order in groups and multiple collectors. As all may know, there is a remarkable push to support community curation and linking to shared concepts using URIs/GUIDs via concepts like digital extended specimens. For those that embark on round-tripping of community-led data curation or enhancements, this puts pressure on collection management systems. What’s emerging from these externally-facing activities is a need to accurately store and share elements of provenance such as near-verbatim, in-context Agent strings on terms like dwc:recordedBy or dwc:identifiedBy as opposed to computed values accomplished through local disambiguation. We all stand to benefit in various ways when terms are populated with stable, representative, and evidence-based values (eg that plainly visible on labels) that other institutions are likely to also share. I have more thoughts on this here: https://github.com/tdwg/dwc/issues/450. As a brief summary, there is value in retaining in-context representations of Agent text strings, the least of which is to afford a local, port-mortem view of evidence to illustrate why a merge was executed. At the risk of being axiomatic, people have many names and many people have the same name. However, there is in fact a benefit to embracing some of this chaos. It may be more inclusive and sensitive to cultural differences to do so.

While I do believe this merge of Agents is a critical feature, I wish to raise caution. What for example, happens to the suppressed, non-canonical URIs/GUIDs for suppressed Agents post-merge? Are these records internally deleted with links rebuilt or are they retained in the back-end with a redirect? Is undo too intractable a feature to implement if indeed Agents are deleted post-merge? See `agent-merging`: "Undo" button for agent merge ¡ Issue #2906 ¡ specify/specify7 ¡ GitHub.

If there’s one recommendation I can make it’s to consider a verbatim field for collectors and determiners to be used in context in a non-relational way and for these text string values to be used in dwc:recordedBy and dwc:identifiedBy despite the fact that the definition of these terms would lead us to populate them with computed values. We have dwc:recordedByID and dwc:identifiedByID for the purposes of communicating shared identity.