Blog

Finding Similar Documents Without a Full Text Index

by | Jan 27, 2015 | Content Management, Solutions | 0 comments

Is there a way to quickly find similar documents in a Documentum repository? Yes, there is. One approach could be to use the Lucene MoreLikeThis() API. This API call to the Lucene Full Text search engine extracts what it believes to be the most salient words from a document and runs a Full Text search looking for documents whose content matches the chosen query words. But what if there was a simpler, lighter-weight approach?

In my 2014 EMC Knowledge Sharing whitepaper, Finding Similar Documents Without Using A Full Text Index, I detail an approach for identifying similar documents in a Documentum repository by using a 64-bit hash value. This hash value, called the Similarity Index (SI), is a product of a hashing function named SimHash[1]. This 64-bit value is applied with an Aspect to an object as metadata. This hash value can then be queried to find content that is similar to a given document’s Similarity Index. For example, you could execute a DQL query like this to discover content that shares 80% similarity with a selected document:

select similar_obj_id from dbo.si_view where r_object_id ='[r_object_id]' and similarity >= 0.80

Where [r_object_id] is the object ID of a known object.

Using queries like this, content can be discovered which meets a varying degree of similarity. In this example, the query would return any document which is 80% similar to the selected document. For finer results, you could query for content which has 90% similarity.

The details for implementing this solution are discussed in the whitepaper. The most interesting elements of the solution are the SimHash function itself, and the relationship between the Aspect, which stores and evaluates the SI, and a registered database view that makes searching possible.

If you are intrigued, I encourage you to download the whitepaper.

[1] Moses Charikar, https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

Categories

Need a bit more info on how Armedia can help you?

Feel free to schedule a 30-minute no-obligations meeting.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *