Finding Similar Documents Without a Full Text Index

Finding Similar Documents Without a Full Text Index

Is there a way to quickly find similar documents in a Documentum repository? Yes, there is. One approach could be to use the Lucene MoreLikeThis() API. This API call to the Lucene Full Text search engine extracts what it believes to be the most salient words from a document and runs a Full Text search looking for documents whose content matches the chosen query words. But what if there was a simpler, lighter-weight approach?

In my 2014 EMC Knowledge Sharing whitepaper, Finding Similar Documents Without Using A Full Text Index, I detail an approach for identifying similar documents in a Documentum repository by using a 64-bit hash value. This hash value, called the Similarity Index (SI), is a product of a hashing function named SimHash[1]. This 64-bit value is applied with an Aspect to an object as metadata. This hash value can then be queried to find content that is similar to a given document’s Similarity Index. For example, you could execute a DQL query like this to discover content that shares 80% similarity with a selected document:

select similar_obj_id from dbo.si_view where r_object_id ='[r_object_id]' and similarity >= 0.80

Where [r_object_id] is the object ID of a known object.

Using queries like this, content can be discovered which meets a varying degree of similarity. In this example, the query would return any document which is 80% similar to the selected document. For finer results, you could query for content which has 90% similarity.

The details for implementing this solution are discussed in the whitepaper. The most interesting elements of the solution are the SimHash function itself, and the relationship between the Aspect, which stores and evaluates the SI, and a registered database view that makes searching possible.

If you are intrigued, I encourage you to download the whitepaper.

[1] Moses Charikar, https://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf

Think Alfresco from Documentum perspective –Take 1

When you work for a while in the software you get numbed to “technologies have come and gone…” occasionally though some become commodities and others trend setters. We have seen that with many products like Apache, Tomcat, Lucene, Drupal …etc that have stabilized and matured over the past years with the help of increased development from the open source realm. Wait! Did I mention the word “Open Source” and going to talk about the enterprise content management?

(more…)

Documentum Performance Enhancers – They’re not just for athletes anymore

Good day, all. For all who are celebrating Memorial Day, we wish you the best & pass along our respects to our servicemen and women of past & present.

Today’s topic will cover some simple advice on how to extract more performance out of your Content Server by simply changing the way you ask it questions in your code. Documentum offers plenty of advice on how to trim & tune your Content Servers, but the simplest advice that goes the longest way is: Use DQL.

(more…)