A Language Translation Service for Documentum

A Language Translation Service for Documentum

Providing content in users’ native languages is becoming the expected norm. It is not uncommon for banks, credit card companies, insurers, government agencies, and manufactures to provide licensing agreements, rules and regulations, warranties (e.g., John Deere), and forms (e.g., IRS) in multiple languages – both online and in print.

These documents are usually available at an organization’s brick and mortar office, or contained in their online library, and are available for users to download. Producing and managing documents in multiple languages can be expensive, tedious, time-consuming, and prone to staleness.

For example: What if the “source” document changes, how do these changes ripple through to the rest of the translations? What if one of the translations needs to be “tweaked”, how do you keep its version in synch with the source document? With a content management system like Documentum and an integration with a language translation service like Lingotek, the process of producing and managing multi-language content can be simplified.

Lingotek is a translation services company whose translation management system lends itself well to integration with content management systems like Documentum. Lingotek provides a completely online environment for submission, management, and exchange of documents for translation.

They offer numerous out-of-the-box workflows to accommodate many translation needs, including machine translation only, machine translation plus editor, and translation with multiple reviewers. Once a document has been submitted to Lingotek for translation, your translators can access and process the document in Lingotek’s cloud-based Translation Management System (TMS). When translations are completed, they are flagged as such and returned to the originator, in their original format and style.

The following video demonstrates a solution for creating and maintaining multiple translations of content in a Documentum repository. The solution uses the inherent content management capabilities of the Documentum Content Server to manage content, versions, and relationships among documents and leverages the Content Server’s infrastructure (specifically, Service-based Objects, asynchronous jobs, and external database tables) to integrate with Lingotek for the production of translations.

Lingotek offers a comprehensive RESTful API that integrates easily with Documentum to provide a seamless solution for the production and management of multilingual content.

See what you think.

Is this an integration that could benefit your company, department, or organization? Leave us a comment, we’d be happy to discuss it with you.

Documentum Storage Decision Points – SAN versus NAS

When it comes to in-house IT services, every company has different practices.  We will stick to the basics, unless our company’s core product is cutting edge hardware or software.  Usually businesses follow the pack, and their system architects color within the lines.

For several years there has been the discussion around SAN versus NAS storage.  What is the best solution?  The simple non-answer is “It depends.”   We need to consider several things while designing the storage architecture to support a solution, or working with the customer’s storage engineer to design it.

The decision points are common among all software solutions.  However, the software products themselves may move you toward a particular hardware and storage solution.  Documentum is no different than Alfresco, and it is no different than FileNet and others regarding the questions to ask.

There may be different storage requirements for products within a vendor’s suite, and the solution “stack” may dictate the storage solution for all the products deployed in a system.

Regarding Documentum is there a “best solution?”   We will address concerns,  provide you questions to ask, and present a couple of recommendations.

As a friend of mine once said, “Indecision is the key to flexibility,” but eventually you have to order hardware.

NAS versus SAN

We will not go deeply into SAN and NAS storage definitions.  However, here is a very simplified view of NAS and SAN configurations:

NAS-Storage-Basic-Definition

 

SAN Storage

 

 

NAS provides both storage and a file server.  The NAS unit has an operating system.  The application host, in our case Documentum, communicates with NAS over the network using file based protocols such as NFS and CIFS to read and write the content located on the file system.  To the host operating system, NAS appears as a file server, that provides drive and share mapping.   An example is EMC Celerra.

SAN storage is also connected to the network, connected to a SAN switch with connections to the client hosts.  Blocks of storage appear the same as local storage disks.  SAN appears as a disk to the operating system.  There are management utilities such as Veritas to make it accessible.  SAN protocols include Fiber Channel, iSCSI, and ATA over Ethernet (AoE).  An example is EMC Symmetrix or CLARiiON.

Storage protocols can significantly affect price and performance.  For example an iSCSI SAN may cost multiple times an ATA.  So why pick iSCSI?  Well, iSCSI is much faster and generally more reliable than ATA.  Fiber Channel is faster still.  However, iSCSI uses less expensive Ethernet switches and cables, where Fiber Channel requires more specialized components.  What do you need to meet your requirements – operating distance, support staff skills, available budget?

In years past, the arguments about SAN versus NAS were more dichotomous than today.  The cost, features, and performance are different, but with hybrid configurations we can get efficiency and performance.  Differences still exist if you choose one versus the other.

Generalizations are dangerous, because there are exceptions, protocols change rapidly and blur the lines and turn todays best recommendation into tomorrow’s dog.   So, anticipating comments to the contrary, here ya go:  Consider SAN to be faster, and consider NAS lower cost and easier to maintain.

What about CAS (Content Addressable Storage)?  CAS is used mainly for archiving content, especially large amounts of content. The storage unit contains a CPU.  The storage address for each content file is computed based on an algorithm using the actual characters in the file.   CAS offers security features and supports retention policies.  We’ll discuss CAS at another time.  However, there are NAS and even SAN solutions that can be used in concert with Documentum products for archiving and retention policy compliance.

What Works Well in Documentum

Let’s think about what Documentum is and does, what your business requirements are, and why you might choose SAN over NAS and vice versa.

The rule of thumb is that file I/O is the critical path to any single processing thread.  However, we need to consider all kinds of performance and latency issues; for example,  remote users accessing centralized repositories, network bandwidth, file transfer protocols, peer to peer protocol layers, application design, host resources, and other such.  However, reading and writing data to and from disk, with the necessary transfer of packets across the wires is the single most time critical process.   Any single component out of whack will degrade performance.   However, if all else is well, then file I/O performance will impact speed directly.

Content Server

What does Documentum do?   For a moment, let’s forget the EMC IIG product suites and stacks.  The basic, over-simplified answer is that Documentum manages content files in various formats and it collects and stores data about each one.  That means it must transmit data over the wires to and from a relational database.  Second, it must transmit files over the wires, to and from storage.

Generally speaking, Documentum can use either SAN or NAS storage for content files.  NAS may be best choice if you require sharing by multiple content servers.

Full Text Index

One of the features of Documentum is full text indexing.  Documentum specifically recommended against NAS with their old Verity and FAST full text indexing integrations.  The reason is that the content server is already dealing with the basic file and data communications, and that NAS puts additional load on the entire process and negatively impacted throughput as it copied content and created the index.  Full text indexing even now is run on a separate dedicated host.  With both FAST and xPlore, you can search on metadata in the full text index, as well as for specific text.  The underlying data is in XML.

Even with great improvements in NAS performance, we would recommend to use SAN storage with full text indexing.

xCP

Documentum xCP is suite of products and a platform for designing and building process driven applications.  There is considerable file activity between the product components and database.  We would recommend SAN.

Database

Database performance yields Documentum performance.  Query design and tuning in custom applications as well as managing indexes, query plans, and statistics with the out of box product are mandatory for good performance.   SAN versus NAS is an important conversation.   However, one of our largest clients uses SAN with Oracle.

Oracle, EMC2 and others continue improving storage designs.  For example, Oracle now recommends their Direct NFS (dNFS) with release 11g.  Oracle has integrated NFS into its product.  Oracle accesses NFS storage and communicates with it directly within Oracle and not via the host operating system.

Decision Criteria

What are some questions to ask, and things to consider?

Cost

The objective is to get maximum performance at minimum total cost.  Here are a few things to consider besides unit cost:

  • Existing contract with the storage vendor
  • Licensing fees
  • Internal  versus vendor support
  • Discounts on storage and software bundles (Hmm.  EMC has storage solutions as well as Documentum.)
  • External versus internal Cloud storage solutions
  • Tiered storage solution based on retention policy – lower cost slower storage for archiving?
  • Do you want to add CAS to the mix or apply the storage you know to your Documentum archiving solution?

Performance

Do you really need sub-second response times from your content management system?   Performance is in the eye of the beholder, usually the end user.  Consider design recommendations unrelated to hardware:

  • Separate your CMS from content presentation.
  • Separate physically or by process your content authoring from publishing channels
  • Pre-publish content that is to be consumed
  • Archive “old” content to reduce query and response times.
  • Documentum custom types utilizing different default storage locations
  • Distributed stores to bring content physically closer to the consumer

Summary

“It depends” is the operative phrase when deciding what kind of storage you want to purchase for your Documentum system, or any other application.  Different Documentum products have unique storage considerations.   When designing your system, consider costs other than the direct storage price and build efficiencies into the architecture from the ground up.

SAN versus NAS is still a valuable discussion to have in spite of rapid improvements in technology.  They continue to converge.  Hybrid systems offer performance and cost savings.  Be careful of Documentum product requirements, but also use Documentum features to take advantage of storage technology and savings.

Query Results Truncated in Documentum

Recently, some colleagues and I were discussing whether the Content Server truncated result sets for large queries. They insisted that it did and that the largest result set Documentum would return was 1000 rows or 350 rows from any single source (the default values for dfc.search.max_results and dfc.search.max_results_per_source in the dfc.properties file). “Ridiculous!”, I exclaimed. I had run queries that returned 1,000s of rows and could prove it. So, I set out on this little research project.

To prove my point, I decided to run a query that returned a known result set from a variety of clients, while changing the settings of dfc.search.max_results and dfc.search.max_results_per_source. To set these properties, I added the following lines to the dfc.properties file on both the Content Server and the DA web application server. I set these properties artificially low to make the results obvious.

dfc.search.max_results = 100
dfc.search.max_results_per_source = 10

The query I ran was select r_object_id from dm_folder. In my repository, this query returned 743 rows (from iDQL, which I used as my baseline). I also ran this query from the RepoInt utility, the DA DQL Editor and the DA Advanced Search page. If there was any truth to the claims of my colleagues, I should see a result set no larger than 100 rows when the properties were in effect. See the table below for the results.

Client No Config Changes Content Server Only App Server Only
iDQL32 743 743 743
RepoInt 743 743 743
DA DQL Editor 743 743 743
DA Adv. Search 350 350 10

Interestingly, the Advanced Search did truncate the result set, but not as I expected. It truncated the result set to 350 when these properties were not explicitly set, leading me to believe there was some sort of default in play. It also truncated the result set to 10, not 100, when the properties were set. What’s going on here?

After reading up a bit on dfc.search.max_results and dfc.search.max_results_per_source properties, I concluded that these configuration settings only affect ECIS/FS2 searches and not “regular” client searches (i.e., iDQL, RepoInt, DQL Editor, etc.). However, since Webtop (and DA) are configured to use ECIS/FS2 when they are installed, it appears that the Advanced Search does respect the dfc.search.max_results and dfc.search.max_results_per_source properties when they are set. Here’s how it works:

The dfc.search.max_results property dictates how large the final result set can be. The default value is 1,000. In my testing, this was supposed to be 100 rows. However, this setting is the maximum setting for the entire result set and is further constrained by the dfc.search.max_results_per_source property.

The dfc.search.max_results_per_source property dictates the maximum number of results that can be returned from a single source. The default value is 350. Since my testing only involved one repository, the maximum number of results returned was 10. If I had searched across 2 repositories, the final result set would have contained 20 rows (max). Following this logic, if I had searched across 20 repositories, the result would have been 100 (the maximum size allows by the dfc.search.max_results property), not 200 as expected.

My advice is if you are only searching on one repository, set the dfc.search.max_results and dfc.search.max_results_per_source properties equal to each other to ensure your Advanced Searches return maximum result sets. What the actual value of these properties are to produce maximum performance and efficiency is up to you to determine.

So, my colleagues and I were both right, we just needed to specify how we were running our queries.

This blog was originally posted at msroth.wordpress.com on Juy 5, 2010.

Finding an Object’s Content File in Documentum

You probably know that Documentum (in its default state) stores content on the file system and retains a pointer to the content in its database. Likely, you have navigated the file store on the Content Server and discovered directories like ../data/docbase/content_storage_01/00000123/80/00/23/. How in the world does this directory structure relate back to a particular object?

Documentum uses several objects to hold persistence information about content; we will use five of them to determine where the content for an object with r_object_id = '0900000180023d07' resides: dmr_content, dm_format, dm_filestore, dm_docbase_config, and dm_location. The following query will get us all the information we need to assemble the path to the object’s content.

select data_ticket, dos_extension, file_system_path, r_docbase_id from dmr_content c, dm_format f, dm_filestore fs, dm_location l, dm_docbase_config dc where any c.parent_id = '0900000180023d07' and f.r_object_id = c.format and fs.r_object_id = c.storage_id and l.object_name = fs.root

Result:

  • data_ticket = -2147474649
  • dos_extension = txt
  • file_system_path = C:/Documentum/data/docbase/content_storage_01
  • r_docbase_id = 123

The trick to determining the path to the content is in decoding the data_ticket's 2’s complement decimal value. Convert the data_ticket to a 2’s compliment hexidecimal number by first adding 2^32 to the number and then converting it to hex. You can use a scientific calculator to do this or grab some Java code off the net.

  • -2147474649 + 2^32 = (-2147474649 + 4294967296) = 2147492647
  • converting 2147492647 to hex = 80002327

Now, split the hex value of the data_ticket at every two characters, append it to file_system_path and docbase_id (padded to 8 digits), and add the dos_extension. Viola! you have the complete path to the content file.

C:/Documentum/data/docbase/content_storage_01/00000123/80/00/23/27.txt

I think this is a really clever way to manage the creation and assignment of directories and filenames, don’t you? In addition, this scheme guarantees that there is never more than 256 files in a single directory, increasing optimization.

You can do it in reverse also. Say you have file with this path: /80/20/23.txt. What is its r_object_id?

  • converting 80002023 to decimal = 2147491875
  • subtract 2^32: 2147491875 – 4294967296 = -2147475421
  • select r_object_id, object_name from dm_sysobject, s dmr_content c where any c.parent_id = s.r_object_id and c.data_ticket = -2147475421.0

Note: You must append “.0” to the data_ticket value to force DQL to process the variable as a floating point number, otherwise you get an integer overflow error.

Of course, you can always use the GET_FILE administrative method to find an object’s content’s file path. Just remember, that the content ID it is asking for is the r_object_id for the dmr_content object.

This blog was originally posted at msroth.wordpress.com on Sept 9, 2011.

Part IV – Opening a File in a Documentum Repository through the Adobe FrameMaker Integration

 Now that the Adobe FrameMaker application has been successfully connected to the Documentum Content Server Repository, it is time to open a project file.  Opening a project file can be accomplished through several methods.  The methods include browsing the repository through the tree view pane, clicking the “File” then “Open” menu items, or clicking the “Open…” link on the Adobe FrameMaker welcome screen.  The latter two methods produce the same results.

Opening a File through FrameMaker Documentum Integration

(more…)