What, No Full Text Search Already?
My project ArkCase is a Spring application that integrates with Alfresco (and other ECM platforms) via CMIS – the Content Management Interoperability Standard. ArkCase stores metadata in a database, and content files in the ECM platform. Our customers so far have not needed integrated full text search; plain old database queries have sufficed. Eventually we know full text search has to be addressed. Why not now, since ArkCase has been getting some love? Plus, high quality search engines such as SOLR are free, documented in excellent books, and could provide more analytic services than just plain old search.
What do we want from SOLR Search integration?
- We want both quick search and advanced search capabilities. Quick search should be fast and search only metadata (case number, task assignee, …). Quick search is to let users find an object quickly based on the object ID or the assignee. Advanced search should still be fast, but includes content file search and more fields. Advanced search is to let users explore all the business objects in the application.
- Search results should be integrated with data access control. Only results the user is authorized to see should appear in the search results. This means two users with different access rights could see different results, even when searching for the same terms.
- The object types to be indexed, and the specific fields to be indexed for each object type, should be configurable at run time. Each ArkCase installation may trace different object types, and different customers may want to index different data. So at runtime the administrator should be able to enable and disable different object types, and control which fields are indexed.
- Results from ArkCase metadata and results from the content files (stored in the ECM platform) should be combined in a seamless fashion. We don’t want to extend the ECM full-text search engine to index the ArkCase metadata, and we don’t want the ArkCase metadata full text index to duplicate the ECM engine’s data (we don’t want to re-index all the content files already indexed by the ECM). So we will have two indexes: the ArkCase metadata index, and the ECM content file index. But the user should never be conscious of this; the ArkCase search user interface and search results should maintain the illusion of a single coherent full text search index.
Both Quick Search and Advanced Search
To enable both quick search and advanced search modes, I created two separate SOLR collections. The quick search collection includes only the metadata fields to be searched via the Quick Search user interface. The full collection includes all indexed metadata. Clearly these two indexes are somewhat redundant since the full collection almost certainly includes everything indexed in the quick search collection. As soon as we have a performance test environment I’ll try to measure whether maintaining the smaller quick search collection really makes sense. If the quick search collection is not materially faster than the equivalent search on the full index, then we can stop maintaining the quick search collection.
Integration with Data Access Control
Data access control is a touchy issue since the full text search queries must still be fast, the pagination must continue to work, and the hit counts must still be accurate. These goals are difficult to reach if application code applies data access control to the search results after they leave the search engine. So I plan to encode the access control lists into the search engine itself, so the access control becomes just another part of the search query. Search Technologies has a fine series of articles about this “early binding” architecture: https://www.searchtechnologies.com/search-engine-security.html.
Configurable at Runtime
ArkCase has a basic pattern for runtime-configurable options. We encode the options into a Spring XML configuration file, which we load at runtime by monitoring a Spring load folder. This allows us to support as many search configurations as we need: one Spring full-text-search config file for each business object type. At some future time we will add an administrator control panel with a user interface for reading and writing such configuration files. This Spring XML profile configures the business object to be indexed. For business objects stored in ArkCase tables, this configuration includes the JPA entity name, the entity properties to be indexed, the corresponding SOLR field names, and how often the database is polled for new records. For Activiti workflow objects, the configuration includes the Activiti object type (tasks or business processes), and the properties to be indexed.
Seamless Integration of Database, Activiti, and ECM Data Sources
The user should not realize the indexed data is from multiple repositories.
Integrating database and Activiti data sources is easy: we just feed data from both sources into the same SOLR collection.
The ECM already indexes its content files. We don’t want to duplicate the ECM index, and we especially don’t want to dig beneath the vendor’s documented search interfaces.
So in our application code, we need to make two queries: one to the ArkCase SOLR index (which indexes the database and the Activiti data), and another query to the ECM index. Then we need to merge the two result sets. As we encounter challenges with this double query and result set merging I may write more blog articles!
SOLR is very easy to work with. I may use it for more than straight forward full text search. For example, the navigation panels with the lists of cases, lists of tasks, lists of complaints, and so on include only data in the SOLR quick search collection. So in theory we should be able to query SOLR to populate those lists – versus calling JPA queries. Again, once we have a performance test environment I can tell whether SOLR queries or JPA queries are faster in general.