Blog

The Sunny Side of Tika

by | Feb 13, 2014 | Java, Software Development | 5 comments

One of the great joys of development comes when you learn about a program making your development task very simple.  Apache Tika is one of those programs, and before I even begin to talk about Tika, I have to tip my hat to the developers.  Thank you very much for making my job simpler.

So, what is Tika?

Tika reads the context and content of almost any file so your programs can consume it.  Tika is commonly used for those working with eDiscovery, Taxonomy generation, content capture, and indexing for content management systems. Basically, anytime you want your application to understand all there is about a file, or URI look to Tika to convert it for you.

Tika initially was designed as a part of the Apache Lucene project which is used to automatically index files for full text index searches.  Tika has been around for several years now, but remains a very active application due to the task it was designed for and how well it was written.

The great thing about Tika is you don’t have to know what the file type is for Tika to parse it.  Tika determines the file type for you based on the files header or by extensions. It then uses its built-in parsers to read the file.

How easy is Tika?

Below is the section of my code that implements Tika version 1.4.

	public String parseToString(File givenFile) throws IOException, TikaException{
		Tika myTika = new Tika();
		return myTika.parseToString(givenFile);
	}

Yep, that is parsed content in two simple lines of code.  Now, mind you, I was only interested in getting the content of the file in a text format, but that simplicity is what I am grateful for.  Tika, in all of its facets, is simple to use while remaining very versatile. (Note: When testing this code, make certain the file actually has readable content.  A TIFF or a MP3 file generally does not contain content to be parsed, so you won’t see anything.)

Other ways of using Tika

Okay so maybe you want a little bit more information from a File, like fetching the extra metadata found in a file header.  To complete this task some extra steps are needed, but are well documented. Below is an example of fetching metadata from a file.  In the case below, I’m generating a JSONObject of name-value pairs for the metadata so I the application can consume the metadata later.

public JSONObject fetchMetaData(File givenFile) throws IOException {

		JSONObject jsonMetaData = new JSONObject();

		Metadata metadata = new Metadata();
		Tika myTika = new Tika();

		InputStream instream = new FileInputStream(givenFile);

		myTika.parse(instream, metadata);

		 for (String metaKey : metadata.names()) {
			 String metaValue = metadata.get(metaKey);

			 jsonMetaData.put(metaKey, metaValue);
		}

		return jsonMetaData;
	}

Now maybe you want to fetch both metadata and content, then below uses the Tika classes to fetch both at one time.  That way you only make one pass on the File or inputstream:

public String getEverything(File givenFile) throws IOException, SAXException, TikaException {

		JSONObject jsonMetaData = new JSONObject();
		ContentHandler handler = new BodyContentHandler();

		Metadata metadata = new Metadata();
		Parser parser = new AutoDetectParser();
		InputStream instream = new FileInputStream(givenFile);
		ParseContext parsedContent = new ParseContext();

		parser.parse(instream, handler, metadata, parsedContent);

		 for (String metaKey : metadata.names()) {
			 String metaValue = metadata.get(metaKey);

			 jsonMetaData.put(metaKey, metaValue);
		}

		// Return string
		return jsonMetaData.toJSONString() + "\n\n" + handler.toString();
	}

 

More Tika Documentation and Examples

https://tika.apache.org/index.html

https://tika.apache.org/1.4/gettingstarted.html

https://www.openlogic.com/wazi/bid/314389/Content-mining-with-Apache-Tika

https://mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.4

 

 

Categories

Need a bit more info on how Armedia can help you?

Feel free to schedule a 30-minute no-obligations meeting.

5 Comments

  1. Scott Roth

    Very interesting, Lee. I’d love to hear the use case around what you are doing. Tika might come in very handy in the future. Thanks for bringing it to my attention.

    Reply
    • Lee Grayson

      Hey Scott, in my case I was looking through Microsoft Office and HTML files for references to a particular URL being replaced. The documents were in Documentum, however Full Text indexing was not being used. So, I created an application that looped through a DQL result set, downloaded each file, and parsed each file with TIKA to find the documents with the URL I was looking for. I am also looking at using Tika as a parser within Armedia’s Ligero application much like it is being used in Lucene. That way Ligero can have a way to sort content as it comes through a watched folder, or even complete metadata based on the file’s header instead of relying upon an external data source.

      Reply
  2. Scott Roth

    Very interesting, Lee. I’d love to hear the use case around what you are doing. Tika might come in very handy in the future. Thanks for bringing it to my attention.

    Reply
  3. Colin Stephenson

    Scott,

    One of the use cases I bugged Lee to help me with is to run a Tika app against a folder of documents and determine if they are

    1. Good
    2. Password protected
    3. Corrupted

    We are using this to assist with exporting content during a migration.

    Reply
  4. Colin Stephenson

    Scott,

    One of the use cases I bugged Lee to help me with is to run a Tika app against a folder of documents and determine if they are

    1. Good
    2. Password protected
    3. Corrupted

    We are using this to assist with exporting content during a migration.

    Reply

Submit a Comment

Your email address will not be published. Required fields are marked *