Answer: There is a self-perpetuating industry of well paid “experts” that exist to support it that have no desire to risk their livelihoods on something new and they actively oppose change to the status-quo.
Case #1: Metadata and Taxonomies in Enterprise Content Management (ECM)
It’s a fine concept. If you add metadata to a document describing and classifying it when you store it, then it will be easier to find when you want to look for it. This has been around for over 20 years. It was created solely because if you did not manually tag documents, when you went to find them you got back thousands of possibilities in a search and would have to manually sift through them to find the right ones. The process would be time-consuming and counter productive. The technology worked, however the process did not; few users ever bothered to correctly tag documents.
ABC_contract.doc [metadata: legal, contract, procurement, supply chain, Fiscal 2013, Partner/Vendor NAICS Code, partner/vendor code, product/material class]
Taxonomies were the layer above metadata. Now added to a process that does not work, is the ability to navigate a hierarchy that adds formal structure to metadata. Using the example above; the NAICS code is derived from a Taxonomy of Industries . NAICS Taxonomy
What does the user really want to do with their ECM system? Do they really want metadata? No.
They wanted to say “find me the contracts we issued for electric motors over 1/2 HP over the last two years from North American suppliers”
20 years ago it was impossible to answer that question without adding manual metadata on your source documents and using taxonomies. Today that is no longer true but we still do it and worse yet most organizations plan to continue doing it for the foreseeable future.
Today’s technology does not require the user to manually add structured metadata to a document to result in high quality search. It can be achieved through automated document enrichment and advanced search capabilities.
Using today’s technologies let’s file that same contract without metadata and then search for it.
Document Enrichment
Static document enrichment processes documents and adds properties about the document to an index database. In the case above, the process would do the following:
– automatic promotion of properties based on words, word patterns found.
-
It will determine that it is a contract.
-
It will determine who the parties of the contract are by name. ABC Company
-
It will determine the relationship of the parties. Supplier Vendor
-
It will determine the addresses of the parties by look-up or document extract. 123 Main Street, Chicago, Illinois
-
It will determine dates . Effective Date 3/21/2103 Term: 2 years Holdback date: 7/30/2013
-
Using an Ontological Domain model it will determine the business domain. Plant Operations, Material handling
-
It will identify people and look-up roles. Robert Smith, CFO
-
It will extract and classify terms or codes found in the domain. Pump
-
It will also determine what the document is not (which is equally if not more important in search)
– semantic enrichment
-
It will find semantic references and promote discrete property values based on domain rules (example: First Quarter = Jan 1, 2013 through March 31st, 2013)
– correlated enrichment
-
It will find product codes, product names and add both internal and external references for URL’s to related information (ie. internet specification page for product)
Now , let’s search…
Search
So let’s all agree up front that with today’s search technology your document will always be found (ie. full index/ full text search). The only question is therefore the quality of the search and will the document be first on the list returned or number 500,000 and you find it 4 days later after going through the list.
“find me the contracts we issued for electric motors over 1/2 HP over the last two years from North American suppliers”
In the above request there are three distinct search concepts:
-
Using auto-promoted managed properties. Discrete terms located in the document (contract, motor, supplier). These are matched through promoted managed properties.
-
Using semantic interpretation/transformation. Domain semantic concepts.
-
Supplier
-
North American actually equals any address where the State/Province = …
-
Over the last two years equals effective date today through today – 730 days
-
Motors – electric, industrial, material handling
-
-
Using Dynamic Data Enrichment.
-
Extending document Data (internet/intranet search by product code/name in contract including HP, Horsepower)
-
Today if you use Enterprise search technology like Microsoft FAST, it will scan your documents find textual patterns and promote document properties as if it was source metadata. In the example above the document would be “tagged” (virtually) with contract, motor, supplier, supplier name, supplier address, key dates etc. FAST also provides for data enrichment. It can find terms such as product codes, product names, process names etc. and add “best bet” references to the properties from your intranet or extranet (including structured database searches such as SAP). It is effectively correlating this document into your domain. These two capabilities are out of the box functions that require little configuration and provide on their own extremely high quality search results without manual metadata.
To step up search one more notch we can add to the enrichment process the ontological domain model and semantic transformation.
The ontological model provides context for words/phrases in your business domain. For example without an ontological model the system would not know if “windows” were pieces of glass you looked through or an operating system for a computer. By adding the ontological model to the enrichment and query, search quality again jumps forward from very high quality to extremely high. (ie. Semaphore for FAST)
Lastly is the introduction of Semantic transformation technologies. Take a look at my post Semantic Search to get an overview.
Today’s technologies for document enhancement and search for the majority of commercial applications completely negate the requirement for manually entered metadata on stored documents. It’s time to move on from the horse and buggy.
I agree with your point from a document management perspective where the main purpose is effective collaboration. However, when I think from a records management perspective, where the main purpose is legal compliance and there is little to none room for mistakes in classification of information, I feel quite hesitant about a machine driven approach. I certainly see the value of improving findability of information for e-Discovery purposes but when it comes to disposition of critical records by relying on machine generated metadata, I am not so confident. I think there is still room for metadata and manual classification of information in order to keep records management as a reliable practice.
What’s your view and experience on application of automated content enrichment processes for records management purposes? Do you find a machine driven approach reliable enough to satisfy the specific needs of records management on accurate classification?
I have seen only a handful of customers with overtly complex record classification schemes where automated classification would be problematic. Most classification processes are quite well defined and clear about the content in one category versus another. I expect the manual errors made in record classification would still greatly outnumber those of properly implemented automated classification, especially in organizations where record classification processes are done infrequently by a broad base of regular users.