Organizing the Unknown – Applying Taxonomy to Discovery through Sophia Search

We all know that taxonomy-based solutions are core and integral to creating business opportunities and solving business problems in our worlds of information.  Yet, there are some areas that taxonomy cannot model, and so seemingly can be of no help whatsoever to some unlucky owners of certain business “problems”.  For instance, organizing what we don’t know.  That seems straightforward and uncontroversial.  Surely we cannot organize what we don’t know?

And yet … undiscovered and/or so-far-unknown relevance hold value for the first discoverers.  There are whole areas of business problems and opportunities hidden in discovery for litigation or in ultra-early stage science and research, to call out just two.  So, logically, if topics and their unknown connections could be discovered on the emergent edge or in massive heterogeneous corpora or joined sets of documents – then taxonomy-based organization could be brought to bear, with all the attendant benefits we know of so well.

I had lunch recently with Jeff Bierach, VP of Sales for Sophia Search (, an early-stage start-up with buzz and promising solutions.  Sophia Search – the “Search” part of the name is really a misnomer – do automatic clustering of documents into integral clusters that are relatively free of “noise” (bad cluster inclusion), have clean distinction between clusters and make relevant sense.  And, they seem to do this very well, indeed.

Now, automatic clustering of documents into a taxonomy (or “clusteronomy”) has a long and often less than storied track record in creating easy-to-apply business opportunities out of information chaos.  We all know the formal reasons for this – clusters are not unitary “concepts”, automatic “good” labeling of the cluster is impossible, there is no pervasive subsumption relationship between child and parent clusters, clusters share concepts, “hide” concepts … and so on.  Of course, from the point of view of an automatic clustering/categorization vendor taxonomies built by human intelligence (and blood, sweat and tears) also have their downsides.  Such is life …

And yet – Sophia Search does clustering very well.  First, let’s share the experimental proof and then let’s start a conversation on the implications.  Sophia Search took the New York Times corpus of content – 1.8 million items – and clustered it.  They discovered/created 19,000 clusters, and from that 418 high-level Sophia tags.   They then compared these 418 clusters to the already hand-tagged NYT documents (the NYT corpus is a tagged corpus as are most large newspaper corpora). 

There are some powerful findings.  For example 64% of the documents with the Sophia tag “Market” were already hand-tagged by the NYT with “Corporations”.  That is indeed a reassuring congurence.  Similarly, 92% of the documents with the Sophia tag “Music” were already hand-tagged by the NYT with their tag “Music”.  That all starts to look a very nice starting place for those business problems that have as a first (big) bite a set of documents, and thus topics, with as yet important but unknown connections.

For the well-named, well-conceptualized, organization of automatic clusters into robust taxonomy current methods and tools will work nicely. Discovery – once it is discovered – can be mapped and mapped very well, with all kinds of attendant business benefits.   But, if you are pre- successful discovery, then this kind of hybrid knowledge organization model – melding automatic clustering that really “works” with best practices taxonomy development that always works – is well worth considering.  Sophia Search appear to offer the missing link in this kind of hybrid knowledge organization model for those who have business issues or opportunities around “The Information Edge” – discovery (of what? who knows, but you know “it” is there), unknown connections, emerging connections etc.


Your Comments