Don’t try this at home

Build or Buy? For domain-specific natural language processing, you should avoid building it yourself.

If you’re trying to decide whether to build or to buy your new content tagging system, we implore you to buy! (And, of course, we think you should buy one in particular…)

The build versus buy debate has been around for a long time and, although modern consensus generally agrees on the merits of buying, it would be reductive not to acknowledge some important nuances.

While not even the best-funded financial institution is going to build its own version of Bloomberg, it is unlikely that a third party vendor could deliver a highly customised algorithmic trading system as well as the in-house Quants who use them every day. (Though, even on that, certain vendors would surely disagree).

So, what is the difference? Where should the line be drawn?

Costs and basic productivity metrics are obviously key inputs here but decisions are not always made on that basis, especially when a bank thinks there might be a competitive advantage in doing something its own way, or where there might be considerable internal talent to implement a solution.

First Consideration: Competitive Advantage or Disadvantage?

Turning back to our earlier example and considering a trading system based on proprietary algorithms, it is easy to accept that this might be better built in-house: it is a genuine point of difference for the bank that builds it. Perhaps the algorithms are so good that it becomes a tangible competitive advantage.

For a new version of Bloomberg, however, it is impossible to contemplate an in-house solution: the competitive landscape means that it is a disadvantage NOT to have the third-party solution, rather than an advantage to have your own. It is a system that relies on, in addition to many other things, access to vast amounts of data from a market-place much bigger than any one bank’s product and client coverage.

So how does this apply to a taxonomy-based financial tagging system?

Like a new Bloomberg, building your own would clearly put you at a competitive disadvantage.

Imagine a portfolio manager client who is excited to have signed up to receive a thematic selection of research reports from their five favourite banks. If four of the five all use the same gold standard thematic taxonomy, the client knows they can rely on receiving content about “5G” without being deluged with every single Telecom Sector and Single Stock report.

If the fifth bank has designed its own thematic taxonomy, however, it might not be able to expose “5G” as its own theme. Perhaps it has decided that “Future of Communication” or “Mobile Technology” is the more important theme and has buried 5G in one of those instead. This would mean that the same client would then have to accept either receiving no content on 5G or having that 5G content heavily diluted with other “Future of Communication” topics. In that scenario, you can imagine that the client stops relying on the fifth provider for that subject matter.

To be clear, though, the issue is not whether “5G”, “Mobile Technology”, or “Future of Communication” is the correct name or grouping for the thematic content. That is clearly subjective and likely to change over time and in different conditions. The real issue is that the publisher should always aim to provide its content to its clients according to the clients’ wishes.

The third party taxonomy can manage this easily as it constantly updates for the latest best practice it sees across the entire market. It can also be mapped seamlessly to any existing taxonomies already created, and adapt to changes over time. Furthermore, it is likely to be much larger. Therefore, the bank could decide to keep its “Future of Communication” tag as a parent category for general purposes but could still serve the client specific “5G” content as a child category.

In other words, an ‘ontology isn’t just for Christmas’. It is easy to underestimate the maintenance burden. Maintenance of an ontology and taxonomy requires a team of domain experts to update the breadth, depth, relationships and synonyms to accommodate a constantly evolving financial language and landscape.

Some examples are: new companies, sector classifications, asset classes (e.g. Digital Assets) and technologies (DLT, NFT, automation, ChatGPT); geopolitical changes (heads of state / central banks / elections); financial risk events (‘credit crunch’, or Covid-19 with all the variants to Omicron and beyond); and emergent themes (ESG, or ‘Lower-for-longer’).

There is a belief that some of these issues can be avoided by pulling in feeds of reference data and combining them together. However, that simply shifts the problem to the equally challenging tasks of interminable data cleaning, finding and filling gaps, and linking together disparate datasets all evolving at different times and in different ways.

So far, therefore, we have covered the cost advantage to buying an expert tagging solution (particularly the ongoing maintenance and development costs). We have also explained why it is a competitive disadvantage not to use the gold standard. But a final point remains:

Second Consideration: Technology

The pure technical hurdles to creating a best-in-class NLP system are often significantly underestimated, especially by talented internal teams. Overcoming these hurdles requires a level of specific expertise and singular focus which is often hard to achieve and sustain in a bank. The list of these hurdles is long but there are some worth highlighting.

Firstly, working with any sort of unstructured heterogenous data is hard (and unglamorous), and there are very few off-the-shelf products to get you started. Limeglass built its very own PDF converter, for example, because no existing technology could be adapted to something fit-for-purpose. Mainstream NLP technologies struggle with this. They tend to require a homogeneous plain-text input and ignore formatting and visual positioning of elements on a page. In contrast, the Limeglass Rich-NLP technology was written from the ground up to deal with, and even embrace, the complexities of this content. With this technology, formatting and positioning become an asset.

Secondly, ambiguity in human language is the scourge of all NLP. Some of the famous open-source systems make a decent attempt at this, and the level of accuracy they achieve is good enough for various applications of NLP like sentiment analysis and chatbots, especially in everyday language. But a tagging system for a product as important, specialised, and complicated as Investment Research clearly needs a high level of accuracy. Being able to distinguish the same letters in a stock ticker from an accounting acronym requires the combination of a context-aware algorithm with a massive lexicon of financial terms and phrases.

Limeglass gives you an out-of-the-box solution for all this. It can provide consistent tagging of unstructured, heterogeneous documents based on a managed financial ontology (growing at 1,500 tags per month). And this can even map to, or supplement, your own taxonomy if you have already built one.

Rather than trying to build your own system as a competitive advantage, use Limeglass as the platform on top of which you can develop your true competitive edge.

So what do we say in conclusion? Don’t try this at home!

Oliver Hunt
Head of Client Solutions