Making Search Work for Investment Research

The Challenges, Pitfalls and how to Overcome them

Simon Gregory  |  CTO & Co-Founder

Series: Making Search Work for Investment Research
Part 1 – Time, Evolution & Chronology
Part 2 – Understanding Information Loss
Part 2a – Garbage In, Garbage Out

In this series of articles, I will be discussing some of these common challenges and pitfalls to avoid when producing an optimal search experience for your research customers.

You may want to read this if you are considering updating or investing in your research search engine…

Part 2 – Understanding Information Loss

In this second article in the series, I will be looking at search and research discovery in terms of the concept of ‘information’. Analytical reports are multi-thematic, richly formatted documents using a specialised markets language and terminology so accurately modelling information and context is hugely important.

In the first article I discussed how financial research is a dynamic content stream, and that you need to handle time, evolution, and chronology appropriately. When highlighting the significance of considering time as a first-class citizen I referred to the parallels with Einstein’s work in four dimensions. Incidentally, Einstein also had something to say with regards to information with the following quote:
‘Intelligence is not the ability to store information, but to know where to find it‘.

In researching this article, I came across another quote which nicely highlights the value of information, this one is by Benjamin Disraeli:
‘As a general rule, the most successful man in life is the man who has the best information’.

It is not hard to see the parallels of these philosophies with regards to our objective of a successful investment research discovery strategy. The aim is to provide market participants access to the best information, and search is one of the main tools at our disposal for allowing them to find it.

Information is the core currency in the search process and therefore to get the best results it is extremely important to understand what is happening to this information at all times.

What is Information?

There are lots of different definitions of ‘information’ but for this discussion I quite like ‘the communication or reception of knowledge or intelligence’, as it sits well with the domain of investment research.

In this article I will be discussing the importance of the flow of information through your research discovery system. Research contains valuable information, but by its nature it is also noisy. We will consider everything from the perspective where the aim is to identify the important information from the noise, and to preserve it as much as possible.

It may be surprising to hear that the effects of information loss and noise are present in various guises in nearly all areas of the system, and with even the latest technologies. As such they are often overlooked, however they are usually the single biggest barrier to achieving high quality results, whatever the chosen technology.

We will be considering how information degrades right from the moment it is published by the author, to being ingested, normalised, tokenised, and indexed. We will then look further into the implications for traditional search engines such as full-text search, right up to the newer technologies such as vector search, large language models (LLMs) and retrieval augmented generation (RAG). I will also detail some of the ways we have approached these problems at Limeglass, and how we may be able to help you maximise your investment research information in all areas of your research discovery systems.

These are some of main reasons why search engines struggle to cope with investment research. We see a lot of teams committing time and investment into fine-tuning particular levers made available by the solution or introducing new technologies to improve results. Ultimately though, everything is being held back by information loss and noise, so the full potential is never realised. If you think this could be you, you might want to read on.

What is the Information Flow of Investment Research?

As a background to this article let us consider the typical lifecycle of a research report.

An experienced investment research analyst will (hopefully!) have spent a lot of effort on gathering, preparing and analysing information to prepare each report. Everything will have been thought through and information will be written and presented in a very deliberate way.

The final report content is high value to both its intended audience and the credibility of the publisher. As such, a report is often also carefully reviewed before dissemination. At the point of publishing, this document is usually converted into a distribution format such as HTML and/or PDF.

Finally, it is passed to the content management and distribution systems with some accompanying ‘metadata’ information. The content management system will do some further processing of the published report so it can be indexed for search, while the distribution system will use the document metadata to decide where and how the document should be passed on (such as emails or feeds to various other systems).

Part 2a – Garbage In, Garbage Out

While the functionality and capabilities of content management and distribution systems are extremely important in an effective research discovery strategy, they are not the only key pieces in this puzzle.

It is ultimately the quality and availability of the information they are working from that is the primary limiting factor in these systems. In other words, you could have the most sophisticated search engine or email subscription system, but if the key source information is missing, scrambled or corrupted then the results will only reflect this.

I hear many people using the phrase ‘garbage in, garbage out’ and this is absolutely applicable here. This obviously is not a comment on the quality of the research itself, in fact as I have already alluded a research report contains highly valuable information. How the report is processed on its way to being published has a huge impact on how accessible this valuable information will ultimately be.

There are many factors that at play here, the causes take many forms, they are present in all parts of the system and are even there in the emergent cutting-edge approaches in use today. In fact, while the new Generative AI (GenAI) / LLM / vector embedding models help in some areas, they also bring their own class of problems that affect and disrupt the chain of information flow.

The Theory of Information

There is a particularly relevant branch of science called ‘information theory’ which is ‘the mathematical study of the quantification, storage, and communication of information’. I will spare you from the in-depth scientific and mathematical elements of this, but at its core is an important concept of this branch of science called ‘entropy’ – which is essentially a means to measure and understand information loss.

Understanding information preservation and information loss is an extremely important aspect for information management systems such as natural language processing and search.

A computer’s native way of storing and understanding information is using numbers. Computers have huge difficulty understanding native human language, hence all the continued research efforts to advance natural language processing (NLP) technology.

NLP is essentially all about translating human language into some kind of computer language that can be in some way understood by computers. There are many different approaches and continued rapid advances in this subfield of artificial intelligence (AI). However, it is very important to understand that this human-computer translation will always involve some form of information loss at most parts of the process.

So, there is a direct conflict here between putting content where information value and integrity is paramount, through a system that fundamentally will lose some of the information in various ways.

Think of it like a river running all the way to a reservoir. At the start of its journey the water falls from the clouds in its purest form and along its journey to the reservoir some will be absorbed into the soil, re-evaporate back into the atmosphere or divert into another estuary and never make it to the reservoir (i.e. data loss). Meanwhile toxins, rubbish and other elements can pollute the water on its journey (i.e. noise / data corruption). At the reservoir, ultimately not all the water survived the journey and depending on how you mitigate the factors, is it safe to drink and is there enough?

So, is it OK to Lose Information?

It is obviously far from ideal, but information loss is a fundamental property of the process. Entropy is scientific fact of a system, so we have to accept it as it is unavoidable. However, it is not like we have no control of the processes, so you may want to ask the question a different way – if I’m going to lose information, can I somehow optimize it and preserve the information I would really like to keep? The answer is often yes, but you do need to look at each instance case by case.

How Information is Lost – Publishing to HTML or PDF

A prime example of information loss is in the most ubiquitous publication format, the research PDF report. Despite the valiant efforts of research publishers to migrate to more fluid and composable formats, the research PDF is still the dominant format for published research. The reason is partly historic and partly presentational.

Historically, research reports would be printed out and read on the desk or on the train home. As I often mention in these articles, research reports are unique in many ways, in this sense they are highly technical and valuable but also contain strong branding. Research is also used as publisher marketing, so being able to consistently reproduce reports and preserve the presentational layout is important. PDFs excel at this, they are designed to preserve the document as if it were printed. PDFs are also relatively ‘tamper proof’, it is hard to seamlessly change them, they can be signed or encrypted and so are an attractive distribution format for compliance purposes.

If you have ever tried to copy information from a PDF, then you may already understand that they are a great example of how information can get lost or transformed into a more unobtainable state. PDF is a made for print format, essentially it contains just the very low-level instructions to print characters, graphics and images for faithfully reproducing the look of an article on a screen or printer. It is not a format primarily designed to maintain the information in a useful way for a computer to understand its meaning. There is no concept of words, lines, sentences, and inside it is just an obscure set of printer instructions. Technically the information is there as a human can easily understand the visual reproduction, however the information is many, many times less accessible to a computer than in its original format. Even a simple copy paste will often give you a jumbled mess of text that you have to manually unscramble back to something resembling its original order.

I cannot overstate how difficult it is to reliably obtain the source articles from a PDF document without significant data loss, noise, distortion or corruption. While HTML is a much better format, in that there is a lot less information loss, it is not perfect. While there are some standards for creating semantic / accessible documents, these are not uniformly implemented and vary widely between publishers. Authoring tools allow authors freedom to style and compose documents in different ways and again while they look nice on screen (depending on the browser), the information in the HTML itself is not always easy for a computer to reliably interpret the intent.

In both formats, there is also a lot of noise to contend with, such as headers, footers, disclaimers, adverts, promotions and other side content.

Clearly, if you have the choice, using the HTML over PDF documents is the preferable option in nearly every case. This is not always possible, as even if you have made the switch, you are likely have a healthy PDF document archive that you would love to index.

As I mentioned, humans can still read this information, so technically the information is not actually lost… yet. What has happened though, is that it has just become many times more difficult for a computer to get it reliably, and a lot of noise has been introduced. Because of this, the information will actually be lost in the next step – content extraction and normalisation.

How Information is Lost – Extracting & Normalising Content

The first stages of traditional and emergent NLP approaches are to pass the documents through a series of ‘content extraction filter and normalization’ processes.

The content filter is responsible for extracting the content from the source document. This almost always outputs plain text – so this step is immediately discarding all formatting and document structure.

One of the distinct properties of investment research is how formatting, style and complex visual layouts are used to convey information to a user. These visual ‘contextual cues’ are integral to how humans naturally process the information. Title hierarchies, inline titles, lists, bullets and other important information conveyed by the presentation and layout are lost.

In research reports this is very important contextual information, for example, a report may have subsection headings on each country, and then go on to mention elections, inflation or GDP in the paragraphs. The author and the reader understand these terms are all in relation to a specific country, as they can see the titles. The plaintext processor does not understand titles, and the information was never passed on to it anyway, so it is seeing the content very differently.

This plaintext simplification process is done because dramatically simplifies the problem for the other NLP tasks which are nearly always designed to work with simple plaintext inputs, but this is at the cost of losing important contextual information. As this is happening right at the start, all further processes are impacted by this single step.

Without this contextual information, another informational problem arises, ‘noise’. In addition to the main investment research content, reports usually also contain elements such as marketing, disclaimers, tables, figures, headers, footers, etc. Without the visual cues, they are all much more difficult to reliably detect, isolate or extract. They are all sources of noise in the document, and when processed, they interfere with the information contained in the main article content itself.

This is a classic example of how information is lost, or noise is introduced in a system, due to transformations made to either optimise or normalise the information to make it easier to process. The general assumptions are that enough information is preserved to produce an effective result. However, the mileage will vary widely depending on the nature of the content. Investment research is particularly sensitive to this issue.

At Limeglass we identified the areas where important information was normally being lost and built our solution with great care to preserve the kind of information that has the most importance and value to get the best results.

Extracting High-Quality Content Information from PDF & HTML

There are many tools that can work with PDF documents, unfortunately we are yet to discover anything that addresses this problem in a way can reliably get access to all the contextual information. Most of the tools will output plaintext, but the results are often jumbled. HTML resolves many of these issues, but still contains noise and suffers a loss of context when converted to plaintext.

For many NLP processes a change to the content order doesn’t make too much of an impact on their results, as they are either indexing individual words or small chunks of content. But this also nicely highlights the limitations of these technologies, as they will never give good results in the cases when structure, order and context does matter. With investment research, the content order clearly does matter.

Preserving Contextual Information in a Document Graph

We spent a lot of our initial design work considering where the information is in investment research and what should ideally be preserved during processing. This analysis actually sent us in a different direction to traditional and emergent NLP technologies, and we took the bold decision to preserve much of the contextual information that is traditionally discarded.

In our ‘Rich-NLP’ world research reports are not flat plaintext documents, but detailed graphs that model the structure and formatting of each report as well as the text. These graphs include the intricate relationships between sentences, paragraphs, titles and other constructs. It is clearly a challenge to take this novel graph-based approach, but our understanding of financial research meant that the compromises involved in taking the classic NLP approaches just would not have got us to where we wanted to be.

Extracting Clean Content

At Limeglass, the difficulty and importance of correctly identifying both the correct flow and formatting of source content from complex documents was something we were well aware of when we began our development journey.

We have invested a lot of development time, research and IP into solving the hundreds of problems involved in this process and built dedicated tooling to inspect PDF and HTML documents to reconstruct and model the content back in the way the author intended. This is a large topic on its own and deserves its own article series.

We have come a long way on this journey, while you are never guaranteed perfect accuracy across the board in this process, we can show that it is possible to greatly improve on the simple text extraction methods that most systems use as the preprocessing step to consuming and ingesting these documents.

Our content extraction filter is the first step to ensure we can provide high quality, clean, well formatted inputs for our Rich-NLP processing engine, but it could also be the first step to improving the quality of inputs to your own technology stack.

Normalisation – What Next?

Being able to reliably obtain clean, consistent normalised content is an important starting point, as problems get compounded as the information passes further down the chain.

The next part of the normalisation step differs between approaches and models. The following chapters will go into how the different search technologies handle content, and what happens to the information in each case.

< Previous | Part 1 –  Time, Evolution & Chronology

Next > | Coming Soon

The next chapters will be made available soon – if you would like register to be notified of updates, put ‘Search Article’ in our contact form with your details.