How to structure content to make it better suited for LLMs

We often get asked whether there are prescribed ways to write and structure content to make it better suited for LLMs. In this article, we explain how content is used to feed the LLM, and why the answer is essentially "anything works".

How Markprompt ingests data

Step 1: Getting the raw content

Whether it's a public website, a GitHub repository or a set of text files uploaded via our API, Markprompt starts by converting all of it to plain text, and if possible, to Markdown. For instance:

For an HTML page on a public website, we use Turndown to convert it to Markdown.
For Markdoc pages (e.g. in a GitHub repo), we use a combination of the Markdoc AST transform and the HTML renderer, followed by Turndown.
Similarly for MDX, we use a combination of tools from Unified to transform the content into Markdown (stripping away e.g. JSX elements).
Plain text files are preserved as is.

The source code can be found on our GitHub repo.

Step 2: Chunking into sections

The next step is to break the content into smaller chunks so that they can fit into GPT-4's context window (which is limited at 8k tokens, which is approximately 2000 words).

For that, we split a page into sections delimited by headings, or if this is not enough / if there are no headings, we break the text into smaller paragraphs to get below the 8k token threshold.

Step 3: Generating embeddings

The last step is to generate so-called "embeddings" from sections. Embeddings provide a way to capture similarity between two pieces of information, and are represented as numerical vectors. If two pieces of text have similar meaning, their corresponding vectors will be "close". We use the text-embedding-ada-002 embeddings model from OpenAI.

How Markprompt generates answers

Now that the content has been indexed, we have all the info needed to generate answers to users' questions.

Finding relevant sections

Say a user asks the question

What is the process of registering a device, and how do I hook it up to the network afterwards?

The first step is to find sections in the source content that contains relevant information to produce an answer. There might be a few of them. In the same way as we did with sections, we generate an embedding for the question. Then, we find all the closest embeddings that we created in the previous step, to obtain a list of closely related sections.

By default, we extract at most 10 sections, and we cutoff all sections that are too far away in meaning (similarity threshold below 0.5). This can all be configured in the API call using the sectionsMatchCount and sectionsMatchThreshold parameters.

If the source content consists of a lot of small paragraphs, all containing relevant information, you may want to increase the sectionsMatchCount, to ensure all the sections are taken into account.

Providing instructions

Now that we have found sections with relevant information, the last step is to provide instructions via a so-called "system prompt". The default system prompt looks like this:

1You are kind AI who loves to help people!\n

This prompt has all the instructions necessary to come up with an answer. You can [customize the prompt] even further(https://markprompt.com/docs#system-prompts), for instance to adjust tone and style.

Conclusion

As long as all the relevant sections end up in the final prompt, Markprompt will be able to come up with a response. In the end, whether your content is structured in small sections or as long chunks of dense text or unstructured paragraphs, it doesn't matter. Eventually, language models will be able to include entire corpora into a single prompt, removing the necessity for chunking (Anthropic recently announced 100k context windows). Bottom line is, as long as the information is present somewhere in your content, it will be able to surface in response to a user's query.