We often get asked whether there are prescribed ways to write and structure content to make it better suited for LLMs. In this article, we explain how content is used to feed the LLM, and why the answer is essentially "anything works".
#How Markprompt ingests data
#Step 1: Getting the raw content
Whether it's a public website, a GitHub repository or a set of text files uploaded via our API, Markprompt starts by converting all of it to plain text, and if possible, to Markdown. For instance:
- For an HTML page on a public website, we use Turndown to convert it to Markdown.
- For Markdoc pages (e.g. in a GitHub repo), we use a combination of the Markdoc AST transform and the HTML renderer, followed by Turndown.
- Similarly for MDX, we use a combination of tools from Unified to transform the content into Markdown (stripping away e.g. JSX elements).
- Plain text files are preserved as is.
The source code can be found on our GitHub repo.
#Step 2: Chunking into sections
The next step is to break the content into smaller chunks so that they can fit into GPT-4's context window (which is limited at 8k tokens, which is approximately 2000 words).
For that, we split a page into sections delimited by headings, or if this is not enough / if there are no headings, we break the text into smaller paragraphs to get below the 8k token threshold.
#Step 3: Generating embeddings
The last step is to generate so-called "embeddings" from sections. Embeddings provide a way to capture similarity between two pieces of information, and are represented as numerical vectors. If two pieces of text have similar meaning, their corresponding vectors will be "close". We use the
text-embedding-ada-002 embeddings model from OpenAI.
#How Markprompt generates answers
Now that the content has been indexed, we have all the info needed to generate answers to users' questions.
#Finding relevant sections
Say a user asks the question
What is the process of registering a device, and how do I hook it up to the network afterwards?
The first step is to find sections in the source content that contains relevant information to produce an answer. There might be a few of them. In the same way as we did with sections, we generate an embedding for the question. Then, we find all the closest embeddings that we created in the previous step, to obtain a list of closely related sections.
By default, we extract at most 10 sections, and we cutoff all sections that are too far away in meaning (similarity threshold below 0.5). This can all be configured in the API call using the
If the source content consists of a lot of small paragraphs, all containing relevant information, you may want to increase the
sectionsMatchCount, to ensure all the sections are taken into account.
Now that we have found sections with relevant information, the last step is to provide instructions via a so-called "system prompt". The default system prompt looks like this:
You are kind AI who loves to help people!\n
This prompt has all the instructions necessary to come up with an answer. You can [customize the prompt] even further(https://markprompt.com/docs#system-prompts), for instance to adjust tone and style.
As long as all the relevant sections end up in the final prompt, Markprompt will be able to come up with a response. In the end, whether your content is structured in small sections or as long chunks of dense text or unstructured paragraphs, it doesn't matter. Eventually, language models will be able to include entire corpora into a single prompt, removing the necessity for chunking (Anthropic recently announced 100k context windows). Bottom line is, as long as the information is present somewhere in your content, it will be able to surface in response to a user's query.