Frequently Asked Questions
This is unlikely. Markprompt mitigates this in various ways. For a starter, the underlying prompt is instructed to only say things with high levels of confidence based on the content that you have provided during training. The default system prompt that Markprompt uses is the following:
You are kind AI who loves to help people!
Note that this prompt can be customized via a system prompt, which can change this behavior, e.g. if omitting the instructions to exclusively use the information provided by the context.
We also use a low temperature (0.1), giving little room for freedom. At worst, it comes with an "I don't know" result, which oftentimes is an indication that the content was not clear in the first place. Making the content more explicit, or providing more information, usually fixes this.
The main cause for this is that the section where the information is present is too long. Markprompt has a token cutoff imposed by the model (the "context window"). If you encounter such as situation, please coznsider using a model with a larger context window, such as GPT-4. You can read more about the context window for each model in the OpenAI models documentation.
Your project keys can be found by navigating to your project and opening the Settings tab.
Syncing large repositories (>100 Mb) is not yet supported. In this case, we recommend using file uploads or the train API.
If the OpenAI API is unresponsive after a short delay, we use a fallback LLM provider (currently, Anthropic Claude claude-3-5-sonnet-20240620
).
It does not matter much. The language models we use, such as GPT-4, are able to surface information regardless of how it is structured. We explain this in more details in our blog post.
If your internal file structure is not the same as your public-facing website structure, you should use the getHref
link transformation functions in your configuration, as described in the Link mapping section.
Ideally, links in your content should fully specified, and match the ones that appear on your published site where the prompt is hosted. For instance, if a public page has the path /docs/getting-started/introduction
, links pointing to that page should reference exactly this path. Relative links, or #hash
links should ideally contain the base path. So on a page with path /articles/starter
, a link of the form #step-1
should be changed to /articles/starter
. This is because the content containing these links will be passed on to the language model, and may come up in different ways (for instance, as part of a longer sentence containing links to different pages). If for some reason you are not able to provide links in this way, there are ways around it involving prompt engineering, and you should be able to make it work. We explain this in depth in our prompt engineering custom business logic article.
We split up each file into sections (i.e. content parts separated by headings, like <h2>
s and <h3>
s). These are considered as reasonable "small units" of information.
We pass only sections, and there can be multiple of these (either from the same file, or from different files). We don't pass the entire file by default, as it may contain irrelevant extra information, and if we had included it, it could prevent sections from other relevant articles from being included due to the size limit on the context window. If the rest of the file turns out to be relevant for the response, these sections would be part of the context anyways.
By default, we cap at 10 sections, or 1500 tokens (approx. 400 words), whichever the smallest, to fit the context window. You can change the number of sections to include using the sectionsMatchCount
query parameter (cf. create chat completion).
We include the file path alongside the section content. Here is how the context looks when injected into the system message:
1---
2Section id: /path/to/file1
3
4Section content...
5---
6Section id: /path/to/file2
7
8Section content...
9---
10...
In addition, the chat response includes the metadata of all the files associated to the sections used in the context.
When available, we use the section heading to make it clearer where the source came from. When a user clicks the reference, not only are they brought to the relevant page, but to the actual section that contained the info used to produce the response. You can fully customize this behavior using the references.getLabel
callback (cf. options).
Make sure that your Azure account exposes the email attribute. To do this, head over to the Azure Portal, click "Azure Active Directory" → "Users", select the user, "Edit properties", and ensure that a valid email is set under the "Email" attribute.
Source: OpenAI ↗
Tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens. These tokens are not cut up exactly where the words start or end - tokens can include trailing spaces and even sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths:
- 1 token ≈ 4 chars in English
- 1 token ≈ ¾ words
- 100 tokens ≈ 75 words
Or
- 1-2 sentence ≈ 30 tokens
- 1 paragraph ≈ 100 tokens
- 1,500 words ≈ 2048 tokens
To get additional context on how tokens stack up, consider this:
- Wayne Gretzky’s quote "You miss 100% of the shots you don't take" contains 11 tokens.
- OpenAI’s charter contains 476 tokens.
- The transcript of the US Declaration of Independence contains 1,695 tokens.
How words are split into tokens is also language-dependent. For example ‘Cómo estás’ (‘How are you’ in Spanish) contains 5 tokens (for 10 chars). The higher token-to-char ratio can make it more expensive to implement the API for languages other than English.