Hybrid streaming payloads

Hybrid streaming payloads

With the rise of LLM-native products such as OpenAI's ChatGPT, we are increasingly consuming API endpoints through ReadableStreams. Streaming responses are great, as users can start seeing content immediately, without having to wait for the entire response to be available. For chat completions, this is particularly important, as responses can be lengthy and take time to fully generate. Being able to start reading a response while it is being generated is great UX.

Similar to OpenAI, Markprompt offers a completions API to generate ChatGPT-like responses based on your content. These completions can be returned either as a plain XMLHttpRequest body (when stream is set to false), or as a ReadableStream, typically used in a chat-like interface.

Alongside the completion, Markprompt provides extra metadata, such as pointers to the content sources used to generate the response. This is useful to present a list of references allowing the user to navigate to the original sources of content. This data should be available immediately, and not require the full response to be presented to the user. For instance, a user might start off reading a response, but realize they want to navigate to one of the references instead without waiting. Therefore, we want to send this metadata as a single "payload", before the streaming response starts.

Now, when serving a ReadableStream, the response is being chunked into small parts. This makes a lot of sense with unstructured data such as sentences: "Yes,", "you", "can", "use", "Markprompt", ... However, the metadata payload is a JSON object, and chunking it up into smaller parts breaks its object structure: we would need all the chunks to be available before we can recover the structure and display the object properly.

At present, the standard Response object does not offer an obvious way to pass a JSON payload alongside a streaming response. Some protocols, such as NDJSON, attempt to alleviate this at the data layer, but it is not suited in our case, since it would requires us to also stream the textual response with this method, adding unnecessary extra data to each chunk.

Passing data in the stream

One solution, which Markprompt originally used, is to use a stream separator: we first pass the metadata as a serialized JSON payload, then add a "separator" that indicates the end of the payload, and the start of the generated completions response. The streaming response has the following shape:

1{ data: [ reference1, reference2, ...] }___START_RESPONSE_STREAM___Yes, you can use Markprompt...

On the client side, we receive the streamable response in chunks, and wait for the ___START_RESPONSE_STREAM___ separator to come in. Once detected, we take the string preceding it, and parse it into a JSON object. Everything after the separator is treated as the usual streaming response. Not ideal, but it works!

Passing data in the response headers

Another approach, which we have ultimately opted for, is to use response headers to pass the JSON payload alongside the stream. Response headers are served immediately, and do not depend on a readable stream being initiated. The code is much simpler, as it cleanly separates the JSON payload from the streaming response, and does not require any separator or similar trick.

Our streaming backend is a Next.js application running on Vercel's edge runtime. Building the response looks as follows:

ts
1export const config = {
2  runtime: 'edge',
3};
4
5export default async function handler(req: NextRequest) {
6  const stream = new ReadableStream(/* ... */);
7
8  const metadata = { data: /* ... */ };
9  const headers = new Headers();
10  headers.append('x-markprompt-data', JSON.stringify(metadata));
11
12  return new Response(stream, { headers });
13}

By default, Next.js does not make these headers available to scripts running in the browser in response to cross-origin requests, which is typically how users would access the completions. In order to expose the x-markprompt-data header, we need to include it in the Access-Control-Expose-Headers. This can be done in next.config.js:

js
1const nextConfig = {
2  async headers() {
3    return [
4      {
5        source: '/(.*)',
6        headers: [
7          {
8            key: 'Access-Control-Expose-Headers',
9            value: 'x-markprompt-data',
10          },
11        ],
12      },
13    ];
14  },
15};
16
17module.exports = nextConfig;

The backend is now ready to serve a hybrid of streaming and JSON data. The client-side code is much more streamlined, and closer to the mental model of our data transport:

ts
1const res = await fetch(
2  'https://api.markprompt.com/v1/completions',
3  { /*...*/ }
4);
5
6// JSON payload
7const metadata = JSON.parse(res.headers.get('x-markprompt-data'));
8
9// Stream
10const reader = res.body.getReader();
11const decoder = new TextDecoder();
12let done = false;
13while (!done) {
14  const { value, done: doneReading } = await reader.read();
15  done = doneReading;
16  const chunk = decoder.decode(value);
17  // ...
18}

Handling non-UTF-8 data

One limitation of request headers is that they can only include UTF-8 strings. Our metadata may include non-UTF-8 strings, for instance when serving content in non-Latin languages. We considered encoding the payload as a Base64 string, but this typically requires using a Buffer object, which is not currently available on the edge runtime. Instead, we use a TextEncoder, which produces a Uint8Array that we can turn into a string. We also considered using LZ-compression (with an an edge runtime-compatible library such as lz-string), which would reduce the header size, but ultimately decided to go for the plain Uint8Array approach to not impose any dependencies on the client. Our server function looks as follows:

ts
1export default async function handler(req: NextRequest) {
2  // ...
3  const stream = new ReadableStream(/* ... */);
4
5  const metadata = { data: /* ... */ };
6  const encoder = new TextEncoder();
7  const encodedPayload = encoder.encode(JSON.stringify(metadata).toString();
8  const headers = new Headers();
9  headers.append('x-markprompt-data', encodedPayload);
10
11  return new Response(stream, { headers });
12}

The client-side code just needs a minor adjustment:

ts
1const res = await fetch(
2  'https://api.markprompt.com/v1/completions',
3  { /*...*/ }
4);
5
6// JSON payload
7const encodedPayload = res.headers.get('x-markprompt-data');
8const headerArray = new Uint8Array(encodedPayload.split(',').map(Number));
9const decoder = new TextDecoder();
10const decodedValue = decoder.decode(headerArray);
11const payload = JSON.parse(decodedValue);
12// ...

Conclusion

Our current solution works well. In the future, we should hope that the standard Response object will evolve to support hybrid streaming and non-streaming payloads.