Improving UX with Stream Responses in OpenAI Assistants API

Information

To reach a broader audience, this article has been translated from Japanese.
You can find the original version here.

The Assistants API from OpenAI is convenient with tools for maintaining conversation context through threading, Function calling, Retrieval, and more. However, to interact with users interactively, it was necessary to poll until the assistant (and the subsequent GPT) fully generated a response. This resulted in longer perceived wait times for users, which was not ideal for UX.

To address this, OpenAI made the following announcement last month (2024-03-14):

Streaming is now available in the Assistants API! You can build real-time experiences with tools like Code Interpreter, retrieval, and function calling.https://t.co/B0Vytm6zyE pic.twitter.com/9QWQnQRH9x
— OpenAI Developers (@OpenAIDevs) March 13, 2024

It seems that the Stream format response, which was supported in the Chat API, is now also supported in the Assistants API.

Here, I tried it out and will introduce it.

Preliminary Setup

Here, we will create a terminal-style conversation script using Node.js (TypeScript).

Install the following in any NPM project (TypeScript-related settings are omitted as they are not the main topic).

npm install openai @inquirer/prompts

The OpenAI library used here is currently the latest version 4.33.0.
Also, @inquirer/prompts is a library that supports user interaction in CLI.

Building the Overall Framework

We create the overall framework of the source code.
This part is simplified from the following Assistants API introductory article:

Trying OpenAI's Assistants API (Beta)

import OpenAI from 'openai';
import { input } from '@inquirer/prompts';

const openai = new OpenAI();
const assistant = await openai.beta.assistants.create({
  name: 'フリーザ様',
  instructions: 'You act as Frieza from Dragon Ball. Speak in Japanese',
  model: 'gpt-4-turbo'
});

const thread = await openai.beta.threads.create();

try {
  while (true) {
    const req = await input({ message: '>' }); // Get user prompt
    if (req === 'q') break; // Exit with `q`
    await openai.beta.threads.messages.create(
      thread.id,
      {
        role: 'user',
        content: req
      }
    );

    // Write code to execute the thread and return results to the user
    
    console.log();
  }
} finally {
  await Promise.all([
    openai.beta.threads.del(thread.id), 
    openai.beta.assistants.del(assistant.id)
  ]);
}

First, create an assistant and a thread to manage conversation history in the Assistants API, and continue the dialogue with the assistant until the user enters q. Finally, delete the created thread and assistant^[1].

For terms such as assistant and thread, please refer to the aforementioned article or the following official document:

OpenAI Doc - How Assistants work - Objects

Using Stream Response

Now, let's write the thread execution code that we didn't describe earlier.
To receive responses in stream format, write as follows:

const stream = await openai.beta.threads.runs.create(thread.id, {
  assistant_id: assistant.id,
  stream: true // Enable stream response
});
for await (const event of stream) {
  if (event.event === 'thread.message.delta') {
    const chunk = event.data.delta.content?.[0];
    if (chunk && chunk.type === 'text') {
      process.stdout.write(chunk.text?.value ?? '');
    }
  }
}

Unlike before, we specify stream: true during thread execution.
This way, the assistant returns a Stream instead of the usual execution result (Run instance).
This stream implements AsyncIterable, so you can subscribe to various events until the thread execution is complete.
The subscribable events are as follows:

export type AssistantStreamEvent =
  | AssistantStreamEvent.ThreadCreated
  | AssistantStreamEvent.ThreadRunCreated
  | AssistantStreamEvent.ThreadRunQueued
  | AssistantStreamEvent.ThreadRunInProgress
  | AssistantStreamEvent.ThreadRunRequiresAction
  | AssistantStreamEvent.ThreadRunCompleted
  | AssistantStreamEvent.ThreadRunFailed
  | AssistantStreamEvent.ThreadRunCancelling
  | AssistantStreamEvent.ThreadRunCancelled
  | AssistantStreamEvent.ThreadRunExpired
  | AssistantStreamEvent.ThreadRunStepCreated
  | AssistantStreamEvent.ThreadRunStepInProgress
  | AssistantStreamEvent.ThreadRunStepDelta
  | AssistantStreamEvent.ThreadRunStepCompleted
  | AssistantStreamEvent.ThreadRunStepFailed
  | AssistantStreamEvent.ThreadRunStepCancelled
  | AssistantStreamEvent.ThreadRunStepExpired
  | AssistantStreamEvent.ThreadMessageCreated
  | AssistantStreamEvent.ThreadMessageInProgress
  | AssistantStreamEvent.ThreadMessageDelta
  | AssistantStreamEvent.ThreadMessageCompleted
  | AssistantStreamEvent.ThreadMessageIncomplete
  | AssistantStreamEvent.ErrorEvent;

As you can see, many events can be subscribed to here.
However, the most important event is AssistantStreamEvent.ThreadMessageDelta.
This event contains the new message delta.

Here, we subscribe to this event and write the message delta to standard output.

Using Stream-specific API

The OpenAI library also included APIs specialized for stream responses.
This method does not iterate over the stream but adds a listener to the subscribed event.

const stream = openai.beta.threads.runs
  .stream(thread.id, { assistant_id: assistant.id })
  .on('textDelta', (delta, snapshot) => process.stdout.write(delta.value ?? ''));
await stream.finalRun();

This method is more readable, so it is generally better to use this one.

Below is a video of this script in action.

You can see that messages are output in stages instead of waiting for all messages to complete.

Summary

It has become easy to use stream format responses in the Assistants API.
It is expected to be utilized in scenarios where real-time interaction with users is required.

Assistants remain, so if forgotten, delete them from the OpenAI API management console. ↩︎