Multiple streaming responses with Azure OpenAI and Semantic Kernel
Working with Azure multiple streaming responses in Semantic Kernel
Yash Worlikar Mon Feb 12 2024 4 min readIn the previous blog we discussed the basics of getting started with Azure OpenAI and Semantic kernel. We also created a simple request with chat history and execution parameters to our AI service and printed the response.
If you have tried it out yourself you would have noticed that the response is printed all at once unlike what you see in popular AI based chatbots like ChatGPT where the response is streamed in word by word. So what’s going on here?
Streaming responses with Semantic kernel
Responses from the LLMs are relatively slow and can have vastly different speeds based on the size of the response. This is mainly due to the way these LLM models work.
Every word(token) is generated sequentially in a loop, so longer responses take a noticeably longer amount of time. Since every word is dependent on the previous response, we cannot parallelize this process.
The speed at which these models generate responses is not uniform, and it significantly varies based on the length of the desired output. This creates a terrible user experience. To avoid scenarios like this rather than waiting for the entire response, we can stream the response to the end user as it gets generated.
Semantic kernel supports both streaming response and Non-streaming response. To get the API response all at once we can use the GetChatMessageContentAsync
method, while for streaming response we can use the GetStreamingChatMessageContentsAsync
method.
This is equivalent to setting stream=True
in the OpenAI API request. When enabled the response is sent back incrementally in chunks via an event stream.
We can iterate over this using the await foreach
loop to get the streamed response. This will keep updating the fullMessage
until the response stream ends.
string fullMessage = string.Empty;
await foreach (var result in chatCompletionService.GetStreamingChatMessageContentsAsync(chatHistory))
{
if (chatUpdate.Content?.Length > 0)
{
fullMessage += result.Content;
Write(fullMessage);
}
}
The above code works well for a single result scenario, but to return multiple responses in a single request ResultsPerPrompt
, we must modify the above code. Now instead of a simple string, we a updating a dictionary with the response stream as we get them.
Dictionary<int, string> ResultDict = new Dictionary<int, string>();
await foreach (var result in chatCompletionService.GetStreamingChatMessageContentsAsync(chatHistory))
{
if (!ResultDict.ContainsKey(result.ChoiceIndex))
{
ResultDict[result.ChoiceIndex] = result.Content;
}
else
{
ResultDict[result.ChoiceIndex] += result.Content;
}
}
Streaming multiple responses using simultaneous multiple calls
While multiple responses work great, they aren’t perfect and have some inherent flaws and limitations. Models like gpt-4-vision
don’t support multiple responsesand are limited to only one response per call.
Another issue is that even when we set the temperature
to a higher value, the ResultsPerPrompt
is sometimes too similar to one another. Whenever an OpenAI API call is made, a random seed is set for that call. Due to this, each call gives relatively unique responses compared to the multiple responses generated in the same call.
Currently, it’s not possible to assign a random seed for each response, so the responses sometimes end up being similar.
To improve our multiple responses, rather than relying on the ResultsPerPrompt
parameter we can instead simultaneously make multiple API calls for generating multiple responses.
ConcurrentDictionary<int, string> ResultDict = new ConcurrentDictionary<int, string>();
List<Task> processingTasks = new List<Task>();
int numberOfCalls = 2;
for (int i = 0; i < numberOfCalls; i++)
{
async Task func()
{
// Capturing current loop iteration to avoid race conditions
int currentK = i;
await foreach (var result in chatCompletionService.GetStreamingChatMessageContentsAsync(chatHistory))
{
// Checking if the key is not present
if (!ResultDict.ContainsKey(currentK))
{
ResultDict[currentK] = result.Content ?? string.Empty;
}
else
{
// If present, update the content by concatenating it with existing
ResultDict.AddOrUpdate(currentK, result.Content ?? string.Empty, (_, existingContent) => existingContent + result.Content);
}
}
}
processingTasks.Add(func());
}
// Wait for all processing tasks to complete
await Task.WhenAll(processingTasks);
Here we are calling the streaming API simultaneously while also updating the responses as we receive them.
Do note that this would increase the overall input token consumption, while it’s not noticeable for cheaper models like GPT 3.5, it may quickly add up for more expensive models like GPT 4 while also consuming twice the API rate limit quota
Wrapping up
In this blog post, we’ve explored how to implement streaming responses with Semantic kernel for Azure OpenAI. Utilizing streaming response, we can improve user experience by delivering responses incrementally as they are generated, similar to popular chatbots.
By leveraging streaming responses and optimizing multiple response generation, developers can create more engaging and efficient AI-powered applications using Semantic kernel with Azure OpenAI.