Vision in Semantic Kernel

Introduction

Currently rather than just limiting the AI models to be good at natural language, there has been a shift towards making the models have multiple model capabilities inbuilt. These include the ability to understand not just natural language but images, audio, and video too.

Working with gpt-4 with vision and Semantic Kernel

GPT-4 with vision or gpt-4-vision refers to the ability of the gpt-4 model to understand images and provide answers related to them. So now we aren’t just limited to the text inputs to utilize the natural language capabilities of the get-4 model.

Using the vision models in Semantic kernel is as same as using as using any other OpenAI or Azure OpenAI model but with some differences

The OpenAI API for the gpt-4-vision-1106 model currently only supports one response per request (ResultsPerPrompt must always be set to 1)
Currently, function calling isn’t supported
We can use the ImageContent class to send images to the AI model in the chat conversation Note - Currently only image URLs are supported in V1. The base64 image support will be available in future releases

When sending an image to OpenAI, the size of the image determines the total token cost associated with it. OpenAI provides three options for setting the detail option for each image depending on our use.

low
high
auto

Note - The detail parameter for the image is set to auto by default and currently cannot be changed so link URLs with appropriate image size prevent overconsumption of tokens.

Now Let’s use this model using the semantic kernel SDK.

//Create a kernel with gpt-4 vision model
var kernel = Kernel.CreateBuilder()
            .AddOpenAIChatCompletion("gpt-4-vision-preview", "<OpenAI:ApiKey>")
            .Build();

// Get the chat service
var chatCompletionService = kernel.GetRequiredService<IChatCompletionService>();

string systemPrompt = "You are a friendly assistant that helps decribe images.";
string userInput = "What is this image?";

// Select an image URL to send
string imageUri = "https://upload.wikimedia.org/wikipedia/commons/6/62/Panthera_tigris_sumatran_subspecies.jpg";

// Add system message
var chatHistory = new ChatHistory(systemPrompt);

chatHistory.AddUserMessage(new ChatMessageContentItemCollection
{
    new TextContent(userInput),
    new ImageContent(new Uri(imageUri))
});

var reply = await chatCompletionService.GetChatMessageContentAsync(chatHistory);

We aren’t just limited to one image per call and can add multiple images if required.

Limitation of GPT 4 with vision

While the model shows excellent capabilities among various use cases it has many limitations. Being a generalized model, the model fails at tasks that require precise or consistent outputs.

Things like rotation, image filters, and image shapes may also skew the outputs of the model. The model may also hallucinate, producing incorrect or misleading outputs.

Conclusion

The current multimodal capabilities of AI models signify a paradigm shift in AI technology. As this technology continues to mature, it will undoubtedly revolutionize the way we build our applications, offering new possibilities and use cases.

So rather than limiting ourselves just to what is currently available we should build the current technologies with the future in mind and be ready to adapt or change if necessary. GPT 4 with vision is just a start and as we move forward, we can expect to see even more advanced and integrated AI systems that will change our daily lives forever.