šŸ“• Build AI Agents using LangGraph.js is now out!

Using ChatGPT Vision API with LangChain in JavaScript

Today's article aims to provide a simple example of how we can use the ChatGPT Vision API to read and extract information from images.

We will use the JavaScript version of LangChain to pass the information from a picture to an LLM and retrieve the objects from the image:

Let's roll up our sleeves and get to work!

Wrapping the Image Info for ChatGPT Vision API

Our example directory will contain the following files:

/app_folder
    .env
    index.js
    food.jpg
    package.json

We need a way to pass the image data from the food.jpg file to the LLM.

Given that we are working with a local image, we will package the info in a base64 string:

const convertImageToBase64 = filePath => {
  const imageData = fs.readFileSync(filePath)
  return imageData.toString('base64')
}

const base64String = convertImageToBase64('food.jpg')

Later, we will pass this info as part of our convention with the LLM.

Setting Up the Model

The GPT-4o is multimodal, meaning that the model can reason across audio, vision, and text in real time.

Therefore, it can also work with the base64String info we extracted in the previous step.

We will set a ChatOpenAI gpt-4o model with LangChain, ask the model to analyze the image, and parse the final answer using a custom list output parser:

import { ChatOpenAI } from "@langchain/openai"
import { CustomListOutputParser } from "@langchain/core/output_parsers"
import { ChatPromptTemplate } from "@langchain/core/prompts"
import { HumanMessage, AIMessage } from "@langchain/core/messages"
import fs from 'fs'
import * as dotenv from "dotenv"

const convertImageToBase64 = filePath => {
  const imageData = fs.readFileSync(filePath)
  return imageData.toString('base64')
}

const base64String = convertImageToBase64('food.jpg')
dotenv.config()

const model = new ChatOpenAI({
  modelName: "gpt-4o",
  maxTokens: 1024
})

let prompt = ChatPromptTemplate.fromMessages([
  new AIMessage({
    content: "You are a useful bot that is especially good at OCR from images"
  }),
  new HumanMessage({
    content: [
      { "type": "text", 
        "text": "Identify all items in this image which are food-related and provide a list of what you see. Just say the name of the food."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/jpeg;base64," + base64String
        }
      }
    ]
  })
])

let chain = prompt
  .pipe(model)
  .pipe(new CustomListOutputParser({ separator: `\n` }))

let response = await chain.invoke()

console.log(response)

Running the above code will return the following:

[
  '- Rice',
  '- Avocado',
  '- Seaweed',
  '- Wasabi',
  '- Soy sauce',
  '- Salmon',
  '- Pickled ginger'
]

Note the maxTokens: 1024 limit. This parameter specifies the maximum number of tokens the model can generate in its response. It helps control the length of the response and the use of resources.

Some time ago, I created an example of how to use TensorflowJs and COCO-SSD to detect objects from an image. And keep in mind that COCO-SSD could only identify just a few classes of objects Using ChatGPT Vision is much better, easier, and reliable.

As a fun exercise, you can take the ingredients returned by the Vision API and ask the LLM to suggest a dish based on these ingredients. Or even better, you can ask the model to return an image of how the dish will look (in this case, sushi).

You can find the full code of this example on my GitHub. Happy coding!

šŸ“– Build a full trivia game app with LangChain

Learn by doing with this FREE ebook! This 35-page guide walks you through every step of building your first fully functional AI-powered app using JavaScript and LangChain.js

šŸ“– Build a full trivia game app with LangChain

Learn by doing with this FREE ebook! This 35-page guide walks you through every step of building your first fully functional AI-powered app using JavaScript and LangChain.js


Leave a Reply

Your email address will not be published. Required fields are marked *