Streaming OCR with Google’s Vision API and OpenCV

Objectives & Prerequisites:

By the end of the article you will learn how to:

  • Apply OCR (Object Character Recognition) with Google’s Vision API.
  • Apply the API with live streaming with video feed from your webcam.

Before beginning, you will need:

  • Basic coding experience in Python.  
  • Some high-level understanding of Computer Vision techniques.

Difficulty: Beginner/Moderate


So what is Google’s Computer Vision API and why should I use it?

Google has released an API that can extract information from images very accurately.  There are many features within this API, but the one I want to focus on today is text extraction from images.  This product is so powerful that it can read image text in different fonts, languages, and even orientations (sideways, upside down).  Because of this, it is better than any open source software that I have tried.


What are some of the applications?

This can be incredibly useful to:

  1. Pull out meaningful text from scanned documents instead of transcribing by hand.
  2. Extract information from a stack of business cards instead of manually inputting data into a database.
  3. Pull useful information from a billboard.

Basically, many tedious tasks can be automated with this API and it can be applied to a wide array of contexts.  An added advantage is that the cost of using it relatively low. Behind this product is also years of research and testing from Google, so why reinvent the wheel?

You can try it out here:

There are many ways to use this API, but what I am going to today is show you how to run both the Google Vision API and the live streaming capability with OpenCV, a fantastic Python package for image processing.  What the code will do is access your webcam, allowing you to wave different objects with text such as a candy bar wrapper, a receipt, or even words on a t-shirt.  In your command line terminal, it will show the text that appears in the images frame by frame.


Before We Get Going

  1. You will need to perform some pre-setup installations:
    • “Pip install” the following packages:
      • opencv-python – imported as cv2 in Python code
      • google-cloud-vision
      • Pillow – imported as PIL
  2. You will also need to sign up for the Google Cloud Platform:


Steps for OCR Demo

1. Vision API Setup:

2. Vision API Code: This code is based on the source code from the Vision API guide, which I modified a bit. 

Vision API Code Displayed for OCR

# export GOOGLE_APPLICATION_CREDENTIALS=kyourcredentials.json
import io
import cv2
from PIL import Image
# Imports the Google Cloud client library
from import vision
from import types
# Instantiates a client
client = vision.ImageAnnotatorClient()
def detect_text(path):
    """Detects text in the file."""
    with, 'rb') as image_file:
        content =
    image = types.Image(content=content)
    response = client.text_detection(image=image)
    texts = response.text_annotations
    string = ''
    for text in texts:
        string+=' ' + text.description
    return string

3. OpenCV Code: This code is based on the OpenCV Stream code.

OpenCV Code Displayed for OCR

cap = cv2.VideoCapture(0)
    # Capture frame-by-frame
    ret, frame =
    file = 'live.png'
    cv2.imwrite( file,frame)
    # print OCR text
    # Display the resulting frame
# When everything done, release the capture

  • The important thing to note is that cv2.VideoCapture() function, there is a 0 input which activates the webcam.  The input is typically a link to a static video or live video feed.
  • Now you can save the entire portion of the code in a .py file and give it any name you want.

4. The Code in Action: Now open up in a terminal where you saved your code and API key and run the following code.

export GOOGLE_APPLICATION_CREDENTIALS=yourAPIkey.json first on command line

  • Then run your Python (  file on the command line immediately afterward.


Final Result

GIF of Results From Camera and Terminal

As you can see, the original detect_text()  function earlier is embedded within the while loop. This enables the Vision API to check for text in the image frames while the live webcam is running.  Below on the left is a screenshot of a backpack with our company logo on it. On the right side, is the output frame by frame.  Words within the results are not consistent since I was waving the backpack around within the frame.

I hope you were able to learn how to use the Google Vision API in a novel way with OpenCV.

At SpringML, we can not only apply powerful Google APIs for your business but also create custom deep learning models such as Using Object Detection to Find Potholes in Tensorflow and deploy them on Google Cloud Platform.  If you want to reach out to us for our domain expertise in Machine Learning, you can reach us at

Thought Leadership