How to use the Google Vision API

Happy or sad? Cat or person? Use the Google API to detect details about images

How to use the Google Vision API
Thinkstock

Recently, I covered how computers can see, hear, feel, smell, and taste. One of the ways your code can “see” is with the Google Vision API. Google Vision API connects your code to Google’s image recognition capabilities. You can think of Google Image Search as a kind of API/REST interface to images.google.com, but it does much more than show you similar images.

Google Vision can detect whether you’re a cat or a human, as well as the parts of your face. It tries to detect whether you’re posed or doing something that wouldn’t be okay for Google Safe Search—or not. It even tries to detect if you’re happy or sad.

Setting up the Google Vision API

To use the Google Vision API, you have to sign up for a Google Compute Engine Account. GCE is free to try but you will need a credit card to sign up. From there you select a project (but My First Project is selected if you have just signed up). Then get yourself an API key from the lefthand menu.

google vision api screen 1 IDG

Here, I’m using a simple API key that I can use with the command line tool Curl (if you prefer, you can use a different tool able to call REST APIs):

google vision api screen 2 IDG

Save the key it generates to a text file or buffer somewhere (I refer to it as YOUR_KEY for now on) and enable the API on your project (go to this URL and click Enable the API):

google vision api screen 3 IDG

Select your project from the next screen:

google vision api screen 4 IDG

Now you’re ready to go! Stick this text in a file called google_vision.json:

{
 "requests":[
    {
      "image":{
      "source":{
      "imageUri":
      "https://upload.wikimedia.org/wikipedia/commons/9/9b/Gustav_chocolate.jpg"
         }
     },
       "features": [{
         "type": "TYPE_UNSPECIFIED",
         "maxResults": 50
     },
       {
         "type": "LANDMARK_DETECTION",
         "maxResults": 50
     },
       {
         "type": "FACE_DETECTION",
         "maxResults": 50
     },
       {
         "type": "LOGO_DETECTION",
         "maxResults": 50
     },
       {
         "type": "LABEL_DETECTION",
         "maxResults": 50
     },
       {
         "type": "TEXT_DETECTION",
         "maxResults": 50
     },
       {
         "type": "SAFE_SEARCH_DETECTION",
         "maxResults": 50
     },
       {
         "type": "IMAGE_PROPERTIES",
         "maxResults": 50
     },
       {
         "type": "CROP_HINTS",
         "maxResults": 50
     },
       {
         "type": "WEB_DETECTION",
         "maxResults": 50
     }
    ]
   }
  ]
}

This JSON request tells the Google Vision API which image to parse and which of its detection features to enable. I just did most of them up to 50 results.

Now use Curl:

curl -v -s -H "Content-Type: application/json" https://vision.googleapis.com/v1/images:annotate?key=YOUR_KEY --data-binary @google_vision.json > results

Looking at the Google Vision API response

* Connected to vision.googleapis.com (74.125.196.95) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
* Server certificate: *.googleapis.com
* Server certificate: Google Internet Authority G3
* Server certificate: GlobalSign
> POST /v1/images:annotate?key=YOUR_KEY HTTP/1.1
> Host: vision.googleapis.com
> User-Agent: curl/7.43.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 2252
> Expect: 100-continue
>
* Done waiting for 100-continue
} [2252 bytes data]
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Vary: X-Origin
< Vary: Referer
< Date: Tue, 24 Apr 2018 18:26:10 GMT
< Server: ESF
< Cache-Control: private
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< Alt-Svc: hq=":443"; ma=2592000; quic=51303433; quic=51303432; quic=51303431; quic=51303339; quic=51303335,quic=":443"; ma=2592000; v="43,42,41,39,35"
< Accept-Ranges: none
< Vary: Origin,Accept-Encoding
< Transfer-Encoding: chunked
< 
{ [905 bytes data]
* Connection #0 to host vision.googleapis.com left intact

You should see something like this:

If you look in results, you’ll see this:

{
 "responses": [
   {
    "labelAnnotations": [
    {
    "mid": "/m/01yrx",
    "description": "cat",
    "score": 0.99524164,
    "topicality": 0.99524164
    },
    {
    "mid": "/m/035qhg",
    "description": "fauna",
    "score": 0.93651986,
    "topicality": 0.93651986
    },
    {
    "mid": "/m/04rky",
    "description": "mammal",
    "score": 0.92701304,
    "topicality": 0.92701304
    },
    {
    "mid": "/m/07k6w8",
    "description": "small to medium sized cats",
    "score": 0.92587274,
    "topicality": 0.92587274
    },
    {
    "mid": "/m/0307l",
    "description": "cat like mammal",
    "score": 0.9215815,
    "topicality": 0.9215815
    },
    {
    "mid": "/m/09686",
    "description": "vertebrate",
    "score": 0.90370363,
    "topicality": 0.90370363
    },
    {
    "mid": "/m/01l7qd",
    "description": "whiskers",
    "score": 0.86890864,
    "topicality": 0.86890864
…

Google knows you have supplied it a cat picture. It even found the whiskers!

Now, I’ll try a larger mammal. Replace the URL in the request with my Twitter profile picture and run it again. It has a picture of me getting smooched by an elephant on my 2014 trip to Thailand.

The results will include locations of my facial features.

…
            "landmarks": [
            {
           "type": "LEFT_EYE",
           "position": {
"x": 114.420876,
"y": 252.82072,
"z": -0.00017215312
           }
            },
            {
           "type": "RIGHT_EYE",
           "position": {
"x": 193.82027,
"y": 259.787,
"z": -4.495486
           }
            },
            {
           "type": "LEFT_OF_LEFT_EYEBROW",
           "position": {
"x": 95.38249,
"y": 234.60289,
"z": 11.487803
           }
            },
…

Google isn’t as great at judging emotion as facial features:

"rollAngle": 5.7688847,
"panAngle": -3.3820703,
"joyLikelihood": "UNLIKELY",
"sorrowLikelihood": "VERY_UNLIKELY",
"angerLikelihood": "UNLIKELY",
"surpriseLikelihood": "VERY_UNLIKELY",
"underExposedLikelihood": "VERY_UNLIKELY",
"blurredLikelihood": "VERY_UNLIKELY",
"headwearLikelihood": "VERY_UNLIKELY"

I was definitely surprised, because I was not expecting the kiss (I was just aiming for a selfie with the elephant). The picture may show a bit joy combined with "yuck" because elephant-snout kisses are messy and a bit slimy.

Google Vision also noticed some other things about the picture and me:

{
"mid": "/m/0jyfg",
"description": "glasses",
"score": 0.7390568,
"topicality": 0.7390568
},
{
"mid": "/m/08g_yr",
"description": "temple",
"score": 0.7100323,
"topicality": 0.7100323
},
{
"mid": "/m/05mqq3",
"description": "snout",
"score": 0.65698373,
"topicality": 0.65698373
},
{
"mid": "/m/07j7r",
"description": "tree",
"score": 0.6460454,
"topicality": 0.6460454
},
{
"mid": "/m/019nj4",
"description": "smile",
"score": 0.60378826,
"topicality": 0.60378826
},
{
"mid": "/m/01j3sz",
"description": "laughter",
"score": 0.51390797,
"topicality": 0.51390797
}
]
…

Google recognized the elephant snout! It also noticed I’m smiling and that I’m laughing. Note the lower scores indicate lower confidence, but it is good that the Google Vision API noticed.

…
"safeSearchAnnotation": {
"adult": "VERY_UNLIKELY",
"spoof": "POSSIBLE",
"medical": "VERY_UNLIKELY",
"violence": "UNLIKELY",
"racy": "UNLIKELY"
  }
…

Google doesn’t believe that this is more than a platonic kiss and realizes that I’m not being harmed by the elephant.

Aside from this, you’ll find things like matching images and similar images in the response. You’ll also find topic associations. For example, I tweeted once time about a “Xennials” article, and now I’m associated with it!

How is the Google Vision API useful?

Whether you’re working in security or retail, being able to figure out what something is from an image can be fundamentally helpful. Whether you’re trying to figure out what cat breed you have or who this customer is or whether Google thinks a columnist is influential in a topic, the Google Vision API can help. Note that Google’s terms only allow this API to be used in personal computing applications. Still whether you’re adoring data in a search application or checking whether user submitted content is racy or not, Google Vision might be just what you need.

While I used the version of the API that uses public URIs, you can also post raw binary or Google Cloud Storage file locations using different permutations.

Author’s note: Thanks to my colleague at Lucidworks, Roy Kiesler, whose research contributed to this article.

Copyright © 2018 IDG Communications, Inc.