Facebook AI Describes Photos to Blind Users

Facebook just became a lot more accessible to the 39 million-plus people who are blind and 246 million-plus people with severe visual impairments. Facebook today introduced automatic alternative text, or automatic alt text, an artificial intelligence (AI) application that generates a verbal description of images on the site.

Now people using screen readers on iOS devices will hear a list of items a photo may contain. For example, automatic alt text will now tell a Facebook user that an image “may contain three people, smiling, outdoors.” Before automatic alt text, Facebook users would only hear the name of the person who shared the photo.

Just last week, interestingly, Twitter also made its service more accessible to the visually impaired. Twitter now allows users to add descriptions or “alternative text” to images that allows people using screen readers and braille displays to hear what an image is about.

So, how did Facebook build automatic alt text? The site’s object recognition technology is based on a neural network that contains billions of parameters and is trained with millions of examples of visual objects. The company’s software developers and engineers work with the Facebook Accessibility team to make technology more accessible. Here’s more from Facebook:

While Facebook’s visual recognition technology described above can be used to recognize a wide range of objects and scenes (both referred to as “concepts” in the rest of this post), for this first launch we carefully selected a set of about 100 concepts based on their prominence in photos as well as the accuracy of the visual recognition engine. We also chose concepts that had very specific meanings, and we avoided concepts open to interpretation. The current list of concepts covers a wide range of things that can appear in photos, such as people’s appearance (e.g., baby, eyeglasses, beard, smiling, jewelry), nature (outdoor, mountain, snow, sky), transportation (car, boat, airplane, bicycle), sports (tennis, swimming, stadium, baseball), and food (ice cream, pizza, dessert, coffee). And settings provided different sets of information about the image, including people (e.g., people count, smiling, child, baby), objects (car, building, tree, cloud, food), settings (inside restaurant, outdoor, nature), and other image properties (text, selfie, close-up).

We make sure that our object detection algorithm can detect any of these concepts with a minimum precision of 0.8 (some are as high as 0.99). Even with such a high quality bar, we can still retrieve at least one concept for more than 50 percent of photos on Facebook. Over time our goal is to keep increasing the vocabulary of automatic alt text to provide even richer descriptions.

Construction of sentence

After detecting the major objects in a photo, we need to organize them in a way that feels natural to people. We experimented with different approaches, such as ordering the concepts by their confidence, showing the concepts with a confidence level (such as 50 percent or 75 percent) attached to them, and so on. After many surveys and in-lab user experience studies, and after using this feature ourselves, we decided to group all the concepts into three categories – people, objects, and scenes – and then present information in this order. For each photo, we first report the number of people (approximated by the number of faces) in the photos, and whether they are smiling or not; we then list all the objects we detect, ordered by the detection algorithm’s confidence; scenes, such as settings and properties of the entire image (e.g., indoor, outdoor, selfie, meme), will be presented at the end. In addition, since we cannot guarantee that the description we deliver is 100 percent accurate (given that it’s neither created nor reviewed by a human), we start our sentence with the phrase “Image may contain” to convey uncertainty. As a result, we will construct a sentence like “Image may contain: two people, smiling, sunglasses, sky, tree, outdoor.”

Facebook says it took about 10 months to get automatic alt text to its current stage. The biggest challenge was “balancing people’s desire for more information about the images with the quality and social intelligence of such information. Interpretation of visual content can be very subjective and context-dependent. For instance, though people mostly care about who is in the photo and what they are doing, sometimes the background of the photo is what makes it interesting or significant.”

As of now, Facebook’s automatic alt text is available only on iOS screen readers that are set to English. However, Facebook said it plans to make automatic alt text compatible with other languages in the near future.

Let’s hope this goes better than Microsoft’s AI-powered chat bot, Tay.ai, that was shut down after just 16 hours after it starting tweeting racial slurs, defending white supremacist propaganda, and supporting genocide. Tay was designed to engage in playful conversations with 18- to 24-year-olds. It could tell jokes, play games, send pictures, tell you your horoscope. Tay was even supposed to become more personalized with users as time went on. But within hours of it going live, Twitter users took advantage of Tay’s flaws and forced Microsoft to shut it down.