Aaron Preece
In the last year, significant improvements in AI vision technology have emerged, most notably OpenAI's GPT-4 and its integration with the Be My Eyes app. Previously a novelty in most cases, image recognition has now advanced to a point where it can be incredibly valuable in multiple real-world situations. With technologies such as image AI, it seems inevitable that they will be adapted into screen reading and other accessibility technologies to aid people who are blind or have low vision.
For this piece, I wanted to highlight a groundbreaking feature that takes a significant step to use machine learning to increase accessibility: Apple's Screen Recognition feature integrated with VoiceOver. This feature scans the currently visible UI displayed on your phone or tablet, automatically recognizes UI elements, and makes them available to VoiceOver, even if they were completely invisible to the screen reader before. Notably, Screen Recognition takes place on the user's device and does not require cloud access.
Getting Started with Screen Recognition
To use Screen Recognition, you will need an iOS device no older than an iPhone Xs. A full list of supported devices can be found on Apple's official website. Using Screen Recognition is fairly simple. First, go to Settings > Accessibility > VoiceOver > VoiceOver Recognition, and turn Screen Recognition on. Most likely, your phone will need to download the machine learning model used for recognizing content.
Once Screen Recognition is active, there is one other aspect to consider to ensure you can use it to its full extent. Oftentimes, apps that are entirely inaccessible register to VoiceOver as direct touch apps. Direct touch is a feature that allows a user to interact directly with an app, bypassing VoiceOver's usual commands. This theoretically allows an app to be navigated with its own built-in gestures and inputs while still being able to output information to VoiceOver. However, whenever an app automatically turns on direct touch, it becomes inaccessible with VoiceOver gestures, making it impossible to use with Screen Recognition.
You can manually turn off direct touch in specific apps by going to Settings > Accessibility > VoiceOver > Rotor and then selecting Direct Touch Apps. Additionally, you can disable direct touch on the fly by doing a two-finger quadruple tap to bring up VoiceOver Quick Settings. Here, you will find an option to turn off direct touch altogether. Note that it may turn on again later and might need to be deactivated again. Also, you will need to turn off direct touch before entering an app that registers as a direct touch app, as once you're in the app, you will not be able to perform this command.
Using Screen Recognition Effectively
Once you have installed Screen Recognition, activated it, and ensured that the app you wish to use will not automatically enable direct touch, you are ready to begin. I personally recommend adding Screen Recognition to your Rotor settings so it will be easy to turn on and off for specific apps on the fly. Go to Settings > Accessibility > VoiceOver > Rotor > Rotor Items and select Screen Recognition to ensure you always have access.
Screen Recognition is most useful in apps that do not provide any accessibility, but it can be used in any app, even in stock apps such as Safari. This is where having Screen Recognition in your Rotor will come in handy, as it doesn't simply recognize inaccessible items but every item in your currently visible UI. This means that if you open an app that is partially accessible already, Screen Recognition will essentially re-identify all the elements on the screen, which might introduce accessibility errors elsewhere.
Remember, Screen Recognition can recognize UI elements such as buttons and sliders, as well as all the text displayed on the screen. However, it may not necessarily know what the image on a button means. For example, if Screen Recognition sees a button with a gear icon for settings in an app, it recognizes that it is a button but might not recognize that it is a settings button, unless the word "settings" is nearby to alert Screen Recognition that this label is associated with the button. More commonly, you will need to use "explore by touch" to find text labels and their associated buttons. Not ideal, but still quite usable.
When using Screen Recognition, be aware that elements that are distinct might be grouped into one large element or otherwise recognized incorrectly. I found it incredibly useful to rapidly turn VoiceOver off and back on to force Screen Recognition to re-identify the screen, which often corrects any strange recognition errors. Additionally, if a screen is scrollable, just scrolling the screen slightly might cause Screen Recognition to identify elements differently.
The Bottom Line
As smartphones and tablets continuously increase in power, it's exciting to consider where technologies like Screen Recognition will be in the future. Currently, Screen Recognition is functional for apps that use a specific layout or feature text labels heavily but might still be impractical for many others. However, as device processing power and storage capacity increase, the assumption is that the recognition model will become more powerful as well.
At the time of publication, OpenAI had just released GPT-4O, their latest large language model with image processing capabilities. GPT-4O has the capability to live recognize streamed video content, a major leap compared to the static image recognition available just a year earlier. This industry moves incredibly fast, and I have high hopes for the future of AI-based accessibility tools.