We’ve been working recently to bring our web site into compliance with ADA requirements, and one of our big issues is 300+ hours of video without closed captioning or transcripts. While we’re putting process and policy in place to ensure new videos have this, we’ve been trying to figure out the best way to deal with transcribing all the old material. Enter IBM’s Watson.
I stumbled on the Watson APIs while investigating options for transcribing our video. The full set of services are listed on the Watson Services page on the Watson Developer Cloud site. The specific API that peaked my interest was the Speech to Text API. I reached out to IBM, had some really god conversations about how the Watson AI could help us with this project (at a substantial cost savings over human based transcription services), and saw Watson in action very impressively transcribing some video real time with excellent accuracy. This looked like a great opportunity to try out IBM’s AI tools on something that wouldn’t expose any significant bias, All we wanted to do was take someone talking and turn it into text. I really thought this could be a great use of AI. I was wrong.
The IBM folks gave me a test site where I could try a few videos of ours to see how reliable it was. So I did. The first video I tried had some music in the background, and Watson has a hard time when there is background sound. Fair enough. I’ve done enough audio work to understand how that could be a problem. Once I figured that out, I tried some others with no background noise. Just a couple people talking. Here’s one of the videos I tried:
And here’s a bit of the translation I got back:
What is the mariners family spirit to you.. Well the merit of suddenly spirit to mean means %HESITATION.. The sense of coming to go there.. …I think within the Marist brothers.. You know we kind of look at it brothers and priests coming together but let me look at it from the bigger perspective of the parents family.. Brother spree sisters.. Les collaborators those who join us in our mission coming together …for common purpose the calm in Michigan so in our apostolic ministries so here are some amount in our schools how do we come together as professed religious lake committed people lay collaborators in our ministry to push forward this common goal that we have and how do %HESITATION.. You know where charm about how do we pass on our marriages terrorism in such a way that.. Students and faculty …see themselves played an active role in this Maraniss stating that we have here on campus..
There are some spots that this isn’t bad, but this is one of the good ones. I tried some others with much worse results. I tried some videos from a few TEDTalks and got much better results. After some further conversations, I learned two things about Watson. First, it’s mostly been taught legal, medical, and tech language. Given that, any of the religious specific phrases caused Watson some trouble. Second, Watson has been taught mostly from mid-western US speakers (what people often describe incorrectly as “neutral”). So not only were many of the Hawaiʻian words and phases garbled (ʻOhana, or family spirit, for instance), Watson basically couldn’t understand any of the Polynesian or Asian accents very common here in Hawaiʻi. Given that accuracy was our first (and almost only) priority, we abandoned the project. I appreciate the time IBM took to work with us, and how upfront they were with the issues once we pointed this out. This most definitely isn’t meant to be a slam on the technology or the people working on it. The reality, though, is that Watson as a transcriptionist is basically useless in all but a very narrow set of circumstances because Watson has not been taught any cultural awareness.
And that, I think, is the valuable lesson I learned. Even when something “simple” like converting words from spoken to written is involved, the bias of the AI must always be understood and acknowledged. There is never, ever any such thing as neutral.