Live blogging Dr. David Ferrucci’s address to the IBM Watson University Symposium at Harvard Business School and MIT Sloan School of Management. Ferrucci was director of the IBM Watson project.
Additional coverage is on the Smarter Planet Blog. .
Ferrucci tells a story about his daughter’s quote: “Interesting things are boring.” Because that was her frame of reference. Interesting things involve a lot of complexity that makes them boring (to people who aren’t explicitly interested in them).
We’re trying to move from the age of moving bits to the age of understanding their meaning. Meaning is ultimately subjective and we are the subjects. We see a calculator with dice on top and we infer it has something to do with calculating odds. We bring a huge wealth of background knowledge to interpreting what an image means.
The Next Big ThingThis is one in a series of posts that explore people and technologies that are enabling small companies to innovate. The series is underwritten by IBM Midsize Business, but the content is entirely my own. |
Don’t expect AI systems to originate meaning, but expect them to infer meaning. We have to get the AI systems to make the right assignments of meaning but we can’t expect them to originate meaning.
You can look at context from a chronology or magnitude or causation perspective. If you want computers to detect human meanings, it takes advanced technology.
Jeopardy was an interesting challenge. The end goal was to get smarter about processing a language that humans use to detect intended meanings. Watson and Jeopardy were milestones in that direction.
You couldn’t play Jeopardy by having thousands of FAQs and linking to their answers. Example: “The first person mentioned by name in ‘The Man in the Iron Mask’ is this hero of a previous book by the same author.” The answer is D’Artagnan. If you’re going to buzz in on that question, you’d better have an accurate probability that your answer is correct. You have to look at all the evidence supporting that answer and analyze it so that that probability gets better and better.
Example: “This actor, Audrey’s husband from 1954 to 1968, directed her as Rima the bird girl in ‘Green Mansions.’” You need to parse what “direct” means: to guide, to lead, to direct? What is “Green Mansions?” Watson would produce multiple syntactic assignments because there was no guarantee of approaching that problem correctly. You need to parse all of these assignments in parallel to see which are most likely.
In a random sample of 20,000 past Jeopardy questions we found 2,500 distinct types. The most frequent occurred less than 3% of the time. They didn’t help us map to answer, but it did help us identify different ways of understanding how questions are parsed.
Plausible inference varies by context. Example: In the category of Lincoln Blogs: “Treausre secy Chase just submitted this for the third time. Guess what, pal? This time I’m accepting it.” Answer is “resignation.” Sixth-grade class came up with a different answer: “Friend request.”
“Vessels sink.” “Sink” can refer to submerged but you can also sink a cue ball.
We calculated the winning player on Jeopardy gets about 50% of the answers. Those that they got a chance to answer they answered correctly about 85% of the time. Ken Jennings was able to answer 62% of the board, which was phenomenal.
Fundamental bets we made:
- Large hand-crafted models are too limited.
- Intelligence from the capability oif many: We had to come up with many hypotheses and algorithms to figure out how to attack the problem. We had to combine those to balance their application.
- Massive parallelism was a key enabler: We had to pusue many competing independent hypotheses over large amounts of data.
The notion of intelligence from the capability of many enabled us to get a lot of different components to interoperate with each other. We had to build many different ways to combine them.
Watson Architecture
Watson takes many possible alternatives or candidate answers. For each of those hypotheses it gathers a large amount of evidence. So it may take 100 interpretations and gather 100 facts for each, or 10,000 factors. We then run them through filters to calculate a probability of each answer, and if it’s above a certain threshold, like 50%, then we buzz in.
Then we also have to balance in the competition. If Watson is way ahead, it’s less likely to buzz in on a lower probability. If it’s way behind, it gets more aggressive.
The goal of all researchers was can they drive the end-to-end system performance? We changed the cultural incentives. We all go into one room and everyone focused on natural language processing, information retrieval, knowledge representation and more. We produced more than 8,000 documents.
Some early answers:
Decades before Lincoln, Daniel Webster spoke of government “made for”, “made by” & answerable to them.” Watson: “No One.”
“Give a Brit a tinnkle when you get into town and you’ve done this.” Watson: “Urinate”
This system is complex in ways that no one imagined. We did end-to-end integration almost every two weeks, then drove into data to determine which factors were contributing to right and wrong answers.
We got to 87% to 90% precision, which was good enough to compete. We went into the final game with Ken Jennings. Our analytics told us we had about a 74% chance of winning.
It took about two hours to answer one question on a single 2.6Ghz CPU. We used about 15TB of RAM and 2,000 parallel cores.
I fought hard to get an answer panel on TV so people could see Watson computing possibilities. This got people thinking about what was going on behind the scenes.
An answer doesn’t really matter if you can’t back it up with evidence that a human can understand. When we apply this to health care or finance, you need to provide the evidence of why you think this is an important answer, but you have to do it in a way that people can understand. The computer isn’t giving answers but providing evidence.
Applications in health care: You’re finding hypotheses that weren’t clearly evident and gathering information that supports evidence. You can imaging information comine from patient data, symptoms, office visits, and background databases, computing these profiles, providing these profiles to the medical teams and helping them to see how the evidence supports the hypothesis.
What do people intend when they use words: “Geeshe, she was only 10 when she took ohome an Oscar in 1974. She’s 40 now.” Watson’s confidence was low. It knew that Tatum O’Neal won an Oscar at age 10, but it doesn’t understand “take home.” So we asked people to answer that question so that Watson could understand what the term meant.
It’s not enough to assign semantics to a sentence. You want to interact with people who can help you understand meaning.
Question: How long until Watson can program itself?
As Watson is training on prior questions and answers, it’s balancing those inputs and learning how to weight the possibilities. Different problems require different balancing. I can’t give you a short answer because what do you mean by programming yourself? Short answer is I don’t know.
Question: It sounds like you ran out of gas at some point in comparing Watson’s performance to humans. Is there some limit?
It does slow down. At some point, there are things that are incredibly hard for a computer to do. Can we push that line upward? I think we can, but for Jeopardy there was only so far we needed to push it. The game is also evolving. We learned that data prior to 2003 was easier for the computer to understand. After we learned that, we focused only on data post-2003.
Question: What areas are this technology more appropriate for?
We’ve only focused on health care so far. It’s harder for the machine to understand different domains. In health care, you don’t have as much training data and the input is not just a small graph. It’s a huge graph of different factors and relationships. In health care we had to tackle the multi-dimensional input problem. Because of the limited ontology in that domain, it’s somewhat easier to tackle health care than some other areas.