Until recently, computers were hopeless in producing sentences that really made sense. But major steps have been taken in the field of natural language processing (NLP) and machines can now generate convincing passages at the touch of a button.
This progress has been driven by deep-learning techniques that extract statistical patterns in word use and argumentation structure from large amounts of text. But a new document from the Allen Institute of Artificial Intelligence draws attention to something that is still missing: machines don't really understand what they write (or read).
This is a fundamental challenge in the grand pursuit of generalizable AI, but outside the academic world it is also relevant to consumers. For example, chatbots and voice assistants based on state-of-the-art natural language models have become the interface for many financial institutions, healthcare providers and government institutions. Without a genuine understanding of language, these systems are more likely to fail, delaying access to important services.
The researchers built on the work of the Winograd Schema Challenge, a test that was set up in 2011 to evaluate the common-sense reasoning of NLP systems. The challenge uses a set of 273 questions with pairs of sentences that are identical except for one word. That word, known as a trigger, reverses the meaning of the pronoun of each sentence, as shown in the example below:
- The trophy does not fit in the brown suitcase because it is too large.
- The trophy does not fit in the brown suitcase because it is too small.
To succeed, an NLP system must find out which of the two options the pronoun refers to. In this case, you should select the “trophy ” for the first and “suitcase ” for the second to correct the problem correctly.
The test was originally designed with the idea that such problems cannot be answered without a deeper understanding of semantics. State-of-the-art deep-learning models can now achieve around 90% accuracy, so it seems that NLP has come closer to its goal. But in their paper, which will receive the Outstanding Paper Award at next month's AAAI conference, the researchers challenge the effectiveness of the benchmark and thus the degree of progress that the field has actually made.
They created a considerably larger data set, called WinoGrande, with 44,000 of the same kind of problems. For this they have designed a crowdsourcing scheme to quickly create and validate new sense pairs. (Part of the reason that the Winograd dataset is so small is that it was made by hand by experts). Workers at Amazon Mechanical Turk created new sentences with required words selected through a randomization procedure. Each pair of sentences was then given to three additional workers and kept only if it met three criteria: at least two workers selected the correct answers, all three considered the options unambiguous, and the references of the pronoun could not be derived by simple word connections.
As a final step, the researchers also ran the dataset through an algorithm to remove as many “artifacts” as possible – unintended data patterns or correlations that could help a language model to come up with the right answers for the wrong reasons. This reduced the chance that a model could learn to play the dataset.
When they tested the state-of-the-art models for these new problems, the performance dropped to between 59.4% and 79.1%. In contrast, humans still achieved 94% accuracy. This means that a high score on the original Winograd test is likely to blow up. “It's just a dataset-specific achievement, not a general task creation,” said Yejin Choi, an associate professor at the University of Washington and a senior research manager at AI2 who led the research.
Choi hopes that the dataset will serve as a new benchmark. But she also hopes it will inspire more researchers to look beyond deep learning. The results emphasized to her that real common-sense NLP systems must contain other techniques, such as structured knowledge models. Her earlier work is promising in this direction. “We have to find another game plan somehow,” she says.
The newspaper has received some criticism. Ernest Davis, one of the researchers who worked on the original Winograd challenge, says that many of the example sentences mentioned in the paper “seriously fail” with confusing grammar. “They don't match the way people who speak English actually use pronouns,” he wrote in an email.
But Choi notes that really robust models do not need perfect grammar to understand a sentence. People who speak English as a second language sometimes confuse their grammar, but still pass on their meaning.
“People can easily understand what our questions are about and choose the right answer,” she says, referring to the 94% performance accuracy. “If people should be able to do that, my view is that machines should be able to do that.”Tags: #ArtificialIntelligence, #latestNewsAI, #researchAi, #Robotics, Artificial Intelligence