B. GRAU, O. FERRET, M. HURAULT-PLANTET,
C. JACQUEMIN, L. MONCEAUX, I. ROBBA AND A. VILNAT
COPING WITH ALTERNATE FORMULATIONS
OF QUESTIONS AND ANSWERS
Abstract: We present in this chapter the QALC system which has participated in the four TREC QA
evaluations. We focus here on the problem of linguistic variation in order to be able to relate questions
and answers. We present first, variation at the term level which consists in retrieving questions terms in
document sentences even if morphologic, syntactic or semantic variations alter them. Our second subject
matter concerns variation at the sentence level that we handle as different partial reformulations of
questions. Questions are associated with extraction patterns based on the question syntactic type and the
object that is under query. We present the whole system thus allowing situating how QALC deals with
variation, and different evaluations.
1. INTRODUCTION
The huge quantity of available electronic information leads to a growing need for
users to have tools able to be precise and selective. These kinds of tools have to
provide answers to requests quickly without requiring users to explore large amount
of texts or documents, or to reformulate their request. From this viewpoint, finding
an answer consists not only in finding relevant documents but also in extracting
relevant parts from them if the question is a factual one, or to summarize them if the
request is thematic. This leads us to express the QA problem in terms of an
information retrieval problem that can be solved using natural language processing
(NLP) approaches.
Pure NLP solutions were studied at the end of the seventies to answer questions
as in QUALM, the well-known system of Lehnert (1977). This system analyzed
small stories about specific topics (traveling by bus, going to the restaurant, etc.),
transformed them into a conceptual representation and answered questions by
choosing a strategy depending on the kind of information sought. It consisted of
developing reasoning on the conceptual representation making use of general
knowledge. In a restricted domain, some recent work such as Extrans (Berri, Mollá
Alliod & Hess, 1998) also made use of an NLP approach: its purpose was to analyze
the Unix manual, to represent it in a logical form and to make inferences to answer
189
T. Strzalkowski and S. Harabagiu, (eds.), Advances in Open Domain Question Answering, 189-226
© 2008 Springer.
,190 B. GRAU ET AL.
questions. Nevertheless, Extrans proposed to back off to a weaker mode, exploiting
keywords when the NLP resolution fails.
The intensive use of semantic and pragmatic knowledge prevents the application
of these approaches to open domain questions. As a matter of fact, the resolution
strategy has to be adapted to work in such an environment, relaxing the constraints
on the conceptual representation. If sentence representations are closer to the surface
form, they involve less knowledge and they can be built automatically on a larger
scale. Thus, while knowing that the kind of required information remains the same,
one can view searching the answer not as an inference problem, but as a reformulation
problem: according to what is asked, find one of the different linguistic expressions
of the answer in all candidate sentences. The answer phrasing can be considered as
an affirmative reformulation of the question, partially or totally, which entails the
definition of models that match with sentences containing the answer. According to
the different approaches, the kind of model and the matching criteria greatly differ.
Strategies range from finding certain words of the questions in the sentence and
selecting a noun phrase of the expected type – a minimal strategy applied by all
the Question Answering (QA) systems in TREC – to building a structured
representation that makes explicit the relations between the words of the question
and which is compared to a similar representation of the sentences (Harabagiu,
Pasca & Maiorano, 2000; Hovy, Hermjacob & Lin, 2001b). As realizing a complete
parse of sentences remains an unsolved problem, our position is halfway. It consists
in a partial reformulation of the question, centered on the question focus and
expressed by syntactic constraints.
While the expected answer type is rather precise when the questions ask for a
named entity — for example the question When is Bastille Day? requires a date as
answer and the question What is the name of the managing director of Apricot
Computer? requires a person name — it remains general for other ones, such as
questions asking for a definition as in What is a nematode? or for a cause. In the
former case, the answer type is such as its recognition in sentences can rely on
patterns that are independent from the question terms. Thus, finding an answer
mainly requires recognizing an instance of the expected named entity. However, in
the latter case, the answer cannot be specified by itself and must be described by a
pattern that involves relationships with some question terms and this leads us to talk
about linguistic patterns of answers. Nevertheless, whatever criteria are applied, they
all require the modeling of linguistic variation at some level.
At the term level, sentences that answer What is the average salary of a
professional baseball player?, will certainly contain an expression about salary,
which might be the average pay, and an expression about baseball player, which
might be baseball professional. The first formulation involves a semantic variation
by using a synonym, while the second example relies on a syntactic variation of a
noun phrase.
At the sentence level, when looking for a definition, as demanded in What is
epilepsy?, the answer might be expressed with epilepsy is a seizure disorder or a
person has a seizure disorder or epilepsy …, corresponding to several formulations
of the same information involving syntactic variations. These answer formulations
can be described by the following patterns: “epilepsy is NP” and “NP or epilepsy”
, COPING WITH ALTERNATE FORMULATIONS 191
where NP stands for a noun phrase that comprises the answer. The general principle
involved in our QA system consists of determining the type of sought information in
order to know which patterns better describe an affirmative reformulation. These
patterns allow the system to find the answer in a selected sentence.
Before detailing our approach, we will examine in section 2 related work on
linguistic variation in order to provide a context. This will be followed in section 3
by a general description of our system, QALC, in order to give a complete vision of
our solution and situate within our architecture the role of the different modules we
will describe in the further sections. The recognition of term variants, performed by
Fastr (Jacquemin, 2001) in our system, help the process that selects relevant
passages and the question-sentence pairing process. It will be presented in section 4.
Our criteria for choosing the answering strategy depend on which information is
deduced when analyzing the question. It can be one or several of the following
features: a) a named entity type that characterizes the answer; b) the syntactic form
of the question; c) the question focus, which is a noun that is generally present in the
answer formulation; d) the associated answer patterns.
Our question analysis module makes use of a syntactic parser. We will discuss in
section 5 why we use such a parser and how it is integrated in our system. We will
then discuss how we make use of the different question features. First, the
recognition of a noun phrase similar to the question focus in sentences and its impact
in the sentence selection process according to other criteria selection will be detailed
in section 6. And finally, in section 7, we will present how question categories lead
us to associate with each question a set of reformulation patterns.
2. LINGUISTIC VARIATION IN RELATED WORK
2.1 Paraphrase at the Term Level
Paraphrase is the natural human capacity to use different wordings for expressing
the same conceptual content. Many text processing applications have to deal with
paraphrase for covering alternate formulations with a similar semantic content.
Generating paraphrases is useful in Natural Language Generation (Robin, 1994;
Barzilay and McKeown, 2001) because it offers the possibility to use different
formulations depending on the context. In Information Retrieval and, especially, in
QA applications, it is necessary to cope with paraphrase at various levels of the
process. Paraphrase should be accounted for at the indexing level in order to conflate
indexes that correspond to similar concepts. Index conflation is taken into
consideration by Fastr, which performs term variant recognition (Jacquemin, 2001).
At the querying and pairing level, it is also mandatory to recognize variant phrasings
in order to associate different formulation of the same information need with its
corresponding indexes (Lin and Pantel, 2001).
Once the need for recognizing paraphrases is established, there are several
possibilities for processing text documents and conflating paraphrase text chunks.
Early attempts in variant conflation such as (Sparck Jones and Tait, 1984) use
, 192 B. GRAU ET AL.
semantically-rich approaches. In these approaches, it is assumed that (1) a full parse
tree can be produced for any sentence in the document and (2) the semantic and
morphological links required for the detection of any paraphrase exist in a database.
Even though there have been important developments in the design of large scale
parsers and in the enrichment of thesauri and term banks, it is unrealistic to pretend
that the two preceding requirements can be satisfied in large-scale information
access applications such as QA.
Recent developments in large scale paraphrase recognition do not require full in-
depth analyses. Instead, they rely on shallow parsers such as Minipar (Berwick,
1991), or a combination of part-of-speech patterns and lexical features (Barzilay and
McKeown, 2001; Jacquemin 2001). Although exhaustiveness in paraphrase patterns
and associated morphological and semantic links is unrealistic, recent approaches to
paraphrase recognition combine machine learning techniques, recycling of human-
based semantic or morphological databases, and distributional similarities. In (Barzilay
and McKeown, 2001), corpus-based paraphrases are extracted from multiple
translations of the same text through learning algorithms inspired from machine
translation techniques. This technique improves upon classical machine translation
by providing associations between single and multiple word expressions. With the
same purpose in mind, the classical algorithms for extracting semantic classes
through distributional similarities were improved by Lin and Pantel (2001) by using
similarities between shallow parse trees. The resulting output contains paraphrases at
the lexical or phrase level that are missing from manually-generated variations. The
approach to paraphrase pattern discovery relies on progressive corpus-based tuning
in (Jacquemin, 2001). This approach separates the different levels of variant
construction (structural and syntactic, morphological, and semantic). Through
corpus-based tuning the structural similarities are extracted. In a second step, the
combination of structure and semantic information can be refined by associating
specific structures with specific lexical classes based on shallow semantic features
(Fabre and Jacquemin, 2000).
In the QA system, QALC, developed at LIMSI, variation is accounted for at the
indexing level and at the question analysis level. At the indexing level, variant
indexes are conflated through term variant recognition. Term variation involves
structural, morphological, and semantic transformations of single or multi-words
terms. The semantic links are extracted from WordNet (Fellbaum, 1998)
synonymy relations. The morphological links for inflectional morphology result from
lemmatization performed by the TreeTagger (Schmid, 1999). As for derivational
morphology, two words are morphologically related if they share the same
derivational root in the CELEX database (CELEX, 1998). Both morphological and
semantic links are combined in the structural associations obtained through corpus-
based tuning. Term paraphrase recognition is used for dynamic and query-based
document ranking at the output of the search engine. Documents that contain variants
of the query terms are paired with the corresponding queries. As a result, linguistic
variation is explicitly addressed through the exploitation of word paradigms,
contrarily to other approaches like the one taken in COPSY (Schwarz, 1988), where
an approximate matching technique between the query and the documents implicitly
takes it into account.