Truth Is A Lie
Truth Is A Lie
Truth Is A Lie
Truth Is a Lie:
Crowd Truth and the
Seven Myths of
Human Annotation
n Big data is having a disruptive impact amar prestar aen … “The world is changed.” In the past
across the sciences. Human annotation of
semantic interpretation tasks is a critical
part of big data semantics, but it is based
on an antiquated ideal of a single correct
I decade the amount of data and the scale of computation
available has increased by a previously inconceivable
amount. Computer science, and AI along with it, has moved
truth that needs to be similarly disrupted.
solidly out of the realm of thought problems and into an
We expose seven myths about human empirical science. However, many of the methods we use
annotation, most of which derive from predate this fundamental shift, including the ideal of truth.
that antiquated ideal of truth, and dispel Our central purpose is to revisit this ideal in computer sci-
these myths with examples from our ence and AI, expose it as a fallacy, and begin to form a new
research. We propose a new theory of
theory of truth that is more appropriate for big data seman-
truth, crowd truth, that is based on the
intuition that human interpretation is
tics. We base this new theory on the claim that, outside
subjective, and that measuring annota- mathematics, truth is entirely relative and is most closely
tions on the same objects of interpretation related to agreement and consensus.
(in our examples, sentences) across a Our theories arise from experimental data that has been
crowd will provide a useful representation published previously (Aroyo and Welty 2013a, Soberon et al.
of their subjectivity and the range of rea- 2013, Inel et al. 2013, Dumitrache et al. 2013), and we use
sonable interpretations.
throughout the article examples from our natural language
Copyright © 2015, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2015 15
Articles
16 AI MAGAZINE
Articles
No. Sentence
ex1 [GADOLINIUM AGENTS] used for patients with severe renal failure show signs of [NEPHROGENIC
SYSTEMIC FIBROSIS].
ex2 He was the first physician to identify the relationship between [HEMOPHILIA] and [HEMOPHILIC
ARTHROPATHY].
ex3 [Antibiotics] are the first line treatment for indications of [TYPHUS].
ex4 With [Antibiotics] in short supply, DDT was used during World War II to control the insect vectors of
[TYPHUS].
ex5 [Monica Lewinsky] came here to get away from the chaos in [the nation’s capital].
ex6 [Osama bin Laden] used money from his own construction company to suport the [Muhajadeen] in
Afganistan against Soviet forces.
def1 MANIFESTATION links disorders to the observations that are closely associated with them; for example,
abdominal distension is a manifestation of liver failure
def2 CONTRAINDICATES refers to a condition that indicates that drug or treatment SHOULD NOT BE USED, for
example, patients with obesity should avoid using danazol
def3 ASSOCIATED WITH refers to signs, symptoms, or findings that often appear together; for example, patients
who smoke often have yellow teeth
hold, like analyzing a segment of music for its mood erate the same ground truth. Rather than accepting
(Lee and Hu 2012). In our research we have found this as a natural property of semantic interpretation,
countless counterexamples to the one truth myth. disagreement has been considered a measure of poor
Consider example ex1 in table 1. Annotators were quality in the annotation task, either because the task
asked what UMLS1 relation was expressed in the sen- is poorly defined or because the annotators lack suf-
tence between the highlighted terms, and they dis- ficient training. However, in our research we found
agreed, some choosing the side-effect relation, others that disagreement is not noise but signal, and at the
choosing cause. Looking closely at the sentence, level of individual examples can indicate that the
either interpretation looks reasonable; in fact one sentence or the relation at hand is ambiguous or
could argue that in general the cause relation sub- vague, or that the worker is not doing a good job. The
sumes the side-effect relation, and as a result this isn’t sentence ex4 in table 1 provides a good illustration of
disagreement at all. However, the definition of the ambiguity in a sentence, where we found annotators
relations are that cause is a strict sufficient causality disagreed on what relation was expressed between
and side-effect represents the possibility of a condi- the highlighted terms. In a very deep reading of the
tion arising from a drug. We might rule in favor of sentence, one may conclude that Antibiotics treat
one or the other relation being appropriate here, but Typhus because why else would its shortage cause
in actuality most experts are unable to make the dis- you to eliminate the carriers of the disease? However,
tinction in reading the sentence, and it seems quite in a more shallow reading the sentence does not
reasonable to suppose that the semantics of the rela- clearly express any relation between the two argu-
tions, while they may be ontological, are not linguis- ments. In example ex3, the sentence is quite precise
tic: they are difficult or at the very least uncommon and clear about the relationship, and we see this at
to express in language. The fact of the matter seems the level of annotator disagreement: it is high for the
to be, from experts and nonexperts alike, they have sentence ex4, and nonexistant for sentence ex3. This
varying degrees of difficulty understanding why the corresponds well with what we consider to be the
side-effect relation and the cause relation are differ- suitability of each sentence for lexical-based relation
ent, but they are uniformly unable to tell when a sen- extraction. Disagreement is giving us information.
tence expresses one or the other. This clearly indi-
cates that the “correct” interpretation of sentence ex1 Detailed Guidelines Help
is a matter of opinion; there is not one true interpre- The perceived problem of low annotator agreement is
tation. typically addressed through development of detailed
guidelines for annotators that help them consistent-
Disagreement Is Bad ly handle the kinds of cases that have been observed,
When empirically grounded AI work began, it was through practice, to generate disagreement. We have
noticed that if you give the same exact annotation found that increasingly precise annotation guidelines
task to two different people, they will not always gen- do eliminate disagreement but do not increase qual-
SPRING 2015 17
Articles
140
10 workers
120
20 workers
30 workers
100
80
60
40
20
0
TB T MT PB P MP BD D MD CO C MC LO HM DF SS OTH NO
ity, perfuming the agreement scores by forcing annotation guidelines, and we are forced to keep
human annotators to make choices they may not instructions simple. This turns out to drastically
actually think are valid, and removing the potential reduce the design period for annotation tasks, which
signal on individual examples that are vague or can easily drag on for months (the ACE 2002 guide-
ambiguous. For example, the ACE 2002 RDC guide- lines took more than a year). Simplifying guidelines
lines V2.3 say that “geographic relations are assumed allows annotators to make choices they are more
to be static,” and claim that sentence ex5 expresses comfortable with, drastically reduces development
the located relation between the two arguments, and training time, and allows for disagreement to be
even though one clear reading of the sentence is that used as a signal.
Monica Lewinsky is not in the capital. A further prob-
lem with overly specifying the guidelines is that it One Is Enough
often leads to crisp definitions of relations that make Due to the time and cost required to generate human
sense from an ontological perspective (that is, the annotated data, standard practice is for the vast
relations exist in the world) but are never expressed majority, often more than 90 percent, of annotated
in language. Consider the definitions in table 1: the examples to be seen by a single annotator, with a
manifestation relation from UMLS has a very precise small number left to overlap among all the annota-
definition, but we were unable to find examples of it tors so that agreement can be measured. We see
in medical texts. When we turn to crowdsourcing as many examples where just one perspective isn’t
a potential source of cheaper and more scalable enough, in some cases there are five or six popular
human annotated data, we are faced with the reality interpretations of a sentence and they can’t be cap-
that microtask workers won’t read long complex tured by one person. In several experiments on rela-
18 AI MAGAZINE
Articles
Table 2. Subject Matter Expert Versus Crowd Accuracy on UMLS Relation Annotation Task.
tion annotation for examples like those shown in crowd annotations, the expert annotators reach
table 1, we saw that between 15 and 20 workers per agreement only on 30 percent of the sentences, and
sentence yielded the same relative disagreement the most popular vote of the crowd covers 95 percent
spaces as any higher amount up to 50. Figure 1 shows of this expert annotation agreement. Table 2 further
the results of one such experiment on 30 sentences shows the relative accuracy of the crowd to medical
for a different set of 16 medical relations, in which subject matter experts (SME). In our analysis, mis-
we ran the same sentences through the annotation takes by the crowd were not surprising, but experts
process with 10, 20, and 30 workers annotating each were far more likely than nonexperts to see relations
sentence. What the specific relations are doesn’t mat- where none were expressed in a sentence, when they
ter; the graph shows the accumulated results for each knew the relation to be true. In sentence ex2 (table
relation across all the sentences, and we see that the 1), medical experts annotated a causes relation
relative distribution of the worker’s annotations look between the two arguments, because they knew it to
the same for 20 and 30 workers per sentence, but are be true. Crowd annotators did not indicate a rela-
different at 10. In further experiments we found 15 tion, and in general tended to read the sentences
workers per sentence to be the lowest point where more literally, which made them better suited to pro-
the relative distribution stabilizes; it is very likely this vide examples to a machine.
number depends on other factors in the domain, but We have also found that multiple perspectives on
we have not investigated it deeply. We can learn sev- data, beyond what experts may believe is salient or
eral things from figure 1, for example that since the correct, can be useful. The Waisda? video tagging
most popular choice across the sentences is OTHER, game (Gligorov et al. 2011) study shows that only 14
the set of relations given to the workers was not well percent of tags searched for by lay users could be
suited to this set of sentences. We could not learn this found in the professional video annotating vocabu-
from only one annotation per sentence, nor could we lary (GTAA), indicating that there is a huge gap
learn individual properties of sentences such as ambi- between the expert and lay users’ views on what is
guity (discussed above). important. Similarly, the steve.museum project (Lea-
son 2009) studied the link between a crowdsourced
Experts Are Better user tags folksonomy and the professionally created
Conventional wisdom is that if you want medical museum documentation. Again in this separate
texts annotated for medical relations you need med- study only 14 percent of lay user tags were found in
ical experts. In our work, experts did not show sig- the expert-curated collection documentation.
nificantly better quality annotations than nonex-
perts. In fact, with 30 microworkers per sentence for All Examples Are Created Equal
the UMLS relation extraction task, we found 91 per- In typical human annotation tasks, annotators are
cent of the expert annotations were covered by the asked to say whether some simple binary property
SPRING 2015 19
Articles
225527731 0 0 0 1 0 11 0 0 0 0 0 0 0 0
225527732 0 0 0 0 0 7 2 0 2 2 0 1 0 0
225527733 0 0 0 1 0 7 1 0 1 0 0 0 0 1
225527734 0 0 0 0 0 1 0 0 2 0 0 0 0 9
225527735 0 0 0 0 0 13 0 0 0 0 0 0 0 0
225527736 0 0 0 2 0 2 0 0 1 0 0 0 3 4
225527737 0 0 0 2 0 6 2 0 3 1 1 0 0 0
225527738 0 0 0 2 0 0 1 0 0 1 8 1 0 0
225527739 0 0 0 10 0 0 0 0 0 0 0 1 0 0
225527740 0 0 0 10 0 2 1 0 1 0 0 0 0 1
225527741 1 0 0 5 0 3 3 0 1 0 1 0 1 1
225527742 0 0 0 4 0 0 0 0 3 0 0 0 0 4
225527743 0 0 0 1 0 1 2 0 1 0 0 0 0 8
225527744 0 0 0 3 0 1 0 0 1 8 0 0 0 1
225527745 0 0 0 5 0 2 3 0 1 4 0 0 0 0
225527746 0 0 1 1 5 2 0 0 1 0 0 0 2 0
225527747 0 0 0 1 8 2 2 0 1 0 0 0 1 1
225527748 0 0 0 1 7 1 0 0 1 0 0 0 2 1
225527749 0 0 0 0 0 0 0 0 3 0 1 1 4 2
225527750 0 0 0 1 0 4 2 0 3 0 1 2 0 0
holds for each example, like whether sentence ex2 of them but not clearly, and as described above this
(table 1) expresses the cause relation. They are not is an obvious example of a sentence with a high lev-
given a chance to say that the property may partial- el of vagueness. In sentence ex3, all annotators indi-
ly hold or holds but is not clearly expressed. Individ- cate the treats relation with no disagreement. The dis-
ual humans are particularly bad at uniformly choos- agreement allows us to weight the latter sentence
ing from scales of choices (like high, medium, low), higher than the former, giving us the ability to both
but we can find by recording disagreement on each train and evaluate a machine in a more flexible way.
example that poor quality examples tend to generate
high disagreement. In sentence ex4, we find a mix Once Done, Forever Valid
between treats and prevents in the crowd annota- Perspectives change over time, which means that
tions, indicating that the sentence may express either training data created years ago might contain exam-
20 AI MAGAZINE
Articles
ples that are not valid or only partially valid at a lat- worker-worker disagreement score, which is calcu-
er point in time. Take for example the sentence ex6, lated by constructing a pairwise confusion matrix
and a task in which annotators are asked to identify between workers and taking the average agreement
mentions of terrorists. In the 1990s, [Osama bin for each worker. The first metric gives us a measure of
Laden] would have been labeled as “hero” and after how much a worker disagrees with the crowd on a
2001 would have been labeled as “terrorist.” Consid- sentence basis, and the second gives us an indication
ering the time, both types would be valid, and they as to whether there are consisently like-minded
introduce two roles for the same entity. We are only workers. While we encourage disagreement, if a
just beginning to investigate this particular myth, worker tends to disagree with the crowd consistent-
but our approach includes continuous collection of ly, and does not generally agree with any other work-
training data over time, allowing the adaptation of ers, that worker will be labeled low quality. Before
gold standards to changing times. We can imagine computing worker measures, the sentences with the
cases such as popularity of music and other clearly lowest clarity scores (see below) are removed from
more subjective properties of examples that would be the disagreement calculations, to ensure that work-
expected to change, but even cases that may seem ers are not unfairly penalized if they happened to
more objective could benefit from continuous col- work on a bad batch of sentences. Our experiments
lection of annotations as, for example, relative levels show that these worker metrics are more than 90 per-
of education shift. cent effective in identifying low quality workers
(Soberon et al. 2013).
Crowd Truth Sentence scores include sentence clarity and the
Crowd truth is the embodiment of a new theory of core crowd truth metric for relation extraction, sen-
truth that rejects the fallacy of a single truth for tence-relation score (SRS). SRS is measured for each
semantic intepretation, based on the intuition that relation on each sentence as the cosine of the unit
human interpretation is subjective and that measur- vector for the relation with the sentence vector. The
ing annotations on the same objects of interpretation relation score is used for training and evaluation of
(in our examples, sentences) across a crowd will pro- the relation extraction system; it is viewed as the
vide a useful representation of their subjectivity and probability that the sentence expresses the relation.
the range of reasonable interpretations. Crowd truth This is a fundamental shift from the traditional
has allowed us to identify and dispel the myths of approach, in which sentences are simply labelled as
human annotation and to paint a more accurate pic- expressing, or not, the relation, and presents new
ture of human performance on semantic interpreta- challenges for the evaluation metric and especially
tion for machines to attain. for training. In our experiments we have seen that
The key element to crowd truth is that multiple the sentence-relation score is highly correlated with
workers are presented the same object of interpreta- clearly expressing a relation. Sentence clarity is
tion, which allows us to collect and analyze multiple defined for each sentence as the max relation score
perspectives and interpretations. To facilitate this, we for that sentence. If all the workers selected the same
represent the result of each worker’s annotations on relation for a sentence, the max relation score will be
a single sentence as a vector in which each interpre- 1, indicating a clear sentence. In figure 2, sentence
tation that is possible is a dimension in the vector 735 has a clarity score of 1, whereas sentence 736 has
space. In the case of relation extraction, the crowd a clarity score of 0.61, indicating a confusing or
truth vector has n + 2 dimensions, where n is the ambiguous sentence. Sentence clarity is used to
number of relations + options for NONE and OTHER weight sentences in training and evaluation of the
(allowing a worker to indicate that a sentence does relation extraction system, since annotators have a
not express a relation at all or does not express any of hard time classifying them, the machine should not
the given relations). In these vectors, a 1 is given for be penalized as much for getting it wrong in evalua-
each relation the worker thought was being tion, nor should it treat such training examples as
expressed (workers can indicate multiple relations), strong exemplars.
and we use them to form sentence disagreement vec- Relation scores include relation similarity, relation
tors for each sentence by summing all the worker ambiguity, and relation clarity. Similarity is a pair-
vectors for the sentence. An example set of disagree- wise conditional probability that if relation Ri is
ment vectors is shown in figure 2. We use these vec- annotated in a sentence, relation Rj is as well. Infor-
tors to compute metrics on the workers (for low qual- mation about relation similarity is used in training
ity and spam), on the sentences (for clarity and what and evaluation, as it roughly indicates how confus-
relations may be expressed), and on the relations (for able the linguistic expression of two relations are.
clarity and similarity) as follows: This would indicate, for example, that relation
Worker measures include the worker-sentence dis- colearning (Carlson et al. 2009) would not work for
agreement score, which is the average of all the similar relations. Ambiguity is defined for each rela-
cosines between each worker’s sentence vector and tion as the max relation similarity for the relation. If
the full sentence vector (minus that worker), and the a relation is clear, then it will have a low score. Since
SPRING 2015 21
Articles
Sign
Referent Interpreter
techniques like relation colearning have proven nitive process attempts to find the referent of that
effective, it may be useful to exclude ambiguous rela- sign (an object, an idea, a class of things, and so on).
tions from the set. Clarity is defined for each relation This process of interpretation is what we generally
as the max sentence-relation score for the relation mean when we talk about semantics.
over all sentences. If a relation has a high clarity In crowd truth for relation extraction, sentences
score, it means that it is atleast possible to express the are the signs, workers are the interpreters, and the ref-
relation clearly. We find in our experiments that a lot erents are provided by the semantics of the domain;
of relations that exist in structured sources are very in our examples the set of relations are the possible
difficult to express clearly in language and are not fre- referents. Adapting crowd truth to a new problem
quently present in textual sources. Unclear relations involves substituting for the sentences the objects to
may indicate unattainable learning tasks. be interpreted, and identifying the possible semantic
The three kinds of scores hold up well in our exper- space of referents. Once the semantic space is identi-
iments for building a medical relation extraction fied, it is mapped into a vector space and the same
gold standard (Aroyo and Welty 2013b). We believe measures can be applied. For example, if the inter-
the idea of crowd truth generalizes to other big data pretation task were to identify the predominant col-
semantics tasks quite easily. The three kinds of meas- ors in an image, the vector space could be the range
ures we introduce correspond directly to the three of possible (relevant) colors.
corners of the triangle of reference (see figure 3) The most work in adaptation of crowd truth to a
between a sign, something the sign refers to, and the new problem lies in determining a useful vector space
intepreter of the sign (Ogden and Richards 1923). for representing the disagreement. It is important for
The intepreter perceives the sign (a word, a sound, an the dimensionality to be relatively low, so that there
image, a sentence, and so on) and through some cog- is reasonable opportunity for workers to agree as well
22 AI MAGAZINE
Articles
SPRING 2015 23
Articles
24 AI MAGAZINE