Mutual information of words and pictures

Text-only Preview

Mutual information of words and pictures
Kobus Barnard
Keiji Yanai
Department of Computer Science
Department of Computer Science
University of Arizona
The University of Electro-Communications
Tucson, Arizona
1-5-1 Chofugaoka, Chofu-shi
Email: [email protected]
Tokyo, 182-8585 JAPAN
Email: [email protected]
Abstract— We quantify the mutual information between words
words, once we see the image or image component. Mutual
and images or their components in the context of a recently devel-
information is symmetric, and we equally have
oped model for their joint probability distribution. We compare
 $¡ £¦¥%§©&'¡¤§©()¡¤§0!
£©21
the results with estimates of human level performance, exploiting
(2)
a methodology for evaluating localized image semantics.
We also report results of using information theoretic measures
To quantify the mutual information of words and pictures ( IV)
3
to determine whether or not a word is “visual”. In particular, we
we apply form (1), and for the application to ?nding “visual”
examine the entropy of image regions likely to be associated with
words ( V) we use form (2) — in fact, since we only need to
3
a candidate visual word. We propose using such an approach to
¡4§5!
£©
rank the words, we use only
.
prune words that do not link to given features. This can reduce
the dif?culties of linking of words and images in large scale data

We compute the required probabilities based on models for
sets.
their joint probability described below ( III-A). These models
3
are quite limited in effectiveness, re?ecting that the current
I. INTRODUCTION
state of the art has a long way to go. Hence one motivation
Intuitively there is much mutual information between im-
for this work is to compare the mutual information computed
ages and associated text. For example, given an image, we are
from such models with similar quantities based on human level
not overly surprised by relevant keywords. In this work we
recognition.
quantify the mutual information suggested by this scenario,
An important distinction is the mutual information between
using a recently developed model for the joint probability of
words and images, taken as a whole, and words and image
words and images and their components [1], [2], [3]. We con-
regions. Most words associated with images refer to speci?c
sider both the mutual information between entire images and
parts within the image. Further, we assume that systems that
words, and image regions and words. We compare the results
automate image understanding must embody image composi-
with estimates of human level performance, exploiting recent
tionally. However, the models of the genre outlined below are
methodology for evaluating localized image semantics [4].
typically trained on data where the nature of the composition is
This gives an alternative characterization of these models,
hidden by correspondence ambiguity. For example, the training
different from the word prediction performance measures used
set used for the ?rst set of experiments consists of images
so far [1], [2], [3].
with roughly ?ve keywords, but we are ignorant of which
In a different application, we consider that the entropy of
image parts go with which keywords. We posit that even if
image regions associated with a word can be indicative of
the goal is simply image annotation — suitable for indexing
how “visual” that word is. Thus we can apply information
application — the reduction of uncertainty in correspondence
theoretic measures to determine whether a word is “visual” [5].
is a key issue for generalization. For example, an algorithm
This is important because the automated processing of large
that confuses horses and grass will do ?ne as long as horses
image data sets involves potentially very large vocabularies,
and grass always co-occur as they might in a training set. Thus
but many words associated with images are not very useful for
in this work we set out to measure mutual information on both
visual representation. This suggests a large scale data mining
image annotation and region labeling.
exercise to determine which words are likely to be useful for
automatically annotating images based on visual properties.
A. Ground truth semantic entropy
For the ground truth word distributions for entire images
II. ESTIMATING THE MUTUAL INFORMATION OF WORDS
we remain consistent with previous work and assume that the
AND IMAGES OR THEIR COMPONENTS
keywords provide a reasonable empirical estimate [3]. This
We compute the mutual information of random variables for
ignores issues of completeness of the keyword set relative to
words, W, and blobs, B, by the standard formula [6]:
the vocabulary, and relations among the words. For example,
in a tiger image, should the word “cat” be treated differently
 ¢¡¤£¦¥¨§©¡¤£©¡ £"!
§#©
(1)
than the word “tiger”?
where H(X) is the entropy of the random variable X. We
In the case of image regions, an additional complexity is
interpret this informally as the reduction in entropy of the
that, due to imprecise segmentation, each region will generally

cover some subset of the image area relevant to several seman-
A. An exemplar multi-modal translation model
tic entities. We have addressed some of these issues in recent
We model the joint probability of a particular blob, , and
6
work on the evaluation of localized image semantics [4]. That
a word
, as
7
work provides a method to compute, for a given segmentation,
89¡
©&CBEDF89¡
!
G¤©H89¡
!
G¤©H89¡4G ©
a distribution of weights over the words that quanti?es the
(3)
[email protected]¨6
7
6
reward for assigning that word for that region. The method
uses WordNet [7] to establish a protocol for scoring related
G
89¡4G ©
where
indexes over concepts,
is the concept prior,
words. For example, “tiger” is rewarded more than“cat”, with
89¡
!
G ©
89¡
!
G ©
is a frequency table, and
is a Gaussian distri-
7
6
the proportion set so that blind guessing of either one will give
bution over features. We further assume a diagonal covariance
the same expected value of the overall score.
matrix (independent features) because ?tting a full covariance
For the experiments in this paper, we assume that these
is generally too dif?cult for a large number of features.
weights are proportional to a good ground truth probability
This independence assumption is less troublesome because
distribution. Further, the sum of these scores give a weight
we only require conditional independence, given the concept.
encoding the proportion of the image semantics attributed
Intuitively, each concept generates some image regions accord-
to that region. We use this weighting to compute averages
ing to the particular Gaussian distribution for that concept.
over regions to mitigate somewhat the impact of the particular
Similarly, it generates one ore more words for the image
segmentation algorithm. The results using straight averages are
according to a learned table of probabilities.
substantively similar.
To go from the blob oriented expression (3) to one for an
§
entire image, we assume that the observed blobs,
, yield a
89¡4G¨!
§#©
III. M
posterior probability,
, which is proportional to the sum
ODELING THE JOINT PROBABILITY OF WORDS AND
89¡¤G%!
©
of
. Words are then generated conditioned on the blobs
IMAGE REGIONS
6
from:
89¡
!
§#©PICBEDF89¡
!
G ©Q89¡4G¨!
§#©
Recent work suggests that relatively simple approaches can
(4)
7
7
usefully model the joint probability distribution of image
region features and associated words [1], [2], [3], [8], [9],
where by assumption
[10]. Using regions or other localized features makes sense
89¡4G%!
§©I
89¡4G¨!
©
BSR
because image semantics are largely dependent on composi-
(5)
6
tional elements within them such as objects and backgrounds.
These models are trained using large data sets of images
89¡4G¨!
©IF89¡
!
G ©Q89¡4G ©
and Bayes rule is used to compute
.
6
6
with associated text. Critically, the correspondence between
Some manipulation [14] shows that this is equivalent to
particular words and particular visual elements is not required,
assuming that the word posterior for the image is proportional
as large quantities of such data is not readily available and
to the sum of the word posteriors for the regions:
expensive to create.
The general idea, shared by many variants of the approach,
89¡
!
§©PIUT
89¡
!
©
BSR
(6)
7
7
6
is that image are generated from latent factors (concepts)
which contribute both visual entities and words. The fact that
visual entities and words come from the same source is what
We limit the sum over blobs to the largest N blobs (in this
enables the model to link them. Because we train the models
work N is sixteen). While training, we also normalize the
without knowing the correspondence, we need an assumption
contributions of blobs and words to mitigate the effects of
of how multiple draws from the pool of factors lead to the
differing numbers of blobs and words in the various training
£WVX§
observed data. The model detailed below assumes that multiple
images. The probability of the observed data,
, given
draws are ?rst made to produce image entities, and then the
the model, is thus:
same group of factors is sampled to produce the image words.
89¡¤£YV`§#©P89¡4§©H89¡¤£"!
§©
(7)
Note that this implements the key assumption that image
semantics is compositional, and thus each image typically
where
needs to be described by multiple visual entities. Without
D
RHcedgf
89¡4§#©Pba
B
89¡
!
G¤©Q89¡¤G¤©QhpirqHsutwvyx?
compositionally, we would need to model all possible com-
(8)
vyx
6
binations of entities. For example, we would have to model
tigers on grass, tigers in water, tigers on sand, and so on.
and
Clearly, one tiger model should be reused when possible.
In what follows, we use feature vectors associated with
89¡¤£"!
§©P
c???f
D?89¡
!
G¤©Q89¡¤G%!
§©
1
a
B
h
irq%s¨t?vS?$?
(9)
image regions obtained using normalized cuts [11]. For each
vS?
7
?
image region we compute a feature vector representing color,
R
¡¤?
©
¡4?
texture, size, position, shape [12], and color context [13]. We
Here
(similarly
)) is the maximum
R
number
?
?5??
?5??
?
refer to region, together with its feature vector, as a blob.
of blobs (words) for any training set image,
(similarly

TABLE I
?
)is the number of blobs (words) for the particular image,
?
THE MUTUAL INFORMATION BETWEEN ENTIRE IMAGES AND OUR
89¡¤G%!
§©
and
is computed from (5).
VOCABULARY WORDS COMPUTED BASED ON THE MODE DESCRIBED IN
Since we do not know which concept is responsible for
THE TEXT ( III-A)
which observed blobs and words in the training data, determin-
?
ing the maximum likelihood values for the model parameters
?A???d?
?A???e
fP?
gh?i?)j f?
89¡
!
G ©
89¡
!
G ©
89¡¤G¤©
(
,
, and
) is not tractable. We thus estimate
7.32
6.69
0.63
7
6
values for the parameters using expectation maximization
(EM) [15], treating the hidden factors (concepts) responsible
TABLE II
for the blobs and words as missing data.
THE MUTUAL INFORMATION BETWEEN ENTIRE IMAGES AND OUR
The model generalizes well because it learns about image
VOCABULARY WORDS COMPUTED USING THE IMAGE KEYWORDS.
components. These components can occur in different con?g-
?A???d?
?A???e
fP?
gh?i?)j f?
urations and still be recognized. For example, it is possible to
5.65
2.42
3.23
learn about “sky” regions in images of tigers, and then predict
“sky” in giraffe images. Of course, predicting the word giraffe
TABLE III
requires having giraffes in the training set.
THE MUTUAL INFORMATION BETWEEN IMAGE REGIONS AND OUR
VOCABULARY WORDS COMPUTED FROM THE MODEL.
IV. EXPERIMENTS
?A???d?
?A???e
fP?
gh?i?)j f?
We trained the above model on a set of 26,078 Corel images.
7.01
4.37
2.64
The vocabulary size was 509 words. The number of mixture
components was 2000. We report results for the 1014 images
TABLE IV
for which we have ground truth region labels. These images
THE MUTUAL INFORMATION BETWEEN IMAGE REGIONS COMPUTED FROM
were held out from training. Naturally, where the results re?ect
THE MODEL, BUT GIVEN THE IMAGE WORDS.
model ?t, the training data results were a little better, but not
substantively.
?A???d?
?A???e
fP?
gh?i?)j f?
For the ?rst experiment (Table I), we estimated the quan-
5.00
0.60
4.40
¡ £©
?¡¤£"!
§©
tities
and
averaging over the image set to
TABLE V
estimate the marginal P(W). This gives similar results to
THE MUTUAL INFORMATION BETWEEN IMAGE REGIONS COMPUTED FROM
simply using the empirical word distribution, but we prefer
THE “GROUND TRUTH” DISTRIBUTION OF THE WORDS FOR THAT REGION.
marginalizing in the same context of the computation of
?A???d?
?A???e
fP?
gh?i?)j f?
¡¤£"!
§©
to reduce biases in the mutual information estimate.
6.63
1.53
5.10
With this protocol, we found relatively little mutual informa-
tion (0.63).
In the second experiment (Table II), we forced the word
posteriors from region word posteriors. Image word posteriors
posterior for each image to have mass only on the observed
are necessary both for training with correspondence ambiguity,
keywords. This gives ground truth quantities that are compa-
as well as image annotation. These ?nding suggest that we
rable with those in the previous experiment. Not surprisingly,
may be able to improve the heuristic.
the conditional entropy (2.42) re?ects the number of keywords
In a fourth experiment (Table IV), we constrained the
that we have for each image (typically in the range of 3 to 5).
region word posterior to have mass only for words that were
The mutual information here was 3.23.
associated with the image. The remaining uncertainty is a
Clearly there is a large difference between our model and the
combination of correspondence ambiguity, and mismatches
“oracle”. To further compare the two processes, we computed
between keywords and what is depicted in regions. In this case
the average KL divergence between the word posterior distri-
the mutual information was very high (4.40). More striking
butions and the observed image word distributions, ?nding it to
was the low value of the conditional entropy (0.60).
be 4.27. As a comparison, the average KL divergence between
In our ?nal experiment (Table V) we computed the mutual
the overall word empirical distribution and the observed image
information using the region ground truth (5.10), and here the
word distributions is 5.50. This is consistent with results
conditional entropy was 1.53. A critical observations is that
reported elsewhere [3] — our models consistently perform
this number includes uncertainty due to segmentation errors
somewhat better than chance, but we have a long way to go.
which are very prevalent, as segmentation along semantic lines
In the third experiment (Table III), we computed quantities
is very dif?cult. The substantially lower conditional entropy in
similar to those in the ?rst, but now entropy was computed
the fourth experiment suggests to us that our model is perhaps
using probability distributions conditioned on only one blob.
loosing too much information, and perhaps its power should
Interestingly, we found that our model supported substantively
be increased.
more mutual information (2.64) between regions and words
than between images and words. Recall that the model ex-
V. FINDING VISUAL WORDS
plicitly represents the joint probability of words and regions,
We have further applied information theoretic measures to
and that we used a heuristic for producing image word
quantify the “visualness” of words. In particular, we have

proposed using the entropy of image regions likely associated
with a given word as a measure of “visualness” [5]. We would
like determine “visualness” on a large scale to support internet
scale linking of pictures and words. Given the extensive
vocabulary that this implies, it makes sense to investigate
which words are good candidates for success. Thus we see the
?rst immediate application of this work as a tool for pruning
large vocabularies to exclude the many words that are not
visual, relative to our features.
Fig. 1. “Yellow” regions after one iteration. At this stage many of the images
We begin by using using Google Image Search to ?nd
do not have much yellow in them, and there are many labeling errors. For
a large number of images that have a fair chance of being
example, the ?ower in the top right image is green-blue, as is the region in
the third image in the top row. The region marked yellow in the second image
relevant to a given word. Having selected the images, we face
of the second row is white, whereas the two smaller, un-labeled, regions to
a familiar problem. Even if a word is relevant to an image in
either side are in fact yellow.
general, it likely correlates with the features of only a small
part of the image. We expect the bulk of any image to be
irrelevant to the word. Hence to estimate whether a word
correlates with image features, we need to estimate which
parts of the image are relevant. Not surprisingly, this requires
an iterative algorithm which alternates between determining
an appropriate characterization for the word, and determining
which regions are relevant.
To implement this we prepare a large Gaussian mixture
model for the regions of a large number of images. A
Fig. 2. “Yellow” regions after ?ve iterations. These images all have signi?cant
concept is characterized as probability distribution over the
yellow regions, and they are generally correctly labeled. The entropy of the
mixture components. We iteratively estimate that distribution
yellow regions, as modeled by a Gaussian mixture over features, is relatively
and whether or not each image region is relevant to the
low compared with background or random regions. Hence the system picks
out “yellow” as a visual word.
concept. After suf?cient iterations we compute the entropy of
the distribution. If that distribution has low entropy, then we
designate the word as visual. Otherwise, the process suggests
The connection is through the sampling bias for “professional
that it is hard to distinguish the regions linked to the word
sports” which yields low entropy because of a limited number
from a random selection of regions. In that case we consider
of textures and backgrounds (e.g. ?elds and courts) that go
that word not suf?ciently visual, and prune it from the words
with those images. It depends on the application as to whether
that we try to link to image features. Details are available
such words are a liability.
elsewhere [5].
Table VII lists the 15 adjectives with lowest entropy among
the 150 tested. In case of “religious” (Figure 3), which is
A. Experiments
ranked as 145-th, the region-adjective linking did not work
We experimented with the 150 most common adjectives
well, and the entropy is thus relatively large. This re?ects
used for indexing images in the Hemera Photo-Object col-
the fact that the image features of the regions included in
lection. We used each of these adjectives as the search term
“religious” images have no prominent tendency. Thus we can
for Google Image search. We used the ?rst 250 web images
say that “religious” has no or only a few visual properties.
returned.
Figure 1 shows “yellow” images after one iteration. In
89¡lkemonpnrqtsu!
vhwx©
the ?gure, the regions with high probability
are
labeled as “yellow”, while the regions with high probability
89¡lyzq{y
kemonpnrqtsu!
v
©
w
are labeled as “non-yellow”. Figure 2 shows
“yellow” images after ?ve iterations. This indicates the itera-
tive region selection worked well in case of “yellow”.
Table VI shows the 15 top adjectives and their image
entropy. In this case, the entropy of “dark” is the lowest, so
in this sense “dark” is the most “visual” adjective among the
150 adjectives under the condition we set in this experiment.
Fig. 3.
“Religious” regions in images from the web gathered by using the
Figure 4 shows some of the “dark” images. Most of the regions
word “religious”. There is little obvious pattern of difference between the
labeled with “dark” are uniform black ones.
two kinds of regions, consistent with the notion that our low level features
Interestingly, the method identi?es many words which, at
are not likely to be able to represent the meaning of “religious”. There is little
difference in the entropy between the regions deemed ‘religious” and those
?rst glance, do not appear to be truly visual. A good example
deemed“non-religious” — both are large. Thus the method denotes ‘’religious”
in our results is “professional” which is ranked relatively high.
as a non-visual word given the features.

TABLE VI
WORDS WITH THE TOP 15 ENTROPY RANKINGS.
rank
adjective
entropy
1
dark
0.0118
2
senior
0.0166
3
beautiful
0.0178
4
visual
0.0222
5
rusted
0.0254
6
musical
0.0321
7
purple
0.0412
8
black
0.0443
Fig. 4.
“Dark” regions in images from the web gathered by using the word
9
ancient
0.0593
“dark”. “Dark” regions are identi?ed as being dark, which means that they
10
cute
0.0607
have little variance in color or texture on an absolute scale. Hence, taken as a
11
shiny
0.0643
group, their entropy, as measured in the context of a Gaussian mixture model
12
scary
0.0653
over features, is relatively low. Thus the method denotes ‘’dark” as a visual
13
professional
0.0785
word.
14
stationary
0.1201
15
electric
0.1411
VI. CONCLUSION
TABLE VII
We have applied standard information theory methods to
WORDS WITH THE BOTTOM 15 ENTROPY RANKINGS.
provide some insight into the task of building systems which
rank
adjective
entropy
automatically link words to images and words to image
136
medical
2.5246
137
assorted
2.5279
regions. In particular, information theoretic measures appear
138
large
2.5488
to quite useful for thinking about the relation between image
139
playful
2.5541
annotation and region labeling. The former seems to be equiv-
140
acoustic
2.5627
141
elderly
2.5677
alent to the later with added correspondence ambiguity, but we
142
angry
2.5942
do not have a clear theory on how these two processes should
143
sexy
2.6015
relate in the context of algorithm building. Complications
144
open
2.6122
145
religious
2.7242
include segmentation errors and vocabulary issues. The work
146
dry
2.8531
presented in this paper suggests that useful quanti?cation
147
male
2.8835
of the components of uncertainty can be achieved through
148
patriotic
3.0840
149
vintage
3.1296
information theory.
150
mature
3.2265
We have further used information theory measures methods
to quantify the “visualness” of words. This yields a simple
method to prune large vocabularies of words that are not
[7] G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller, “Intro-
visual, given our features. In the domain of linking words and
duction to wordnet: an online lexical database,” International Journal
pictures, such non-visual words increase computation burden,
of Lexicography, vol. 3, no. 4, pp. 235–244, 1990.
[8] P. Carbonetto, N. d. Freitas, and K. Barnard, “A statistical model
and complicate already dif?cult model ?tting and selection.
for general contextual object recognition,” in European Conference
Thus a method to automatically remove them makes sense.
on Computer Vision, 2004, pp. I:350–362. [Online]. Available:
http://kobus.ca/research/publications/ECCV-04/index.html
REFERENCES
[9] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation
and retrieval using cross-media relevance models,” in SIGIR, 2003, pp.
[1] K.
Barnard
and
D.
Forsyth,
“Learning
the
semantics
119–126.
of
words
and
pictures,”
in
International
Conference
on
[10] S. Feng, R. Manmatha, and V. Lavrenko, “Multiple bernoulli relevance
Computer
Vision,
2001,
pp.
II:408–415.
[Online].
Available:
models for image and video annotation,” in Proceedings of CVPR’04,
http://kobus.ca/research/publications/ICCV-01/index.html
vol. 2, 2004, pp. 1002–1009.
[2] P. Duygulu, K. Barnard, J. d. Freitas, and D. Forsyth, “Object
[11] J. Shi and J. Malik., “Normalized cuts and image segmentation,” IEEE
recognition
as
machine
translation:
Learning
a
lexicon
for
a
Transactions on Pattern Analysis and Machine Intelligence, vol. 22,
?xed image vocabulary,” in The Seventh European Conference
no. 9, pp. 888–905, 2000.
on Computer Vision, 2002, pp. IV:97–112. [Online]. Available:
[12] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. I.
http://kobus.ca/research/publications/ECCV-02-1/ECCV-02-1.pdf
Jordan, “Matching words and pictures,” Journal of Machine Learning
[3] K. Barnard, P. Duygulu, N. d. Freitas, D. Forsyth, D. Blei, and
Research, vol. 3, pp. 1107–1135, 2003.
M. I. Jordan, “Matching words and pictures,” Journal of Machine
[13] K. Barnard, P. Duygulu, K. G. Raghavendra, P. Gabbur, and D. Forsyth,
Learning Research, vol. 3, pp. 1107–1135, 2003. [Online]. Available:
“The effects of segmentation and feature choice in a translation
http://kobus.ca/research/publications/JMLR/JMLR-03.ps.gz
model of object recognition,” in IEEE Conference on Computer Vision
[4] K. Barnard, Q. Fan, R. Swaminathan, A. Hoogs, R. Collins, P. Rondot,
and Pattern Recognition, 2003, pp. II:675–682. [Online]. Available:
and J. Kaufhold, “Evaluation of localized semantics: data, methodology,
http://kobus.ca/research/publications/CVPR-03/CVPR-03.pdf
and experiments,” U. Arizona, Computing Science,” TR-05-08, 2005.
[14] K.
Barnard,
P.
Duygulu,
and
D.
Forsyth,
“Exploiting
text
[Online]. Available: http://kobus.ca/research/publications/IJCV-06/TR-
and
image
feature
co-occurrence
statistics
in
large
datasets,”
05-08.pdf
in
Trends
and
Advances
in
Content-Based
Image
and
Video
[5] K. Yanai and K. Barnard, “Image region entropy: A measure of ’visual-
Retrieval, R. Veltkamp, Ed.
Springer, to appear. [Online]. Available:
ness’ of web images associated with one concept,” in ACM Multimedia,
http://kobus.ca/research/publications/Dagstuhl/dagstuhl.pdf
2005. [Online]. Available: http://kobus.ca/research/publications/ACM-
[15] A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from
MM-05/Yanai-Barnard-ACM-MM-05.pdf
incomplete data via the em algorithm,” Journal of the Royal Statistical
[6] T. Cover and J. Thomas, Elements of information theory.
John Wiley
Society. Series B (Methodological), vol. 39, no. 1, pp. 1–38, 1977.
and Sons inc., 1991.