A Stochastic Grammar of Images. Zhu, S. and Mumford, D. Foundations and Trends\textregistered in Computer Graphics and Vision, 2(4):259–362, Now Publishers, Inc., 2006.
A Stochastic Grammar of Images [link]Paper  doi  abstract   bibtex   
A Stochastic Grammar of Images
author = {Zhu, Song-Chun and Mumford, David},
title = {{A Stochastic Grammar of Images}},
journal = {Foundations and Trends{\textregistered} in Computer Graphics and Vision},
year = {2006},
volume = {2},
number = {4},
pages = {259--362},
annote = {This is the reprint version from <http://www.stat.ucla.edu/{\textasciitilde}sczhu/papers/Reprint_Grammar.pdf>

It's more fancy than Alan Yuille's models as in "Complexity of Representation and Inference in Compositional Models with Part Sharing" or "Semantic part segmentation using compositional model combining shape and appearance".

Section 1.

pp. 267 What's minimax principle?

pp. 269 And-nodes are grammar rules, and Or-nodes are production rules, using some grammar terminology.

pp. 270 Parse graph is a particular instance of an AND-OR graph, specifying all the OR-nodes. 

horizonal connections exists at some levels. when we collapse them all the way done to leaf levels, we get configuration graphs. Personally, I think 1(e) is incomplete. For example, there shuld be link between 2 and 9 as well. But whatever. Well check end of Def 5.1. Here I think it's somewhat sloppy.

Fig 1.3 Multiple parents for one node? I think NOT.

Below 1.3
1. And-Or graph    a AND-OR structure, having (pairwise) horizontal links between AND components.
2. parse graph.       instance of AND-OR structure, where all ORs have been specified.
3. for each primitive, find all its ancestors with links.   create a link between two primitives if they have ancestors that are linked. Basically, reflect the link on top in the primitives.
4. visual vocabulary. set of all instances possible for a primitive. Note that primitives can be pretty high level in concept, due to scale problem. Not sure what anchor points and open bond means here.
5. language. everything above.

Section 1.3.2 two arugments against purely unsupervised training. I think they still apply today. (2017)

End of chapter 1: here they talk about advantage of AND-OR. One thing is efficient training. But I don't know how that is consistent with the fact And-or graph models are difficult to train. Maybe, here efficient training means after a whole hierarchy is modeled?

Section 2.2: linking traditional grammar definitions and AND-OR graph.

AND nodes are rule names, and OR nodes the the left side symbols of rules [so must be non-terminal].

Section 2.3 ambiguity and reusable parts in AND-OR graph grammar for vision. In the end they assume that there's no occlusion and no sharing, so each object can be parsed as a tree structure perfectly.

Section 2.4 at the end of it, it discusses potential problem of assigning probability. The example here (A->AA|a) is given in reference [10], which seems to show that using MLE, we won't get problematic estimation.

Section 2.5 add some context effect during modeling that would affect the probability. That basically means horizontal connections, or some other connections not representable through tree.

Address variable should be a specific concept in language modeling. For example, in Figure 2.8, the address variable for "what" should point to "is" as well as "said". and logically, "what is", and "said what" form "next" relation.

What's it's essentially saying, is that in applying bigram frequency, we should not compute h(what, I), but h(said, what), and h(what, is), and they need to be determinied through some inference.

Section 2.6 some special issues for vision grammar.
1. elimination of recursive rules. I don't know how this is to be done precisely
2. multi scale. simple, just add some termination rules.
3. grammar is more stochastic, and it's better thought of as a mixture of strcit grammar and MRF, given the stochasticity of images.

Section 2.7 previous work on image grammar. ignore, since they are probably even less popular than AND-OR graph.

Section 3 Visual vocabulary ($\Delta$). Here it's talking about how to model small image patches.

Overall structure.

1. First we have image primitives,
2. then we have graphlets in 3.3, through clustering primitives.
3. later they talk about primitives for modeling objects. See Figure. 3.7. Mathematically, they are same as image primitives, and they are composed of primitives and graphlets (Fig. 3.8 caption)

Def 3.1 is not that difficult.

First $\phi_i(x,y; \alpha_i)$ just means that $\alpha_i$ is the parameter for specifying one image patch, and produces a function from $(x,y)$ (pixel location) to pixel values. Some concrete examples include Gabor filter, where $\alpha_i$ represents 6 or 7 parameters in Gabor filter, and we get a function mapping from $(x,y)$ to the intensity value of Gabor. Notice that in this case, $(x,y)$ has finite support, only concerning a small patch.

$\beta$'s definition is confusing. I think eventually in some instantiation of visual vocabulary, each $beta$ should refer to some other image word in the vocabulary.

Eq. (3.2) and (3.3). A bigger image is explained by smaller pataches. Notice that some parts are not exaplainble (Eq. (3.2)).

Section 4 horizontal relationship among parts (of potentially different levels) ($\mathcal{R}$).

According to 4.1, nodes of all levels, primitives, graphlets, parts, etc., can form relation.

Def 4.1. $\gamma$ defines the type of relationship (see Fig. 4.1), or some more patetermized description, such as those in Eq. (4.4). and $\rho$ defines the score of this relationship given patches).

Fig. 4.1 is really fancy. No idea how to train these.

Notice that using Def 4.1, we only have pairwise relationships. For exmaple, in Fig 4.1. The radial example would be most natural using 5 parts, but here it's precomposed into 2 parts.

Section 4.2. In what's presented here, all example configurations have nodes at the same level (primitives, parts, or objects).

Section 4.3 Fig 4.3 shows the complexity of this approach. I wonder how to discriminate different configurations. The learned probablistic model would be too complicated. This approach would become intractable easily...

Section 5 Parse graph is instantiation of an AND-OR graph. Each AND-OR graph defines a probablity distribution of parse graphs.

Eq. (5.4) each and node produces a set of children nodes, as well as the relationships between the children nodes.

> A production should also associate the open bonds of A with open bonds in C.

I think it means that there's "containing" relationship between A and C, and they should be connected together, one end on A, one end on C.

Eq. (5.6) should refer to some relationship at same level of graph.

> Depending on the type of relation, there may be special rules for producing relations at a lower level from higher level relations in the collapsing process.

This might be inheriting relation, as shown in Fig 1.3(d), (e).

So there are two types of configurations at each level: uncollapsed (in Eq. (5.4)), and collapsed, in Eq. (5.7).

pp. 321 three (or two?) features of And-or graph w.r.t. and-or tree. I think item 3 (or-node sharing) is not dealt with, or dealt with with some hack in practice. See Fig. 1.3, and description of it, above Def. 6.1. In practice, shared parts are duplicated, I guess.

pp. 322 "Both nodes 8 and 10 have a common ancestor node C." should be "both nodes 6 abd 8".

Def 6.1.

Here the definition for configuration, seems to be the configuration for leaf nodes. Notice that given a graph, we won't allow infinite number of recurision. So we have finite number of configurations in terms of set of variables. That's why in Eq. (6.7), you have N.

Eq. (6.13) notice that you also define singular terms for intermediate and nodes as well, but in Eq. (6.13), you don't have it. So maybe some typo. Based on Section 7, I believe intermediate nodes are not there.

The constraints between parts and parent, I guess, are encoded in the pairwise terms.

Section 7, Section 8

Take away message is: they are difficult and time-consuming, and not practical.

Section 7. The algorithm for learning relationship set is not practical for me, as you assumed a large set of relationships before hand.

Section 8. Based on (8.1), in the likelihood term, you can see that this likelihood function doesn't care about the unsketchable part. The notation $\Delta_{sk}$ simply means that this term only depends on sketchable part of image, not that there's some conditional probability involved for $\Delta_{sk}$.

publisher = {Now Publishers, Inc.},
keywords = {classics, hierarchical},
doi = {10.1561/0600000018},
language = {English},
read = {Yes},
rating = {5},
date-added = {2017-02-24T15:00:03GMT},
date-modified = {2017-02-24T21:18:35GMT},
abstract = {A Stochastic Grammar of Images},
url = {http://www.nowpublishers.com/article/Details/CGV-018},
local-url = {file://localhost/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2006/Zhu/FNT%20in%20Computer%20Graphics%20and%20Vision%202006%20Zhu.pdf},
file = {{FNT in Computer Graphics and Vision 2006 Zhu.pdf:/Users/yimengzh/Documents/Papers3_revised/Library.papers3/Articles/2006/Zhu/FNT in Computer Graphics and Vision 2006 Zhu.pdf:application/pdf}},
uri = {\url{papers3://publication/doi/10.1561/0600000018}}
Downloads: 0