Today is the last day of ECCV 2010, so the best papers are already announced. Although I was not there, a lot of papers are available via cvpapers, so I eventually run into some of them.
Best papers
The full list of the conference awards is available here. The best paper is therefore "Graph Cut based Inference with Co-occurrence Statistics" by Lubor Ladicky, Chris Russell, Pushmeet Kohli, and Philip Torr. The second best is "Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics" by Abhinav Gupta, Alyosha Efros, and Martial Hebert. Two thoughts before starting reviewing them. First, the papers come from the two institutions which are (arguably) considered now the leading ones in the vision community: Microsoft Research and Carnegie-Mellon Robotics Institute. Second, both papers are about semantic segmentation (although the latter couples it with implicit geometry reconstruction); Vidit Jain already noted the acceptance bias in favour of the recognition papers.
Okay, the papers now. Ladicky et al. addressed the problem of global terms in the energy minimization for semantic segmentation. Specifically, their global term deals only with occurrences of object classes and invariant to how many connected components (i.e. objects) or individual pixels represent the class. Therefore, one cow on an image gives the same contribution to the global term as two cows, as well as one accidental pixel of a cow. The global term penalizes big quantity of different categories in the single image (the MDL prior), which is helpful when we are given a large set of possible class labels, and also penalizes the co-occurrence of the classes that are unlikely to come along with each other, like cows and sheep. Statistics is collected from the train set and defines if the co-occurrence of the certain pair of classes should be encouraged or penalized. Although the idea of incorporating co-occurrence to the energy function is not new [Torralba et al, 2003; Rabinovich et al., 2007], the authors claim that their method is the first one which simultaneously satisfy the four conditions: global energy minimization (implicit global term rather than a multi-stage heuristic process), invariance to the structure of classes (see above), efficiency (not to make the model order of magnitude larger) and parsimony (MDL prior, see above).
How do the authors minimize the energy? They restrict the global term to the function of the set of classes represented in the image, which is monotonic w.r.t. argument set enclosing (more classes, more penalty). Then the authors introduce some fake nodes for αβ-swap or α-expansion procedure, so that the optimized energy remains submodular. It is really similar to how they applied graph-cut based techniques for minimizing energy with higher-order cliques [Kohli, Ladicky and Torr, 2009]. So when you face some non-local terms in the energy, you can try something similar.
What are the shortcomings of the method? It would be great to penalize objects in addition to classes. First, local interactions are taken into account, as well as global ones. But what about the medium level? Shape of an object, size, colour consistency etc. are the great cues. Second, on the global level only inter-class co-occurrences play a role, but what about the intra-class ones? It is impossible to have two suns in a photo, but it is likely to meet several pedestrians walking along the street. It is actually done by Desai et al. [2009] for object detection.
The second paper is by Gupta et al., who have remembered the romantic period of computer vision, when the scenes composed of perfect geometrical shapes were reconstructed successfully. They address the problem of 3D reconstruction by a single image, like in auto pop-up [Hoiem, Efros and Hebert, 2005]. They compare the result of auto pop-up with Potemkin villages: "there is nothing behind the pretty façade." (I believe this comparison is the contribution of the second author). Instead of surfaces, they fit boxes into the image, which allows them to put a wider range of constraints to the 3D structure, including:
- static equilibrium: it seems that the only property they check here is that centroid is projected into the figure bearing;
- enough support force: they estimate density (light -- vegetation, medium -- human, heavy -- buildings) and say that it is unlikely that building is build on the tree;
- volume constraint: boxes cannot intersect;
- depth ordering: backprojecting the result to the image plane should correspond to what we see on the image.
This is a great paper that exploits Newtonian mechanics as well as human intuition, however, there are still some heuristics (like the density of a human) which could probably be generalized out. It seems that this approach has a big potential, so it might became the seminal paper for the new direction. Composing recognition with geometry reconstruction is quite trendy now, and this method is ideologically simple but effective. There are a lot of examples how the algorithm works on the project page.
Funny papers
There are a couple of ECCV papers which have fancy titles. The first one is "Being John Malkovich" by Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, and Steve Seitz from the University of Washington GRAIL. If you've seen the movie, you can guess what is the article about. Given the video of someone pulling faces, the algorithm transforms it to the video of John Malkovich making similar faces. "Ever wanted to be someone else? Now you can." In contrast to the movie, in the paper not necessarily John Malkovich plays himself: it could be George Bush, Cameron Diaz, John Clooney and even any person for whom you can find a sufficient video or photo database! You can see the video of the real-time puppetry on the project page, although obvious lags take place and the result is still far from being perfect.
Another fancy title is "Building Rome on a Cloudless Day". There are 11 (eleven) authors contributing to the paper, including Marc Pollefeys. This summer I spent one cloudless day in Rome, and, to be honest, it was not that pleasant. So, why is the paper called this way then? The paper refers to another one: "Building Rome in a Day" from ICCV 2009 by the guys from Washington again, which itself refers to the proverb "Rome was not built in a day." In this paper authors build a dense 3D model of some Rome sights using a set of Flickr photos tagged "Rome" or "Roma". Returning back to the monument of collective intelligence from ECCV2010, they did the same, but without cloud computing, that's why the day is cloudless now. S.P.Q.R.
I cannot avoid to mention here the following papers, although they are not from ECCV. Probably the most popular CVPR 2010 paper is "Food Recognition Using Statistics of Pairwise Local Features" by Shulin Yang, Mei Chen, Dean Pomerleau, Rahul Sukthankar. The first page of the paper contains the motivation picture with a hamburger, and it looks pretty funny. They insist that the stuff from McDonald's is very different from that from Burger King, and it is really important to recognize them to keep track of the calories. Well, the authors don't look overweight, so the method should work.
The last paper in this section is "Paper Gestalt" by the imaginary Carven von Bearnensquash, published in Secret Proceedings of Computer Vision and Pattern Recognition (CVPR), 2010. The authors (presumably from UCSD) make fun of the way we usually write computer vision papers assuming that some features might convince a reviewer to accept or reject the paper, like mathematical formulas that create an illusion of author qualification (although if they are irrelevant), ROC curves etc. It also derides the attempts to apply black-box machine-learning techniques without the appropriate analysis of the possible features. Now I am trying to subscribe to the Journal of Machine Learning Gossip.
Colleagues
There was only one paper from our lab at the conference: "Geometric Image Parsing in Man-Made Environments" by Olga Barinova and Elena Tretiak (in co-authorship with Victor Lempitsky and Pushmeet Kohli). The scheme similar to the image parsing framework [Tu et al., 2005] is utilized, i.e. top-down analysis is performed. They detect parallel lines (like edges of buildings and windows), their vanishing points and the zenith jointly, using a witty graphical model. The approach is claimed to be robust to the clutter in the edge map.
Indeed, this paper could not have been possible without me. =) It was me who convinced Lena to join the lab two years ago (actually, it was more like convincing her not to apply for the other lab). So, the lab will remember me at least as a decent selectioner/scout...