An Object-based Bayesian Framework for Top-down Visual Attention. Borji, A., Sihite, D. N., & Itti, L. In Proc. Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI-12), Toronto, Canada, pages 1529-1535, Aug, 2012.
abstract   bibtex   
We introduce a new task-independent framework to model top-down overt visual attention based on graphical models for probabilistic inference and reasoning. We describe a Dynamic Bayesian Network (DBN) that infers probability distributions over attended objects and spatial locations directly from observed data. Probabilistic inference in our model is performed over object-related functions which are fed from manual an- notations of objects in video scenes or by state-of-the-art object detection models. Evaluating over appx. 3 hours (appx. 315,000 eye fixations and 12,600 saccades) of observers playing 3 video games (time-scheduling, driving, and flight combat), we show that our approach is significantly more predictive of eye fixations com- pared to: 1) simpler classifier-based models also developed here that map a signature of a scene (multi-modal information from gist, bottom-up saliency, physical actions, and events) to eye positions, 2) 14 state-of-the-art bottom-up saliency models, and 3) brute-force algo- rithms such as mean eye position. Our results show that the proposed model is more effective in employing and reasoning over spatio-temporal visual data.
@inproceedings{ Borji_etal12aaai,
  author = {A. Borji and D. N. Sihite and L. Itti},
  title = {An Object-based Bayesian Framework for Top-down Visual Attention},
  booktitle = {Proc. Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI-12), Toronto, Canada},
  abstract = {We introduce a new task-independent framework to model top-down overt visual attention based on graphical
                  models for probabilistic inference and reasoning.  We describe a Dynamic Bayesian Network (DBN) that
                  infers probability distributions over attended objects and spatial locations directly from observed
                  data.  Probabilistic inference in our model is performed over object-related functions which are fed
                  from manual an- notations of objects in video scenes or by state-of-the-art object detection
                  models. Evaluating over appx. 3 hours (appx. 315,000 eye fixations and 12,600 saccades) of observers
                  playing 3 video games (time-scheduling, driving, and flight combat), we show that our approach is
                  significantly more predictive of eye fixations com- pared to: 1) simpler classifier-based models also
                  developed here that map a signature of a scene (multi-modal information from gist, bottom-up saliency,
                  physical actions, and events) to eye positions, 2) 14 state-of-the-art bottom-up saliency models, and
                  3) brute-force algo- rithms such as mean eye position. Our results show that the proposed model is
                  more effective in employing and reasoning over spatio-temporal visual data.},
  pages = {1529-1535},
  month = {Aug},
  year = {2012},
  review = {full/conf},
  type = {bu;td;mod;cv},
  if = {2012 acceptance rate: 26.0%},
  file = {http://ilab.usc.edu/publications/doc/Borji_etal12aaai.pdf}
}

Downloads: 0