Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding. Zaranis, E., Farinhas, A., Santos, S., Canaverde, B., Ramos, M. M., Surikuchi, A. K, Viveiros, A., Liao, B., Bueno-Benito, E., Sivakumaran, N., Vasylenko, P., Yu, S., Sannigrahi, S., Mohammed, W., Peters, B., Villegas, D. S., Stengel-Eskin, E., Attanasio, G., Yoon, J., Frank, S., Suglia, A., Zerva, C., Elliott, D., Dimiccoli, M., Bansal, M., Lanz, O., Bernardi, R., Fernández, R., Pezzelle, S., Niculae, V., & Martins, A. F. T. In Workshop on Multimodal Intelligence: Next Token Prediction & Beyond (ICLR 2026), 2026.
Paper
Github
Data abstract bibtex Holistic understanding of long-form video remains a challenge for vision-language models (VLMs). Unfortunately, current benchmarks cannot easily capture this limitation, since they mostly focus on ``needle-in-a-haystack'' details, rewarding context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we address this gap by introducing MF, a new benchmark to evaluate how well models are able to comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long), requiring integration of both visual and language modalities. MF includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs—one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance. Despite the relative ease of the task for humans who can effectively retain and reason over critical narrative information, current VLMs lack this ability and thus struggle.
@inproceedings{zaranis-etal-2026-mf2,
title={Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding},
author={Emmanouil Zaranis and Ant\'onio Farinhas and Saul Santos and Beatriz Canaverde and Miguel Moura Ramos and Aditya K Surikuchi and Andr\'e Viveiros and Baohao Liao and Elena Bueno-Benito and Nithin Sivakumaran and Pavlo Vasylenko and Shoubin Yu and Sonal Sannigrahi and Wafaa Mohammed and Ben Peters and Danae Sánchez Villegas and Elias Stengel-Eskin and Giuseppe Attanasio and Jaehong Yoon and Stella Frank and Alessandro Suglia and Chrysoula Zerva and Desmond Elliott and Mariella Dimiccoli and Mohit Bansal and Oswald Lanz and Raffaella Bernardi and Raquel Fern\'andez and Sandro Pezzelle and Vlad Niculae and André F. T. Martins},
year={2026},
booktitle = {Workshop on Multimodal Intelligence: Next Token Prediction & Beyond (ICLR 2026)},
url={https://openreview.net/pdf?id=i58ycqyHiV},
url_github = {https://github.com/deep-spin/MF2},
url_data = {https://huggingface.co/datasets/sardinelab/MF2},
abstract = {Holistic understanding of long-form video remains a challenge for vision-language models (VLMs). Unfortunately, current benchmarks cannot easily capture this limitation, since they mostly focus on ``needle-in-a-haystack'' details, rewarding context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we address this gap by introducing MF, a new benchmark to evaluate how well models are able to comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long), requiring integration of both visual and language modalities. MF includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs---one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance. Despite the relative ease of the task for humans who can effectively retain and reason over critical narrative information, current VLMs lack this ability and thus struggle.}
}
Downloads: 0
{"_id":"8mYZpthoCkB2SPpwv","bibbaseid":"zaranis-farinhas-santos-canaverde-ramos-surikuchi-viveiros-liao-etal-moviefactsandfibsmf2abenchmarkforlongmovieunderstanding-2026","author_short":["Zaranis, E.","Farinhas, A.","Santos, S.","Canaverde, B.","Ramos, M. M.","Surikuchi, A. K","Viveiros, A.","Liao, B.","Bueno-Benito, E.","Sivakumaran, N.","Vasylenko, P.","Yu, S.","Sannigrahi, S.","Mohammed, W.","Peters, B.","Villegas, D. S.","Stengel-Eskin, E.","Attanasio, G.","Yoon, J.","Frank, S.","Suglia, A.","Zerva, C.","Elliott, D.","Dimiccoli, M.","Bansal, M.","Lanz, O.","Bernardi, R.","Fernández, R.","Pezzelle, S.","Niculae, V.","Martins, A. F. T."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding","author":[{"firstnames":["Emmanouil"],"propositions":[],"lastnames":["Zaranis"],"suffixes":[]},{"firstnames":["António"],"propositions":[],"lastnames":["Farinhas"],"suffixes":[]},{"firstnames":["Saul"],"propositions":[],"lastnames":["Santos"],"suffixes":[]},{"firstnames":["Beatriz"],"propositions":[],"lastnames":["Canaverde"],"suffixes":[]},{"firstnames":["Miguel","Moura"],"propositions":[],"lastnames":["Ramos"],"suffixes":[]},{"firstnames":["Aditya","K"],"propositions":[],"lastnames":["Surikuchi"],"suffixes":[]},{"firstnames":["André"],"propositions":[],"lastnames":["Viveiros"],"suffixes":[]},{"firstnames":["Baohao"],"propositions":[],"lastnames":["Liao"],"suffixes":[]},{"firstnames":["Elena"],"propositions":[],"lastnames":["Bueno-Benito"],"suffixes":[]},{"firstnames":["Nithin"],"propositions":[],"lastnames":["Sivakumaran"],"suffixes":[]},{"firstnames":["Pavlo"],"propositions":[],"lastnames":["Vasylenko"],"suffixes":[]},{"firstnames":["Shoubin"],"propositions":[],"lastnames":["Yu"],"suffixes":[]},{"firstnames":["Sonal"],"propositions":[],"lastnames":["Sannigrahi"],"suffixes":[]},{"firstnames":["Wafaa"],"propositions":[],"lastnames":["Mohammed"],"suffixes":[]},{"firstnames":["Ben"],"propositions":[],"lastnames":["Peters"],"suffixes":[]},{"firstnames":["Danae","Sánchez"],"propositions":[],"lastnames":["Villegas"],"suffixes":[]},{"firstnames":["Elias"],"propositions":[],"lastnames":["Stengel-Eskin"],"suffixes":[]},{"firstnames":["Giuseppe"],"propositions":[],"lastnames":["Attanasio"],"suffixes":[]},{"firstnames":["Jaehong"],"propositions":[],"lastnames":["Yoon"],"suffixes":[]},{"firstnames":["Stella"],"propositions":[],"lastnames":["Frank"],"suffixes":[]},{"firstnames":["Alessandro"],"propositions":[],"lastnames":["Suglia"],"suffixes":[]},{"firstnames":["Chrysoula"],"propositions":[],"lastnames":["Zerva"],"suffixes":[]},{"firstnames":["Desmond"],"propositions":[],"lastnames":["Elliott"],"suffixes":[]},{"firstnames":["Mariella"],"propositions":[],"lastnames":["Dimiccoli"],"suffixes":[]},{"firstnames":["Mohit"],"propositions":[],"lastnames":["Bansal"],"suffixes":[]},{"firstnames":["Oswald"],"propositions":[],"lastnames":["Lanz"],"suffixes":[]},{"firstnames":["Raffaella"],"propositions":[],"lastnames":["Bernardi"],"suffixes":[]},{"firstnames":["Raquel"],"propositions":[],"lastnames":["Fernández"],"suffixes":[]},{"firstnames":["Sandro"],"propositions":[],"lastnames":["Pezzelle"],"suffixes":[]},{"firstnames":["Vlad"],"propositions":[],"lastnames":["Niculae"],"suffixes":[]},{"firstnames":["André","F.","T."],"propositions":[],"lastnames":["Martins"],"suffixes":[]}],"year":"2026","booktitle":"Workshop on Multimodal Intelligence: Next Token Prediction & Beyond (ICLR 2026)","url":"https://openreview.net/pdf?id=i58ycqyHiV","url_github":"https://github.com/deep-spin/MF2","url_data":"https://huggingface.co/datasets/sardinelab/MF2","abstract":"Holistic understanding of long-form video remains a challenge for vision-language models (VLMs). Unfortunately, current benchmarks cannot easily capture this limitation, since they mostly focus on ``needle-in-a-haystack'' details, rewarding context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we address this gap by introducing MF, a new benchmark to evaluate how well models are able to comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long), requiring integration of both visual and language modalities. MF includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs—one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance. Despite the relative ease of the task for humans who can effectively retain and reason over critical narrative information, current VLMs lack this ability and thus struggle.","bibtex":"@inproceedings{zaranis-etal-2026-mf2,\n title={Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding}, \n author={Emmanouil Zaranis and Ant\\'onio Farinhas and Saul Santos and Beatriz Canaverde and Miguel Moura Ramos and Aditya K Surikuchi and Andr\\'e Viveiros and Baohao Liao and Elena Bueno-Benito and Nithin Sivakumaran and Pavlo Vasylenko and Shoubin Yu and Sonal Sannigrahi and Wafaa Mohammed and Ben Peters and Danae Sánchez Villegas and Elias Stengel-Eskin and Giuseppe Attanasio and Jaehong Yoon and Stella Frank and Alessandro Suglia and Chrysoula Zerva and Desmond Elliott and Mariella Dimiccoli and Mohit Bansal and Oswald Lanz and Raffaella Bernardi and Raquel Fern\\'andez and Sandro Pezzelle and Vlad Niculae and André F. T. Martins},\n year={2026},\n booktitle = {Workshop on Multimodal Intelligence: Next Token Prediction & Beyond (ICLR 2026)},\n url={https://openreview.net/pdf?id=i58ycqyHiV}, \n url_github = {https://github.com/deep-spin/MF2},\n url_data = {https://huggingface.co/datasets/sardinelab/MF2},\n abstract = {Holistic understanding of long-form video remains a challenge for vision-language models (VLMs). Unfortunately, current benchmarks cannot easily capture this limitation, since they mostly focus on ``needle-in-a-haystack'' details, rewarding context-insensitive retrieval over deep comprehension. Others rely on large-scale, semi-automatically generated questions (often produced by language models themselves) that are easier for models to answer but fail to reflect genuine understanding. In this paper, we address this gap by introducing MF, a new benchmark to evaluate how well models are able to comprehend, consolidate, and recall key narrative information from full-length movies (50-170 minutes long), requiring integration of both visual and language modalities. MF includes over 50 full-length, open-licensed movies, each paired with manually constructed sets of claim pairs---one true (fact) and one plausible but false (fib), totalling over 850 pairs. These claims target core narrative elements such as character motivations and emotions, causal chains, and event order, and refer to memorable moments that humans can recall without rewatching the movie. Our experiments demonstrate that both open-weight and closed state-of-the-art models fall well short of human performance. Despite the relative ease of the task for humans who can effectively retain and reason over critical narrative information, current VLMs lack this ability and thus struggle.}\n}\n\n\n","author_short":["Zaranis, E.","Farinhas, A.","Santos, S.","Canaverde, B.","Ramos, M. M.","Surikuchi, A. K","Viveiros, A.","Liao, B.","Bueno-Benito, E.","Sivakumaran, N.","Vasylenko, P.","Yu, S.","Sannigrahi, S.","Mohammed, W.","Peters, B.","Villegas, D. S.","Stengel-Eskin, E.","Attanasio, G.","Yoon, J.","Frank, S.","Suglia, A.","Zerva, C.","Elliott, D.","Dimiccoli, M.","Bansal, M.","Lanz, O.","Bernardi, R.","Fernández, R.","Pezzelle, S.","Niculae, V.","Martins, A. F. T."],"key":"zaranis-etal-2026-mf2","id":"zaranis-etal-2026-mf2","bibbaseid":"zaranis-farinhas-santos-canaverde-ramos-surikuchi-viveiros-liao-etal-moviefactsandfibsmf2abenchmarkforlongmovieunderstanding-2026","role":"author","urls":{"Paper":"https://openreview.net/pdf?id=i58ycqyHiV"," github":"https://github.com/deep-spin/MF2"," data":"https://huggingface.co/datasets/sardinelab/MF2"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://raw.githubusercontent.com/dmg-illc/dmg/master/bibbase/dmg-publications.bib","dataSources":["sXg85ARyqxCxh4Dnn","42BzLCGMaDCpG5rq2"],"keywords":[],"search_terms":["movie","facts","fibs","mf2","benchmark","long","movie","understanding","zaranis","farinhas","santos","canaverde","ramos","surikuchi","viveiros","liao","bueno-benito","sivakumaran","vasylenko","yu","sannigrahi","mohammed","peters","villegas","stengel-eskin","attanasio","yoon","frank","suglia","zerva","elliott","dimiccoli","bansal","lanz","bernardi","fernández","pezzelle","niculae","martins"],"title":"Movie Facts and Fibs (MF2): A Benchmark for Long Movie Understanding","year":2026}