Perhaps PTLMs Should Go to School – A Task to Assess Open Book and Closed Book QA

Perhaps PTLMs Should Go to School – A Task to Assess Open Book and Closed Book QA. Ciosici, M., Cecil, J., Lee, D., Hedges, A., Freedman, M., & Weischedel, R. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6104–6111, Online and Punta Cana, Dominican Republic, November, 2021. Association for Computational Linguistics.

Paper doi abstract bibtex

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have ``understood'' the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

@inproceedings{ciosici-etal-2021-perhaps,
    title = "Perhaps {PTLM}s Should Go to School {--} A Task to Assess Open Book and Closed Book {QA}",
    author = "Ciosici, Manuel  and
      Cecil, Joe  and
      Lee, Dong-Ho  and
      Hedges, Alex  and
      Freedman, Marjorie  and
      Weischedel, Ralph",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.493",
    doi = "10.18653/v1/2021.emnlp-main.493",
    pages = "6104--6111",
    abstract = "Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be {\textasciitilde}50{\%}. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook{'}s content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5{'}s pre-training), yields at best minor improvement (56{\%}), suggesting that the PTLM may not have {``}understood{''} the textbook (or perhaps misunderstood the questions). Performance is better ({\textasciitilde}60{\%}) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).",
}

Downloads: 0

{"_id":"myKzdHtgkzuRwWMce","bibbaseid":"ciosici-cecil-lee-hedges-freedman-weischedel-perhapsptlmsshouldgotoschoolatasktoassessopenbookandclosedbookqa-2021","author_short":["Ciosici, M.","Cecil, J.","Lee, D.","Hedges, A.","Freedman, M.","Weischedel, R."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"Perhaps PTLMs Should Go to School – A Task to Assess Open Book and Closed Book QA","author":[{"propositions":[],"lastnames":["Ciosici"],"firstnames":["Manuel"],"suffixes":[]},{"propositions":[],"lastnames":["Cecil"],"firstnames":["Joe"],"suffixes":[]},{"propositions":[],"lastnames":["Lee"],"firstnames":["Dong-Ho"],"suffixes":[]},{"propositions":[],"lastnames":["Hedges"],"firstnames":["Alex"],"suffixes":[]},{"propositions":[],"lastnames":["Freedman"],"firstnames":["Marjorie"],"suffixes":[]},{"propositions":[],"lastnames":["Weischedel"],"firstnames":["Ralph"],"suffixes":[]}],"booktitle":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","month":"November","year":"2021","address":"Online and Punta Cana, Dominican Republic","publisher":"Association for Computational Linguistics","url":"https://aclanthology.org/2021.emnlp-main.493","doi":"10.18653/v1/2021.emnlp-main.493","pages":"6104–6111","abstract":"Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have ``understood'' the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).","bibtex":"@inproceedings{ciosici-etal-2021-perhaps,\r\n title = \"Perhaps {PTLM}s Should Go to School {--} A Task to Assess Open Book and Closed Book {QA}\",\r\n author = \"Ciosici, Manuel and\r\n Cecil, Joe and\r\n Lee, Dong-Ho and\r\n Hedges, Alex and\r\n Freedman, Marjorie and\r\n Weischedel, Ralph\",\r\n booktitle = \"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing\",\r\n month = nov,\r\n year = \"2021\",\r\n address = \"Online and Punta Cana, Dominican Republic\",\r\n publisher = \"Association for Computational Linguistics\",\r\n url = \"https://aclanthology.org/2021.emnlp-main.493\",\r\n doi = \"10.18653/v1/2021.emnlp-main.493\",\r\n pages = \"6104--6111\",\r\n abstract = \"Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be {\\textasciitilde}50{\\%}. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook{'}s content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5{'}s pre-training), yields at best minor improvement (56{\\%}), suggesting that the PTLM may not have {``}understood{''} the textbook (or perhaps misunderstood the questions). Performance is better ({\\textasciitilde}60{\\%}) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).\",\r\n}\r\n\r\n","author_short":["Ciosici, M.","Cecil, J.","Lee, D.","Hedges, A.","Freedman, M.","Weischedel, R."],"bibbaseid":"ciosici-cecil-lee-hedges-freedman-weischedel-perhapsptlmsshouldgotoschoolatasktoassessopenbookandclosedbookqa-2021","role":"author","urls":{"Paper":"https://aclanthology.org/2021.emnlp-main.493"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/f/PuMJAovyAXRT2tXA4/weisched-2023.bib","dataSources":["GqXAgYxBdFLtfgmuX","rfZPhXEcB2WkQFng2","GcGh9GNcwAcPFdaKL","D7uT8WysJetCvvFX7","dGtyvF95wfRJ9kTzd","kEea7YES5bdJiBa3M","kthbhvTKmDiLRyN4D","bNHGR4jWCTLTnHxvM","mdKvQEkTwJWHLGhfR","N5tNERhDzTXhTdrT8"],"keywords":[],"search_terms":["perhaps","ptlms","school","task","assess","open","book","closed","book","ciosici","cecil","lee","hedges","freedman","weischedel"],"title":"Perhaps PTLMs Should Go to School – A Task to Assess Open Book and Closed Book QA","year":2021}