Better Synthetic Data by Retrieving and Transforming Existing Datasets

Better Synthetic Data by Retrieving and Transforming Existing Datasets. Gandhi, S., Gala, R., Viswanathan, V., Wu, T., & Neubig, G. In April, 2024.

Paper abstract bibtex

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, \textit\DataTune\, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

@inproceedings{gandhi_better_2024,
	title = {Better {Synthetic} {Data} by {Retrieving} and {Transforming} {Existing} {Datasets}},
	url = {https://www.semanticscholar.org/paper/Better-Synthetic-Data-by-Retrieving-and-Existing-Gandhi-Gala/00d4fea24baae6ac9a77ca2b0744f466b268e780},
	abstract = {Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, {\textbackslash}textit\{DataTune\}, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49{\textbackslash}\% and improves over existing methods that use synthetic or retrieved training data by 34{\textbackslash}\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.},
	urldate = {2024-04-26},
	author = {Gandhi, Saumya and Gala, Ritu and Viswanathan, Vijay and Wu, Tongshuang and Neubig, Graham},
	month = apr,
	year = {2024},
}

Downloads: 0

{"_id":"76XmHZbd49y8GFPMy","bibbaseid":"gandhi-gala-viswanathan-wu-neubig-bettersyntheticdatabyretrievingandtransformingexistingdatasets-2024","author_short":["Gandhi, S.","Gala, R.","Viswanathan, V.","Wu, T.","Neubig, G."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","title":"Better Synthetic Data by Retrieving and Transforming Existing Datasets","url":"https://www.semanticscholar.org/paper/Better-Synthetic-Data-by-Retrieving-and-Existing-Gandhi-Gala/00d4fea24baae6ac9a77ca2b0744f466b268e780","abstract":"Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, \\textit\\DataTune\\, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\\% and improves over existing methods that use synthetic or retrieved training data by 34\\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.","urldate":"2024-04-26","author":[{"propositions":[],"lastnames":["Gandhi"],"firstnames":["Saumya"],"suffixes":[]},{"propositions":[],"lastnames":["Gala"],"firstnames":["Ritu"],"suffixes":[]},{"propositions":[],"lastnames":["Viswanathan"],"firstnames":["Vijay"],"suffixes":[]},{"propositions":[],"lastnames":["Wu"],"firstnames":["Tongshuang"],"suffixes":[]},{"propositions":[],"lastnames":["Neubig"],"firstnames":["Graham"],"suffixes":[]}],"month":"April","year":"2024","bibtex":"@inproceedings{gandhi_better_2024,\n\ttitle = {Better {Synthetic} {Data} by {Retrieving} and {Transforming} {Existing} {Datasets}},\n\turl = {https://www.semanticscholar.org/paper/Better-Synthetic-Data-by-Retrieving-and-Existing-Gandhi-Gala/00d4fea24baae6ac9a77ca2b0744f466b268e780},\n\tabstract = {Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, {\\textbackslash}textit\\{DataTune\\}, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49{\\textbackslash}\\% and improves over existing methods that use synthetic or retrieved training data by 34{\\textbackslash}\\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.},\n\turldate = {2024-04-26},\n\tauthor = {Gandhi, Saumya and Gala, Ritu and Viswanathan, Vijay and Wu, Tongshuang and Neubig, Graham},\n\tmonth = apr,\n\tyear = {2024},\n}\n\n\n\n","author_short":["Gandhi, S.","Gala, R.","Viswanathan, V.","Wu, T.","Neubig, G."],"key":"gandhi_better_2024","id":"gandhi_better_2024","bibbaseid":"gandhi-gala-viswanathan-wu-neubig-bettersyntheticdatabyretrievingandtransformingexistingdatasets-2024","role":"author","urls":{"Paper":"https://www.semanticscholar.org/paper/Better-Synthetic-Data-by-Retrieving-and-Existing-Gandhi-Gala/00d4fea24baae6ac9a77ca2b0744f466b268e780"},"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/zotero/abhishek-p","dataSources":["h7kKWXpJh2iaX92T5"],"keywords":[],"search_terms":["better","synthetic","data","retrieving","transforming","existing","datasets","gandhi","gala","viswanathan","wu","neubig"],"title":"Better Synthetic Data by Retrieving and Transforming Existing Datasets","year":2024}