Better Synthetic Data by Retrieving and Transforming Existing Datasets. Gandhi, S., Gala, R., Viswanathan, V., Wu, T., & Neubig, G. In April, 2024.
Better Synthetic Data by Retrieving and Transforming Existing Datasets [link]Paper  abstract   bibtex   
Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, \textit\DataTune\, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49\% and improves over existing methods that use synthetic or retrieved training data by 34\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.
@inproceedings{gandhi_better_2024,
	title = {Better {Synthetic} {Data} by {Retrieving} and {Transforming} {Existing} {Datasets}},
	url = {https://www.semanticscholar.org/paper/Better-Synthetic-Data-by-Retrieving-and-Existing-Gandhi-Gala/00d4fea24baae6ac9a77ca2b0744f466b268e780},
	abstract = {Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, {\textbackslash}textit\{DataTune\}, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49{\textbackslash}\% and improves over existing methods that use synthetic or retrieved training data by 34{\textbackslash}\%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.},
	urldate = {2024-04-26},
	author = {Gandhi, Saumya and Gala, Ritu and Viswanathan, Vijay and Wu, Tongshuang and Neubig, Graham},
	month = apr,
	year = {2024},
}

Downloads: 0