From Pixels to Prose: A Large Dataset of Dense Image Captions

From Pixels to Prose: A Large Dataset of Dense Image Captions. Singla, V., Yue, K., Paul, S., Shirkavand, R., Jayawardhana, M., Ganjdanesh, A., Huang, H., Bhatele, A., Somepalli, G., & Goldstein, T. June, 2024. arXiv:2406.10328 [cs]

Paper doi abstract bibtex

Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose

@misc{singla_pixels_2024,
	title = {From {Pixels} to {Prose}: {A} {Large} {Dataset} of {Dense} {Image} {Captions}},
	shorttitle = {From {Pixels} to {Prose}},
	url = {http://arxiv.org/abs/2406.10328},
	doi = {10.48550/arXiv.2406.10328},
	abstract = {Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose},
	urldate = {2025-04-23},
	publisher = {arXiv},
	author = {Singla, Vasu and Yue, Kaiyu and Paul, Sukriti and Shirkavand, Reza and Jayawardhana, Mayuka and Ganjdanesh, Alireza and Huang, Heng and Bhatele, Abhinav and Somepalli, Gowthami and Goldstein, Tom},
	month = jun,
	year = {2024},
	note = {arXiv:2406.10328 [cs]},
	keywords = {Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, year-2-pubs, year-2-pubs-direct},
}

Downloads: 0

{"_id":"X75fS2xxMaNozXZCD","bibbaseid":"singla-yue-paul-shirkavand-jayawardhana-ganjdanesh-huang-bhatele-etal-frompixelstoprosealargedatasetofdenseimagecaptions-2024","author_short":["Singla, V.","Yue, K.","Paul, S.","Shirkavand, R.","Jayawardhana, M.","Ganjdanesh, A.","Huang, H.","Bhatele, A.","Somepalli, G.","Goldstein, T."],"bibdata":{"bibtype":"misc","type":"misc","title":"From Pixels to Prose: A Large Dataset of Dense Image Captions","shorttitle":"From Pixels to Prose","url":"http://arxiv.org/abs/2406.10328","doi":"10.48550/arXiv.2406.10328","abstract":"Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose","urldate":"2025-04-23","publisher":"arXiv","author":[{"propositions":[],"lastnames":["Singla"],"firstnames":["Vasu"],"suffixes":[]},{"propositions":[],"lastnames":["Yue"],"firstnames":["Kaiyu"],"suffixes":[]},{"propositions":[],"lastnames":["Paul"],"firstnames":["Sukriti"],"suffixes":[]},{"propositions":[],"lastnames":["Shirkavand"],"firstnames":["Reza"],"suffixes":[]},{"propositions":[],"lastnames":["Jayawardhana"],"firstnames":["Mayuka"],"suffixes":[]},{"propositions":[],"lastnames":["Ganjdanesh"],"firstnames":["Alireza"],"suffixes":[]},{"propositions":[],"lastnames":["Huang"],"firstnames":["Heng"],"suffixes":[]},{"propositions":[],"lastnames":["Bhatele"],"firstnames":["Abhinav"],"suffixes":[]},{"propositions":[],"lastnames":["Somepalli"],"firstnames":["Gowthami"],"suffixes":[]},{"propositions":[],"lastnames":["Goldstein"],"firstnames":["Tom"],"suffixes":[]}],"month":"June","year":"2024","note":"arXiv:2406.10328 [cs]","keywords":"Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, year-2-pubs, year-2-pubs-direct","bibtex":"@misc{singla_pixels_2024,\n\ttitle = {From {Pixels} to {Prose}: {A} {Large} {Dataset} of {Dense} {Image} {Captions}},\n\tshorttitle = {From {Pixels} to {Prose}},\n\turl = {http://arxiv.org/abs/2406.10328},\n\tdoi = {10.48550/arXiv.2406.10328},\n\tabstract = {Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose},\n\turldate = {2025-04-23},\n\tpublisher = {arXiv},\n\tauthor = {Singla, Vasu and Yue, Kaiyu and Paul, Sukriti and Shirkavand, Reza and Jayawardhana, Mayuka and Ganjdanesh, Alireza and Huang, Heng and Bhatele, Abhinav and Somepalli, Gowthami and Goldstein, Tom},\n\tmonth = jun,\n\tyear = {2024},\n\tnote = {arXiv:2406.10328 [cs]},\n\tkeywords = {Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, year-2-pubs, year-2-pubs-direct},\n}\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","author_short":["Singla, V.","Yue, K.","Paul, S.","Shirkavand, R.","Jayawardhana, M.","Ganjdanesh, A.","Huang, H.","Bhatele, A.","Somepalli, G.","Goldstein, T."],"key":"singla_pixels_2024","id":"singla_pixels_2024","bibbaseid":"singla-yue-paul-shirkavand-jayawardhana-ganjdanesh-huang-bhatele-etal-frompixelstoprosealargedatasetofdenseimagecaptions-2024","role":"author","urls":{"Paper":"http://arxiv.org/abs/2406.10328"},"keyword":["Computer Science - Computation and Language","Computer Science - Computer Vision and Pattern Recognition","Computer Science - Machine Learning","year-2-pubs","year-2-pubs-direct"],"metadata":{"authorlinks":{}},"html":""},"bibtype":"misc","biburl":"https://bibbase.org/zotero-group/dcambrid/5266609","dataSources":["e4qi3jRmPhPzc7C9a"],"keywords":["computer science - computation and language","computer science - computer vision and pattern recognition","computer science - machine learning","year-2-pubs","year-2-pubs-direct"],"search_terms":["pixels","prose","large","dataset","dense","image","captions","singla","yue","paul","shirkavand","jayawardhana","ganjdanesh","huang","bhatele","somepalli","goldstein"],"title":"From Pixels to Prose: A Large Dataset of Dense Image Captions","year":2024}