Science of music-based citizen science: How seeing influences hearing. Bedoya, D., Lascabettes, P., Fyfe, L., & Chew, E. PLOS ONE, 20(9):e0325019, September, 2025. Publisher: Public Library of Science
Science of music-based citizen science: How seeing influences hearing [link]Paper  doi  abstract   bibtex   
Citizen science engages volunteers to contribute data to scientific projects, often through visual annotation tasks. Hearing based activities are rare and less well understood. Having high quality annotations of performed music structures is essential for reliable algorithmic analysis of recorded music with applications ranging from music information retrieval to music therapy. Music annotations typically begin with an aural input combined with a variety of visual representations, but the impact of the visuals and aural inputs on the annotations are not known. Here, we present a study where participants annotate music segmentation boundaries of variable strengths given only visuals (audio waveform or piano roll) or only audio or both visuals and audio simultaneously. Participants were presented with the set of 33 contrasting theme and variations extracted from a through-recorded performance of Beethoven’s 32 Variations in C minor, WoO 80, under differing audiovisual conditions. Their segmentation boundaries were visualized using boundary credence profiles and compared using the unbalanced optimal transport distance, which tracks boundary weights and penalizes boundary removal, and compared to the F-measure. Compared to annotations derived from audio/visual (cross-modal) input (considered as the gold standard for our study), boundary annotations derived from visual (unimodal) input were closer than those derived from audio (unimodal) input. The presence of visuals led to larger peaks in boundary credence profiles, marking clearer global segmentations, while audio helped resolve discrepancies and capture subtle segmentation cues. We conclude that audio and visual inputs can be used as cognitive scaffolding to enhance results in large-scale citizen science annotation of music media and to support data analysis and interpretation. In summary, visuals provide cues for big structures, but complex structural nuances are better discerned by ear.
@article{bedoya_science_2025,
	title = {Science of music-based citizen science: {How} seeing influences hearing},
	volume = {20},
	issn = {1932-6203},
	shorttitle = {Science of music-based citizen science},
	url = {https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0325019},
	doi = {10.1371/journal.pone.0325019},
	abstract = {Citizen science engages volunteers to contribute data to scientific projects, often through visual annotation tasks. Hearing based activities are rare and less well understood. Having high quality annotations of performed music structures is essential for reliable algorithmic analysis of recorded music with applications ranging from music information retrieval to music therapy. Music annotations typically begin with an aural input combined with a variety of visual representations, but the impact of the visuals and aural inputs on the annotations are not known. Here, we present a study where participants annotate music segmentation boundaries of variable strengths given only visuals (audio waveform or piano roll) or only audio or both visuals and audio simultaneously. Participants were presented with the set of 33 contrasting theme and variations extracted from a through-recorded performance of Beethoven’s 32 Variations in C minor, WoO 80, under differing audiovisual conditions. Their segmentation boundaries were visualized using boundary credence profiles and compared using the unbalanced optimal transport distance, which tracks boundary weights and penalizes boundary removal, and compared to the F-measure. Compared to annotations derived from audio/visual (cross-modal) input (considered as the gold standard for our study), boundary annotations derived from visual (unimodal) input were closer than those derived from audio (unimodal) input. The presence of visuals led to larger peaks in boundary credence profiles, marking clearer global segmentations, while audio helped resolve discrepancies and capture subtle segmentation cues. We conclude that audio and visual inputs can be used as cognitive scaffolding to enhance results in large-scale citizen science annotation of music media and to support data analysis and interpretation. In summary, visuals provide cues for big structures, but complex structural nuances are better discerned by ear.},
	language = {en},
	number = {9},
	urldate = {2025-09-11},
	journal = {PLOS ONE},
	author = {Bedoya, Daniel and Lascabettes, Paul and Fyfe, Lawrence and Chew, Elaine},
	month = sep,
	year = {2025},
	note = {Publisher: Public Library of Science},
	pages = {e0325019},
}

Downloads: 0