In *International Conference on Learning Representations (ICLR)*, 2023. (spotlight presentation)

Openreview abstract bibtex 9 downloads

Openreview abstract bibtex 9 downloads

Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. We develop a proxy for the validation performance associated with a training set based on a non-conventional between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. We develop a novel method to value individual data based on the sensitivity analysis of the Wasserstein distance. Importantly, these values can be directly obtained from the output of off-the-shelf optimization solvers once the Wasserstein distance is computed. We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a over the state-of-the-art performance while being Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

@inproceedings{2023_4C_LAVA, title={LAVA: Data Valuation without Pre-Specified Learning Algorithms}, author={Hoang Anh Just* and Feiyang Kang* and Tianhao Wang and Yi Zeng and Myeongseob Ko, Ming Jin and Ruoxi Jia}, booktitle={International Conference on Learning Representations (ICLR)}, note = {<font style="color:#FF0000">(spotlight presentation)</font>}, pages={}, year={2023}, url_openreview={https://openreview.net/forum?id=JJuP86nBl4q}, keywords = {Machine Learning}, abstract={Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. We develop a proxy for the validation performance associated with a training set based on a non-conventional between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. We develop a novel method to value individual data based on the sensitivity analysis of the Wasserstein distance. Importantly, these values can be directly obtained from the output of off-the-shelf optimization solvers once the Wasserstein distance is computed. We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a over the state-of-the-art performance while being Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity. }, }

Downloads: 9