Detection of Anomalies in Large Datasets Using an Active Learning Scheme Based on Dirichlet Distributions. Pichara, K., Soto, A., & Araneda, A. In Advances in Artificial Intelligence, Iberamia-08, LNCS 5290, pages 163-172, 2008.
Detection of Anomalies in Large Datasets Using an Active Learning Scheme Based on Dirichlet Distributions [pdf]Paper  abstract   bibtex   1 download  
Today, the detection of anomalous records is a highly valu- able application in the analysis of current huge datasets. In this paper we propose a new algorithm that, with the help of a human expert, effi- ciently explores a dataset with the goal of detecting relevant anomalous records. Under this scheme the computer selectively asks the expert for data labeling, looking for relevant semantic feedback in order to improve its knowledge about what characterizes a relevant anomaly. Our ratio- nale is that while computers can process huge amounts of low level data, an expert has high level semantic knowledge to efficiently lead the search. We build upon our previous work based on Bayesian networks that pro- vides an initial set of potential anomalies. In this paper, we augment this approach with an active learning scheme based on the clustering proper- ties of Dirichlet distributions. We test the performance of our algorithm using synthetic and real datasets. Our results indicate that, under noisy data and anomalies presenting regular patterns, our approach signifi- cantly reduces the rate of false positives, while decreasing the time to reach the relevant anomalies.

Downloads: 1