Learning from Data Streams

View Sample PDF

Author(s): João Gama (University of Porto, Portugal)and Pedro Pereira Rodrigues (University of Porto, Portugal)
Copyright: 2009
Pages: 5
Source title: Encyclopedia of Data Warehousing and Mining, Second Edition
Source Author(s)/Editor(s): John Wang (Montclair State University, USA)
DOI: 10.4018/978-1-60566-010-3.ch176

Purchase

View Learning from Data Streams on the publisher's website for pricing and purchasing information.

Abstract

In the last two decades, machine learning research and practice has focused on batch learning usually with small datasets. In batch learning, the whole training data is available to the algorithm that outputs a decision model after processing the data eventually (or most of the times) multiple times. The rationale behind this practice is that examples are generated at random accordingly to some stationary probability distribution. Also, most learners use a greedy, hill-climbing search in the space of models. What distinguishes current data sets from earlier ones are the continuous flow of data and the automatic data feeds. We do not just have people who are entering information into a computer. Instead, we have computers entering data into each other. Nowadays there are applications in which the data is modelled best not as persistent tables but rather as transient data streams. In some applications it is not feasible to load the arriving data into a traditional DataBase Management Systems (DBMS), and traditional DBMS are not designed to directly support the continuous queries required in these application (Babcock et al., 2002). These sources of data are called Data Streams. There is a fundamental difference between learning from small datasets and large datasets. As pointed-out by some researchers (Brain & Webb, 2002), current learning algorithms emphasize variance reduction. However, learning from large datasets may be more effective when using algorithms that place greater emphasis on bias management. Algorithms that process data streams deliver approximate solutions, providing a fast answer using few memory resources. They relax the requirement of an exact answer to an approximate answer within a small error range with high probability. In general, as the range of the error decreases the space of computational resources goes up. In some applications, mostly database oriented, an approximate answer should be within an admissible error margin. Some results on tail inequalities provided by statistics are useful to accomplish this goal.

The IRMA Community

Research IRM

Learning from Data Streams

Purchase

Abstract

Related Content

IRMA Sponsors