Contextual and Spatio-temporal Data Cleaning
It is becoming increasingly difficult for organizations to reap value from their data due to poor data quality. This is motivated by the observation that real data is rarely error free, containing incomplete, inconsistent, and stale values. This leads to inaccurate, and out-of-date data analysis downstream. Addressing data inconsistency requires not only reconciling differing syntactic references to an entity, but it is often necessary to include domain expertise to correctly interpret the data. For example, understanding that a reference to ‘jaguar’ may be interpreted as an animal or as a vehicle. Secondly, having up-to-date (or current) data is important for timely data analysis. Cleaning stale values goes beyond just relying on timestamps, especially when timestamps may be missing, inaccurate or incomplete.
In this talk, I will present our work towards achieving consistent and up-to-date data. First, I will discuss contextual data cleaning that uses a new class of data integrity constraints that tightly integrate domain semantics from an ontology. Second, we argue that data currency is a relative notion based on individual spatio-temporal update patterns, and these patterns can be learned and predicted. I will present our framework to achieve these two objectives, and provide a brief overview of recent extensions with applications to knowledge fusion.