Solving the GIGO problem: Using Schemas to Enhance Value of Data
Data science is a powerful tool for delivering better products and experiences to customers. Good data analytics lets you identify problems in real time, and fix them fast. But you can’t reap that benefit without good data, and ensuring that you have good data is one of the great challenges for data scientists. We’ve found that schemas can provide powerful tools to address that challenge.
At Comcast, we use data analytics to improve our products and deliver better, faster, more reliable experiences to our customers. In order to do that, we collect large amounts of non-personal telemetry data about network utilization, latency, throughput and any technology issues that may be impacting performance.
We use that data to identify usage “hot spots”, ensure consistent video quality, and measure how services like DVR and Video on Demand, among many others, are performing.
Data Scientists have a colorful shorthand expression for the issues around ensuring good data inputs: “garbage-in-garbage-out” or “GIGO”. The GIGO problem is just what it sounds like. You can’t deliver powerful data analytics results unless you have good data to start with.
Our analytic process involves integrating data from several different internal sources. Early in our process, there were few standards on naming conventions, and spotty documentation. The old rule of thumb applied: 70% of the analysts’ time went into gathering, understanding, cleaning and integrating the data, while only 30% went into doing the actual analyses and simulations.
The solution: data governance
My colleagues and I in Comcast Technology and Product are developing a new internal system as a single point of ingest for real-time data, and a set of data storage solutions that are designed to avoid these data integration problems. We do this through data governance. Data governance is an elegant way to make data discovery and integration easy, and ensure good source material for powerful analytic solutions.
Our data governance solution is based on pairing all data in our system with a schema that describes it syntactically (what its form is) and semantically (what it means). We use Apache Avro, an open source schema representation language and data serialization format. There are many pluses to using Avro’s serialization, including excellent compression rates, but we’ll focus here on Avro schemas.
Avro schemas enforce the types and structures of data, and also document the meaning of each attribute. A library of core subschemas enables reuse of standard naming conventions and formats for commonly referenced data such as device, network interface, error, and tracing messages as they move across the network. When core subschemas are used and the data producer refers to “deviceId”, the semantics of that field are well known and documented.
Avro schemas follow the data from the upstream to the downstream end of the internal data infrastructure, from initial ingest via Apache Kafka, through intermediate processing and enrichment in flight, until it finally is at rest in one or more data storage options (big data lake, key-value store, time series database, etc.). This enables us to understand and integrate data at any point in its journey.
We are currently exploring open source tools to add data discovery and data lineage to our data governance platform. Then we can trace the movement of a dataset or even a single data message from ingest all the way to storage. We can answer questions like: “Where in the ecosystem can I find data about X?”; “In which data stores can I find data that came in via Kafka topic Y?”; “Where did the data I’m looking at in this database come from (i.e. who produced it), and how has it changed since it was first ingested?”
Data analytics involving the integration of several disparate datasets is doomed to failure due to the GIGO rule, unless the datasets are well understood, validated and governed. We are using Apache Avro schemas as the backbone of our data governance solution, to solve the GIGO problem. Thus, we enable analysts to use the data to produce the best possible understanding of network utilization, errors, and metrics, and ultimately improve the quality of service for Comcast customers.