A New Age for Data Quality

Once, most data quality issues were from human errors and inadequate business processes. While these still exist, new data sources, such as sensor data and third-party data from social media, openData and "wisdom of the crowd" introduce new sources of potential error. And yet, the old ways of storing "data" in log books, engineering journals, paper notes and filing cabinets are still widely practiced. At the same time, data quality is more important than ever as organizations rely more on predictive algorithms, machine learning, deep learning, artificial intelligence and cognitive computing. The basics of data quality have remained the same, but the means by which we can assure data quality are changing.

Data Quality Basics

Fundamentally, data quality is about trust; that the decisions made from the data are good decisions, based upon trustworthy data. To achieve this trust, data must be:

  1. correct
  2. valid
  3. accurate
  4. timely
  5. complete
  6. consistant
  7. singular (no duplications that affect count, aggregates, etc)
  8. unique
  9. [have] referential integrity
  10. [apply] domain integrity (data rules)
  11. [enforce] business rules

Now, these principles must be applied to all the new sources and uses of data, often as part of streaming or real-time decision support, automated decisions, or autonomous systems.

Moreover, the data rules and the business rules must reflect reality, including evolving cultural norms and regulatory requirements. For example, in many areas of the world, gender is no longer based simply on biology at birth, but includes gender identification that may be more than just male or female, and may change over time as an individual's self-awareness changes. As another example, regulations in some areas of the world are imposing stricter restrictions around individual privacy, such as the General Data Protection Regulation (GDPR) in the EU with full application coming in May of 2018.

Data Verification

Third-party data verification tools have been around for decades, are often purchased and installed on-premises, including their own databases of information. Today, data verification may be done through such tools, or through openData and openGov databases; modern data preparation tools may even recommend freely available data sources, such as demographic data, to enhance and verify the data that your organization has collected or generated. Other data, such as social media data, is also available to enhance your understanding of customers, markets, culture, regulations and politics that might influence your decisions. Current third-party data is most often accessed through Application Programming Interfaces (APIs) that may be HTTP or ReSTful, or might be proprietary. Use, or rather, misuse of these APIs have the potential to degrade, rather than enhance your decisions support process. Another issue is that you may not know how third-party data is governed according to the basics of data quality. Again, modern data preparation and API management tools can help with these issues, as can open architectures and specifications.

Data from sensors and from sensor-actuator feedback loops, aren't new. Data from connected sensors, actuators, feedback loops, and all kinds of things, from pills to diagnostic machines, from wearables to cars, from parking sensors to a city's complete transportations system, some of which may be available through openGov initiatives, are new. Many of the organizations using such IoT data have never used such data before.

Now that we have taken a very brief look into data quality and new opportunities, let's go into the new tools we have to use these new data opportunities.

Data Stewardship through AI

In the spirit of drinking one’s own champagne, many of the new uses of data – the output of data science – are being applied to data management. As software has consumed the world, machine learning is eating software; deep learning and artificial intelligence are rapidly becoming the top of this food chain. Once, a dozen or so source systems made for a good size data warehouse, with nightly ETL updates. Now, organizations are streaming hundreds of sources into data lakes. The people, processes and technologies for data quality can only keep up through augmentation through the use of advanced analytic algorithms. Machine Learning uses metadata to continuously update business catalogues as artificial intelligence augments the data stewards. Metadata is changing as well, to provide semantic layers within data management tools, and to better understand the data sets coming from the IoT, social media, or open data initiatives.

The first players to apply these techniques to data management and analytics became our first "Data Grok" companies, data that helps humans grok data and how that data can be used. Since then, the first companies to earn the DataGrok designation, Paxata and Ayasdi, have been joined by many others adding machine learning, deep learning and even artificial narrow intelligence (ANI) to provide recommendations and guardrails to data scientists, data stewards, business analysts, and any individual using organizational data to make decisions.

Data Quality Relations

Data Management development through the execution of enterprise architecture, policies, practices and procedures encompasses the interaction among data quality, data governance, and data integrity. Regulatory and process compliance are dependent upon all three. Ownership of each data set, data element and even datum, is critical to assuring data quality and data integrity, and is the first step to providing data governance. Business metadata, technical metadata and object metadata come together through business, technical and operational ownership of the data to build data stewardship and data custodian policies. The architectural frameworks used for Enterprise, IoT and Data architectures result in specifications for each critical data element that provide an overarching view across all business, technical and operational functions.

Data governance interacts with architectural activities in an agile and continuous improvement process that allows standards and specifications to reflect changing organizational needs. The processes and people can assure that data specifications are applicable to the needs of each organizational unit while assuring that data standards are uniformly applied accrues the organization. The size and culture of an organization determines the formality and structure of data governance and may include a governing council, sponsorship at various organizational levels, executive sponsorship (at a minimum), data ownership, data stewardship, data custodianship, change control and monitoring. But even with all this, the goal of data governance must be to provide appropriate access to data, and not restrict the use of data…from any source.

IT Must Adapt

Information Technology has often been seen as a bottleneck. Many times in our consulting work, we have found ourselves in the position of arbiter between IT and the business. Self-service BI, Analytics and Data Preparation mean IT must become an enabler of data usage, providing trustworthy data without restricting the users. The productionalizing of data science again means that IT must be an enabler of data usage, including the machine learning and other advanced analytics models that data science teams produce. As data science and data management & analytics tools come together, the need for IT to guide the use of data and tools without limiting that use becomes paramount. At the same time, privacy and security must be retained within data governance. Patient data must only be available to the patient and those healthcare professionals and caregivers who require access to that data. Personally Identifiable Information (PII) must be controlled. Regulatory compliance, such as GDPR and PCI, must be adhered to.

There is also a need for two-way traceability from the datum to the end-use in reports and analytics, training sets or scoring, and from the end-use to the source system, including lineage of all transformations along the way. This lineage of source and use enables both regulatory compliance and collaboration. Such transparent history also helps builds trust in the data, and in what other users and IT data management professionals have done to the data.

IT and OT must Work Together

As connected products mature through the 5Cs ( of our developing IoT maturity model: connection, communication, collaboration, context and cognition) towards cognition, information technology and operations technology, business systems and engineering systems, must share data under a unified architecture. Much of the promise of the IoT can only be achieved through IT and OT working together. Consumer and marketing information being merged with supply chain and production quality information to build predictive models that allow just-in-time inventory control and agile, custom product delivery is only one example of changes to consumer expectation, whether that consumer is another business, a government or an individual. Industries from every market, such as the energy sector, consumer packaged goods and pharmaceutical manufacturing have reaped the benefits of IT and OT working together, of SCADA/Historians data being integrated with Cloud marketing and sales data or ERP data. But for this partnership between IT and OT to work, they each must trust the data of the other, and that only happens through data governance and data quality efforts.

Metadata and Master Data Management in DQ

Metadata and Master Data Management (MDM) are fundamental in ensuring data quality, and key to using trustworthy data throughout a modern data ecosystem from the most modern data sources and analytic requirements at the Edge to the most enduring legacy systems at the Core; from the droplets in the Fog to the globally distributed multi-Cloud and hybrid architectures. Metadata and MDM have been part of the solution all along, but now must be applied in new ways, both at the core and at the Edge, and distributed through multiple Cloud, hybrid architectures, on-premises, and out into the furthest reaches of the Fog, as all these resources elastically scale up and down at need.

Sensor Data Makes for Interesting DQ

Some of us have been dealing with sensors, sensor-actuator feedback loops and the concepts of the large, complex system for all of our careers, but for many, the fundamentals of connected hardware will be new. Sensor data can be messy. Two sensors from the same manufacturer will be slightly different in the data sets produced, even though they both meet specification; two sensors from different manufacturers will certainly be different in center point, range, precision and accuracy, and how the data are packaged. Sensors drift over time, and will need calibration against public standards. Sensors age, and may be replaced, and both of these conditions affect all the previous points.

Data architecture and DQ

Having worked in System Engineering for aerospace, I go to Deming's definition of Quality as conformance to specifications well suited to the customer, and, for data, specifications come from the architecture.

Architecture abstracts out the organizational needs as a series of views representing the perspectives of the people, processes and technologies affected by and effected through that solution, system or ecosystem. A standalone quality solutions architecture is not a good idea, as quality must be pervasive through an architecture. However, adding quality as a view within an architecture assures that data quality, data governance and compliance are properly represented within the architecture. {Though outside the scope of this post, I would also consider adding security as a separate view.} There are many architectural frameworks, and even controversy about their effectiveness; TOGAF, MIKE2, 4+1 and BOST are the main frameworks. Architectural frameworks focus on enterprise, data and solutions (application) architectures, with a recent interest in Internet of Things (IoT) architecture. Adherence to a framework or method is not as important as that the process by which an architecture is created meets the culture and needs of the organization.

Standards


For reference purposes, here are a list of data quality standards and methods that you might find useful:

  • ISO9001 Quality Management Family of Standards
  • ISO 8000 Data Quality Family of Standards
  • EFQM Quality Management Framework and Excellence Model
  • TOGAF The Object Group Architectural Framework for Data Architecture
  • BOST [PDF] An Introduction to the BOST Framework and Reference Models by Informatica
  • MIKE2 The Open Source Standard for Information Management
  • 4+1 Views [PDF] Architectural Blueprint by Philippe Kruchten [citation in alt tag]
  • TDWI Data Improvement Documents

Informatica is First in Customer Loyalty, Again, AND Continues to Innovate

We began using Informatica in its very early days. By 1998, we were using it for an ambitious enterprise data warehouse project spanning three divisions of a Fortune 100 company, taking in transactional and operational data from over 40 operating companies. The days are long gone when we would have implemented complex data architectures and data flows using Informatica Power Center and Power Mart in hub-and-spoke arrangements. But the need to provide powerful data management for analytics around business processes has only grown, as sales, services and customer touch-points have grown. We now generate data every minute of the day, awake or asleep. We tweet, email, and post to social media, personal blogs, and photography and video sharing sites. The things that make the things we use, and all the things around us have embedded computers and are sensor enabled, and generate even more data. Because of this, we have changed the focus of data management from simply extracting from common source systems, transforming so all the data conformed to internal standards, and loaded into that mystical single source of truth [the ETL of old]. Today, our focus is on discovering and exploring data relevant to our organizational and individual needs, no matter the source. And yet, all this data must be vetted; data quality and data governance are more important than ever. While the idea of a single source of truth is passé, trust in our data is not. Whether we are trying to improve our personal fitness or determine the impact of the latest marketing campaign, or bring the perpetrators of genocide to justice, we expect consistency in the answers to the questions we ask of all these sources of data.

Informatica has been amazingly innovative in expanding its capabilities for data management. Informatica solutions and products keep up with where industry is going. Informatica was one of the first data management companies to realize the importance of the Internet of Things (IoT). Their development of the Intelligent Data Platform is seen as a hallmark in handling all these new sources of data. Their attention to metadata and master data management has also improved, and even outpaced, the industry. Informatica can still be deployed on-premises, in one’s own data center, or in private or hybrid clouds, or in public Cloud platforms. Real-time data management, and continuous event processing are also part of Informatica’s suite of products. All of this innovation has been rewarded again today, as for the 11th year in a row, Informatica has been named #1 in Customer Loyalty for data integration. Informatica has earned top marks in customer loyalty in the annual Data Integration Customer Satisfaction Survey conducted by independent research from Kantar TNS.

To show that Informatica is not resting on its laurels, they have also announced today new and enhanced products and services:

  • Cloud Support Offerings
  • Business Critical Success Plan for On-Premises Deployments
  • New Big Data Support Accelerator

You can read more about the Customer Loyalty award and the Informatica announcements in their press release.

The Evolution of Data Management for IoT

In the upcoming webinar for SnapLogic, we will be looking at the Internet of Things from the perspective of data.

  • What data can be expected
  • How IoT data builds upon the evolution of data management and analytics for big data
  • Why IoT data differs from data from other sources
  • Who can make the most use of IoT data or Who can be impacted most by IoT data
  • Where IoT data needs to be processed
  • When IoT data has an impact

Specifically, how the recent evolution of data management in response to big data, is ideally suited in some ways for IoT data, and is still evolving for some unique characteristics of IoT data and metadata.

The business drivers range from new sources of data that can help organizations better understand, service and retain customers, to consolidation in many industries bringing about the need to bring together data from disparate and duplicate information and operation systems after merger and acquisition. One of the more pervasive developments has been the movement of data acquisition, storage, processing, management and analytics, to the Cloud.

Beyond these corporate motives, governments and non-government organizations (NGOs) are using data for good to bring about better quality of life for millions or billions of individuals. Clean water, prosecuting genocide, fighting human trafficking, reducing hunger, and opening up new means of commerce are only a few examples. Some look at the future and see a utopian paradise, others a dystopian wasteland. The IoT with evolving data management and analytics are unlikely to bring about either extreme, but I do think that the future will be better for billions as a result.

The basic question that we’ll ask in this webinar is “What is the Internet of Things?”. From simple connectivity, to the resulting cognitive patterns that will be exhibited by these connected things, we will explore what it means to be a thing on the Internet of Things, how the IoT is currently evolving, and how to bring value from the IoT. It is also important to recognize that the IoT is already here, many organizations are reaping the benefits from IoT data management and sensor analytics. The webinar will show ways in which your organization can join the IoT or mature your IoT capabilities.

Big data was often described by three parameters overwhelming the old ways of integrating and storing data: volume, velocity and variety. Really, we are looking at deftly interweaving the volumetric flow of data in timely ways that flexibly provide for privacy, security, convenience, transparency, governance and compliance. Nowhere is this evolution better expressed than in data management for the Internet of Things (IoT).

We will cover some of the more interesting and useful aspects of preparing for IoT data and sensor analytics. Though coined by Kevin Ashton in 1999, the IoT is still considered in the early stages of adoption and relevance. While the latest trends in data management and analytics apply to IoT data and sensor analytics, there are specific needs for properly addressing IoT data, which legacy ETL (extract, transform and load) and DBMS (database management systems) simply don’t handle well, such as time-series data and location data, as well as metadata specific to IoT. In addition to these characteristics of IoT data, we will explore other aspects that make IoT data so interesting.

The IoT isn’t meeting its hype as yet, which requires many solution spaces coming together as ecosystems. Instead, the IoT is growing within each vertical separately, creating new data silos. This is exemplified by the 30-plus standards bodies addressing IoT data communication, transport and packaging. Metadata and API management can help. Metadata also addresses the nuances of IoT data, such as the factors arising from replacing a sensor that allow continuity of the data set and understanding of the difference before and after the change.

Information Technology (IT) and Operational Technology (OT) are coming together in IoT. This means interfacing legacy systems on both side of the house, such as enterprise resource planning (ERP) and customer relationship management (CRM) systems with supervisory control and data acquisition (SCADA) systems, and relational database management systems (RDBMS) with Historians DBMS. This also means deriving context from the EDGE of the IoT for use in central IT and OT systems, and bringing context from those central systems for use in streaming analytics at the Edge. Further this means that machine learning (ML) is not just for deep analysis at the end of the DMA process; ML is now necessary for properly managing data at each step from the sensor or actuator generating the data stream, to intermediate gateways, to central, massively scalable analytic platforms, on-premises and in the Cloud.

As we discuss all of this, our participants in today’s webinar will come away with five specific recommendations on gaining advantage through the latest IoT data management technologies and business processes. For more on what we will be discussing, visit my post on the SnapLogic Blog. I hope that you’ll register and join the conversation on 2016 October 27 at 10:00 am PDT.

Contest Kognitio on Hadoop Best Use

Contest

During the week of 2017 September 26 at the O'Reilly Strata-Hadoop conference in New York City, Kognitio announced the start of their contest looking for the best use case or application of Kognitio-on-Hadoop. Kognitio are looking for innovative solutions that include Kognitio-on-Hadoop. Innovation is defined by Kognitio as

Innovation could be a novel or interesting application or it could be something that is common place but is now being done at scale.

This covers a wide range of potential big data analytics use cases that might include data-for-good, government, academic or business applications. Contestants must write-up their use case in a short paper, to be submitted to Kognitio no later than 2017 March 31. Applications will be judged by a named panel headed by a leading industry analyst. The winner will be notified on 2017 June 01. Applicants can be individuals, groups or organizations. The winner may chose among the following three prizes:

  1. US$5,000.00
  2. A one year standard support contract
  3. A one year internship at Kognitio’s R&D facility in the UK – subject to the intern being eligible to work in the United Kingdom

Kognitio on Hadoop is free to download; registered entrants will receive notifications of patches and updates to the free software, as well as preferential support on the Kognitio forums.

Kognitio

As one of the first in-memory, massively parallel processor (MPP) analytics platform, Kognitio has over 25 years of experience to bring to big data processing…always in-memory, MPP and on clusters. Today, the Kognitio Analytical Platform is delivered via appliances, software, and cloud. Kognitio on Hadoop was announced at the 2016 Strata-Hadoop conference in London. This free-to-use version of the Kognitio Analytical Platform includes full YARN integration allowing Hadoop users to pull vast amounts of data into memory for data management and analytics (DMA). As an in-memory MPP analytical platform, Kognitio is very scalable and can provide MPP execution of any computational statistics or data science applications. MPP of SQL, MDX, R, Python and other languages, for advanced analytics, is handled through bulk synchronous parallel (BSP) API. This provides extremely fast, high concurrency access to the data. In addition to these languages, Kognitio has a strong partnership with business intelligence vendors, such as Tableau, Microstrategy and others. For Tableau, Kognitio has a first-class connector; and, for example, a joint customer in the financial services market, with 10,000 customers accessing nine petabytes (9PB) of data in Hadoop [five terabytes (5TB) in Kognitio]. As example of the high concurrency available through Kognitio, the financial services customer routinely sees 1500-2000 queries per second from ~500 concurrent sessions. Now, know that this is an analytical subsystem; there are another 15 such uses of Kognitio, for specific purposes, accessing that 9PB data lake.

Kognitio on Hadoop

Kognitio on Hadoop can be downloaded free of charge and with no data size limits or functional restrictions. This download is available without registration. There is a range of paid support options available as well. Kognitio on Hadoop is integrated with YARN, and works on any existing Hadoop infrastructure. Thus, no additional hardware is required solely for Kognitio. Kognitio on Hadoop accesses files, such as CSV files, stored on Hadoop, in HDFS, as one would normally store data in Hadoop. Intelligent parallelism in Kognitio 8.2 allows queries to be assigned to as few as one core, or to use all cores, allowing for extraordinarily high levels of concurrency. This apportionment is performed dynamically by Kognitio. In addition to the obvious advantages of such a mature product, as free-to-use, Kognitio on Hadoop can be much more easily deployed, tested, and brought into production, while many open source solutions are still trying to run in a lab. Kognitio on Hadoop was developed internally using Apache Hadoop. Kognitio on Hadoop is in production at customers on Apache Hadoop, and the distributions from Cloudera, Hortonworks and MapR.

Why is this important to SAE?

As the Internet of Things matures beyond simple connectivity and communication, in-memory MPP analytical platforms, such as Kognitio on Hadoop, will be required to allow context to be derived from intelligent sensor packages and Edge gateways, to the Cloud, and provide context to the Edge, Fog and sensors, in real-time. Kognitio on Hadoop conceivably allows true collaboration and contextualization among things and humans in sensor analytics ecosystems.

Innovation at CHRISTUS with Informatica

CHRISTUS Health is a Catholic healthcare organization that started with a call to the Mother Superior of the Monastery of the Order of the Incarnate Word and Blessed Sacrament in Lyons, France in 1866. Three sisters responded, beginning their mission to care for the people of Texas in the United States of America. Since then, CHRISTUS Health has been working in accordance with their ideals and values, bringing quality healthcare to the people of the United States, Mexico and Chile. Fifteen years ago, a journey began to bring together financial and clinical data measuring patient outcomes, as well as operational efficiency. In 2013 a Business Intelligence Division was established to take analytics to another level. And this year, at Informatica World 2016 #INFA16, CHRISTUS Health was awarded an Informatica Innovation Award.

CHRISTUS Health – Everything CHRISTUS Health does is about taking better care of patients. While it was imperative that they control growing volumes of data, they knew that if they did it right, there was an unprecedented opportunity to use data to increase effectiveness and quality of services delivered. With end-to-end data management providing accurate, consistent views of data across the enterprise, the health system is confident they will improve the patient experience, enable clinical insight discovery and identify operational efficiencies. CHRISTUS Health implemented Informatica as a scalable enterprise data management architecture across its hundreds of healthcare facilities powering a voracious demand for business and clinical analytics for value-based care delivery. Immediate outcomes include an annual $1M in savings across lines of business and a 30 percent reduction in data stewardship efforts coupled with increased contract management across the Group Purchasing Organization.

In its sixteenth year, the Informatica Innovation Awards acknowledges organizations that creatively change their processes through data using Informatica products.

Exceptional Informatica Customers Receive 2016 Innovation Awards for Excellence around Business Transformation
Annual Awards Program Honors Visionary Organizations that Unlock Data to Power Business
Eleven organizations will be recognized as Innovation Awards winners and finalists at a luncheon on Monday, May 23rd. We’ll be announcing those organizations in this press release. Now in its 16th year, the annual Informatica Innovation Awards program honors organizations that demonstrate vision, creativity and leadership in the use of Informatica solutions to transform business through data.

Through the efforts of Peggy O'Neill, Analyst Relations at Informatica Corporation, Clarise and I were able to speak with Mavis Girlinghouse, a 34 year veteran of CHRISTUS Health, currently their System Director of Business Intelligence. We had a good time discussing the challenges specific to the healthcare industry, such as dealing with HL7 data andpushing that data to vendors in a useable fashion, the challenges of the recent change to IDC10 – which went rather smoothly at CHRISTUS Health – and the coming improvements to patient care and population health management through IoT data and sensor analytics.

The primary business case driving the work being honored this week, was managing the supply chain across the organization, with a focus on contract management. Remember, CHRISTUS Health spans six states in the USA, and six more in Mexico and Chile. This isn't a case study, and we didn't have time to go into competitive vendor selection nor return on investment. However, one immediate return was identifying contracts with terms that were impossible to meet, resulting in recurring late payment fees. Renegotiating these contracts resulted in significant cost savings. Technically, CHRISTUS Health went from four MDM systems to one; they went from using numerous spreadsheets requiring weeks of manual labor, to single-click answers. They aren't stopping at managing suppliers and supplies with MDM. Next on the list is the Pharmacy system, to be followed by including the implant tracking system. There are other challenges with which Informatica products help, such as dealing with the three electronic medical record (EMR) systems in use throughout CHRISTUS Health. Mavis is also planning for the deluge of data that will be associated with the change from episodic healthcare to population health management, where dynamic data mapping will ease the ETL (extract, transform, load) issues. CHRISTUS Health is also constantly planning for the ever changing security landscape. Population health management and improving patient-physician interaction and care management, will require bringing together EMR, pharmacy, third-party provider, IoT, social, and other external data sources to help CHRISTUS Health patients make better, informed lifestyle and prevention choices. Informatica brings the tools that allow new data coming in to be properly mirrored with like data, thus brining proper context to care decisions.

You can also hear Mavis on The Cube, and hear some great advice and experiences on how to get started with big data, and how data truly can save lives.

December 2017
Mon Tue Wed Thu Fri Sat Sun
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
 << <   > >>
The DataArchon blog provides research, analysis, insight and leadership for all aspects of data management and analysis. While many like to talk about data science as something new, we see it as an extension of analytic teams. The new discipline that many seek has more of a craft feel to it, and we see it more as a data smith.

Search

  XML Feeds