Neo4j, ML, AI and Causal Inference

There has been incredible growth with Neo4j (@Neo4j) since the first GraphConnect conference in 2013, in business adoption, in data science, in partnerships, in approach, in staff and in customers. During the month of August, we had the privilege of speaking with Amy Hodler (@amyhodler ‬) and, separately, with Lance Walter (@lancewalter), both with Neo4j. Amy is the Analytics & AI Program Manager at Neo4j. Lance is CMO of Neo4j. There are many amazing things happening at Neo4j today, but this blog post will focus on the impact of connected data on machine learning (ML) and artificial intelligence (AI).

Let’s start with some basics.

Background on Graphs and Neo4j

Maps, charts, graphs…what’s the difference among these? At the most basic level, the underlying mathematics which describe maps, charts and graphs are different, and this difference allows for different technology to be created from these concepts. Colloquially though, these three words are often used interchangeably. I might say that the one thing statisticians and data scientists have in common is that we both start by graphing out a representative sample of a new data set. Of course, we really mean plotting the data…often on graph paper. The result might be a chart, an histogram perhaps. And charts are often used in data visualization after analysis. If you’ve read this blog over the years, you know that I love mindmaps, but geographical maps are great for certain types of data visualization, and concept maps are a great gateway to graph technologies.

Maps, charts, graphs…As you can see, there is a lot of overlap in these concepts. Both the words and the mathematics are important because how you might use these concepts determines how you can use data to better understand the world around you, and make better decisions. For our purposes, we are talking about graphs as used in graph databases, graph analytics and graph algorithms to model and elucidate the relationships among entities that are meaningful to a process or ecosystem.

Advances in technology and science have ushered in a digital revolution of an hyper connected environment. This is true in social media and in the socialization of machines. These hyper-connected flows of information, ideas and energy appear in all industries, in government, in technology-for-good NGOs, and even in our personal lives.

In such ecosystems, organizations are presented with opportunities from this connected data. More data and more types of data are being collected at ever higher but varying rates. These continually evolving datasets are both related and dynamic. There is a need to be able to quickly and efficiently identify relationships and patterns in this vast array of data. This need to efficiently store and process highly-connected data has resulted in an increased interest in graph databases.

Most people are familiar with data stored in and accessed from relational models and databases. These systems utilize fairly fixed schemas and are not designed to handle the complex relationships present in today’s connected data sets. Relationships are not easily traversed. Primary and foreign keys are not enough—they are not designed to represent extensive, connected data. These systems require costly programming to find relationships in the data sets, and performance suffers because of the underlying design and storage implementations.

In a graph database, relationships have as much value as the data itself. A graph database transforms a complex interconnected dataset into meaningful and understandable relationships. The key concept here is the edge or relationship. The graph relates the data items in the database to a collection of nodes or vortices, the edges representing the relationships between the nodes. The relationships allow data in the database to be linked together directly and retrieved easily.

Data Scientists Like Graphs

At the same time as we were catching up with Neo4j, our friend, Julian Hyde (@julianhyde), recommended that we read Judea Pearl’s Book of Why, The New Science of Cause and Effect. Having now read the Book of Why as well as Graph Algorithms by Amy Hodler and Mark Needham, I was struck by the similarity of concepts and language between causality and graphs. Learning about causal inference and graph algorithms concurrently in this way truly makes it impossible not to think about one in terms of the other, and makes obvious the usefulness of one to the other. This is especially true for complex systems, machine learning (ML) and artificial intelligence (AI).

AI is used in a variety of contexts with many meanings. AI can be described as artificial narrow intelligence (AnI), general artificial intelligence (GAI), artificial super intelligence (AsI) and Augmented Intelligence (AugI) – though in that last one, it isn’t always clear what or who is being augmented. Of course, many call simple machine learning, even computational statistics at the level of linear regression, “artificial intelligence”. The combination of graph algorithms and causal inference seems to be key in bringing clarity to these various versions of AI, and in helping move towards true GAI or AsI.

Tradition is the passing on of knowledge. If you think of this in terms of graph databases and AI, perhaps we can see the true importance of graphs to AI…providing tradition, an history of knowledge upon which to build, to improve, to provide context, as machines augment humans and humans augment machines. This is where causal inference comes into play. Consider Judea Pearl’s ladder of causation (Observation, Intervention and Counterfactuals).

The Ladder of Causation by Judea Pearl
Book of Why, Chapter 1, Figure 2. The Ladder of Causation, with representative organisms at each level. — by Judea Pearl http://bayes.cs.ucla.edu/WHY/why-ch1.pdf

From the perspective of graphs, this leads to many possibilities, three of which we plan to pursue.

The first is relevant to general data science workflows, and results from a Twitter conversation with JD Long (@cmastication). Can graphs and causal inference be used to validate and improve the data science workflow, leading to optimizing production use of data science outputs?

The second is useful to projects that we are doing with Rebaie Analytics Group in providing ML prototypes for IoT pilots. Can causal inference and graphs be used to select and validate training sets, especially of third-party data, for ML and AnI?

The third is important to our research and synthesis of ethics frameworks. As this two-way augmented decision making happens, these decisions must be based upon an ethical framework of cultural, regulatory, economic, political and environmental factors, that perhaps, can be built into the graph through the use of the causal inference concept of counterfactuals and graph algorithms that will find and test counterfactuals.

Directed acyclic graphs (DAG) based on causal assumptions as opposed to capturing associations data make more sense to the human minds doing the model. The topology of such graphs is also more amenable to remodeling to capture new information or changes in the causal structure. As such, causal Bayesian Networks as DAG naturally attract data scientists, especially when one of the goals is transparency and explainability in how the model arrived at its predictions or inferences. The concepts of graph local querying and graph global processing also lend themselves well to iteratively updating information across a network, allowing transactions and analytics to interact through causal inference, Bayesian predictions, machine learning scoring and all levels of artificial intelligence.

Contextualization is of paramount importance to every aspect of data management and analytics. Without context, causal inference can’t occur. Context is integral to any organization’s IoT maturity and AI Readiness. Graph technologies inherently bring context to data. Metadata and master data are also a primary means of bringing context to data; and graph technologies are increasingly used in metadata and master data management. Therefore graph technologies are vital to all IoT and AI initiatives. Context through graphs bring even more value to data scientists.

Graphs Make Big Data Look Small

Another reason that data scientists like Neo4j is that purpose-built graph databases make big data look small. Neo4j is based on property graphs, and, like all graphs, the data model and the database treats entities (nodes, vertices) and relationships (edges, paths) as equal elements. Properties assigned to nodes and paths handle much of the separate overhead that other data management systems require. Also, while there are classes of graph algorithms that need a global look at the graph, traversing the entire topology, there are classes of graph solutions that can be considered local graph questions and can be handled in subgraphs. All of this combines to make big data very manageable…making big data look small.

So, unlike other database management systems that require connections between entities using special properties such as foreign keys or out-of-band processing, graph databases use simple abstractions of nodes and relationships when connecting structures. This has the added benefit of enabling users to build sophisticated models in less time.

More Interesting Neo4j Tidbits

As we said at the beginning, there have been many interesting happenings and amazing growth at Neo4j. Here is a list, and references.

  1. The best place to start your graph journey is with Graph Databases by Ian Robinson, Jim Weber and Emil Eifrem, ©2013, O’Reilly — download the 2nd edition as an ebook, free with signup
  2. The Neo4j community is incredibly rich in members, activity and projects
  3. There are over 450 Neo4j events including large conferences, online meetings, graph tours, meetups and training
  4. Neo4j has been adding new hires at every level to strengthen their product, services, business and market position.

    • Mike Asher – former Pivotal CFO, now CFO Neo4j
    • Ivan Zoratti – CTO MariaDB, Head of Field MySQL, now Neo4j DB PM
    • Jake Graham – Intel Saffron AI Product Manager, now AI PM
    • Matt Casters – Author & Architect of Kettle, data integration product
    • Denise Persson – CMO Snowflake joined Neo4j BoD
    • Lisa Hatheway - VP Demand Gen, formerly Vectra AI & Couchbase
    • Alicia Frame - Senior PM for Data Science, formerly of BenevolentAI, EPA and USDA
  5. The Innovation Lab at Neo4j is very popular at the many Graph Tour events, and possibly the best way to be convinced that graph technologies can solve your organization’s challenges; perhaps the best way to learn about this program is with this five minute interview with Alessandro Svensson, the Director of Neo4j Innovation Lab
  6. More on the Book of Why can be found on Dr. Judea Pearl’s publication page at UCLA.
  7. Graph Algorithms by Amy Hodler and Mark Needham, ©2019 published by O’Reilly is available as a free ebook with signup
  8. There has been much discussion over the past decade that data management and analytics must move beyond online transaction processing (OLTP) and online analytical processing (OLAP) to hybrid transaction-analytical processing (HTAP as explained by Donald Feinberg and Merv Adrian at Gartner). Indeed, many of the M&A transactions in this space have been to embed analytics into ERP, eCommerce and financial systems. RDBMS vendors have even put forth the idea that their product can handle both transactions and analytics in the same instance of the database. Modeling such capabilities in entity-relationship diagrams and the corresponding physical models is extremely difficult and the result often performs poorly. GraphDB in the other hand are very good at HTAP. In addition, it is difficult to envision a better platform to manage the ebb and flow of data, metadata and master data among the core, intermediate aggregation points and edge within and among sensor analytics ecosystems.
  9. Google Cloud Partnerships for open source data management and analytics. Neo4j is among these partners, with the goal of bringing native propert graphs into the Cloud.
  10. For the first time in 30 years, a new ISO standard for a query language is firmly on its way to becoming a standard. The proposal has been approved to develop and maintain the graph query language (GQL) by the same international working group that also maintains the SQL standard. This is a significant step in advancing graph technologies for full and extensible platforms.

Graphs Matter to Sensor Analytics Ecosystems

Our research focuses on synthesizing Sensor Analytics Ecosystems (SensAE), bringing value to the Internet of Things (IoT) through data management and analytics…data engineering and data science…within the realm of complex systems…without creating new silos…using an ethics philosophy to guide adoption assuring privacy, transparency, security and convenience by design. Graphs are essential to building Sensor Analytics Ecosystems platforms. Causal inference appears to be essential to IoT maturity, which depends upon AI Readiness.

  • The IoT, even within a siloed solution, is a complex system. By the end of that first Graph Connect in 2013, we were convinced that graphs are the best way to model data in a complex system.
  • As data, metadata, master data and information ebbs and flows in the enterprise, we realized that graphs are the only technology that can break away from source-target data flows, and have included graph technologies as part of SensAE system architectures guidelines.
  • Every IoT pilot starts with connection, and adds communication. To mature, collaboration and contextualization must then be added. With advanced analytics, ML and AnI, the IoT project moves towards cognition. We will now extend our research to determine exactly how causal inference as expressed within a property graph database using graph algorithms to select and direct ML and AnI, enhance this maturation process within an ethics framework. Graphs will be a necessary part of this research.

Guavus Plus SQLstream means Broad and Deep for IoT Data Science

History

From the first time that Damian Black, founder of SQLstream, and Dr. Anukool Lakhina, founder of Guavus, first met almost a decade ago, the synergies and complementary nature of their visions was apparent to both of them. At the time though, each chose their own path, with Guavus using open source solutions to become a leader in big data, real-time analytics, firmly focused on the Telecommunications CSP (Customer Service Profiles) and operational efficiency market. Meanwhile, SQLstream built off of Eigenbase components to create one of the first true streaming analytics engines, while having strict compliance to SQL standards; on the business side, finding a niche in the burgeoning IoT market, especially in Transportation, all while remaining an horizontal solution.

Guavus was acquired by Thales in 2017. The Thales Group, a large, international player in aerospace and defense, with a significant presence in transportation, expressed interest in SQLstream about four years ago. It was at this point that Damian and Anukool realized that the solutions Guavus and SQLstream had developed since their earlier discussions, had become even more strongly complementary, with Guavus' deep domain expertise in telecommunications, machine learning and data science, and SQLstream as a pioneer and leader in streaming analytics with an horizontal platform. In addition, Guavus is following Thales lead in broadening their domain expertise into the Industrial Internet of Things. SQLstream has had great success in the Transportation area, as well as in other sensor analytics ecosystems (SensAE). In addition, Guavus recognizes the need to process the vast amount of telecoms and IoT data closer to the source. In January of 2019, Guavus acquired SQLsream.

Integration

Although the merger is only a month old, the two companies are already working as one to bring the strengths of each together for greater customer success. Over the next six to 12 months, the two will be integrated into a single platform with the ability to scale-up to mind-numbingly large data flows, and to scale-down very finely-tuned small aggregates where and as needed throughout the ecosystem. This will allow greater operational efficiency as separating signal from noise, close to the source, allows processing the data immediately, providing value timely and cost effectively. Data rates are growing, per Damian, by 50% as edge sources increase in importance, but data storage and management costs are only decreasing by 12-14%. Only by pushing the algorithms – the machine learning models – into the streaming pipeline, will organizations be able to actually draw value from this data. Guavus has some of the best data science expertise in the industry for their customers in Telecom. As this domain experience grows to include Transportation, and IIoT in general, companies growing in IoT maturity will be able to perform streaming analytics and machine learning augmented analytics on appropriately aggregated data throughout their ecosystems.

With our integrated solutions, CSPs to IIoT customers will be able to take advantage of something that’s radically different as we deliver AI-powered analytics from the network edge to the network core. With this solution, our customers can now analyze their operational, customer, and business data anywhere in the network in real time, without manual intervention, so they can make better decisions, provide smarter new services, and reduce their costs." — Guavus Press Release

This matches well with what we have seen, and what we present for SensAE architecture, that the ebb and flow of data throughout the ecosystem must allow for appropriate aggregation and analytics at each point within the ecosystem.

Future

At MWC19, there has been a lot of interest in these specific solution, and also in building trust throughout the ecosystem, with security, and, as our research has shown, with the ability to select the desirable levels of privacy and transparency. Responding to these industry concerns is already in the Thales/Guavus/SQLstream roadmap.

The SQLstream products have the ability to analyze, filter, and aggregate data at the network edge in real-time and forward the information to the network core where the Guavus’ Reflex® platform can apply AI-powered analytics, giving customers a widely distributed and scalable architecture with better price/performance and total cost of ownership." — Guavus Press Release

The next few months are going to be exciting with SQLstream, Guavus and Thales bringing together their expertise in streaming analytics, data management, telecommunications, transportation, machine learning, data science, industrial needs and system engineering.

Did You See What Ockam.io Just DID

The W3C Decentralized Identifiers (DID) is a working specification from the credentials community group. While not yet a standard, a working group has been proposed to develop the current specification into a full W3C standard. Amazingly, this proposed working group has support from over 60 companies (15 is the minimum required, and 25 is considered very healthy). Building trust into interoperability is important for the Internet of Things to mature allowing the development of sensor analytics ecosystems (SensAE). The W3C DID is not focused on things alone, but on people as well as things, providing the opportunity for every user of the Internet to have public and private identities adhering to the guidelines of Privacy by Design. We often talk about the need within SensAE for privacy, transparency, security and convenience to be provided in a flexible fashion. The W3C DID is an excellent first step in allowing this flexibility to be controlled by each individual and thing on the Internet, indeed on each data stream in human-to-human, human-to-machine, machine-to-human and machine-to-machine exchange or transaction. But every specification and every architecture need to be implemented to have value. Moreover, they need to be implemented in such a way that the complexity is abstracted out, both to increase adoption, and to reduce compromising errors. This is where the recently announced Ockam.io open source (under the Apache 2 license) software development kit and the Ockam TestNetwork come into play.

Currently, organizations entering IoT are either trying to build all of the pieces themselves, or searching for full stack IoT Platforms. Either of these approaches can limit interoperability. Think of a simple example, such as in the Smart Home area, where device vendors need to choose among such vendor-centric platforms as offered by Amazon, Apple, Google or Samsung, with no hope of easy interoperability. Such vendor lock-in limits adoption. This is also true of the industrial IoT Platform vendors. Manufacturers that might want two-way traceability from the mine to the assembly line to user environments to retirement of a product are stymied by the lack of interoperability and secure means to share data among all the players in their supply chain, vendor, customer and environmental ecosystem. Standards can be confusing and also cause lock-in. For example, there are two main standards bodies addressing Smart Grids, each with hundreds of standards and specifications that are not consistent from one body to the other, and do not allow for secure data exchange among all involved parties.

The W3C DID specification seeks to support interoperability and trust, with individual control of data privacy and transparency. The overall specification requires specific DID implementation specifications to ensure identity management while maintaining interoperability among all organizations adhering to the overall DID specification. This means that on-boarding IoT devices in a secure fashion with integration among all the devices in an organization's ecosystem can be done in seconds rather than months (or not at all). Even though, say, the OckamNetwork has never coordinated with Sovrin, or some large corporation's Active Directory, one can register a DID claim for a device in the Ockam TestNetwork and instantly have trusted interoperability and exchanges with users of Sovrin or any DID compliant organization. This means that an organization can move their IoT maturity immediately from simple connection, to trusted communication. Let's look at an example from the Ockam.io SDK.

Did You See What Ockam.io Just DID

With just a few simple lines of code, a developer can register their device in such a way that the device is uniquely identified from original manufacture through any installations, incorporation in a system, deployments and changes in location, to reuse or retirement. What this means within the OckamNetwork, is that life events, and metadata from the life events, are continuous to build trust throughout the life of the device. As always, metadata in one situation is useful data in another, such that the DID document that defines the identity also defines and leverages the subject’s metadata. The developer is free within the W3C DID model to define the metadata as needed. This allows key management through a decentralized blockchain network of the developer's choosing, without creating silos. This also allows the end-user of the device, or the system that contains many devices, to trust the system, with reasonable confidence that the system will not be compromised in a botnet.

To be successful, the W3C DID requires broad uptake. Major corporations are involved (beyond the 60 companies mentioned above). Visit Decentralized Identity Foundation to see the list.

Yes, the W3C DID and compliant DID such as OckamNetwork use blockchain. Contrary to common belief due to the most common use of blockchain technology in currencies such as Bitcoin or Etherium, where all the blocks or headers need to be tracked since genesis, the amount of data that needs to be exchanged to validate trust is not huge. This is due to CAP or Brewer's Theorem, where a blockchain can have two of Consistency, Availability or Partition-tolerance. The OckamNetwork is CP based. Because of this absolute consistency, with instant commit, with 100% finality in the commit, one only needs to track the most recent block. Another interesting side effect of CP is that it allows for low-power and casual connectivity, two important features for IoT devices which may need to conserve power and may need to connect only at need.

Another interesting feature of the W3C DID is that the issuer of a claim can make claims about any subject – any other entity in the network. This means that the problem often seen in IoT where a failed sensor is replaced with another sensor, while in-spec, has slightly different parameters than the previous sensor. Also, the important thing is that the data stream remains consistent, and that users of that data understand that the data are still about the same system or location and that differences in the data are due to a change in sensor not a change in the system. The W3C DID model of claims allows a graph model of sensor to location that ensures consistency of the data stream while ensuring trust in the source of the data, through a signer of the issuer of the claim that is proven by a two-thirds majority of provers in the chain. Thus, the state of the blockchain is modeled as a graph that consistently tracks the flow of data from all the different sensors and types of sensors, graphed to each location, to each subsystem, to each system, whether that system is a vehicle or a city.

The beauty of the Ockam.io SDK is that the developer using it, does not need to know that there is a blockchain, doesn't need to know how to implement cryptography; they can, but there is no requirement to do so. These functions are available to the developer while the complexity is abstracted away. With ten lines of code, one can join a blockchain and cryptographically sign the identity of the device, but the developer does not need to learn the complexities behind this to provide orders of magnitudes of better security for their IoT implementation. The whole SDK is built around interfaces, such that one can adapt to various hardware characteristics, for example RaspberryPi vs microcontroller. All of this is handled by the layers of abstraction in the system. The hundreds of validator nodes in the system that maintain consensus need to process every transaction, and to repeat every transaction. To maximize the throughput of the system, Ockam.io use graph tools that are very lightweight. Thus, as OckamNetwork matures, it will use the open source graph libraries, without using the full capabilities of a full graph database management or analytics systems, such as neo4j. This will allow for micronodes that don't have cryptographic capability to still leverage the blockchain by being represented by . A low-power device with limited bandwidth only needs to wake up at need, transmit a few hundred bits of data and confirm that the data was written either by staying awake for a few seconds or by confirming the next time it wakes up. Micronodes can take advantage of the Ockam Network by having a software guard extension (SGX) system represent the device on the chain. Another aspect is that, much like older SOA systems, descriptive metadata enhances interoperability and self-discovery of devices that need to interoperate in a trusted fashion.

Beyond the technical, an important part of any open source project is community. There is no successful open source project that does not have a successful community. Ockam.io is building a community through GitHub, Slack, open source contributions to other projects (such as a DID parser in GoLang) and IoT, security and blockchain meetups. There are also currently six pilots being implemented with early adopters of the Ockam,io SDK that are growing the community. The advisor board includes futurists and advisors that are both proselytizing and mentoring to make the community strong.

It is early days for the W3C DID, but with companies like Ockam.io building open source SDKs that implement the DID specification with specific implementations, the future for both DID and Ockam.io is bright and will help overcome the silo'd development that we've seen in IoT that is limiting IoT success. Ockam.io is not focused on any market vertical, but are focused on use-case verticals. This is applicable to all the solution spaces that make up the IoT from SmartHomes to SmartRegions, from Supply-chains to Manufacturers, from oil rigs to SmartGrids, and from Fitness Trackers to Personalized, Predictive, Preventative Healthcare.

Developers who wish to provide strong identity management quickly and conveniently should check out the OckamNetwork architecture, download the repo from GitHub and up your IoT trust game by orders of magnitude.

An AI Powered 4D Printed Facial Tissue Drone

Imagine that you are in a future, augmented city. The sensors around you, through machine learning scoring and artificial narrow intelligence realize that you are about to sneeze…even before you do. In response, a nearby 4D printer makes a handkerchief that feels as though it is made of the softest cotton-linen blend, and indeed those materials are part of the weave, but only a part. A variety of nano-materials make up the rest, incorporating soft sensors, and various mechanical properties that allow the handkerchief to fly to you from the 4D printer. And indeed, this is 4D, as the material properties change from a flying bird shape with powerful wings, to a soft facial tissue, landing in your hand, just in time to capture your sneeze. Now whether the sneeze was caused by some errant dust – this is, after all, an augmented city with integrated agriculture and green spaces – or an allergen, the handkerchief's sensors now analyze the sputum and mucous that you sneezed into it, just as secondary assurance that you aren't about to spread cold, flu, or more serious viral or bacterial contamination around you. The handkerchief is fully reusable and recyclable and repurporseable, to be sterilized and become a face mask fro you, to protect you from the dust or allergens, or to protects others from your disease vector, or to become something else all together.

Machine learning scoring at the sensor package level – that is being done today by companies such as Simularity.

Machine learning and deep learning being incorporated into software to help guide augmented human decisions and autonomous machine decisions – a variety of companies, such as the ones we wrote about in our Data Grok posts over the past few years.

Artificial narrow Intelligence – is appearing in everything from chatbots to surgical robots, and is being investigated by more companies than we can add to this post.

Soft sensors – are currently being researched mostly in the textile and fashion industries.

IoT Architecture that includes hardware, firmware and software from the sensor to the Fog and Edge, through multiple intermediate aggregation points into a distributed Core of on-premises and multi-Cloud infrastructures and services – not implemented anywhere that I know of, and our own development of this architecture is still nascent.

Completion of the 5Cs IoT Maturity Model that we help to develop in 2014, and are still working on today – again, not that I know of.

Fully augmented smart cities – there are projects and megaprojects and conferences everywhere, but all silo'd and incomplete to date.

A sensor analytics ecosystem that would allow this to occur, with proper provisioning of privacy, transparency, security and convenience while building trust through two-way accountability – not yet, and perhaps never, but something that we are working toward.

And finally, the framework of an ethical core, along with the cultural, regulatory, economic, political and environmental factors[in draft, coming soon] to bring such a sensor analytics ecosystem and augmented city into existence, need to be understood.

A New Age for Data Quality

Once, most data quality issues were from human errors and inadequate business processes. While these still exist, new data sources, such as sensor data and third-party data from social media, openData and "wisdom of the crowd" introduce new sources of potential error. And yet, the old ways of storing "data" in log books, engineering journals, paper notes and filing cabinets are still widely practiced. At the same time, data quality is more important than ever as organizations rely more on predictive algorithms, machine learning, deep learning, artificial intelligence and cognitive computing. The basics of data quality have remained the same, but the means by which we can assure data quality are changing.

Data Quality Basics

Fundamentally, data quality is about trust; that the decisions made from the data are good decisions, based upon trustworthy data. To achieve this trust, data must be:

  1. correct
  2. valid
  3. accurate
  4. timely
  5. complete
  6. consistent
  7. singular (no duplications that affect count, aggregates, etc)
  8. unique
  9. [have] referential integrity
  10. [apply] domain integrity (data rules)
  11. [enforce] business rules

Now, these principles must be applied to all the new sources and uses of data, often as part of streaming or real-time decision support, automated decisions, or autonomous systems.

Moreover, the data rules and the business rules must reflect reality, including evolving cultural norms and regulatory requirements. For example, in many areas of the world, gender is no longer based simply on biology at birth, but includes gender identification that may be more than just male or female, and may change over time as an individual's self-awareness changes. As another example, regulations in some areas of the world are imposing stricter restrictions around individual privacy, such as the General Data Protection Regulation (GDPR) in the EU with full application coming in May of 2018.

Data Verification

Third-party data verification tools have been around for decades, are often purchased and installed on-premises, including their own databases of information. Today, data verification may be done through such tools, or through openData and openGov databases; modern data preparation tools may even recommend freely available data sources, such as demographic data, to enhance and verify the data that your organization has collected or generated. Other data, such as social media data, is also available to enhance your understanding of customers, markets, culture, regulations and politics that might influence your decisions. Current third-party data is most often accessed through Application Programming Interfaces (APIs) that may be HTTP or ReSTful, or might be proprietary. Use, or rather, misuse of these APIs have the potential to degrade, rather than enhance your decisions support process. Another issue is that you may not know how third-party data is governed according to the basics of data quality. Again, modern data preparation and API management tools can help with these issues, as can open architectures and specifications.

Data from sensors and from sensor-actuator feedback loops, aren't new. Data from connected sensors, actuators, feedback loops, and all kinds of things, from pills to diagnostic machines, from wearables to cars, from parking sensors to a city's complete transportations system, some of which may be available through openGov initiatives, are new. Many of the organizations using such IoT data have never used such data before.

Now that we have taken a very brief look into data quality and new opportunities, let's go into the new tools we have to use these new data opportunities.

Data Stewardship through AI

In the spirit of drinking one’s own champagne, many of the new uses of data – the output of data science – are being applied to data management. As software has consumed the world, machine learning is eating software; deep learning and artificial intelligence are rapidly becoming the top of this food chain. Once, a dozen or so source systems made for a good size data warehouse, with nightly ETL updates. Now, organizations are streaming hundreds of sources into data lakes. The people, processes and technologies for data quality can only keep up through augmentation through the use of advanced analytic algorithms. Machine Learning uses metadata to continuously update business catalogues as artificial intelligence augments the data stewards. Metadata is changing as well, to provide semantic layers within data management tools, and to better understand the data sets coming from the IoT, social media, or open data initiatives.

The first players to apply these techniques to data management and analytics became our first "Data Grok" companies, data that helps humans grok data and how that data can be used. Since then, the first companies to earn the DataGrok designation, Paxata and Ayasdi, have been joined by many others adding machine learning, deep learning and even artificial narrow intelligence (ANI) to provide recommendations and guardrails to data scientists, data stewards, business analysts, and any individual using organizational data to make decisions.

Data Quality Relations

Data Management development through the execution of enterprise architecture, policies, practices and procedures encompasses the interaction among data quality, data governance, and data integrity. Regulatory and process compliance are dependent upon all three. Ownership of each data set, data element and even datum, is critical to assuring data quality and data integrity, and is the first step to providing data governance. Business metadata, technical metadata and object metadata come together through business, technical and operational ownership of the data to build data stewardship and data custodian policies. The architectural frameworks used for Enterprise, IoT and Data architectures result in specifications for each critical data element that provide an overarching view across all business, technical and operational functions.

Data governance interacts with architectural activities in an agile and continuous improvement process that allows standards and specifications to reflect changing organizational needs. The processes and people can assure that data specifications are applicable to the needs of each organizational unit while assuring that data standards are uniformly applied across the organization. The size and culture of an organization determines the formality and structure of data governance and may include a governing council, sponsorship at various organizational levels, executive sponsorship (at a minimum), data ownership, data stewardship, data custodianship, change control and monitoring. But even with all this, the goal of data governance must be to provide appropriate access to data, and not restrict the use of data…from any source.

IT Must Adapt

Information Technology has often been seen as a bottleneck. Many times in our consulting work, we have found ourselves in the position of arbiter between IT and the business. Self-service BI, Analytics and Data Preparation mean IT must become an enabler of data usage, providing trustworthy data without restricting the users. The productionalizing of data science again means that IT must be an enabler of data usage, including the machine learning and other advanced analytics models that data science teams produce. As data science and data management & analytics tools come together, the need for IT to guide the use of data and tools without limiting that use becomes paramount. At the same time, privacy and security must be retained within data governance. Patient data must only be available to the patient and those healthcare professionals and caregivers who require access to that data. Personally Identifiable Information (PII) must be controlled. Regulatory compliance, such as GDPR and PCI, must be adhered to.

There is also a need for two-way traceability from the datum to the end-use in reports and analytics, training sets or scoring, and from the end-use to the source system, including lineage of all transformations along the way. This lineage of source and use enables both regulatory compliance and collaboration. Such transparent history also helps builds trust in the data, and in what other users and IT data management professionals have done to the data.

IT and OT must Work Together

As connected products mature through the 5Cs of our IoT maturity model (connection, communication, collaboration, contextualization and cognition), information technology and operations technology, business systems and engineering systems, must share data under a unified architecture. Much of the promise of the IoT can only be achieved through IT and OT working together. Consumer and marketing information being merged with supply chain and production quality information to build predictive models that allow just-in-time inventory control and agile, custom product delivery is only one example of changes to consumer expectation, whether that consumer is another business, a government or an individual. Industries from every market, such as the energy sector, consumer packaged goods and pharmaceutical manufacturing have reaped the benefits of IT and OT working together, of SCADA/Historians data being integrated with Cloud marketing and sales data or ERP data. But for this partnership between IT and OT to work, they each must trust the data of the other, and that only happens through data governance and data quality efforts.

Metadata and Master Data Management in DQ

Metadata and Master Data Management (MDM) are fundamental in ensuring data quality, and key to using trustworthy data throughout a modern data ecosystem from the most modern data sources and analytic requirements at the Edge to the most enduring legacy systems at the Core; from the droplets in the Fog to the globally distributed multi-Cloud and hybrid architectures. Metadata and MDM have been part of the solution all along, but now must be applied in new ways, both at the core and at the Edge, and distributed through multiple Cloud, hybrid architectures, on-premises, and out into the furthest reaches of the Fog, as all these resources elastically scale up and down at need.

Sensor Data Makes for Interesting DQ

Some of us have been dealing with sensors, sensor-actuator feedback loops and the concepts of the large, complex system for all of our careers, but for many, the fundamentals of connected hardware will be new. Sensor data can be messy. Two sensors from the same manufacturer will be slightly different in the data sets produced, even though they both meet specification; two sensors from different manufacturers will certainly be different in center point, range, precision and accuracy, and how the data are packaged. Sensors drift over time, and will need calibration against public standards. Sensors age, and may be replaced, and both of these conditions affect all the previous points.

Data architecture and DQ

Having worked in System Engineering for aerospace, I go to Deming's definition of Quality as conformance to specifications well suited to the customer, and, for data, specifications come from the architecture.

Architecture abstracts out the organizational needs as a series of views representing the perspectives of the people, processes and technologies affected by and effected through that solution, system or ecosystem. A standalone quality solutions architecture is not a good idea, as quality must be pervasive through an architecture. However, adding quality as a view within an architecture assures that data quality, data governance and compliance are properly represented within the architecture. {Though outside the scope of this post, I would also consider adding security as a separate view.} There are many architectural frameworks, and even controversy about their effectiveness; TOGAF, MIKE2, 4+1 and BOST are the main frameworks. Architectural frameworks focus on enterprise, data and solutions (application) architectures, with a recent interest in Internet of Things (IoT) architecture. Adherence to a framework or method is not as important as that the process by which an architecture is created meets the culture and needs of the organization.

Standards


For reference purposes, here are a list of data quality standards and methods that you might find useful:

  • ISO9001 Quality Management Family of Standards
  • ISO 8000 Data Quality Family of Standards
  • EFQM Quality Management Framework and Excellence Model
  • TOGAF The Object Group Architectural Framework for Data Architecture
  • BOST [PDF] An Introduction to the BOST Framework and Reference Models by Informatica
  • MIKE2 The Open Source Standard for Information Management
  • 4+1 Views [PDF] Architectural Blueprint by Philippe Kruchten [citation in alt tag]
  • TDWI Data Improvement Documents
October 2019
Mon Tue Wed Thu Fri Sat Sun
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      
 << <   > >>
The TeleInterActive Press is a collection of blogs by Clarise Z. Doval Santos and Joseph A. di Paolantonio, covering the Internet of Things, Data Management and Analytics, and other topics for business and pleasure. 37.540686772871 -122.516149406889

Search

Categories

The TeleInterActive Lifestyle

Yackity Blog Blog

The Cynosural Blog

Open Source Solutions

DataArchon

The TeleInterActive Press

  XML Feeds

Mindmaps

Our current thinking on sensor analytics ecosystems (SAE) bringing together critical solution spaces best addressed by Internet of Things (IoT) and advances in Data Management and Analytics (DMA) is updated frequently. The following links to a static, scaleable vector graphic of the mindmap.

Recent Posts

Multiple blogs done right!