Questions and Commonality

In the introduction to our open source solutions (OSS) for decision support systems (DSS) study guide (SG), I gave a variety of examples of activities that might be considered using a DSS. I asked some questions as to what common elements exist among these activities that might help us to define a modern platform for DSS, and whether or not we could build such a system using open source solutions.

In this post, let's examine the first of those questions, and see if we can start answering those questions. In the next post, we will lay out a syllabus of sorts for this OSS DSS SG.

The first common element is that in all cases, we have an individual doing the activity, not a machine nor a committee.

Secondly, the individual has some resources at their disposal. Those resources include current and historical information, structured and unstructured data, communiqués and opinions, and some amount of personal experience, augmented by the experience of others.

Thirdly, though not explicit, there's the idea of digesting these resources and performing formal or informal analyses.

Fourthly, though again, not explicit, the concept of trying to predict what might happen next, or as a result of the decision is inherent to all of the examples.

Finally, there's collaboration involved. Few of us can make good decisions in a vacuum.

Of course, since the examples are fictional, and created by us, they represent our biases. If you had fingered our domain server back in 1993, or read our .project and .plan files from that time, you would have seen that we were interested in sharing information and analyses, while providing a framework for making decisions using such tools as email, gopher and electronic bulletin boards. So, if you identify any other commonalities, or think anything is missing, please join the discussion in the comments.

From these commonalities, can we begin to answer the first question we had asked: "What does this term [DSS] really mean?". Let's try.

A DSS is a set of processes and technology that help an individual to make a better decision than they could without the DSS.

That's nice and vague; generic enough to almost meaningless, but provides some key points that will help us to bound the specifics as we go along. For example, if a process or technology doesn't help us to make a better decision, than it doesn't fit. If something allows us to make a better decision, but we can't define the process or identify the technology involved, it doesn't belong (e.g. "my gut tells me so").

Let's create a list from all of the above.

  1. Individual Decision Maker
  2. Process
  3. Technology
  4. Structured Data
  5. Unstructured Data
  6. Historical Information
  7. Current Information
  8. Communication
  9. Opinion
  10. Collaboration
  11. Analysis
  12. Prediction
  13. Personal Experience
  14. Other's Experience

What do you think? Does a modern system to support decisions need to cover all of these elements and no others? Is this list complete and sufficient? The comments are open.

First DSS Study Guide

Someone sitting in their study, looking at their books, journals, piles of scholarly periodicals and files of correspondence with learned colleagues probably didn't think that they were looking at their decision support system, but they were.

Someone sitting on the plains, looking at the conditions around them, smoke signals from distant tribe members, records knotted into a string, probably didn't think that they were looking at their decision support system, but they were.

Someone at the nexus of a modern military command, control, communications, computing and intelligence system, probably didn't think that they were looking at their decision support system, but they were.

Someone pulling data from transactional systems, and dumping the results of reports & analyses from BI tool into a spreadsheet to feed a dashboard for the executives of a huge corporation probably didn't think that they were looking at their decision support system, but they were.

The term "decision support system" has been in use for over 50 years, perhaps longer.

  • But what does this term really mean?
  • What do all of my examples have in common?
  • How can we build a reasonable decision support system from open source solutions?
  • What resources exist to help us learn?

I'm starting a series of posts, essentially a "study guide" to help answer these questions.

I'll be drawing from and pointing to the following books and online resources as we install, configure and use open source systems to create a technical platform for a decision support system.

  1. Bayesian Computation in R by Jim Albert, Springer Series in UseR!, ISBN: 0-38-792297-0, Purchase from Amazon, you can also purchase the Kindle ebook from Amazon
  2. R in a Nutshell by Joseph Adler, ISBN: 0-59-68017-0X, Purchase from Amazon
  3. Pentaho Solutions; Business Intelligence and Data Warehousing with Pentaho and MySQL, by Roland Bouman and Jos van Dongen, ISBN: 0-47-048432-2, Purchase from Amazon
  4. Pentaho Reporting 3.5 for Java Developers by Will Gorman, ISBN: 1-84-719319-6, Purchase from Amazon
  5. Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration by Matt Casters, Roland Bouman & Jos van Dongen, ISBN: 0-47-063517-7 due 2010 September, Pre-Order from Amazon
  6. Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten and Eibe Frank, Second Edition, Morgan-Kaufmann Series in Data Management Systems, ISBN: 0-12-088407-0 a.k.a. "The Weka Book", Purchase from Amazon, Pre-Order the Third Edition, you can also purchase the Kindle ebook from Amazon
  7. LucidDB online documentation
  8. Pertinent information from Eigenbase
  9. LudidDB mailing list archive on Nabble
  10. Anything I can find on PAT
  11. Pentaho Community Forums, Wiki, WebEx Events, and other community sources
  12. R Mailing Lists and Forums
  13. Various Books in PDF from The R Project
  14. Information Management and Open Source Solution Blogs from our side-column linkblogs

In this study guide series of posts:

  • I'll show how the datawarehousing (DW) and business intelligence (BI) can be extended to include all the elements held in common from my DSS examples.
  • We'll examine the open source solutions Pentaho, R, Rserve, Rapache, LucidDB and possibly Map-Reduce & Key-value-stores, and the related open source projects, communities and companies in terms of how they can be used to create a DSS.
  • I would like to add a collaboration tool to the mix, as we do in our implementation projects, possibly Mindtouch, a ReSTful Wiki Platform.
  • We may add one non-open source package, SQLStream, that's built upon open source elements from Eigenbase. This will allow us to add a real-time component to our DSS.
  • I'll give my own experience in installing these packages and getting them to work together, with pointers to the resources listed above.
  • We'll explore sample and public data sets with the DSS environment we created, again with pointers to and help from the resources listed.

The purpose of this series of posts is a study guide, not an online book written as a blog. The goal is to help us to define a modern DSS and build it out of open source solutions, while using existing resources.

Please feel free to comment, especially if there is anything that you feel should be included beyond what I've outlined here.

Modeling and Predictives

Here's a personal perspective and a bit of a personal history regarding mathematical modeling and predictives.

The 1980s were an exciting time for mathematical modeling of complex systems. At the time, there were two basic types of modeling: deterministic and stochastic (probability or statistics models). Within stochastic modeling, traditional statistics vs. Bayesian statistics was a burgeoning battleground. Physical simulations (often based upon deterministic models) were giving way to computer simulations (often based upon stochastic models, especially Monte Carlo Simulations). Two theories were popularized during this time: catastrophe theory and chaos theory; ultimately though, both of these theories proved incapable of prediction - the hallmark of a good mathematical model. A different type of modeling technique, based upon relational algebra, was also moving from the theoretical work of Ted Codd, to the practical implementations at (the company now known as) Oracle: data modeling.

Mathematical models are attempts to understand the complex by making simplifying assumptions. They are always a balance between complexity and accuracy. One nice example of the evolution of a deterministic mathematical model can be found in the Ideal Gas Laws, starting with Boyle's Law to Charles' Law to Gay-Lussac's Law to Avogadro's Law, culminating in the Ideal Gas Law, which all of saw in high school chemistry: PV=nRT.

Mathematical models are used in pretty much all fields of endeavor: physical sciences, all types of engineering, behavioral studies, and business. In the 1970's, I used deterministic electrochemical models to understand and predict the behaviour of various chemical stoichiometry for fuel cells and photovoltaic cells. In the 1980's, I used Bayesian statistics, sometimes combined with Monte Carlo Simulations to predict the reliability and risk associated with complex aerospace, utility and other systems.

The most popular use of Bayesian statistics was to expand the a priori knowledge of a complex system with subjective opinions. Likely the most famous application of Bayesian Statistics, at the time I became involved with the branch, was the Rand Corporation's Delphi Method. There was actually a joke in the Aerospace Industry about the Delphi Method:

A team of Rand consultants went to Werner von Braun to seek the expert opinion of the engineers working on a new rocket motor. The consultants explained their Delphi Method thusly. Prior to the first static test of the new rocket motor, they would ask, separately, each of the five engineers working on the new design their opinion of the rocket's reliability. Their opinions would form the Bayesian a priori distribution. After the test, they would reveal the results of the first survey and the test results, and ask the five engineers, collectively, their new opinion of the rocket's reliability. This would form the Bayesian a posteriori, from which the rocket's reliability would be predicted. Doctor von Braun said that he could save them some time. He gathered his team of rocket engineers, and asked them if they thought that the new rocket motor would fail. Each answered, as did Doctor Von Braun, "no" in German. "There, you see, five nines reliability, as specified." declared the good Doctor to the Rand consultants, "No need for any further study on your part."

Yep, it's a side splitter. :))

I didn't like this method, and did things a bit differently. My method involved gathering all the data for similar test and production models, weighting each relevant engineering variable, creating the a priori, fitting with Weibull Analysis, designing the Bayesian mathematical conjugate, using a detailed post-mortem of the first and subsequent tests of the system being analyzed, updating and learning as we went, to finally predict the reliability and risk for the system. I first used this on the Star48 perigee kick motor, and went on to refine and use this method for:

  • a variety of apogee and perigee kick motors
  • several components of the Space Transportation System
  • the Extreme Ultraviolet Explorer
  • Gravity Probe-B
  • a halogen lamp pizza oven
  • a methodology for failure mode, effects and criticality analysis of the electrical grid
  • and many more systems

I started to call this method "objective Bayes", but that name was already taken by a branch of Bayesian statistics that uses a non-informative a priori. Several of my projects resulted in software programs, all in FORTRAN. The first was used as a justification for a 1 MB [no, not a mistake] "box" [memory] for the corporate mainframe. NASA had sent us detailed data on over 4,000 solid propellant rocket motors. Talk about "big data". &#59;) I had a lot of fun doing this into the 1990's.

The next paradigm shift, for me personally, was learning data modeling, and focusing on business processes rather than engineering systems. Spending time at Oracle, including Richard Barker and his computer aided system engineering methods, I felt right at home. Rather than Bayesian Statistics, I would be using relational algebra and calculus for deterministic mathematical models of the data for the business processes being stored in a relational database management system. I very quickly got involved in very large databases, decision support systems, data warehousing and business intelligence.

I was surprised, and, after 17 years, continue to be surprised, how few data modelers agree with the statement in the preceding paragraph. I'm surprised how few data modelers go beyond entity-relationship diagrams; how few know or care about relational algebra and relational calculus. I'm amazed how few people realize that the arithmetic average computed in most "analytic" systems is a fairly useless measure of the underlying data, for most systems. I'm amazed that BI and analytic systems are still deterministic, and always go with simplicity over accuracy.

But computer power continues to expand. Moore's Law still rules. We can do better now. Things that used to take powerful main frames or even supercomputers can be done on laptops now. We no longer need to settle for simplicity over accuracy.

More importantly, the R Statistical Language has matured. Literally thousands and thousands of mathematical, graphical and statistical packages have been added to the CRAN, Omegahat and BioConductor repositories. Even the New York Times has published pieces about R.

It's once again time to move from deterministic to stochastic models.

Over the next few weeks, I hope to post a series of "study guides" that will focus on setting up a web-based environment consolidating SQL and MDX based analytics, as expressed in Pentaho and LucidDB open source projects, with R, and possibly SQLStream. Updated 20100314 to correct links (typos). Thanks to Doug Moran of Pentaho for catching this.

There have been many articles as well on "Big Data". As I commented on Merv Adrian's blog post request for "Ideas for SF Big Data Summit":

One area of discussion, which may appear to be for the “newbies” but is actually a matter of some debate, would be the definition of “big data”.

It really isn’t about the amount of data (TB & PB & more) so much as it is about the volumetric flow and timeliness of the data streams.

It’s about how data management systems handle the various sources of data as well as the interweaving of those sources.

It means treating data management systems in the same way that we treat the Space Transportation System, as a very large, complex system.

-- Comment by Joseph A. di Paolantonio, February 1, 2010 at 4:09 pm

I believe this because there is a huge amount of data about to come down the pipe. I'm not talking about the Semantic Web or the pidly little petabytes of web log and click-through data. I'm talking about the instrumented world. Something that's been in the making for ten years, and more: RFID, SmartDust, ZigBee, and more wired and wireless sensors, monitors and devices that will become a part of everything, everywhere.

Let me just cite two examples from something that is coming, is hyped, but not yet standardized, even if solid attempts at definition are being made: the SmartGrid. First, consider the fact that utility companies are distributing and using smart meters to replace manually read mechanical meters at homes and businesses; this will result in thousands of data points per day as opposed to one per month PER METER. The second is EPRI's copper-riding robot, as explained in a recent Popular Science. Think of the petabytes of data that these two examples will generate monthly. [Order the Smart Grid Dictionary: First Edition on Amazon]

The desire, the need, to analyze and make inferences from this data will be great. The need to actually predict from this data will be even greater, and will be a necessary element of the coming SmartGrid, and in making the instrumented world a better world for all of humanity.

Now if only we can avoid the likes of Skynet and Archangel.

An Open Source Childrens Story

On Twitter today, Lance Walter asked me to go into the Ark Business with him, and Gareth Greenaway asked for entertainment. It must be a rainy Friday afternoon &#59;)

I'm not sure about Lance's offer, but I did tell Gareth the following story, from tweet-start to tweet-end. This isn't word for word as I tweeted. 'Tis a bit expanded, but the tale is the same.

Once upon a time there was a young penguin named Tux. Tux decided to set off on a journey through IT Land. Now IT Land is a dangerous place, full of hackers fighting crackers, and ruled by those in the Ivory Tower and the acolytes of the Megaliths.

Along the way, the adventurous Tux met the Dolphin, the Elephant and the Beekeeper. They made a pact on the Lucid glyph to become a Dynamo of IT, bringing power to the datasmiths of the Land.

They met many Titans from the Megaliths on their Quest. The Beekeeper used the open source bees to open the scrum along the way, blocking the hookers with their sharp claws.

Some of the Titans were helpful, some, not so much.

The Dolphin was empowered by the Sun. But the Sun was consumed by a powerful Oracle. The Elephant, too, gained a powerful ally, and they do Enterprise against the Oracle. The band of the Quest was broken, and Tux was sad.

The Era of Lucid thought ended, but the Dynamo yet powers the Lucid Glyph, and Tux can rely on the Dynamo and the Beekeeper to predict a future clear of the Oracle.

And thus this quest ends, but another soon begins, where Tux will meet new friends and new foes. Will Beastie and the dæmons be allies? Will the Paladin in the Red Hat be stalwart?

Perhaps we'll find out at OSCON, for Gareth suggested that an assemblage of geeks would enjoy this story, and we'll see if OSCON thinks our tales worthy of a keynote slot in 2010.

Do you recognize all the characters in this tale? Maybe the links will help.

What say you, OSCON? Would these tales make a worthy Keynote?

Pentaho Reporting Review

As promised in my post, "Pentaho Reporting 3.5 for Java Developers First Look", I've taken the time to thoroughly grok Pentaho Reporting 3.5 for Java Developers by Will Gorman [direct link to Packt Publishing][Buy the book from Amazon]. I've read the book, cover-to-cover, and gone through the [non-Java] exercises. As I said in my first look at this book, it contains nuggets of wisdom and practicalities drawn from deep insider knowledge. This book does best serve its target audience, Java developers with a need to incorporate reporting into their applications. But it is also useful for report developers who wish to know more about Pentaho, and Pentaho users who wish to make their use of Pentaho easier and the resulting reporting experience richer.

The first three chapters provide a very good introduction to Pentaho Reporting and its relationship to the Pentaho BI Suite and the company Pentaho, historical, technical and practical. These three chapters are also the ones that have clearly marked sections for Java specific information and exercises. By the end of Chapter Three, you'll have installed Pentaho Report Designer, and built several rich reports. If you're a Java developer, you'll have had the opportunity to incorporate these reports into both Tomcat J2EE or Swing web applications. You'll have been introduced to the rich reporting capabilities of Pentaho, accessing data sources, the underlying Java libraries, and the various output options that include PDF, Excel, CSV, RTF, XML and plain text.

Chapters 4 through 8 is all about the WYSIWYG Pentaho Report Designer, the pixel-level control that it gives you over the layout of your reports, and the many wonderful capabilities provided by Pentaho Reporting from a wide range of chart types to embedding numeric and text functions, to cross-tabs and sub-reports. Other than Chapter 5, these chapters are as useful for a business user creating their own reports, as it is for a report developer. Chapter 5 is a very deep dive, very technical look at incorporating various data sources. The two areas that really stand out are the charts (Chapter 6) and functions (Chapter 7).

There are a baker's dozen types of charts covered, with an example for each type. Some of the more exotic are Waterfall, Bar-Line, Radar and Extended XY Series charts.

There are hundreds of parameters, functions and expressions that can be used in Pentaho Reports, and Will covers them all. The formula capability of Pentaho Reporting follows the OpenFormula standard, similar to the support for formulæ in Microsoft Excel, and the same as that followed by OpenOffice.org. One can provide computed text or numeric values within Pentaho reports to a fairly complex extent. Chapter 7 provides a great introduction to using this feature.

Chapters 9 through 11 are very much for the software developer, covering the development of Interactive Reports in Swing and HTML, the use of Pentaho's APIs and extension of Pentaho Reporting capabilities. It's all interesting stuff, that really explains the technology of Pentaho Reporting, but there's little here that is of use to the business user or non-Java report developer.

The first part of Chapter 12, on the other hand, is of little use to the Java developer, as it shows how to take reports created in Pentaho Report Designer and publish them through the Pentaho BI-Server, including formats suitable to mobile devices, such as the iPhone. The latter part of Chapter 12 goes into the use of metadata, and is useful both for the report developer and the Java developer.

So, as I said in my first look, the majority of the book is useful even if you're not a Java developer who needs to incorporate sophisticated reports into your application. That being said, Will Gorman does an excellent job in explaining Pentaho Reporting, and making it very useful for business users, report designers, report developers and, his target audience, Java developers. I heartily recommend that you buy this book. [Amazon link]

May 2019
Mon Tue Wed Thu Fri Sat Sun
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
 << <   > >>
The TeleInterActive Press is a collection of blogs by Clarise Z. Doval Santos and Joseph A. di Paolantonio, covering the Internet of Things, Data Management and Analytics, and other topics for business and pleasure. 37.540686772871 -122.516149406889

Search

Categories

The TeleInterActive Lifestyle

Yackity Blog Blog

The Cynosural Blog

Open Source Solutions

DataArchon

The TeleInterActive Press

  XML Feeds