Classic relational databases have been the mainstay of information analysis for decades, and they aren't going anywhere any time soon. In fact, all the major database platforms have made major inroads in recent years in helping users deal with the glut of data that we need to manage and mine -- the so called Big Data problem.
There is one particular area where these relational database systems all run into trouble regardless of scaling adaptations, and that is when you have highly interconnected data types. If you are a SQL user you can spot these when you need to do multiple table JOINS on a query to get the information you need. If your are not versed in SQL, you can think of it as those real-world operational questions that are a couple steps beyond simple. For example the question "how many customers do we have in the 20001 zip code that ordered within the last month?" is trivial for a relational database. You can picture it as extracting the purchase data into a simple spreadsheet and then filter it by zip code and sort it by date. However, what about the equally plausible question "how many customers do we have in the 20001 zip code who ordered in the last month and who have contacted customer support and ended up requesting a return?". Both are valid questions in the course of business but the second question likely requires JOINing (linking) tables in a database from the table that tracks orders, the one that tracks customer service communication, and the one that tracks return authorizations. All these tables are likely to be large and that is when relational database systems start to slow down.
As another example, imagine you want to build a recommendation engine so that your customer service representatives can suggest new products to people when they call in. In that case you want to find items of interest to one person based on the ordering behavior of others. The simplistic solution is you find a product that two people ordered and then recommend to the first person other products that the second person bought. This is illustrated below in Figure 1 as a 'Simple Recommendation Model' where Customer A purchased product 1, 2 and 3 and Customer B has purchased 2, 3, and 4. The simplistic analysis is to see that since they both purchased product 2 and 3 then you could recommend product 1 to Customer B and product 4 to Customer A.
Figure 1: Simple Recommendation Model
While the above is useful it is limited, because you are only going to be able to make recommendations based on customers that have a purchase or purchases in common and, more importantly, you don't have any context to evaluate the relative strength or importance of each recommendation. Product 2, 3 & 4 might all be sporting goods, and product 1 might be a fluffy pink robe that Customer A happened to buy for his mom. The simple purchase comparison doesn't give you any quantitative insight into the appropriateness of the recommendation.
Below is an alternative 'Deep Recommendation' model that addresses both those shortcomings. We start again with Customer A and Customer B with the same purchase histories. But we add to the analysis additional customers purchase behavior (C, D & E). In this analysis you can see we would still recommend product 4 to Customer A, just as we did in the simple model. However, we are now also able to see that purchasers of product 4 have a very high likelihood of also purchasing product 5. So this recommendation model would recommend to Customer A that they look at both product 4 and product 5 based on the group purchase history of C, D & E even though there is no purchase history in common between Customer A and those other customers.
We can see this because we drew a graph of the data rather than just listed the transactions in a table or spreadsheet The graph framework allows us to evaluate these interconnected relationships in a different way, and by exploiting nodal analysis we can identify and prioritize product recommendations that are likely to generate a series of additional purchases rather than just an incremental one, and we have a quantitative assessment of the strength of the recommendation.
Figure 1: Deep Recommendation Model
The challenge with the 'Deep Recommendation' model is that while it is easy to visualize it is very challenging to execute using conventional SQL databases. It can be done, but as the number of customers and products and transactions grows, the horsepower needed to crunch this kind of data becomes daunting. From a technical perspective the answer can only be gleened from several queries with programmatic steps in between; which leads to slower performance. The larger the database of people, products and transactions the slower the performance will be... until the response is so slow that clients stop using the system and you lose customers. Granted, the loss of customers will improve performance of the database, but that isn't an optimal solution.
Up until recently, the answers to the challenges of any data related problem could be solved by bigger, faster database systems. What has changed recently is that in highly connected systems with 100's of million records or more, even the most powerful relational databases are being taxed. This has caused a resurgent interest in techniques that are as old or older than relational databases and have the ability to handle these interconnected questions much more easily. One of these is the Graph Database.
The Graph Database Option
Graph databases have been around as long as relational databases, or perhaps longer depending on your definition, but they fell out of focus with the rise of relational databases early in the computer industry. Let me emphasize that I do not advocate or posit that Graph Databases are better or should supplant relational databases, but I would strongly advise that for some of the computational problems of businesses today the Graph architecture be evaluated as being a viable alternative. Could you build an accounting system off a graph database? Yes. Should you? No, unless you just feel like trying it. Similarly though, if you have a highly interconnected information set, you might be better served with a graph database as opposed to brute forcing the same functionality in an conventional SQL system. If you would like to spend 11 minutes on a nice, casual overview of what you can do with graph database you can check this out.
To borrow language from the investment industries, I think one of the purest plays in the graph database space is Neo4j. Neo4j is a pure graph database that is open source and has matured to include all the great stability, security and reliability features that enterprise IT demands, while still preserving the elegant simplicity of graph databases.
What is most compelling to me about graph databases is that they are white-board native. As you can see in the video above, or even in the figures in the recommendation examples, graph database are what you draw on a white-board. Having grown up a disciple of SQL and traditional databases, I had my share of almost religious-like arguments over database normalization, table structure and even column naming. Graph databases put much of that to the side and let you tackle the challenges of some of today's most interesting problems with a new intellectual lense that I think will enliven many.