By Demian Hess, Director of Digital Asset Management & Publishing Systems, Avalon Consulting LLC
When you think of a database, you probably think of tables. This paradigm is so ingrained in our thinking that it is easy to forget that there were many competing models for databases in the early days of computing. It was not until Edgar F. Codd published a series of papers in 1969 and 1970 that tables began to gain traction as the best way to organize information in a database. Codd used the table model to create a relational calculus that separated the way information was queried from the way it was physically stored.
This led to the codification of Structured Query Language (SQL) and eventual dominance of relational databases like Ora-cle, SQL Server, Postgres and MySQL. A similarly profound change in database design is occurring today. Rather than use tables, data scientists are increasingly turning toward graphs to organize information. If you have ever created a mind map as a brain storming exercise, then you are familiar with the fundamental concepts of the graph data model: information is represented as a web of nodes connected by lines.
Graphs are ideal for representing interconnected data, whether social networks or the semantic relationships between abstract concepts.
Native graph databases like Neo4J and OrientDB exploit the efficiency of graph-based analysis and querying, while semantic databases (also known as “triple stores”) such as MarkLogic, Ontotext, and AllegroGraph use a graph-based information model to sup-port drawing logical inferences from relation-ships.
Many large enterprises, including LinkedIn, Twitter, Facebook and Google, are now using graph databases and triple stores to manage information and perform analytics.
These types of databases are gaining adherents because they offer solutions to three problems that are not well served by traditional relational databases: changing business requirements, the importance of relationships, and the need to share data across different domains.
Changing business requirements
Relational databases require a fixed schema that defines what tables will be used and what columns will appear in those tables. This means that you need to know exactly what data you are going to collect when you are designing a relational database. For a system holding contact information, for example, you need to decide ahead of time whether people will be allowed to have just one or multiple phone numbers. If people will only have a single phone number, then you would design your database to have a single column in an existing table to hold that data point.
However, if people can have multiple numbers, then you need to create a separate table just for that purpose. Fixed data schemas create challenges for all businesses.
This is certainly true in M&E, where the continual emergence of new for-mats, distribution channels, and business models regularly impose new data requirements. No database designer before 2006 ever thought to include tables for Twitter accounts and hash tags. Databases being designed to-day must be able to accommodate the data requirements for the next killer app that redefines the way businesses engage with their customers.
Graphs free enterprises from the constraints of fixed schemas because there are no tables to define. In a graph, new types of nodes and relationships can be added at any time as the need is discovered.
The addition of new types of data will, of course, require changes to application-layer code–particularly updates to queries and reports. Fortunately, application changes can usually be completed in days or weeks, whereas database changes and data migration projects can take months or even years.
Importance of relationships
The emergence of social and professional networks has placed a premium on understanding the relationships between people and the things that interest them. Media companies have also discovered that complex relation-ships exist within their own metadata, particularly in regard to interconnections between people, titles, markets, contracts, and distribution channels.
The ability to track and act on these relationships is becoming a critical factor in how well media and entertainment companies can create new products, attract viewers, and arrange new deals. Unfortunately, relational databases do not make it easy to analyze relationships because the table-based model is primarily focused on rows of data and not the connections between them.
I discovered this first hand when I worked as a developer for a publisher that was creating a product for government lobbyists.
The product needed to capture as much information as possible about past and former government staff in order to identify key players and ways to gain access to decision makers. Storing all this varied information in a relational database was extremely challenging since it did not fit into a fixed schema. It was even harder to query the data for meaningful connections.
Finding friends-of-friends and other indirect relationships required a SQL syntax that was extremely difficult to write and maintain. These types of queries also performed poorly as the size of the dataset increased from thousands to millions of records.
By contrast, graphs make it easy to query for relationships, since that is a fundamental part of the data model. Here at Avalon, I was part of a project that exemplifies how companies can use interconnected metadata. Avalon’s client, a large media company, had decades of video from popular shows and wanted to create a compelling user experience that went beyond a top ten list of most watched scenes.
The project team realized that users were interested in the relationships that existed within the videos and that arose from user interaction with the content. Which characters were parts of which scenes and who did they appear with? How often did a topic occur and what was the context?
What other scenes would users be interested in given their past viewing history and social network? We were able to answer these questions by utilizing a semantic triple store, which allowed us to query the relationships between videos, users, events and characters.
Sharing information across domains
When I was in graduate school studying information systems, my professors referred to relational databases as “mini-worlds.” The implication was two-fold. First, each relational database was only an approximation of the real world. Second, each database was a world unto itself, with its own set of unique identifiers, column names, and table structures.
It is not possible to write an SQL query that arbitrarily combines data from multiple databases. If you need to analyze information from different systems, you must export the data, clean the datasets to conform to a new schema, and import them into a data warehouse.
Graphs help break down the information silos created by relational databases. Since information can be added to graphs easily, the cost of merging information from multiple sources is greatly reduced. Graph databases that follow Semantic Web standards (in other words, triple stores) can also be queried without consolidating all the data into a single system. With the Semantic Web, data can exist as a distributed “super graph” across multiple triple stores. In order to query this information, each database simply needs to be connected to the internet and support a semantic query language called SPARQL.
Enterprises have started to realize the value of these distributed information graphs. For example, Avalon Consulting recently worked with a major studio that extracted metadata from multiple relational databases, converted it to a graph-based information model, and exposed the information via SPARQL so that it could be queried across the enterprise.
Moving from grid to graph
Graph databases are not the best solution for all problems. Metadata that are highly transactional, such as for orders, invoices, and other point-in-time events, tend to fit inside tables and are best stored in relational databases.
Metadata that are highly varied and interconnected, however, are good candidates for a wide range of graph-oriented technologies. For example, if your company is interested in batch processing large volumes of interconnected metadata, then an open source, big data platform like Giraph or Pegasus might be the best choice. If your enterprise needs to draw inferences in real time from a distributed graph of metadata, then a semantic triple store such as MarkLogic, AllegroGraph, or Onto-Text might be a good option.
If semantic web standards impose too much abstraction and learning overhead, then proprietary and open source graph databases like Neo4J or OrientDB deserve consideration.
Whichever graph database you choose, you will be in good company. Google’s trend analysis tool shows that the term “graph data-base” has grown in popularity every year since 2008 and has not yet reached its peak. Underpinning this interest is the fact that graph databases are providing real value. As the analysts at Gartner reported in September 2014, “graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations.”
If you have not previously considered using graph and semantic databases, it is definitely time to take a look.
Demian Hess has worked in digital publishing since 2000, having held positions at Reed Elsevier, SAGE Publications, Inc., and the National Library of Medicine’s PubMed Central. After studying American Civilization and Computer Science at Brown University, he went on to complete a Master’s in English at Oregon State University, as well as a Master’s in Information Systems at Drexel University.