Uncovering document data in the graph data age

Editorial Type: Technology Focus Date: 03-2018 Views: 1,197 Tags: Document, Search, Strategy, Recognition, Neo4j PDF Version:
Graph technology leader Emil Eifrem, CEO of Neo4j, explains how all organisations can learn from the document and data management approach behind the world's biggest data scoop

Last year's Paradise Papers revelations and their 2015 predecessors, The Panama Papers, dragged into the spotlight the complex web of the 1%'s financial dealings. And at the heart of the Papers: data - a lot of it.

The exposure of the activities of clients of offshore law firms qualifies as the world's largest financial scandal, and at 2.6 Terabytes and 11.5 million documents the Panama Papers was far larger than anything Snowden or Wikileaks ever managed. The Paradise Papers was not far behind, at 1.4TB of data and 13.4 million documents.

Reporting for both the Panama and Paradise Papers was a global operation, with media partners and journalists from around the world collaborating on the published stories. To meet the challenge, the group behind the scoop, the International Consortium of Investigative Journalists (the ICIJ) used an Open Source approach, at the centre of which was a graph database, plus OCR to scan and digitally capture images, buttressed by 40 temporary servers that allowed it to process hundreds of documents in parallel. It also used ETL (Extract, Transform, and Load) software to transform source data from SQL to graph format, as well as a visualisation front end and an Open Source platform based on Oxwall for comms needs.

The ICIJ found graph database to be very quick in terms of modelling data, and easy to understand for everybody. That matters, as even the least tech-savvy reporter could still expand the documents, while more data-conversant reporters and programmers could use the graph query language to do more complex queries (e.g. 'Show me everybody within two degrees of separation of this person'). Thanks to the technology, reporters never had to look at the enormous mass of data themselves - and the ICIJ were able to allow 400 people from more than 100 media organisations to work together on this project.

As a result, the ICIJ had one of the highest impacts with its journalistic stories ever witnessed before in journalism. What has changed over the last decade to facilitate these investigations and large-scale collaboration? Simple - technology. That's why the group has embraced data-based techniques as core to what it does. Faced with millions of documents outlining complex offshore deals, the ICIJ needed to find a way to uncover hidden networks and to collaborate on a global basis.

It is also important to note that the ICIJ used graph software to detect connections that would otherwise probably never have become visible. Instead of breaking up data artificially, the way a relational database does, graphs use a notational formalism more closely aligned with the way humans natively think about information. Once that data model is coded in a scalable architecture, a graph database is matchless at mining connections in huge and complex datasets.

In any context where large data and document sets need to be managed, graphs are increasingly the tool of choice. And in the data age, super-large connected datasets are becoming more of a factor, e.g. real-time online recommendations in retail, as well as AI-powered shop bots, fraud detection in financial services, enterprise network management systems, and in research for the investigation of diseases - and even by government for security and welfare.

Without a doubt, graphs supported by a visualisation front end also support the sort of successful distributed document management model to a non-technical audience the ICIJ use cases exemplify. Graphs are an important factor for the global DM community in our super-connected information age. Are you taking note?
More info: www.neo4j.com