Today we added a plug-in to the 2.1 version of the community edition of GraphDB which is called “FastImport”. It’s basically a bulk import plug-in which takes a proprietary XML format as input and imports vertices and edges into a running GraphDB instance.
In order to use this new plug-in and import feature you need to know that an import basically splits into a two-stage process:
scheme setup
fast-import
So first you’ll have to define which vertex and edge types get imported by the following step 2 – you normally do this using the GraphQL and specifying several vertex types. For demonstration purposes we take a small social network with only one vertex type:
CREATEVERTEXTYPE User ATTRIBUTES (String Name, Int64 Age, Set<User> Friends)
After having set-up the scheme the only thing left is to actually call the import plug-in using another short GraphQL query:
IMPORTFROM ‘file:\\100k_import.xml’ FORMAT FastImport
This will, for example, take the 100.000 user dataset and import it into the current GraphDB instance. Of course we did that already for you so here are the comparison results between a GraphQL and FastImport and the persistent and In-Memory version of GraphDB:
Of course you can also download the data-sets used in this small benchmark here:
DBPedia data is provided in several RDF triple files. Each line in each file gives a “complete” information set – based on predicate, subject and object, e.g.
mappingbased_properties_en.nt: (some line) <http://dbpedia.org/resource/12_Monkeys> <http://dbpedia.org/ontology/editing> <http://dbpedia.org/resource/Mick_Audsley> .
stands for: “12 Monkeys” has a “editor” “Mick Audsley”.
In other files there is additional information available, e.g. that
“12 Monkeys” is a film
“Mick Audsley” is a person
… probably more information about “12 Monkeys” and “Mick Audsley”
What we want to do in sones GraphDB is to create a VERTEX for the film “12 Monkeys”. This includes
type information – 12 Monkeys is a film
a set of properties – e.g. its budget
EDGES to related information, e.g. the editor Mick Audsley.
There is a single point of information (The VERTEX “12 Monkeys”) that holds all information and relation in a single instance. To import the VERTEX “12 monkeys”, we had to write a parser over all available triple files that gives us all related information from DBPedia data set.
At this point we’ve had two options implementing this parser. The first one was to read all triple files in a dedicated order to ensure data validity (we need to know that “12 Monkeys” is a movie, to be able to assign the predicate “editor” unambiguous) or do an intermediate step by creating a temporary file that collects all data without validation and to do the import afterwards.
Our decision was to do the intermediate step, because of that it allows some synchronization during reading the triple files and avoids creating invalid data since exported data can be cross-checked easily.
This step is represented by project “2_ParseAndConvertTripleDataFiles” in solution GraphDBPedia available at http://github.com/sones/sones-dbpedia. The parser reads only a subset of offered data-files to show functionality and focus on the added values.
The result of the export for “Apollo 8” looks like this:
1 VertexID=-9223372036854775808
2 http://dbpedia.org/resource/Apollo_8=http://dbpedia.org/ontology/SpaceMission
3 LongAbstract_de=viel text
4 LongAbstract_en=a lot of text
5 http://dbpedia.org/ontology/commandModule_en=CM-103
6 http://dbpedia.org/ontology/missionDuration_en=529242.0
7 http://dbpedia.org/ontology/lunarOrbitTime_en=72613.0
8 http://dbpedia.org/ontology/crewSize_en=3
9 http://dbpedia.org/ontology/lunarModule_en=Ballast: Lunar Test Article B
10 http://dbpedia.org/ontology/serviceModule_en=SM-103
11 http://dbpedia.org/ontology/nextMission_en=http://dbpedia.org/resource/Apollo-9-patch.png
12 http://dbpedia.org/ontology/booster_en=http://dbpedia.org/resource/Saturn_V
13 http://dbpedia.org/ontology/previousMissions_en= http://dbpedia.org/resource/AP7lucky7.png
14 http://dbpedia.org/ontology/launchPad_en=http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39
15 ShortAbstract_en=some text
16 Name_en=http://dbpedia.org/resource/Apollo_8
17 http://dbpedia.org/ontology/SpaceMission/lunarOrbitTime_en=20.170277777777777
18 http://dbpedia.org/ontology/SpaceMission/missionDuration_en=6.125486111111111
Apart from one property, all data had been exported from the triple files. During importing the ontology information (line2 in this example), we’ve also created a VertexID – unique for the corresponding VERTEXTYPE. This allows us to do a unique and performant linking during data import (happens later) by referring to this ID.
After this intermediate step, the real import step can be done. Sones Graph DB offers GraphQL as simple and intuitive language. Based on the data-structure we’ve prepared above, with GQL two steps have to be done. At first, create all VERTICES including all properties and afterwards do the linking between all VERTICES.
Therefore, for the example above, two statements would be created:
INSERTINTO httpwwwdbpediaorgontologySpaceMisson VALUES ( VertexID=-9223372036854775808, LongAbstract_de=’viel text’, LongAbstract_en=’a lot of text’, Name_en=’ http://dbpedia.org/resource/Apollo_8’, httpdbpediaorgontologycommandModule_en=’CM-103’, httpdbpediaorgontologymissionDuration_en=529242.0, httpdbpediaorgontologylunarOrbitTime_en=72613.0, httpdbpediaorgontologycrewSize_en=3, httpdbpediaorgontologylunarModule_en=’Ballast: Lunar Test Article B’, httpdbpediaorgontologyserviceModule_en=’SM-103’, ShortAbstract_en=’some text’, httpdbpediaorgontologySpaceMissionlunarOrbitTime_en=20.170277777777777, httpdbpediaorgontologySpaceMissionmissionDuration_en=6.125486111111111
UPDATE httpwwwdbpediaorgontologySpaceMisson SET ( httpdbpediaorgontologynextMission_en=SETOF(Name_en=’http://dbpedia.org/resource/Apollo-9’) httpdbpediaorgontologybooster_en=SETOF(Name_en=’http://dbpedia.org/resource/Saturn_V’) httpdbpediaorgontologypreviousMission_en=SETOF( Name_en=’http://dbpedia.org/resource/AP7’) httpdbpediaorgontologylaunchPad_en=SETOF(Name_en=’http://dbpedia.org/resource/Kennedy_Space_Center_Launch_Complex_39’) ) WHERE VertexID=-9223372036854775808
The problem of this approach is, that EDGES are set via a WHERE condition that maybe is not unique or the attribute is not set at all at the target VERTEX. An option to solve this, is to verify the ID of the target vertex and do the linking via this condition.
Sones GraphDB also offers another option to do the importing, XmlBulkImport. It has the advantage that it is faster than GraphQL (due to the fact it uses Graph-filesystem interfaces) and also organizes INSERTING and LINKING of data itself. Instead of creating GraphQL, a proprietary XML structure has to be created and the import is done via a single IMPORTGQL statement.
A description of this format and its usage can be found at: http://developers.sones.de/wiki/doku.php?id=importexport:xmlbulkimport
This XmlBulkImport data file is created by project “3_ParseAndConvertTripleDataFiles” in solution GraphDBPedia available at http://github.com/sones/sones-dbpedia”.
The first step was to transfer the ontology – provided in Web Ontology Language (OWL) format – into GraphDB VERTEXTYPES and EDGES. Therefore, a parser had been implemented that reads the OWL-file, converts it into a class-model and is able to export data into a GQL – CREATEVERTEXTYPES statement.
The ontology currently contains 273 classes (DBPedia 3.6.) and thousands of datatype properties and object properties. A short demonstration of its main structures can be found here:
represents the highest “PopulatedPlace” on an island.
The conversion creates a
VERTEXTYPE – one for each class,
having multiple PROPERTIES – from datatype properties
and multiple EDGES – from object-properties
Within the data schema, there is a big amount of multi-lateral dependencies. The CREATEVERTEXTYPES statement solves all of them and creates a valid data schema.
Additionally to the ontology from the OWL file, we’ve added some vertex types to fix some problems we’ve run at and to enhance the functionality a little bit:
At first, the VERTEXTYPE Thing was not described in the Ontology. It is the base class in the ontology that all other VERTEXTYPES are base upon.
To reflect disambiguation, we’ve created a VERTEXTYPE Instance with an EDGE to a SET of Thing. In case there is a disambiguation, an Instance refers to the corresponding NODEs in the GraphDB.
Within the RDF-files, labels are saved in dedicated triples. We’ve added a dedicated VERTEXTYPE also, to avoid a mix-up in case one label refers to multiple Instances.
Currently, the GraphDB has some limitations regarding the allowed characters within VERTEXTYPES, its ATTRIBUTES and EDGES. The OWL and RDF format is generally based on URLs as data-definition. GraphDB has limitations working with colons, dots and slashes (both slash and backslash). Our simple workaround was to keep the URL and remove all occurrences of these characters. This leads us from http://dbpedia.org/ontology/Island to httpdbpediaorgontologyIsland.
Another challenge is the type-mapping between OWL and GraphDB. GraphDB supports c# simple data types, in the DBPedia OWL we are facing a list of 9 datatypes from an XML schema, DBPedia area units, speed units, density units, time units, volume units, distance units and several others. This led us to a huge switch that does the mapping – all properties could be reflected with the C# data types without data loss.
Wikipedia is available in multiple languages. DBPedia export currently is provided in 99 of them.
Some time later (during the next steps) we’ve found out that data in several languages differs a little bit sometimes, since there are different authors. For the data schema, this is relevant, because there are options how to handle this behavior.
One option is to let the data importer application logic decide how to handle this. We’ve decided to make the data schema language specific and provide a separate – language specific – attributes. This grows up the data schema a little bit, but does not lead to any data loss. Additionally, some application logic can be implemented later on, to check data quality for each node.
The command-line tool “1_CreateGqlSchemaFromOntology” ‚available at GitHub (https://github.com/sones/sones-dbpedia) VisualStudion solution creates the CREATEVERTEXTYPES statements as described above, based on the ontology of DBPedia 3.6. – later versions currently have not yet been tested.
The command line executable has to be started with 2 parameters:
.owl filename (the filename has either to be an absolute path or located within the executables directory.
result .gql file – name of the file, where all queries will be inserted in.
During runtime, the user will be requested for all languages that have to be reflected in schema. Our suggestion is to use 2-letter county-codes like “_en” or “_de”. An empty string exits the iteration.
After the execution the result .gql file easily can be imported via IMPORTGQL statement.
DBPedia already is saved in a machine readable format (RDF). We’ve started a proof-of-concept to show that GraphDB is able to solve these requirements too and to find out differences, advantages and disadvantages of the different concepts.
In RDF, the data model stands next to the data. Within sones GraphDB there is close connection between each object (node) and it’s (Vertex) type. For example the node “Homer Simpson” knows that he’s a “FictionalCharacter”.
Our expectation was, that GraphDB requires less hard-disk space and also offers a better data store, since all information about an object is saved in a unique node instead of several triple-data-files. Besides, any relationship between two objects (e.g. a person and its birth-place) is saved directly on that object. While loading a node, all information is available from a single location.
During project runtime we’ve discovered several problems that can be solved with that idea. The arising data network enables customers to find out complex relationships between any node using graph-algorithms. Disambiguation of words is possible, using the schema information (e.g. Tuareg can be either nomads living in the Sahara or a vehicle built by a German car vendor).
We’ve had our first contacts with DBPedia in May 2010 already. A prospect asked us, whether or not GraphDB is the best way to reflect the data schema and import all data. After getting a first impression from the DBPedia-Website:
from www.dbpedia.org/About:
“DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.”
We’ve decided: Yes, it is!.
from www.dbpedia.org/Datasets:
DBpedia uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the SPARQL query language to query this data. Please refer to the Developers Guide to Semantic Web Toolkits to find a development toolkit in your preferred programming language to process DBpedia data.
The DBpedia knowledge base currently describes more than 3.64 million things, out of which 1.83 million are classified in a consistent Ontology, including 416,000 persons, 526,000 places (including 360,000 populated places), 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations (including 40,000 companies and 38,000 educational institutions), 183,000 species and 5,400 diseases.
At this time we’ve not yet had too much experiences with the Semantic Web, therefore there was probably some work to do.
The following blog articles will describe our work and refer to the source-code available under www.github.com/sones/sones-dbpedia
In the course of our work at sones GraphDB 2.1 we refactored our index interfaces to make them more suitable for our needs. Furthermore, we wanted to make it easier for the community to implement custom index structures for their special needs. For the latter reason we set up a tutorial and published a sample implementation on github.
The interfaces are explained in our developer wiki, the tutorial can also be found there. The source code regarding the tutorial is located at github.
We always think about new ways to integrate GraphDB into existing environments. And one of those environments our users are working with right now are the several Enterprise Service Busses which are available right now.
One big player in the ESB environment is the Mule Open Source ESB:
“Mule is a lightweight enterprise service bus (ESB) and integration framework. It can handle services and applications using disparate transport and messaging technologies. The platform is Java-based, but can broker interactions between other platforms such as .NET using web services or sockets.
The architecture is a scalable, highly-distributable object broker that can seamlessly handle interactions across legacy systems, in-house applications and almost all modern transports and protocols.”
In order to show how a GraphDB integrates into those typical ESB environments we created a small example.
The architecture of this example is like this:
The idea behind this is that an example Message-WebApp is posting a message to the Mule ESB and then this message gets transformed and in the last consequence consumed by a sones RESTful webservice hosted by a GraphDB.
With version 2.0 of sones GraphDB we introduced an enhanced graph model we call Property Hypergraph.
In this Property Hypergraph model there are some standard edge types:
single-edge: an edge between two vertices
multi-edge: an edge splitting up into single-edges of the same edge type pointing towards the same vertex type.
hyper-edge: an edge to a subgraph made up by all possible types of vertices
In order to create different vertex types since version 1.0 of sones GraphDB the GraphQL command “CREATEVERTEXTYPE” is available to users. In version 2.0 we introduced edge types but those were only useable if you would use the API to interact with the GraphDB instance.
In version 2.1 we’ve now added a full edge type management to be used either through GraphDBs new edge-type API and of course through the new GraphQL extensions which add edge-type handling.
So for example let’s say you want to create an edge type “User” and insert a bunch of those. And you want those users to be connected by a specific edge type which comes with it’s own attributes – something like this:
The GraphQL Queries to create the above scheme would be these:
CREATEEDGETYPE UserLink ATTRIBUTES (Double weight, String priority, LIST<String> tags) COMMENT = ‘This is my edge type named UserLink.’ CREATEVERTEXTYPE User ATTRIBUTES (String name, SET<User(UserLink)> friends) INSERTINTO User VALUES (name = ‘UserA’) INSERTINTO User VALUES (name = ‘UserB’, friends = SETOF(name = ‘UserA’ : (weight = 15.5, priority = ‘high’, tags = LISTOF(‘best friend’, ‘mate’))))
As you can see it’s easy to actually create edge types and add attributes to the relationships these edges represent. It even gets better: you can also use the inheritance mechanisms you are already used to on vertex types as well as undefined (schemeless) attributes.
If you want to dive deeper into edge types you won’t have to wait until the release of GraphDB 2.1 at the end of this year. You can just grab the current source code on github and get started.
Additional documentation and examples are available in our documentation wiki. Here are some places you will find more information:
Not long ago we showed off the new capabilities of GraphDB 2.1 (to be released at the end of 2011) regarding the visualization of data. Now we extended that capabilities and added another Output Plug-In to the Community Edition. It’s called GraphVis.
And you can download it now with our source-code package from our GitHub repository.
The best way to show-off the new functionality is by literally showing it:
A good start for the documentation of the new visualization options is our ever growing wiki.
As you might have noticed by the check-ins on the source code repository we are well on the way towards version 2.1 of sones GraphDB. Besides the many new features and fixes we just merged a feature branch into the main repository which contains the first steps of a simple but already powerful visualization.
For some time now we are looking into ways to visualize the data stored in GraphDB and since todays web browsers come with HTML5 features the idea was born to integrate a future visualization into the existing integrated WebShell and the future web administration tool.
The WebShell is an integrated module of GraphDB which allows the user to access sones GraphDB by just logging in using a web browser.
With the ability to run queries and use plug-ins to determine how the output will look like the WebShell is a perfect place to enhance user experience. Since there are several output plug-ins available with version 2.0 already (JSON, XML, Text, HTML,…) we thought it would be a great idea to have a simple visualization implemented just by adding a new output plug-in to GraphDB.
And that’s what we did in the first step: We added an additional output plug-in called “barchart”. This first new output plug-in uses the great D3.js library to draw nice charts, plots and graphs in HTML5 and the GraphQL ALIAS feature to map x and y coordinate-axes of those visualizations.
A bit tricky is how multi-level mappings of x– and y-axes are handled, we are still working on that. But currently if your data is in the same level you can already output nice graphs like this one:
Just by switching to the barchart format using the FORMAT WebShell command and then changing a query like this:
Since the current state is only the beginning we will add more visualization options and features for the release of GraphDB version 2.1 at the end of this year.