This book is a work in progress. Feedback is very much encouraged and welcomed!

The title of this book could equally well be "A getting started guide for users of graph databases and the Gremlin query language featuring hints, tips and sample queries". It turns out that is a bit too long to fit on one line for a heading but in a single sentence that describes the focus of this work pretty well.

I have resisted the urge to try and cover every single feature of TinkerPop one after the other in a reference manual fashion. Instead, what I have tried to do is capture the learning process that I myself have gone through using what I hope is a sensible flow from getting started to more advanced topics. To get the most from this book I recommend having the Gremlin console open with my sample data loaded as you follow along. I have not assumed that anyone reading this has any prior knowledge of Apache TinkerPop, the Gremlin query language or related tools. I will introduce everything you need to get started in Chapter 2.

I hope people find what follows useful. It definitely remains a work in progress and more will be added in the coming weeks and months as time permits. I am hopeful that what is presented so far is of some use to folks, who like me, are learning to use the Gremlin query and traversal language and related technologies.

A lot of additional material, including the book in many different formats such as PDF, HTML, ePub and MOBI as well as sample code and data, can be found at the project’s home on GitHub. You will find a summary of everything that is available in the "Introducing the book sources, sample programs and data" section.

1.1. How this book came to be

I forget exactly when, but over a year ago I started compiling a list of notes, hints and tips, initially for my own benefit, of things I had found poorly explained elsewhere while using graph databases and especially using Apache TinkerPop, Janus Graph and Gremlin. Over time that document grew (and continues to grow) and has effectively become a book. After some encouragement from colleagues I have decided to release it as a living book in an open source venue so that anyone who is interested can read it. It is aimed at programmers and anyone using the Gremlin query language to work with graphs. Lots of code examples, sample queries, discussion of best practices, lessons I learned the hard way etc. are included.

I would like to say very heartfelt Thank You to all those that have encouraged me to keep going with this adventure! It has required quite a lot of work but also remains a lot of fun.

Kelvin R. Lawrence
First draft: October 5th, 2017
Current draft: April 18th 2018

1.2. Providing feedback

Please let me know about any mistakes you find in this material and also please feel free to send me feedback of any sort. Suggested improvements are especially welcome. A good way to provide feedback is by opening an issue in the GitHub repository located at https://github.com/krlawrence/graph. You are currently reading revision 279-preview of the book.

I am grateful to those who have already taken the time to review the manuscript and open issues or pull requests.

1.3. Some words of thanks

I would like to thank my colleagues, Graham Wallis, Jason Plurad and Adam Holley for their help in refining and improving several of the queries contained in this book. Gremlin is definitely a bit of a team sport. We spent many fun hours discussing the best way to handle different types of queries and traversals!

I would also be remiss if I did not give a big shout out to all of the folks that spend a lot of time replying to questions and suggestions on the Gremlin Users Google Group. Special thanks should go to Daniel Kuppitz, Marko Rodriguez and Stephen Mallette, key members of the team that created and maintains Apache TinkerPop.

Lastly, I would like to thank everyone who has submitted feedback and ideas via e-mail as well as GitHub issues and pull requests. That is the best part about this being a living book we can continue to improve and evolve it just as the technology it is about continues to evolve. Your help and support is very much appreciated.

1.4. What is this book about?

This book introduces the Apache TinkerPop 3 Gremlin graph query and traversal language via real examples featuring real world graph data. That data along with sample code and example applications is available for download from the GitHub project as well as many other items. The graph, air-routes.graphml, is a model of the world airline route network between 3,373 airports including 43,400 routes. The examples presented will work unmodified with the air-routes.graphml file loaded into the Gremlin console running with a TinkerGraph. How to set that environment up is covered in the Downloading, installing and launching the console section below.

The examples in this book have now been updated and tested using Apache TinkerPop version 3.3 which introduced a few breaking changes. If you find any I missed please let me know! The examples have also been tested using the 3.3.1 release.

TinkerGraph is an in-memory graph, meaning nothing gets saved to disk automatically. It is shipped as part of the Apache TinkerPop 3 download. The goal of this tutorial is to allow someone with little to no prior knowledge to get up and going quickly using the Gremlin console and the air-routes graph. Later in the book I will discuss using additional technologies such as JanusGraph, Apache Cassandra, Gremlin Server and Apache Elastic Search to build scalable and persisted graph stores that can still be traversed using Gremlin queries. I will also discuss writing stand alone Java and Groovy applications as well as using the Gremlin Console. I even slipped a couple of Ruby examples in too!

In the first few sections of this book I have mainly focussed on showing the different types of query that you can issue using Gremlin. I have not tried to show all of the output that you will get back from entering these queries but have selectively shown examples of output. I go a lot deeper into things in chapters 4, 5 and 6.
How this book is organized
  • I start off by briefly doing a recap on why Graph databases are of interest to us and discuss some good use cases for graphs. I also provide pointers to the sample programs and other additional materials referenced by the book.

  • In Chapter two I introduce several of the components of Apache TinkerPop and also introduce the air-routes.graphml file that will be used as the graph the majority of examples shown in this book are based on.

  • In Chapter three things start to get a lot more interesting! I start discussing how to use the Gremlin graph traversal and query language to interrogate the air-routes graph. I begin by comparing how we could have built the air-routes graph using a more traditional relational database and then look at how SQL and Gremlin are both similar in some ways and very different in others. For the rest of the Chapter, I introduce several of the key Gremlin methods, or as they are often called, "steps". I mostly focus on reading the graph (not adding or deleting things) in this Chapter.

  • In Chapter four the focus moves beyond just reading the graph and I describe how to add vertices (nodes), edges and properties as well as how to delete and update them. I also present a discussion of various best practices. I also start to explore some slightly more advanced topics in this chapter.

  • In Chapter five I focus on using what has been covered in the prior Chapters to write queries that have a more real world feel. I present a lot more examples of the output from running queries in this Chapter. I also start to discuss topics such as analyzing distances, route distribution and writing geospatial queries.

  • In Chapter six I start to expand the focus to concepts beyond using the Gremlin Console and a TinkerGraph. I start by looking at how you can write stand alone Java and Groovy applications that can work with a graph. I then introduce JanusGraph and take a fairly detailed look at its capabilities such as support for transactions, schemas and indexes. I also explore various technology choices for back end persistent stores and indexes as well as introducing the Gremlin Server.

  • In Chapter seven a discussion is presented of some common Graph serialization file formats along with coverage of how to use them in the context of TinkerPop 3 enabled graphs.

  • I finish up by providing several links to useful web sites where you can find tools and documentation for many of the topics and technologies covered in this book.

1.5. Introducing the book sources, sample programs and data

All work related to this project is being done in the open at GitHub. A list of where to find the key components is provided below. The examples in this book make use of a sample graph called air-routes.graphml which contains a graph based on the World airline route network between over 3,370 airports. The sample graph data, quite a bit of sample code and some larger demo applications can all be found at the same GitHub location that hosts the book manuscript. You will also find releases of the the book in various formats (HTML,PDF, DocBook/XML, MOBI and EPUB) at the same GitHub location. The sample programs include stand alone Java, Groovy and Ruby examples as well as many examples that can be run from the Gremlin Console. There are some differences between using Gremlin from a stand alone program and from the Gremlin Console. The sample programs demonstrate several of these differences. The sample applications area contains a full example HTML and JavaScript application that lets you explore the air-routes graph visually. The home page for the GitHub project includes a README.md file to help you navigate the site. Below are some links to various resources included with this book.

Where to find the book, samples and data
Project home
Book manuscript in Asciidoc format
Latest PDF and HTML snapshots
Official book releases in multiple formats
  • Official releases include Asciidoc, HTML, PDF, ePub, MOBI and DocBook versions as well as snapshots of all the samples and other materials in a single package. My goal is to have an official release about once a month providing enough new material has been created to justify doing it. The eBook and MOBI versions are really intended to be read using e-reader devices and for that reason use a white background for all source code highlighting to make it easier to read on monochrome devices.

  • I recommend using the PDF version if possible as it has page numbering. If you prefer reading the book as if it were web page then by all means use the HTML version. You will just not get any pagination or page numbers. The DocBook format can be read using tools such as Yelp on Linux systems but is primarily included so that people can use it to generate other formats that I do not already provide. There is currently an issue with the MOBI and ePub versions that causes links to have the wrong text. Other than that they should work although you may need to change the font size you use on your device to make things easier to read.

  • https://github.com/krlawrence/graph/releases

Sample data (air-routes.graphml)
Sample code
Example applications
Change history

1.6. So what is a graph database and why should I care?

This book is mainly intended to be a tutorial in working with graph databases and related technology using the Gremlin query language. However, it is worth spending just a few moments to summarize why it is important to understand what a graph database is, what some good use cases for graphs are and why you should care in a World that is already full of all kinds of SQL and NoSQL databases. In this book we are going to be discussing directed property graphs. At the conceptual level these types of graphs are quite simple to understand. You have three basic building blocks. Vertices (often referred to as nodes), edges and properties. Vertices represent "things" such as people or places. Edges represent connections between those vertices, and properties are information added to the vertices and edges as needed. The directed part of the name means that any edge has a direction. It goes out from one vertex and in to another. You will sometimes hear people use the word digraph as shorthand for directed graph. Consider the relationship "Kelvin knows Jack". This could be modeled as a vertex for each of the people and an edge for the relationship as follows.

Kelvin — knows → Jack

Note the arrow which implies the direction of the relationship. If we wanted to record the fact that Jack also admits to knowing Kelvin we would need to add a second edge from Jack to Kelvin. Properties could be added to each person to give more information about them. For example, my age might be a property on my vertex.

It turns out that Jack really likes cats. We might want to store that in our graph as well so we could create the relationship:

Jack — likes → Cats

Now that we have a bit more in our graph we could answer the question "who does Kelvin know that likes cats?"

Kelvin — knows → Jack — likes → Cats

This is a simple example but hopefully you can already see that we are modelling our data the way we think about it in the real world. Armed with this knowledge you now have all of the basic building blocks you need in order to start thinking about how you might model things you are familiar with as a graph.

So getting back to the question "why should I care?", well, if something looks like a graph, then wouldn’t it be great if we could model it that way. Many things in our everyday lives center around things that can very nicely be represented in a graph. Things such as your social and business networks, the route you take to get to work, the phone network, airline route choices for trips you need to take are all great candidates. There are also many great business applications for graph databases and algorithms. These include recommendation systems, crime prevention and fraud detection to name but three.

The reverse is also true. If something does not feel like a graph then don’t try to force it to be. Your videos are probably doing quite nicely living in the object store where you currently have them. A sales ledger system built using a relational database is probably doing just fine where it is and likewise a document store is quite possibly just the right place to be storing your documents. So "use the right tool for the job" remains as valid a phrase here as elsewhere. Where graph databases come into their own is when the data you are storing is intrinsically linked by its very nature, the air routes network used as the basis for all of the examples in this book being a perfect example of such a situation.

Those of you that looked at graphs as part of a computer science course are correct if your reaction was "haven’t graphs been around for ages?" Indeed, Leonard Euler is credited with demonstrating the first graph problem and inventing the whole concept of "Graph Theory" all the way back in 1763 when he investigated the now famous "Seven Bridges of Konigsberg" problem.

If you want to read a bit more about graph theory and its present-day application, you can find a lot of good information online. Here’s a Wikipedia link to get you started: https://en.wikipedia.org/wiki/Graph_theory

So, given Graph Theory is anything but a new idea, why is it that only recently we are seeing a massive growth in the building and deployment of graph database systems and applications? At least part of the answer is that computer hardware and software has reached the point where you can build large big data systems that scale well for a reasonable price. In fact, it’s even easier than ever to build the large systems because you don’t have to buy the hardware that your system will run on when you use the cloud.

While you can certainly run a graph database on your laptop—​I do just that every day—​the reality is that in production, at scale, they are big data systems. Large graphs commonly have many billions of vertices and edges in them, taking up petabytes of data on disk. Graph algorithms can be both compute- and memory-intensive, and it is only fairly recently that deploying the necessary resources for such big data systems has made financial sense for more everyday uses in business, and not just in government or academia. Graph databases are becoming much more broadly adopted across the spectrum, from high-end scientific research to financial networks and beyond.

Another factor that has really helped start this graph database revolution is the availability of high-quality open source technology. There are a lot of great open source projects addressing everything from the databases you need to store the graph data, to the query languages used to traverse them, all the way up to visually displaying graphs as part of the user interface layer. In particular, it is so-called property graphs where we are seeing the broadest development and uptake. In a property graph, both vertices and edges can have properties (effectively, key-value pairs) associated with them. There are many styles of graph that you may end up building and there have been whole books written on these various design patterns, but the property graph technology we will be focused on in this book can support all of the most common usage patterns. If you hear phrases such as directed graph and undirected graph, or cyclic and acyclic graph, and many more as you work with graph databases, a quick online search will get you to a place where you can get familiar with that terminology. A deep discussion of these patterns is beyond the scope of this book, and it’s in no way essential to have a full background in graph theory to get productive quickly.

A third, and equally important, factor in the growth we are seeing in graph database adoption is the low barrier of entry for programmers. As you will see from the examples in this book, someone wanting to experiment with graph technology can download the Apache TinkerPop package and as long as Java 8 is installed, be up and running with zero configuration (other than doing an unzip of the files), in as little as five minutes. Graph databases do not force you to define schemas or specify the layout of tables and columns before you can get going and start building a graph. Programmers also seem to find the graph style of programming quite intuitive as it closely models the way they think of the world.

Graph database technology should not be viewed as a "rip and replace" technology, but as very much complimentary to other databases that you may already have deployed. One common use case is for the graph to be used as a form of smart index into other data stores. This is sometimes called having a polyglot data architecture.

1.7. A word about terminology

The words node and vertex are synonymous when discussing a graph. Throughout this book you will find both words used. However, as the Apache TinkerPop documentation almost exclusively uses the word vertex, as much as possible when discussing Gremlin queries and other concepts, I will endeavor to stick to the word vertex or the plural form vertices. As this book has evolved I realized my use of these terms had become inconsistent and in future updates I plan, with a few exceptions such as when discussing binary trees, to standardize on vertex rather than node to be consistent with the TinkerPop documentation.


Let’s take a look at what you will need to have installed and what tools you will need available to make best use of the examples contained in this tutorial. The key thing that you will need is the Apache TinkerPop project’s Gremlin Console download. In the sections below I will walk you through a discussion of what you need to download and how to set it up.

2.1. What is Apache TinkerPop?

Apache TinkerPop is a graph computing framework and top level project hosted by the Apache Software Foundation. The homepage for the project is located at this URL: http://tinkerpop.apache.org/

The project includes the following components:
  • A graph traversal (query) language

Gremlin Console
Gremlin Server
Programming Interfaces

The programming interfaces allow providers of graph databases to build systems that are TinkerPop enabled and allow application programmers to write programs that talk to those systems.

Any such TinkerPop enabled graph databases can be accessed using the Gremlin query language and corresponding API. We can also use the TinkerPop API to write client code in languages like Java that can talk to a TinkerPop enabled graph. For most of this book we will be working within the Gremlin console with a local graph. However in Section 6 we will take a look at Gremlin Server and some other TinkerPop 3 enabled environments. Most of Apache Tinkerpop has been developed using Java 8 but there are also bindings available for many other programming languages such as Groovy and Python. Parts of TinkerPop are themselves developed in Groovy, most notably the Gremlin Console. The nice thing about that is that we can use Groovy syntax along with Gremlin when entering queries into the Console or sending them via REST API to a Gremlin Server. All of these topics are covered in detail in this book.

The queries used as examples in this book have been tested with Apache TinkerPop version 3.3. using the TinkerGraph graph and the Gremlin console as well as some other TinkerPop 3 enabled graph stores.

2.2. The Gremlin console

The Gremlin Console is a fairly standard REPL (Read Eval Print Loop) shell. It is based on the Groovy console and if you have used any of the other console environments such as those found with Scala, Python and Ruby you will feel right at home here. The Console offers a low overhead (you can set it up in seconds) and low barrier of entry way to start to play with graphs on your local computer. The console can actually work with graphs that are running locally or remotely but for the majority of this book we will keep things simple and focus on local graphs.

To follow along with this tutorial you will need to have installed the Gremlin console or have access to a TinkerPop3/Gremlin enabled graph store such as TinkerGraph or JanusGraph.

Regardless of the environment you use, if you work with Apache TinkerPop enabled graphs, the Gremlin console should always be installed on your machine!

2.2.1. Downloading, installing and launching the console

You can download the Gremlin console from the official Apache TinkerPop website:

It only takes a few minutes to get the Gremlin Console installed and running. You just download the ZIP file and unzip it and you are all set. TinkerPop 3 also requires a recent version of Java 8 being installed. I have done all of my testing using Java 8 version 1.8.0_131. The Gremlin Console will not work with versions prior to 1.8.0_45. If you do not have Java 8 installed it is easy to find and download off the Web. The download also includes all of the JAR files that are needed to write a stand alone Java or Groovy TinkerPop application but that is a topic for later!

When you start the Gremlin console you will be presented with a banner/logo and a prompt that will look something like this. Don’t worry about the plugin messages yet we will talk about those a bit later.

$ ./gremlin.sh

         (o o)
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph

You can get a list of the available commands by typing :help. Note that all commands to the console itself are prefixed by a colon ":". This enables the console to distinguish them as special and different from actual Gremlin and Groovy commands.

gremlin> :help

For information about Groovy, visit:

Available commands:
  :help       (:h  ) Display this help message
  ?           (:?  ) Alias to: :help
  :exit       (:x  ) Exit the shell
  :quit       (:q  ) Alias to: :exit
  import      (:i  ) Import a class into the namespace
  :display    (:d  ) Display the current buffer
  :clear      (:c  ) Clear the buffer and reset the prompt counter
  :show       (:S  ) Show variables, classes or imports
  :inspect    (:n  ) Inspect a variable or the last result with the GUI object browser
  :purge      (:p  ) Purge variables, classes, imports or preferences
  :edit       (:e  ) Edit the current buffer
  :load       (:l  ) Load a file or URL into the buffer
  .           (:.  ) Alias to: :load
  :save       (:s  ) Save the current buffer to a file
  :record     (:r  ) Record the current session to a file
  :history    (:H  ) Display, manage and recall edit-line history
  :alias      (:a  ) Create an alias
  :register   (:rc ) Register a new command with the shell
  :doc        (:D  ) Open a browser window displaying the doc for the argument
  :set        (:=  ) Set (or list) preferences
  :uninstall  (:-  ) Uninstall a Maven library and its dependencies from the Gremlin Console
  :install    (:+  ) Install a Maven library and its dependencies into the Gremlin Console
  :plugin     (:pin) Manage plugins for the Console
  :remote     (:rem) Define a remote connection
  :submit     (:>  ) Send a Gremlin script to Gremlin Server

For help on a specific command type:
    :help command
Of all the commands listed above :clear (:c for short) is an important one to remember. If the console starts acting strangely or you find yourself stuck with a prompt like "…​…​1>" , typing :clear will reset things nicely.

It is worth noting that as mentioned above, the Gremlin console is based on the Groovy console and as such you can enter valid Groovy code directly into the console. So as well as using it to experiment with Graphs and Gremlin you can use it as, for example, a desktop calculator should you so desire!

gremlin> 2+3

gremlin> a = 5

gremlin> println "The number is ${a}"
The number is 5

gremlin> for (a in 1..5) {print "${a} "};println()
1 2 3 4 5
The Gremlin Console does a very nice job of only showing you a nice and tidy set of query results. If you are working with a graph system that supports TinkerPop 3 but not via the Gremlin console (an example of this would be talking to a Gremlin Server using the HTTP REST API) then what you will get back is going to be a JSON document that you will need to write some code to parse. We will explore that topic much later in this book.

If you want to see lots of examples of the output from running various queries you will find plenty in the "MISCELLANEOUS QUERIES AND THEIR RESULTS" section of this book where we have tried to go into more depth on various topics.

Mostly you will run the Gremlin console in its interactive mode. However you can also pass the name of a file as a command line parameter, preceded by the -e flag and Gremlin will execute the file and exit. For example if you had a file called "mycode.groovy" you could execute it directly from your command line window or terminal window as follows:

$ gremlin -e mycode.groovy

If you wanted to have the console run your script and not exit afterwards, you can use the -i option instead of -e.

You can get help on all of the command line options for the Gremlin console by typing gremlin --help. You should get back some help text that looks like this

$ gremlin --help

usage: gremlin.sh [options] [...]
  -C, --color                               Disable use of ANSI colors
  -D, --debug                               Enabled debug Console output
  -Q, --quiet                               Suppress superfluous Console
  -V, --verbose                             Enable verbose Console output
  -e, --execute=SCRIPT ARG1 ARG2 ...        Execute the specified script
                                            (SCRIPT ARG1 ARG2 ...) and
                                            close the console on
  -h, --help                                Display this help message
  -i, --interactive=SCRIPT ARG1 ARG2 ...    Execute the specified script
                                            and leave the console open on
  -l                                        Set the logging level of
                                            components that use standard
                                            logging output independent of
                                            the Console
  -v, --version                             Display the version

If you ever want to check which version of TinkerPop you have installed you can enter the following command from inside the Gremlin console.

// What version of Gremlin am I running?

One thing that is not at all obvious or apparent is that the Gremlin console quietly imports a large number of Java Classes and Enums on you behalf as it starts up. This makes writing queries within the console simpler. However, as we shall explore in the "Important Classes and Enums to be aware of" section later, once you start writing stand alone programs in Java or other languages, you need to actually know what the console did on your behalf. As a teaser for what comes later, try typing :show imports when using the Gremlin Console and see what it returns.

2.2.2. Saving output from the console to a file

Sometimes it is useful to save part or all of a console session to a file. You can turn recording to a file on and off using the :record command.

In the following example, we turn recording on using :record start mylog.txt which will force all commands entered and their output to be written to the file mylog.txt until the command :record stop is entered. The command g.V().count().next() just counts how many vertices (nodes) are in the graph. We will explain the Gremlin graph traversal and query language in detail starting in the next section.

gremlin> :record start mylog.txt
Recording session to: "mylog.txt"

gremlin> g.V().count().next()
gremlin> :record stop
Recording stopped; session saved as: "mylog.txt" (157 bytes)

If we were to look at the mylog.txt file, this is what it now contains.

// OPENED: Tue Sep 12 10:43:40 CDT 2017
// RESULT: mylog.txt
// RESULT: 3618
:record stop
// CLOSED: Tue Sep 12 10:43:50 CDT 2017

For the remainder of this book I am not going to show the gremlin> prompt or the =⇒ output identifier as part of each example, just to reduce clutter a bit. You can assume that each command was entered and tested using the Gremlin console however.

If you want to learn more about the console itself you can refer to the official TinkerPop documentation and, even better, have a play with the console and the built in help.

2.3. Introducing TinkerGraph

As well as the Gremlin Console, the TinkerPop 3 download includes an implementation of an in-memory graph store called TinkerGraph. This book was mostly developed using TinkerGraph but I also tested everything using JanusGraph. We will introduce JanusGraph later in the "Introducing JanusGraph" section. The nice thing about TinkerGraph is that for learning and testing things you can run everything you need on your laptop or desktop computer and be up and running very quickly. We will explain how to get started with the Gremlin Console and TinkerGraph a bit later in this book.

Tinkerpop 3 defines a number of capabilities that a graph store should support. Some are optional others are not. You can query any TinkerPop 3 enabled graph store to see which features are supported using a command such as graph.features() once you have established the graph object. We will look at how to do that soon. The following list shows the features supported by TinkerGraph. This is what you would get back should you call the features method provided by TinkerGraph. I have arranged the list in two columns to aid readability. Don’t worry if not all of these terms make sense right away - we’ll get there soon!

Output from graph.features()
> GraphFeatures                          > VertexPropertyFeatures
>-- ConcurrentAccess: false              >-- UserSuppliedIds: true
>-- ThreadedTransactions: false          >-- StringIds: true
>-- Persistence: true                    >-- RemoveProperty: true
>-- Computer: true                       >-- AddProperty: true
>-- Transactions: false                  >-- NumericIds: true
> VariableFeatures                       >-- CustomIds: false
>-- Variables: true                      >-- AnyIds: true
>-- LongValues: true                     >-- UuidIds: true
>-- SerializableValues: true             >-- Properties: true
>-- FloatArrayValues: true               >-- LongValues: true
>-- UniformListValues: true              >-- SerializableValues: true
>-- ByteArrayValues: true                >-- FloatArrayValues: true
>-- MapValues: true                      >-- UniformListValues: true
>-- BooleanArrayValues: true             >-- ByteArrayValues: true
>-- MixedListValues: true                >-- MapValues: true
>-- BooleanValues: true                  >-- BooleanArrayValues: true
>-- DoubleValues: true                   >-- MixedListValues: true
>-- IntegerArrayValues: true             >-- BooleanValues: true
>-- LongArrayValues: true                >-- DoubleValues: true
>-- StringArrayValues: true              >-- IntegerArrayValues: true
>-- StringValues: true                   >-- LongArrayValues: true
>-- DoubleArrayValues: true              >-- StringArrayValues: true
>-- FloatValues: true                    >-- StringValues: true
>-- IntegerValues: true                  >-- DoubleArrayValues: true
>-- ByteValues: true                     >-- FloatValues: true
> VertexFeatures                         >-- IntegerValues: true
>-- AddVertices: true                    >-- ByteValues: true
>-- DuplicateMultiProperties: true       > EdgePropertyFeatures
>-- MultiProperties: true                >-- Properties: true
>-- RemoveVertices: true                 >-- LongValues: true
>-- MetaProperties: true                 >-- SerializableValues: true
>-- UserSuppliedIds: true                >-- FloatArrayValues: true
>-- StringIds: true                      >-- UniformListValues: true
>-- RemoveProperty: true                 >-- ByteArrayValues: true
>-- AddProperty: true                    >-- MapValues: true
>-- NumericIds: true                     >-- BooleanArrayValues: true
>-- CustomIds: false                     >-- MixedListValues: true
>-- AnyIds: true                         >-- BooleanValues: true
>-- UuidIds: true                        >-- DoubleValues: true
> EdgeFeatures                           >-- IntegerArrayValues: true
>-- RemoveEdges: true                    >-- LongArrayValues: true
>-- AddEdges: true                       >-- StringArrayValues: true
>-- UserSuppliedIds: true                >-- StringValues: true
>-- StringIds: true                      >-- DoubleArrayValues: true
>-- RemoveProperty: true                 >-- FloatValues: true
>-- AddProperty: true                    >-- IntegerValues: true
>-- NumericIds: true                     >-- ByteValues: true
>-- CustomIds: false
>-- AnyIds: true
>-- UuidIds: true

TinkerGraph is really useful while learning to work with Gremlin and great for testing things out. One common use case where TinkerGraph can be very useful is to create a sub-graph of a large graph and work with it locally. TinkerGraph can even be used in production deployments if an all in memory graph fits the bill. Typically, TinkerGraph is used to explore static (unchanging) graphs but you can also use it from a programming language like Java and mutate its contents if you want to. However, TinkerGraph does not support some of the more advanced features you will find in implementations like JanusGraph such as transactions and external indexes. We will cover these topics as part of our discussion of JanusGraph in the Introducing JanusGraph section later on. One other thing worth noting in the list above is that UserSuppliedIds is set to true for vertex and edge ID values. This means that if you load a graph file, such as a GraphML format file, that specifies ID values for vertices and edges then TinkerGraph will honor those IDs and use them. As we shall see later this is not the case with most other graph systems.

When running in the Gremlin Console, support for TinkerGraph should be on by default. If for any reason you find it to be off you, can enable it by issuing the following command.

:plugin use tinkerpop.tinkergraph

Once the TinkerGraph plugin is enabled you will need to close and re-load the Gremlin console. After doing that, you can create a new TinkerGraph instance from the console as follows.

graph = TinkerGraph.open()

In many cases you will want to pass parameters to the open method that give more information on how the graph is to be configured. We will explore those options later in the book. Before you can start to issue Gremlin queries against the graph you also need to establish a graph traversal source object by calling the new graph’s traversal method as follows.

g = graph.traversal()
Throughout the remainder of this book we will follow the convention that we will always use the variable name graph for any variable that represents a graph instance and we will always use the variable name g for any variable that represents an instance of a graph traversal source object.

2.4. Introducing the air-routes graph

Along with this book I have provided what is, in big data terms,a very small, but nonetheless real World, graph that is written in GraphML, a standard XML format for describing graphs that can be used to move graphs between applications. The graph, air-routes.graphml is a model I built of the World airline route network that is fairly accurate.

The air-routes.graphml file can be downloded from the sample-data folder located in the GitHub repository at the following URL: https://github.com/krlawrence/graph/tree/master/sample-data

Of course, in the real World, routes are added and deleted by airlines all the time so please don’t use this graph to plan your next vacation or business trip! However, as a learning tool I hope you will find it useful and easy to relate to. If you feel so inclined you can load the file into a text editor and examine how it is laid out. As you work with graphs you will want to become familiar with popular graph serialization formats. Two common ones are GraphML and GraphSON. The latter is a JSON format that is defined by Apache TinkerPop and heavily used in that environment. GraphML is very widely recognized by TinkerPop and many other tools as well such as Gephi, a popular open source tool for visualizing graph data. A lot of graph ingestion algorithms also still use comma separated values (CSV) format files.

We will briefly look at loading and saving graph data in Sections 2 and 4. We take a much deeper look at different ways to work with graph data stored in text format files including importing and exporting graph data in the "COMMON GRAPH SERIALIZATION FORMATS" section at the end of this book.

The air-routes graph contains several vertex types that are specified using labels. The most common ones being airport and country. There are also vertices for each of the seven continents (continent) and a single version vertex that I provided as a way to test which version of the graph you are using.

Routes between airports are modeled as edges. These edges carry the route label and include the distance between the two connected airport vertices as a property called dist. Connections between countries and airports are modelled using an edge with a contains label.

Each airport vertex has many properties associated with it giving various details about that airport including its IATA and ICAO codes, its description, the city it is in and its geographic location.

Specifically, each airport vertex has a unique ID, a label of airport and contains the following properties. The word in parenthesis indicates the type of the property.

 type    (string) : Vertex type. Will be 'airport' for airport vertices
 code    (string) : The three letter IATA code like AUS or LHR
 icao    (string) : The four letter ICAO code or none. Example KAUS or EGLL
 desc    (string) : A text description of the airport
 region  (string) : The geographical region like US-TX or GB-ENG
 runways (int)    : The number of available runways
 longest (int)    : Length of the longest runway in feet
 elev    (int)    : Elevation in feet above sea level
 country (string) : Two letter ISO country code such as US, FR or DE.
 city    (string) : The name of the city the airport is in
 lat     (double) : Latitude of the airport
 lon     (double) : Longitude of the airport

We can use Gremlin once the air route graph is loaded to show us what properties an airport vertex has. As an example here is what the airport vertex with an ID of 3 looks like. We will explain the steps that make up the Gremlin query shortly.

// Query the properties of vertex 3

desc=[Austin Bergstrom International Airport]

Even though the airport vertex label is airport I chose to also have a property called type that also contains the string airport. This was done to aid with indexing when working with other graph database systems and is explained in more detail later in this book.

You may have noticed that the values for each property are represented as lists (or arrays if you prefer), even though each list only contains one element. The reasons for this will be explored later in this book but the quick explanation is that this is because TinkerPop allows us to associate a list of values with any vertex property. We will explore ways that you can take advantage of this capability in the "Attaching multiple values (lists or sets) to a single property" section.

The full details of all the features contained in the air-routes graph can be learned by reading the comments at the start of the air-routes.graphml file or reading the README.txt file.

The graph currently contains a total of 3,612 vertices and 49,894 edges. Of these 3,367 vertices are airports, and 43,160 of the edges represent routes. While in big data terms this is really a tiny graph, it is plenty big enough for us to build up and experiment with some very interesting Gremlin queries.

Lastly, here is are some statistics and facts about the air-routes graph. If you want to see a lot more statistics check the README.txt file that is included with the air-routes graph.

Air Routes Graph (v0.77, 2017-Oct-06) contains:
  3,374 airports
  43,400 routes
  237 countries (and dependent areas)
  7 continents
  3,619 total nodes
  50,148 total edges

Additional observations:
  Longest route is between DOH and AKL (9,025 miles)
  Shortest route is between WRY and PPW (2 miles)
  Average route distance is 1,164.747 miles.
  Longest runway is 18,045ft (BPX)
  Shortest runway is 1,300ft (SAB)
  Furthest North is LYR (latitude: 78.2461013793945)
  Furthest South is USH (latitude: -54.8433)
  Furthest East is SVU (longitude: 179.341003418)
  Furthest West is TVU (longitude: -179.876998901)
  Closest to the Equator is MDK (latitude: 0.0226000007242)
  Closest to the Greenwich meridian is LDE (longitude: -0.006438999902457)
  Highest elevation is DCY (14,472 feet)
  Lowest elevation is GUW (-72 feet)
  Maximum airport node degree (routes in and out) is 544 (FRA)
  Country with the most airports: United States (579)
  Continent with the most airports: North America (978)
  Average degree (airport nodes) is 25.726
  Average degree (all nodes) is 25.856

Here are the Top 15 airports sorted by overall number of routes (in and out). In graph terminology this is often called the degree of the vertex or just vertex degree.


     1    52   FRA  (544)  out:272 in:272
     2    70   AMS  (541)  out:269 in:272
     3   161   IST  (540)  out:270 in:270
     4    51   CDG  (524)  out:262 in:262
     5    80   MUC  (474)  out:237 in:237
     6    64   PEK  (469)  out:234 in:235
     7    18   ORD  (464)  out:232 in:232
     8     1   ATL  (464)  out:232 in:232
     9    58   DXB  (458)  out:229 in:229
    10     8   DFW  (442)  out:221 in:221
    11   102   DME  (428)  out:214 in:214
    12    67   PVG  (402)  out:201 in:201
    13    50   LGW  (400)  out:200 in:200
    14    13   LAX  (390)  out:195 in:195
    15    74   MAD  (384)  out:192 in:192

Throughout this book you will find Gremlin queries that can be used to generate many of these statistics.

There is a sample script called graph-stats.groovy in the GitHub repository located in the sample-code folder that shows how to generate some statistics about the graph. The script can be found at the following URL: https://github.com/krlawrence/graph/tree/master/sample-code

2.5. TinkerPop 3 migration notes

There are still a large number of examples on the internet that show the TinkerPop 2 way of doing things. Quite a lot of things changed between TinkerPop 2 and TinkerPop 3. If you were an early adopter and are coming from a TinkerPop 2 environment to a TinkerPop 3 environment you may find some of the tips in this section helpful. As we we will explain below, using the sugar plugin will make the migration from TinkerPop 2 easier but it is recommended to learn the full TinkerPop 3 Gremlin syntax and get used to using that as soon as possible. Using the full syntax will make your queries a lot more portable to other TinkerPop 3 enabled graph systems.

TinkerPop 3 requires a minimum of Java 8 v45. It will not run on earlier versions of Java 8 based on my testing.

2.5.1. Creating a TinkerGraph TP2 vs TP3

The way that you create a TinkerGraph changed between TinkerPop 2 and 3.

graph = new TinkerGraph()  // TinkerPop 2
graph = TinkerGraph.open() // TinkerPop 3

2.5.2. Loading a graphML file TP2 vs TP3

If you have previous experience with TinkerPop 2 you may also have noticed that the way a graph is loaded has changed in TinkerPop 3.

graph.loadGraphML('air-routes.graphml') // TinkerPop 2
graph.io(graphml()).readGraph('air-routes.graphml') // TinkerPop 3

The Gremlin language itself changed quite a bit between TinkerPop 2 and TinkerPop 3. The remainder of this book only shows TinkerPop 3 examples.

2.5.3. A word about the TinkerPop.sugar plugin

The Gremlin console has a set of plug in modules that can be independently enabled or disabled. Depending upon your use case you may or may not need to manage plugins.

TinkerPop 2 supported by default some syntactic sugar that allowed shorthand forms of queries to be entered when using the Gremlin console. In TinkerPop 3 that support has been moved to a plugin and is off by default. It has to be enabled if you want to continue to use the same shortcuts that TinkerPop 2 allowed by default.

You can enable sugar support from the Gremlin console as follows:

:plugin use tinkerpop.sugar
The Gremlin Console remembers which plugins are enabled between restarts.

In the current revision of this book I have tried to remove any dependence on the TinkerPop.sugar plugin from the examples presented. By not using Sugar, queries shown in this book should port very easily to other TinkerPop 3 enabled graph platforms. A few of the queries may not work on versions of TinkerPop prior to 3.2 as TinkerPop continues to evolve and new features are being added fairly regularly.

The Tinkerpop.sugar plugin allows some queries to be expressed in a more shorthand or lazy form, often leaving out references to values() and leaving out parenthesis. For example:

// With Sugar enabled

// Without Sugar enabled

People Migrating from TinkerPop 2 will find the Sugar plugin helps get your existing queries running more easily but as a general rule it is recommended to become familiar with the longhand way of writing queries as that will enable your queries to run as efficiently as possible on graph stores that support TinkerPop 3. Also, due to changes introduced with TinkerPop 3, using sugar will not be as performant as using the normal Gremlin syntax.

In earlier versions of this book many of the examples showed the sugar form. In the current revision I have tried to remove all use of that form. It’s possible that I may have missed a few and I will continue to check for, and fix, any that got missed. Please let me know if you find any that slipped through the net!

2.6. Loading the air-routes graph using the Gremlin console

Here is some code you can load the air routes graph using the gremlin console by putting it into a file and using :load to load and run it or by entering each line into the console manually. These commands will setup the console environment, create a TinkerGraph graph and load the air-routes.graphml file into it. Some extra console features are also enabled.

There is a file called load-air-routes-graph.groovy, that contains the commands shown below, available in the /sample-data directory. https://github.com/krlawrence/graph/tree/master/sample-data

These commands create an in-memory TinkerGraph which will use LONG values for the vertex and edge IDs. TinkerPop 3 introduced the concept of a traversal so as part of loading a graph we also setup a graph traversal source called g which we will then refer to in our subsequent queries of the graph. The max-iteration option tells the Gremlin console the maximum number of lines of output that we ever want to see in return from a query. The default, if this is not specified, is 100.

You can use the max-iteration setting to control how much output the Gremlin Console displays.

If you are using a different graph environment and GraphML import is supported, you can still load the air-routes.graphml file by following the instructions specific to that system. Once loaded, the queries below should still work either unchanged or with minor modifications.

conf = new BaseConfiguration()
graph = TinkerGraph.open(conf)
:set max-iteration 1000
Setting the ID manager as shown above is important. If you do not do this, by default, when using TinkerGraph, ID values will have to be specified as strings such as "3" rather than just the numeral 3.

If you download the load-air-routes-graph.groovy file, once the console is up and running you can load that file by entering the command below. Doing this will save you a fair bit of time as each time you restart the console you can just reload your configuration file and the environment will be configured and the graph loaded and you can get straight to writing queries.

:load load-air-routes-graph.groovy
As a best practice you should use the full path to the location where the GraphML file resides if at all possible to make sure that the GraphML reading code can find it.

Once you have the Gremlin Console up and running and have the graph loaded, if you feel like it you can cut and paste queries from this book directly into the console to see them run.

Once the air-routes graph is loaded you can enter the following command and you will get back information about the graph. In the case of a TinkerGraph you will get back a useful message telling you how many vertices and edges the graph contains. Note that the contents of this message will vary from one graph system to another and should not be relied upon as a way to keep tack of vertex and edge counts. We will look at some other ways of doing that later in the book.

// Tell me something about my graph

When using TinkerGraph, the message you get back will look something like this.

tinkergraph[vertices:3610 edges:49490]

2.7. Turning off some of the Gremlin console’s output

Sometimes, especially when assigning a result to a variable and you are not interested in seeing all the steps that Gremlin took to get there, the Gremlin console displays more output than is desirable. An easy way to prevent this is to just add an empty list ";[]" to the end of your query as follows.


2.8. A word about indexes and schemas

Some graph implementations have strict requirements on the use of an index. This means that a schema and an index must be in place before you can work with a graph and that you can only begin a traversal by referencing a property in the graph that is included in the index. While that is, for the most part, outside the scope of this book, it should be pointed out that some of the queries included in this material will not work on any graph system that requires all queries to be backed by an index. Such graph stores tend not to allow what are sometimes called full graph searches for cases where a particular item in a graph is not backed by an index. One example of this is vertex and edge labels which are typically not indexed but are sometimes very useful items to specify at the start of a query. As most of the examples in this book are intended to work just fine with only a basic TinkerGraph the subject of indexes is not covered in detail until Section 6 "MOVING BEYOND THE CONSOLE AND TINKERGRAPH" . However, as TinkerGraph does have some indexing capability I have also included some discussion of it in the "Introducing TinkerGraph indexes" section. In Section 6 where I start to look at additional technologies such as JanusGraph I have included a more in depth discussion of indexing as part of that coverage. You should always refer to the specific documentation for the graph system you are using to decide what you need to do about creating an index and schema for your graph. We will explain what TinkerGraph is in the next section. I won’t be discussing the creation of an explicit schema again until Section 6. When working with TinkerGraph there is no need to define a schema ahead of time. The types of each property are derived at creation time. This is a really convenient feature and allows us to get productive and do some experimenting really quickly.

In production systems, especially those where the graphs are large, the task of creating and managing the parts of the index is often handed to an additional software component such as Apache Solr or Apache Elastic Search.

In general for any graph database, regardless of whether it is optional or not, use of an index should be considered a best practice. As I mentioned, even TinkerGraph has a way to create an index should you want to.


Now that you hopefully have the air-routes graph loaded it’s time to start writing some queries!

In this section we will begin to look at the Gremlin query language. We will start off with a quick look at how Gremlin and SQL differ and are yet in some ways similar, then we will look at some fairly basic queries and finally get into some more advanced concepts. Hopefully each set of examples presented by building upon things previously discussed will be easy to understand.

3.1. Introducing Gremlin

Gremlin is the name of the graph traversal and query language that TinkerPop provides for working with property graphs. Gremlin can be used with any graph store that is Apache TinkerPop enabled. Gremlin is a fairly imperative language but also has some more declarative constructs as well. Using Gremlin we can traverse a graph looking for values, patterns and relationships we can add or delete vertices and edges, we can create sub-graphs and lots more.

3.1.1. A quick look at Gremlin and SQL

While it is not required to know SQL in order to be productive with Gremlin, if you do have some experience with SQL you will notice many of the same keywords and phrases being used in Gremlin. As a simple example the SQL and Gremlin examples below both show how we might count the number of airports there are in each country using firstly a relational database and secondly a property graph.

When working with a relational database, we might decide to store all of the airport data in a single table called airports. In a very simple case (the air routes graph actually stores a lot more data than this about each airport) we could setup our airports table so that it had entries for each airport as follows.

---  ----  ----  ---------------  ----------
1    ATL   KATL  Atlanta          US
3    AUS   KAUS  Austin           US
8    DFW   KDFW  Dallas           US
47   YYZ   CYYZ  Toronto          CA
49   LHR   EGLL  London           UK
51   CDG   LFPG  Paris            FR
52   FRA   EDDF  Frankfurt        DE
55   SYD   YSSY  Sydney           AU

We could then use a SQL query to count the distribution of airports in each country as follows.

select country,count(country) from airports group by country;

We can do this in Gremlin using the air-routes graph with a query like the one below (we will explain what all of this means later on in the book).


You will discover that Gremlin provides its own flavor of several constructs that you will be familiar with if you have used SQL before, but again, prior knowledge of SQL is in no way required to learn Gremlin.

One thing you will not find when working with a graph using Gremlin is the concept of a SQL join. Graph databases by their very nature avoid the need to join things together (as things that need to be connected already are connected) and this is a core reason why, for many use cases, Graph databases are a very good choice and can be more performant than relational databases.

Graph databases are usually a good choice for storing and modelling networks. The air-routes graph is an example of a network graph a social network is of course another good example. Networks can be modelled using relational databases too but as you explore the network and ask questions like "who are my friends' friends?" in a social network or "where can I fly to from here with a maximum of two stops?" things rapidly get complicated and result in the need for multiple joins.

As an example, imagine adding a second table to our relational database called routes. It will contain three columns representing the source airport, the destination airport and the distance between them in miles (SRC,DEST and DIST). It would contain entries that looked like this (the real table would of course have thousands of rows but this gives a good idea of what the table would look like).

---  ----  ----
ATL  DFW   729
ATL  FRA   4600
AUS  DFW   190
AUS  LHR   4901
BOM  AGR   644
BOM  LHR   4479
CDG  DFW   4933
CDG  FRA   278
CDG  LHR   216
DFW  FRA   5127
DFW  LHR   4736
LHR  BOM   4479
LHR  FRA   406
YYZ  FRA   3938
YYZ  LHR   3544

If we wanted to write a SQL query to calculate the ways of travelling from Austin (AUS) to Agra (AGR) with two stops, we would end up writing a query that looked something like this:

select a1.code,r1.dest,r2.dest,r3.dest from airports a1
  join routes r1 on a1.code=r1.src
  join routes r2 on r1.dest=r2.src
  join routes r3 on r2.dest=r3.src
  where a1.code='AUS' and r3.dest='AGR';

Using our air-routes graph database the query can be expressed quite simply as follows:


Adding or removing hops is as simple as adding or removing one or more of the out() steps which is a lot simpler than having to add additional join clauses to our SQL query. This is a simple example, but as queries get more and more complicated in heavily connected data sets like networks, the SQL queries get harder and harder to write whereas, because Gremlin is designed for working with this type of data, expressing a traversal remains fairly straightforward.

We can go one step further with Gremlin and use repeat to express the concept of three times as follows.


Gremlin also has a repeat …​ until construct that we will see used later in this book. When combined with the emit step, repeat provides a nice way of getting back any routes between a source and destination no matter how many hops it might take to get there.

Again, don’t worry if some of the Gremlin steps shown here are confusing, we will cover them all in detail a bit later. The key point to take away from this discussion of SQL and Gremlin is that for data that is very connected, Graph databases provide a very good way to store that data and Gremlin provides a nice and fairly intuitive way to traverse that data efficiently.

One other point worthy of note is that every vertex and every edge in a graph has a unique ID. Unlike in the relational world where you may or may not decide to give a table an ID column this is not optional with graph databases. In some cases the ID can be a user provided ID but more commonly it will be generated by the graph system when a vertex or edge is first created. If you are familiar with SQL, you can think of the ID as a primary key of sorts if you want to. Every vertex and ID can be accessed using it’s ID. Just as with relational databases, graph databases can be indexed and any of the properties contained in a vertex or an edge can be added to the index and can be used to find things efficiently. In large graph deployments this greatly speeds up the process of finding things as you would expect. We look more closely at IDs in the Working with IDs section.

3.2. Some fairly basic Gremlin queries

A graph query is often referred to as a traversal as that is what we are in fact doing. We are traversing the graph from a starting point to an ending point. Traversals consist of one or more steps (essentially methods) that are chained together.

As we start to look at some simple traversals here are a few steps that you will see used a lot. Firstly, you will notice that almost all traversals start with either a g.V() or a g.E(). Sometimes there will be parameters specified along with those steps but we will get into that a little later. You may remember from when we looked at how to load the air-routes graph in Section 2 we used the following instruction to create a graph traversal source object for our loaded graph.

g = graph.traversal()

Once we have a graph traversal source object we can use it to start exploring the graph. The V step returns vertices and the E step returns edges. You can also use a V step in the middle of a traversal as well as at the start but we will examine those uses a little later. The V and E steps can also take parameters indicating which set of vertices or edges we are interested in. That usage is explained in the "Working with IDs" section.

If it helps with remembering you can think of g.V() as meaning "looking at all of the vertices in the graph" and g.E() as meaning "looking at all of the edges in the graph". We then add additional steps to narrow down our search criteria.

The other steps we need to introduce are the has and hasLabel steps. They can be used to test for a certain label or property having a certain value. We will introduce a lot of different Gremlin steps as we build up our Gremlin examples throughout this book, including may other forms of the has step, but these few are enough to get us started.

You can refer to the official Apache TinkerPop documentation for full details on all of the graph traversal steps that are used in this tutorial. With this tutorial I have not tried to teach every possible usage of every Gremlin step and method, rather, I have tried to provide a good and approachable foundation in writing many different types of Gremlin query using an interesting and real world graph.

The latest TinkerPop 3 documentation is always available at this URL: http://tinkerpop.apache.org/docs/current/reference/

Below are some simple queries against the air-routes graph to get us started. It is assumed that the air-routes graph has been loaded already per the instructions above. The query below will return any vertices (nodes) that have the airport label.

// Find vertices that are airports

This query will return the vertex that represents the Dallas Fort Worth (DFW) airport.

// Find the DFW vertex

The next two queries combine the previous two into a single query. The first one just chains the queries together. The second shows a form of the has step that we have not looked at before that takes an additional label value as its first parameter.

// Combining those two previous queries (two ways that are equivalent)


Here is what we get back from the query. Notice that this is the Gremlin Console’s way of telling us we got back the Vertex with an ID of 8.


So, what we actually got back from these queries was a TinkerPop Vertex data structure. Later in this book we will look at ways to store that value into a variable for additional processing. Remember that even though we are working with a Groovy environment while inside the Gremlin Console, everything we are working with here, at its core, is Java code. So we can use the getClass method from Java to introspect the object. Note the call to next which turns the result of the traversal into an object we can work with further.


class org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex

The next step that we used above is one of a series of steps that the Tinkerpop documentation describes as terminal steps. We will see more of these terminal steps in use throughout this book. As mentioned above, a terminal step essentially ends the graph traversal and returns a concrete object that you can work with further in your application. You will see next and other related steps used in this way when we start to look at using Gremlin from a stand alone program a bit later on. We could even add a call to getMethods() at the end of the query above to get back a list of all the methods and their types supported by the TinkerVertex class.

3.2.1. Retrieving property values from a vertex

There are several different ways of working with vertex properties. We can add, delete and query properties for any vertex or edge in the graph. We will explore each of these topics in detail over the course of this book. Initially, let’s look at a couple of simple ways that we can look up the property values of a given vertex.

// What property values are stored in the DFW vertex?

Here is the output that the query returns. Note that we just get back the values of the properties when using the values step, we do not get back the associated keys. We will see how to do that later in the book.

Dallas/Fort Worth International Airport

The values step can take parameters that tell it to only return the values for the provided key names. The queries below return the values of some specific properties.

// Return just the city name property


// Return the 'runways' and 'icao' property values.


3.2.2. Does a specific property exist on a given vertex or edge?

You can simply test to see if a property exists as well as testing for it containing a specific value. To do this we can just provide has with the name of the property we are interested in. This works equally well for both vertex and edge properties.

// Find all edges that have a 'dist' property

// Find all vertices that have a 'region' property

// Find all the vertices that do not have a 'region' property

// The above is shorthand for

3.2.3. Counting things

A common need when working with graphs is to be able to count how "many of something" there are in the graph. We will look in the next section at other ways to count groups of things but first of all let’s look at some examples of using the count step to count how many of various things there are in our air-routes graph. First of all lets find out how many vertices in the graph represent airports.

// How many airports are there in the graph?


Now, looking at edges that have a route label, let’s find out how many flight routes are stored in the graph. Note that the outE step looks at outgoing edges. In this case we could also have used the out step instead. The various ways that you can look at outgoing and incoming edges is discussed in the "Starting to walk the graph" section that is coming up soon.

// How many routes are there?


You could shorten the above a little as follows but this would cause more edges to get looked as as we do not first filter out all vertices that are not airports.

// How many routes are there?


You could also do it this way but generally starting by looking at all the Edges in the graph is considered bad form as property graphs tend to have a lot more edges than vertices.

// How many routes are there?


We have not yet looked at the outE step used above. We will look at it very soon however in the "Starting to walk the graph" section.

3.2.4. Counting groups of things

Sometimes it is useful to count how many of each type (or group) of things there are in the graph. This can be done using the group and groupCount steps. While for a very large graph it is not recommended to run queries that look at all of the vertices or all of the edges in a graph, for smaller graphs this can be quite useful. For the air routes graph we could easily count the number of different vertex and edge types in the graph as follows.

// How many of each type of vertex are there?

If we were to run the query we would get back a map where the keys are label names and the values are the counts for the occurrence of each label in the graph.


There are other ways we could write the query above that will yield the same result. One such example is shown below.

// How many of each type of vertex are there?


We can also run a similar query to find out the distribution of edge labels in the graph. An example of the type of result we would get back is also shown.

// How many of each type of edge are there?


As before we could rewrite the query as follows.

// How many of each type of edge are there?


By way of a side note, the examples above are shorthand ways of writing something like this example which also counts vertices by label.

// As above but using group()


We can be more selective in how we specify the groups of things that we want to count. In the examples below we first count how many airports there are in each country. This will return a map of key:value pairs where the key is the country code and the value is the number of airports in that country. As the fourth and fifth examples show, we can use select to pick just a few values from the whole group that got counted. Of course if we only wanted a single value we could just count the airports connected to that country directly but the last two examples are intended to show that you can count a group of things and still selectively only look at part of that group.

// How many airports are there in each country?

// How many airports are there in each country? (look at country first)

We can easily find out how many airports there are in each continent using group to build a map of continent codes and the number of airports in that continent. The output from running the query is shown below also.

// How many airports are there in each continent?


These queries show how select can be used to extract specific values from the map that we have created. Again you can see the results we get from running the query.

// How many airports in there in France (having first counted all countries)


// How many airports are there in France, Greece and Belgium respectively?


The group and groupCount steps are very useful when you want to count groups of things or collect things into group using a selection criteria. You will find a lot more examples of grouping and counting things in the section called "Counting more things".

3.3. Starting to walk the graph

So far we have mostly just explored queries that look at properties on a vertex or count how many things we can find of a certain type. Where the power of a graph really comes into play is when we start to walk or traverse the graph by looking at the connections (edges) between vertices. The term walking the graph is used to describe moving from one vertex to another vertex via an edge. Typically when using the phrase walking a graph the intent is to describe starting at a vertex traversing one or more vertices and edges and ending up at a different vertex or sometimes, back where you started in the case of a circular walk. It is very easy to traverse a graph in this way using Gremlin. The journey we took while on our walk is often referred to as our path. There are also cases when all you want to do is return edges or some combination of vertices and edges as the result of a query and Gremlin allows this as well. We will explore a lot of ways to modify the way a graph is traversed in the upcoming sections.

The table below gives a brief summary of all the steps that can be used to walk or traverse a graph using Gremlin. You will find all of these steps used in various ways throughout the book. Think of a graph traversal as moving through the graph from one place to one or more other places. These steps tell Gremlin which places to move to next as it traverses a graph for you.

In order to better understand these steps it is worth defining some terminology. One vertex is considered to be adjacent to another vertex if there is an edge connecting them. A vertex and an edge are considered incident if they are connected to each other.

Table 1. Where to move next while traversing a graph

out *

Outgoing adjacent vertices.

in *

Incoming adjacent vertices.

both *

Both incoming and outgoing adjacent vertices.

outE *

Outgoing incident edges.

inE *

Incoming incident edges.

bothE *

Both outgoing and incoming incident edges.


Outgoing vertex.


Incoming vertex.


The vertex that was not the vertex we came from.

Note that the steps labelled with an * can optionally take the name of an edge label as a parameter. If omitted, all relevant edges will be traversed.

3.3.1. Some simple graph traversal examples

To get us started, in this section we will look at some simple graph traversal examples that use some of the steps that were just introduced. The out step is used to find vertices connected by an outgoing edge to that vertex and the outE step is used when you want to examine the outgoing edges from a given vertex. Conversely the in and inE steps can be used to look for incoming vertices and edges. The outE and inE steps are especially useful when you want to look at the properties of an edge as we shall see in the "Examining the edge between two vertices" section. There are several other steps that we can use when traversing a graph to move between vertices and edges. These include bothE, bothV and otherV. We will encounter those in the "Other ways to explore vertices and edges using both, bothE, bothV and otherV" section.

So let’s use a few examples to help better understand these graph traversal steps. The first query below does a few interesting things. Firstly we find the vertex representing the Austin airport (the airport with a property of code containing the value AUS). Having found that vertex we then go out from there. This will find all of the vertices connected to Austin by an outgoing edge. Having found those airports we then ask for the values of their code properties using the values step. Finally the fold step puts all of the results into a list for us. This just makes it easier for us to inspect the results in the console.

// Where can I fly to from Austin?

Here is what you might get back if you were to run this query in your console.


All edges in a graph have a label. However, one thing we did not do in the previous query was specify a label for the out step. If you do not specify a label you will get back any connected vertex regardless of its edge label. In this case it does not cause us a problem as airports only have one type of outgoing edge, labeled route. However, in many cases, in graphs you create or are working with, your vertices may be connected to other vertices by edges with differing labels so it is good practice to get into the habit of specifying edge labels as part of your Gremlin queries. So we could change our query just a bit by adding a label reference on the out step as follows.

// Where can I fly to from Austin?

Despite having just stated that consistently using edge labels in queries is a good idea, unless you truly do want to get back all edges or all connected vertices, I will break my own rule quite a bit in this book. The reason for this is purely to save space and make the queries I present shorter.

Here are a few more simple queries similar to the previous one. The first example can be used to answer the question "Where can I fly to from Austin, with one stop on the way?". Note that, as written, coming back to Austin will be included in the results as this query does not rule it out!

// Where can I fly to from Austin, with one stop on the way?

This query uses an in step to find all the routes that come into the London City Airport (LCY) and returns their IATA codes.

// What routes come in to LCY?

This query is perhaps a bit more interesting. It finds all the routes from London Heathrow airport in England that go to an airport in the United States and returns their IATA codes.

// Flights from London Heathrow (LHR) to airports in the USA

3.3.2. What vertices and edges did I visit? - Introducing path

A Gremlin method (often called a step) that you will see used a lot in this book is path. After you have done some graph walking using a query you can use path to get a summary back of where you went. Here is a simple example of path being used. Throughout the book you will see numerous examples of path being used including in conjunction with by to specify how the path should be formatted. This particular query will return the vertices and outgoing edges starting at the LCY airport vertex. You can read this query like this: "Start at the LCY vertex, find all outgoing edges and also find all of the vertices that are on the other ends of those edges". The inV step gives us the vertex at the other end of the outgoing edge.

// This time, for each route, return both vertices and the edge that connects them.

If you run that query as-is you will get back a series of results that look like this. This shows that there is a route from vertex 88 to vertex 77 via an edge with an ID of 13208.


While this is useful, we might want to return something more human readable such as the IATA codes for each airport and perhaps the distance property from the edge that tells us how far apart the airports are. We could add some by modulators to our query to do this. Take a look at the modified query and an example of the results that it will now return. The by modulators are processed in a round robin fashion. So even though there are three values we want to have formatted, we only need to specify two by modulators as both the first and third values are the same. If all three were different, say for example that the third value was a different property like a city name then we would have to provide an explicit by modulator for it. If this is not fully clear yet don’t panic. Both path and by are used a lot throughout this book.


When you run this modified version of the query, you will receive a set of results that look like the following line.


Note that the example above is equivalent to this longer form of the same query. The by modulator steps that follow a path are applied in a round robin fashion. So if there are not enough specified for the number of steps in the path, it just loops back around to the first by step and so on.


Sometimes it is necessary to use a by modulator that has no parameter as shown below. This is because the item in the path is not an element containing multiple values but rather a single value, in this case, an integer.


The results show the codes for the airports we visited along with a number representing the number of runways the second airport has.


In Apache TinkerPop version 3.2.5 the ability to limit what is returned by the path step using from and to modulators was added. This enables us to not return the entire path of a traversal but instead to be more selective.

First of all, look at the example below. In this case I have just used the same path constructs used in the prior examples. The query returns the first 10 routes found starting at Austin (AUS) with one stop on the way.


As expected the results show each airport that was visited.


Given that every journey starts in Austin, we might not actually want the AUS airport code to be part of the returned results. We might just want to capture the places that we ended up visiting after leaving Austin. This can be achieved by labelling the parts of the traversal that we care about using as steps and then using from and to modulators to tell the path step what we are interested in. Take a look at the modified version of the query below.


This time AUS is not included in the path results.


Because after skipping the AUS part of the path we did in fact want the rest of the results we could have left off the to modulator and written the query as follows.


As you can see the results are the same as before.


Obviously there are a lot of ways that from and to can be used. By way of one final example, let’s create a version of the query with three out steps. Note that a bit later we will see how repeat can be used when the same steps need to be used repeatedly like this but that is not important to this specific example.


As expected we now have an additional stop added to each of the journeys.


Let’s now modify the query to limit which parts of the path are returned.


As you can see, only the parts of the journey that we selected have been returned.


We could also have written the query as shown below to only show the results of each path up to a certain point.


This time only the first three airports visited are included in each result.


By way of a side note, in cases like this where more than one of the results is identical, you may want to remove the duplicates. That is where the dedup step is useful. You will find coverage of dedup in the "Removing duplicates - introducing dedup" section. However, as a little taste test, let’s add a dedup step to the end of our previous query and see what happens.



As you can see all of the duplicate results have now been removed. Hopefully this gives you a good basic understanding of the path step. You will see it used a lot throughout the remainder of this book. However, there are a few things to be aware of when using path. Those concerns are explained in the A warning that path can be memory and CPU intensive section a bit later.

3.3.3. Does an edge exist between two vertices?

You can use the hasNext step to check if an edge exists between two vertices and get a Boolean (true or false) value back. The first query below will return true because there is an edge (a route) between AUS and DFW. The second query will return false because there is no route between AUS and SYD.





3.3.4. Using as, select and project to refer to traversal steps

Sometimes it is useful to be able to remember a point of a traversal by giving it a name (label) and refer to it later on in the same query. This ability was more essential in TinkerPop 2 than it is in TinkerPop 3 but it still has many uses. The query below uses an as step to attach a label at two different parts of the traversal, each representing different vertices that were found. A select step is later used to refer back to them.


This query, while a bit contrived, and in this case probably a poor substitute for using path, returns the following results.


In the example above only the vertices themselves were selected. We can also use a by modulator to specify which property to retrieve from the selected vertices.


This time the results contain the airport codes.


While the prior example was perhaps not ideal, it does show how as and select work. For completeness, here is the same query but using path. You will see both the select and path steps used a lot throughout this book.


Which would produce the following results. Notice that this time the results do not have labels associated with them but are otherwise the same.


While the path step is a lot more convenient, in some cases in can be very expensive in terms of memory and CPU usage so it is worth remembering these alternative techniques using as and select. That topic is discussed in more detail in the "A warning that path can be memory and CPU intensive section.

You can also give a point of a traversal multiple names and refer to each later on in the traversal/query as shown below.


In the most recent releases of TinkerPop you can also use the new project step and achieve the same results that you can get from the combination of as and select steps. The example below shows the previous query, rewritten to use project instead of as and select.


This query, and the prior query, would return the following results.


In the prior example we gave our variables simple names like a and b. However, it is sometimes useful to give our traversal variables and named steps more meaningful names and it is perfectly OK to do that. Let’s rewrite the query to use some more descriptive variable names.


When we run the modified query, here is the output we get.


3.3.5. Using multiple as steps with the same label

It is actually possible using an as step to give more than one part of a traversal the same label (name). In the example below, the label 'a' is used twice but you will notice that when the label is selected only the last item added is returned.



There are some special keywords that can be used in conjunction with the select step in cases like this one. These keywords are first, last and all and their usage is shown below.







Here is another example of a query that labels two different parts of a traversal with the same 'a' label. As you can see from the results, only the second one is used because of the last keyword that is provided on the select step.



Here is the same query but using the first keyword this time as part of the select step.



Note that when the same name is used to label a step, the data structure created by Gremlin is essentially a List. As such, the by modulator cannot be used when the all keyword is used on the select step. To get the values of each element in the list we can use an unfold step as shown below.



Keywords such as all, first and last are discussed further in the "Important Classes and Enums to be aware of" section later on in the book.

3.3.6. Examining the edge between two vertices

Sometimes, it is the edge between two vertices that we are interested in examining and not the vertices themselves. Typically this is because we want to look at one or more properties associated with that edge. By way of an example, let’s imagine we anted to know how many miles the flight is between Miami (MIA) and Dallas Fort Worth (DFW). In our air routes graph, the distances between vertices are stored using a property called dist on any edge that has a route label. We can use the outE and inV steps to find the edge connecting Miami and Dallas. We can also use the select and as steps that we just learned about to help with this task. Take a look at the query below. This will find the outgoing route edge from MIA to DFW, store it in the traversal variable e and at the end of the query use select to return it as the result of the query.


If we were to run the query, we would get back something similar to this


So we found the route edge that connects the vertex with an ID of 16 (MIA) with the airport that has an ID of 8 (DFW). While interesting, this is not exactly what we set out to achieve. What we actually are interested in is the distance property of that edge so we can see how far it is from Miami to Dallas Fort Worth. We need to add one additional step to our query that will look at the dist property of the edge. Let’s modify our query to do that.


If we run the query again we get back what we were looking for. We can see that it is 1,120 miles from Miami to Dallas Fort Worth.


As a side note, we could have written the query using inE and outV and achieved the same result by looking at the edge from Dallas to Miami.



Throughout the remainder of the book you will find lots of examples that use steps such as outE, inE, outV and inV.

3.4. Limiting the amount of data returned

It is sometimes useful, especially when dealing with large graphs, to limit the amount of data that is returned from a query. As shown in the examples below, this can be done using the limit and tail steps. A little later in this book we also introduce the coin step that allows a pseudo random sample of the data to be returned.

// Only return the FIRST 20 results

// Only return the LAST 20 results

Depending upon the implementation, it is probably more efficient to write the query like this, with limit coming before values to guarantee less airports are initially returned but it is also possible that an implementation would optimize both the same way.

// Only return the FIRST 20 results

Note that limit provides a shorthand alternative to range. The first of the two examples above could have been written as follows.

// Only return the FIRST 20 results

We can also limit a traversal by specifying a maximum amount of time that it is allowed to run for. The following query is restricted to a maximum limit of ten milliseconds. The query looks for routes from Austin (AUS) to London Heathrow (LHR). All the parts of this query are explained in detail later on in this book but I think what they do is fairly clear. The repeat step is explained in detail in the "Shortest paths (between airports) - introducing repeat" section.

// Limit the query to however much can be processed within 10 milliseconds

Here is what the query above returned when run on my laptop.


If we give the query another 10 milliseconds to run, so 20 in total, you can see that a few more routes were found.

// Limit the query to 20 milliseconds


3.4.1. Retrieving a range of vertices

Gremlin provides various ways to return a sequence of vertices. We have already seen the limit and range steps used in the previous section to return the first 20 elements of a query result. We can also use the range step to select different range of vertices by giving a non zero starting offset and an ending offset. The range offsets are zero based, and while the official documentation states that the ranges are inclusive/inclusive it actually appears from my testing that they are inclusive/exclusive.

// Return the first two airport vertices found


The starting value given to a range step does not have to be 0. In the example below we ask for the 3rd, 4th and 5th results found by specifying a range of "(3,6)".

// Return the fourth, fifth and sixth airport vertices found (zero based)


Here is an example of how we can use the index -1 to mean "until the end of the list". This is similar to the convention used in many programming languages when working with arrays and list.

// Return all the remaining vertices starting at the 3500th one

Here is another example that uses the range step, this time looking only at vertices with a label of country. Notice how this time we found vertices with much higher ID values.


There is no guarantee as to which airport vertices will be selected as this depends upon how they are stored by the back end graph. Using TinkerGraph the airports will most likely come back in the order they are put into the graph. This is not likely to be the case with other graph stores such as JanusGraph. So do not rely on any sort of expectation of order when using range to process sets of vertices.

In TinkerPop 3.3 a new skip step was introduced which can be used as an alternative to range in some cases. The skip step can be used whenever you would otherwise use range where the second parameter would be -1 meaning "all remaining".

The two examples below will produce the same results.



Here is the output you might get from running either query.


To prove that the skip and range steps used above worked again, we can run the query again with skip removed and look at the results. You will notice, the first five vertices listed were not included as part of the results from the prior queries.



You can also use the local keyword to have skip work on an incoming collection within a traversal. The example below, while contrived, applies skip to the list generated by the fold step.



There are many other ways to specify a range of values using Gremlin. You will find several additional examples in the "Testing values and ranges of values" section.

3.4.2. Removing duplicates - introducing dedup

It is often desirable to remove duplicate values from query results. The dedup step allows us to do this. If you are already familiar with Groovy collections, the dedup step is similar to the unique method that Groovy provides. In the example below, the number of runways for every airport in England is queried. Note that in the returned results there are many duplicate values.



If we only wanted a set of unique values in the result we could rewrite the query to include a dedup step. This time the query results only include one of each value.



It is also possible to use a by modulator to specify how dedup should be applied. In the example below we only return one airport for each unique number of runways.



There is one more form of the dedup step. In this form, one or more strings representing labelled steps are provided as parameters. Take a look first of all at the query below. It finds vertex V(3) and labels it 'a'. It then finds vertex V(4) and labels it 'c'. Next it finds all the vertices connected to V(4) and labels those 'b'. Only the first 10 are retrieved. Lastly a select step is used to return the results. As expected vertices 3 and 4 are present in all of the results.



Taking the same query but adding a dedup step that references the 'a' and 'c' labels, removes all duplicate references that include those vertices from the results so this time even though a limit of 10 is used we only actually get one result back.



A bit later we will take a look at the concept of local scope when working with traversals. There are some examples of local scope being used in conjunction with dedup in the "Using local scope with collections" section.

It is also possible to use sets to achieve similar results as we shall see in some of the following sections such as the "Introducing toList, toSet, bulkSet and fill section that is coming up soon.

3.5. Using valueMap to explore the properties of a vertex or edge

A call to valueMap will return all of the properties of a vertex or edge as an array of key:value pairs. Basically what in Java terms is called a HashMap. You can also select which properties you want valueMap to return if you do not want them all. Each element in the map can be addressed using the name of the key. By default the ID and label are not included in the map unless a parameter of true is provided.

The query below will return the keys and values for all properties associated with the Austin airport vertex.

// Return all the properties and values the AUS vertex has

If you are using the Gremlin console, the output from running the previous command should look something like this.

[country:[US], code:[AUS], longest:[12248], city:[Austin], elev:[542], icao:[KAUS], lon:[-97.6698989868164], type:[airport], region:[US-TX], runways:[2], lat:[30.1944999694824], desc:[Austin Bergstrom International Airport]]
Notice how each key like country is followed by a value that is returned as an element of a list. This is because it is possible (for vertices but not for edges) to provide more than one property value for a given key by encoding them as a list.

Here are some more examples of how valueMap can be used. If a parameter of true is provided, then the results returned will include the ID and label of the element being examined.

// If you also want the ID and label, add a parameter of true

[country:[US],id:3,code:[AUS],longest:[12250],city:[Austin],lon:[-97.6698989868164],type:[airport],elev:[542],icao:[KAUS],region:[US-TX],runways:[2],label:airport,lat:[30.1944999694824],desc:[Austin Bergstrom International Airport]]

You can also mix use of true along with requesting the map for specific properties. The next example will just return the ID, label and region property.

// If you want the ID, label and a specific field like the region, you can do this

If you only need the keys and values for specific properties to be returned it is recommended to pass the names of those properties as parameters to the valueMap step so it does not return a lot more data than you need. Think of this as the difference, in the SQL World, between selecting just the columns you are interested in from a table rather than doing a SELECT *.

As shown above, you can specify which properties you want returned by supplying their names as parameters to the valueMap step. For completeness, it is worth noting that you can also use a select step to refine the results of a valueMap.

// You can 'select' specific fields from a value map

[code:[AUS],icao:[KAUS],desc:[Austin Bergstrom International Airport]]

If you are reading the output of queries that use valueMap on the Gremlin console, it is sometimes easier to read the output if you add an unfold step to the end of the query as follows. The unfold step will unbundle a collection for us. You will see it used in many parts of this book.


desc=[Austin Bergstrom International Airport]

You can also use valueMap to inspect the properties associated with an edge. In this simple example, the edge with an ID of 5161 is examined. As you can see the edge represents a route and has a distance (dist) property with a value of 1357 miles.



3.6. Assigning query results to a variable

It is extremely useful to be able to assign the results of a query to a variable. The example below stores the results of the valueMap call shown above into a variable called aus.

// Store the properties for the AUS airport in the variable aus.
It is necessary to add a call to next to the end of the query in order for this to work. Forgetting to add the call to next is a very commonly made mistake by people getting used to the Gremlin query language. The call to next terminates the traversal part of the query and generates a concrete result that can be stored in a variable. There are other steps such as toList and toSet that also perform this traversal termination action. We will see those steps used later on.

Once you have some results in a variable you can refer to it as you would in any other programming language. We will explore mixing Java and Groovy code with your Gremlin queries later in this book. For now let’s just use the Groovy println to display the results of the query that we stored in aus. We will take a deeper look at the use of variables with Gremlin later in the book when we look at mixing Gremlin and Groovy in the "Making Gremlin even Groovier" section.

// We can now refer to aus using key:value syntax
println "The AUS airport is located in " + aus['city'][0]

The AUS airport is located in Austin
Because properties are stored as arrays of values. Even if there is only one property value for the given key, we still have to add the [0] when referencing it otherwise the whole array will be returned if we just used aus['city']. We will explore why property values are stored in this way in the "Attaching multiple values (lists or sets) to a single property" section.

As a side note, the next step can take a parameter value that tells it how much data to return. For example if you wanted the next three vertices from a query like the one below you can add a call to next(3) at the end of the query. Note that doing this turns the result into an ArrayList. Each element in the list will contain a vertex.



We can call the Java getClass method to verify the type of the values returned.


class java.util.ArrayList


class org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerVertex
When using the Gremlin Console, you can check to see what variables you have defined using the command :show variables.

3.6.1. Introducing toList, toSet, bulkSet and fill

It is often useful to return the results of a query as a list or as a set. One way to do this is to use toList or toSet methods. Below you will find an example of each. The call to join is used just to make the results easier to read on a single line.

// Create a list of runway counts in Texas
listr = g.V().has('airport','region','US-TX').


Now let’s create a set and observe the different result we get back.

// Create a set of runway counts in Texas (no duplicates)
setr = g.V().has('airport','region','US-TX').


As a side note, in many cases we can use the dedup step to remove duplicates from a result. However, it is worth knowing that a set can be created as a result type as in some cases this can be very useful. The example below performs the same runways query using a dedup step. I added an order step so that it is easier to compare the results with the previous query.

// Create a list of runway counts in Texas (no duplicates)


Finally, let’s create the list again, but without the call to join, as that creates a single string result which is not what we want in this case.

listr = g.V().has('airport','region','US-TX').

The variable can now be used as you would expect.




TinkerPop also provides a third method called bulkSet that can be used to create a collection at the end of a traversal. The difference between a bulkSet and a set is that bulkSet is a so called weighted set. A bulkSet stores every value but includes a count of how many of each type is present. Let’s look at a few examples. First of all we can check that the bulkSet does indeed contain all the values.

setb= g.V().has('airport','region','US-TX').values('runways').toBulkSet().join(',')

A bulkSet offers some additional methods that we can call. One of these is uniqueSize which will tell us how many unique values are present.

setb= g.V().has('airport','region','US-TX').values('runways').toBulkSet()

// How many unique values are in the set?

// How many total values are present?

The asBulk method returns a map of key/value pairs where the key is the number and the value is the number of time that number appears in the set.



There is another way to store the results of a query into a collection. This is achieved using the fill method. Unlike toList and the other methods that we just looked at, fill will store the results into a pre-existing variable. The query below defines a list called a and stores the results of the query into it. This will produce the same result as using toList.

a = []



We can define a variable that is a set and use fill to achieve the same result as using toSet.

s = [] as Set

println s

[2, 7, 5, 3, 4, 1]

3.7. Working with IDs

Every vertex and every edge in a graph has a unique ID that can be used to reference them individually or as a group. Beware that the IDs you provide when loading a graph from a GraphML or GraphSON file may not in many cases end up being the IDs that the back-end graph store actually uses as it builds up your graph. Tinkergraph for example will preserve user provided IDs but most systems like JanusGraph or IBM-Graph generate their own IDs. The same is true when you add vertices and edges using a graph traversal or using the TinkerPop API. This is a long winded way of saying that you should not depend on the IDs in your GraphML or GraphSON file that you just loaded remaining unchanged once the data has been loaded into the graph store. When you add new a vertex or edge to your graph, the graph system will automatically generate a new, unique ID for it. If you need to figure out the ID for a vertex or an edge you can always get it from a query of the graph itself!

Don’t rely on the graph preserving the ID values you provide. Write code that can query the graph itself for ID values.

Especially when dealing with large graphs, because using IDs is typically very efficient, you will find that many of your queries will involve collecting one or more IDs and then passing those on to other queries or parts of the same query. In most if not all cases, the underlying graph system will have setup it’s data structures, whether on disk or in memory, to be very rapidly accessed by ID value.

Let’s demonstrate the use of ID values using a few simple examples. The query below finds the ID, which is 8, for the vertex that represents the DFW airport.

 // What is the ID of the "DFW" vertex?


Let’s reverse the query and find the code for the vertex with an ID of 8.

// Simple lookup by ID


We could also have written the above query as follows.

// which is the same as this

Here are some more examples that make use of the ID value.

// vertices with an ID between 1 and 5 (note this is inclusive/exclusive)

// Which is an alternate form of this

// Find routes from the vertex with an ID of 6 to any vertex with an ID less than 46

// Which is the same as

You can also pass a single ID or multiple IDs directly into the V() step. Take a look at the two examples below.

// What is the code property for the vertex with an ID of 3?


// As above but for all of the specified vertex IDs


You can also pass a list of ID values into the V step. We take a closer look at using variables in this way in the "Using a variable to feed a traversal" section.



Every property in the graph also has an ID as we shall explore in the "Properties have IDs too" section.

3.8. Working with labels

It’s a good idea when designing a graph to give the vertices and edges meaningful labels. You can use these to help refine searches. For example in the air-routes graph, every airport vertex is labelled airport and every country vertex, not surprisingly, is labelled country. Similarly, edges that represent a flight route are labelled route. You can use labels in many ways. We already saw the hasLabel() step being used in the basic queries section to test for a particular label. Here are a few more examples.

// What label does the LBB vertex have?

// What airports are located in Australia?  Note that 'contains' is an
// edge label and 'country' is a vertex label.

// We could also write this query as follows

By using labels in this way we can effectively group vertices and edges into classes or types. Imagine if we wanted to build a graph containing different types of vehicles. We might decide to label all the vertices just vehicle but we could decide to use labels such as car, truck and bus. Ultimately the overall design of your graph’s data model will dictate the way you use labels but it is good to be aware of their value.

As useful as labels are, in larger graph deployments when indexing technology such as Solr or Elastic Search is often used to speed up traversing the graph, vertex labels typically do not get indexed. Therefore, it is currently recommended that an actual vertex property that can be indexed is used when walking a graph rather than relying on the vertex label. This is especially important when working with large graphs where performance can become an issue.

Here are a few more examples of ways we can work with labels.

// You can explicitly reference a vertex label using the label() method

// Or using the label key word

// But you would perhaps use the  hasLabel() method in this case instead

// How many non airport vertices are there?

// Again, it might be more natural to actually write this query like this:

The same concepts apply equally well when looking at edge labels as shown below.

// The same basic concepts apply equally to edges



Of course we have already seen another common place where labels might get used. Namely in the three parameter form of has as in the example below. The first parameter is the label value. The next two parameters test the properties of all vertices that have the airport label for a code of "SYD".


3.9. Using the local step to make sure we get the result we intended

Sometimes it is important to be able to do calculations based on the current state of a traversal rather than waiting until near the end. A common place where this is necessary is when calculating the average value of a collection. In the next section we are going to look at a selection of numerical and statistical operations that Gremlin allows us to perform. However, for now lets use the mean step to calculate the average of something and look at the effect the local step has on the calculation. The mean step works just like you would expect, it returns the mean, or average, value for a set of numbers.

If we wanted to calculate the average number of routes from an airport, the first query that we would write might look like the one below.



As you can see the answer we got back, 43400.0 looks wrong, and indeed it is. That number is in fact the total number of outgoing routes in the entire graph. This is because as written the query counts all of the routes, adds them all up, but does not keep track of how many airports it visited. This means that calling the mean step is essentially the same as dividing the count by one.

So how do we fix this? The answer is to use the local step. What we really want to do is to create, in essence, a collection of values, where each value is the route count for just one airport. Having done that, we want to divide the sum of all of these numbers by the number of members ,airports in this case, int the collection.

Take a look at the modified query below.

// Average number of outgoing routes from an airport.


The result this time is a much more believable answer. Notice how this time we placed the out().('route').count() steps inside a local step. The query below, with the mean step removed, shows what is happening during the traversal as this query runs. I truncated the output to just show a few lines.



What this shows is that for the first ten airports the collection that we are building up contains one entry for each airport that represents the number of outgoing routes that airport has. Then, when we eventually apply the mean step it will calculate the average value of our entire collection and give us back the result that we were looking for.

Let’s look at another example where we can use the local step to change the results of a query in a useful way. First of all, take a look at the query below and the results that it generates. The query first finds all the airports located in Scotland using the region code of GB-SCT. It then creates an ordered list of airport codes and city names into a list.


Here are the results from running the query.

[ABZ,Aberdeen,BEB,Balivanich,BRR,Eoligarry,CAL,Campbeltown,DND,Dundee,EDI,Edinburgh,EOI,Eday,FIE,Fair Isle,FOA,Foula,GLA,Glasgow,ILY,Port Ellen,INV,Inverness,KOI,Orkney Islands,LSI,Lerwick,LWK,Lerwick,NDY,Sanday,NRL,North Ronaldsay,PIK,Glasgow,PPW,Papa Westray,PSV,Papa Stour Island,SOY,Stronsay,SYY,Stornoway,TRE,Balemartine,WIC,Wick,WRY,Westray]

However, it would be more convenient perhaps to have the results be returned as a list of lists where each small list contains the airport code and city name with all the small lists wrapped inside a big list. We can achieve this by wrapping the second half of the query inside of a local step as shown below.


Here are the results of running the modified query. I have arranged the results in two columns to aid readability.

[ABZ,Aberdeen]          [LSI,Lerwick]
[BEB,Balivanich]        [LWK,Lerwick]
[BRR,Eoligarry]         [NDY,Sanday]
[CAL,Campbeltown]       [NRL,North Ronaldsay]
[DND,Dundee]            [PIK,Glasgow]
[EDI,Edinburgh]         [PPW,Papa Westray]
[EOI,Eday]              [PSV,Papa Stour Island]
[FIE,Fair Isle]         [SOY,Stronsay]
[FOA,Foula]             [SYY,Stornoway]
[GLA,Glasgow]           [TRE,Balemartine]
[ILY,Port Ellen]        [WIC,Wick]
[INV,Inverness]         [WRY,Westray]
[KOI,Orkney Islands]

There are many other ways that local can be used. You will find examples of those throughout the book. You will see some that show how local can be used as a parameter to the order step when we dig deeper into route analysis in the "Distribution of routes in the graph (mode and mean)" section.

3.10. Basic statistical and numerical operations

The following queries demonstrate concepts such as calculating the amount of a particular item that is present in the graph, calculating the average (mean) of a set of values and calculating a maximum or minimum value. The table below summarizes the available steps.

Table 2. Basic statistical steps


Count how many of something exists.


Sum (add up) a collection of values.


Find the maximum value in a collection of values.


Find the minimum value in a collection of values.


Find the mean (average) value in a collection.

We will dig a bit deeper into some of these capabilities and explain in more detail in the "Distribution of routes in the graph (mode and mean)" section of the book. Some of these examples also take advantage of the local step that was introduced in the previous section. A way of calculating the standard deviation within a data set is presented later in the "Gremlin’s scientific calculator - introducing math" section.

// How many routes are there from Austin?

// Sum of values - total runways of all airports

The mean step allows us to find the mean (average) value in a data set.

// Statistical mean (average) value - average number of runways per airport

// Average number of routes to and from an airport

The following queries find maximum and minimum values using the max and min steps.

//maximum value - longest runway

// What is the biggest number of outgoing routes any airport has?

//minimum value - shortest runway

3.11. Testing values and ranges of values

We have already seen some ways of testing whether a value is within a certain range. Gremlin provides a number of different steps that we can use to do range testing. The list below provides a summary of the available steps. We will see each of these in use throughout this book.

Table 3. Steps that test values or ranges of values


Equal to


Not equal to


Greater than


Greater than or equal to


Less than


Less than or equal to


Inside a lower and upper bound, neither bound is included.


Outside a lower and upper bound, neither bound is included.


Between two values inclusive/exclusive (upper bound is excluded)


Must match at least one of the values provided. Can be a range or a list


Must not match any of the values provided. Can be a range or a list

The following queries demonstrate these capabilities being used in different ways. First of all, here are some examples of some of the direct compare steps such as gt and gte being used. The fold step conveniently folds all of the results into a list for us.

// Airports with at least 5 runways

Here is the output that we might get from running the query.


The next three queries show examples of lt, eq and neq being used.

// Airports with less than 3 runways

// How many airports have 3 runways?

// How many airports have anything but just 1 runway?

Note that in some cases, such as when using a simple has step the eq is not actually required. For example the query used above could be written as follows instead.


You could also write this query using an is step. You will find the is step used a lot in this book but mostly in conjunction with where steps. To me the usage below does not feel as elegant as the has step alternative used above.

// How many airports have 3 runways?

Here are examples of inside and outside being used.

// Airports with greater than 3 but less than 6 runways.

// Airports with less than  3 or more than 6 runways.

Below are some examples showing within and without being used.

// Airports with at least 3 but not more than 6 runways

// Airports with 1,2 or 3 runways.

// Airports with less than 3 or more than 6 runways.

The between step lets us test the provided value for being greater than or equal to a lower bound but less than an upper bound. The query below will find any airport that has 5,6 or 7 runways. In other words, any airport that has at least 5 but less than 8 runways.

// Airports with at least 5 runways but less than 8

Here is the result of running the query.


As with many queries we may build, there are several ways to get the same answer. Each of the following queries will return the same result. To an extent which one you use comes down to personal preference although in some cases one form of a query may be better than another for reasons of performance.






The values do not have to be numbers. We could also compare strings for example.

Let’s now look at a query that compares strings rather than numbers. The following query finds all airports located in the state of Texas in the United States but only returns their code if the name of the city the airport is located in is not Houston


This next query can be used to find routes between Austin and Las Vegas. We use a within step to limit the results we get back to just routes that have a plane change in Dallas, San Antonio or Houston airports.


Here is what the query returns. Looks like we can change planes in Dallas or Houston but nothing goes via San Antonio.


Conversely, if we wanted to avoid certain airports we could use without instead. This query again finds routes from Austin to Las Vegas but avoids any routes that go via Phoenix (PHX) or Los Angeles (LAX).


Lastly this query uses both within and without to modify the previous query to just airports within the United States or Canada as Austin now has a direct flight to London in England we probably don’t want to go that way if we are headed to Vegas!


The within and without steps can take a variety of input types. For example, each of these queries will yield the same results.

// Range of values (inclusive, inclusive)

// Explicit set of values

// List of values

You will find more examples of these types of queries in the next section.

3.11.1. Refining flight routes analysis using not, neq, within and without

As we saw in the previous section, it is often useful to be able to specifically include or exclude values from a query. We have already seen a few examples of within and without being used in the section above. The following examples show additional queries that use within and without as well as some examples that use the neq (not equal) and not steps to exclude certain airports from query results.

The following query finds routes from AUS to SYD with only one stop but ignores any routes that stop in DFW.


We could also have written the query using not


Similar to the above but adding an and clause to also avoid LAX


We could also have written the prior query this way replacing and with without This approach feels a lot cleaner and it is easy to add more airports to the without test as needed. We will look more at steps like and in the "Boolean operations" section that is coming up soon.

// Flights to Sydney avoiding DFW and LAX

Using without is especially useful when you want to exclude a list of items from a query. This next query finds routes from San Antonio to Salt Lake City with one stop but avoids any routes that pass through a list of specified airports.

// How can I get from SAT to SLC but avoiding DFW,LAX,PHX and JFK ?

In a similar way, within allows us to specifically give a list of things that we are interested in. The query below again looks at routes from SAT to SLC with one stop but this time only returns routes that stop in one of the designated airports.

// From AUS to SLC with a stop in any one of DFW,LAX,PHX or TUS

Here is what the query returns.


Here are two more examples that use without and within to find routes based on countries.

// Flights from Austin to countries outside (without) the US and Canada

Here is the output from running the query.

Mexico City

Here is a twist on the previous query that looks for destinations in Mexico or Canada that you can fly to non stop from Austin.

// Flights from Austin to airports in (within) Mexico or Canada

This is what we get back from our new query.

Mexico City

3.11.2. Using coin and sample to sample a dataset

In any sort of analysis work it is often useful to be able to take a sample, perhaps a pseudo random sample, of the data set contained within your graph. The coin step allows you to do just that. It simulates a biased coin toss. You give coin a value indicating how biased the toss should be. The value should be between 0 and 1, where 0 means there is no chance something will get picked (not that useful!), 1 means everything will get picked (also not that useful!) and 0.5 means there is an even 50/50 chance that an item will get selected.

The following query simply picks airports with a 50/50 coin toss and returns the airport code for the first 20 found.

// Pick 20 airports at random with an evenly biased coin (50% chance).

This next query is similar to the first but takes a subtly different approach. It will select a pseudo random sample of vertices from the graph and for each one picked return its code and its elevation. Note that a very small value of 0.05 (or 5% chance of success) is used for the coin bias parameter this time. This has the effect that only a small number of vertices are likely to get selected but there is a better chance they will come from all parts of the graph and avoids needing a limit step. Of course, there is no guarantee how many airports this query will pick!

// Select some vertices at random and return them with their elevation.

We can see how fairly the coin step is working by counting the number of vertices returned. The following query should always return a count representing approximately half of the airports in the graph.


If all you want is, say 20 randomly selected vertices, without worrying about setting the value of the coin yourself, you can use the sample step instead.


3.11.3. Using Math.random to more randomly select a single vertex

While the sample step allows you to select one or more vertices at random, in my testing, at least when using a TinkerGraph, it tends to favor vertices with lower index values. So for example, in a test I ran this query 1000 times.


What I found was that I always got back an ID of less than 200. This leads me to believe that the sample(1) call is doing something similar to this


Look at the code below. Even if I run that simple experiment many times always gives results similar to these.

(1..10).each { println g.V().hasLabel('airport').sample(1).id().next()}


Given the air routes graph has over 3,300 airport vertices I wanted to come up with a query that gave a more likely result of picking any one airport from across all of the possible airports in the graph. By taking advantage of the Java Math class we can do something that does seem to much more randomly pick one airport from across all of the possible airports. Take a look at the snippets of Groovy/Gremlin code below.

More examples of using variables to store values and other ways to use additional Groovy classes and methods with Gremlin are provided in the "Making Gremlin even Groovier" and "Using a variable to feed a traversal" sections.
// How many airports are there?
numAirports = g.V().hasLabel('airport').count().next()


// Pick a random airport ID
x=Math.round(numAirports*Math.random()) as Integer


// Get the code for our randomly selected airport


This simple experiment shows that the numbers being generated using the Math.random approach appears to be a lot more evenly distributed across all of the possible airports.

(1..10).each { println Math.round(numAirports*Math.random()) as Integer}


Note that this approach only works unmodified with the air-routes graph loaded into a TinkerGraph. This is because we know that the TinkerGraph implementation honors user provided IDs and that in the air routes graph, airport IDs begin at one and are sequential with no gaps. However, you could easily modify this approach to work with other graphs without relying on knowing that the index values are sequential. For example you could extract all of the IDs into a list and then select one randomly from that list.

It is likely that this apparent lack of randomness is more specific to TinkerGraph and the fact that it will respect user provided ID values whereas other graph systems will probably store vertices in a more random order to begin with. Indeed when I ran these queries on Janus graph the sample step did indeed yield a better selection of airports from across the graph.

If the airport IDs were not all known to be in a sequential order one after the other, we could create a list of all the airport IDs and then select one at random by doing do something like this if we wanted to use our Math.random technique.

airports = g.V().hasLabel('airport').id().toList()

numAirports = airports.size


x=Math.round(numAirports*Math.random()) as Integer






3.12. Sorting things - introducing order

You can use order to sort things in either ascending (the default) or descending order. Note that the sort does not have to be the last step of a query. It is perfectly OK to sort things in the middle of a query before moving on to a further step. We can see examples of that in the first two queries below. Note that the first query will return different results than the second one due to where the limit step is placed. I used fold at the end of the query to collect all of the results into a nice list. The fold step can also do more than this. It provides a way of doing the reduce part of map-reduce operations. We will see some other examples of its use elsewhere in this book such as in the "Using fold to do simple Map-Reduce computations" section.

// Sort the first 20 airports returned in ascending order


As above, but this time perform the limit step after the order step.

// Sort all of the airports in the graph by their code and then return the first 20


Here is a similar example to the previous two. We find all of the places you can fly to from Austin (AUS) and sort the results, as before using the airport’s IATA code, but this time we also include the ICAO code for each airport in the result set.


Here are the results from running the query.


By default a sort performed using order is performed in ascending order. If we wanted to sort in descending order instead we can specify decr as a parameter to order. We can also specify incr if we want to be clear that we intend an ascending order sort.

// Sort the first 20 airports returned in descending order


You can also sort things into a random order using shuffle. Take a look at the example below and the output it produces.



Below is an example where we combine the field we want to sort by longest and the direction we want the sort to take, decr into a single by instruction.

// List the 10 airports with the longest runways in decreasing order.

Here is the output from running the query. To save space I have split the results into two columns.

[code:[BPX],longest:[18045]]    [code:[DOH],longest:[15912]]
[code:[RKZ],longest:[16404]]    [code:[GOQ],longest:[15748]]
[code:[ULY],longest:[16404]]    [code:[HRE],longest:[15502]]
[code:[UTN],longest:[16076]]    [code:[FIH],longest:[15420]]
[code:[DEN],longest:[16000]]    [code:[ZIA],longest:[15092]]

Lastly, let’s look at another way we could have coded the query we used earlier to find the longest runway in the graph. As you may recall, we used the following query. While the query does indeed find the longest runway in the graph. If we wanted to know which airport or airports had runways of that length we would have to run a second query to find them.


Now that we know how to sort things we could write a slightly more complex query that sorts all the airports by longest runway in descending order and returns the valueMap for the first of those. While this query could probably be written more efficiently and also improved to handle cases where more than one airport has the longest runway, it provides a nice example of using order to find an airport that we are interested in.


In the case of the air-routes graph there is only one airport with the longest runway. The runway at the Chinese city of Bangda is 18,045 feet long. The reason the runway is so long is due to the altitude of the airport which is located 14,219 feet above sea level. Aircraft need a lot more runway to operate safely at that altitude!

[country:[CN], code:[BPX], longest:[18045], city:[Bangda], elev:[14219], icao:[ZUBD], lon:[97.1082992553711], type:[airport], region:[CN-54], runways:[1], lat:[30.5536003112793], desc:[Qamdo Bangda Airport]]

3.12.1. Sorting by key or value

Sometimes, when the results of a query are a set of one or more key:value pairs, we need to sort by either the key or the value in either ascending or descending order. Gremlin offers us ways that we can control the sort in these cases. Examples of how this works are shown below.

In Tinkerpop 3.3 changes to the syntax were made. The previous keywords valueDecr, valueIncr, keyDecr and keyIncr are now specified using the form by(keys,incr) or by(values,decr) etc.

The following example shows the difference between running a query with and without the use of order to sort using the keys of the map created by the group step.

// Query but do not order


Notice also how local is used as a parameter to order. This is required so that the ordering is done while the final list is being constructed. If you do not specify local then order will have no effect as it will be applied to the entire result which is treated as a single entity at that point.

// Query and order by airport code (the key)


In this example we make the numbers of runways the key field and sort on it in descending order.



3.13. Boolean operations

Gremlin provides a set of logical operators such as and, or and not that can be used to form Boolean (true/false) type queries. In a lot of cases I find that or can be avoided by using within for example and that not can be sometimes avoided by using without but it is still good to know that these operators exist. The and operator can sometimes be avoided by chaining has steps together. That said there are always cases where having these boolean steps available is extremely useful.

// Simple example of doing a Boolean AND operation

// Simple example of doing a Boolean OR operation

You can also use and in an infix way as follows so long as you only want to and two traversals together.


As you would probably expect, an or step can have more than two choices. This one below has four. Also, note that in practice, for this query, using within would be a better approach but this suffices as an example of a bigger or expression.


Using within the example above could be written like this so always keep in mind that using or may not always be the best approach to use for a given query. We will look more closely at the within and without steps in the following section.


This next example uses an and step to find airports in Texas with a runway at least 12,000 feet long.


As with the or step, using and is not always necessary. We could rewrite the previous query as follows.


Gremlin also provides a not step which works as you would expect. This query finds vertices that are not airports?


This previous query could also be written as follows.

Depending on the model of your graph and the query itself it may or may not make sense to use the boolean steps. Sometimes, as described above chaining has steps together may be more efficient or using a step like within or without may make more sense.

Boolean steps such as and can also be dot combined as the example below shows. This query finds all the airports that have less than 100 outbound routes but more than 94 and returns them grouped by airport code and route count. Notice how in this case the and step is added to the lt step using a dot rather than having the and be the containing step for the whole test. The results from running the query are shown below as well.



The query we just looked at could also be written as follows but in this case using the and step inline by dot combining it (as above) feels cleaner to me. As you can see we get the same result as before.



As I have pointed out several times already, there are often many ways to write a query that will produce the same result. Here is an example of the previous two queries rewritten to use a between step instead of an and step. Remember that between is inclusive/exclusive, so we have to specify 101 as the upper bound and 95 as the lower bound.



Just for fun, here is the same query but rewritten to use an inside step.



As a side note, if we wanted to reverse the grouping so that the airports were grouped by the counts rather than the codes we could do that as follows.



You can also add additional inline and steps to a query as shown below. Notice that this time AUH and NCE are not part of the result set as they have 97 routes which our new and test eliminates.



The where step was used lot in the examples above. Hopefully the effect of using it was clear. Nonetheless I will explain in more detail how the where step works in the next section.

3.14. Using where to filter things out of a result

We have already seen the where step used in some of the prior examples. In this section we will take a slightly more focussed look at the where step. The where step is an example of a filter. It takes the current state of a traversal and only allows anything that matches the specified constraint to pass on to any following steps. Where can be used by itself, or as we shall see later, in conjunction with a by modulator or following a match or a choose step.

It is worth noting that some queries that use where can also be written using has instead.

Let’s start by looking at a simple example of how has and where can be used to achieve the same result.

// Find airports with more than five runways

// Find airports with more than five runways

In examples like the one above, both queries will yield the exact same results but the has step feels simpler and cleaner for such cases. Notice how in the where step version the gt predicate has to be placed inside of an is step. The next example starts to show the real power of the where step. We can include a traversal inside of the where step that does some filtering for us. In this case, we first find all vertices that are airports and then use a where step to only keep airports that have more than 60 outgoing routes. Finally we count how many such airports we found.

// Airports with more than 60 unique routes from them


In our next example, we want to find routes between airports that have an ID of less than 47 but only return routes that are longer than 4,000 miles. Again, notice how we are able to look at the incoming vertex ID values by placing a reference to inV at the start of the where expression. The where step is placed inside of an and step so that we can also examine the dist property of the edge. Finally we return the path as the result of our query.

// Routes longer than 4,000 miles between airports with and ID less than 47

Below is what we get back as a result of running the query.


Sometimes you will be looking for results that match the inverse condition of a where step. One way this can be achieved is to wrap the where step inside of a not step, as shown below. We start in Austin (AUS) and find all outgoing routes. We then look at all incoming contains edges to see which country each airport we can fly to is in that do not have the United States country code of US .


As you can see, when we run the query, only destinations outside the United States are returned.

[code:[MEX],city:[Mexico City]]

This pattern comes in useful whenever you want to use a traversal inside of a where step and negate the results. Of course, there are other ways we could write this particular query but I wanted to show an example of this technique being used. For completeness, two simpler ways of writing this query that do not use where at all are shown below but there will be cases where combining not and where are your best option.

      out().has('country', neq('US')).valueMap('code','city')

      out().not(has('country', 'US')).valueMap('code','city')

It is also possible to use some special forms of the and and or steps when working with a where step. Take a look at the query below. This will match airports that you can fly to from Austin (AUS) so long as they have more than four runways and do not have exactly six runways.


Here is what the query returns. As you can see we only found airports that you can fly to from AUS that have 5,7 or 8 runways.


The same is true for the or step. We could rewrite our query to find airports we can fly to from Austin that have more than six or exactly four runways.


The above is a shorter form of the following query which demonstrates another way we could use the boolean operators within a where step. In this case only two traversals are allowed to be compared using the boolean operator (in this case an or step).


Our last example in this section uses a where step to make sure we don’t end up back where we started when looking at airline routes with one stop. We find routes that start in Austin, with one intermediate stop, that then continue to another airport, but never end up back in Austin. A limit step is used to just return the first 10 results that match this criteria. Notice how the as step is used to label the AUS airport so that we can refer to it later in our where step. The effect of this query is that routes such as AUS→DFW→AUS will not be returned but AUS→DFW→LHR will be as it does not end up back in Austin.

// List 10 places you can fly to with one stop, starting at Austin but
// never ending up back in Austin


As you work with Gremlin, you will find that the where step is one that you use a lot. You will see many more examples of where being used throughout the remainder of this book. In the next section we will look at some additional ways that where can be used.

3.14.1. Using where and by to filter results

A new capability was added in the Tinkerpop 3.2.4 release that allows a where step to be followed with a by modulator. This makes writing certain types of queries a lot easier than it was before. Hopefully by now, this capability is supported in many TinkerPop enabled graph stores but it is always a good idea to verify the version of TinkerPop supported before starting to design queries.

The query below starts at the Austin airport and finds all the airports that you can fly to from there. A where by step is then used to filter the results to just those airports that have the same number of runways that Austin has. What is really nice about this is that we do not have to know ahead of time how many runways Austin itself has as that is handled for us by the query.

Combining the where and by steps allows you to write powerful queries in a nice and simple way.

If you were to run the query in the Gremlin Console, these are the results that you should see. Note that all the airports returned have two runways. This is the same number of runways that the Austin airport has.


The ability to combine where and by steps together allows us to avoid having to write the previous query in more complicated ways such as the one shown below.


There is also a two parameter form of the where step that you may have noticed used above. In this case the first parameter refers to a label defined earlier in the query. Take a look at the example below. We find the vertex for the Austin (AUS) airport and label it 'a'. We then look at all the airports you can fly to from there and label them 'b'. We then use a where step to compare 'a' and 'b'. Only airports with less runways than Austin should be returned.


Austin has two runways so only airports with one runway are returned by this query when run.


It is also possible to compare two different properties by adding a second by modulator. This is useful when vertex properties have different key names but may contain the same values. The query below is definitely contrived, you could achieve the same thing in a more simple way, but is does demonstrate two by modulators being used. The country property of an airport vertex is compared with the code property of a country vertex. The query first finds any airport vertex with a city property containing the string London. Next any connecting country vertices (ones connected by contains edges) are found. The where test compares the country code value of the two vertices. Lastly a select is used to pick the results that we want to return.


When the query is run we get back all the airports with a city name of London along with their region code and country code.


As I mentioned the above query was used just as an example. In reality the following query that does not use any where steps would have sufficed in this case.



Let’s imagine that we want to write a query to find all airports that have the same region code as the French airport of Nice. Let’s assume for now that we do not know what the region code is so we cannot just write a simple query to find all airports in that region. So, instead we need to write a query that will first find the region code for Nice and then use that region code to find any other airports in the region. Finally we want to return the airport code along with the city name and the region code. One way of writing this query is to take advantage of the where…​by construct.

Take a look at the query below and the output it generates.


[TLN,Toulon/Le Palyvestre,FR-U]
In the sample programs folder you will find a program called GraphRegion.java that shows how to perform the query shown above in a Java program.

There are several things about this query that are interesting. Firstly because we are comparing the results of two values steps we do not provide a parameter to the by step as we do not need to provide a property key. Secondly, we use a second V() step in the query to find all the airports that have the same airport code as Nice. Note that this also means that Nice is included in the results. Lastly we wrap the part of the query that prepares the output in a form we want in a local step so that a separate list is created for each airport.

You could write this query other ways, perhaps using a match step but once you understand the pattern used above it is both fairly simple and quite powerful.

3.15. Using choose to write if…​then…​else type queries

The choose step allows the creation of queries that are a lot like the "if then else" constructs found in most programming languages. If we were programming in Java we might find ourselves writing something like the following.

if (longest > 12000)

Gremlin offers us a way to do the same thing. The query below finds all the airports in Texas and then will return the value of the code property if an airport has a runway longer than 12,000 feet otherwise it will return the value of the desc.

// If an airport has a runway > 12,000 feet return its code else return its description

Here is another example that uses the same constructs.

// If an airport has a code of AUS or DFW report its region else report its country

3.15.1. Including a constant value - introducing constant

Sometimes it is very useful, as the example below demonstrates, to return constant rather than derived values as part of a query. This query will return the string "some" if an airport has less than four runways or "lots" if it has more than four.

// You can also return constants using the constant() step
The constant step can be used to return a constant value as part of a query.

Here is one more example that uses a sample step to pick 10 airports and then return either "lots" or "not so many" depending on whether the airport has more than 50 routes or not. Note also how as and select are used to combine both the derived and constant parts of the query result that we will ultimately return.

             constant('not so many')).as('b').

Here is an example of what the output from running this query might look like.

[a:YYT,b:not so many]
[a:YEG,b:not so many]
[a:BLR,b:not so many]
[a:TSF,b:not so many]

We could go one step further if you don’t want the a: and b: keys returned as part of the result by adding a select(values) to the end of the query as follows.

             constant('not so many')).as('b').

Here is what the output from the modified form of the query.

[YYT,not so many]
[YEG,not so many]
[BLR,not so many]
[TSF,not so many]

The constant step is not limited to use within choose steps. It can be used wherever needed. You will find many examples of its use throughout this book including in the "Using union to combine query results" section.

3.16. Using option to write case/switch type queries

When option is combined with choose the result is similar to the case or switch style construct found in most programming languages. In Java for example, we might code a switch statement as follows.

    case "DFW": System.out.println(desc); break;
    case "AUS": System.out.println(region); break;
    case "LAX": System.out.println(runways);

We can write a Gremlin query that follows the same pattern as our Java switch statement. As in the Java example I decided to lay out query out across multiple lines to aid readability and clarity.

// You can combine choose and option to generate a more "case statement" like query

The example below shows a choose followed by four options. Note the default case of none is used as the catchall. Notice how in this case, the values returned are constants.

// You can return constant values of you need to
                                      option(1,constant('just one')).
                                      option(2,constant('a couple')).
                                      option(none,constant('quite a few'))

3.17. Using match to do pattern matching

The match step was added in TinkerPop 3 and allows a more declarative style of pattern based query to be expressed using Gremlin. Match can be a bit hard to master but once you figure it out it can be a very powerful way of traversing a graph looking for specific patterns.

Below is an example that uses match to look for airline route patterns where there is a flight from one airport to another but no return flight back to the original airport. The first query looks for such patterns involving the JFK airport as the starting point. You can see the output from running the query below it. This is the correct answer as, currently, the British Airways Airbus A318 flight from London City (LCY) airport stops in Dublin (DUB) to take on more fuel on the way to JFK but does not need to stop on the way back because of the trailing wind.

// Find any cases of where you can fly from JFK non stop
// to a place you cannot get back from non stop. This query
// should return LCY, as the return flight stops in Dublin
// to refuel.



We can expand the query by leaving off the specific starting point of JFK and look for this pattern anywhere in the graph. This really starts to show how important and useful the match Gremlin step is. We don’t have any idea what we might find, but by using match, we are able to describe the pattern of behavior that we are looking for and Gremlin does the rest.

// Same as above but from any airport in the graph.

If you were to run the query you would find that there are in fact over 200 places in the graph where this situation applies. We can add a count to the end of query to find out just how many there are.

// How many occurrences of the pattern in the graph are there?


The next query looks for routes that follow the pattern A→B→C but where there is no direct flight of the form A→C. In other words it looks for all routes between two airports with one intermediate stop where there is no direct flight alternative available. Note that the query also eliminates any routes that would end up back at the airport of origin. To achieve the requirement that we not end up back where we started, a where step is included to make sure we do not match any routes of the form A→B→A.


There are, of course a lot of places in the air-routes graph where this pattern can be found. Here are just a few examples of the results you might get from running the query.


Here is another example of using match along with a where. This is actually a different way of writing a query we saw earlier. This query starts out by looking at how many runways Austin has and then looks at every airport that you can fly to from Austin and then looks at how many runways those airports have. Only airports with the same number as Austin are returned. Using a match step for this task is overkill. However, it does show the basic constructs used by the match step and again illustrates using values calculated in one part of a query later on in that same query.


As I mentioned, the example above is not the best way to write this query and it can be done without using a match step at all and just using a where step as shown in the three examples below. Each one is simpler than its predecessor.

One way we could choose to write this query is using multiple select steps. This is also not a very efficient solution but does work.


A better way than either of the prior two combines a filter step with the where and select steps.


As mentioned in the "Using where and by to filter results" section, this query can be simplified further using a where step and a by modulator. This capability was introduced in the TinkerPop 3.2.4 release.


So, while there is often a simpler way to write a query that avoids using the match step, for some queries, especially in more complex cases, it provides a useful and powerful way to express in a more declarative way, a set of criteria that must be met. However, before I resort to using a match step I always think carefully about other ways that I could write the query that might be simpler. I do this because the syntax of the match step can be tricky to get right without a fair bit of trial and error in my experience.

3.18. Using union to combine query results

The union step works just as you would expect from its name. It allows us to combine parts of a query into a single result. Just as with the boolean and and or steps it is sometimes possible to find other ways to do the same thing without using union but it does offer some very useful capability.

Here is a simple example that uses a union step to produce a list containing a vertex and the number of outgoing routes from that vertex. Note that in the next section we will see that there are simpler ways to write this query while still using a union step. The main point to take away from this example is that you can use a union step to combine the results of multiple traversals. This example combines the results of two traversals but you can certainly combine more as needed. Note that the out step starts from the vertex that was found immediately before the union step which in this case is the DFW vertex. So in other words the output from the prior step is available to the steps within the union step just as with other Gremlin steps we have already looked at.



Not that this is recommended, but the previous query could also be written as follows using two has steps both inside a single union step. This does however demonstrate that you can use a union step to combine the results of fairly arbitrary graph traversals.



As a side note, instead of using a union step and producing a list, we might decide to use a group step and produce a map. A map might be preferable if you want to access individual keys and values directly. It all depends, as always, on the results that best fit the problem you are solving.



3.18.1. Introducing the identity step

Gremlin has an identity step that we have not seen used so far in this book. The identity step simply returns the entity that was passed in to the current step of a traversal (in this case union) from the prior step. We can rewrite the query we used above to use an identity step. This simplifies the query as it removes the need to use the as and select steps. As shown below, using identity causes the vertex V[8] representing the DFW airport from the prior has step to be included in the result.



3.18.2. Using constant values as part of a union

We have already seen the constant step used in the "Including a constant value - introducing constant" section. As you might expect, you can also use constant steps within a union step as the two examples below show.



The identity step that was just introduced above could be used to add the V[3] vertex to the result. We are now combining three traversal steps together inside of the union step.



Finally, let’s change the query again to include a city name in the result. Note that the values step refers to the property of the vertex that was referenced immediately before the union step so it will return the city property of vertex V[3].



3.18.3. More examples of the union step

The following query uses a sample step to select 10 airports at random from the graph. For each selected airport, a union step is then used to combine the id of the vertex with a few properties. Note that local scope is used so that the results of each union step are folded into a list.


Here is the output I got back from running the query.

[83,RSW,Fort Myers]
[26,SAN,San Diego]
[136,MEX,Mexico City]
[44,SAF,Santa Fe]

If local scope had not been used, the result would have been a single list containing all of the results as shown below.



By way of another simple example, the following query returns flights that arrive in AUS from the UK or that leave AUS and arrive in Mexico.

// Flights to AUS from the UK or from AUS to Mexico

When we run that query, we get the following results showing that there are routes from LHR in the UK and to the three airports MEX, CUN and GDL in Mexico.


This query solves the problem "Find all routes that start in London, England and end up in Paris or Berlin with no stops". Because city names are used and not airport codes, all airports in the respective cities are considered.


Here are the results from running the query. Note that routes from five different London airports were found.


As mentioned previously, sometimes, especially for fairly simple queries, there are alternatives to using union. Indeed, it is actually not necessary to use a union step to achieve the prior result. The re-written version of the query below will return the same results as the version that uses a union step. This time a simple has step featuring a within predicate is used instead.


The previous two queries used a path step to essentially show each individual route. We can adjust the version of the query that uses a union step just a little bit and instead turn the result into a series of lists where the first item in the list is the origin airport and the remaining items in the list are the places you can fly to from there within our criteria. The double underscore "__" is used in the way that identity could have been used to refer to the incoming vertex. The union step is wrapped in a local step so that each union is individually folded. If the local step was omitted all of the results would be folded into a single list. In this case, the union step makes writing the query relatively easy and this is probably a good example of where the union step should be used. Namely, when you want to combine multiple traversal results.


Running the amended query shows us the results in perhaps a more useful form.


3.18.4. Using union to combine more complex traversal results

So far the examples we have looked at mostly show fairly simple traversals being used inside of a union step. This next query is a bit more interesting. We again start from any airport in London, but then we want routes that meet any of the criteria:

  • Go to Berlin and then to Lisbon

  • Go to Paris and then Barcelona

  • Go to Edinburgh and then Rome

We also want to return the distances in each case. Note that you can union together as many items as you need to. In this example we combine the results of three sets of traversals to get the desired results.

// Returns any paths found along with the distances between airport pairs.


Here is what we get back when we run our query


The next query finds the total distance of all routes from any airport in Madrid to any airport anywhere and also does the same calculation but minus any routes that end up in any Paris airport. We have not yet seen the filter step that is used below. It is one of the foundational Gremlin steps that many others such as where build upon. A filter step will only pass on to the next step in the query incoming elements that meet the criteria specified within the filter.


Here is the output from running the query. As you can see the first number is slightly larger than the second as all routes involving Paris have been filtered out from the calculation.


It is worth noting that it is not required that every traversal inside of a union step returns a result. The returned results will include any of the traversals that did return something. The example below demonstrates this. Of course in practice you would not write this particular query this way. However, I think this example demonstrates a feature of the union step that it is important to understand.


If we run the query, you will see that SYD is not part of the results as there is no route between Austin and Sydney.


For completeness, this query would more likely be written as follows rather than using a union.



3.19. Using sideEffect to do things on the side

The sideEffect step allows you to do some additional processing as part of a query without changing what gets passed on to the next stage of the query. The example below finds the airport vertex V(3) and then uses a sideEffect to count the number of places that you can fly to from there and stores it in a traversal variable named a before counting how many places you can get to with one stop and storing that value in b. Note that there are other ways we could write this query but it demonstrates quite well how sideEffect works.



Later in the book we will discuss lambda functions, sometimes called closures and how they can be used. The example below combines a closure with a sideEffect to print a message before displaying information about the vertex that was found. Again notice how the sideEffect step has no effect on what is seen by the subsequent steps. You can see the output generated below the query.

g.V().has('code','SFO').sideEffect{println "I'm working on it"}.values('desc')

I'm working on it
San Francisco International Airport

Later in the book we will look at other ways that side effects can be used to solve more interesting problems.

3.20. Using aggregate to create a temporary collection

At the time of writing this book, there were 59 places you could fly to directly (non stop) from Austin. We can verify this fact using the following query.



If we wanted to count how many places we could go to from Austin with one stop, we could use the following query. The dedup step is used as we only want to know how many unique places we can go to, not how many different ways of getting to all of those places there are.



There is however a problem with this query. The 865 places is going to include (some or possibly all of) the places we can also get to non stop from Austin. What we really want to find are all the places that you can only get to from Austin with one stop. So what we need is a way to remember all of those places and remove them from the 865 some how. This is where aggregate is useful. Take a look at the modified query below



After the first out step. all of the vertices that were found are stored in a collection I chose to call nonstop. Then, after the second out we can add a where step that essentially says "only keep the vertices that are not part of the nonstop collection". We still do the dedup step as otherwise we will still end up counting a lot of the remaining airports more than once.

Notice that 812 is precisely 59 less than the 871 number returned by the previous query which shows the places you can get to non stop were all correctly removed from the second query. This also tells us there are no flights from Austin to places that you cannot get to from anywhere else!

We will take a more in depth look at the various types of collections that you can use as part of a Gremlin query in the "Collections revisited" section a bit later.

3.21. Using inject to insert values into a query

Sometimes you may want to add something additional to be returned along with the results of your query. This can be done using the inject step. To start off, here is a simple example showing how inject fundamentally works. We insert some numbers and ask Gremlin to give us the mean value.



Of course just using inject so we can do a simple mathematical computation is of limited use. The next example shows how inject can be used as part of a query. The string ABIA, another acronym commonly used when referring to the Austin Bergstrom International Airport, is injected into the query.


If we were to run the query, here is what we would get back

Austin Bergstrom International Airport

3.22. Using coalesce to see which traversal returns a result

Sometimes, when you are uncertain as to which traversal of a set you are interested in will return a result you can have them evaluated in order by the coalesce step. The first of the traversals that you specify that returns a result will cause that result to be the one that is returned to your query.

Look at the example below. Starting from the vertex with an ID of 3 it uses coalesce to first see if there are any outgoing edges with a label of fly. If there are any the vertices connected to those edges will be returned. If there are not any, any vertices on any incoming edges labelled contains will be returned.

// Return the first step inside coalesce that returns a vertex

As there are not any edges labelled fly in the air-routes graph, the second traversal will be the one whose results are returned.

If we were to run the above query using the air-routes graph, this is what would be returned.

[code:[NA],type:[continent],desc:[North America]]
[code:[US],type:[country],desc:[United States]]

We can put more than two traversals inside of a coalesce step. In the following example there are now three. Because some contains edges do exist for this vertex, the route edges will not be looked at as the traversals are evaluated in left to right order.


As we can see the results returned are still the same.

[code:[NA],type:[continent],desc:[North America]]
[code:[US],type:[country],desc:[United States]]

3.22.1. Combining coalesce with a constant value

The coalesce step can also be very useful when combined with a constant value. In the example below if the airport is in Texas then its description is returned. If it is not in Texas, the string "Not in Texas" is returned instead.

g.V(1).coalesce(has('region','US-TX').values('desc'),constant("Not in Texas"))

Not in Texas

g.V(3).coalesce(has('region','US-TX').values('desc'),constant("Not in Texas"))

Austin Bergstrom International Airport

A bit later in the "Using coalesce to only add a vertex if it does not exist" section we will again use coalesce to check to see if a vertex already exists before we try to add it.

3.23. Returning one of two possible results - introducing optional

Sometimes it may be useful to return one of two results depending upon the outcome of an attempted traversal. The optional step will return either the results of the provided traversal if there is a result or the result of the prior step if there is no result.

In the example below, there is no direct route between Austin (AUS) and Sydney (SYD) so the Austin vertex is returned by the optional step.



However, there is a route between Austin and Dallas Fort Worth (DFW) so as the example below shows, this time the optional step returns the DFW vertex.



Note that the previous queries behave in the same way that the coalesce step would behave if used as shown below. In this case, an identity step is used to return the prior vertex if the provided traversal does not return a result.





3.24. Other ways to explore vertices and edges using both, bothE, bothV and otherV

We have already looked at examples of how you can walk a graph and examine vertices and edges using steps such as out, in, outE and inE. In this section we introduce some additional ways to explore vertices and edges.

As a quick recap, we have already seen examples of queries like the one below that simply counts the number of outgoing edges from the vertex with an ID of 3.



Likewise, this query counts the number of incoming edges to that same vertex.



The following query introduces the bothE step. What this step does is return all of the edges connected to this vertex whether they are outgoing or incoming. As we can see the count of 120 lines up with the values we got from counting the number of outgoing and incoming edges. We might want to retrieve the edges, as a simple example, to examine a property on each of them.



If we wanted to return vertices instead of edges, we could use the both step. This will return all of the vertices connected to the vertex with an ID of 3 regardless of whether they are connected by an outgoing or an incoming edge.



This next query can be used to show us the 120 vertices that we just counted in the previous query. I sorted the results and used fold to build them into a list to make the results easier to read. Note how vertex 3 is not returned as part of the results. This is important, for as we shall see in a few examples time, this is not always the case.



You probably also noticed that most of the vertices appear twice. This is because for most air routes there is an outgoing and an incoming edge. If we wanted to eliminate any duplicate results we can do that by adding a dedup step to our query.



We can do another count using our modified query to check we got the expected number of results back.



There are a similar set of things we can do when working with edges using the bothV and otherV steps. The bothV step returns the vertices at both ends of an edge and the otherV step returns the vertex at the other end of the edge. This is relative to how we are looking at the edge.

The query below starts with our same vertex with the ID of 3 and then looks at all the edges no matter whether they are incoming or outgoing and retrieves all of the vertices at each end of those edges using the bothV step. Notice that this time our count is 240. This is because for every one of the 120 edges, we asked for the vertex at each end so we ended up with 240 of them.



We can again add a dedup step to get rid of duplicate vertices as we did before and re-do the count but notice this time we get back 62 instead of the 61 we got before. So what is going on here?



Let’s run another query and take a look at all of the vertices that we got back this time.



Can you spot the difference? This time, vertex 3 (v[3]) is included in our results. This is because we started out by looking at all of the edges and then asked for all the vertices connected to those edges. Vertex 3 gets included as part of that computation. So beware of this subtle difference between using both and the bothE().bothV() pattern.

Let’s rewrite the queries we just used again but replace bothV with otherV. Notice that when we count the number of results we are back to 61 again.


So let’s again look at the returned vertices and see what the difference is.



As you can see, when we use otherV we do not get v[3] returned as we are only looking at the other vertices relative to where we started from, which was v[3].

3.25. Shortest paths (between airports) - introducing repeat

Gremlin provides a repeat…​until looping construct similar to those found in many programming languages. This gives us a nice way to perform simple shortest path type queries. We can use a repeat…​until loop to look for paths between two airports without having to specify an explicit number of out steps to try.

While performing such computations, we may not want paths we have already travelled to be travelled again. We can ask for this behavior this using the simplePath step. Doing so will speed up queries that do not need to travel the same paths through a graph multiple times. Without the simplePath step being used the query we are about to look at could take a lot longer. The addition of a limit step is also important as without it this query will run for a LONG time looking for every possible path!!

The query below looks for routes between Austin (AUS) and Agra (AGR). An important query for those Austinites wanting to visit the Taj Mahal!

// What are some of the ways to travel from AUS to AGR?

Here are the results from running the query. Notice how, using the repeat…​until construct we did not have to specify how many steps to try.


You can also place the until before the repeat as shown below.

// Another shortest path example using until...repeat instead

Here are the results from running the query.


We can also specify an explicit number of out steps to try using a repeat…​times loop but this of course assumes that we know ahead of time how many stops we want to look for between airports. In the next section we will introduce the emit step that gives you more control over the behavior of what is returned from repeat loops.



The previous query is equivalent to this next one but doing it this way is less flexible in that we can not as easily vary the number of out steps, should, for example we want to next try five hops instead of the current two..


As is often the case when working with Gremlin there is more than one way to achieve the same result. The loops step, that can be used to control how long a repeat loop runs, is essentially equivalent to the times step. Take a look at the two queries below. Both achieve the same result. The first uses loops while the second uses times. I prefer the readability offered by the use of times.





In the next section, we will look at how emit can be used to adjust the behavior of a repeat…​times loop.

3.25.1. Using emit to return results during a repeat loop

Sometimes it is useful to be able to return the results of a traversal as it executes. The example below starts at the Santa Fe airport (SAF) and uses a repeat to keep going out from there. By placing an emit right after the repeat we will be able to see the paths that are taken by the traversal. If we did not put the emit here this query would run for a very long time as the repeat has no other ending condition!



Another place where emit can be useful is when repeat and times are used together to find paths between vertices. Ordinarily, if you use a step such as times(3) then the query will only return results that are three hops out. However if we use an emit we can also see results that take less hops. First of all take a look at the query below that does not use an emit and the results that it generates.


The paths returned show a selection of ways to get to Miami from Austin with two stops but none of the results show less than two stops. Is this what we really wanted?


Now let’s change the query to use an emit. This time you can think of the query as saying "at most three hops" or in airline terms "at most two stops".


As you can see, by adding an emit we got back a quite different set of results. This is a really useful and powerful capability. Being able to express ideas such as "at most three" provides us a way to write very clean queries in cases like this.


The emit step can also take a parameter such as a has step to filter out intermediate results that we are not interested in. The query below will only show intermediate results as the repeat operates if they meet a given condition. In this case the condition is that the path must have passed through the Prague (PRG) airport’s vertex. A limit step is used to only show the first 10 results.


Here are the results from running the query.


Without the condition as part of the emit step we get different results as we are shown every path the graph traverser is taking.



So far, while interesting, many of the results shown look at first glance as if they could have been generated without using an emit. However, the query below is more interesting in that we use an until step to specify a target airport of Austin (AUS) that we are interested in getting to from Lerwick (LSI) in the Shetland Islands. We also specify, as part of the emit step, that we are interested in seeing any routes found that involve any airports in New York State regardless of whether or not they end up in Austin.


Here are the results the query generates, Notice how we got a mixture of New York airports as well as Austin as our final destinations.


The emit can also be placed before the repeat step. This will cause the result of the previous step in the query to be emitted before the results that follow. In the example below, we start in Austin and go out two hops using a repeat loop. The first ten airport codes of the places we found are returned. Notice how AUS is returned as the first value even though that is where we started from due to our use of emit.



In some cases, an emit placed after a repeat step has the same effect as an until step. Both queries below look for routes between Johannesburg (JNB) and Sydney (AUS).



When either query is run, the following results are returned.


You will see more examples of emit being used in the "Modelling an ordered binary tree as a graph" section a bit later.

3.25.2. Introducing cyclicPath

You can use cyclicPath to find paths through the graph that end up back where they started. The rather contrived example below finds routes that both start and end in Austin (AUS).

// From Austin and back again

Here are the results from running the query.


You can also use cyclicPath as a termination condition for a repeat loop. The query below keeps following outbound edges until it ends up back where it started.


Here are the results from running the modified query. As you can see the two queries generated the same results.


3.25.3. A warning that path can be memory and CPU intensive

The path step is incredibly useful and I find myself using it a lot. However, there are some downsides to the use of path, especially when searching entire graphs for non trivial results. Take a look at the query below. It returns the first 10 routes found that will get you from Papa Stour (PSV), a small airport in the Shetland Islands, to Austin (AUS). The simplePath step is used to make sure the same exact path is never looked at twice. This query runs quickly and returns some useful results.



However, were we to reverse the query as follows we can run into trouble. In fact, if you run this query on your laptop, after a few minutes of high CPU usage and increased fan noise, it is likely you will get an error that the query ran out of available memory.


The reason this happens is as follows. There are very few routes from PSV and not many more from the airports it is closely connected to. Therefore, if you start the query from PSV, you will fairly quickly find some paths that end up in AUS. However, if you start from AUS there are a lot of possible routes that Gremlin has to explore before it gets close to finding PSV. If it helps, think of a funnel where PSV is at the narrow end and AUS is at the other.

The path step uses a lot of memory and in some cases can cause issues.

The reason that so much memory is consumed is that the path step, even if simplePath is used to avoid travelling the same path twice, has to store up a very large number of routes before finding the ones we are actually interested in. So while the path step is incredibly useful, be aware that in cases like this one, it can get you into trouble if not used with care.

3.26. Calculating vertex degree

While working with graphs, the word degree is used when discussing the number of edges coming into a vertex, going out from a vertex or potentially both coming in and going out. It’s quite simple with Gremlin to calculate various measures of degree as the following examples demonstrate. The query below will calculate the number of outgoing routes for every airport in the graph. If you run this query you will get quite a lot of result data back as there are over 3,300 airports in the graph.

// Out degree (number of routes) from each vertex (airport)

The query below builds upon the prior one but just selects a few of the results.

// Outbound routes (degree) from LHR, JFK and DFW?

If we were to run the query the output should look like this.


The query below is a little more complex but can be used to find the 10 airports with the highest number of outgoing routes. Some of the concepts used such as local scope are covered in more detail a bit later on in the "Using local scope with collections" section.


Here are the results from running the query. As you can see, Amsterdam (AMS) has the highest number of outgoing routes. The topic of analyzing routes is revisited in detail in the "Airports with the most routes" section.


The next query will calculate the route degree based on all, incoming and outgoing, routes for ten airports. The query takes advantage of the project step that was introduced in TinkerPop 3.2.

// Calculate degree (in and out) for each vertex.

Here is the output that this query generates.


We could of course also write the same query using a group step as shown below.


The results below were generated by running the query.


3.27. Gremlin’s scientific calculator - introducing math

As we have seen in some of the prior sections, there are some Gremlin steps such as sum, count and mean that can be used to perform some fairly basic mathematical operations. In Apache TinkerPop version 3.3.1 a new math step was introduced that allows us to perform scientific calculator style mathematical operations as part of a Gremlin graph traversal. As these operators build upon the Java Math class it is worth becoming familiar with that class if you are not already. Be aware that this functionality will only be available to you if the graph implementation that you are using supports Apache TInkerPop 3.3.1 or higher.

The table below provides a summary of the available operators sorted alphabetically.

Table 4. Scientific calculator operators


Arithmetic plus.


Arithmetic minus.


Arithmetic multiply.


Arithmetic divide.


Arithmetic modulo (remainder).


Raise to the power. (n^x).


Absolute value


Arc (inverse) cosine in radians.


Arc (inverse) sine in radians.


Arc (inverse) tangent in radians.


Cube root


Returns the smallest (closest to negative infinity) double value that is greater than or equal to the argument and is equal to a mathematical integer.


Cosine of angle given in radians.


Hyperbolic cosine.


Returns Euler’s number "e" raised to the given power (e^x)


Returns the largest (closest to positive infinity) double value that is less than or equal to the argument and is equal to a mathematical integer.


Natural logarithm (base e)


Logarithm (base 10)


Logarithm (base 2)


Returns the signum function of the argument; zero if the argument is zero, 1.0 if the argument is greater than zero, -1.0 if the argument is less than zero.


Sine of angle given in radians.


Hyperbolic sine


Square root


Tangent of angle given in radians.


Hyperbolic tangent.

The math step behaves differently from other steps that we have looked at so far in as much as the entire expression is passed in as a single string. This means that you can use labels you have assigned as part of a traversal but you cannot use external variables or static constant references such as Math.PI inside the expression itself. There are ways to easily work around this however as we shall see below. I have not attempted to give an example of every single operator being used but the examples provided should provide all of the basic building blocks you will need to incorporate mathematical operators into your own Gremlin queries.

These features require that the graph database system you are using supports a TinkerPop version of 3.3.1 or higher.

3.27.1. Performing simple arithmetic

Let’s start by looking at a few basic examples. First of all the query below shows that we can perform mathematical operations on literal values as part of a traversal. The vertex we found at the start of the traversal is not used by the math step in this case.



If you want to use the result of the prior step of a traversal as part of a math step the special symbol "_" (underscore) can be used as shown below. Note that the inject step provides us a nice way to feed in values while experimenting with the math step

g.inject(100).math('_ /2')


Now let’s look at how vertex properties can be used by a math step. To do this, we can also use named traversal steps as part of a math operation. The examples below start by checking how many runways the DFW and SFO airports have. Then a math step is used to show how we can add those values together as part of a traversal.

// How many runways does DFW have?


// How many runways does SFO have?


Now let’s use math to add the values together as part of a single traversal. Note that even though we are adding two integers together, the result comes back as a double precision value. Also, note that the named steps 'a' and 'b' are specified inside the single string that is passed to the math step. This is a key difference from all other steps where we refer to one or more traversal labels inside of a step. Lastly, notice that a by modulator is used to tell the math step which properties we want to add together.

// Use named steps to add some results together.
      math('a + b').by('runways')


The examples below show division and modulo operators being used on the results of a count step.



g.V(3).out().count().math('_ / 2')


g.V(3).out().count().math('_ % 5')


Note that the underscore character allowed us to avoid having to write the previous queries using a pattern like the one used below where the count step is labelled as 'a'.

g.V(3).out().count().as('a').math('a / 2')


3.27.2. Using the trigonometric functions

The trigonometric operators work as you would expect. All angles need to be specified as radians and not degrees. You can do the conversion to radians yourself or use the Java Math.toRadians helper method if you prefer. The query below uses the math step to calculate the sine of 60 degrees and stores the result in a variable called "x".

// Calculate the sine of 60 degrees


We can use our variable "x" to calculate the arcsine.

// Calculate the arcsine


We can use the Gremlin console as a calculator to prove that we got the correct answer back.

// Prove this is the right answer


Note that just as when using the Java Math library you have to be aware of possible rounding errors. You would expect the calculation below to return 1.0 but it does not as the conversion of 45 degrees to radians is not precise enough for the Math library. Note that using the Java Math.toRadians method does not achieve the desired result either.

// Manual conversion


Same experiment but using the helper method.

// Libraray conversion


This presents us with a chance to experiment with another of the operators that the math step provides. We can use the ceil operator to round our result up to the nearest integer.



3.27.3. Calculating a standard deviation

We could use the new math step to implement a query that calculates the standard deviation for the number of runways each airport in the graph has. This allows us to see use of the sqrt and power (^) operators. I broke the solution into three queries rather than try to force it all into one. Even now the final query of the three is complicated enough I think! Notice how multiple math steps are used in the same query with the results from one being used as input to the next.

First of all let’s calculate the mean (or average) number of runways in the graph. Not surprisingly this number is close to 1.5 as as the majority of the airports only have one or two runways.

// Average number of runways


We also need to know how many airports there are in the graph so that we can calculate the variance as part of the standard deviation calculation.

// Total number of airports
count = g.V().hasLabel('airport').count().next()


Now we are ready to make use of the square root and power operators and calculate the standard deviation. As a reminder, the standard deviation is found by taking the square root of the variance in a data set. The variance itself is calculated by for each airport subtracting the mean from the number of runways it has and squaring it and then taking the sum of those values and finally dividing that sum by the number of airports. Let’s write a query that can do all of that for us.

// Calculate the standard deviation
  math('(_ - m)^2').sum().math('_ / c').math('sqrt(_)')


We could use another query to check on the distribution of runways in the graph to see if we believe our standard deviation result.



Looking at the distribution, where a large majority of the airports have either one or two runways, our result looks pretty reasonable. Clearly the few airports with six, seven or eight runways are the outliers in this sample and would fall well outside of the standard deviation from the mean that we calculated.

Just for fun, let’s use the same basic set of steps once again but this time to find the standard deviation for the number of outgoing routes in the graph.

As before we need to find the mean value for the data set. This time we need to find the average number of outgoing routes in the graph. The airport count remains the same of course.



count = g.V().hasLabel('airport').count().next()


Now we are ready to again calculate the standard deviation for the data set representing all outgoing routes per airport.

  math('(_ - m)^2').sum().math('_ / c').math('sqrt(_)')


This time we got a much bigger number back as the result compared to when we looked at runways. This reflects the differing distribution of routes between major and more minor airports.

There are a lot of other operators that I have not provided examples for but hopefully this section gives you a feel for ways that the math step can be used to create interesting queries.

3.28. Including an index with results - introducing withIndex and index

If for any reason you wanted an index value included as part of the results from a query you can use the Groovy withIndex or index methods as shown below. The withIndex method adds the index value at the end of a list whereas the indexed method adds it at the start. You can provide a starting index value as a parameter. If no value is provided the default value for the first index will be zero.



Here is the same query but using 1 as the starting index.



Below is the query used again but this time with the indexed method being used to generate the index value.



3.29. More examples using concepts we have covered so far

The examples in this section build upon the topics that we have covered so far. The query below finds cities that can be flown to from any airport in the Hawaiian islands.

// Which cities can I fly to from any airport in the Hawaiian islands?

If we run the query we would get back results similar to those below. Only a few of the full result set are shown.

[Honolulu,San Francisco]
[Honolulu,San Diego]
[Honolulu,Las Vegas]
[Lihue,San Francisco]

The query below looks for airports in Europe using the continent vertex with a code of EU as the starting point. The results are sorted in ascending order and folded into a list.

// Find all the airports that are in Europe (The graph stores continent information
// as "contains" edges connected from a "continent" vertex to each airport vertex.

Here is what we get back from running the query.


The next queries show two ways of finding airports with 6 or more runways.



Next, let’s look at two ways of finding flights from airports in South America to Miami. The first query uses select which is what we would have had to do in the TinkerPop 2 days. The second query, which feels cleaner to me, uses the path and by step combination introduced in TinkerPop 3.



This query finds the edge that connects Austin (AUS) with Dallas Ft. Worth (DFW) and returns the dist property of that edge so we know how far that journey is.

// How far is it from DFW to AUS?


As an alternative approach, we could return a path that would include both airport codes and the distance. Notice how we need to use outE and inV rather than just out as we still need the edge to be part of our path so that we can get its dist property using a by step.



If we wanted to find out if there are any ways to get from Brisbane to Austin with only one stop, this query will do nicely!

// Routes from BNE to AUS with only one stop


This is another way of doing the same thing but, once again, to me using path feels more concise. The only advantage of this query is that if all you want is the name of any intermediate airports then that’s all you get!



A common thing you will find yourself doing when working with a graph is counting things. This next query looks for all the airports that have less than five outgoing routes and counts them. It does this by counting the number of airports that have less than five outgoing edges with a route label. There are a surprisingly high number of airports that offer this small number of destinations.

// Airports with less than 5 outgoing edges


In a similar vein this query finds the airports with more than 200 outgoing routes. The second query shows that where is a synonym for filter in many cases.



Here are two more queries that look for things that meet a specific criteria. The first finds routes where the distance is exactly 100 miles and returns the source and destination airport codes. The first query uses as and select while the second one uses path and includes the distance in the result.

// List ten (or less) routes where the distance is exactly 100 miles


Similar to the prior query but using path and displaying the distance as well as the airport codes.



This query looks for any airports that have an elevation above 10,000 feet. Two ways of achieving the more or less the same result are shown. The first uses valueMap and the second uses a project step instead.

// Airports above 10,000ft sorted by ascending elevation
g.V().has('airport','elev', gt(10000)).

g.V().has('airport','elev', gt(10000)).order().by('elev',incr).

If we ran the query that uses the project step here is what we should get back.

[city:La Paz / El Alto,elevation:13355]

The next query finds any routes between Austin and Sydney that only require one stop.The by step offers a clean way of doing this query by combining it with path.


The following three queries all achieve the same result. They find flights from any airport in Africa to any airport in the United States. These queries are intresting as the continent information is represented in the graph as edges connecting an airport vertex with a continent vertex. This is about as close as Gremlin gets to a SQL join statement!

The first query starts by looking at airports the second starts from the vertex that represents Africa. The third query uses where to show an alternate way of achieving the same result.




If we run the query we should get results that look like this. I have laid them out in two columns to save space.

[a:JNB, b:ATL]    [a:ACC, b:JFK]
[a:JNB, b:JFK]    [a:CMN, b:IAD]
[a:CAI, b:JFK]    [a:CMN, b:JFK]
[a:ADD, b:IAD]    [a:GCK, b:DFW]
[a:ADD, b:EWR]    [a:DKR, b:IAD]
[a:LOS, b:ATL]    [a:DKR, b:JFK]
[a:LOS, b:IAH]    [a:LFW, b:EWR]
[a:LOS, b:JFK]    [a:RAI, b:BOS]
[a:ACC, b:IAD]

The query below shows how to use the project step that was introduced in TinkerPop 3, along with order and select to produce a sorted table of airports you can fly to from AUSTIN along with their runway counts. The limit step is used to only return the top ten results. You will find several examples elsewhere in this book that use variations of this collection of steps.


Here are the results we get from running the query.

[ap:ORD, rw:8]
[ap:DFW, rw:7]
[ap:BOS, rw:6]
[ap:DEN, rw:6]
[ap:DTW, rw:6]
[ap:YYZ, rw:5]
[ap:MDW, rw:5]
[ap:ATL, rw:5]
[ap:IAH, rw:5]
[ap:FRA, rw:4]


So far we have looked mostly at querying an existing graph. In the following sections we will look at many other topics that it is also important to be familiar with when working with Gremlin. These topics include mixing in some Groovy or Java code with your queries, as well as adding vertices (nodes), edges and properties to a graph and also deleting them. We will also look at how to create a sub-graph and how to save a graph to an XML or JSON file and a lot more. Let’s start off with a short discussion of query layout, reserved words and data modelling.

4.1. A word about layout and indentation

As you begin to write more complex Gremlin queries they can get quite lengthy. In order to make them easier for others to read it is recommended to spread them over multiple lines and indent them in a way that makes sense. I am not going to propose an indentation standard, I believe this should be left to personal preference however there are a few things I want to mention in passing. When working with the Gremlin console, if you want to spread a query over multiple lines then you will need to end each line with a backslash character or with a character such as a period or a comma that tells the Gremlin parser that there is more to come.

The following example shows the query we already looked at in the Boolean operations section of this book but this time edited so that it could be copy and pasted directly into the Gremlin console.

g.V().hasLabel('airport') \
     .has('region',within('US-TX','US-LA','US-AZ','US-OK'))   \
     .order().by('region',incr)   \

We can avoid the use of backslash characters if we lay the query out as follows. Each line ends with a period which tells the parser that there are more steps coming.


If we do not give the parser one of these clues that there is more to come, the Gremlin console will try and execute each line without waiting for the next line.

Some people find it easier to read queries when each step or modulator is given its own line and indented appropriately. So we could layout the query as shown below and it will still work just fine.


Whether you decide to use the backslash as a continuation character or leave the period on the previous line is really a matter of personal preference. Just be sure to do one or the other if you want to use multiple line queries within the Gremlin console. There is no golden rule as to how many lines and how much indenting you should use when laying out your more complex queries. However, whatever you decide to do, it is worth remembering that others reading your work may find well laid out and appropriately indented steps easier to read and understand.

4.2. A warning about reserved word conflicts and collisions

Most of the time the issue I am about to describe will not be a problem. However, there are cases where names of Gremlin steps conflict with reserved words and method names in Groovy. Remember that Gremlin is coded in Groovy and Java. If you hit one of these cases, often the error message that you will get presented with does not make it at all clear that you have run into this particular issue. Let’s look at some examples. One step name in Gremlin that can sometimes run into this naming conflict is the in step. However, you do not have to worry about this in all cases. First take a look at the following query.


That query does not cause an error and correctly returns all of the vertices that are connected by an incoming edge, to the AUS vertex. There is no conflict of names here because it is clear that the in() reference applies to the result of the has step. However, now take a look at this query.


In this case the in() is on it’s own and not dot connected to a previous step. The Gremlin runtime (which remember is written in Groovy) will try to interpret this and will throw an error because it thinks this is a reference to its own in method. To make this query work we have to adjust the syntax slightly as follows.


Notice that I added the "__." (underscore underscore period) in front of the in(). This is shorthand for "the thing we are currently looking at", so in this case, the result of the has step.

There are currently not too many Groovy reserved words to worry about. The three that you have to watch out for are in, not and as which have special meanings in both Gremlin and Groovy. Remember though, you will only need to use the "__." notation when it is not clear what the reserved word, like in, applies to.

You will find an example of not being used with the "__." prefix in the "Modelling an ordered binary tree as a graph" section a bit later on.

4.3. Thinking about your data model

As important as it is to become good at writing effective Gremlin queries, it is equally important, if not more so, to put careful consideration into how you model your data as a graph. Ideally you want to arrange your graph so that it can efficiently support the most common queries that you foresee it needing to handle.

Consider this query description. "Find all flight routes that exist between airports anywhere in the continent of Africa and the United States". When putting the air-routes graph together I decided to model continents as their own vertices. So each of the seven continents has a vertex. Each vertex is connected to airports within that continent by an edge labeled "contains".

I could have chosen to just make the continent a property of each airport vertex but had I done that, to answer the question about "routes starting in Africa" I would have to look at every single airport vertex in the graph just to figure out which continent contained it. By giving each continent it’s own vertex I am able to greatly simplify the query we need to write.

Take a look at the query below. We first look just for vertices that are continents. We then only look at the Africa vertex and the connections it has (each will be to a different airport). By starting the query in this way, we have very efficiently avoided looking at a large number of the airports in the graph altogether. Finally we look at any routes from airports in Africa that end up in the United States. This turns out to yield a nice and simple query in no small part because our data model in the graph made it so easy to do.

// Flights from any Airport in Africa to any airport in the United States

We could also have started our query by looking at each airport and looking to see if it is in Africa but that would involve looking at a lot more vertices. The point to be made here is that even if our data model is good we still need to always be thinking about the most efficient way to write our queries.

// Gives same results but not as efficient

Now for a fairly simple graph, like air-routes, this discussion of efficiency is perhaps not such a big deal, but as you start to work with large graphs, getting the data model right can be the difference between good and bad query response times. If the data model is bad you won’t always be able to work around that deficiency simply by writing clever queries!

4.3.1. Keeping information in two places within the same graph

Sometimes, to improve query efficiency I find it is actually worth having the data available more than one place within the same graph. An example of this in the air routes graph would be the way I decided to model countries. I have a unique vertex for each country but I also store the country code as a property of each airport vertex. In a small graph this perhaps is overkill but I did it to make a point. Look at the following two queries that return the same results - the cities in Portugal that have airports in the graph.



The first query finds the country vertex for Portugal and then, finds all of the countries connected to it. The second query looks at all airport vertices and looks to see if they contain PT as the country property.

In the first example it is likely that a lot less vertices will get looked at than the first even though a few edges will also get walked as there are over 3,000 airport vertices but less than 300 country vertices. Also, in a production system with an index in place finding the Portugal vertex should be very fast.

Conversely, if we were already looking at an airport vertex for some other reason and just wanted to see what country it is in, it is more convenient to just look at the country property of that vertex.

So there is no golden rule here but it is something to think about while designing your data model.

4.3.2. Using a graph as an index into other data sources

While on the topic of what to keep in the graph, something to resist being drawn into in many cases is the desire to keep absolutely everything in the graph. For example, in the air routes graph I do not keep every single detail about an airport (radio frequencies, runway names, weather information etc.) in the airport vertices. That information is available in other places and easy to find. In a production system you should consider carefully what needs to be in your graph and what more naturally belongs elsewhere. One thing I could do is add a URL as a property of each airport vertex that points to the airports home page or some other resource that has all of the information. In this way the graph becomes a high quality index into other data sources. This is a common and useful pattern when working with graphs. This model of having multiple data sources working together is sometimes referred to as Polyglot storage.

4.3.3. A few words about supernodes

When a vertex in a graph has a large number of edges and is disproportionately connected to many of the other vertices in the graph it is likely that many, if not all, graph traversals of any consequence will include that vertex. Such vertices (nodes) are often referred to as supernodes. In some cases the presence of supernodes may be unavoidable but with careful planning as you design your graph model you can reduce the likelihood that vertices become supernodes. The reason we worry about supernodes is that they can significantly impact the performance of graph traversals. This is because it is likely that any graph traversal that goes via such a vertex will have to look at most if not all of the edges connected to that vertex as part of a traversal.

The air-routes graph does not really have anything that could be classed as a supernode. The vertex with the most edges is the continent vertex for North America that has approximately 980 edges. The busiest airports are IST and AMS and they both have just over 530 total edges. So in the case of the air-routes graph we do not have to worry too much.

If we were building a graph of a social network that included famous people we might have to worry. Consider some of the people on Twitter with millions of followers. Without taking some precautions, such a social network, modelled as a graph, could face issues.

As you design your graph model it is worth considering that some things are perhaps better modelled as a vertex property than as a vertex with lots of edges needing to connect to it. For example in the air routes graph there are country vertices and each airport is connected to one of the country vertices. In the air routes graph this is not a problem as even if all of the airports in the graph were in the same country that would still give us less than 3,500 edges connected to that vertex. However, imagine if we were building a graph of containing a very large number of people. If we had several million people in the graph all living in same the country that would be a guaranteed way to get a supernode if we modelled that relationship by connecting every person vertex to a country vertex using a lives in edge. In such situations, it would be far more sensible to make the country where a person lives a property of their own vertex.

A detailed discussion of supernode mitigation is beyond the scope of this book but I encourage you to always be thinking about their possibility as you design your graph and also be thinking about how you can prevent them becoming a big issue for you.

4.4. Making Gremlin even Groovier

As we have already discussed, the Gremlin console builds upon the Groovy console, and Groovy itself is coded in Java. This means that all of the classes and methods that you would expect to have available while writing Groovy or Java programs are also available to you as you work with the Gremlin Console. You can intermix additional features from Groovy and Java classes along with the features provided by the TinkerPop 3 classes as needed. This capability makes Gremlin additionally powerful. You can also take advantage of these features when working with Gremlin Server and with other TinkerPop enabled graph services with the caveat that some features may be blocked if viewed as a potential security risk to the server or simply because they are not supported.

Every Gremlin query we have demonstrated so far is also, in reality, valid Groovy. We have already shown examples of storing values into variables and looping using Groovy constructs as part of a single or multi part Gremlin query.

In this section we are going to go one step further and actually define some methods, using Groovy syntax, that can be run while still inside the Gremlin Console. By way of a simple example, let’s define a method that will tell us how far apart two airports are and then invoke it.

// A simple function to return the distance between two airports
def dist(g,from,to) {
  return d }

// Can be called like this

This next example shows how to define a slightly longer method that prints out information about the degree of a vertex in a nice, human readable, form.

// Groovy function to display vertex degree
def degree(g,s) {
  v = g.V().has('code',s).next();
  i=g.V(v).in().count().next() ;
  println "Edges in  : " + i;
  println "Edges out : " + o;
  println "Total     : " +(i+o);

// Can be called like this

Here is an example that shows how we can query the graph, get back a list of values and then use a for loop to display them. Notice this time how we initially store the results of the query into the variable x. The call to toList ensures that x will contain a list (array) of the returned values.

// Using a Groovy for() loop to iterate over a list returned by Gremlin
for (a in x) {println(a.values('code').next()+" "+a.values('icao').next()+" "+a.values('desc').next())}

// We can also do this just using a 'for' loop and not storing anything into a variable.
for (a in g.V().hasLabel('airport').limit(10).toList()) {println(a.values('code').next()+""+a.values('icao').next())}

Sometimes (as you have seen above) it is necessary to make a call to next to get the result you expect returned to your variable.

number = g.V().hasLabel('airport').count().next()
println "The number of airports in the graph is " + number

Here is another example that makes a Gremlin query inside a for loop.

for (a in 1..10) print g.V().has(id,a).values('code').next()+" "

This example returns a hash of vertices, with vertex labels as the keys and the code property as the values. It then uses the label names to access the returned hash.


Here is another example. This time we define a method that takes as input a traversal object and the code for an airport. It then uses those parameters to run a simple Gremlin query to retrieve all of the places that you can fly to from that airport. It then uses a simple for loop to print the results in a table. Note the use of next as part of the println. This is needed in order to get the actual values that we are looking for. If we did not include the calls to next we would actually get back the iterator object itself and not the actual values.

// Given a traversal and an airport code print a list of all the places you can
// fly to from there including the IATA code and airport description.
def from(g,a) {
  for (x in places) {println x.values('code').next()+" "+x.values('desc').next()}

// Call like this

This example creates a hash map of all the airports, using their IATA code as the key. We can then access the map using the IATA code to query information about those airports. Remember that the ;[] at the end of the query just stops the console from displaying unwanted output.

// Create a map (a) of all vertices with the code property as the key

// Show the description stored in the JFK vertex

Another useful way to work with veriables is to establish the variable and then use the fill step to place the results of a query into it. The example below creates an empty lisy called german. The query then finds all the vertices for airports located in Germany and uses the fill step to place them into the variable.

german = []

We can then use our list as you would expect. Remember that as we are running inside the Gremlin console we do not have to explicitly iterate through the list as you would if you were writing a stand alone Groovy application.

// How many results did we get back?


// Query some values from one of the airports in the list


// Feed an entry from our list back into a traversal




Towards the end of the book, in the "Working with TinkerGraph from a Groovy application" section, we will explore writing some stand alone Groovy code that can use the TinkerPop API and issue Gremlin queries while running outside of the Gremlin Console as a stand alone application.

4.4.1. Using a variable to feed a traversal

Sometimes it is very useful to store the result of a query in a variable and then, later on, use that variable to start a new traversal. You may have noticed we did that in the very last example of the prior section where we fed the german variable back in to a traversal. By way of another simple example, the code below stores the result of the first query in the variable austin and then uses it to look for routes from Austin in second query. Notice how we do this by passing the variable containing the Austin vertex into the V() step.


You can take this technique one step further and pass an entire saved list of vertices to V(). In the next example we first generate a list of all airports that are in Scotland and then pass that entire list into V() to first of all count how many routes there are from those airports and then we start another query that looks for any route from those airports to airports in Germany.

// Find all airports in Scotland

// How many routes from these airports?

// How many of those routes end up in Germany?

In this example of using with variables to drive traversals, we again create a list of airports. This time we find all the airports in Texas. We then use a Groovy each loop to iterate through the list. For each airport in the list we print the code of the starting airport and then the codes of every airport that you can fly to from there.

// Find all of the airports in Texas

// For each airport, print a list of all the airports that you can fly to from there.
texas.each {println it.values('code').next() + "===>" +

This example, which is admittedly a bit contrived, we use a variable inside of a has step. We initially create a list containing all of the IATA codes for each airport in the graph. We then iterate through that list and calculate how many outgoing routes there are from each place and print out a string containing the airport IATA code and the count for that airport. Note that this could easily be done just using a Gremlin query with no additional Groovy code. The point of this example is more to show another example of mixing Gremlin, Groovy and variables. Knowing that you can do this kind of thing may come in useful as you start to write more complicated graph database applications that use Gremlin. You will see this type of query done using just Gremlin in the section called "Finding unwanted parallel edges"" later in this book.

for (a in m) println a + " : " + g.V().has('code',a).out().count().next()

Lastly, here is an example that uses an array of values to seed a query.

     each {println g.V().has('code','JFK').outE().inV().

Here is the output from running the code.

[JFK, 1520, AUS]
[JFK, 427, RDU]
[JFK, 945, MCO]
[JFK, 3440, LHR]
[JFK, 1390, DFW]

4.5. Adding vertices, edges and properties

So far in this book we have largely focussed on loading a graph from a file and running queries against it. As you start to build your own graphs you will not always start with a graph saved as a text file in GraphML, CSV, GraphSON or some other format. You may start with an empty graph and incrementally add vertices and edges. Just as likely you may start with a graph like the air routes graph, read from a file, but want to add vertices, edges and properties to it over time. In this section we will explore various ways of doing just that.

Vertices and edges can be added directly to the graph using the graph object or as part of a graph traversal. We will look at both of these techniques over course of the following pages.

4.5.1. Adding an airport (vertex) and a route (edge)

The following code uses the graph object that we created when we first loaded the air-routes graph to create a new airport vertex (node) and then adds a route (edge) from it to the existing DFW vertex. We can specify the label name (airport) and as many properties as we wish to while creating the vertex. In this case we just provide three. We can additionally add and delete vertex properties after a vertex has been created. While using the graph object in this way works, it is strongly recommended that the traversal source object g be used instead and that vertices and edges be added using a traversal. Examples of how to do that are coming up next.

// Add an imaginary airport with a code of 'XYZ' and connect it to DFW
xyz = graph.addVertex(label,'airport',
                      'desc','This is not a real airport')

// Find the DFW vertex
dfw = g.V().has('code','DFW').next()

// Create a route from our new airport to DFW

In many cases it is more convenient, and also recommended, to perform each of the previous operations using just the traversal object g. The following example does just that. We first create a new airport vertex for our imaginary airport and store it’s vertex in the variable xyz. We can then use that stored value when we create the edge also using a traversal. As with many parts of the Gremlin language, there is more than one way to achieve the same results.

// Add an imaginary airport with a code of 'XYZ' and connect it to DFW
xyz = g.addV('airport').property('code','XYZ').
                        property('desc','This is not a real airport').next()

Notice, in the code above, how each property step can be chained to the previous one when adding multiple properties. Whether you need to do it while creating a vertex or to add and edit properties on a vertex at a later date you can use the same property step.

It is strongly recommended that the traversal source object g be used when adding, updating or deleting vertices and edges. Using the graph object directly is not viewed as a TinkerPop best practice.

We can now add a route from DFW to XYZ. We are able to use our xyz variable to specify the destination of the new route using a to step.

// Add a route from DFW to XYZ

We could have written the previous line to use a second V() step if we had not previously saved anything in a variable. Note that while this use of a second V() step will work locally, if you are sending queries to a Gremlin Server (a topic we will discuss later in this book) this syntax is not supported and will not work.


We might also want to add a returning route from XYZ back to DFW. We can do this using the from step in a similar way as we used the to step above.

// Add the return route back to DFW

Another way that we could have chosen to create our edge is by naming things using as steps. Note also there is more than one way to define the direction of an edge. Two examples are shown below. One uses the addOutE step and the other uses the outE and to combination. Notice also how the V step was used below to essentially start a new traversal midway through the current one.



You will see a bigger example that uses as to name steps in the "Quickly building a graph for testing" section that is coming up soon.

4.5.2. Using a traversal to determine a new label name

In TinkerPop 3.3.1 a new capability was added to the addV and addE steps. This new capability allows us to use a traversal to determine what the label used by a new vertex or edge should be. Take a look at the query below. We have seen this type of query used earlier in the book. It simply tells us what label the vertex representing the Austin (AUS) airport has.



What the new capability added in TinkerPop 3.3.1 allows us to do is include the traversal above inside of an addV step as shown below. The first string result returned by the provided traversal will be used as the label name.



We can inspect the new vertex using valueMap to make sure that our label was correctly assigned.


These features require that the graph database system you are using supports a TinkerPop version of 3.3.1 or higher.

We can now do something similar to dynamically work out what the label should be for an edge between our new airport and Austin.



Again, we can use a valueMap step to make sure our new edge label looks OK.



4.5.3. Using a traversal to seed a property with a list

You can use the results of a traversal to create or update properties. The example below creates a new property called places for the Austin airport vertex. The values of the property are the results of finding all of the places that you can travel to from that airport and folding their code values into a list.

// Add a list as a property value

We can use a valueMap step to make sure the property was created as we expected it to be. As you can see a new property called places has been created containing as it’s value a list of codes.



To gain access to these values from your code or Gremlin console queries, we can use the next step. A simple example is given below where values is used to retrieve the values of the places property and then we use size to see how many entries there are in the list.



Once we have access to the list of values we can access them using the normal Groovy array syntax. The example below returns the three values with an index between 2 and 4.



4.5.4. Using inject to specify new vertex ID values

If the graph database you are using supports user provided ID values, you can use an inject step as one way to specify what you want the ID value of a new vertex to be. For example consider the example below.



You can also specify more than one ID value if you want to create multiple vertices.



Note that if you try to create a vertex using an ID that already exists the operation will fail.


Vertex with id already exists: 99999

4.5.5. Quickly building a graph for testing

Sometimes for testing and for when you want to report a problem or ask for help on a mailing list it is handy to have a small stand alone graph that you can use. The code below will create a mini version of the air routes graph in the Gremlin Console. Note how all of the vertices and edges are created in a single query with each step joined together.

The form of addV that used to allow creation of a vertex and a property using something like g.addV(label,'airport,code,AUS)' is now deprecated and should not be used.

4.5.6. Adding vertices and edges using a loop

Sometimes it is more efficient to define the details of the vertices or edges that you plan to add to the graph in an array and then add each vertex or edge using a simple for loop that iterates over it. The following example adds our imaginary airports directly to the graph using such a loop. Notice that we do not have to specify the ID that we want each vertex to have. The graph will assign a unique ID to each new vertex for us.

vertices = [["WYZ","KWYZ"],["XYZ","KXYZ"]]
for (a in vertices) {graph.addVertex(label,"airport","code",a[0],"iata",a[1])}

We could also have added the vertices using the traversal object g as follows. Notice the call to next(). Without this the vertex creation will not work as expected.

vertices = [["WYZ","KWYZ"],["XYZ","KXYZ"]]
for (a in vertices) {g.addV("airport").property("code",a[0],"iata",a[1]).next()}

This technique of creating vertices and/or edges using a for loop can also be useful when working with graphs remotely over HTTP connections. It is a very convenient way to combine a set of creation steps into a single REST API call.

If you prefer a more Groovy like syntax you can also do this.

vertices = [["WYZ","KWYZ"],["XYZ","KXYZ"]]
vertices.each {g.addV("airport").property("code",it[0],"iata",it[1]).next()}

4.5.7. Using coalesce to only add a vertex if it does not exist

In the Combining coalesce with a constant value section we looked at how coalesce could be used to return a constant value if the other entities that we were looking for did not exist. We can reuse that pattern to produce a traversal that will only add a vertex to the graph if that vertex has not already been created.

Let’s assume we wanted to add a new airport, with the code "XYZ" but we are not sure if the airport might have already been added.

We can check to see if the airport exists, using a basic has step.


If it does not exist yet, which in this case it does not, noting will be returned. We could go one step further and change the query to return an empty list [] if the airport does not exist by adding a fold step to the query.



Now that we have a query that can return an empty list if a vertex does not exist we can take advantage of this in a coalesce step. The query below looks to see if the airport already exists and passes the result of that into a coalesce step. Remember, coalesce will return the result of the first traversal it looks at that returns a good result. We can make the first parameter passed to coalesce and unfold step. This way in the case where the airport does not exist, unfold will return nothing and so coalesce will attempt the second step. In this case our second step creates a vertex for the airport "XYZ".



As you can see the query above created a new vertex with an ID of 53865 as the XYZ airport did not already exist. However, if we run the same query again, notice that we get the same vertex back that we just created and not a new one. This is because this time, the coalesce step does find a result from the unfold step and so completed before attempting the addV step.



Using coalesce in this way provides us with a nice pattern for a commonly performed task of checking to see if something already exists before we create it.

4.6. Deleting vertices, edges and properties

So far in this book we have looked at several examples where we created new vertices, edges and properties but we have not yet looked at how we can delete them. Gremlin provides the drop step that we can use to remove things from a graph.

4.6.1. Deleting a vertex

In some of our earlier examples we created a fictitious airport vertex with a code of XYZ and added it to the air routes graph. If we now wanted to delete it we could use the following Gremlin code. Note that removing the vertex will also remove any edges we created connected to that vertex.

// Remove the XYZ vertex

4.6.2. Deleting an edge

We can also use drop to remove specific edges. The following code will remove the flights, in both directions between AUS and LHR.

// Remove the flight from AUS to LHR  (both directions).

4.6.3. Deleting a property

Lastly, we can use drop to delete a specific property value from a specific vertex. Let’s start by querying the properties defined by the air-routes graph for the San Francisco airport.


[country:[US],code:[SFO],longest:[11870],city:[San Francisco],elev:[13],icao:[KSFO],lon:[-122.375],type:[airport],region:[US-CA],runways:[4],lat:[37.6189994812012],desc:[San Francisco International Airport]]

Let’s now drop the desc property and re-query the property values to prove that it has been deleted.



[country:[US],code:[SFO],longest:[11870],city:[San Francisco],elev:[13],icao:[KSFO],lon:[-122.375],type:[airport],region:[US-CA],runways:[4],lat:[37.6189994812012]]

If we wanted to delete all of the properties currently associated with the SFO airport vertex we could do that as follows.


4.6.4. Removing all the edges or vertices in the graph

This may not be something you want to do very often, but should you wish to remove every edge in the graph you could do it, using the traversal object, g, as follows. Note that for very large graphs this may not be the most efficient way of doing it depending upon how the graph store handles this request.

// Remove all the edges from the graph

You could also use the graph object to do this. The code below uses the graph object to retrieve all of the edges and then iterates over them dropping them one by one. Again for very large graphs this may not be an ideal approach as this requires reading all of the edge definitions into memory. Note that in this case we call the remove method rather than use drop as we are not using a graph traversal in this case.

// Remove all the edges from the graph

You could also delete the whole graph, vertices and edges, by deleting all of the vertices!

// Delete the entire graph!

4.7. Property keys and values revisited

We have already looked, earlier in the book, at numerous queries that retrieve, create or manipulate in some way the value of a given property. There are still however a few things that we have not covered in any detail concerning properties. Most of the property values we have looked at so far have been simple types such as a String or an Integer. In this section we shall look more closely at properties and explain how they can in fact be used to store lists and sets of values. We will also introduce in this section the concept of a property ID.

4.7.1. The Property and VertexProperty interfaces

In a TinkerPop 3 enabled graph, all properties are implementations of the Property interface. Vertex properties implement the VertexProperty interface which itself extends the Property interface. These interfaces are documented as part of the Apache TinkerPop 3 JavaDoc. The interface defines the methods that you can use when working with a vertex property object in your code. One important thing to note about vertex properties is that they are immutable. You can create them but once created they cannot be updated.

We will look more closely at the Java interfaces that TinkerPop 3 defines in the "Working with TinkerGraph from a Java Application" section a bit later in this book.

The VertexProperty interface does not define any "setter" methods beyond the basic constructor itself. Your immediate reaction to this is likely to be "but I know you can change a property’s value using the property step". Indeed we have already discussed doing just that in this book. However, behind the scenes, what actually happens when you change a property, is that a new property object is created and used to replace the prior one. We will examine this more in a minute but first let’s revisit a few of the basic concepts of properties.

In a property graph both vertices and edges can contain one or more properties. We have already seen a query like the one below that retrieves the values from each of the property keys associated with the DFW airport vertex.


Dallas/Fort Worth International Airport

What we have not mentioned so far, however, is that the previous query is a shortened form of this one.


Dallas/Fort Worth International Airport

If we wanted to retrieve the VertexProperty (vp) objects for each of the properties associated with the DFW vertex we could do that too. In a lot of cases it will be sufficient just to use values or valueMap to access the values of one or more properties but there are some cases, as we shall see when we look at property IDs, where having access to the vertex property object itself is useful.


vp[desc->Dallas/Fort Worth In]

We have already seen how each property on a vertex or edge is represented as a key and value pair. If we wanted to retrieve a list of all of the property keys associated with a given vertex we could write a query like the one below that finds all of the property keys associated with the DFW vertex in the air-routes graph.



We could likewise find the names, with duplicates removed, of any property keys associated with any outgoing edges from the DFW vertex using this query. Note that edge properties are implementations of Property and not VertexProperty.



We can use the fact that we now know how to specifically reference both the key and value parts of any property to construct a query like the one below that adds up the total length of all the longest runway values and number of runways in the graph and groups them by property key first and sum of the values second.


[longest:25497644, runways:4816]

4.7.2. The propertyMap traversal step