DIY facts and figures for your favorite sport using "Open Data" and Neo4j

Have you ever wondered what the source is for the facts, figures and charts that sports commentators and journalists use? Have you ever had more new questions than questions answered after reading an article?

In sports, for example, big newspapers and tv stations cover large events like the Champions League. However, if you are interested in smaller countries, less popular sports or just curious about other aspects/statistics e.g. Olympics 2012: the alternative medals table - the media do not usually serve you well. The situation is similar in other areas.

With the idea of Open Data gaining traction and the general trend to publish results online, it is becoming easier to access data sources. Software like Neo4j makes it easy to run your research and statistics.

In this blog post, I will give an example from sports, using tennis data. At the end you will have seen an example of getting data from the internet, creating a data model, publishing your analysis on GraphGist and the tools that can help with data quality. You might also gain some ideas on what you could analyze yourself.

Note that some of the commands below require a working Python installation and a shell.

ATP Men's Singles Top 100 and 2016 Finalists

2016 is a year with Olympic Games, but the only easily usable data, at the time of writing, is from the 2008 Olympics and earlier. Results for the 2016 Olympic Games in Rio are available in lots of detail as PDFs. To extract e.g. all medal winners from all events would be a fair amount of work. For the scope of this blog post, I chose to search for data that already came in tabular form.

I settled for tennis with the idea to compare countries, their players and how well they do. Due to size limitations of publishing the results on GraphGist, I had to restrict the scope of the data used to the ATP men's singles top 100 ranking and the finals of the 2016 tournaments.

Identifying Nodes And Relationships

The task is to identify the nodes, relationships, and attributes that make up the graph data model. One way to do this is by looking at the domain and noting down keywords. Nouns often are nodes and verbs can become relationships. At the same time, you can note down attributes for the nodes.

In the tennis example, important concepts are Player, Match, Tournament, Ranking, Country, Tournament Series, Court, and Surface. Verbs are win, lose, play, advance to next round. Depending on the kind of questions one wants to answer, the outcome will be a different graph model. E.g. Rank should probably be a node if there are lots of queries about ranking and how it changes over time. Similarly, if the focus is on the surface, e.g. clay vs. grass, then it might make sense to have it as a Node.

Below is the graph model identification in our case. Nodes in bold, Relationships in bold italic and Attributes in italics.

Tennis Players have an ATP ranking that is updated weekly. They are from a Country. In singles tennis two players compete against each other in a Match which is played at a Tournament. The player who wins two sets (best of 3) advances to the next round. The loser is out. The winner of the final round wins the tournament. Tournaments belong to different series. The Grand Slam series is the most prestigious one and also gives the most points for the ranking. Matches in a Grand Slam tournament are best of 5.

alt text

Getting the raw data

As it is possible to import CSV files into databases, I was looking for the source to be in CSV already or in a format that is simple to convert to CSV. For example a spreadsheet or a website with HTML tables.

Player data including their ranking can be retrieved from the ATP Site. They are available weekly. As I was only interested in a snapshot and not the week to week changes I just picked the current week.
Sports data uses the IOC country codes and not the ISO codes, and so I took Country data from this Wikipedia entry and Match data from Tennis-Data.co.uk.

Player and Country data are HTML tables; Match data is in a spreadsheet. All three have to be converted to CSV format.

Tools for converting to CSV

csvkit is a collection of command line tools to manipulate CSV files. After installation, you will be able to run the commands below.

.xls to .csv

in2csv converts various tabular data formats into CSV.
in2csv 2016.xls will convert the match data to CSV format.

HTML table to tab delimited (*.tsv)

For HTML tables there is a Firefox Add-On Table Tools 2 which allows to manipulate the data and then export it to a CSV file. Table Tools 2 offers keyboard shortcuts, and I used it to sort the table and to remove columns of no interest. Getting the data from the HTML table to a CSV file is done by selecting Copy->Table as Tab-Delimited Text and pasting it into a new file.

.tsv to .csv

While it is possible to import a tab separated file into Neo4j, it is simpler to work with comma separated files everywhere. The following command transforms a tab separated file into a comma separated file csvcut --tabs inputfile > outputfile.

Data quality

Data downloaded from the internet is not guaranteed to be of high quality and needs to be validated. The Neo4j docs have a section on data quality and links to various tools.
I used csvkit to both validate and manipulate the CSV files before importing them with Neo4j.

Trailing whitespace can lead to duplicate entities on import into Neo4j. As none of the above tools checks for trailing or leading whitespace, I will show how to deal with whitespace in the section on import.

Validation

Both CVS syntax errors and data plausibility checks can be run using tools from csvkit.

csvclean cleans a CSV file of common syntax errors. csvclean --dry-run shows if there are common syntax errors in the CSV file without doing any cleaning.

csvstat prints descriptive statistics for all columns in a CSV file. It can be used for data plausibility checks. csvstat players.csv has the following output when running on a file with the top 100 players.

  1. Ranking
        <type 'int'>
        Nulls: False
        Min: 1
        Max: 100
        Unique values: 100
  2. Country
        <type 'unicode'>
        Nulls: False
        Unique values: 35
        5 most frequent values:
                FRA:    12
                ESP:    11
                ARG:    7
                GER:    6
                USA:    6
        Max length: 3
  5. LastName
        <type 'unicode'>
        Nulls: False
        Unique values: 100
        Max length: 15
  6. Age
        <type 'int'>
        Nulls: False
        Min: 18
        Max: 37
        Mean: 28.15
        Median: 28.0
        Unique values: 20
        5 most frequent values:
                26:     12
                31:     12
                27:     9
                25:     8
                30:     8
Row count: 100

It is easy to see that the age range is from 18 to 37 and the median is 28 which sounds plausible. Ranking goes from 1 to 100 with 100 unique values. Last names have a max length of 15 characters. Country stats show that both France and Spain provide more than 10% of the Top 100 players which seems a lot, but still in the plausible range.

Data Preparation

Player data

The player data obtained from the ATP website needed some changes to make it suitable for import. To first get a look at the data, csvlook is used to render a CSV to the command line in a readable, fixed-width format.

Preview the player data:

➜ ✗ csvlook players.csv
|----------+---------+------------------------------+------|
|  Ranking | Country | Player                       | Age  |
|----------+---------+------------------------------+------|
|  1       | SRB     | Novak Djokovic               | 29   |
|  2       | GBR     | Andy Murray                  | 29   |
|  3       | SUI     | Stan Wawrinka                | 31   |
|  4       | SUI     | Roger Federer                | 35   |
...
|  17      | ESP     | Roberto Bautista Agut        | 28   |

The above has the following issues.

First name(s) and last name in the same column.
In the case of multiple first or last names, it is unclear which one is which.

The goal was to extract the player column, split the first and last names and create two columns. For players with two 3 or more name parts, I guessed which part was the first and last names.

csvcut works on columns and makes it possible to slice, delete and reorder them in a CSV file.

Extract name column to separate file

➜ ✗ csvcut -c 3 players.csv > names.csv
|--------------------------------|
|  Player                        |
|--------------------------------|
|  Novak Djokovic                |
|  Andy Murray                   |
|  Stan Wawrinka                 |
|  Roger Federer                 |

Split first and last names

Add FirstName, LastName headers manually. Add a comma to separate first name and last name. When there are multiple first or last names, check manually.

|------------------+-------------------------|
|  FirstName       | LastName                |
|------------------+-------------------------|
|  Novak           |  Djokovic               |
|  Andy            |  Murray                 |
|  Stan            |  Wawrinka               |
|  Roger           |  Federer                |

remove name column from players

➜ ✗ csvcut -c 1,2,4 players.csv > players_without_names.csv
|----------+---------+------|
|  Ranking | Country | Age  |
|----------+---------+------|
|  1       | SRB     | 29   |
|  2       | GBR     | 29   |
|  3       | SUI     | 31   |
|  4       | SUI     | 35   |

Join players without name and the newly separated name. Preview result

csvjoin is used to join related data.

➜ ✗ csvjoin players_without_names.csv names.csv | csvlook
|----------+---------+-----+-----------------+-------------------------|
|  Ranking | Country | Age | FirstName       | LastName                |
|----------+---------+-----+-----------------+-------------------------|
|  1       | SRB     | 29  | Novak           |  Djokovic               |
|  2       | GBR     | 29  | Andy            |  Murray                 |
|  3       | SUI     | 31  | Stan            |  Wawrinka               |
|  4       | SUI     | 35  | Roger           |  Federer                |

As preview is looking good, execute the join csvjoin players_without_names.csv names.csv > players_with_full_names.csv

Reorder Columns

First look at the column order, then reorder them.

➜ ✗ csvcut -n players_with_full_names.csv
1: Ranking
2: Country
3: Age
4: FirstName
5: LastName

➜ ✗ csvcut -c 1,2,4,5,3 players_with_full_names.csv > players_prepared.csv
|----------+---------+-----------------+------------------------+------|
|  Ranking | Country | FirstName       | LastName               | Age  |
|----------+---------+-----------------+------------------------+------|
|  1       | SRB     | Novak           |  Djokovic              | 29   |
|  2       | GBR     | Andy            |  Murray                | 29   |
|  3       | SUI     | Stan            |  Wawrinka              | 31   |
|  4       | SUI     | Roger           |  Federer               | 35   |
...
|  17      | ESP     | Roberto         |  Bautista Agut         | 28   |

Tournaments & Matches

2016.csv contains all matches of the year up until now. It also has all the information regarding tournaments but duplicated many times.

Chaining the csvgrp command with csvcut allows to filter the rows and keep only a subset of columns. In my case, filtering for only the rows where the value for Round is "The Final" and then only keeping the columns relevant to tournaments.

Extract Tournament Data

➜ ✗ csvgrep -c Round -m "The Final" 2016.csv | csvcut -c ATP,Location,Tournament,Series,Court,Surface > tournaments.csv
|------+------------------+---------------------------+--------------+---------+----------|
|  ATP | Location         | Tournament                | Series       | Court   | Surface  |
|------+------------------+---------------------------+--------------+---------+----------|
|  1   | Brisbane         | Brisbane International    | ATP250       | Outdoor | Hard     |
|  2   | Chennai          | Chennai Open              | ATP250       | Outdoor | Hard     |
|  3   | Doha             | Qatar Exxon Mobil Open    | ATP250       | Outdoor | Hard     |
|  4   | Auckland         | ASB Classic               | ATP250       | Outdoor | Hard     |
|  5   | Sydney           | Apia International        | ATP250       | Outdoor | Hard     |
|  6   | Melbourne        | Australian Open           | Grand Slam   | Outdoor | Hard     |

Extract the Finals

➜ ✗ csvgrep -c Round -m "The Final" 2016.csv | csvcut -c Tournament,Date,Round,BestOf,Winner,Loser > finals.csv
|----------------------------+------------+-----------+--------+---------------+-------------------|
|  Tournament                | Date       | Round     | BestOf | Winner        | Loser             |
|----------------------------+------------+-----------+--------+---------------+-------------------|
|  Brisbane International    | 2016-01-10 | The Final | 3      | Raonic M.     | Federer R.        |
|  Chennai Open              | 2016-01-10 | The Final | 3      | Wawrinka S.   | Coric B.          |
|  Qatar Exxon Mobil Open    | 2016-01-09 | The Final | 3      | Djokovic N.   | Nadal R.          |
|  ASB Classic               | 2016-01-16 | The Final | 3      | Bautista R.   | Sock J.           |
|  Apia International        | 2016-01-16 | The Final | 3      | Troicki V.    | Dimitrov G.       |
|  Australian Open           | 2016-01-31 | The Final | 5      | Djokovic N.   | Murray A.         |

All the commands from csvkit can be scripted and thus automated. Which is useful if some data is regularly analyzed.

Importing the CSV Files into Neo4j

Importing the CSV files into Neo4j and building up the graph is the next step. For this first download and install Neo4j. Neo4j has several means of importing data. For medium sized datasets, LOAD CSV is a straightforward and fast approach, and it allows to import CSV data via files, HTTPS, HTTP, and FTP. Check out the LOAD CSV documentation for all the details.

The LOAD CSV cypher statement is run by executing neo4j-shell or other methods. Using neo4j-shell means pointing it to a file with cypher queries.

In the simplest case, the cypher statement is

LOAD CSV WITH HEADERS FROM "file:///path/to/file/players.csv" AS row
WITH row LIMIT 4
RETURN row;

Which tries to read the first four lines of the CSV file and returns all 4 lines. Running LOAD CSV without doing an import is a simple way to check if Neo4j can parse the CSV file.

➜ ✗ $NEO4J_HOME/bin/neo4j-shell -v -file ./check_players.cypher
+-----------------------------------------------------------------------------------------------+
| row                                                                                           |
+-----------------------------------------------------------------------------------------------+
| {LastName -> "Djokovic", Country -> "SRB", FirstName -> "Novak", Age -> "29", Ranking -> "1"} |
| {LastName -> "Murray", Country -> "GBR", FirstName -> "Andy", Age -> "29", Ranking -> "2"}    |
| {LastName -> "Wawrinka", Country -> "SUI", FirstName -> "Stan", Age -> "31", Ranking -> "3"}  |
| {LastName -> "Federer", Country -> "SUI", FirstName -> "Roger", Age -> "35", Ranking -> "4"}  |
+-----------------------------------------------------------------------------------------------+
4 rows
129 ms

To actually get the data into Neo4j, Nodes and Relationships will have to be created when running the import.

LOAD CSV WITH HEADERS FROM "file:///path/to/file/players_with_countries.csv" AS row
WITH row LIMIT 125

MERGE (c:Country{name: trim(row.Nation), ioc_code: trim(row.Country)})
MERGE (p:Player {lastName: trim(row.LastName), firstName: trim(row.FirstName), ranking: toInt(trim(row.Ranking)), age: toInt(trim(row.Age))})
MERGE (p)-[:FROM_COUNTRY]->(c);

Running this will populate the database and will output the number of entities created.

➜ ✗ $NEO4J_HOME/bin/neo4j-shell -v -file ./import_players.cypher
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 163
Relationships created: 125
Properties set: 576
Labels added: 163
427 ms

Note that in the import statements above trim() is used to remove whitespace from the data. E.g. trim(row.Nation). trim() or some other script which removes whitespace is necessary as none of the CSV tools can warn about or fix whitespace.

I wanted to import the CSV files from inside my git repo structure. To make this work with a default Neo4j installation, comment out dbms.directories.import=import in <NEO4J_INSTALLATION_DIRECTOR>/conf/neo4j.conf.

Building The Graph Model

Building the graph model was an iterative process as I did not get it right from the beginning and so I had to run the import statements many times. To make it easy to import, I created a shell script that runs all imports and used a cleanup cypher query to completely empty the database. Don't use this if you want to keep some of your entries in the database.

// Deletion
MATCH (n)
DETACH DELETE n

Querying The Graph

After the data was imported the graph looked something like this.

alt text

Now it was time to write the queries and get some meaningful data out.

Below I showcase some queries, with the full queries and results at the Tennis 2016 graph gist.

Country queries

Some simple queries:

Number of countries with a Top 100 player

MATCH (c:Country)-[:FROM_COUNTRY]-(players:Player)
WHERE players.ranking <= 100
RETURN count(DISTINCT c.name) + " countries in Top 100"

which returns

35 countries in Top 100

It looks like it is possible for many countries to get into the top 100. Somewhat surprisingly there are 24 different nations in the top 50 and 9 different countries in the top 10.

No of countries in top X
 9 countries in top  10
16 countries in top  25
24 countries in top  50
35 countries in top 100

What sticks out is that higher up in the rankings, countries are distributed more evenly. It would be interesting to see if that is a stable situation over time and how it compares to other sports.

A quick check at table tennis and athletics 100m shows similar results compared to tennis.
7 out of 10 and 12 out of 25 for table tennis
6 out of 10 and 13 out of 25 for athletics.
There is a higher concentration of countries in the top 10 and a similar distribution of nations in the top 25.

With 35 countries in the top 100, will they all have roughly 3 players or will there be big differences in the number of players?

Number of players in Top 100 for each country

MATCH (c:Country)-[:FROM_COUNTRY]-(players)
WHERE players.ranking <= 100
WITH c, count(players) AS player_count
ORDER BY player_count DESC
RETURN c.name AS Country_Name, player_count AS Number_Of_Players
LIMIT 10

The result shows two countries which both have more than 10% of the Top 100 players, 7 countries with between 4 and 7 players. Followed by 5 countries each with 3 and 2 players respectively and 16 countries with a single player.

╒══════════════════════╤═════════════════╕
│Country_Name          │Number_Of_Players│
╞══════════════════════╪═════════════════╡
│France                │12               │
├──────────────────────┼─────────────────┤
│Spain                 │11               │
├──────────────────────┼─────────────────┤
│Argentina             │7                │
├──────────────────────┼─────────────────┤
│United States         │6                │
├──────────────────────┼─────────────────┤
│Germany               │6                │
├──────────────────────┼─────────────────┤
│Russia                │5                │
├──────────────────────┼─────────────────┤
│Croatia               │4                │
├──────────────────────┼─────────────────┤
│Australia             │4                │
├──────────────────────┼─────────────────┤
│Great Britain         │4                │
├──────────────────────┼─────────────────┤
│Italy                 │3                │
└──────────────────────┴─────────────────┘

Are there countries well represented in the top 50 but with only a few players in the top 51 to 100. Such a distribution could show potential difficulty with getting enough young talents to come. (This is a very simplified view. One would need to take into account ranking over time and age of the players.)

Countries with potential successor problems

MATCH(c:Country)-[:FROM_COUNTRY]-(players:Player)
WHERE players.ranking <= 100
WITH  c, count(players) AS player_count
WHERE player_count > 0

OPTIONAL MATCH (c)-[:FROM_COUNTRY]-(players:Player)
WHERE players.ranking <= 50
WITH  c, count(players) AS player_count_top50

OPTIONAL MATCH (c:Country)-[:FROM_COUNTRY]-(players:Player)
WHERE 50 < players.ranking  <= 100
WITH player_count_top50, c, count(players) AS player_count_bottom50
WHERE player_count_top50 -  player_count_bottom50 >= 2 AND player_count_bottom50 < 3

RETURN c.name, player_count_top50 AS Players_in_Top_50,  player_count_bottom50 AS Players_in_Top_51_to_100, player_count_top50 + player_count_bottom50 AS total
ORDER BY player_count_top50 DESC, player_count_bottom50 DESC

While 20% of the players in the top 50 are from Spain, there is only a single player from Spain in ranks 51 to 100. Switzerland has two players in the top 10, but no other player in the top 100.

╒═════════════╤═════════════════╤════════════════════════╤═════╕
│c.name       │Players_in_Top_50│Players_in_Top_51_to_100│total│
╞═════════════╪═════════════════╪════════════════════════╪═════╡
│Spain        │10               │1                       │11   │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│United States│4                │2                       │6    │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│Croatia      │3                │1                       │4    │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│Switzerland  │2                │0                       │2    │
└─────────────┴─────────────────┴────────────────────────┴─────┘

There is also the opposite case of countries with a lot more players in the top 51 to 100. Argentina is sticking out as the country with 6 players in the top 51 to 100 but only one player in the top 50. Interestingly, Argentina is one of the Davis Cup finalists in 2016. The other one being Croatia with 3 players in the top 50. I am curious now if any relationship can be found between player distribution and Davis Cup success.

╒═════════════╤═════════════════╤════════════════════════╤═════╕
│c.name       │Players_in_Top_50│Players_in_Top_51_to_100│total│
╞═════════════╪═════════════════╪════════════════════════╪═════╡
│Germany      │2                │4                       │6    │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│Argentina    │1                │6                       │7    │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│Russia       │1                │4                       │5    │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│Great Britain│1                │3                       │4    │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│Ukraine      │0                │3                       │3    │
├─────────────┼─────────────────┼────────────────────────┼─────┤
│Brazil       │0                │2                       │2    │
└─────────────┴─────────────────┴────────────────────────┴─────┘

Prepare for GraphGist

The GraphGist Portal has a good introduction and how-to for the creation of a GraphGist. The purpose of the GraphGist portal is to provide an easy way to share educational graphs with only a restricted number of nodes, relationships. It provides a limited interface compared to a full Neo4j instance. Thus I had to make the following adjustments.

LOAD CSV is only possible in a very specific way by hosting the CSV files inside a Github gist. By contrast, the GraphGist file itself can also be hosted on Dropbox which allows syncing in the background without having to commit/push after every change. The solution was first to import everything into a Neo4j instance and then dump the DB as cypher statements.
Graph size. Initially, I was importing more data. E.g. Top 500 players, all matches, all countries in the world. Changed to Top 100 players, only finals and only the countries used by a tournament or top 100 player.
Displaying the name of the node. When displaying the result of a query as a graph, the name of the node is the attribute that comes first alphanumerically. E.g. I wanted to display a player's lastName but instead its age was shown as 'a' sorts before 'l'. As a workaround, I renamed the lastName attribute to a_lastName for the GraphGist to force it to be used as the display name. This is purely a workaround for GraphGist and should never be done in a production database.

Conclusion

These days is a good time to run your analysis on data that is available on the internet. There is a lot of data available, specific portals to provide access to it and also communities behind it. In Switzerland, there is the Swiss Open Data Portal, the Opendata.ch association and the open data hack days.

Data exists in all kinds of formats. Some of it will have to be scraped from websites or extracted from PDFs. For data that is already in tabular form, csvkit is the tool to work with.

Once the data model is created and the date imported into Neo4j, the queries can be run on it. GraphGist allows for sharing small datasets/analysis for educational purposes. For larger needs, there are also hosted neo4j instances by Graphene DB and Graph Story with paid plans. Graphene DB also offers a free plan with a limit of 1000 nodes and 10'000 relationships.

It was fun working through this example. I learned a lot about CVS processing on the command line, writing cypher queries which helped me towards my Neo4j certification and last but not least Switzerland needs at least one other player in the top 100 for the time when Roger Federer will step down from active tennis.

Want to learn more about Neo4j? Need some help to get your data right? At 42talents we offer Graph Data Modeling and Neo4j Fundamentals Trainings as well as consulting to support you getting answers for your graphy questions.