Running geospatial queries on AWS Athena

Running geospatial queries on AWS Athena

I have been a user of AWS Athena for some time but didn’t know until recently that Athena possesses some nifty built-in geospatial querying abilities. In this article, I’ll explain what those abilities are and show a working demo of using them. If you want to follow along with the coding examples shown you should have an AWS account that has read/write access to Athena and the other AWS services that this implies such as Glue and S3. 

The problem statement we are trying to solve is this. Given a large input data set that contains numeric geographical data points (e.g latitude/longitude) can we produce a subset of that data that corresponds to a specific geographic area contained within the overall data set? To illustrate this we are going to use a UK 5-year vehicle accident data set and try to pinpoint only those accidents that occurred in a particular UK region, in this case it will be the city of Edinburgh.

In case you don’t know what Athena is, it’s AWS’s serverless, interactive query service that makes it easy to analyse data in Amazon S3 using standard SQL. Athena is integrated with the AWS Glue Data Catalogue, allowing you to create a unified metadata repository across various services. In simple terms, Athena is like a read-only SQL database where the tables correspond to objects contained in your S3 buckets and folders. 

In order to facilitate geospatial queries, Athena uses what is called the Well-Known Text (WKT) format, which is a text markup language for representing vector geometry objects. The WKT format was originally defined by the Open Geospatial Consortium (OGC) and described in their Simple Feature Access. The current standard definition is in the ISO/IEC 13249-3:2016 standard. 

For the Athena use case, WKT is varchar (string) data type and can represent the following distinct geometric objects:- 

  • A point 
  • A line 
  • A polygon 
  • A multiline 
  • A multipolygon 

The actual data that describes these objects can be any convenient set of numeric coordinates. In our case, we assume it will be latitude & longitude pairings. Each coordinate pair should be separated by a space character and coordinate pairings should be separated by a comma. For example, this would be a valid representation of a square-shaped polygon:- 

POLYGON((30.0 10.0, 60.0 10.0, 30.0 40.0, 60.0 40.0))

Athena engine version 2.0 supports the following geospatial functions.

Constructor functions 

Constructor functions are used to obtain binary representations of a point, line, or polygon geometry data type. You can also use these functions to convert binary data to text and obtain binary values for geometry data that is expressed as WKT. 

Geospatial relationship functions 

These functions express relationships between two different geometries that you specify as input and return results of type boolean. The order in which you specify the pair of geometries matters: the first geometry value is called the left geometry, and the second geometry value is called the right geometry. They return TRUE if the relationship described by the function is satisfied, otherwise they return FALSE. 

Operation functions 

Operation functions are used to perform operations on geometry data type values. For example, you can obtain the boundaries of a single geometry data type; intersections between two geometry data types; difference between left and right geometries, where each is of the same geometry data type; or an exterior buffer or ring around a particular geometry data type. 

Accessor functions 

Accessor functions are useful to obtain values in types varchar, bigint, or double from different geometry data types, where geometry is any of the geometry data types supported in Athena. For example, you can obtain an area of a polygon geometry data type, maximum and minimum X and Y values for a specified geometry data type, obtain the length of a line, or receive the number of points in a specified geometry data type. 

Aggregation functions 

Bing tile functions 

Bing tile functions convert between geometries and tiles in the Microsoft Bing maps tile system. 

Ok, let’s get started. By way of example, we’ll be using Athena and geospatial queries to display a list of all vehicle road accidents in Edinburgh, Scotland between the years 2016 and 2020. We need two different data sets for this. One that lists all accidents in the UK and one WKT format file that describes the boundary of Edinburgh's border in terms of latitude & longitude pairs. The accident data we’re using can be found at:-

https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-accident-last-5-years.csv 

After downloading this data set, we’ll upload it to an S3 bucket in AWS, and then create an Athena table based on it. 

When our main data set is in S3 we’ll need a WKT file that describes the boundary of the city of Edinburgh. That’s not that easy to come by so there is a free tool that we can use to roughly trace around Edinburgh's border, and it will produce the latitude & longitude points for us which, with a bit of data wrangling, we can turn into a WKT file. 

We can then upload that file to a different s3 folder and create a little lookup table in Athena based on the data it contains. After that, it’s a relatively simple Athena geospatial query to get the final data we’re interested in. 

Upload the main data file to S3 and create an Athena table for it.

Probably the easiest way to do this is to use Glue to crawl our input data set once it’s on S3. This will create a glue catalogue table and hence an Athena table too. This isn’t a tutorial on Glue so I’ll just present the Athena table DDL to give you an idea of what yours should look like if you’re following along.

CREATE EXTERNAL TABLE `data`(
  `accident_index` bigint,  
  `accident_year` bigint,  
  `accident_reference` bigint,  
  `location_easting_osgr` bigint,  
  `location_northing_osgr` bigint,  
  `longitude` double,  
  `latitude` double,  
  `police_force` bigint,  
  `accident_severity` bigint,  
  `number_of_vehicles` bigint,  
  `number_of_casualties` bigint,  
  `date` string,  
  `day_of_week` bigint,  
  `time` string,  
  `local_authority_district` bigint,  
  `local_authority_ons_district` string,  
  `local_authority_highway` string,  
  `first_road_class` bigint,  
  `first_road_number` bigint,  
  `road_type` bigint,  
  `speed_limit` bigint,  
  `junction_detail` bigint,  
  `junction_control` bigint,  
  `second_road_class` bigint,  
  `second_road_number` bigint,  
  `pedestrian_crossing_human_control` bigint,  
  `pedestrian_crossing_physical_facilities` bigint,  
  `light_conditions` bigint,  
  `weather_conditions` bigint,  
  `road_surface_conditions` bigint,  
  `special_conditions_at_site` bigint,  
  `carriageway_hazards` bigint,  
  `urban_or_rural_area` bigint,  
  `did_police_officer_attend_scene_of_accident` bigint,  
  `trunk_road_flag` bigint,  
  `lsoa_of_accident_location` string) 
ROW FORMAT DELIMITED  
  FIELDS TERMINATED BY ','  
STORED AS INPUTFORMAT  
  'org.apache.hadoop.mapred.TextInputFormat'  
OUTPUTFORMAT  
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
LOCATION 
  's3://your_bucket/your_folder/'  

Most of the columns we don’t need, but the important thing is that each record contains a latitude and longitude that pinpoints the location where the accident occurred. Here's a snippet of the first few columns and records.


No alt text provided for this image


As you can see, our main data file doesn’t contain any geographic information apart from lat/longs (and eastings/northings) that could be used to pinpoint the accident location and this is where our geospatial queries will come in useful. 

Create a WKT text file and upload it to S3 

The next thing we need is a WKT file describing, using latitude/longitude pairings, the outline of Edinburgh’s border. As I said before this isn’t easy to obtain but I came across a useful website that can help with this. Click on the following URL:- 

http://apps.headwallphotonics.com/ 

You should be presented with a satellite image of part of the world map. You can zoom in and out of this just as you would something like, say, google maps. Change from satellite to regular map format by clicking on the appropriate link near the top left-hand corner of the screen. 

No alt text provided for this image

 



Now we need to zoom in on Edinburgh. For those whose geography is a bit ropey, Edinburgh is the capital city of Scotland, UK. Once you’ve done that, click on the New Polygon menu item near the top middle of the screen. From here you can simply move your mouse around and left-click on enough locations to encompass the border area of Edinburgh. Move individual points around by holding down the left mouse button as you click on them. You don’t have to be too accurate with this but when you’re done your screen should look something like this. 

No alt text provided for this image

Importantly, near the bottom left-hand side of the screen, you should see a list of latitude & longitude points making up the polygon you’ve just drawn. You should copy and paste this data into a text editor like so:- 

#Polygon_0
55.967170834718694, -3.351221312947499
55.93372286246732, -3.374567260213124
55.8929311007601, -3.3134558100178113
55.883304146932815, -3.2173254389240613
55.88445950748224, -3.1019689936115613
55.916026048046746, -3.0511572260334363
55.942952775440276, -3.0566503900959363
55.989069359173484, -3.153467406697499
55.99943801749178, -3.3010961908771863
55.98906935917351, -3.3137991327717176

Next, you need to massage that file a bit. Remove the first line and all commas. At the end of each lat/long pairing insert a comma and remove all newline characters so that the entire data is on just one line. Book-end the start and end of the line with double round brackets and insert the word POLYGON at the start of the line. Next, we need to give this polygon a name e.g EDINBURGH and separate this name from the rest of the line with a pipe character. So, your final file before uploading to S3 should look like this:-

EDINBURGH|POLYGON((55.967170834718694 -3.351221312947499,55.93372286246732 -3.374567260213124,etc ...,55.98906935917351 -3.3137991327717176)) 

Save this file with a .wkt extension and upload it to a suitable location in S3, different from where your main accident data set lives. Run the following command in the Athena query editor to create an Athena table based on the polygon data in S3.

create external table edinburgh_poly(
city_name string, 
geom string 
) 
row format delimited fields terminated by '|' 
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat' 
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
location 's3://your-bucket/your_folder’  


Running our geospatial Athena query 

Ok, so the hard part is done. We just need to run a geospatial query that will sub-select all accident data for the UK that happened in Edinburgh. For that, we will use 3 Athena geospatial functions. 

st_point  st_point takes two input parameters and returns a geospatial POINT type. The two parameters it will take will be our lat/long pairings from our main accident data set 

st_polygon st_polygon takes a WKT description of a polygon and returns a geospatial polygon. This will be the data contained in our edinburgh_poly table. 

st_contains Our final function accepts two parameters. The first is our POLYGON geospatial structure (returned from the st_polygon function). The second parameter is our POINT geospatial structure (returned from the st_point function). The st_contains function returns TRUE if the point is on or inside the boundary of the polygon.

So, we now have everything we need for our final query that returns all accidents from the UK data set that occurred within Edinburgh. Open the Athena query editor and type in the following:- 

select * from data t1, edinburgh_poly t2
where st_contains(st_polygon(t2.geom),st_point(t1.latitude,t1.longitude)) 
and t2.city_name = 'EDINBURGH' 
 

If you’ve managed to follow along ok so far you should come up with query output similar to this.


No alt text provided for this image

 

Let’s double-check some of those coordinates that were returned to make sure the query worked ok. To do that I saved the first 100 latitude/longitude pairs returned from the above query and saved them to a CSV file. I then opened MyMaps in Google and imported the CSV file to plot out the points. You can see the result of that in the image below showing all the points clustered around the city of Edinburgh so I'm satisfied the query was working OK.

No alt text provided for this image


Ok, that's all I have for now. If you found this article useful please like and share to help spread the knowledge around and check out the following links for more details. 

http://apps.headwallphotonics.com/ 

https://docs.aws.amazon.com/athena/latest/ug/querying-geospatial-data.html 

https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry 

https://mymaps.google.com/

Sandhya K Ravi

Author | VIT University | IIM Calcutta | Faculty - Placements & Trainings @VIT | Ex - HR | Softskills Trainer | Internationally certified softskills Trainer from ISIM, Mumbai / AICI Global | Career Development Centre

2y

Pls check your DM

Like
Reply
Cristián Andrés Vargas Acevedo

ES | Data Technical lead | Data Engineer en XBrein | Diplomado en Big Data & Data Science

2y

some clickbait for the buffer 🤪, but still good post

Like
Reply

To view or add a comment, sign in

Explore topics