Data cleaning: latitude and longitude

When cleaning a dataset you need to check everything.  I’m trying to make sense of the extreme storms archive provided by the Bureau of Meteorology.  Most of the records are fine but there are some errors in latitude and longitude (Figure 1).

Australian towns like Cobram, Rutherglen and Wodonga (the bottom rows in Figure 1) are all in the southern hemisphere so, by convention the latitudes are negative.    Similarly, the whole of Australia is well south of the equator so latitudes of zero of -1 are not correct.  The numbers, -1, 0 and 1 seem to be sentinel values that indicate missing so need to be recoded.

Lattitude_errors.png

Figure 1: Errors in latitude and longitude

Wikipedia lists the extreme points of Australia.  Those for the mainland plus Tasmania are:

  • Northernmost point – Cape York Peninsula, Qld -10o41′
  • Southernmost point – South East Cape, Tas -43o39′
  • Westernmost point – Steep Point, WA 113o09′ E
  • Easternmost point – Cape Byron, NSW 153o38’E

Points outside this range may refer to islands, or be in error.  Best to check.

Latitudes and longitudes can be looked up for a locality using the geocode function in the ggmap package in R.  We can also go from coordinates to locality using revgeocode.

geocode('Rutherglen, Australia')
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Rutherglen,%20Australia&sensor=false
lon lat
1 146.4625 -36.05556

# Which state is Rutherglen in?

x <- geocode('Rutherglen, Australia', output = 'all')
x$results[[1]]$address_components[[3]]$long_name
[1] "Victoria"
x$results[[1]]$address_components[[3]]$short_name
[1] "VIC"

# From coordinates to locality

revgeocode(c(146.4625, -36.05556))
# [1] "62 Main St, Rutherglen VIC 3685, Australia"

# Which state are these coordinates in?

x <- revgeocode(c(146.4625, -36.05556), output = 'all')

x$results[[5]]$address_components[[1]]$long_name
#[1] "Victoria"

x$results[[5]]$address_components[[1]]$short_name
#[1] "VIC"

# Alternatively
x <- geocode('Rutherglen, Australia', output = 'more')
as.character(x$administrative_area_level_1)
# Victoria

Some manual checking will be required but a lot can be automated which will avoid many gross errors.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s