Indexing GeoNames into Solr
This post walks through a quick and easy way to index GeoNames.org locations into Solr 5.2.1. It uses the Solr default configuration for the gettingstarted collection.
For more on Solr collections vs cores.
Getting Solr 5.2.1 up and going
Download and unzip Solr 5.2.1
$ ls solr*
solr-5.2.1.zip
$ unzip -q solr-5.2.1.zip
$ cd solr-5.2.1/Start Solr
$ bin/solr start -e cloud -nopromptYou should now be able to successfully navigate to http://127.0.0.1:8983/solr
Formatting GeoNames.org data for Solr
GeoNames provides several data download types available on their website. This post will focus on indexing allCountries.txt which includes all features from GeoNames. This file unzipped is ~1.2 GB which could be troublesome for some. Beginning users may want to start with a smaller dataset such as cities1000.txt which is a smaller subset of the GeoNames data.
Someone out there probably could do all of this in an awesome one liner. These steps are broken up for better understanding of whats going on. We first need to format the GeoNames data into something that is indexable into Solr.
Download and unzip allCountries.zip
Download available from GeoNames.
$ unzip -q allCountries.zipallCountries.txt comes in a tab-delimited text file in utf-8 encoding. The following fields are provided:
| Field | Description | 
|---|---|
| geonameid | integer id of record in geonames database | 
| name | name of geographical point (utf8) varchar(200) | 
| asciiname | name of geographical point in plain ascii characters, varchar(200) | 
| alternatenames | alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000) | 
| latitude | latitude in decimal degrees (wgs84) | 
| longitude | longitude in decimal degrees (wgs84) | 
| more … | we don’t need the rest of these | 
We won’t use most of these columns, so let’s get rid of the ones we don’t need.
Get rid of columns we don’t need
We only need the 1st, 2nd, 5th, and 6th columns.
$ cut  -f1-2,5-6 allCountries.txt > allCountries_red.txtAdd a header row
Add in a header row to the tsv text file. Note, whitespace delimiters (between id, title_t, lat, lng) should be tab literals.
$ sed '1s/^/id  title_t lat lng\
/g' allCountries_red.txt > allCountries_head.txtAdd a WKT column
This command requires the csvpys version of csvkit software. Running the command will create a new WKT point column loc_srpt using the existing lat and lng columns. *_srpt is a Spatial Recursive Prefix Tree Field Type dynamic Solr field shipped with the default gettingstarted Solr schema.
$ csvpys --tab -s loc_srpt "'POINT(' + ch['lng'] + ' ' + ch['lat'] + ')'" allCountries_head.txt > allCountries_wkt.txtOnly keep the columns we need
Get rid of the lat and lng columns
$ csvcut -c 1,2,5 allCountries_wkt.txt > allCountries_wkt_cut.txtConvert the tsv to json
$ csvjson -i 2 allCountries_wkt_cut.txt > allCountries.jsonIndex into Solr
If you are doing this using the full allCountries.txt file, this command can take a while (at least 5 minutes). This command will index over 10 million records into your Solr index. You can check the status of this command by seeing if the document counts in your Solr collection are increasing. You can see this by using the Solr admin interface.
$ curl 'http://localhost:8983/solr/gettingstarted/update?commit=true' --data-binary @allCountries.json -H 'Content-type:application/json'You should now have your GeoNames data indexed in Solr!
Checkout a Solr query.
// http://127.0.0.1:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=true
{
  "responseHeader":{
    "status":0,
    "QTime":39,
    "params":{
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
  "response":{"numFound":144573,"start":0,"maxScore":1.0,"docs":[
      {
        "title_t":["El Tarter"],
        "id":"3039154",
        "loc_srpt":"POINT(1.65362 42.57952)",
        "_version_":1504146876751937536},
      {
        "title_t":["Sant Julià de Lòria"],
        "id":"3039163",
        "loc_srpt":"POINT(1.49129 42.46372)",
        "_version_":1504146876821143552},
      {
        "title_t":["Pas de la Casa"],
        "id":"3039604",
        "loc_srpt":"POINT(1.73361 42.54277)",
        "_version_":1504146876823240704},
      ...
  }
}You can now do all sorts of fun spatial search things in Solr!