, Johann Schmitz

In the last article i described how to deploy Apache solr on Tomcat. Now we will import some real data to be able to actually search for something and see the incredible speed of lucene/solr.

Solr ships with some example documents (example/exampledocs/ in the distribution archive), however the sample data only contains a few KB of data. We need more data. Much more.

We could use one of the dumps of the Wikipedia from http://dumps.wikimedia.org/backup-index.html, but this is a giant xml file and i don't want to write a XSLT script to transform these xml files to solr compliant update files. Thankfully, the archive team has some big data dumps of various sources. The geocities and the MobileMe archive is a litte to much data for getting started with a basic solr installation, so i choose the Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape dump (404 MB). The dump contains about 5 million tweets from 5000 twitter users and their GPS position (see the description PDF for a detail explanation about the data). So download and extract the twitter_cikm_2010.zip. In the meantime, we configure our solr installation.

About documents, schema and fields

Solr (lucene to be precise) is document-based. This means, that every searchable information is wrapped into a document. Think of a document as the "thing" a user would search for: Products in an online shop, auctions on an online auction platform, or articles in the wikipedia.

Each solr instance ("core" in the solr language), contains exactly one schema (a formal description of the documents). If you want to search for different types (which share no common information), you have to setup another core (read: instance) of solr and define a different schema.

Each field is a distinct attribute of a document. It has a type, which defines filters, analyzers for pre-processing.

Defining a schema

Our installation contains a pre-build schema and field types in /opt/solr/conf/schema.xml. Open the file in your favourite text editor and look around. On the top, you see many <fieldType> nodes - these are the available data types. Each field type may have multiple analyzers for query and index assigned to it: Tokenizers to split the content into smaller pieces (words or phrases), Filters to remove irrelevant parts (like stopwords, meta-chars) or to stem words into their root form.

Each document has a primary key, typically named id. You can change the field name of the primary with the <uniqueKey /> node. This primary key is used to uniquely identify the document in the solr index. We will later see how to use the primary key to update the index or to remove a document from the index. Furthermore, each document has a default property. Per default it's named text. You can change the name too, but for simplicity i recommend to simply include a field named text.

If you have worked with databases before (especially relational database), the schema may look wrong to you: all searchable attributes of the document is put into a single entity. Tip: simply ignore anything you know about normalization :)

The definition of the "default" schema starts around line 871: A simple list of <field /> nodes with a name attribute, a reference to one of the defined field types and a few other properties:

  • required: like the name says
  • indexed: Set the value to true if you want to search for this field
  • stored: Set the value to true if you want this attribute in your search results

Just comment the existing schema definition, and add the following declaration:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="text" type="text_general" indexed="true" stored="true" required="true" />
<field name="uid" type="long" indexed="true" stored="true" required="true" />
<field name="date" type="date" indexed="true" stored="true" required="true" />
<field name="location" type="text_general" indexed="true" stored="true" required="true" />

This schema defines (more or less) all data we have in the download archive. Our primary key is the tweet id (the "primary key" of tweets from twitter). Even the tweet id is a numeric value, we specify it as a string, because solr has some problems with numerical primary keys.

Importing the data into the lucene index

Solr/Lucene supports multiple ways to import data into the index: You can simply POST update xml messages, import csv files or instruct Solr to read data from an URL, UNC path and so on.

I took the simplest (and pre-configured) way to import the data: The update xml. The XML document looks like this:

<add>
    <doc>
        <field name="field1">value</field>
        <field name="field2">value<field>
        <field name="field3">value</field>
    </doc>
    ... more <doc /> nodes ...
</add>

I wrote a python script to transform the tweet files to xml files. Run it python main.py /path/to/test_set_users.txt /path/to/tweets/test_set_tweets.txt /tmp/foo/. CAUTION: The script loads all user data and tweets into memory - don't run it with less than 2 GB of free ram! It's a quick'n'dirty script, so its single threaded and not written for speed, so it will take some time. It produces a set of .xml files, containing 10.000 tweet documents each.

Now, we can import the xml files into our index. Use the post.sh file as described in the installation post to send the data to the server: /path/to/example/exampledocs/post.sh /tmp/foo/*.xml (remember to change the URL variable to point to your server).

Searching in the index

Open you solr interface at http://localhost:8080/solr-web/admin/: solr admin interface

You can enter your search query (e.g. "Pasta") in the textarea. Queries without an field specification are executed against the default search field, "text" in our schema (see above). Executing a search opens the result xml:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">32</int>
  <lst name="params">
    <str name="indent">on</str>
    <str name="start">0</str>
    <str name="q">pasta</str>
    <str name="version">2.2</str>
    <str name="rows">10</str>
  </lst>
</lst>
<result name="response" numFound="1033" start="0">
  <doc>
    <date name="date">2010-03-08T14:21:49Z</date>
    <str name="id">10186236530</str>
    <str name="location">43.764292,-79.732801</str>
    <str name="text">@Janelliebeans PASTA !!!</str>
    <long name="uid">19899137</long>
  </doc>
  <doc>
    <date name="date">2009-05-22T11:01:34Z</date>
    <str name="id">1883735694</str>
    <str name="location">41.926556,-72.675233</str>
    <str name="text">@jamie_oliver Pasta!</str>
    <long name="uid">17540628</long>
  </doc>
  ...
</result>
</response>

The result xml contains a debug node at the top. The QTime field contains the time it took to execute this query. Searching 5.000.000 tweets in 32 ms is really fast!

Updating the search index

To update a document in the search index, re-post the UpdateXML with the same id property. To delete a document by id POST a delete message to the server:

<?xml version="1.0" encoding="UTF-8"?>
<delete>
    <query>id:298253</query>
</delete>

You can change the <query/> node value to a more complex query to delete many documents without specifing the IDs.