MS RFC 52: One-pass query processing

Date:

2009/03/08

Authors:

Steve Lime

Contact:

sdlime at comcast.net

Last Edited:

2009/06/03

Status:

Adopted

Version:

MapServer 6.0

Overview

This RFC proposes change(s) to the current of query (by point, by box, by shape, etc…) processing in MapServer.

Presently MapServer supports a very flexible query mechanism that utilizes two passes through the data. This works by caching a list of feature IDs (pass one) and then a second pass through the features for presentation: templated output, drawing, or retrieval via MapScript. The obvious problem is the performance hit incurred from the second pass. The real pain is that the msLayerGetShape() function, as implemented, provides random access to the data which can be very expensive for certain drivers.

Technical Solution

There are a number of potential solutions:

1. One could cache the returned shapes in memory. While this wouldn’t result in a true single-pass, you wouldn’t have to go back to the original driver twice. However, it could lead to large memory consumption with even moderately-sized datasets. Multiple clients accessing services at the same time would only compound the problems. Some testing has confirmed this method to be no faster and probably even a bit slower than option 3 below.

2. Another solution would be fold much of the query pre- and post-processing code into the msLayerWhichShapes() and msLayerNextShapes() functions so that the access paradigm used in drawing layers could be used. Subsequent research has let us to conclude that a true single pass is not possible in some cases. For example, GML requires a result set envelope be written out before writing individual features. There’s no way to get that initial envelope without a pass through the features.

3. A final solution would be to change how the msLayerGetShape() function behaves. We prepose changing the behavior of that function to provide random access to a result set (as defined by msLayerWhichShapes()) rather than the entire data set. This removes most of the overhead currently incurred by referencing the results already returned by the data driver in the initial query. For implementation purposes we would retain the current msLayerGetShape() implementations to support easy random access and create a new function called msLayerResultsGetShape().

msLayerGetShape() would only be used via MapScript to preserve true random access functionality. Note that drawing uses msLayerNextShape() and does not rely on index values.

Under this last solution data drivers would need to do two things:

  • update the population of the shapeObj index property (long int) with a value that will allow msLayerResultsGetShape() to randomly access a result

  • creation of the driver-specific version of msLayerResultsGetShape() to retrieve a shape from the results created in msLayerWhichShapes()

  • the default implementation of msLayerResultsGetShape() would simply wrap msLayerGetShape() since most drivers would not have to implement the newer function (e.g. OGR or shapefiles)

The query functions would need to:

  • not close the layer when finished with a query (we assume that users will want to do something with the results)

  • allow msLayerWhichItems() to retrieve ALL items so that the retrieved shapes are presentation-ready (draw, template, or …)

The presentation functions:

  • refrain from calling msLayerOpen(), msLayerWhichItems(), msLayerWhichShapes() since that has already been done in the query functions

MapScript layerObj wrappers:

  • add resultsGetShape() method

This solution has been piloted in the single-pass sandbox with very promising results. In some cases queries run orders of magnitude faster. One positive side effect is that primary keys need not be used to retrieve features from the result set. It is the drivers responsibility to provide data to uniquely identify a row in the result set.

Backwards Compatibility Issues

This solution will most likely require changes to MapScript applications that process query results depending on types of data being processed. Those hitting shapefiles for example will still function “as is” since there that driver will still be using msLayerGetShape(). Any script hitting a RDBMS data source will have to swith to using resultsGetShape() instead of getFeature()/getShape().

One casualty would be the query save/read functions. Since the processing of a set of results would be specific to dataset result handle it won’t be possible to get back to a result once a layer is ultimately closed. A proposed solution to this problem is presented next.

Query File Support

Query files have provided a means of saving the results of a query operation for use in subsequent map production. The series of indexes gathered during a query are written to disk and read later to be used to access the data a feature at a time. With the proposed changes this simply won’t work with RDBMS data sources. It becomes necessary to instead recreate the result set but re-executing the query. Problem is, there’s no easy way to serialize query parameters.

I propose creating a new object, a queryObj, to store the various parameters associated with MapServer queries. It might look something like:

typedef struct {
   int type; /\* By rect, point, shape, attribute, etc... Types match the query functions. \*/
   int qlayer; /\* used by all functions \*/

   rectObj *rect; /\* used by msQueryByRect() \*/

   char *qitem; /* used by msQueryByAttribute() \*/
   char *qstring; /\* used by msQueryByAttribute() \*/

   ...and so on...
} queryObj;

A single queryObj would hang off a mapObj and the mapObj would be the sole parameter passed to the various query methods. MapServer C code, primarily the CGI and OGC interfaces would simply populate the appropriate queryObj members and call the correct query function.

MapScript would be unchanged. The wrappers for the various query functions need only use the user supplied parameters to populate the queryObj and then call the query function. The queryObj would be immutable.

By storing all the information in a single store it should be easily serialized to disk. When read, the reconsituted queryObj could then be used to re-execute the appropriate query. The msSaveQuery() and msLoadQuery() function signatures would remain “as is” although the internals would change.

Files Impacted

  • driver files: implement msXXXLayerResultsGetShape() function to shape fetching code if necessary

  • maptemplate.c: don’t open/close a layer when presenting results

  • mapgml.c: don’t open/close a layer when presenting results

  • mapdraw.c: don’t open/close a layer etc… IF if drawing a query map

  • maplayer.c: refactor msLayerWhichItems(), add msLayerResultsGetShape() to the layer plugin API

  • mapquery.c: re-work msSaveQuery() and msLoadQuery(), change query functions to take a lone mapObj as input, add msInitQuery() and msFreeQuery() functions

  • mapserv.c: populate map->query before calling query functions

  • mapwxs.c (various): populate map->query before calling query functions

  • mapfile.c: leverage msInitQuery() and msFreeQuery() functions

  • mapserver.h: define queryObj, add to mapObj

  • mapscript (various): update map/layer query methods to populate a queryObj

  • others? (mapcopy.c for one)

Although a number of files are impacted the changes in general are relatively simple.

Unknowns

To date only shapefiles, PostGIS and Oracle drivers have been tested with this new scheme, all with positive results. Even shapefiles showed a performance improvement simply due to incurring the overhead of opening files just once. It’s not clear how OGR, SDE and raster queries will be impacted. I hope the owners of those drivers can comment further.

MapServer has supported a rather obscure query method called queryByIndex() as basically a wrapper to msLayerGetShape(). This change may render that method obsolete but more checking need be done.

Voting History

+1 from SteveL, DanielM, HowardB, PerryN, TamasS