MS RFC 69: Support for clustering of features in point layers

Date:2011/02/19
Author:Tamas Szekeres
Contact:szekerest at gmail.com
Status:Adopted (Implemented on 2011-03-04)
Version:MapServer 6.0

Description: This RFC proposes an implementation for clustering multiple features from a layer to single (aggregated) features based on their relative positions.

1. Overview

In order to make the maps perspicuous at a given view, we may require to limit the number of the features rendered at neighbouring locations which would normally overlap each other. Currently there’s no such mechanism in MapServer which would prevent from the symbols to overlap based on their relative locations. In a feasible solution we should provide rendering the isolated symbols as is, but create new (clustered) features for those symbols that would overlap in a particular scale. According to the example at http://trac.osgeo.org/mapserver/attachment/ticket/3700/cluster.png the number of the features forming the clusters are displayed in the labels for each clustered features.

3. The proposed soution

This functionality will be implemented as a separate layer data provider (implemented in mapcluster.c) This provider will be used internally (being invoked in msLayerOpen). The clustering parameters can be specified for each layer type as follows:

LAYER
  TYPE POINT
  CONNECTIONTYPE OGR
  NAME cluster
  CLUSTER
     MAXDISTANCE 20  # in pixels
     REGION ellipse  # can be rectangle or ellipse
     GROUP (expression)  # we can define an expression to create separate groups for each value
     FILTER (expression) # we can define a logical expression to specify the grouping condition
  END
  ...
END

We can also use multiple classes to display the clustered shapes. The referred example above use the following layer definition:

LAYER
    CLUSTER
        MAXDISTANCE 50
        REGION "rectangle"
    END
    LABELITEM "Cluster:FeatureCount"
    CLASSITEM "zoomcode"

    CLASS
             TEMPLATE "query.html"
             STYLE
             SYMBOL "image4"
             END
             LABEL
                    ...
             END
             EXPRESSION "Cluster:Empty"
          END

      CLASS
             TEMPLATE "query.html"
             STYLE
             SYMBOL "image1"
             END
             LABEL
                    ...
             END
             EXPRESSION "5"
          END
          CLASS
             TEMPLATE "query.html"
             STYLE
             SYMBOL "image2"
             END
             LABEL
                    ...
             END
             EXPRESSION "4"
          END
          CLASS
             TEMPLATE "query.html"
             STYLE
             COLOR 0 0 255
             SYMBOL "image3"
             END
             LABEL
                    ...
             END
             EXPRESSION "3"
          END
          ...
  END

3.1 The concept of the implementation

In the proposed solution msLayerOpen will call the vtable method of the cluster layer provider instead of the original vtable method depending on the existence of the clustering options, something like:

if (layer->cluster.region)
  return msClusterLayerOpen(layer);

In msClusterLayerOpen the data provider will override the vtable functions so that the subsequent LayerWhichShapes/LayerNextShape/LayerClose (and some further) functions will be handled by this provider and not by the original data source. The clustering process itself will be handled in the LayerWhichShapes call. This is the only place where the features are retrieved from the original data source and then cached in the local clustering database (stored in layerinfo).

The clustering process itself will be implemented in the following way:

  1. For each feature we create a tentative cluster with some aggregate attributes (like the feature count, the average position and the variance) the features are added into a customized quadtree data structure which provides quick access when searching for the neighboring shapes
  2. For each feature we will retrieve all the neighbouring shapes (that has already been retrieved earlier) within the specified distance (CLUSTERMAXDISTANCE) and search shape (CLUSTERREGION) by using a quadtree search. We will also inspect the filter and group conditions in each relations, In the related clusters we update the feature count (n) average positions (avg) and the variance (var) for each intersecting clusters by using the following recursive formula:
n = n + 1
avg(n) = avg(n-1) * (n-1) / n + x(n) / n
var(n) = var(n-1) * (n-1) / n + pow2(x(n) - avg(n)) / (n-1)
  1. In a second turn we evaluate the tentative clusters based on their feature count and the offset of the average position related to the initial position and the variance. The best ranking clusters will be identified by minimizing the position offset and the variance. The individual features (having rank=0) will be retrieved first in this approach.
  2. The best ranking clusters will be added to the finalization list (in layerinfo) and the finalized clusters (and the related features) will be removed from the quadtree as well.
  3. Based on the finalized features we update the average position and the variance of the affected clusters which are still exist in the quadtree.
  4. Repeat from #4 until we have features in the quadtree.

In LayerNextShape the features are served from the finalization list which is preserved until the layer is open. In LayerClose the vtable of the layer will be restored to the original methods (by calling msInitializeVirtualTable)

3.2 Handling the feature attributes (items)

The clustered layer itself will provide the following aggregate attributes:

1) Cluster:FeatureCount - count of the features in the clustered shape 1) Cluster:Group - The group value of the cluster (to which the group expression is evaluated)

These attributes can be used to configure the labels of the features and can also be used in expressions. The clustered layer will also support to get further attributes from the original data source as referenced in the LAYER configuration. The ITEMS processing option can be used to specify additional attributes from the source layer according to the user preference.

If we retrieve the original attributes then the layer provider will provide only those values which are equal for each shapes in the cluster. The other values are set to “Cluster:Empty”. In the future we may probably extend this by implementing aggregate functions to define the attributes, like min([attributename]) or max([attributename])

3.3 Handling classes and styles

We can define the symbology and labelling of the clustered layers in the same way as any other layer by specifying the classes and styles. STYLEITEM AUTO is not considered to be supported at this phase.

3.4 Query processing

In the query operations the clustered features are retrieved as single shapes with the attribute set as specified in the ITEMS processing option. The clustered features are preserved until the layer is open, so the single pass query approach is provided by this driver.

4. Implementation Details

In order to implement this enhancement the following changes should be made in the MapServer codebase:

  1. Modify the lexer to contain the new keywords
  2. Create a new struct (clusterObj) to contain the clustering parameters and implement the handlers (initCluster, loadCluster, freeCluster, writeCluster) in mapfile.c
  3. Modify maplayer.c to invoke msClusterlayerOpen based on the existence of the clustering parameters, and make msLayerWhichItems aware of the group and filter expressions
  4. Implement mapcluster.c containing the code of the cluster layer data provider

4.1 Files affected

The following files will be modified/created by this RFC:

Makefile.vc
Makefile.in
maplayer.c (msLayerOpen and msLayerWhichItems)
mapfile.c, mapfile.h(for handling clusterObj)
mapcluster.c (the code of the new data provider)
cluster.i (SWIG interface file to expose clusterObj)
maplexer.l
mapserver.h

4.2 MapScript Issues

The new object (clusterObj) will be exposed to the mapscript interface.

4.3 Backwards Compatibilty Issues

This change provides a new functionality with no backwards compatibility issues being considered.

5. Bug ID

The ticket for RFC-69 (containing implementation code) can be found here.

Bug 3700

6. Voting history

+1 from SteveL, AssefaY, ThomasB and TamasS