5 Optimize gridding params

Why aggregate data to grids?

Why do we aggregate data to grids?
Discretization
Hard to analyze data in continuous space, but easy to analyze with discretized region. Defining spatial analysing units can discretize the region.
Comparable
All grids are with same size, their attributes are comparable under same standard.
Controllable
Under grid-based framework, aggregation accuracy is controllable. Defining smaller grids will improve the accuracy, but increase computing burden.
Efficient
Using TransBigData, GPS data can match to grids with small computational complexity. High computation speed for the matching between grids and GPS data.

In TransBigData, the gridding framework is determined by the gridding params. Each of the gridding params can define a griding coordinate system. The params are as follows:

params=(lonStart,latStart,deltaLon,deltaLat,theta)

However, how to choose an appropriate gridding params in our research is the most basic thing, which may have a great impact on the final analysis results.

The selection of the grid depends on the data and the purpose analyzed.
Suppose we want to use the grid system to analyze the vehicle travel trajectory. If the grid boundaries coincide with the road centerline, the vehicle travel through the road section will generate the GPS points along the grid boundry. There will be great differences in the grid sequence generated after matching GPS to grids even if the vehicles are passing through the same road section. In another word, a better grid coordinate system should be that the trajectory travel through the same path should have similar grid sequence.

A good idea is to input the urban road network data and optimize the grid parameters from the road network. However, for a gridding framework like TransBigData, this is not the best solution. The GPS data we want to analyze is not only the vehicle trajectory data and they do not have to follow a given road network. Moreover, the spatial feature of the road network is already included in the vehicle trajectory. Thus, the selection of gridding parameters should depend on the original spatial attributes of the GPS data.

When analysing individual mobility data, the optimal grid selection criteria are also different. Since individuals usually stay more time and generate more data in their activity points, a better gridding should match these data into the same grid. The result should be that few grids occupy more data.

Here, we offer three methods to optimize the griding params: centerdist, gini and gridscount

[1]:
import pandas as pd
import geopandas as gpd
import transbigdata as tbd
#Read taxi gps data
tripdata = pd.read_csv(r'data/TaxiData-Sample.csv')
tripdata.columns = ['track_id','time','lon','lat','OpenStatus','Speed']

#Retain the data in given area
area = gpd.read_file(r'data/gis/szarea1.json')
tripdata = tbd.clean_outofshape(tripdata,area,col=['lon','lat'])

#Generate initial griding params
bounds = [113.6,22.4,114.8,22.9]
initialparams = tbd.area_to_params(bounds,accuracy = 500)

centerdist: Minimize the distance between grid center and GPS data

When a batch of data with close distance are distributed at the edge of the grid, the deviation of GPS data will cause these data to be matched into different grids. So one of the solution is to minimize the distance between grid center and GPS data.

[2]:
#Optimize gridding params
params_optimized = tbd.grid_params_optimize(tripdata,
                                            initialparams,
                                            col=['track_id','lon','lat'],
                                            optmethod='centerdist',
                                            sample=0, #not sampling
                                            printlog=True)

Optimized index centerdist: 160.41280636449184
Optimized gridding params: {'slon': 113.60144616975187, 'slat': 22.401543058590295, 'deltalon': 0.004872390756896538, 'deltalat': 0.004496605206422906, 'theta': 43.585298279322615, 'method': 'rect'}
../_images/gallery_Example_5-Optimize_grid_params_7_1.png

gini: Maximize the gini index

In economics, Gini index is a measure of statistical dispersion intended to represent the income inequality or the wealth inequality within a nation or a social group. Here, we can borrow the concept of Gini index to represent the distribution of GPS data in the grids. The higher of the Gini index represents that the data is more concentrated distribution in the given grids.
The gini index is more helpful in analysing human mobility data.
[3]:
#Optimize griding params
params_optimized = tbd.grid_params_optimize(tripdata,
                                            initialparams,
                                            col=['track_id','lon','lat'],
                                            optmethod='gini',
                                            sample=0, #not sampling
                                            printlog=True)

Optimized index gini: -0.11709661279249717
Optimized gridding params: {'slon': 113.60363252207824, 'slat': 22.40161914185426, 'deltalon': 0.004872390756896538, 'deltalat': 0.004496605206422906, 'theta': 47.730990684694575, 'method': 'rect'}
../_images/gallery_Example_5-Optimize_grid_params_10_1.png

gridscount: Minimize the average count of grids for individuals

Under this standard, each individual should appear in as few grids as possible.

[4]:
#Optimize griding params
params_optimized = tbd.grid_params_optimize(tripdata,
                                            initialparams,
                                            col=['track_id','lon','lat'],
                                            optmethod='gridscount',
                                            sample=0, #not sampling
                                            printlog=True)

Optimized index gridscount: 9.0
Optimized gridding params: {'slon': 113.60372085909265, 'slat': 22.403002740815666, 'deltalon': 0.004872390756896538, 'deltalat': 0.004496605206422906, 'theta': 44.56000665402531, 'method': 'rect'}
../_images/gallery_Example_5-Optimize_grid_params_13_1.png

Also support optimizing triangle and hexagon gridding parameters

[5]:
initialparams['method'] = 'tri'
[6]:
#Optimize gridding params
params_optimized = tbd.grid_params_optimize(tripdata,
                                            initialparams,
                                            col=['track_id','lon','lat'],
                                            optmethod='centerdist',
                                            sample=0, #not sampling
                                            printlog=True)

Optimized index centerdist: 136.87564489047065
Optimized gridding params: {'slon': 113.60421146982776, 'slat': 22.402738210124514, 'deltalon': 0.004872390756896538, 'deltalat': 0.004496605206422906, 'theta': 31.61303640854649, 'method': 'tri'}
../_images/gallery_Example_5-Optimize_grid_params_16_1.png
[7]:
initialparams = tbd.area_to_params(bounds,accuracy = 500/(6**0.5))
initialparams['method'] = 'hexa'
[8]:
#Optimize gridding params
params_optimized = tbd.grid_params_optimize(tripdata,
                                            initialparams,
                                            col=['track_id','lon','lat'],
                                            optmethod='centerdist',
                                            sample=0, #not sampling
                                            printlog=True)

Optimized index centerdist: 135.60103782128888
Optimized gridding params: {'slon': 113.60043088516572, 'slat': 22.400303375881162, 'deltalon': 0.0019891451969749397, 'deltalat': 0.0018357313884130575, 'theta': 17.62535531106509, 'method': 'hexa'}
../_images/gallery_Example_5-Optimize_grid_params_18_1.png