3 Trajectory processing using TransBigData

For vehicle trajectory data, the TransBigData library provides a comprehensive set of trajectory data processing methods starting from version 0.5.0 and above. These methods include preprocessing and drift correction of trajectory data, segmentation of stops and trips, grid-based representation, visualization, and more. This article will introduce how to use the TransBigData library for processing trajectory data.

[1]:

import pandas as pd
import geopandas as gpd
import transbigdata as tbd
# ensure tbd version is above 0.5.0
tbd.__version__

[1]:

'0.5.0'

Trajectory Quality

First, we read the data and observe the basic information to check for any missing values. Using the built-in methods of a DataFrame, we can easily view the basic information of the data, including data types, number of fields, number of rows, and the presence of missing values. The code is as follows:

[2]:

# Read the data
data = pd.read_csv('data/TaxiData-Sample.csv', header=None)
data.columns = ['id', 'time', 'lon', 'lat', 'OpenStatus', 'speed']
# Convert the time format
data['time'] = pd.to_datetime(data['time'])
data

[2]:

	id	time	lon	lat	OpenStatus	speed
0	34745	2023-05-29 20:27:43	113.806847	22.623249	1	27
1	34745	2023-05-29 20:24:07	113.809898	22.627399	0	0
2	34745	2023-05-29 20:24:27	113.809898	22.627399	0	0
3	34745	2023-05-29 20:22:07	113.811348	22.628067	0	0
4	34745	2023-05-29 20:10:06	113.819885	22.647800	0	54
...	...	...	...	...	...	...
544994	28265	2023-05-29 21:35:13	114.321503	22.709499	0	18
544995	28265	2023-05-29 09:08:02	114.322701	22.681700	0	0
544996	28265	2023-05-29 09:14:31	114.336700	22.690100	0	0
544997	28265	2023-05-29 21:19:12	114.352600	22.728399	0	0
544998	28265	2023-05-29 19:08:06	114.137703	22.621700	0	0

544999 rows × 6 columns

[3]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 544999 entries, 0 to 544998
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   id          544999 non-null  int64
 1   time        544999 non-null  datetime64[ns]
 2   lon         544999 non-null  float64
 3   lat         544999 non-null  float64
 4   OpenStatus  544999 non-null  int64
 5   speed       544999 non-null  int64
dtypes: datetime64[ns](1), float64(2), int64(3)
memory usage: 24.9 MB

In it, the data types of the data fields, the number of non-null values, and the memory usage are listed. In the “Non-Null” column, the number of non-null values for each field is listed. If the number of non-null values for a field is less than the total number of rows, it indicates the presence of missing values in that field.

Next, we will use TransBigData to generate a data quality report and observe the number of vehicles in the data, the observation time period, and the sampling interval:

[4]:

# Generate data quality report
tbd.data_summary(data,col=['id', 'time'],show_sample_duration=True)

Amount of data
-----------------
Total number of data items:  544999
Total number of individuals:  180
Data volume of individuals(Mean):  3027.7722
Data volume of individuals(Upper quartile):  4056.25
Data volume of individuals(Median):  2600.5
Data volume of individuals(Lower quartile):  1595.75

Data time period
-----------------
Start time:  2023-05-29 00:00:00
End time:  2023-05-29 23:59:59

Sampling interval
-----------------
Mean:  27.995 s
Upper quartile:  30.0 s
Median:  20.0 s
Lower quartile:  15.0 s

../_images/gallery_Example_3-Trajectory_processing_using_TransBigData_8_1.png

Redundancy Elimination

Redundancy elimination is an important step in cleaning trajectory data. It reduces the data volume and improves data processing efficiency without affecting the information contained in the data. In practical trajectory data processing, you may encounter the following two types of redundancy:

Redundancy of Duplicate Data at the Same Moment:

In a trajectory dataset, there may be multiple trajectory data entries for the same vehicle at the same moment. This can happen when the sampling interval is too short compared to the precision of the time field in the dataset. For example, if the sampling interval is 1 second, but the time field in the dataset has a precision of 1 minute, it can result in multiple trajectory data entries for the same vehicle within the same minute. Not removing these redundant data can cause difficulties in subsequent processing. The method to eliminate this redundancy is straightforward: keep only one trajectory data entry for the same vehicle at the same moment.

Redundancy of Vehicle Stops:

In vehicle trajectory data, the sampling interval is usually very short, such as collecting data every few seconds. This means that data is continuously generated for vehicles, whether they are moving or stationary. In practical applications, the focus is often on the trajectory of vehicles during trips, rather than during stops. For each instance of a vehicle stopping, we only need to know the start and end times of the stop. The data generated during the middle part of the stop, at the same location, is redundant and can be removed to reduce the overall data size. For a sequence of consecutive n data entries (n>=3) with the same location, we only need to keep the first and last entries, as the intermediate data is redundant. In the code, it is sufficient to compare the vehicle ID and the latitude-longitude of each data entry with the previous and next trajectory entries. If they are the same, the data can be removed.

However, the redundancy elimination method mentioned above for vehicle stops does not consider the information carried by fields other than the vehicle ID, time, and latitude-longitude. For example, in the case of taxi vehicles, passengers may board the taxi during a stop, changing the status from “vacant” to “occupied.” In such cases, this information needs to be preserved.

TransBigData provides a function, tbd.traj_clean_redundant(), for trajectory data redundancy elimination. It can handle the mentioned redundancy situations and can also detect redundancy in fields other than the vehicle ID and latitude-longitude. The code is as follows:

[5]:

# data volume before Redundancy Elimination
len(data)

[5]:

[6]:

# Data redundancy removal to reduce data size and improve computational efficiency in subsequent steps
#
data = tbd.traj_clean_redundant(
    data,
    col=['id', 'time', 'lon', 'lat', 'speed']  # Apart from vehicle ID, time, longitude, and latitude, consider whether the speed field has redundancy
)
len(data)

[6]:

The code snippet performs data redundancy removal to reduce the data size and improve computational efficiency. The traj_clean_redundant function from the tbd module is used for this purpose. The function takes the data variable as input and specifies the columns (col) to consider for redundancy removal, including the vehicle ID, time, longitude, latitude, and optionally the speed field. The result is stored back in the data variable, and the length of the updated data is outputted.

Drift Cleaning

In vehicle trajectory data, deviations and errors may occur between the collected vehicle trajectory data and the actual situations due to factors such as errors in data collection devices, environmental interferences, device malfunctions, unstable GPS signals, insufficient satellite coverage, and signal obstructions. This results in a discrepancy between the actual positions and the collected positions of the trajectory data, known as data drift in vehicle trajectory data. In the data, data drift is manifested by significant distances between the trajectory data points and the actual positions, often accompanied by sudden jumps. This drift can affect subsequent spatial analysis and spatial statistics, requiring the cleaning of vehicle trajectory data to ensure data accuracy and usability.

To clean vehicle trajectory data for drift, one approach is to remove trajectory data points outside the defined study area. The study area can be defined in two ways: either by specifying the bottom-left and top-right coordinates to determine the boundary range (bounds) or by using a geographic information file (geojson or shapefile) that represents the study area.

To remove drift data using a geojson or shapefile, you first need to read the geojson or shapefile as a GeoDataFrame type in GeoPandas. Then, you can use the intersects() method provided by GeoPandas to determine if the trajectory data is within the study area. However, this method requires performing spatial geometry matching for each trajectory data point, which can be time-consuming for large datasets. The tbd.clean_outofshape() method provided in the TransBigData package offers a more efficient approach. It first maps the trajectory data to the corresponding grid using built-in grid partitioning methods and then performs spatial matching based on the grid, significantly improving the cleaning efficiency.

Here is an example code snippet using TransBigData’s clean_outofshape() method for data drift cleaning:

[7]:

# Read the research area range
sz = gpd.read_file('Data/sz.json')
sz.plot()

[7]:

<AxesSubplot:>

../_images/gallery_Example_3-Trajectory_processing_using_TransBigData_14_1.png

[8]:

# Data drift removal
# Removing data outside the study area
data = tbd.clean_outofshape(data, sz, col=['lon', 'lat'], accuracy=500)
len(data)

[8]:

Cleaning trajectory drift within the study area requires assessing and cleaning based on the continuous changes in trajectories. There are three common approaches for cleaning:

Velocity threshold method: If the velocity between the current trajectory data and the previous and subsequent trajectories exceeds a threshold, it is considered drift.
Distance threshold method: If the distance between the current trajectory data and the previous and subsequent trajectories exceeds a threshold, it is considered drift.
Angle threshold method: If the angle formed by the previous, current, and subsequent trajectories is smaller than a threshold, it is considered drift.

In TransBigData, the tbd.traj_clean_drift() method is provided to clean trajectory data for multiple vehicles. This method integrates distance, velocity, and angle thresholds into a single function.

Here is an example code snippet using TransBigData’s traj_clean_drift() method:

[9]:

# Drift cleaning within the study area using speed, distance, and angle as criteria
data = tbd.traj_clean_drift(
    data,  # Trajectory data, can include data for multiple vehicles, distinguished by ID
    col=['id', 'time', 'lon', 'lat'],  # Column names of the trajectory data
    speedlimit=80,  # Speed threshold in km/h, set to None to skip speed-based filtering
    dislimit=4000,  # Distance threshold in meters, set to None to skip distance-based filtering
    anglelimit=30)  # Angle threshold in degrees, set to None to skip angle-based filtering
len(data)

[9]:

Stops and Trips Extracting

In long-term continuous observations of vehicle trajectory data, a common requirement is to extract the stops and trips from the trajectory data. The stops can be analyzed to determine the duration and location of the vehicle’s parking, while the trips can be further analyzed to extract information such as the origin and destination of each trip, the travel path, travel duration, travel distance, and travel speed. In this section, we will explain how to identify stops and trips from vehicle trajectories, extract the trajectory information for each trip, and generate trajectory lines.

In vehicle trajectory data, stops and trips are typically identified using a time threshold method. The approach is as follows: to avoid fluctuations in the trajectory data, we need to predefine a grid in the geographical space. If the duration between two consecutive data points exceeds our set threshold (usually 30 minutes), we consider it as a stop. The time period between two stops is considered a trip, as shown in the figure below.

StopsandTrips

After cleaning the trajectory data, we need to define a grid coordinate system and convert the trajectory data into grid-based representation to identify stops and trips.

[11]:

# Define the grid parameters
bounds = [113.75, 22.4, 114.62, 22.86]
params = tbd.area_to_params(bounds,accuracy = 100)
params

[11]:

{'slon': 113.75,
 'slat': 22.4,
 'deltalon': 0.0009743362892898221,
 'deltalat': 0.0008993210412845813,
 'theta': 0,
 'method': 'rect',
 'gridsize': 100}

The identification of stops and trips has been provided in the TransBigData package through the tbd.traj_stay_move() function. Here is the code:

[12]:

# Identify stay and move
stay, move = tbd.traj_stay_move(data, params, col=['id', 'time', 'lon', 'lat'], activitytime=1800)
len(stay), len(move)

[12]:

(545, 725)

In the code, data represents the cleaned trajectory data. The traj_stay_move() function is used to identify stops and trips based on the specified stop threshold. It returns two outputs: stops and trips, which contain the identified stops and trips, respectively. You can utilize these data for further analysis.

Note: The tbd.traj_stay_move() function does not remove trips with a duration of 0. This is because some trajectory data may have long sampling intervals, which may not capture the travel process between two stops. As a result, the calculated duration for those trips would be 0.

[13]:

stay

[13]:

	id	stime	LONCOL	LATCOL	etime	lon	lat	duration	stayid
23320	22396	2023-05-29 06:10:09	262	284	2023-05-29 07:56:32	114.005547	22.655800	6383.0	0
51327	22396	2023-05-29 12:51:23	50	429	2023-05-29 16:44:57	113.798630	22.786167	14014.0	1
54670	22413	2023-05-29 00:00:09	145	431	2023-05-29 01:22:41	113.891403	22.787300	4952.0	2
54722	22413	2023-05-29 01:25:13	145	431	2023-05-29 02:01:44	113.891701	22.787399	2191.0	3
54741	22413	2023-05-29 02:18:13	145	431	2023-05-29 02:48:46	113.891502	22.787300	1833.0	4
...	...	...	...	...	...	...	...	...	...
151140	36686	2023-05-29 01:04:37	323	172	2023-05-29 02:33:37	114.065193	22.554672	5340.0	540
154762	36686	2023-05-29 03:03:07	307	185	2023-05-29 03:51:07	114.048721	22.566692	2880.0	541
154924	36686	2023-05-29 04:21:07	307	185	2023-05-29 05:06:07	114.048676	22.566605	2700.0	542
307421	36805	2023-05-29 03:15:54	324	168	2023-05-29 04:14:40	114.065552	22.551100	3526.0	543
337431	36805	2023-05-29 04:23:40	325	132	2023-05-29 05:13:33	114.066521	22.519133	2993.0	544

545 rows × 9 columns

[14]:

move

[14]:

	id	SLONCOL	SLATCOL	stime	slon	slat	etime	elon	elat	ELONCOL	ELATCOL	duration	moveid
0	22396	253	326	2023-05-29 00:00:29	113.996719	22.693333	2023-05-29 06:10:09	114.005547	22.655800	262.0	284.0	22180.0	0
23320	22396	262	284	2023-05-29 07:56:32	114.005547	22.655800	2023-05-29 12:51:23	113.798630	22.786167	50.0	429.0	17691.0	1
51327	22396	50	429	2023-05-29 16:44:57	113.798630	22.786167	2023-05-29 23:59:55	114.025253	22.654900	283.0	283.0	26098.0	2
54670	22413	145	431	2023-05-29 00:00:09	113.891403	22.787300	2023-05-29 00:00:09	113.891403	22.787300	145.0	431.0	0.0	3
54670	22413	145	431	2023-05-29 01:22:41	113.891403	22.787300	2023-05-29 01:25:13	113.891701	22.787399	145.0	431.0	152.0	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...
154762	36686	307	185	2023-05-29 03:51:07	114.048721	22.566692	2023-05-29 04:21:07	114.048676	22.566605	307.0	185.0	1800.0	720
154924	36686	307	185	2023-05-29 05:06:07	114.048676	22.566605	2023-05-29 23:53:46	114.124580	22.571978	384.0	191.0	67659.0	721
135725	36805	328	147	2023-05-29 00:00:03	114.070030	22.531967	2023-05-29 03:15:54	114.065552	22.551100	324.0	168.0	11751.0	722
307421	36805	324	168	2023-05-29 04:14:40	114.065552	22.551100	2023-05-29 04:23:40	114.066521	22.519133	325.0	132.0	540.0	723
337431	36805	325	132	2023-05-29 05:13:33	114.066521	22.519133	2023-05-29 23:53:51	114.120354	22.544300	380.0	160.0	67218.0	724

725 rows × 13 columns

Based on the stops and trips information, you can slice the trajectory data to extract the trajectory points during stops and trips. Since you have already performed redundancy elimination earlier, the redundant trajectory points during stops have been removed. As a result, the trajectory data during stops will have significantly fewer data points compared to the trajectory data during trips.

Trajectory slicing

By the preceding code, parking and travel information has been successfully extracted from the data. However, in the obtained travel information, only the time, latitude, and longitude information of the starting and ending points of each trip are included, without the travel trajectory information. In order to further analyze the travel trajectories of vehicles, it is necessary to extract the trajectory data from the time periods of each trip, that is, slice the trajectory dataset based on the travel information. In the previously calculated travel information, each travel record has a travel ID, start time, and end time columns. The result of trajectory slicing is to extract the trajectory points during the trip and assign a travel ID label to each trajectory point.

The idea of trajectory slicing is illustrated in the diagram below. A flag column is created for the trajectory data to mark whether each row is within the desired time period for slicing. Then, each travel record in the travel data is decomposed into a start time record (flag label of 1) and an end time record (flag label of -1), which are inserted into the trajectory data. Next, the flag column is grouped and summed by vehicle to obtain the flag1 column. In the result, the rows where the flag1 column value is 1 (travel) and the flag column value is 0 (non-temporarily inserted data) are the desired trajectory data.

trajslice

The code for slicing the trajectory data can be performed by using tbd.traj_slice, the code example is as follows:

[18]:

# Extract trajectory points during parking
stay_points = tbd.traj_slice(data, stay, traj_col=['id', 'time'], slice_col=[
                             'id', 'stime', 'etime', 'stayid'])
stay_points

[18]:

	id	time	lon	lat	OpenStatus	speed	stayid
23320	22396	2023-05-29 06:10:09	114.005547	22.655800	0.0	1.0	0.0
23321	22396	2023-05-29 06:20:42	114.005547	22.655800	0.0	0.0	0.0
23322	22396	2023-05-29 06:25:09	114.005417	22.655767	0.0	0.0	0.0
23323	22396	2023-05-29 06:25:17	114.005417	22.655767	0.0	0.0	0.0
23324	22396	2023-05-29 06:30:09	114.005402	22.655767	0.0	0.0	0.0
...	...	...	...	...	...	...	...
307425	36805	2023-05-29 04:14:40	114.066452	22.550966	0.0	9.0	543.0
337431	36805	2023-05-29 04:23:40	114.066521	22.519133	0.0	20.0	544.0
337432	36805	2023-05-29 04:23:55	114.066521	22.519133	0.0	0.0	544.0
337433	36805	2023-05-29 05:13:27	114.066521	22.519133	1.0	0.0	544.0
337434	36805	2023-05-29 05:13:33	114.067230	22.519751	1.0	3.0	544.0

8428 rows × 7 columns

[19]:

# Extract trajectory points during travel
move_points = tbd.traj_slice(data, move, traj_col=['id', 'time'], slice_col=[
                             'id', 'stime', 'etime', 'moveid'])
move_points

[19]:

	id	time	lon	lat	OpenStatus	speed	moveid
0	22396	2023-05-29 00:00:29	113.996719	22.693333	1.0	20.0	0.0
27	22396	2023-05-29 00:01:01	113.995514	22.695032	1.0	34.0	0.0
28	22396	2023-05-29 00:01:09	113.995430	22.695766	1.0	41.0	0.0
29	22396	2023-05-29 00:01:41	113.995369	22.696484	1.0	0.0	0.0
30	22396	2023-05-29 00:02:21	113.995430	22.696650	1.0	17.0	0.0
...	...	...	...	...	...	...	...
132774	36805	2023-05-29 23:53:03	114.120354	22.544333	1.0	3.0	724.0
132775	36805	2023-05-29 23:53:09	114.120354	22.544300	1.0	2.0	724.0
132776	36805	2023-05-29 23:53:15	114.120354	22.544300	1.0	1.0	724.0
132777	36805	2023-05-29 23:53:21	114.120354	22.544300	1.0	0.0	724.0
132778	36805	2023-05-29 23:53:51	114.120354	22.544300	0.0	0.0	724.0

397892 rows × 7 columns

Trajectory densification and sparsification

To facilitate subsequent tasks such as matching the travel paths in the trajectory data with the road network, we perform densification or sparsification of the trajectory points during the travel process. At this stage, we specify the ID column as the travel ID (moveid) column, meaning that when performing the densification or sparsification operation on the trajectory points, we consider each travel separately. The code for this operation is as follows:

[59]:

# Trajectory densification
move_points_densified = tbd.traj_densify(
    move_points, col=['moveid', 'time', 'lon', 'lat'], timegap=15)
move_points_densified

[59]:

	id	time	lon	lat	OpenStatus	speed	moveid
0	22396.0	2023-05-29 00:00:29	113.996719	22.693333	1.0	20.0	0.0
2	NaN	2023-05-29 00:00:30	113.996681	22.693386	NaN	NaN	0.0
3	NaN	2023-05-29 00:00:45	113.996116	22.694183	NaN	NaN	0.0
4	NaN	2023-05-29 00:01:00	113.995552	22.694979	NaN	NaN	0.0
1	22396.0	2023-05-29 00:01:01	113.995514	22.695032	1.0	34.0	0.0
...	...	...	...	...	...	...	...
397889	36805.0	2023-05-29 23:53:15	114.120354	22.544300	1.0	1.0	724.0
397890	36805.0	2023-05-29 23:53:21	114.120354	22.544300	1.0	0.0	724.0
4175974	NaN	2023-05-29 23:53:30	114.120354	22.544300	NaN	NaN	724.0
4175975	NaN	2023-05-29 23:53:45	114.120354	22.544300	NaN	NaN	724.0
397891	36805.0	2023-05-29 23:53:51	114.120354	22.544300	0.0	0.0	724.0

1211070 rows × 7 columns

[60]:

# Trajectory sparsification
move_points_sparsified = tbd.traj_sparsify(
    move_points, col=['moveid', 'time', 'lon', 'lat'],  timegap=120)
move_points_sparsified

[60]:

	id	time	lon	lat	OpenStatus	speed	moveid
0	22396	2023-05-29 00:00:29	113.996719	22.693333	1.0	20.0	0.0
4	22396	2023-05-29 00:02:21	113.995430	22.696650	1.0	17.0	0.0
7	22396	2023-05-29 00:04:21	113.992348	22.696733	0.0	36.0	0.0
10	22396	2023-05-29 00:06:21	113.986366	22.691000	0.0	48.0	0.0
12	22396	2023-05-29 00:08:21	113.989586	22.681749	0.0	43.0	0.0
...	...	...	...	...	...	...	...
397822	36805	2023-05-29 23:45:27	114.091217	22.540768	0.0	11.0	724.0
397835	36805	2023-05-29 23:47:21	114.093002	22.543383	1.0	0.0	724.0
397855	36805	2023-05-29 23:49:21	114.105850	22.545250	1.0	58.0	724.0
397875	36805	2023-05-29 23:51:21	114.119514	22.547033	1.0	24.0	724.0
397890	36805	2023-05-29 23:53:21	114.120354	22.544300	1.0	0.0	724.0

90172 rows × 7 columns

[ ]:

# define a function to plot the trajectory
def plot_traj(traj):
    import folium
    # 1. Create a map with the center at the average coordinates of the trajectory
    m = folium.Map(location=[traj['lat'].mean(), traj['lon'].mean()],  # Map center
                zoom_start=14,  # Map zoom level
                tiles='cartodbpositron')  # Map style
    # 2. Add the trajectory
    folium.PolyLine(
        traj[['lat', 'lon']].values.tolist(),  # Trajectory coordinates
        color='red',  # Trajectory color
        weight=2.5,  # Trajectory width
        opacity=1).add_to(m)  # Trajectory opacity, add to the map after creating the trajectory
    # 3. Add trajectory points
    for i in range(len(traj)):
        folium.CircleMarker(
            location=[traj['lat'].iloc[i], traj['lon'].iloc[i]],  # Trajectory point coordinates
            radius=3,  # Trajectory point radius
            color='red',  # Trajectory point color
        ).add_to(m)  # Fill opacity, add to the map after creating the trajectory point
    # 4. Add start and end markers
    folium.Marker([traj['lat'].iloc[0], traj['lon'].iloc[0]],  # Start point coordinates
                popup='Start',  # Start marker label
                icon=folium.Icon(color='green')).add_to(m)  # Start marker color
    folium.Marker([traj['lat'].iloc[-1], traj['lon'].iloc[-1]],
                popup='End',
                icon=folium.Icon(color='red')).add_to(m)
    # 5. Display the map, directly in Jupyter Notebook
    return m

[65]:

moveid = 51
# Original trajectory
traj = move_points[move_points['moveid']==moveid]
plot_traj(traj)

[65]:

Make this Notebook Trusted to load map: File -> Trust Notebook

[66]:

# Densified trajectory
traj = move_points_densified[move_points_densified['moveid']==moveid]
plot_traj(traj)

[66]:

Make this Notebook Trusted to load map: File -> Trust Notebook

[67]:

# Sparsified trajectory
traj = move_points_sparsified[move_points_sparsified['moveid']==moveid]
plot_traj(traj)

[67]:

Make this Notebook Trusted to load map: File -> Trust Notebook

Trajectory smoothing

When processing vehicle trajectory data, the trajectory points represent “observations” of the actual “state” of the vehicle. Due to errors, the observed data may deviate from the actual state of the vehicle.

How can we obtain a more accurate estimation of the actual state of the vehicle? Consider the method mentioned in the previous section for detecting trajectory drift, which involves comparing the position of a trajectory point with the position of previous trajectory points to check for significant and unreasonable jumps. This approach is essentially based on predicting the possible future positions of the vehicle based on its previous trajectory. If the next recorded trajectory point deviates significantly from the expected position, it can be determined that the trajectory is abnormal.

This method shares similarities with the concept of Kalman filtering. Kalman filtering is a linear quadratic estimation algorithm used for state estimation in linear dynamic systems. It combines the previous state estimation (i.e., the predicted position of the current trajectory point) with the current observation data (recorded position of the current trajectory point) to obtain an optimal estimate of the current state.

The implementation of Kalman filtering involves predicting the current value using the previous optimal result and then correcting the current value using the observed value to obtain the optimal result. This method effectively reduces the impact of noise, allowing for a more accurate estimation of the actual state of the vehicle.

[84]:

move_id =51
traj = move_points[move_points['moveid'] == move_id].copy()
traj_smoothed = tbd.traj_smooth(traj,col = ['id','time','lon', 'lat'],proj=False,process_noise_std = 0.01, measurement_noise_std = 1)

[85]:

# plot the trajectory
import folium
m = folium.Map(location=[traj['lat'].mean(), traj['lon'].mean()],
            zoom_start=14,
            tiles='cartodbpositron')
# original trajectory
folium.PolyLine(
    traj[['lat', 'lon']].values.tolist(),
    color='red',
    weight=2.5,
    opacity=1).add_to(m)
# smoothed trajectory
folium.PolyLine(
    traj_smoothed[['lat', 'lon']].values.tolist(),
    color='blue',
    weight=2.5,
    opacity=1).add_to(m)
m

[85]:

Make this Notebook Trusted to load map: File -> Trust Notebook

The goal of the Kalman filter is to optimize the observed values by estimating the system state while considering the uncertainties of both the observation noise and the system dynamics. It has advantages in smoothing trajectory data by reducing noise effects and minimizing fluctuations within a small range. However, the Kalman filter cannot completely eliminate all noise or handle trajectory drift.

The Kalman filter is suitable for cases where the noise in the trajectory data is relatively stable, meaning the noise variance remains constant. It is particularly effective in smoothing small-scale fluctuations caused by measurement errors in trajectory data.

However, the effectiveness of the Kalman filter is limited when significant drift occurs in the trajectory. Drift points are treated as observations and have a significant impact on state estimation, which the Kalman filter cannot directly handle.

Furthermore, the Kalman filter requires the specification of covariance matrices for process and observation errors, and these parameter settings affect the smoothing effect. Improper covariance matrix settings can result in significant deviations in the smoothed trajectory data, especially when dealing with trajectory data that may deviate from the road network.

When processing trajectory data, a common approach is to first remove drift, then perform smoothing, and finally conduct road network matching. The rationale behind this approach is as follows:

The drift removal step eliminates obvious drift points in the data, which are large noise components. The presence of drift points can significantly interfere with subsequent processing steps. Removing or correcting drift points ensures the accuracy and reliability of subsequent processing.
After drift removal, the trajectory data may still contain some noise and fluctuations. To reduce the impact of these noise and fluctuations, smoothing can be applied to further process the trajectory data, making it smoother, more continuous, and maintaining the overall trend of the trajectory.
Finally, the smoothed trajectory is more stable and better suited for road network matching. It reduces errors caused by noise and fluctuations, thereby improving the accuracy and reliability of road network matching.