Data Preprocess

`clean_outofbounds`(data, bounds[, col])	The input is the latitude and longitude coordinates of the lower left and upper right of the study area and exclude data that are outside the study area
`clean_outofshape`(data, shape[, col, accuracy])	Input the GeoDataFrame of the study area and exclude the data beyond the study area
`id_reindex`(data, col[, new, timegap, ...])	Renumber the ID columns of the data
`id_reindex_disgap`(data[, col, disgap, suffix])	Renumber the ID columns of the data，If two adjacent records exceed the distance, the number is the new ID

transbigdata.clean_outofbounds(data, bounds, col=['Lng', 'Lat'])

The input is the latitude and longitude coordinates of the lower left and upper right of the study area and exclude data that are outside the study area

Parameters:

data (DataFrame) – Data
bounds (List) – Latitude and longitude of the lower left and upper right of the study area, in the order of [lon1, lat1, lon2, lat2]
col (List) – Column name of longitude and latitude

Returns:

data1 – Data within the scope of the study

Return type:

DataFrame

transbigdata.clean_outofshape(data, shape, col=['Lng', 'Lat'], accuracy=500)

Input the GeoDataFrame of the study area and exclude the data beyond the study area

Parameters:

data (DataFrame) – Data
shape (GeoDataFrame) – The GeoDataFrame of the study area
col (List) – Column name of longitude and latitude
accuracy (number) – The size of grid. The principle is to do the data gridding first and then do the data cleaning. The smaller the size is, the higher accuracy it has

Returns:

data1 – Data within the scope of the study

Return type:

DataFrame

transbigdata.id_reindex(data, col, new=False, timegap=None, timecol=None, suffix='_new', sample=None)

Renumber the ID columns of the data

Parameters:

data (DataFrame) – Data
col (str) – Name of the ID column to be re-indexed
new (bool) – False: the new number of the same ID will be the same index; True: according to the order of the table, the origin ID appears again with different index
timegap (number) – If an individual does not appear for a period of time (timegap is the time threshold), it is numbered as a new individual. This parameter should be set with timecol to take effect.
timecol (str) – The column name of time, it should be set with timegap to take effect
suffix (str) – The suffix of the new column. When set to False, the former column will be replaced
sample (int (optional)) – To desampling the data

Returns:

data1 – Renumbered data

Return type:

DataFrame

transbigdata.id_reindex_disgap(data, col=['uid', 'lon', 'lat'], disgap=1000, suffix='_new')

Renumber the ID columns of the data，If two adjacent records exceed the distance, the number is the new ID

Parameters:

data (DataFrame) – Data
col (str) – Name of the ID column to be re-indexed
disgap (number) – If two adjacent records exceed this distance, the number is the new ID
suffix (str) – The suffix of the new column. When set to False, the former column will be replaced

Returns:

data1 – Renumbered data

Return type:

DataFrame