Data Preprocess

clean_outofbounds(data, bounds[, col])

The input is the latitude and longitude coordinates of the lower left and upper right of the study area and exclude data that are outside the study area

clean_outofshape(data, shape[, col, accuracy])

Input the GeoDataFrame of the study area and exclude the data beyond the study area

id_reindex(data, col[, new, timegap, ...])

Renumber the ID columns of the data

id_reindex_disgap(data[, col, disgap, suffix])

Renumber the ID columns of the data,If two adjacent records exceed the distance, the number is the new ID

transbigdata.clean_outofbounds(data, bounds, col=['Lng', 'Lat'])

The input is the latitude and longitude coordinates of the lower left and upper right of the study area and exclude data that are outside the study area

Parameters:
  • data (DataFrame) – Data

  • bounds (List) – Latitude and longitude of the lower left and upper right of the study area, in the order of [lon1, lat1, lon2, lat2]

  • col (List) – Column name of longitude and latitude

Returns:

data1 – Data within the scope of the study

Return type:

DataFrame

transbigdata.clean_outofshape(data, shape, col=['Lng', 'Lat'], accuracy=500)

Input the GeoDataFrame of the study area and exclude the data beyond the study area

Parameters:
  • data (DataFrame) – Data

  • shape (GeoDataFrame) – The GeoDataFrame of the study area

  • col (List) – Column name of longitude and latitude

  • accuracy (number) – The size of grid. The principle is to do the data gridding first and then do the data cleaning. The smaller the size is, the higher accuracy it has

Returns:

data1 – Data within the scope of the study

Return type:

DataFrame

transbigdata.id_reindex(data, col, new=False, timegap=None, timecol=None, suffix='_new', sample=None)

Renumber the ID columns of the data

Parameters:
  • data (DataFrame) – Data

  • col (str) – Name of the ID column to be re-indexed

  • new (bool) – False: the new number of the same ID will be the same index; True: according to the order of the table, the origin ID appears again with different index

  • timegap (number) – If an individual does not appear for a period of time (timegap is the time threshold), it is numbered as a new individual. This parameter should be set with timecol to take effect.

  • timecol (str) – The column name of time, it should be set with timegap to take effect

  • suffix (str) – The suffix of the new column. When set to False, the former column will be replaced

  • sample (int (optional)) – To desampling the data

Returns:

data1 – Renumbered data

Return type:

DataFrame

transbigdata.id_reindex_disgap(data, col=['uid', 'lon', 'lat'], disgap=1000, suffix='_new')

Renumber the ID columns of the data,If two adjacent records exceed the distance, the number is the new ID

Parameters:
  • data (DataFrame) – Data

  • col (str) – Name of the ID column to be re-indexed

  • disgap (number) – If two adjacent records exceed this distance, the number is the new ID

  • suffix (str) – The suffix of the new column. When set to False, the former column will be replaced

Returns:

data1 – Renumbered data

Return type:

DataFrame