spacec.preprocessing package

Module contents

spacec.preprocessing.filter_data(df, nuc_thres=1, size_thres=1, nuc_marker='DAPI', cell_size='area', region_column='region_num', color_by=None, palette='Paired', alpha=0.8, size=0.4, log_scale=False)[source]

Filter data based on nuclear threshold and size threshold, and visualize the data before and after filtering.

Parameters:
  • df (pandas.DataFrame) – The DataFrame to be filtered.

  • nuc_thres (int, optional) – The nuclear threshold, by default 1.

  • size_thres (int, optional) – The size threshold, by default 1.

  • nuc_marker (str, optional) – The nuclear marker, by default “DAPI”.

  • cell_size (str, optional) – The cell size, by default “area”.

  • region_column (str, optional) – The region column, by default “region_num”.

  • color_by (str, optional) – The column to color by, by default None.

  • palette (str, optional) – The color palette, by default “Paired”.

  • alpha (float, optional) – The alpha for the scatter plot, by default 0.8.

  • size (float, optional) – The size for the scatter plot, by default 0.4.

  • log_scale (bool, optional) – Whether to use log scale for the scatter plot, by default False.

Returns:

df_nuc – The filtered DataFrame.

Return type:

pandas.DataFrame

spacec.preprocessing.format(data, list_out, list_keep, method='zscore', ArcSin_cofactor=150)[source]

This function formats the data based on the specified method. It supports four methods: “zscore”, “double_zscore”, “MinMax”, and “ArcSin”.

Parameters:
  • data (DataFrame) – The input data to be formatted.

  • list_out (list) – The list of columns to be dropped from the data.

  • list_keep (list) – The list of columns to be kept in the data.

  • method (str, optional) – The method to be used for normalizing the data. It can be “zscore”, “double_zscore”, “MinMax”, or “ArcSin”. By default, it is “zscore”.

  • ArcSin_cofactor (int, optional) – The cofactor to be used in the ArcSin transformation. By default, it is 150.

Returns:

The formatted data.

Return type:

DataFrame

Raises:

ValueError – If the specified method is not supported.

spacec.preprocessing.read_segdf(segfile_list, seg_method, region_list=None, meta_list=None)[source]

Read the data frame output from the segmentation functions.

Parameters:
  • segfile_list (list) – List of segmented csv files to be read.

  • seg_method (str) – The segmentation method used.

  • region_list (list, optional) – List of regions, by default None. Please make sure the length of each list matches.

  • meta_list (list, optional) – List of metadata, by default None. Please make sure the length of each list matches.

Returns:

df – The concatenated DataFrame from all the segmentation files.

Return type:

pandas.DataFrame

Raises:

SystemExit – If the length of region_list or meta_list does not match with segfile_list.

spacec.preprocessing.remove_noise(df, col_num, z_sum_thres, z_count_thres)[source]

Removes noisy cells from the dataset based on the given thresholds.

Parameters:
  • df (DataFrame) – The input data from which noisy cells are to be removed.

  • col_num (int) – The column number up to which the operation is performed.

  • z_sum_thres (float) – The threshold for the sum of z-scores. Cells with a sum of z-scores greater than this threshold are considered noisy.

  • z_count_thres (int) – The threshold for the count of z-scores. Cells with a count of z-scores greater than this threshold are considered noisy.

Returns:

  • df_want (DataFrame) – The cleaned data with noisy cells removed.

  • cc (DataFrame) – The data of the noisy cells that were removed from the original data.