spacec.preprocessing package

Module contents

spacec.preprocessing.filter_data(df, nuc_thres=1, size_thres=1, nuc_marker='DAPI', cell_size='area', region_column='region_num', color_by=None, palette='Paired', alpha=0.8, size=0.4, log_scale=False)[source]

Filter data based on nuclear threshold and size threshold, and visualize the data before and after filtering.

Parameters:

df (pandas.DataFrame) – The DataFrame to be filtered.
nuc_thres (int, optional) – The nuclear threshold, by default 1.
size_thres (int, optional) – The size threshold, by default 1.
nuc_marker (str, optional) – The nuclear marker, by default “DAPI”.
cell_size (str, optional) – The cell size, by default “area”.
region_column (str, optional) – The region column, by default “region_num”.
color_by (str, optional) – The column to color by, by default None.
palette (str, optional) – The color palette, by default “Paired”.
alpha (float, optional) – The alpha for the scatter plot, by default 0.8.
size (float, optional) – The size for the scatter plot, by default 0.4.
log_scale (bool, optional) – Whether to use log scale for the scatter plot, by default False.

Returns:

df_nuc – The filtered DataFrame.

Return type:

pandas.DataFrame

spacec.preprocessing.format(data, list_out, list_keep, method='zscore', ArcSin_cofactor=150)[source]

This function formats the data based on the specified method. It supports four methods: “zscore”, “double_zscore”, “MinMax”, and “ArcSin”.

Parameters:

data (DataFrame) – The input data to be formatted.
list_out (list) – The list of columns to be dropped from the data.
list_keep (list) – The list of columns to be kept in the data.
method (str, optional) – The method to be used for normalizing the data. It can be “zscore”, “double_zscore”, “MinMax”, or “ArcSin”. By default, it is “zscore”.
ArcSin_cofactor (int, optional) – The cofactor to be used in the ArcSin transformation. By default, it is 150.

Returns:

The formatted data.

Return type:

DataFrame

Raises:

ValueError – If the specified method is not supported.

spacec.preprocessing.read_segdf(segfile_list, seg_method, region_list=None, meta_list=None)[source]

Read the data frame output from the segmentation functions.

Parameters:

segfile_list (list) – List of segmented csv files to be read.
seg_method (str) – The segmentation method used.
region_list (list, optional) – List of regions, by default None. Please make sure the length of each list matches.
meta_list (list, optional) – List of metadata, by default None. Please make sure the length of each list matches.

Returns:

df – The concatenated DataFrame from all the segmentation files.

Return type:

pandas.DataFrame

Raises:

SystemExit – If the length of region_list or meta_list does not match with segfile_list.

spacec.preprocessing.remove_noise(df, col_num, z_sum_thres, z_count_thres)[source]

Removes noisy cells from the dataset based on the given thresholds.

Parameters:

df (DataFrame) – The input data from which noisy cells are to be removed.
col_num (int) – The column number up to which the operation is performed.
z_sum_thres (float) – The threshold for the sum of z-scores. Cells with a sum of z-scores greater than this threshold are considered noisy.
z_count_thres (int) – The threshold for the count of z-scores. Cells with a count of z-scores greater than this threshold are considered noisy.

Returns:

df_want (DataFrame) – The cleaned data with noisy cells removed.
cc (DataFrame) – The data of the noisy cells that were removed from the original data.