spacec.preprocessing package

Module contents

spacec.preprocessing.compensate_cell_matrix(df, image_dict, masks, overwrite=True, device=None)[source]

Compensate cell matrix by computing channel means and sums.

Parameters:

df (DataFrame) – The DataFrame to which the compensated means will be added.
image_dict (dict) – Dictionary containing images for each channel.
masks (ndarray) – 3D numpy array containing masks for each cell.
overwrite (bool, optional) – If True, overwrite existing columns in df. If False, add new columns to df. Default is True.
device (str|None) – None (default) will select the compute device automatically (for compute_channel_means_sums_compensated). Can be forced to cuda, mps, or cpu.

Returns:

The DataFrame with added compensated means.

Return type:

DataFrame

Notes

The function computes the channel means and sums for each cell, compensates them, and adds them to the DataFrame. The compensated means are added to the DataFrame with column names from the keys of the image_dict. If overwrite is True, existing columns in the DataFrame are overwritten. If overwrite is False, new columns are added to the DataFrame.

spacec.preprocessing.filter_data(df, nuc_thres=1, size_thres=1, nuc_marker='DAPI', cell_size='area', region_column='region_num', color_by=None, palette='Paired', alpha=0.8, size=0.4, log_scale=False, plot=False)[source]

Filter data based on nuclear threshold and size threshold, and visualize the data before and after filtering.

Parameters:

df (pandas.DataFrame) – The DataFrame to be filtered.
nuc_thres (int, optional) – The nuclear threshold, by default 1.
size_thres (int, optional) – The size threshold, by default 1.
nuc_marker (str, optional) – The nuclear marker, by default “DAPI”.
cell_size (str, optional) – The cell size, by default “area”.
region_column (str, optional) – The region column, by default “region_num”.
color_by (str, optional) – The column to color by, by default None.
palette (str, optional) – The color palette, by default “Paired”.
alpha (float, optional) – The alpha for the scatter plot, by default 0.8.
size (float, optional) – The size for the scatter plot, by default 0.4.
log_scale (bool, optional) – Whether to use log scale for the scatter plot, by default False.

Returns:

df_nuc – The filtered DataFrame.

Return type:

pandas.DataFrame

spacec.preprocessing.format(data, list_out, list_keep, method='zscore', ArcSin_cofactor=150)[source]

This function formats the data based on the specified method. It supports four methods: “zscore”, “double_zscore”, “MinMax”, and “ArcSin”.

Parameters:

data (DataFrame) – The input data to be formatted.
list_out (list) – The list of columns to be dropped from the data.
list_keep (list) – The list of columns to be kept in the data.
method (str, optional) – The method to be used for normalizing the data. It can be “zscore”, “double_zscore”, “MinMax”, or “ArcSin”. By default, it is “zscore”.
ArcSin_cofactor (int, optional) – The cofactor to be used in the ArcSin transformation. By default, it is 150.

Returns:

The formatted data.

Return type:

DataFrame

Raises:

ValueError – If the specified method is not supported.

spacec.preprocessing.read_segdf(segfile_list, seg_method, region_list=None, meta_list=None)[source]

Read the data frame output from the segmentation functions.

Parameters:

segfile_list (list) – List of segmented csv files to be read.
seg_method (str) – The segmentation method used.
region_list (list, optional) – List of regions, by default None. Please make sure the length of each list matches.
meta_list (list, optional) – List of metadata, by default None. Please make sure the length of each list matches.

Returns:

df – The concatenated DataFrame from all the segmentation files.

Return type:

pandas.DataFrame

Raises:

SystemExit – If the length of region_list or meta_list does not match with segfile_list.

spacec.preprocessing.remove_noise(df, col_num, z_sum_thres, z_count_thres)[source]

Removes noisy cells from the dataset based on the given thresholds.

Parameters:

df (DataFrame) – The input data from which noisy cells are to be removed.
col_num (int) – The column number up to which the operation is performed.
z_sum_thres (float) – The threshold for the sum of z-scores. Cells with a sum of z-scores greater than this threshold are considered noisy.
z_count_thres (int) – The threshold for the count of z-scores. Cells with a count of z-scores greater than this threshold are considered noisy.

Returns:

df_want (DataFrame) – The cleaned data with noisy cells removed.
cc (DataFrame) – The data of the noisy cells that were removed from the original data.