Spatial_Permutation_and_Normalization/README.Rmd at main · rickert-lab/Spatial_Permutation_and_Normalization · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
options(tibble.print_min = 5, tibble.print_max = 5)
```

# Spatial permutation and normalization of multiplexed immunofluorescence imaging data

<!-- badges: start -->
<!-- badges: end -->

## Overview
This R script is used to identify statistically significant spatial features i.e. positive or negative cell-cell colocalizations using the colocation quotient (CLQ) analysis. Here we describe how to calculate the CLQ, and create a null distribution of CLQ values and normalize the data. The normalization process considers the number of cells within each subpopulation. Subpopulations with a low cell count were more likely to yield a broader distribution of CLQ values during the permutation analysis. This broader distribution resulted from the substantial impact of random label sampling on CLQ value calculations.

* `get_CLQ()` The colocation quotient (CLQ) quantifies how a cell subpopulation colocates spatially with another cell subpopulation among a set of nearest neighbors, defined here as 20. We calculated the colocation quotient for the pairwise cell types identified with CELESTA (Zhang et al., 2022, Nature Methods) under naïve and treatment conditions using the following equation: CLQb→a = (Cb→a/Na) / (Nb/(N− 1)) where C is the number of cells of cell type b among the defined nearest neighbors of cell type a, N is the total number of cells and Na and Nb are the numbers of cells for cell type a and cell type b.

* `KNN_neighbors()` Function to find N-nearest neighboring cells

* `find_cell_type_neighbors()` This step intends to find the cell types for neighboring cells

* `CLQ_permutated_matrix_gen()` This function intends to assess the significance of the CLQ values obtained by randomly permuting 500 times the cell labels (cell types) while preserving the subpopulation proportions.

* `get_counts()` This function intends to count the number of cells for each subpopulation. It generates a summary table with cell type number, corresponding names and the cell counts in the sample.

* `CLQ_matrix_gen` This function will read the CELESTA cell assignment file, and will generate the original CLQ matrix.

`CLQ_permutated_matrix_gen_caller` : This function retrieves the output of the permutated matrix of each sample.

`get_counts_caller` This function retrieves the output of the subpopulation counts for each sample.

`significance_matrix_gen` This function identifies statistically significant CLQ values. The CLQ values falling outside or at the tail of the distribution generated by the permutation analysis are considered significant, whereas values within the distribution are deemed non-significant, as they can be reproduced after spatial randomization. Percentile values < 0.05 or > 0.95 are considered as significant. The normalization achieved through the permutation analysis facilitates not only spatial feature comparisons but also enables the comparison of different conditions from the same, or independent experiments. CLQs were normalized according to the following formula: (Observed CLQ - Mean CLQ)/(Max CLQ – Mean CLQ).

`plot_gen` This function plots the distribution of all the permutation CLQ values for each cell pair. The blue bar is the normalized CLQ value and the red bar is the original CLQ value.

`CLQ_normalization_by_sample` This function requires (1) a named vector with the original CLQ values for one sample before normalization, each element need to have a name, which is the two cell types in the cell pair, connected by "_", (2) A cell count file with the number of cells for each cell type in that sample, (3) Number of nearest neighbors in the CLQ calculation, (4) A threshold value cell count for rare cell populations, default is 5, (5) CELESTA input prior cell type signature matrix and (6) Clipping parameters, default to 0.05. but a warning message will suggest clipping more as needed. The original CLQ distribution is bell-shaped, but is skewed on the rail. The clipping parameter allows for better visualization when normalizing the data.

## Dependency

- [spdep](https://cran.r-project.org/web/packages/spdep/index.html): for obtaining spatial neighborhood information
- [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html)

## Usage

```{r,results='hide',message=FALSE, eval = FALSE}
library(spdep)
library(ggplot2)


### Samples are first processed here to generate original and permutated CLQs for each cell to cell pair in a given sample.

### Input file name example: “TAFs1_cell_type_assignment.csv”


files <- (Sys.glob("*cell_type_assignment.csv"))

for (f in files){
  print(f)
  filename_c = f

  count_file = get_counts(filename=filename_c)

  ### CLQ_permutated_matrix_gen function is using iteration number, filename and the count_file generated in the previous step. This function is dependent on multiple functions [get_CLQ(), KNN_neighbors(), find_cell_type_neighbors()]
 ### iternum is the iteration number for permutation analysis, which is set to 500.


  CLQ_permutated = CLQ_permutated_matrix_gen(iternum=500,
                                             filename = filename_c,
                                             df_c = count_file)
}


###
### Then, significance of CLQs are calculated based on their percentile.
### The original CLQs and permutated CLQs are retrieved for each sample through the functions [CLQ_matrix_gen(), CLQ_permutated_matrix_gen_caller(),get_counts_caller() ]

files <- (Sys.glob("*cell_type_assignment.csv"))

for (f in files){
  print(f)

  filename_c = f

  CLQ_matrix = CLQ_matrix_gen(filename = filename_c)

  CLQ_permutated = CLQ_permutated_matrix_gen_caller(filename = filename_c)

  count_file = get_counts_caller(filename=filename_c)

###  “list_of_matrices” is the 500 different CLQ sets for each iteration.

  significance_matrices = significance_matrix_gen(iternum=500,
                                                  filename = filename_c,
                                                  list_of_matrices = CLQ_permutated,
                                                  CLQ_matrix_original= CLQ_matrix,
                                                  df_c = count_file)

### plot_gen generates plot for each CLQ for a pair of cell type A to cell type B. It retrieves the permutated CLQ values from CLQ_permutated, and the original CLQ values from CLQ_matrix.

 plot_gen(iternum=500,
           filename = filename_c,
           list_of_matrices = CLQ_permutated,
           CLQ_matrix_original= CLQ_matrix)

}
```

## Inputs
The spatial permutation analysis requires two inputs:<br/>
`1. CELESTA cell subpopulations`: <br/>
A dataframe with one column named cell_types with all the user-defined CELESTA cell subpopulations. <br/>

See file example: “cell_types_celesta.csv”

`2. Segmented imaging data with CELESTA cell assignment`:<br/>
The _cell_type_assigment.csv output dataframe from the CELESTA algorithm available to download at https://github.com/plevritis-lab/CELESTA.

See file example: “TAFs1_cell_type_assignment.csv”


## Outputs
Spatial permutation outputs:
1. After running the `get_counts()` function, the script will output a .csv file with the number of cells for each cell subpopulation. <br/>

See file example: “TAFs1_CellCounts.csv”

2. After running the `CLQ_matrix_gen` function, the script will output a .csv file with the original CLQ values for each cell pair.

See file example: “TAFs1_CLQ.csv”

3. After running the `CLQ_permutated_matrix_gen` function, the script will output a .csv file of 500 CLQ values obtained by randomly permuting 500 times the cell labels (cell subpopulations) while preserving the proportions. These values will be plotted in

See file example: “TAFs1_CLQ_Permutated.csv”

4. After running the `significance_matrix_gen` function, the script will output a .csv file with the script will output a .csv file with the sample name, the identity of and count of each cell subpopulation, the original CLQ value, the percentile and if the value is deemed significant.

Note that the original CLQ of value zero smay be caused by insufficient cell numbers of respective cell types. These are filtered out in the post-process prior to colocatome generation.

See file example: “TAFs1_CLQ_data_full.csv” and .png images in the PA_figures_TAFs1 folder.

5.After running the `CLQ_normalization_by_sample` functions, the cript will output a .csv file with normalized values.

See file examples: "TAFs1_CLQ_Normalized_L0_R0.05". L0 = left clipping parameter at 0 (no clipping) and R0.05 = clipping parameter at 0.05.

Note that the folder contains only a subset of the distribution plots.

## Getting help
If you encounter a bug, please file an issue with a minimal reproducible example on [GitHub](https://github.com/plevritis-lab/Spatial_Permutation_and_Normalization/issues). For questions and other discussion, please use [community.rstudio.com](https://community.rstudio.com/).