Clustering and Cell Type Annotation Analyses on Bone Marrow Organoids dataset

In this case study, we perform clustering and cell type annotation on a dataset of bone marrow organoids (BMO) generated from human induced pluripotent stem cells.

You can import the study in this tutorial with view-permission to your study list using the following link:

https://cytoanalyst.tinnguyen-lab.com/studies/import/hvjw7ifjDWqSHvkAG

Dataset

We collected the bone marrow organoids dataset for this analysis from the CZ CELLxGENE database. Below is the dataset information:

Dataset download link: here
Collection: Generation of complex human induced pluripotent stem cell-derived bone marrow organoids
Publication: Generation of complex bone marrow organoids from human induced pluripotent stem cells

To download the dataset, follow these steps:

Click the Download button to open the download dialog.
Under the DATA FORMAT, choose the .h5ad (AnnData v0.10) file format.
Click the Download button at the bottom to download the dataset.

Workflow

Create a New Study

To create a new study, navigate to the Study Management page and fill in the required fields in the Create Study form with following the inputs:

Name: Case Study - Clustering & Cell Type Annotation. A descriptive name for the case study.
Description: Performing Clustering and Cell Type Annotation on a Bone Marrow Organoids dataset. A brief description of the case study.

Click the Create Study button to generate the new study.

Once the study is created, you will be redirected to the Data page of the new study.

Note: To modify the study name or description, click the Studies button in the Study navigation bar located at the top of the page. For additional information about managing studies, please refer to the Study Management page.

Upload Data

Uploading and Processing the data

On the data management page, we will upload the downloaded dataset. Click the Click to upload button and select the downloaded dataset to upload.

File Type: AnnData (.h5ad). Specifies the file format of the dataset.
Assay: Default. Indicates the type of assay used in the dataset. In this case, it is the default assay.
Feature ID Column: feature_name. Identifies the column in the dataset that contains the feature IDs, which are the gene names in this case.
Keep Embeddings: True. Indicates whether to retain the precomputed embeddings in the dataset.
Embeddings: umap - pca. Specifies precomputed visualizations and embeddings to be kept.
Keep Metadata in h5ad File: True. Indicates whether to retain the metadata within the dataset file.
Extra Metadata File: Empty. Specifies an additional metadata file to be uploaded. In this case, no extra metadata file is provided.
Has Multiple Samples False. Indicates whether the dataset contains multiple samples. In this case, there is only one sample in the dataset.

Click the Submit button to start the data processing.

A job will be created in the background to process the data. Once the job is complete, the right panel will display options for data filtering. Visit Study Logs to learn more about monitoring analysis jobs and system status.

Data Filtering

Once the data processing is complete, the filtering options will appear in the Data Filtering panel.

Note:: The authors have already excluded barcodes with fewer than 400 detected genes, more than 40,000 counts in total, or mitochondrial genes exeeding 10% of the total gene counts. Refer to the publication for further details about the filtering process.

Therefore, no further filtering will be applied to the dataset.

Save data

Click the Save data button in the Data Filtering panel to open the dialog for saving data.

This dialog enables you to choose which samples and embeddings to save.

By default, the sample name is NewSample. To rename it, click the Edit button next to the NewSample and change it to Bone Marrow.

Finally, click the Save data button to save your selection. The newly saved data will then appear in the data table.

Navigate to the Analysis page.

Once the data is saved, click the Analysis button in the Study navigation bar at the top of the page to navigate to the Analysis page.

The Analysis page provides a comprehensive view of the data and analysis tools. The basic layout of the Analysis page is shown below:

Top Toolbar: Contains dropdown menus for selecting embedding, data normalization, plot type, blending mode, and color map.
Left Sidebar: Contains the label selection panel for selecting labels to visualize.
Bottom Drawer: Contains all analysis tools

For more details about navigation and understanding the layout of the Analysis page, refer to Data Analysis.

Data Exploration

To capture the major cell populations in the dataset, we visualize the cell landscape to manually identify potential clusters.

Follow these steps to visualize the cell landscape:

Ensure that the value Bone Marrow of the Samples label in the left sidebar is selected.
Click the button next to the Samples label to display the cell landscape.

Visualizing the cell landscape reveals that the cells are distinctly separated into three clusters (marked as I, II, and III), which serve as the foundation for cell type annotation.

Clustering Analysis

Create Clustering Analysis

We perform clustering analysis on the dataset to define the boundaries of three distinct cell populations. Specifically, clustering is executed at various resolutions to examine cluster granularity and determine optimal boundaries. Multiple clustering analyses will be conducted with resolutions of 0.1, 0.2, 0.3, 0.4, and 0.5 to achieve this goal.

Follow these steps to perform the analysis:

Click the Clustering tab in the Bottom Drawer, then click the New Clustering button.
In the clustering analysis form, use the following settings:
- Embedding: pca – Specifies the embedding to use for clustering.
- Name: Louvain 0.1 – The name of the clustering analysis.
- Method: Louvain – The clustering method to use.
- Resolution: 0.1 – A parameter used in Louvain or Leiden algorithms to control clustering granularity.
  
  Higher values result in more clusters.
- Number of neighbors: 20 - The number of nearest neighbors to consider when constructing the graph for clustering.
- Distance Metric: Euclidean – The distance metric used to calculate the distance between cells.
- Number of Iterations: 10 - The maximum number of times the algorithm will run for each random start to optimize the modularity score by refining cluster assignments.
- Number of random starts: 10 - The number of times the clustering algorithm will be run with different initial centroid seeds; the final result will be from the run that yields the best clustering solution.
Click the Create button to generate the new clustering analysis.

Clustering using Louvain with 0.1 resolution

Then, we repeat the same steps to create clustering analyses with resolutions of 0.2, 0.3, 0.4, and 0.5.
The settings for these new analyses are shown in the image below:

A: Clustering form with 0.2 resolution.
B: Clustering form with 0.3 resolution.
C: Clustering form with 0.4 resolution.
D: Clustering form with 0.5 resolution.

Clustering using Louvain with Resolutions of 0.2, 0.3, 0.4, and 0.5

Visualize Clustering results

Once the clustering analysis is complete, the results will appear in the Clustering Table under the Existing Clustering tab.
To view the table:

Click the Clustering tab in the Bottom Drawer.
Switch to the Existing Clustering tab to access the results.

Click the icon to view detailed analysis information next to Louvain 0.1, including parameters used and cell counts per cluster.

To visualize clustering results as plots, follow these steps:

In the top toolbar, ensure the settings are configured as follows:
- Visualization embedding: pca - Specifies which embedding to be used for visualization.
- Plot Type: Scatter - Specifies the type of chart to display.
- Plot blending mode: Separate - Specifies the blending mode for visualization.
Switch to the Observation panel in the left sidebar, then:
- Ensure the Samples label is selected.
- Under the Clusters category, click the button next to Louvain 0.1, Louvain 0.2, Louvain 0.3, Louvain 0.4, and Louvain 0.5 to visualize the clustering results.

At this stage, the plots may not be well-arranged. Therefore, we adjust the visualization settings to organize them correctly.

In the top toolbar, click the button to open the visualization settings panel, and configure the following settings:
- Number of rows: 2 – Specifies the number of rows in the grid layout.
- Sync zoom: Enable – When enabled, zooming in on one scatter plot automatically zooms in on all scatter plots in the grid.
- Show plot title: Enable – Displays the title of each plot in the grid. You can position the title to the left, center, or right.
In the visualization settings table, focus on the key customization settings as outlined below:
- Name: The name/title of the plot. You can edit the name by clicking the icon next to the current plot name, if desired.
- Blend Mode: The blending mode applied to the plot. Blend modes can be used to combine multiple plots into one visualization. See Blend Mode for details on using blending modes.
- Color: The color mapping used in the plot. Depending on the chart type, the color mapping can be of two types:
  - Value: Based on the expression values, such as the minimum and maximum expression values.
  - Group: Based on groupings, such as metadata, clusters, or annotations. For more details about color customization, refer to Visualization Settings.
- Action:
  - Click the Remove icon to remove the plot from the grid if needed.

Visualization Settings of Clustering Results

Click anywhere outside the visualization settings panel to close it. The plot details are as follows:

The first plot displays the UMAP visualization of the dataset.
The second through sixth plots show clustering results at resolutions 0.1, 0.2, 0.3, 0.4, and 0.5.

The Cell Landscape & Clustering results visualization

Regardless of the resolution settings, Louvain clustering successfully separates the cells into three major populations, consistent with the cell landscape visualization.

Louvain clustering with a resolutions of 0.3 identifies 15 clusters:

Clusters 1, 3, and 15 correspond to population I.
Clusters 2, 5, 6, 7, 8, 9, 10, and 13 correspond to population II.
Clusters 4, 11, 12, and 14 correspond to population III.

Similarly, Louvain clustering with a resolution 0.1 identifies 8 clusters:

Cluster 1 corresponds to population I.
Clusters 2, 4, 5, and 6 correspond to population II.
Clusters 3, 7, and 8 correspond to population III.

In summary, Louvain analysis at different resolutions consistently separates the cell landscape into three major cell populations. For subsequent analyses, we use the default resolution of 0.3 (optimal for granularity), though all other resolutions yield similar annotation results.

Cluster Markers Identification

Population boundaries from resolution 0.3:

Population I: Clusters 1, 3, 15
Population II: Clusters 2, 5, 6, 7, 8, 9, 10, 13
Population III: Clusters 4, 11, 12, 14

Visualize the clustering result from resolution 0.3

Create a new Gene Set Collection

Before performing differential expression analysis, create a new gene set collection to store filtered DE results. Follow these steps to initialize an empty collection:

Click the Genes Collection tab in the Bottom Drawer to open the gene set collection panel.
Click the New Collection button.
Select the Manual input option.
Enter Cluster Markers in the Name field to specify the collection identifier.
Click the Save button to create the new gene set collection.

New Form for The New Gene Set Collection

Once the gene set collection is created, you can manage it in the Gene Set Collections table under the Existing Collections tab. Follow these steps to view the table:

Click the Genes Collection tab in the Bottom Drawer
Switch to the Existing Collections tab.

For more details about managing the existing collections, refer to Gene Set Collection.

Perform Differential Expression analysis

We identify marker genes by comparing each population against all others, while also evaluating whether populations should be divided into smaller subgroups. Three comparisons are performed:

Population I: Clusters 1, 3, 15 vs. others
Population II: Clusters 2, 5, 6, 7, 8, 9, 10, 13 vs. others
Population III: Clusters 4, 11, 12, 14 vs. others

Follow these steps to navigate to the new form for the differential expression analysis:

Click the Differential Expression tab on the Bottom Drawer to navigate to the analysis panel.
Click the New Differential Expression button to open the analysis creation form.
Select the Custom option.

Navigate to a New Form of Differential Expression Analysis

Clusters 1, 3, 15 vs others (Population I)

In the new form for the analysis, expand the following sections to configure the analysis:

Group 1 Cell Filters and Group 2 Cell Filters: Specify the cell filters for each group.
Method Configurations: Choose the method for the analysis and define its parameters.

We will create the first analysis for Clusters 1, 3, and 15 versus others. The settings are as follows:

Name: Clusters 1, 3, 15 vs others - The analysis identifier.
Group 1 Cell Filters:
- Sample: Bone Marrow - Includes only cells from the selected sample for this group.
- Clustering Filters: Add filters to include cells based on clustering values. Click the Add Clustering Filter button to add a new filtering condition.
  - Clustering result: Louvain 0.3 - Indicates which clustering result is used for the filter.
  - Clustering values: 1, 3, and 15 - Specifies the clusters to keep in the selected clustering result.
Group 2 Cell Filters:
- Sample: Bone Marrow - Includes only cells from the selected sample for this group.
- Clustering Filters: Add filters to include cells based on clustering values. Click the Add Clustering Filter button to add a new filtering condition.
  - Clustering result: Louvain 0.3 - Indicates which clustering result is used for the filter.
  - Clustering values: 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, and 14 - Specifies the clusters to keep in the selected clustering result.
Method Configurations: Select the method and parameters for the differential expression analysis.
- Method: Wilcoxon - Indicates the method to be used for the analysis.
- Max Cells: 100000 - Specifies the maximum number of cells to use for the analysis.
- Min Percent: 0 - Indicates that only gene expressed in at least this percentage of cells will be included in the analysis. In this case, we include all genes.
- Log Fold Change: 0.0 - Indicates that only genes with a log fold change greater than this value will be included in the results. In this case, we include all genes.

DE Analysis Creation Form for Population I

After configuring the cell filters for each group and setting up the method parameters, you can preview the analysis before execution. The preview table displays the following details:

Name: The analysis name.
Group 1:
- Number of cells: The number of cells in Group 1.
- Samples: The selected sample for Group 1.
- Clustering result: The clustering result used for Group 1 (e.g., Louvain 0.3).
- Clustering values: The specific clusters included in Group 1. (e.g., 1, 3, and 15).
Group 2:
- Number of cells: The number of cells in Group 2.
- Samples: The selected sample for Group 2.
- Clustering result: The clustering result used for Group 2 (e.g., Louvain 0.3).
- Clustering values: The specific clusters included in Group 2. (e.g., 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, and 14).
Total Cells: The combined number of cells from both groups included in the analysis.

Once reviewed, click the Submit button to run the analysis.

Clusters 2, 5, 6, 7, 8, 9, 10, 13 vs others (Population II)

We will now create the second analysis comparing clusters 2, 5, 6, 7, 8, 9, 10, and 13 against all other clusters using the same steps. Configure the analysis with the following settings:

Name: Clusters 2, 5, 6, 7, 8, 9, 10, 13 vs others.
Group 1 Cell Filters:
- Sample: Bone Marrow
- Clustering Filters:
  - Clustering result: Louvain 0.3
  - Clustering values: 2, 5, 6, 7, 8, 9, 10, 13
Group 2 Cell Filters:
- Sample: Bone Marrow
- Clustering Filters:
  - Clustering result: Louvain 0.3
  - Clustering values: 1, 3, 4, 11, 12, 14, 15
Method Configurations:
- Method: Wilcoxon
- Max Cells: 100000
- Min Percent: 0
- Log Fold Change: 0.0

DE Analysis Creation Form for Population II

Finally, click the Submit button to run the analysis.

Clusters 4, 11, 12, 14 vs. others (Population III)

We will now create the second analysis comparing clusters 4, 11, 12, and 14 against all other clusters using the same steps. Configure the analysis with the following settings:

Name: Clusters 4, 11, 12, 14 vs others
Group 1 Cell Filters:
- Sample: Bone Marrow
- Clustering Filters:
  - Clustering result: Louvain 0.3
  - Clustering values: 4, 11, 12, 14
Group 2 Cell Filters:
- Sample: Bone Marrow
- Clustering Filters:
  - Clustering result: Louvain 0.3
  - Clustering values: 1, 2, 3, 5, 6, 7, 8, 9, 10, 13, 15
Method Configurations:
- Method: Wilcoxon
- Max Cells: 100000
- Min Percent: 0
- Log Fold Change: 0.0

A New Form of Differential Expression for The Third Group

Finally, click the Submit button to create the analysis.

Manage The Differential Expression Results

When the differential expression (DE) analysis completes, the results will appear in the Differential Expression Table under the Existing Results tab.

The results table displays the following information for each analysis:

Name: Analysis identifier.
Method: Statistical method used (e.g., Wilcoxon).
Max Cells: Maximum cell count parameter.
Min Pct: Minimum expression percentage threshold.
Min Log2FC: Minimum log2 fold change threshold.
Group 1: Summary of Group 1.
Group 2: Summary of Group 2.
Action:
- View: Open detailed results for the analysis.
- Delete: Remove the analysis from the table.

Additionally, CytoAnalyst supports viewing multiple DE analyses results simultaneously:

Select one or more analyses using the checkboxes on the left side of the table.
Click the View Selected button at the top of the table to view the selected analyses results in the same window.

The Existing Results as A Table of The Differential Expression Analyses

For comprehensive guidance on managing DE results, see the Differential Expression Analysis page.

Identify Marker Genes

To extract marker genes, there are two primary approaches for filtering genes and adding them to a gene set collection:

Extract DE genes using individual DE results.
Alternative approach (Recommended) Extract DE genes using multiple DE results simultaneously.

Extract DE genes using individual DE results

In the Action column of the Differential Expression Table, click the View button to open the DE results page for the first analysis (Clusters 1, 3, 15 vs others).

The differential expression results page consists of four main components:

Results table - Displays the differential expression results. Use the Show columns section to customize which columns are visible.
Add selected genes to Gene Set Collection - This panel allows you to add selected genes to a gene set collection with two options:
- Add to existing set: Add the selected genes to an existing set in the chosen collection.
- Create new set: Create a new set within the selected collection.
Analysis parameters - Displays the parameters used for the differential expression analysis.
Volcano plot - Shows the log fold change (x-axis) versus -log10(p-value) (y-axis). Significantly differentially expressed genes are highlighted.

The Differential Expression Results of The First Analysis

To identify marker genes for the first analysis Cluster 1, 3, 15 vs others, filter the Results table using the following criteria:

Log fold change > 3
Adjusted p-value < 0.05
Average expression > 1
Pct1 - Pct2 > 0.5

Follow these steps to filter the table results:

In the Show columns section, enable the following columns for filtering:
- Adjusted p-value: The adjusted p-value of differential expression.
- Avg Log2 Fc: The average log2 fold change.
- Avg Expression in Group 1: The gene's average expression in Group 1.
- Pct1 - Pct2: The difference in expression percentage (Group 1 vs Group 2).
Apply these filters in the Results table:
- Adjusted P Value: Min: [empty] | Max: 0.05.
- Avg Log2 FC: Min: 3 | Max: [empty].
- Avg Expression in Group 1: Min: 1 | Max: [empty].
- Pct1 - Pct2: Min: 0.5 | Max: [empty].

Next, we will add the filtered genes to the Cluster Markers gene set collection, following these steps:

In the Results table, check the box next to the Feature column to select all filtered genes.
Click the Add selected genes to gene set button to expand the form.
Click the Create new set button switch to the creation form.
Select Cluster Markers in the Collection field (where the new set will be added).
Enter Cluster 1 markers in the Name field as the gene set identifier.
Finally, click the Add button to create the set and include the genes.

Note: If the Cluster Markers collection does not exist, follow the instructions in Create a new Gene Set Collection before proceeding.

Filtering The DE Analysis Results Table of The First Analysis

Apply the same criteria and steps to create gene sets for other analyses:

A: Identify marker genes for the second analysis (Clusters 2, 5, 6, 7, 8, 9, 10, 13 vs others).
B: Identify marker genes for the third analysis (Clusters 4, 11, 12, 14 vs others).

Filtering The DE Analysis Results Table of The Second and The Third Analyses

Extract DE genes using multiple DE results

In the Differential Expression Table, follow these steps:

Select all three analyses simultaneously by checking the box in the table header (next to the # symbol).
Click the Extract DE genes button to open the Extracting DE Genes panel in a popup window.

On the Extract DE Genes panel, apply the same filtering criteria to all three analyses. Follow these steps to extract DE genes from the three analyses:

In the filtering section, set the following criteria:
- P Value Adjusted: Min: [empty] - Max: 0.05.
- Log2 FC: Min: 3 - Max: [empty].
- Expression Group 1: Min: 1 - Max: [empty].
- Difference in Percentage (pct1 - pct2): Min: 0.5 - Max: [empty].
Remove dupplicates: Whether to eliminate duplicate markers from each gene set. In this case, disable this option to retain as many markers as possible for cell type identification.
Click the Add to existing collection button, then configure:
- Collection: Select Cluster Markers as the target collection
- Gene Set Name: Use Marker genes for {comparison} as the naming pattern. Note: {comparison} will be automatically replaced with the analysis names.
Click the Add to collection button to finalize and store the filtered gene sets.

Visualize expression patterns

In this section, we will visualize the expression patterns of the marker genes in each group to verify that the selected genes are consistently and uniquely expressed in the target populations.

Follow these steps to visualize the expression patterns of the marker genes in each group:

Click the Features button to switch to the Features tab in the Left Sidebar.
Ensure the settings in the Top Toolbar are configured as follows:
- Visualization embedding: pca - Specifies the embedding used for visualization.
- Normalization method: LogNorm - Normalization applied to the data.
- Plot Type: Scatter - Type of chart displayed.
- Plot blending mode: Separate - Blending mode for visualization.
Under Gene set collections in the left sidebar, locate the collection named Cluster Markers.
Click the button next to the following labels:
- Markers for Clusters 1, 3, 15
- Markers for Clusters 2, 5, 6, 7, 8, 9, 10, 13
- Markers for Clusters 4, 11, 12, 14

By visualizing the expression patterns of the markers in each group, we observe that:

The marker genes for population I (Clusters 1, 3, 15) exhibit high expression within this population and have negligible expression in other populations. Therefore, we are confident that population I represent a single cell type.
Likewise, marker genes associated with population II (Clusters 2, 5, 6, 7, 8, 9, 10, 13) are predominantly expressed within this second population, showing minimal expression outside of it. Thus, it is highly likely that population II also consists of a single cell type.

Given that the marker genes for population III (Clusters 4, 11, 12, 14) are highly expressed primarily in cluster 4 and show minimal to no expression in clusters 11, 12, and 14, cluster 4 likely represents a unique cell type. This clear distinction highlights the importance of conducting differential expression (DE) analyses to further elucidate the specific characteristics of this population.

Exploring substructure within population III

To further understand the gene expression within population III, we will perform differential expression analyses in each of its clusters.

Perform Differential Expression Analyses for Cluster 4, 11, 12, and 14

CytoAnalyst allows you to efficiently run differential expression analyses for multiple clusters simultaneously. Follow these steps to achieve this:

Click the Differential Expression tab in the Bottom Drawer.
Click the New Differential Expression button to access the creation form.
Select the By Cluster option and configure the settings as follows:
- Name: Cluster {cluster} vs. others - This placeholder will be replaced with the cluster number. (e.g., Cluster 4 vs. others).
- Comparison mode: With others - Specifies that each selected cluster will be compared against all other clusters.
- Select Clustering Result: Louvain 0.3 - Indicates which clustering result to use for the analysis.
- Select Clusters: 4, 11, 12, and 14 - Specifies the clusters to include in the analysis. This setup will launch four separate analyses at once, one for each selected cluster.
- Ensure that you have selected the Bone Marrow sample in the Sample section on each group.
- Method: Wilcoxon - Indicates the statistical method to be used for the analysis.
- Max Cells: 100000 - Specifies the maximum number of cells to use for the analysis.
- Min Percent: 0 - Indicates that only genes expressed in at least this percentage of cells will be included in the analysis. In this case, we include all genes.
- Log Fold Change: 0.0 - Indicates that only genes with a log fold change greater than this value will be included in the results. In this case, we include all genes.
- CLick the Submit button to create the analyses.

Identify Marker Genes for Cluster 4, 11, 12, and 14

We will identify marker genes for clusters 4, 11, 12, and 14 by filtering their respective differential expression (DE) results and adding the filtered genes to the Cluster Markers collection. Follow these steps:

Navigate to the Differential Expression Table by clicking the Existing Results tab in the Bottom Drawer.
Select the DE results for Cluster 4, Cluster 11, Cluster 12, and Cluster 14 by checking the corresponding boxes.
Open the Extract DE Genes panel by clicking the Extract DE genes button.

Opening Extracting DE Genes Panel for Additional DE Results

On the Extract DE Genes panel, apply the same criteria used in the previous analyses to filter genes and add them to the Cluster Markers collection.

Note: Expect an error message when applying the same criteria from the previous analyses to the DE result of Cluster 11 vs others, as this is excluded from the filtering process. Please disregard this message and proceed with the filtering process; we will explain it further after filtering.

Extract DE Genes for Clusters 4, 12 & 14

To capture subtle yet biologically meaningful marker genes in Cluster 11 vs others, which were previously missed by our strict original thresholds (Log2 FC ≥ 3), we relax the Log2 FC cutoff to ≥ 2 while maintaining statistical significance (adjusted p-value ≤ 0.05).

We also remove the Difference in Percentage (pct1 - pct2) filter. This allows us to identify rare marker genes (for example, those expressed in only 10% of Cluster 11 cells, but absent elsewhere) that can still be critical for defining the unique identity of this cluster.

Based on these reasons, we will apply the following filtering criteria for the Cluster 11 vs others analysis:

Adjusted p-value: Min: [empty] - Max: 0.05
Log2 FC: Min: 2 - Max: [empty]
Average expression in Group 1: Min: 1 - Max: [empty]

Follow these steps depicted in the image below:

A: Open the filtering panel.
B: Apply the filtering criteria.

Visualize Marker Genes Expression Patterns for Clusters 4, 11, 12, and 14

Here, we will visualize the expression pattern of these markers genes in Cluster 4, Cluster 11, Cluster 12, and Cluster 14 to confirm that the selected genes are consistently and uniquely expressed in the target populations.

Follow these steps to visualize the expression patterns:

Click the Features button to switch to the Features tab in the Left Sidebar.
Ensure the settings in the Top Toolbar are configured as follows:
- Visualization embedding: pca - Specifies the embedding used for visualization.
- Normalization method: LogNorm - Specifies the normalization applied to the data.
- Plot Type: Scatter - Specifies the type of chart displayed.
- Plot blending mode: Separate - Specifies the blending mode used for visualization.
Under Gene set collections in the left sidebar, locate the created collection named Cluster Markers.
Click the button next to the following labels:
- Marker genes for Cluster 4 vs others
- Marker genes for Cluster 11 vs others
- Marker genes for Cluster 12 vs others
- Marker genes for Cluster 14 vs others

Visualize Cluster Markers for Clusters 4, 11, 12 & 14

To replicate this visualization above, configure the settings as follows:

Number of rows: 2. Determines the number of rows in the grid layout.
Sync zoom: Enable. Enabling this option will synchronize the zoom level across all scatter plots in the grid.
Show plot title: Enable. This option displays the title of each plot in the grid. You can position the title to the left, center, or right.
In the visualization settings table: Update the plot titles as shown in the image below.

Settings for Visualization Marker Gene Expression in Clusters 4, 11, 12, & 14

By observing the expression patterns of the markers in each cluster, we can confirm that:

Cluster 12: The marker genes in this cluster are highly expressed within this cluster and show negligible expression in other populations. Therefore, we are confident that Cluster 12 represents a unique cell type.
Cluster 14: Similarly, its marker genes are predominantly expressed within this cluster and show minimal expression elsewhere. Thus, we are confident that Cluster 14 also represents a unique cell type.
Cluster 4 and 11: The expression pattern of marker genes in these two clusters are very similar, indicating that they likely represent the same cell type. Therefore, we will merge clusters 4 and 11 by performing differential expression analysis for Clusters 4 and 11 versus others.

Grouping Clusters 4 and 11

To merge Clusters 4 and 11 into a single cell type, we will perform a differential expression analysis comparing Clusters 4 and 11 against all other clusters.

Click the Differential Expression tab in the Bottom Drawer to navigate to the Differential Expression Table.
Click the New Differential Expression button to open the analysis creation form. Select the Custom option and follow the procedure outlined in the image below:
- A. Configure analysis settings: Set parameters for the DE comparison.
- B. View results: Open the DE result page after running the analysis.
- C. Save marker genes:: Filter the result table and add significantly expressed genes to the Cluster Markers collection.

Cell Type Annotation

Cell Type Inference

Through visualization, clustering, and differential expression (DE) analysis, we have identified five distinct cell populations in the dataset.

Population I: Clusters 1, 3, 15
Population II: Clusters 2, 5, 6, 7, 8, 9, 10, 13
Population III: Clusters 4, 11
Population IV: Cluster 12
Population V: Cluster 14

Infer Cell Types in the Populations I, II, III, V

To infer cell types in the identified populations, we will use CytoAnalyst built-in cell type inference tool to search for potential cell types in each population, based on the marker genes identified in the previous steps:

Click the Genes Collection tab in the Bottom Drawer.
Click the Existing Collections button.
Click the icon next to the Cluster Markers collection to view its contents.

The Collections Table consists of three main components:

Collection: Allows you to update the collection's information.
Gene Sets: Enables you to add a new gene set to the collection.
Gene Set Table: Enables you to view and manage gene sets within the collection.

In the Gene set table, follow these step to use the inference tool:

In the Actions column for each gene set, click the Infer Cell Types button to initiate cell type inference.

Once the inference process completes, a popup titled Inferred Cell Types will appear, where you can preview the inferred cell types. The cell type assignment strategy works as follows:

Select the label that appears most frequently in the top predicted cell types.
If multiple labels have the same frequency, prioritize the cell type with the earliest position (more fine-grained) in the cell ontology hierarchy.
Finally, click the Append to gene set description button to save the inferred cell type information to the gene set's description.

After appending the inferred cell types to the gene set collection, you can find the inference details in the description of each marker gene set within the Gene Set Table.

Inferred Cell Type Results for Populations I, II, III, V

Inferred Cell Types for Population IV

For Population IV (Cluster 12), inferring cell types using the built-in tools is challenging due to its marker genes not being well characterized. Therefore, we will refine the markers by filtering the marker genes of Cluster 12 using additional criteria to capture widely expressed genes.

Follow these steps as depicted in the image below:

A: Navigate to the Cluster 12 vs others analysis result page.
B: Filter the Result Table using the following criteria, then add the filtered genes to the Cluster Markers collection:
- Adjusted P Value: Min: [empty] | Max: 0.05
- Avg Log2 FC: Min: 3 | Max: [empty]
- Avg Expression in Group 1: Min: 1 | Max: [empty]
- Pct1 - Pct2: Min: 0.4 | Max: [empty]
C: Perform cell type inference for the filtered genes.
D: Update the gene set description with the inferred cell types.
E: View the inferred cell types in the Gene Set Table.

Summarize Cell Types Inference

Based on the cell type assignment strategy, we will assign the inferred cell types to the corresponding clusters as follows:

Population I: Clusters 1, 3, 15 → Mesenchymal cells (assigned because Mesenchymal cells is the most frequent prediction).
Population II: Clusters 2, 5, 6, 7, 8, 9, 10, 13 → Hematopoietic cells (top three predictions consistently align with this type).
Population III: Clusters 4, 11 → Endothelial cells (appears in 4/5 predictions).
Population IV: Cluster 12 → Mesenchymal cells (appears in 2/3 predictions).
Population V: Cluster 14 → Mesodermal cells (appears in 3/4 predictions).

At this point, we observe that Populations I and IV share the same inferred cell type (Mesenchymal cells). Therefore, to simplify annotation, we will merge Clusters 1, 3, 15, and 12 into a single cell type labeled Mesenchymal cells.

The final cell type assignments are as follows:

Clusters 1, 3, 15, 12 → Mesenchymal cells.
Clusters 2, 5, 6, 7, 8, 9, 10, 13 → Hematopoietic cells.
Clusters 4, 11 → Endothelial cells.
Cluster 14 → Mesodermal cells.

Assign Inferred Cell Types to Each Cell Population

In this section, we will assign the inferred cell types to each cell population. Follow these steps to create an annotation and assign the inferred cell types to the corresponding clusters:

Click the Cell Annotation tab in the Bottom Drawer.
Click the New Annotation button to open the Cell Annotation New Form.
In the form, configure the following settings:
- Name: New Annotation. Specifies the annotation identifier.
- Default value: unassigned. Assigns this label to all cells initially or where data is missing.
- Copy default value from existing categories: disabled. Enable this option if you want to copy the default value from other metadata, clusters, or annotations. For this case study, we will not use this option.
Finally, click the Create button to create the new annotation.

Assign Inferred Cell Type to Clusters 1, 3, 15, and 12

Once the new annotation has been created, we will assign Mesenchymal cell to the corresponding clusters.

Click the Edit Annotation button in the Bottom Drawer.
Enable Show clustering option to display cluster selection options.
Select the Louvain 0.3 clustering to populate the table with cluster labels.
In the Cluster column, select clusters 1, 3, 15, and 12 to retain only cells belonging to these clusters.
Click the checkbox next to the Sample column header to select all visible cells.
From the Select annotation dropdown, choose New Annotation.
In the New value field, type Mesenchymal cell.
Finally, click the Assign button to annotate the selected cells.

Cell Type Annotation for Clusters 1, 3, 15, & 12

Assign Inferred Cell Types to the Remaining Clusters

Repeat these steps to annotate the remaining cell populations:

Clusters 2, 5, 6, 7, 8, 9, 10, 13 → Hematopoietic cell
Clusters 4, 11 → Endothelial cell
Cluster 14 → Mesodermal cell

Visualize Cell Type Annotation Results

In this section, we will visualize the final annotated cell types and compare them with the original cell types from the original study's annotation.

Follow these steps to achieve this:

Step 1: Navigate to the Observations tab by clicking the Observations button in the Left Sidebar.
Step 2: Configure the Top Toolbar settings as follows:
- Visualization embedding: pca (Determines the embedding used for plotting).
- Normalization method: LogNorm (Applies logarithmic normalization to the data).
- Plot Type: Scatter (Specifies the type of chart displayed).
- Plot blending mode: Separate (Specifies the blending mode used for visualization).
Step 3: Under Annotation category, click the button next to the created annotation named New Annotation to visualize the annotated cell types.
Step 4: Under Categorical Metadata, click the button next to the cell_type to visualize the original cell type from the original study.

The image above illustrates the final cell type annotation result for both the inferred cell types and the original study's annotation.

We observe that CytoAnalyst ’s results are highly consistent with the original cell type assignments.

A key difference is that the original study classified Cluster 12 as Epithelial cells, whereas CytoAnalyst assigned it to Mesenchymal cells.

Our analysis revealed that, in fact, Endothelial cells can undergo Endothelial-to-Mesenchymal Transition (EndMT), a process where they lose Endothelial markers and acquire Mesenchymal traits.

Therefore, we hypothesize that Cluster 12 represents cells undergoing EndMT, explaining the discrepancy in annotations. Based on external evidence from flow cytometry data, we further hypothesize that the original study's authors may have distinguished between these two cell types using additional experimental validation.

Last modified: 24 September 2025