Marker analysis¶

In this notebook we will identify genes that are highly expressed in cells that feature a high mapping score. Furthermore, we will use the same method of differential expression test to ascertain the marker genes for each cluster in the reference SNN graph.

[1]:

import nabo

We start by loading the mapped reference graph. Here we also visualize the mapping score for target ME:

[2]:

hvg_graph = nabo.Graph()
hvg_graph.load_from_gml('../analysis_data/hvg_graph.gml')
nabo.GraphPlot(hvg_graph, vc_attr='cluster', label_attr='cluster',
               vs=hvg_graph.get_mapping_score(target='ME'))

../../../_images/analysis_mll_enl_notebooks_3_markers_4_0.png

[2]:

GraphPlot of 1414 nodes

Now we use set_de_groups to identify marker genes for cells with a high mapping score. The parameter min_score sets the cutoff for mapping score and cells with higher score than this are marked as ‘Test cells’. Using the optional parameter from_clusters, ‘Test cell’ selection can be restricted to certain clusters. The interesting aspect of set_de_groups is how it selects the ‘Control cells’ to compare with ‘Test cells’. ‘Control cells’ are selected based on their path distance from ‘Test cells’ on the graph. Each of the control cells will be no more than node_dist nodes away from ‘Test cells’. Please refer to the documentation for more information on other parameters that regulate ‘Control cell’ selection. At the end we visualize which cells were marked as test cells or control cells, respectively.

[3]:

hvg_graph.set_de_groups('ME', min_score=1, node_dist=2, from_clusters=['3'])
nabo.GraphPlot(hvg_graph, vc_attr='de_group', label_attr='de_group',
               label_attr_type='legend', cmap='coolwarm')

Test nodes: 70, Control nodes: 112

../../../_images/analysis_mll_enl_notebooks_3_markers_6_1.png

[3]:

GraphPlot of 1414 nodes

Since the graph object does not store the gene expression values, we use the Dataset class to load the data. This way we do not need to filter cells and genes again, or set size factors, because they were saved into the HDF5 file in the Preprocessing notebook.

[4]:

wt_data = nabo.Dataset('../analysis_data/WT.h5')
me_data = nabo.Dataset('../analysis_data/MLL_ENL.h5')

INFO: Cached filtered cells loaded
INFO: Cached filtered genes loaded
INFO: Cached cell size factors loaded
INFO: Cached filtered cells loaded
INFO: Cached filtered genes loaded
INFO: Cached cell size factors loaded

Nabo keeps the Dataset class independent of the Graph class, and hence the cell names saved in the Dataset object are not the same as those in the graph (because we appended reference names to them). So we save this info (not saved in HDF5) in a Dataset object so that it dynamically appends reference names to cells.

[5]:

wt_data.update_exp_suffix(hvg_graph.refName)

Now we are ready to perform marker identification. Test cells are saved in deTestCells attribute and control cells in deCtrlCells attribute. We then call the run_de_test method to run the differential expression test (Mann-Whitney U test) between the two groups. The test will return a table containing names and other information for the genes with a significantly high expression in test cells. Please refer to the documentation to check out additional parameters that regulate the behaviour of this step.

The columns of this table are:

gene: Gene name
exp_frac: Fraction of test cells expressing (having non-zero value) this gene
test_group: Name of test group. Can be ignored as we had only test group here.
versus_group: Name of group against which the comparison was made. Can be ignored here as only one control group is defined here.
rbc: Rank-biserial correlation
log_fc2: Log2 fold change
pval: p value
qval: Benjamini-Hochberg corrected p value

[6]:

wt_mapped_marker_table = nabo.run_de_test(
    wt_data, None,
    hvg_graph.deTestCells, [hvg_graph.deCtrlCells])
print (len(wt_mapped_marker_table), " DE genes found" )
wt_mapped_marker_table.head()

100%| 00:00

5  DE genes found

[6]:

	gene	exp_frac	test_group	versus_group	rbc	log2_fc	pval	qval
0	GM10138	0.400000	Test group	Ctrl group 0	0.230204	1.338608	0.001611	0.009666
1	NCS1	0.257143	Test group	Ctrl group 0	0.162449	1.470816	0.006546	0.014850
2	KCNG1	0.257143	Test group	Ctrl group 0	0.157551	1.424919	0.008057	0.014850
3	ALG11	0.285714	Test group	Ctrl group 0	0.157143	1.085207	0.011832	0.014850
4	TDRD3	0.300000	Test group	Ctrl group 0	0.160000	1.080647	0.012375	0.014850

Expression of a gene can also be visualized on the graph using by setting vc, i.e. vertex color to the expression value of a gene (this value can be obtained using wt_Data.exp.<gene>). The exp attribute allows gene autocomplete functionality (this can be activated by writing at least one character after exp. and then pressing the TAB key). The graph below shows expression of FLT3, with darker hue representing higher expression.

[7]:

nabo.GraphPlot(hvg_graph, vc=wt_data.exp.CTR9, cmap='magma_r', vc_percent_trim=1)

../../../_images/analysis_mll_enl_notebooks_3_markers_14_0.png

[7]:

GraphPlot of 1414 nodes

Expression distribution can also be visualized as boxplots using the plot_box_exp function:

[8]:

nabo.plot_box_exp(wt_data, 'CTR9',
                  [hvg_graph.deTestCells, hvg_graph.deCtrlCells],
                  ['Test', 'Control'])

../../../_images/analysis_mll_enl_notebooks_3_markers_16_0.png

The run_de_test method is quite flexible and can easily be used to compare the genes across datasets as well. As an example below, we test for upregulated genes in the WT cells marked as Test cells above, while comparing them to ME cells which were projected onto these WT cells. Graph has a handy method called get_mapped_cells which can be used to identify cells that project onto any set of reference cells.

[9]:

mapped_wt_vs_me_table = nabo.run_de_test(
    wt_data,
    me_data,
    hvg_graph.deTestCells,
    [hvg_graph.get_mapped_cells('ME', hvg_graph.deTestCells)])
print (len(mapped_wt_vs_me_table), " DE genes found" )
mapped_wt_vs_me_table

100%| 00:00

12  DE genes found

[9]:

	gene	exp_frac	test_group	versus_group	rbc	log2_fc	pval	qval
0	S100A9	1.000000	Test group	Ctrl group 0	1.000000	6.791093	1.018165e-26	1.323614e-25
1	S100A8	1.000000	Test group	Ctrl group 0	1.000000	4.865921	9.085780e-25	5.905757e-24
2	STFA1	0.800000	Test group	Ctrl group 0	0.797143	7.520187	2.966528e-20	1.285495e-19
3	BC100530	0.714286	Test group	Ctrl group 0	0.714286	inf	8.820606e-18	2.866697e-17
4	CHIL3	0.585714	Test group	Ctrl group 0	0.585714	inf	5.195727e-14	1.350889e-13
5	GM5483	0.585714	Test group	Ctrl group 0	0.578571	6.343146	1.601439e-13	3.469784e-13
6	NGP	0.471429	Test group	Ctrl group 0	0.471429	inf	5.004760e-11	9.294554e-11
7	GT(ROSA)26SOR	0.628571	Test group	Ctrl group 0	0.439592	2.050626	4.469200e-07	7.262450e-07
8	GPR183	0.314286	Test group	Ctrl group 0	0.304490	5.425057	7.354642e-07	1.062337e-06
9	CCR2	0.828571	Test group	Ctrl group 0	0.347755	1.016938	1.927641e-04	2.505934e-04
10	GM21887	0.342857	Test group	Ctrl group 0	0.199592	1.597854	4.340554e-03	5.129745e-03
11	CTSH	0.400000	Test group	Ctrl group 0	0.187755	1.657413	1.045882e-02	1.133039e-02

To look for upregulated genes the other way round, we simply switch the order of parameters as shown below.

[10]:

mapped_me_vs_wt_table = nabo.run_de_test(
    me_data,
    wt_data,
    hvg_graph.get_mapped_cells('ME', hvg_graph.deTestCells),
    [hvg_graph.deTestCells])
print (len(mapped_me_vs_wt_table), " DE genes found" )
mapped_me_vs_wt_table.head(n=10)

100%| 00:00

310  DE genes found

[10]:

	gene	exp_frac	test_group	versus_group	rbc	log2_fc	pval	qval
0	CAR2	0.985560	Test group	Ctrl group 0	0.982878	7.449399	7.105180e-43	2.209711e-40
1	CTLA2A	0.954272	Test group	Ctrl group 0	0.898814	4.413338	3.396180e-36	5.281060e-34
2	HOXA9	0.935018	Test group	Ctrl group 0	0.825890	2.857941	6.952826e-31	4.324658e-29
3	H2-D1	1.000000	Test group	Ctrl group 0	0.826784	1.405286	6.612265e-31	4.324658e-29
4	MEIS1	0.931408	Test group	Ctrl group 0	0.828657	2.973956	4.381979e-31	4.324658e-29
5	IKZF2	0.800241	Test group	Ctrl group 0	0.777549	5.359693	5.109443e-28	2.648395e-26
6	HOXA7	0.959085	Test group	Ctrl group 0	0.762902	2.030980	1.289349e-26	5.728393e-25
7	VAMP5	0.848375	Test group	Ctrl group 0	0.749527	3.351877	6.167713e-26	2.397699e-24
8	ADGRL4	0.740072	Test group	Ctrl group 0	0.734330	7.338323	1.538700e-25	5.317064e-24
9	CD34	0.960289	Test group	Ctrl group 0	0.737648	1.819609	5.253895e-25	1.633961e-23

Now let’s identify marker genes for each reference graph cluster, using the find_cluster_markers function. This returns two values. The first is a table containing a list of differentially expressed genes between each pair of clusters (in the same format as seen above). The same gene can occur multiple times as it will be tested for multiple pairs of clusters. The second returned value is a dictionary containing only names of marker genes as values and cluster number as keys. Here we call the find_cluster_markers function with the following parameters:

The cluster information from the Graph object (hvg_graph.clusters).
Dataset object (wt_data)
de_frequency (4). This value decides across how many cluster comparisons a gene should have a significant differential expression for it to be called a marker gene for a given cluster. For example, for FLT3 to be called a marker for cluster 2 it should come up at least 4 times as having a significantly high expression when compared to the rest of the clusters. A gene is allowed to be called a marker gene for more than one cluster.
exp_frac_thresh (0.5). For a gene to be tested for a given cluster it should be expressed in at least 50% of the cells.
log_fc_thresh (1). Gene should have \(\log_2 (foldchange) \ >= \ 1\) for mean values.

[11]:

marker_table, marker_genes = nabo.find_cluster_markers(
    hvg_graph.clusters, wt_data, de_frequency=4, exp_frac_thresh=0.5, log2_fc_thresh=1)

Finding markers for cluster 1: 100%| 00:00
Finding markers for cluster 10: 100%| 00:00
Finding markers for cluster 11: 100%| 00:00
Finding markers for cluster 12: 100%| 00:00
Finding markers for cluster 2: 100%| 00:00
Finding markers for cluster 3: 100%| 00:00
Finding markers for cluster 4: 100%| 00:00
Finding markers for cluster 5: 100%| 00:00
Finding markers for cluster 6: 100%| 00:00
Finding markers for cluster 7: 100%| 00:00
Finding markers for cluster 8: 100%| 00:00
Finding markers for cluster 9: 100%| 00:00

Let’s check out how many genes marker genes were found for each cluster.

[12]:

for cluster in marker_genes:
    print (cluster, len(marker_genes[cluster]))