close
close

Mapping interindividual dynamics of innate immune response at single-cell resolution

Mapping interindividual dynamics of innate immune response at single-cell resolution

Ethical compliance

This project was approved by the Wellcome Sanger Institute Animal Welfare and Ethical Review Body and complied with all relevant ethical regulations regarding animal research and human studies. Human cells were obtained from HipSci24, where they were collected from volunteers recruited from the National Institute for Health and Care Research (NIHR) Cambridge BioResource (written consent was given). Human skin profiling was performed in accordance with protocols approved by the Newcastle Research Ethics Committee (REC approval 08/H0906/95+5). Patients with a confirmed diagnosis of COVID-19 were recruited from Addenbrooke’s and Royal Papworth hospitals under ethical approval obtained from the East of England Cambridge Central Research Ethics Committee (NIHR BioResource, REC no. 17/EE/0025). Informed consent was obtained for all participants.

Dermal fibroblast cell culture and stimulation

Primary dermal fibroblast cells from HipSci were used ( The cells were derived from healthy individuals spanning a range of ages (from 30 to 79 and 57.2 on average) and both sexes (40 female and 28 male). Following a similar protocol used in our previous work27, cells were cultured in DMEM (high glucose, pyruvate, Life Technologies), with 10% FBS, GlutaMAX and 1% penicillin-streptomycin. In each experimental batch, we cultured in parallel cells from three different individuals. Cells were split the day before the experiment into separate wells and on the day of experiment were stimulated with either dsRNA (0.5 µg ml−1 high-molecular-weight rhodamine-labeled Poly(I:C) (Invivogen, tlrl-pic), transfected with 1 µl ml−1 lipofectamine 2000 (Thermo Fisher, 11668027), for 2 or 6 h) or 1,000 U ml−1 human recombinant IFN-β (11410-2, PBL), for 2 or 6 h, or left untreated. In this manner, for each individual, we obtained five separate conditions.

After the relevant period of time, cells were detached by trypsinization and resuspended in PBS. Samples from the three individuals with the same treatment were then mixed (for example, ‘unstimulated’ cells from the three donors would be pooled together). The primary aim of this mixing step was to reduce downstream experimental variability between the three donors, while simultaneously streamlining the collection stage. In this manner, we obtained plates for each of the five conditions, with each having a mixture of all three individuals.

Sorting and single-cell library preparation

Cells were sorted on a Becton Dickinson Influx into 96-well plates containing 2 µl per well of lysis buffer, as described in the Smart-Seq2 protocol43, or in our previous work27. Importantly, each 96-well plate contained cells from the same condition of all three individuals used for each experimental batch. Single cells were sorted individually (using FSC-W versus FSC-H), and apoptotic cells were excluded using DAPI. Rhodamine-positive cells were selected in the Poly(I:C) treatments. Cells from each three-plex cell pool were sorted across four plates. Reverse transcription and complementary DNA amplification were performed according to the Smart-Seq2 protocol (Picelli et al., 2014), and library preparation was performed using an Illumina Nextera kit. Samples were sequenced using paired-end 75-bp reads on an Illumina HiSeq 2500 machine. For library preparation, cells were loaded into 384-well plates. We note that cells in one of the four stimulated conditions were assigned in a 384-well plate in conjunction with cells in naïve condition (rows C, H and M).

Smart-Seq2 data preprocessing and quality control

All sequence data were aligned to human genome assembly GRCh38 using STAR (v.2.5.3a; and ENSEMBL human gene assembly 90 as the reference transcriptome. We performed adapter trimming of Tn5 transposon and PCR primer sequences using skewer (v.0.1.127; before alignment. Following alignment, we used featureCounts (v.1.5.3; to count fragments for each annotated gene. In total, we observed 58,394 cells, of which 22,188 cells passed the quality control criteria: the minimum number of sequenced fragments (>10,000 autosomal fragments), the minimum number of expressed genes (>500 autosomal genes), mitochondrial fragment percentage (<20%) and the library complexity (percentage of autosomal fragment counts for the top 100 highly expressed genes <30%). We also performed demuxlet44 (v.0.1.0; to identify the genetic origin of each cell as well as to remove doublets using the genotype data from HipSci.

Genotype data

We obtained the SNP genotype data from HipSci24 (Data availability). We also genotyped 112 COVID-19 PBMC samples using the Affymetrix Axiom UK Biobank array (Data availability). We converted the genome coordinates from hg19 to GRCh38 using CrossMap (v.0.5.2; We then performed the whole-genome imputation using Beagle (v.5.1; with the reference panel from the 1000 Genomes Project (Data availability).

See also  Former MI6 boss warns Britain has ‘lost its way’ without Cold War threat

Cell viability prediction

The cell viability was predicted by the web-based tool CEVIChE ( Because the tool is designed for bulk RNA-seq data, we aggregated gene expression levels for neighboring cells based on the UMAP in Fig. 2a. We constructed 30 × 30 equispaced grids and took geometric means of logCPM (log of counts per million) values within each grid.

GPLVM

The GASPACHO framework incorporated a GPLVM as a core model to estimate the latent variables and model parameters subsequently used in the spatial DE analysis and eQTL mapping. We assumed that the gene expression vector \({y}_{j}={(\,{y}_{{ij}}{;i}=1,\ldots ,N\,)}^{T}\) for the gene \(j\) across \(N\) cells is independently drawn from

$${y}_{j}\sim N({\alpha }_{j}+Z{\gamma }_{j},{{\sigma }_{j}}^{2}\varOmega )$$

$${\alpha }_{j}\sim N(0,{{{\sigma }_{j}}^{2}K}_{\theta }{K}_{B}{K}_{X})$$

$${\gamma }_{j}\sim N(\zeta ,{{\sigma }_{j}}^{2}\varDelta )$$

where \({\alpha }_{j}\) is a baseline GP governed by three different kernel matrices, periodic kernel matrix \(\,{K}_{\theta }\) for the cell cycle state (\(\theta\)) and two other squared exponential kernel matrices \({K}_{B}\) and \({K}_{X}\) for unknown batch effects (B) and the target cell state (X), respectively. Here, \(Z\) is a design matrix for the known covariates, such as donor and sequencing plates (Fig. 1b), and \({\gamma }_{j}\) is a random effect to adjust the known confounding effects whose mean and variance were defined by \(\zeta\) and the diagonal matrix \(\varDelta\) shared across all genes \(j=1,\ldots ,J\). The residual expression was determined by the gene-specific residual variance \({{\sigma }_{j}}^{2}\) and the cell-specific residual variance \(\varOmega ={\rm{diag}}({\omega }_{i}{;i}=1,\ldots ,N)\). The variance of the GP and random effect for gene j was properly scaled by the gene-specific residual variance \({{\sigma }_{j}}^{2}\).

The model parameters \(\{\varDelta ,\varOmega ,\varSigma ,\zeta \,\}\) and the latent variables \(\{\theta ,B,X\,\}\) were inferred by maximizing the marginal likelihood

$$L(\theta ,B,X,\varDelta ,\varOmega ,\varSigma ,\zeta )=\mathop{\prod }\limits_{j=1}^{J}\int p({y}_{j}{\rm{|}}{\alpha }_{j},{\gamma }_{j})p({\alpha }_{j})p({\gamma }_{j}){\mathrm{d}}{\alpha }_{j}{\mathrm{d}}{\gamma }_{j},$$

where \(\varSigma ={\rm{diag}}({{\sigma }_{j}}^{2}{;j}=1,\ldots ,J\,)\). We used the L-BFGS algorithm with the analytic gradient of the likelihood function with respect to the parameters and the latent variables. In reality, the kernel matrices are not tractable for large \(N\); we computed the Titsias bound using the sparse GP21 to approximate the above likelihood (see section 1.3 of the Supplementary Notes for more details).

GP mixture model for gene classification

We employed a GP mixture model to perform the DE analysis in the target cell-state space defined by \(X\) which was estimated by the GPLVM. Specifically, we introduced one extra GP βk for the kth differentially expressed gene group (k = 1,…,K) to which a gene j belongs:

$${y}_{j}\sim N({\alpha }_{j}+{{\delta }_{{jk}}\beta }_{k}+Z{\gamma }_{j},{{\sigma }_{j}}^{2}\varOmega )$$

$${\alpha }_{j}\sim N(0,{{{\sigma }_{j}}^{2}K}_{\theta }{K}_{B})$$

$${\beta }_{k}\sim N(0,{{{\sigma }_{j}}^{2}K}_{X})$$

$${\gamma }_{j}\sim N(\zeta ,{{\sigma }_{j}}^{2}\varDelta )$$

$${\delta }_{{jk}}\sim N(0,1)$$

Here, the effect size of the GP was properly scaled by a coefficient \({\delta }_{{jk}}\) to allow the GP to be both positively and negatively correlated with the gene expression. The model parameters \(\{\varDelta ,\varOmega ,\varSigma ,\zeta \,\}\) and the latent variables \(\{\theta ,B,X\,\}\) were replaced by the estimators of the GPLVM. Then, we maximized the likelihood of a finite mixture of GPs:

$$\begin{array}{l}L({{\pi }_{1},\ldots ,{\pi }_{K},\beta }_{1},\ldots ,{\beta }_{K})\\=\mathop{\prod }\limits_{j=1}^{J}\mathop{\sum }\limits_{k=1}^{K}{\displaystyle{\int}} {\pi }_{k}p(\,{y}_{j}{\rm{|}}{\alpha }_{j},{\beta }_{k},{\gamma }_{j},{\delta }_{{jk}})p({\alpha }_{j})p(\,{\beta }_{j})p({\gamma }_{j})p({\delta }_{{jk}}){\mathrm{d}}{\alpha }_{j}{{\mathrm{d}}\gamma }_{j}{\mathrm{d}}{\delta }_{{jk}}\end{array}$$

with respect to \({\pi }_{k}\) and \({\beta }_{k}\) for \(k=1,\ldots ,K\). Note that the number of total mixture components \(K\) is fixed in the current implementation and K = 3 was used in the fibroblast data. We used the sparse approximation to make the likelihood tractable (see section 1.4 of the Supplementary Notes for more details). Note that this model can be readily extended to classify dynamic eQTL effect sizes into finite spatial patterns (see section 1.4.1 of the Supplementary Notes).

For the pseudotime analysis, we computed the posterior mean E[βk|y1,…,yJ] for the kth component, which provided the underlying cellular states regarding the primary and secondary innate immune responses.

GP regression for association mapping

We employed a GP regression model to map eQTLs in the target cell-state space defined by X which was estimated by the GPLVM. Specifically, we introduced one extra GP βjl for the gene j multiplied by the lth genetic variant gl = (gl1,…,glN)T whose ith element gli is alternative allele dosages for the individual i as a gene–environment interaction:

$${y}_{j}\sim N({\alpha }_{j}+{\beta }_{{jl}}\odot {g}_{l}+Z{\gamma }_{j},{{\sigma }_{jl}}^{2}\varOmega )$$

$${\alpha }_{j}\sim N(0,{{{\sigma }_{jl}}^{2}K}_{\theta }{K}_{B}{K}_{X})$$

$${\beta }_{{jl}}\sim N(0,{{\delta }_{g}}^{2}{{\sigma }_{jl}}^{2}(1{1}^{T}+K_{X}))$$

$${\gamma }_{j}\sim N(\zeta ,{{\sigma }_{jl}}^{2}\varDelta )$$

Here the eQTL effect size was properly scaled by a coefficient \({\delta }_{g}\) to allow for controlling of the genetic contribution on the expression level. The model parameters \(\{\varDelta ,\varOmega ,\varSigma ,\zeta \,\}\) and the latent variables \(\{\theta ,B,X\,\}\) were replaced by the estimated values obtained by the GPLVM. The Bayes factor of genetic association can be obtained by:

See also  Wiltshire's Louis Harvey claims third consecutive podium at Snetterton

$${\mathrm{BF}}=\frac{\int p(\,{y}_{j}{\rm{|}}{\alpha }_{j},{\beta }_{{jl}},{\gamma }_{j})p({\alpha }_{j})p(\,{\beta }_{{jl}})p({\gamma }_{j}){\mathrm{d}}{\alpha }_{j}{\mathrm{d}}{\beta }_{{jl}}{\mathrm{d}}{\gamma }_{j}}{\int p(\,{y}_{j}{\rm{|}}{\alpha }_{j},{\beta }_{{jl}}=0,{\gamma }_{j})p({\alpha }_{j})p({\gamma }_{j}){\mathrm{d}}{\alpha }_{j}{\mathrm{d}}{\gamma }_{j}}$$

where we set \({\delta }_{g}=0.1\) (see section 1.5 of the Supplementary Notes for more details).

As is implemented in CellRegMap, our model can also be extended to take the context-specific donor (context-by-donor interaction) effect into account. Here, the gene expression model can be written as:

$${y}_{j}\sim N\left({\alpha }_{j}+{\beta }_{{jl}}\odot {g}_{l}+Z{\gamma }_{j}+\mathop{\sum }\limits_{i=1}^{{N}_{\mathrm{d}}}{f}_{{ij}}\odot {z}_{i},{{\sigma }_{j}}^{2}\varOmega\right )$$

$${f}_{{ij}}\sim N(0,{{{\delta }_{{\mathrm{dxc}}}}^{2}{{\sigma }_{j}}^{2}K}_{X}){\rm{;}}\;i=1,\ldots ,{N}_{d},$$

where \({f}_{{ij}}\) denotes an additional GP for the individual i, \({z}_{i}\) denotes the indicator vector to specify which cells belong to the individual i and \({N}_{\mathrm{d}}\) denotes the number of donors in the data. The additional variance parameter \({\delta }_{{\mathrm{dxc}}}\) for the context-by-donor interaction effect is estimated under the null model using all genes (see section 1.5.1 of the Supplementary Notes for more details). All the real data analyses using the fibroblast data in this manuscript were based on the Bayes factors with this context-specific donor effect.

The eQTL effect size was estimated using the posterior distribution \(p(\,{\beta }_{{jl}}|\,{y}_{j})\propto p(\,{y}_{j}|\,{\beta }_{{jl}})p(\,{\beta }_{{jl}})\) and the posterior mean \(E[{\beta }_{{jl}}|\,{y}_{j}]\) was computed for each variant l and used for the visualization on a UMAP (see section 1.5.2 of the Supplementary Notes for more details).

Hierarchical model for eQTL mapping and enrichment analysis

We tested genetic variants whose MAF is greater than 0.05 in a 1-Mb cis-regulatory window centered at each gene TSS. To control the FDR in a Bayesian framework, we used the hierarchical model45 to obtain the posterior probability that a gene is an eQTL as well as the posterior probability that a variant is an eQTL variant within the cis window. The model allows incorporating various genomic annotations in the gene-level and variant-level as demonstrated previously45. We used the ChIP–seq peak annotations obtained by Hagai et al.27 in conjunction with TSS proximity to estimate the contribution of epigenetic information to the eQTL variant discovery (see section 2.1 of the Supplementary Notes for more details). Note that we only consider genes expressed in at least 10% of the cells, resulting in a tested dataset of 10,748 genes. We did not introduce the gene-level prior probability to weight highly expressed genes for the eQTL discovery.

eQTL enrichment in differentially expressed genes and other annotations

The enrichment analysis was carried out based on the posterior probability Zj that the gene j is an eQTL obtained from the hierarchical model. We then computed a 2 × 2 table using a corresponding binary annotation \({X}_{j}\) (if the gene j belongs to some annotation, for example, a TATA-box, then \({X}_{j}=1\), and otherwise \({X}_{j}=0\)) or alternatively the posterior probability \({X}_{j}\in [\mathrm{0,1}]\) that the gene j is a differentially expressed gene (one of multiple differentially expressed gene categories defined above), such that

$${T}_{{kl}}=\mathop{\sum }\limits_{j=1}^{J}{(1-{X}_{j})}^{(1-k)}{(1-{Z}_{j})}^{(1-l)}{{X}_{j}}^{k}{{Z}_{j}}^{l}$$

for \(k,l=\mathrm{0,1}\). From the 2-by-2 table \(T\), we computed the log odds ratio \(r=\,\log ({T}_{00}{T}_{11}/({T}_{01}{T}_{10}))\) and its standard error \({\mathrm{Var}}(r)=(1/{T}_{00}+1/{T}_{01}+1/{T}_{10}+1/{T}_{11})\) to perform hypothesis testing. The confidence interval of the log odds ratio was given by \(r\pm 1.96\sqrt{{\mathrm{Var}}(r)}\). We also computed the P value from the Χ2 statistic \({\chi }^{2}={r}^{2}/{\rm{Var}}(r)\).

If the occurrence of eQTLs and an annotation \(X\) were confounded by a factor \(C\) (such as expression level for a gene), we split genes into 100 quantile bins according to the confounding factor \(C\) to compute the log odds ratio and its standard error for each bin as demonstrated above, and then we combined them using the inverse variance method to derive the meta statistic for an adjusted enrichment statistic.

eQTL sharing with GTEx tissues

We used the pairwise hierarchical model45,46 to jointly map eQTLs in two different cell types (a similar approach to the pairwise fGWAS model47). We employed the association Bayes factor at each variant for each gene to compute the regional Bayes factors in a cis region of 1 Mb centered at the TSS under the following five different hypotheses:

H0: a gene is not an eQTL in cell/tissue types 1 and 2.

H1: a gene is an eQTL in cell/tissue type 1, but not in cell/tissue type 2.

H2: a gene is an eQTL in cell/tissue type 2, but not in cell/tissue type 1.

H3: a gene is an eQTL in cell/tissue types 1 and 2 with two independent putative causal variants.

H4: a gene is an eQTL in cell/tissue types 1 and 2 with the shared putative causal variant.

Those regional Bayes factors were used in a hierarchical model to estimate prior probabilities that eQTLs are shared between two cell types (see section 2.2 of the Supplementary Notes for more details).

The hypothesis testing of eQTL enrichment in different DE categories, which are also colocalized with other eQTLs (such as GTEx fibroblasts), was performed by computing the pseudo counts:

$${T}_{{kl}}=\mathop{\sum }\limits_{j=1}^{J}{(1-{X}_{j})}^{(1-k)}{({{Z}_{j}}^{(1)}+{{Z}_{j}}^{(3)})}^{(1-l)}{{X}_{j}}^{k}{({Z_{j}}^{(4)})}^{l}$$

for \(k,l=\mathrm{0,1}\), where \({X}_{j}\) denotes the posterior probability that the gene j is differentially expressed, \({{Z}_{j}}^{(1)}\) denotes the probability that the gene j is an eQTL in our data and not in a GTEx tissue, \({{Z}_{j}}^{(3)}\) denotes the probability that the gene j is an eQTL in our data and a GTEx tissue and not sharing the putative causal variant, and \({{Z}_{j}}^{(4)}\) denotes the probability that the gene j is an eQTL both in our data and a GTEx tissue and colocalized. The odds ratio and its standard error were computed as described in the section ‘eQTL enrichment in differentially expressed genes and other annotations’.

See also  3 West Brom players at a crossroads

Annotating TATA and CpG genes

To look for TATA-box motifs in gene promoters, we used TATA-motifs from CIS-BP (Data availability). We used the CpG annotation (Data availability) from the UCSC Genome browser to search for genes whose promoters overlap with a CGI. In both cases, we used the region 100 bp upstream from the TSS as the promoter region and referred to these genes as TATA genes and CpG genes, respectively.

eQTL variant enrichment at TF motifs

The hierarchical model provided the posterior probability that each variant l in the cis-regulatory region for the gene j is the eQTL \({Z}_{{jl}}\), so that \({\sum }_{l=1}^{{L}_{j}}{Z}_{{jl}}=1\) where Lj is the number of variants in the cis window. We first selected the lead eQTL variant according to the posterior probability for each gene j. We then used the position weight matrices of TF motifs in CIS-BP (Data availability) to call motifs overlapping with lead eQTL variants as described elsewhere45.

To perform the hypothesis testing that a TF motif is significantly overlapping with eQTL variants, we set \({Z}_{j}=\mathop{\mathrm{max}}\limits_{l=1,\ldots ,{L}_{j}}\{{Z}_{jl}\}\) and Xj to be the binary variable whose value is Xj = 1 if the lead eQTL variant l is overlapping with a TF motif; otherwise, Xj = 0. We then computed the 2 × 2 table to perform the enrichment analysis as described in the section ‘eQTL enrichment in differentially expressed genes and other annotations’.

GWAS summary statistics

GWAS summary statistics were obtained from Open Targets which collected and harmonized summary statistics from the GWAS Catalog, FinnGen and UK Biobank (in total, 4,744 traits) (Data availability). We also downloaded summary statistics of four different COVID-19-related traits for all samples excluding 23andMe (‘Very severe respiratory confirmed covid vs. population’, ‘Hospitalized covid vs. not hospitalized covid’, ‘Hospitalized covid vs. population’ and ‘Covid vs. population’ in release 5) from the COVID-19 Host Genetics Initiative (Data availability). We selected 701 GWAS traits out of 4,748 traits with the criterion of five or more genome-wide significant loci, of which 112 were broadly immune-related, including autoimmune and chronic inflammatory diseases as well as infectious diseases.

Colocalization with GWAS traits

We used the same pairwise hierarchical model as in the section ‘eQTL sharing with GTEx tissues’ to perform the GWAS colocalization analysis, where the prior probabilities of the pairwise hierarchical model were fixed as \(\{{\varPi }_{1},\,{\varPi }_{2},\,{\varPsi }_{12}\}=\{0.2,\,0.05,\,0.01\}\), so that we can compare different studies with different statistical power to detect GWAS associations due to varying sample sizes. Here, \({\varPi }_{1}\) denotes the prior probability that a gene is an eQTL, \({\varPi }_{2}\) denotes that a genomic region for the corresponding gene (a 1-Mb window centered at TSS) is a significant GWAS locus and \({\varPsi }_{12}\) is a prior probability that the eQTL and the GWAS locus are colocalized. To fit the model, we converted the effect sizes and standard errors of each GWAS trait into Bayes factors using Wakefield’s approximation48. See section 2.3 of the Supplementary Notes for more details.

PBMC data analysis and eQTL mapping

We used human PBMC scRNA-seq data22 from 112 donors, including 84 COVID-19-positive individuals, profiled with the CITE-seq approach from 10x Genomics. We reduced the full GASPACHO approach to accommodate the PBMC single-cell data of over 700,000 cells in a reasonable time scale. The kernel functions used in the model were restricted to the linear kernel without the cyclic kernel for the cell cycle effect. The latent factors were estimated with the covariates of the number of genes expressed, the number of mapped reads, the sequencing center, sex, age, COVID-19 status, COVID-19 severity, patient ID and the first three genotype principal components. The latent factors were then used to define the two GPs

$${\alpha }_{j}\sim N(0,{{\sigma }_{j}}^{2}X{X}^{T})$$

$${\beta }_{{jl}}\sim N(0,{{\delta }_{g}}^{2}{{\sigma }_{j}}^{2}(1{1}^{T}+X{X}^{T}))$$

for the intercept and the eQTL effect size of variant l for gene j.

OAS1 locus analysis using COVID-19 nasal brushing samples

To fine-map the OAS1 locus in cells in vivo infected with COVID-19, we used human single-cell data of 33 nasal brushing samples from patients with COVID-19 from a recent work23, profiled using CITE-seq. We used the aligned bam files to quantify allele-specific expression at rs10774671 using RASQUAL allele-specific expression caller49. The genotypes were assigned by fitting a binomial distribution on the allele-specific expression with probability parameters p = {0.01, 0.5, 0.99} for reference homozygote, heterozygote and alternative homozygote, respectively. We inferred that there are three reference homozygotes, 18 heterozygotes and 12 alternative homozygotes at rs10774671 in this dataset.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

  • June 12, 2023