Smail 2020

class: center, middle, inverse, title-slide

# Smail 2020
### Timing Liu
### Martin Group
### 2021-01-22

---

## Summary

Integration of rare large-effect expression variants improves polygenic risk prediction

- Integration: a new score that measures **genes with outlier rare variants** in each individual: IOGC
 
--

- Rare variants: MAF < 1%

- Expression: GTEx outliers

- Large effect: z-score filtering

- Improves: 
<img src = "assets/2021-01-19-21-09-11.png" align="center" style="border: none; box-shadow: none; height: 325px; text-align: center;">

---
## Expression: GTEx outliers

1. individual *A* is an outlier in RNA-seq data of gene a
2. identify all rare variants  in gene a of *A* 
3. remove variants that are also present in other individuals B/C/D...

---
## Study variants' effect to PRS-predicted phenotype
![](assets/2021-01-18-10-56-38.png)

---
## Why look at outliers 
![](assets/2021-01-18-10-52-52.png)
- control set (non-outliers): rare variants in non-outliers with CADD +/- 5
- Outliers & non-outliers are near each other 
- But outliers have larger effect size

---
## Why look at outliers - larger effect size 
![](assets/2021-01-19-21-36-05.png)
- PRS variants have larger allele count but smaller effect size

---
## Why look at outliers - larger odds (ratio?)
![](assets/2021-01-18-13-59-11.png)

- 10,000 permutations
- randomly choose one outlier sample and one non-outlier sample (next to PRS site) to compare absolute effect size
---
## Why look at outliers - larger odds (ratio?)
Table 1

|abs. effect size|Person A (outlier) | Person B (non-outlier)|
|--|--|--|
|gene a|1.1|0.5
|gene b|1.3|1
|gene c|0.4|1

Table 2

|| A Outlier | B Non-outlier|
|--|--|--|
|larger|2|1

- odds = 2 -> repeat and get a distribution of odds between outlier vs non-outlier
- then repeat and get a distribution of odds among non-outliers 
  
Why permutation test?
- Not enough variants (8272) so cannot plot raw distribution

---
## Why look at outliers - larger odds (ratio?)
Caveat:
- The authors said they built a contingency table and calculated **odds ratio** but Qinqin and I could not think of a way the contingency table can be built
- There was also no use of tests associated with contingency table: Chi-Square, Fisher's Exact, or Wald Test
- The p-value was obtained by Wilcoxon test (to test if the two distributions are the same)
- The author explained in later correspondence that the figure showed odds and not odds ratio - will revise the manuscript

???
Across each permutation, the **absolute effect size** for a **randomly-chosen outlier sample** and **matched non-outlier sample** was obtained for each gene and summed in a **contingency matrix** to **quantify the number of genes** where the **outlier variant had an absolute effect size greater than the non-outlier variant** (blue shading). This process was repeated for randomly selected non-outlier variants only (gray).

Wilcoxon rank sum test. Subset to genes linked to PRS variants.

---
## Why look at outliers - larger variance

- Dispersion of mean effect sizes per gene for outlier (blue) and non-outlier variants (gray) across genes
- stratified by GTEx outlier Z-score. 
- P-values were obtained using an Ansari Test (`ansari.test`)
  - non-parametric test for the differences in spread, assuming the centres of the two populations are identical

???
dispersion of variability
non-parametric equivalent test for equality of variance

test for differences in spread, whhile assuming that the cetnres of two populations are identical 
- observations are indepdent and identically distributed
- two samples must be independent of each other, with equal medians 
---
## Outline

- **Rare variants**
- **Expression**
- *Large effect: z-score stratification*
- Integration
- Improves 
---
## Large effect: z-score stratification - larger dispersion
<img src = "assets/2021-01-18-14-31-38.png" align="center" style="border: none; box-shadow: none; height: 325px; text-align: center;">

- larger z-scores -> larger dispersion

---
## Large effect: z-score stratification - larger odds ratio
![](assets/2021-01-19-22-23-44.png)

---
## Outline

- **Rare variants**
- **Expression**
- **Large effect: z-score stratification**
- *Integration: a new score that measures **genes with outlier rare variants** in each individual: IOGC*
- Improves

---
## IOGC - independent outlier gene count

- an individual has two genes 
- gene 1 has three variants (a1, a2, a3) 
- effect size: a1 -5, a2 +10, and a3 +3
- 𝑠𝑔𝑛(a1, a2, a3) = (-1, 1, 1)
- s1 = [-1, 1]
- similarly, s2 = [0, 1] 
- IOGC = 2 - 1 = 1 (i.e. sum of protective genes - sum of risk genes)
![](assets/2021-01-18-09-40-27.png)

???
- effect size against what? BMI value? 
- 
for each individual 
- link variants to effect size direction in UKBB
- collapse to gene-level 
  - prevent double-counting
- convert beta effect estimate per variant to integers using a sign function

where 𝛽 is the UKB GWAS beta coefficient for variant 𝑘

---
## Communication with the author

### Q
-	why collapse by genes instead of variants?

### A
-	some variants are very close together and probably in LD (i.e. double counting) 
- (although this is difficult to estimate with rare variants at current sample sizes)
- genes can be linked to GTEx outlier expression

---
## Communication with the author 2.0

### Q
-	why not include effect size?

### A
-	effect size estimates are very noisy at current samples sizes in UK Biobank 
- (can look at the p-values and standard error of the estimates to appreciate this point)
- however, perhaps these estimate are sufficiently powered for effect direction estimate (risk/protective) 
- as sample sizes increase, it would be a good idea to incorporate these estimates

---
## Outline

- **Rare variants**
- *Expression: outlier*
- *Large effect: Z-score*
- *Integration: IOGC*

--
![](assets/2021-01-19-22-55-02.png)

**Mean change in BMI per unit change in IOGC score** increases with increase in
- `number of GTEx tissues where the variants are identified(g)`
- *outlier Z-score (h)*

---
## Outline

- **Rare variants**
- **Expression: outlier**
- **Large effect: Z-score**
- **Integration: IOGC**
- *Improve*

---

## Improve: stand-alone predictor: across percentiles

- BMI increases with IOGC percentile
- Dashed line indicates cohort mean
---
## Improve: stand-alone predictor: two percentile tails 
![](assets/2021-01-19-23-12-00.png)

- Rate of obesity, severe obesity (b) and age of obesity and HT can be stratified by IOGC

---
## Improve: stratify PRS

---
## Improve: more evidence for predicting diabetes
![](assets/2021-01-19-23-18-41.png)

---
## Improve: other traits (~2400)
![](assets/2021-01-19-23-19-31.png)

- Shows that outlier can influence odds ratio but did not apply IOGC 
- IOGC can potentially be helpful

---
## Communication with the author 3.0

### Q
- rare variants without GTEx filtering?
- would have done it as my first step if I were to study the effect of rare variants on improving PRS.

### A
-	The goal of the current paper is to link rare GTEx outlier-associated variants to effects on traits/disease
- Several other papers look at all rare variants using (for example) SKAT-O

---
## Conclusion
![](assets/4ugcxw.jpg)

---
## Acknowledgements
- Hilary for letting me present and recommending this paper 
- Qinqin for the late-night discussions 
- Craig Smail (first author) for the very helpful and prompt correspondance

---
background-image: url(https://media4.giphy.com/media/lD76yTC5zxZPG/giphy.gif)
class: center, top

# Thanks!