Project and Conquer!
 Bryan W Lewis 
 rstor.io 
 http://illposed.net
Point of view matters
Data Shadows
The purple and pink dots are at least this far apart.
The purple and pink dots are at least this far apart.
 The tcor algorithm: fast thresholded distances and correlations.
 1,000 × 20,522 matrix, find all pairs of columns with Pearson’s correlation coefficient ≥ 0.99.
 What subspaces to project into? 
 What kinds of projections?
 \[ \mathrm{span}(\beta, X\beta, X^2\beta, \ldots, X^{k-1}\beta), \]
 \[ \mathrm{span}(\beta, X^TX\beta, (X^TX)^2\beta, \ldots, (X^TX)^{k-1}\beta), \qquad \ldots \]
 Random β for data exploration, 
 measured β for supervised models
…focus on the information content (PCA, SVD, etc.)…
      
…or perhaps cluster similar features…
…or maybe bi-cluster similar rows and columns…
 https://CRAN.R-project.org/package=s4vd (Kaiser & Sill)
Sparse biclust algorithm by Lee, Shen, Huang, and Marron
 1,000 \(\times\) 10,000 example matrix timed on a cheap laptop
library(s4vd)
set.seed(1)
x = matrix(rnorm(1000 * 10000), nrow=1000)
i = seq(from=1, to=nrow(x), by=2)
j = seq(from=1, to=ncol(x), by=2)
x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5)
cl = biclust(x, method=BCssvd, K=1)- Original routines      3,522s
- Reformulated routines     20s
…maybe the data represent a network…
…we might desire the most central vertices,
or perhaps the most communicable paths, etc….
https://github.com/bwlewis/rfinance-2017/topm.r
Compare to Padé approximant for 1000 × 1000 directed-network adjacency matrix
system.time(ex <- diag(expm(X) + expm(-X)) / 2)
# user    system elapsed
# 151.080 0.220  151.552
head(order(ex, decreasing=TRUE), 5)
#[1] 11 25 27 29 74system.time(top <- topm(X, type="centrality"))
# user  system elapsed
# 0.555 0.010  0.565
top$hubs
#[1] 11 25 27 29 74Cool algorithm by Cai, Candès, Shen, prototype: https://github.com/bwlewis/rfinance-2017/svt.r
 Input matrix M with missing values except in index set P.
 Input matrix M with missing values every except index set P. 
 
 \[ \min\,\,\mathrm{rank}(X) \,\,\mathrm{s. t.}\,\, X_P = M_P \]
 Input matrix M with missing values every except index set P. 
 
 \[ \min\,\,\mathrm{rank}(X) \,\,\mathrm{s. t.}\,\, X_P = M_P \]
 \[ \min\,\,||X||_* \,\,\mathrm{s. t.}\,\, X_P = M_P \]
…or, maybe we want to select interesting features (columns)… 
 
 Use generalized Krylov subspaces (Voss, Lanza, Morigi, Reichel, Sgallari, …) to solve
\[ \min \frac{1}{p}||X\beta - y||^p_p + \frac{\mu}{q}||\Phi(\beta)||^q_q \]
q = 1 (Lasso) and 0 < q < 1 (closer to subset selection!)
 
 https://bwlewis.github.io/rfinance-2017
 https://github.com/bwlewis/rfinance-2017
 Thanks, Diethelm!