HurdleDMR from R

HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multinomial Regression (HDMR), as described in:

Kelly, Bryan, Asaf Manela, and Alan Moreira (2018). Text Selection. Working paper.

It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of Taddy (2015).

This tutorial explains how to use this package from R via the JuliaCall package that is available on CRAN.

Setup

Install Julia

First, install Julia itself. The easiest way to do that is from the download site https://julialang.org/downloads/. An alternative is to install JuliaPro from https://juliacomputing.com

Once installed, open julia in a terminal, press ] to activate package manager and add the following packages:

pkg> add RCall HurdleDMR GLM Lasso

The JuliaCall package for R

Now, back to R

In [1]:
install.packages("JuliaCall")
Installing package into ‘/home/root/R/x86_64-pc-linux-gnu-library/3.5’
(as ‘lib’ is unspecified)

Load the JuliaCall library and setup julia

In [2]:
library(JuliaCall)
j <- julia_setup()
Julia version 1.1.1 at location /home/root/julialang/julia-1.1.1/bin will be used.
Loading setup script for JuliaCall...
Finish loading setup script for JuliaCall.

We can now use j$xx to call julia as in

In [3]:
j$eval("1+2")
3

Example data

We will use for this example data that ships with the fantastic textir package for R.

In [4]:
#install.packages("textir") 
library(textir)

data(we8there)

covars <- we8thereRatings[,'Overall',drop=FALSE]
counts <- we8thereCounts
Loading required package: distrom
Loading required package: Matrix
Loading required package: gamlr
Loading required package: parallel

Benchmark in R

To compare the two, we first fit a dmr in R using textir (a wrapper for distrom).

Make a cluster of nprocs processors

In [5]:
nprocs <- as.integer(detectCores()-2)
cl <- makeCluster(nprocs,type=ifelse(.Platform$OS.type=="unix","FORK","PSOCK"))

Fit Distributed mutlinomial regression (DMR) in parallel

In [6]:
system.time(fits <- dmr(cl, covars, counts, verb=1))
fitting 6166 observations on 2640 categories, 1 covariates.
converting counts matrix to column list...
distributed run.
socket cluster with 18 nodes on host ‘localhost’
   user  system elapsed 
  0.303   0.048   4.306 

Good. Now stop the cluster to clean up.

In [7]:
stopCluster(cl)

Get AICc optimal coefficients

In [8]:
BR <- coef(fits)
dim(BR)
Warning message:
“'rBind' is deprecated.
 Since R version 3.2.0, base's rbind() should work fine with S4 objects”
  1. 2
  2. 2640

Get SR projection onto factors

In [9]:
zR <- srproj(BR,counts) 
dim(zR)
  1. 6166
  2. 2

Multinomial inverse regression (MNIR)

The fitted object can be used to it a forward regression to predict a covariate using the low dimensional SRproj of the counts

In [10]:
X <- cbind(covars,zR)
colnames(X) <- c("Overall","zOverall","m")
fmR <- lm(Overall ~ zOverall + m, X)
summary(fmR)
Call:
lm(formula = Overall ~ zOverall + m, data = X)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5142 -0.5608  0.1370  0.6838  4.0842 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.402149   0.019292 176.348  < 2e-16 ***
zOverall    3.181332   0.041696  76.298  < 2e-16 ***
m           0.006737   0.001096   6.146 8.42e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9654 on 6163 degrees of freedom
Multiple R-squared:  0.4896,	Adjusted R-squared:  0.4894 
F-statistic:  2956 on 2 and 6163 DF,  p-value: < 2.2e-16

We can now predict Overall with new counts data

In [11]:
newX = as.data.frame(srproj(BR,counts[1:10,]))
colnames(newX) <-c("zOverall","m")
yhatdmrR <- predict(fmR, newX)
as.vector(yhatdmrR)
  1. 4.86419983555133
  2. 2.24484606535295
  3. 5.65307467024925
  4. 4.56844624548369
  5. 4.6684859314965
  6. 5.06290073040297
  7. 3.66278776218007
  8. 4.4746803535849
  9. 4.00369363965086
  10. 7.35463703886243

Same model but in Julia

Now lets try that in julia.

We need to pass the data to julia

In [12]:
j$command("import SparseArrays")
j$assign("covars",covars)
## there are probably more efficient ways to pass the sparse matrix, but JuliaCall doesn't do this automatically at the moment
j$assign("counts",as.matrix(counts))
j$command("counts = SparseArrays.sparse(counts)")
6166×2640 SparseArrays.SparseMatrixCSC{Float64,Int64} with 66459 stored entries: [11 , 1] = 1.0 [20 , 1] = 1.0 [43 , 1] = 1.0 [63 , 1] = 1.0 [80 , 1] = 1.0 [87 , 1] = 1.0 [88 , 1] = 1.0 [97 , 1] = 1.0 [141 , 1] = 1.0 ⋮ [1273, 2640] = 1.0 [1955, 2640] = 1.0 [2509, 2640] = 1.0 [2842, 2640] = 1.0 [3929, 2640] = 1.0 [4314, 2640] = 1.0 [4862, 2640] = 1.0 [5702, 2640] = 1.0 [6007, 2640] = 1.0

add parallel workers

In [13]:
j$command("using Distributed")
j$call("addprocs", nprocs)
  1. 2
  2. 3
  3. 4
  4. 5
  5. 6
  6. 7
  7. 8
  8. 9
  9. 10
  10. 11
  11. 12
  12. 13
  13. 14
  14. 15
  15. 16
  16. 17
  17. 18
  18. 19

make our package available to all workers

In [14]:
j$command("import HurdleDMR; @everywhere using HurdleDMR")

Fit Distributed mutlinomial regression (DMR) in Julia

In [15]:
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
 22.542   0.236  44.993 

The above returns a lightweight wrapper with basically just the coefficients. To get the entire regularization paths, try the following

In [16]:
system.time(j$command("m = fit(DMRPaths,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
  5.433   1.344  10.554 

Julia compiles each function on its first call, which may be slower for one-off applications, but faster when the function is called many times. So to get a sense of runtime without that fixed cost, you may wish to run it again.

In [17]:
system.time(j$command("m = fit(DMR,@model(c ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
  0.032   0.017   2.611 

On our machine, julia fits dmr in about half the time as R (see 'elapsed' entries above). The speed improvment is mostly due to sharing of memory across parallel workers.

Get AICc optimal coefficients

In [18]:
Bjulia <- j$eval("coef(m)")
dim(Bjulia)
  1. 2
  2. 2640

Get SR projection onto factors

In [19]:
zjulia <- j$eval("srproj(m,counts)")
dim(zjulia)
  1. 6166
  2. 2

Comparing zR to zjulia we see that the estimates are about the same.

In [20]:
all.equal(zR, zjulia, check.attributes = FALSE)
'Mean relative difference: 0.08506275'

The differences are mostly due to default regularization paths differences.

Multinomial inverse regression (MNIR)

The HurdleDMR package provides a general method to fit Counts Inverse Regressions (CIR), fit(CIR...) that can fit both backward and forward parts of the MNIR. For example:

In [21]:
j$command("using GLM")
system.time(j$command("mnir = fit(CIR{DMR,LinearModel},@model(c ~ 1 + Overall),covars,counts,:Overall);"))
   user  system elapsed 
  2.302   0.433   4.995 
In [22]:
j$eval("mnir.model.fwdm")
Julia Object of type LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}.
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
──────────────────────────────────────────────────────────────────────
      Estimate  Std. Error    t value  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
x1  3.3504      0.0195938   170.993      <1e-99  3.31199     3.38882  
x2  3.15752     0.0413464    76.3675     <1e-99  3.07647     3.23857  
x3  0.00635687  0.00109462    5.80739    <1e-8   0.00421104  0.0085027
──────────────────────────────────────────────────────────────────────

The fitted model can be used for prediction as follows:

In [23]:
yhatdmrJ <- j$eval("predict(mnir,covars[1:10,:],counts[1:10,:])")
yhatdmrJ
  1. 4.86914303803309
  2. 2.49144101700483
  3. 5.87751149547541
  4. 4.59964634165681
  5. 4.65527195837653
  6. 4.95595640484482
  7. 3.70372254336368
  8. 4.39420942575794
  9. 3.9647081730177
  10. 7.26066717356227

Comparing the R and julia versions of the predicted values, they appear to be quite similar:

In [24]:
all.equal(yhatdmrR, yhatdmrJ, check.attributes = FALSE)
'Mean relative difference: 0.01893765'

Hurdle Distributed Multinomial Regression (HDMR)

Another advantage of the julia package is allowing for text selection via HDMR. Here we specify the two parts of the model via two formulas:

In [25]:
system.time(j$command("m = fit(HDMR,@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts);"))
   user  system elapsed 
  2.424   0.136  18.578 

Fitted HDMR involves two coefficient arrays, one for the model for positives c ~ ... and one for the model for hurdle crossing or zeros h ~ ...

In [26]:
Cjulia <- j$eval("coef(m)")

The projection onto factors now gives [zpos, zzero, m] instead of [z, m] as before

In [27]:
Zjulia <- j$eval("srproj(m,counts,1,1)")

If we wish to run a CIR with HDMR instead of DMR, all we need to do is specify it in a very similar call to fit(CIR...):

In [28]:
system.time(j$command("cir = fit(CIR{HDMR,LinearModel},@model(c ~ 1 + Overall, h ~ 1 + Overall),covars,counts,:Overall);"))
   user  system elapsed 
  0.555   0.077   6.170 
In [29]:
j$eval("cir.model.fwdm")
Julia Object of type LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}.
LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
───────────────────────────────────────────────────────────────────────
      Estimate  Std. Error    t value  Pr(>|t|)    Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
x1   3.3231      0.0221108  150.293      <1e-99   3.27976     3.36645  
x2   0.0321774   0.0116384    2.76476    0.0057   0.00936205  0.0549928
x3   3.19298     0.0433034   73.7351     <1e-99   3.10809     3.27787  
x4  -0.0196436   0.0115597   -1.69932    0.0893  -0.0423045   0.0030174
x5   0.0291231   0.0128348    2.26907    0.0233   0.00396236  0.0542838
───────────────────────────────────────────────────────────────────────

The HDMR-based CIR model can be used to predict with new data

In [30]:
yhathdmr <- j$eval("predict(cir,covars[1:10,:],counts[1:10,:])")
yhathdmr
  1. 4.85312513033966
  2. 2.43974447171217
  3. 5.84408646734461
  4. 4.60401744136983
  5. 4.68318127773803
  6. 5.00909794714138
  7. 3.68596767649654
  8. 4.47843929232848
  9. 3.93817820313412
  10. 7.22910662898874

Comparing the three predicted values shows only minor differences in this toy example.

In [31]:
cbind(yhatdmrR,yhatdmrJ,yhathdmr)
yhatdmrRyhatdmrJyhathdmr
14.8642004.8691434.853125
22.2448462.4914412.439744
55.6530755.8775115.844086
114.5684464.5996464.604017
124.6684864.6552724.683181
135.0629014.9559565.009098
143.6627883.7037233.685968
154.4746804.3942094.478439
174.0036943.9647083.938178
187.3546377.2606677.229107

Kelly, Manela, and Moreira (2018) show, however, that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.

Please reference the paper for additional details and example applications.