Benjamin LeRoy

Doctoral Student
Statistics and Data Science
Carnegie Mellon University


Contact: bpleroy@stat.cmu.edu



Benjamin LeRoy

Doctoral Student
Statistics and Data Science
Carnegie Mellon University


Contact: bpleroy@stat.cmu.edu

About


I am a sixth year PhD student in the Statistics and Data Science Department at Carnegie Mellon University. I am currently working under Professor Chad Scafer on projects surrounding with uncertainty quantitication for complex high-dimensional objects. I am interested in studying high dimensional statistics and am a member of the Stat ML (Statistics and Machine Learning) group, the STAMPS (Statistical Methods for the Physical Sciences) group, the History of Statistics reading group and the Data Science Group at Carnegie Mellon.

I was born and raised in Santa Rosa, California, one and a half hours drive north of San Francisco on the coast. I transferred from Santa Rosa Junior College (with AAs in Math and Economics) to the University of California, Berkeley where I obtained bachelor degrees in Statistics and Mathematics. During my undergraduate program, I visited Carnegie Mellon to participate in the Statistic Department's Summer Undergraduate Research Experience (SURE).

Attention Undergrads: I have a potential undergrad research project. Focused on time series and tropical cyclone in R. Please feel free to reach out to me if either of these sound interesting. (as of August 15th, 2020)

News / Upcoming Events


Research


Conformal Prediction for Simulation Models
We proposed an approach for conformal based prediction regions when one has a simulator, and observe pairs (X, Y) exchangeable. We use split conformal and nested conformal inference and tools from set estimation to provide prediction regions even for complex outcome spaces with only a distance measure and some notation of “small ball” structure.

Conformal Prediction for Simulation Models,
Benjamin LeRoy and Chad Schafer. ICML Workshop on “Distribution-free Uncertainty Quantification” July 2021. Paper: local version (pre-publication)


Conformal Prediction for Simulation Models
We proposed an approach for conformal based prediction regions when one has a simulator, and observe pairs (X, Y) exchangeable. We use split conformal and nested conformal inference and tools from set estimation to provide prediction regions even for complex outcome spaces with only a distance measure and some notation of “small ball” structure.

Conformal Prediction for Simulation Models,
Benjamin LeRoy and Chad Schafer. ICML Workshop on “Distribution-free Uncertainty Quantification” July 2021. Local version (pre-publication):pdf



Practical Local Conformal Inference
This is on-going work on defining local partitions of the X space to use with local conformal inference to get as close as possible to conditional conformal inference. We utilize recent work on model diagnostics to partition the X space, allowing for application with poor CDE fits (and it also applicable to high dimensional X spaces).

MD-split+: Practical Local Conformal Inference in High Dimensions,
Benjamin LeRoy* and David Zhao* (*equal contribution). ICML Workshop on “Distribution-free Uncertainty Quantification” July 2021. ArXiv: 2107.03280


Practical Local Conformal Inference
This is on-going work on defining local partitions of the X space to use with local conformal inference to get as close as possible to conditional conformal inference. We utilize recent work on model diagnostics to partition the X space, allowing for application with poor CDE fits (and it also applicable to high dimensional X spaces).

MD-split+: Practical Local Conformal Inference in High Dimensions,
Benjamin LeRoy* and David Zhao* (*equal contribution). ICML Workshop on “Distribution-free Uncertainty Quantification” July 2021. ArXiv: 2107.03280



Tropical Cyclone Prediction Bands
Using data relative to tracks of a little less than 1000 storms from National Oceanic and Atmospheric Administration (NOAA) we develop a fully data-driven statistical process for the creation of prediction bands around paths. In a parametric boostrap framework, first we simulate potential curves from a noisy extension to a linear model and then leverage statistical depth and geometric structures to create different version of prediction bands. This work is joint with Niccolò Dalmasso and Robin Dunn.

A Flexible Pipeline for Prediction of Tropical Cyclone Paths,
Niccolò Dalmasso*, Robin Dunn*, Benjamin LeRoy*, Chad Schafer (* equal contribution). ICML Workshop (RESEARCH Track) “Climate Change: How can AI Help?” June 1019. ArXiv: 1906.08832


View work on github, as well a R package: TCpredictionbands.
Tropical Cyclone Prediction Bands
Using data relative to tracks of a little less than 1000 storms from National Oceanic and Atmospheric Administration (NOAA) we develop a fully data-driven statistical process for the creation of prediction bands around paths. In a parametric boostrap framework, first we simulate potential curves from a noisy extension to a linear model and then leverage statistical depth and geometric structures to create different version of prediction bands. This work is joint with Niccolò Dalmasso and Robin Dunn.

A Flexible Pipeline for Prediction of Tropical Cyclone Paths,
Niccolò Dalmasso*, Robin Dunn*, Benjamin LeRoy*, Chad Schafer (* equal contribution). ICML Workshop (RESEARCH Track) “Climate Change: How can AI Help?” June 1019. ArXiv: 1906.08832


View work on github, as well a R package: TCpredictionbands.

Additional Research
A novel record linkage interface that incorporates group structure to rapidly collect richer labels, Kayla Frisoli, Benjamin LeRoy, Rebecca Nugent. In: 2019 IEEE International Conference on Data Science and Advanced Analysics (DSAA), Paper.

Immune cellular homeostasis in early life is determined by genetic variants of cellular production and turnover, Tania Dubovik, Elina Starosvetsky, Benjamin LeRoy, Rachelly Normand, Yasmin Admon, Ayelet Alpert, Yishai Ofran, Max G'Sell, Shai S. Shen-Orr, bioRxiv: 256073.

Software

EpiCompare (R Package)
The goal of EpiCompare is to provide the epidemiology community with easy-to-use tools to encourage comparing and assessing epidemics and epidemiology models in a "Time-Free" manner. This package provides the user the ability to compare epidemics and epidemiology models types (across both the "Agent"/"Aggregate" paradigm and the specifical models). All tools attempt to adhere to tidyverse/ggplot2 style to enhance easy of use.

This package is in collaboration with Shannon Gallagher, Ph.D. at NIH's National Institute of Allergy and Infectious Diseases.

View package website at skgallagher.github.io/EpiCompare, as well a R package on github.
EpiCompare (R Package)
The goal of EpiCompare is to provide the epidemiology community with easy-to-use tools to encourage comparing and assessing epidemics and epidemiology models in a "Time-Free" manner. This package provides the user the ability to compare epidemics and epidemiology models types (across both the "Agent"/"Aggregate" paradigm and the specifical models). All tools attempt to adhere to tidyverse/ggplot2 style to enhance easy of use.

This package is in collaboration with Shannon Gallagher, Ph.D. at NIH's National Institute of Allergy and Infectious Diseases.

View package website at skgallagher.github.io/EpiCompare, as well a R package on github.

ggDiagnose (R package)
Visualization package that provide functions that create ggplot2 based visualizations that can replace the base plot function to diagnose and visualize R objects. Specifically, this package aims to provide visualization tools when these objects are model objects and other non-traditional (read: non-data.frame) based objects.

This package came out of wanting to teach summer research students ggplot2 but that visual diagnostic tools for R models were all in base plot.

View package on github (examples).

This package was developed tangentially from packages like ggfortify, and althought I still think it provides a better framework for students and data scientists - ggfortify is the one to use.

ggDiagnose (R package)
Visualization package that provide functions that create ggplot2 based visualizations that can replace the base plot function to diagnose and visualize R objects. Specifically, this package aims to provide visualization tools when these objects are model objects and other non-traditional (read: non-data.frame) based objects.

This package came out of wanting to teach summer research students ggplot2 but that visual diagnostic tools for R models were all in base plot.

View package on github (examples).

This package was developed tangentially from packages like ggfortify, and althought I still think it provides a better framework for students and data scientists - ggfortify is the one to use.

Talks/Poster Presentations

Teaching


Instructor
  • Summer 2019: 36-350, Statistical Computing, Class Documents & Syllabus
    • an undergraduate course on core programming concepts using R: data structures, functions, iteration, debugging, abstraction to writing code to assist in statistical analysis (visualization, modeling, version control, etc).
    • my main contributions: (1) introduced coding style and best coding practices throughout the course, (2) presented tidyverse style coding (ggplot2, dplyr, tidyr, ...), (3) introduced high level computing concepts like object oriented programming and how to make packages in R, (4) provided high level overviews of the `split-apply-combined' paradigm, parallel computing and deep learning in R.
  • Summer 2017: 36-315, Statistical Graphics and Visualization, Syllabus
    • an undergraduate class on best visualization practices (primarily in ggplot2) and visualization theory.
    • my main contributions: (1) developed course to taught for the first time over the summer - including revamping the course's lectures and assignments, and (2) introduced more visual theory to course
Advising and Mentoring
  • Fall 2018: Data Science Initiative (DSI) Fellow
  • Summer 2018: Summer Undergraduate Research Experience (SURE) Graduate Advisor
    • Advised a team of undergraduates analysis trends in Human Trafficking, github
Teaching Assistance & Course Development
  • Fall 2018: 46-926/927 MSCF's Statistics and Machine Learning I/II, Syllabus 926 / Syllabus 927
    • Head TA; Assisted in moving class from R to python.
  • Fall 2017: 36-315, Statical Graphics and Visualization, Syllabus
    • Head TA; Assisted in assignment and test development.
Teaching Assistant
  • Fall 2020: 36-617: Applied Linear Models, Syllabus
  • Spring 2020: 36-402, Advanced Methods for Data Analysis, Syllabus
  • Fall 2019: 46-668: Special Topics: Text Analysis, Syllabus
  • Spring 2019: 36-618, Experimental Design and Time Series, Syllabus
  • Fall 2018: 36-705, Intermediate Statistics, Syllabus
  • Spring 2016: 36-315; Statistical Graphics and Visualization Syllabus
  • Fall 2016: 36-401; Modern Statistics

Code-Centric Teaching Resources:

Here is a link to some TA resources that I have helped develop and some potentially helpful resources for CMU Statistics students working on our servers.

Memberships


  • ASA: American Statistical Association
  • IEEE: Institute of Electrical and Elecetronics Engineers
    • IEEE-CIS: IEEE's Computational Intelligence Society

CV


CV: Download (updated 6 August 2021)