Edge Cluster Ordering and Related Graphics

Uncovering Latent Informative Images

We will be looking at visualizations that have application in characterizing subject transitions between response categories over time while on cancer therapy, and in gene expression heat-maps. We will be presenting, in this page, a heuristic which builds on hierarchical approaches and demonstrates its utility in ordering rows in the visualization, typically representing subjects, in state sequence graphics, and, in ordering rows and columns, typically representing distinct samples and genes, in heat maps. Other similar methods, more detail and additional R-Code are described in a publication by me and my colleagues in – Shankar S. Srinivasan, Li Hua Yue, Rick Soong, Mia He, Sibabrata Banerjee and Stanley Kotey Chapter 19 Some Methods for Longitudinal and Cross-Sectional Visualization with Further Applications in the Context of Heat Maps © Springer Nature Singapore Pte Ltd. 2018 K. E. Peace et al. (eds.), Biopharmaceutical Applied Statistics Symposium [BASS], ICSA Book Series in Statistics, https://doi.org/10.1007/978-981-10-7820-0_19. The BASS organization is a non-profit organization headed by Dr Karl E. Peace and supports students pursuing diplomas in statistics.

Let’s start by looking at the figure that follows. These graphics were produced by the TraMineR R package (Gabadinho et al. 2016) and comes with data collected in McVicar and Anyadike-Danes (2002) consisting of transitions between states characterizing education and employment in Northern Ireland from 1993 to 1999. This R package has considerable functionality, including descriptive and inferential analysis of sequences. We were strongly interested in plots constructed using this package by stacking horizontal subject strips. Each horizontal subject strip has a sequence of states over time in the context of employment and education, with each state mapped to a different color. The left panel below shows unsorted raw data. The graphic to the right orders the data to bring out somewhat more clearly longitudinal as well as cross-sectional patterns in the data. In the right panel, we see subject similarities on the employment (green), higher education (orange) and school (blue) states, which are not apparent in the raw data. We seek to improve on the graphic to the right through our new edge clustering heuristic. The improved graphic depicting subject transitions using edge clustering is further down in this page.

Some Iconic Images

Partly to pique the reader’s interest and largely to demonstrate the effectiveness of our edge clustering heuristic, we will use images of the First Ladies Meeting at the White House after the 2016 US Election, Picasso’s Portrait of Dora Maar (1937) and Van Gogh’s Starry Night over Rhone. We did need to use some ‘known’ images as what constitutes a ‘right’ image for our visualizations is unknown. Starting with a known parameter and a known variation around this parameter, is often a standard approach in statistics. Observations are generated from this known setting and various analytical methods, blinded to the known generating mechanism, are used to estimate the parameter. Then the analyst pulls out the knowns and compares the various estimates against the knowns for bias and error. We need as well, to test the appropriateness of our ordering heuristics against a known, as it may be hard to argue the rightness of an image using criteria assessed solely in a constructed heat-map. We describe next this dual parameter/estimate approach, in our context, starting with the First Ladies picture.

The First Ladies

What follows is the now iconic photograph of our First Ladies, often admired for doing good work while somehow staying above that contentious, partisan circus around them. It is clearly an informative image. It contains data. You can extract from this image, for each x and y pixel co-ordinate, three numeric values for the intensity of the red, blue and green colors.

Further, this image can be converted to a greyscale image (see image to left below) which would then have a single numeric intensity value. This is like gene expression data. Gene expression data has rows with samples (the row number corresponds to the y-pixel co-ordinate), columns with genes (the column number corresponds to the x-pixel co-ordinate) and a numeric gene expression very much like the numeric greyscale intensity. We remove all ordering in the grey scale image to get the image in the right panel below. R-Code to read in an image, to convert it to greyscale, to extract numerical data and to remove ordering information are in this attached document.

We used edge ordering as well as another technique on the data corresponding to the randomly permuted data. The R-Code to achieve this is in our work referenced earlier. The First Ladies image is recovered split into 6 segments by edge clustering (to left) but works a lot better than the image to right using a competing method.

We have in the attached document details and R-Code for the context where we have a vector observation – we use the same image retaining all three color intensities. A vector observation at each pixel, corresponding to three intensities. The recovered image improves somewhat to one which splits into 4 segments instead of 6. Inspired by the discombombulated first ladies we looked at the work by Picasso in the next section.

Other Images Used

We had better luck with Picasso’s Portrait of Dora Maar. The recovered image was identical except that the wall section to the left moved to the right. Couldn’t fix what Picasso did to the poor Lady though!

In Van Gogh’s Starry Night over Rhone, the recovered image captures the shimmer of the lights on the Rhone and those two people, like all of us, insignificant, in the corner, in awe of it all.

The Educational and Employment States Data Revisited

We tried the edge heuristic on the data we started the discussion with. That graphic is to the left and is an improvement over the ordering using the MDS heuristic. We note that a set of methods by Sakai et al. (dendsort: modular leaf ordering methods for dendrograms representations in R. F1000Research 2014, 3:177) provide a very similar ordering to those produced by edge clustering. Details on these are in the BASS meeting 2017 presentation slide deck, which is attached here as well as the book chapter referenced earlier.

Application to Transitions Between Oncological States

We also looked at changes in states on cancer therapy on two induction regimens in a clinical trial. The induction regimens were expected to be similar with a difference emerging during maintenance therapy. Oncological states in the longitudinal plots that follow include Complete Response (CR—1), Very Good Partial Response (VGPR—2), Partial Response (PR—3), Stable Disease (SD—4), Progressive Disease (PD—5), and Death (6). The associated numbers reflect the ordinality of the data. Despite anticipated similarity in induction, some differences can be seen in the graphic. There was deeper response in the Treatment A panel with more Complete Response and Very Good Partial Response compared to Treatment B, where there was more Stable Disease. Many other features, usually summarized separately in multiple data tables, such as time to response, duration of response, time to progressive disease and time to death are brought out, to a degree, in this one display.