Why Microsoft could be the key to prevent and cure Cancer

Technology Post

If you have not seen Richard Resnick’s “welcome to the genomic revolution” on TED you really should. It will inspire you and help you catch up with a medical revolution that will change the world and may just save your life. Allow me a few moments to provide some background, so that the topic of this blog post will not seem like quackery.

The human body consists of four key components nucleic acids, proteins, carbohydrates, and lipids. Nucleic acids are the ones that are required for creating, growing or mutating cells. Yup, that’s the DNA, RNA stuff.

DNA at its core is not too complex. There are only 4 components Adenine(A) , Thymine(T), Guanine(G) and Cytosine(C). Only AT and GC bind nicely to each other in pairs and is the main reason that figuring out DNA is actually simple.

1pair 2pair

The funky DNA helix chain is a sequence of our friends ATGC glued together. Each combo is a “base pair” (BP). So a sequence could look like something like this in the two strands:

  • ATCGATTGAGCTCTAGCG
  • TAGCTAACTCGAGATCGC

clip_image_helix

Now due to some amazing technology over the last few years and the completion of the Human Genome Mapping Project , the medical world is about to quickly turn healthcare upside down if they get a little help from Microsoft.

Where the complexity of DNA comes in is not in the biology, there are only 4 basic elements; it is in the math. The cure will occur in 5 phases:

  1. Data Capture of the DNA information on a broad enough scale to drive statistically relevant data points. The high volume input of sequenced DNA evidence is required.
  2. Analysis that looks for patterns, mutations and can answer the question “what is involved and influences the cancer”
  3. The DNA and associate Genes need to be mapped to the Protein level where we can understand the role and the biological function of the protein in the process and what drugs or bio-process can influence the protein
  4. We then need to do drug and bio-process response modeling
  5. Develop clinical remediation options

There has been amazing progress in the technology to create a DNA sequence, step 1. The 30 second “Coles-notes” version of how this happens is this. They take the cells with the DNA and melt them where the double-stranded DNA unwinds and separates into single-stranded strands. Then by chemical reaction break apart the strand into chunks and through electrolysis sieve the small strands (1 nucleotide in size) into a gel. These are then split into 4 groups where 4 different chemical reactions attach a “marker” (either radioactive or chemical) to the four base acids (ATGC). The sample is then recombined with each channel and viewed with X-Ray/Ultraviolet light etc. showing the associated ATGC position in the sequence. This process that used to take weeks of effort for a small number of base pairs, can now be done at the rate of Billions of BP per day in a single machine by the latest technology.

clip_image008

So the bottom line is that today, you can have your entire genome mapped in a day, all 3 Billion bits of information and store it likely in your phone.

So now that we can produce vast quantities of sequencing data what does it mean? We now need to process this data. (To answer the process question we need to know a bit about genes.) A gene is “locatable region of genomic sequence”. Ie. When you look at a DNA strand, in some, many or no places on that strand there is a sequence of base pairs that defines a certain attribute. (Perhaps your predisposition to a certain type of cancer).

The total complement of genes in a cell may be stored on one or more chromosomes. The good news for humans is that we only have 23 Chromosomes.

However inside those 23 Chromosomes are a few base pairs, about 3 Billion of them.

Chromosome

Genes

Total base pairs

1

4,220

247,199,719

2

1,491

242,751,149

3

1,550

199,446,827

4

446

191,263,063

5

609

180,837,866

6

2,281

170,896,993

7

2,135

158,821,424

8

1,106

146,274,826

9

1,920

140,442,298

10

1,793

135,374,737

11

379

134,452,384

12

1,430

132,289,534

13

924

114,127,980

14

1,347

106,360,585

15

921

100,338,915

16

909

88,822,254

17

1,672

78,654,742

18

519

76,117,153

19

1,555

63,806,651

20

1,008

62,435,965

21

578

46,944,323

22

1,092

49,528,953

X (sex chromosome)

1,846

154,913,754

Y (sex chromosome)

454

57,741,652

Total

32,185

3,079,843,747

In those 3 Billion pairs, we have some challenges.

  • The direction you “read” the sequence matters.
  • Groups of pairs have a sequence reading frame which means that depending on where you start reading the frame in the strand you get a different amino acid. For example, (borrowed from wikipedia) the string GGGAAACCC,
    • GGGAAACCC
    • GGGAAACCC
    • GGGAAACCC
    • Every sequence can be read in three reading frames, each of which will produce a different amino acid sequence.
  • We need to able to recognize patterns in the strands and across strands. It would be great if the “Cancer” gene would just show up in the DNA labeled and in one place. However, whatever the causes or predispositions to cancer are, they are ALL OVER THE PLACE. (As an example, rearrangement of DNA between chromosomes 9 and 22 is associated with several types of leukemia) It could be across multiple strands, multiple chromosomes and perhaps even outside the chromosome as well.

Hence, why I would state today that the magnitude of this “math” and data processing problem is best solved by Microsoft.

Let’s simply assume that for the next 2 years that every person in British Columbia, Canada (population 4.5 million) who is diagnosed with cancer each year (22,000) provides a contribution of their DNA to a study but let’s just try to find the genetic patterns in the top four cancers. So out of the 44,000 DNA samples we have 27,500 covering:

  • 6000 Breast Cancer cases
  • 6000 Lung Cancer cases
  • 9000 Colon Cancer cases
  • 6500 Prostate Cancer cases
  • 27,500 New Cases of the most common cancers

So let’s do a complete sequence for every case (and perhaps family members as well) resulting in approximately 83 trillion base pairs or perhaps 10 times that if we pull familial and asymptomatic cell DNA as well. Now go look for patterns in those 830 trillion. There are only 20 amino acids that can be made from ATGC so the actual combinations to look for patterns drop to about 40 trillion if you guess right on the start sequence and 120 Trillion if you don’t. We can start at the macro layer and see if there are patterns in the 32,000 genes to simply the problem set but more than likely it will be combinations of patterns that will be the cause. Is it too complex? No. It is complex but it is finite.

Curing and preventing cancer is not far-fetched at all. Here is why:

  • Proteins are involved in everything cells do, including cancer processes.
  • Cancer absolutely has a genetic component
  • Drugs work by influencing the actions of proteins
  • Drugs are metabolized by proteins
  • The genetic components of Cancer CAN be mediated by drugs.

So what do we do then?

What is common in every step of the process is data. Masses of data.  Some is raw, some has metadata and some is in the form of research papers published from thousands of researchers over the years. It is both structured and unstructured and technology is required to turn data into meaningful clinical information quickly.

The scale of the processing required to address this type of problem set is massive. While I have forgotten most of my university combinatorics classes, I do know that the combinations and permutations of a 830 trillion piece dataset is not the job for calculator.exe on a PC. It will require massively parallel computing technology, tens of thousands of CPU’s, massive databases, data mining, expert systems and perhaps pattern matching technology yet to be invented for this purpose. Microsoft is the one company globally with massive cloud datacenters, High Performance Computing (massively parallel)/ Parallel Datawarehouse technology and research resources to not only make the next step in the cure possible, but to collaboratively connect the world’s cancer researchers to do it.

Microsoft research has already created and placed into public domain tools like FaST-LMM. (Factored Spectrally Transformed Linear Mixed Models is a program for performing genome-wide association studies on massive data sets.) Microsoft has the research capability to develop the necessary tools to not only process and analyze research data but also to pull together the existing knowledge of prior research results and help researchers to apply it in context.

For raw processing capability Microsoft has demonstrated > 1 Petaflop (A petaflop is a measure of processing speed and can be expressed as a thousand trillion floating point operations per second) running a single problem set cross more than 1200 nodes in the Windows HPC server. This is among the best in the world and will continue to go up.  It is the ability to scale to any data crunching requirement that will allow researchers to quickly investigate, model and test ideas.

The raw data generated from a single end to end DNA sequencing is about 7 Terabytes. Now multiply that by our sample group of 440,000 more runs. That’s 3 million Terabytes of data. To provide fast responses to research or clinical queries or expediting data mining looking for outliers, you need massively parallel database technology. Again , Microsoft has just brought this technology to market with SQL Server Parallel Datawarehouse.

Lastly, you need to run all this software somewhere. Believe it or not building out a secure datacenter that could house this amount and level of technology is not just a function of enough investment. A datacenter to host this would be in the 200 Megawatt power range. There are not too many places on the globe that have a spare 200 Mw available on the grid at any cost. However, Microsoft knows where they all are and has already been busy thinking about the next monster datacenter builds. I know for certain of a high security location in Canada that would be ideal and ready for a secure Datacenter build quickly.

Sure lots of other companies have a portion of the solution but Microsoft can actually bring all of the required components today and it engages 90,000 Microsoft people in the fight also. Nothing inspires or drives performance more than a big challenge to do something that will change the world. (again)

With a little encouragement; Microsoft could be the next critical piece of the puzzle to help the thousands of researchers prevent and cure Cancer.

This entry was posted in Technology and tagged , , , . Bookmark the permalink.

1 Response to Why Microsoft could be the key to prevent and cure Cancer

  1. Lorna Cooper says:

    Wow…. mind boggling and potentially miraculous! I hope I live long enough to see this work consistently and well. – LC

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s