How does DeepVariant construct RGB images from DNA sequences?

Question

DeepVariant is a pipeline to call genetic variants from DNA sequencing data.
A major step, before feeding the CNN, is to translate these DNA sequences into images. It's unclear why and how Google constructs the RGB images from the DNA data. Obviously, DNA is a string over an alphabet with the characters: {A, T, C, G}.
It is even hard to understand how the mapping works based on the source code of their unit tests.
In their figure from the paper: A is Red, C is Green, G is blue, and T is Yellow (G+R), but this is still unclear how they construct the 3xNxN image.

EDIT from google's blog:

In this article we will show the six channels in a row, but in
DeepVariant they are encoded as six layers in the third dimension,
giving each tensor a shape of (100, 221, 6) corresponding to (height,
width, channels). The variant in question is always in the center of
each pileup image, here marked with a small line at the top.
Channels are shown in greyscale below in the following order:
Read base: different intensities represent A, C, G, and T.
Base quality: set by the sequencing machine. White is higher quality.
Mapping quality: set by the aligner. White is higher quality.
Strand of alignment: Black is forward; white is reverse.
Read supports variant: White means the read supports the given
alternate allele, grey means it does not.
Base differs from ref: White means the base is different from the
reference, dark grey means the base matches the reference.

SmallChess · Accepted Answer

Actually the paper has made it clear how they did it. You just have to read the supplementary materials closer.

In their figure from the paper: A is Red, C is Green, G is blue, and T
is Yellow (G+R), but this is still unclear how they the 3xNxN image.

In RGB, each dimension is an NxN image. Since you have three dimensions, so it's 3xNxN. The red dimension was used to encode the nucleotide bases. The green dimension was used to encode quality scores. Finally, the blue dimension was used to encode the strand information.

Obviously, DNA is a string over an alphabet with the characters: {A,
T, C, G}.

This is easy. Lot's of ways. You could do a one-hot encoding or what DeepVariant used:
def get_base_color(base):
    base_to_color = {'A': 250, 'G': 180, 'T': 100, 'C': 30}
    return base_to_color.get(base, 0)

0x90 · Answer

Based on the supplementary material as mentioned in a comment by Devon Ryan:

The second phase of DeepVariant encodes the reference and read support
for each candidate variant into an RGB image. The pseudo-code for this
component is shown below; it contains all of the key operations to
build the image, leaving out for clarity error handling, code to deal
with edge cases like when variants occur close to the start or end of
the chromosome, and the implementation of non-essential and/or obvious
functions.

Here is the main function
WIDTH = 221
HEIGHT = 100;
def create_pileup_images(candidate_variants):
 for candidate in candidate_variants:
 for biallelic_variant in split_into_biallelics(candidate):
 start = biallelic_variant.start - (WIDTH-1) / 2
 end = WIDTH - span_start
 ref_bases = reference.get_bases(start, end)
 image = Image(WIDTH, HEIGHT)
 row_i = fill_reference_pixels(ref, image)
 for read in reads.get_overlapping(start, end):
 if row_i < HEIGHT and is_usable_read(read):
 add_read(image, read, row_i)
 row_i += 1
 yield image

def fill_reference_pixels(ref, image):
 for row in range(5):
 for col in range(WIDTH):
 alpha = 0.4
 ref_base = ref[col]
 red = get_base_color(ref_base)
 green = get_quality_color(60) # The reference is high quality
 blue = get_strand_color(True) # The reference is on the positive strand
 image[row, col] = make_pixel(red, green, blue, alpha)
 return 5

How does DeepVariant construct RGB images from DNA sequences?

2 Answers

Add your own answers!

Ask a Question