Bioinformatics Asked by xamax on January 3, 2021
I’m not sponsored or anything, just interested in their challenge to decipher their DNA code.
They encoded their first episode of "Biohackers" video/binary file to DNA code and said if we could decode it we can watch it (without Netflix).
Here’s their page: https://biohackersnetflix.com with description and download for the DNA sequence file.
(Don’t know if it’s just in German or you can translate it. If questions regarding this page, ask me.)
The file is ~550MB small and contains 3.882.771 lines (not in fasta format). Every line has a length of 147 characters including primers at both ends (Illumina?).
Here are the first 5 lines:
ACACGACGCTCTTCCGATCTCTCCCAGGGACAAAGGTTCTGCATTTGCAGCAAGACTCCTGTAGTGCTGCAGATTCTCTGGTTGGATAGTACGGCGTACATTTCTGTATTGTAGCACCATGGGGTAGATCGGAAGAGCACACGTCT
ACACGACGCTCTTCCGATCTTAAGGCTTCGTAACAGATATTCTATATCGTCACATTGGTCTGAAGGAAGTCGCCTATAATCGCTCCTCTGTTTTTTAAAACTGCTATGGACCCGCTGTTCGGTGGAGATCGGAAGAGCACACGTCT
ACACGACGCTCTTCCGATCTCATGGTATAAGTGTTAAGGGTAATAACCACCTACCCCCCTCATTGCTCGTTTTTCCTGGAACCTTAACATTCGCAATAGCTAGCTGTTTCCTAGTAGAACCAAGGAGATCGGAAGAGCACACGTCT
ACACGACGCTCTTCCGATCTAGGATGTAGTCACAGGTCATTGTCATTAACTCAACCGAGGACATAACACTAAGTCCCACTAGGCCTGGATTCTCTAACGCGGTCTCTCTATTGGGGGAAGGGGTGAGATCGGAAGAGCACACGTCT
ACACGACGCTCTTCCGATCTTCTGGTAAGGCGGGTTGATATCAGTCACCTCCCTTTGAGCTAAAATACGATGGCGATTTAGTGTGAAACTAATAATGCTTGTCATACCAGCAGTACCGGATCGGGAGATCGGAAGAGCACACGTCT
I trimmed all the primers and tried to decode {A, C, G, T} considering every permutation {00, 01, 10, 11} as the obvious(?) decryption method (4! = 24 possible decodings) using python.
Then I hoped to get 1 of these 24 files loaded into VLC media player or something to be played, but it didn’t work and every file seemed to be broken in the same way. I think I’m missing something here.
Can I assume that a text file containing only 0’s and 1’s should be playable in VLC if the DNA code is correctly decrypted?
(If I am wrong here, please tell or move me.)
//Edit: I converted all 24 files to ASCII to see if there’s some kind of "video-like header". (All videos have some sort of description in their first lines if opened in text editor?) But there’s just gibberish.
//Edit: I saw that every 84th sequence position has a "T", which is kind of weird.
So I tried to run my script again with these T’s removed, but still no solution.
//Edit: I searched for "AVI", "264", "codec" and some other strings in every video file I created and hexdumped. Nothing found.
For clarification: I translated the DNA into every 24 binary and then into their ASCII representation following the 19 upvotes answer: https://stackoverflow.com/questions/7290943/write-a-string-of-1s-and-0s-to-a-binary-file. The 104 bases / 208 Bits (removed repetetive "T" and primers) are actually a multiple of 8 (respectively 26 Bytes) so I could be on the right way (even if not 32 Bytes?).
De novo Assembly didn’t work and I found no obvious ORF "genes" representing some kind of URL to the video or something which was a neat idea considering the video file would be only ~150MB. (See comments.)
I was quite curious when I saw that as well. I've spent more time than I'd care to admit trying the sort of things you have. I've gotten it decoded now, but I can't really claim any sort of victory, as the problem was pre-solved.
After struggling through some of the same experiments you did, I decided to take a closer look at their explainer video here: https://youtu.be/DMYgjOHgHxc
First I tried to decipher the blackboard sketch around 1:30, but it was quite vague. A closer investigation into the speaker, however, yielded some luck. Googling Dr. Reinhard Heckel brought me to his website here: http://www.reinhardheckel.com/ which shows his most recent publication – a paper on encoding digital data in DNA.
The encoding is relatively complex (something nearly impossible to stumble upon accidentally, I'd reckon) but fascinating. The fragments are indexed and have two, layered error-correcting codes. Perhaps making our job harder (but serving the practical purpose of minimizing homopolymers and unwanted annealing) is the fact that the data is XORed with a pseudo-random noise that shuffles the data. The paper obviously has all of the details.
From my skim, however, I stumbled across this Github repo linked in the paper: https://github.com/reinhardh/dna_rs_coding
As it happens, the README was just updated in this last week, describing how to decode the episode from the file provided. If you have Docker, you just need to copy-paste some commands.
The product is indeed the full and final episode – coming in at a dainty 63.1MB. How did they get 40 minutes of video that compressed? Well, in short, 720x360 at 24 fps. It looks a little awful and is in German only, but is certainly a cool little easter egg. If nothing else, I have a cool paper to read.
Correct answer by thelostlambda on January 3, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP