TransWikia.com

Filtering pileup from site lists

Bioinformatics Asked by Eliran Turgeman on December 8, 2020

I want to write a script that filters a pileup file from a site lists file.
As an input I get a reference genome, pileup and site lists files.

Example of an output for this script:
Pileup File :

        NC_000001.10    11456   A   0   *   *
        NC_000001.10    11457   C   0   *   *
        NC_000001.10    11458   T   0   *   *
        NC_000002.11    11460   G   1   ,   E
        NC_000002.11    13914   G   1   ,   E
        NC_000003.11    10503990    C   1   ,   E

Reference File (After using grep by ‘>’):

        >NC_000001.10 Homo sapiens chromosome 1, GRCh37.p13 Primary Assembly
        >NT_113878.1 Homo sapiens chromosome 1 unlocalized genomic contig, GRCh37.p13 Primary Assembly
        >NT_167207.1 Homo sapiens chromosome 1 unlocalized genomic contig, GRCh37.p13 Primary Assembly
        >NC_000002.11 Homo sapiens chromosome 2, GRCh37.p13 Primary Assembly
        >NC_000003.11 Homo sapiens chromosome 3, GRCh37.p13 Primary Assembly

Lists File:

        chr1    11456   C1orf186    -   intronic    intronic    no  no  N   N   N
        chr1    11457   C1orf186    -   intronic    intronic    no  no  N   N   N
        chr2    13914   intergenic  -   intergenic  intergenic  no  no  N   N   N
        chr7    30504355    NOD1    -   intronic    intronic    no  no  N   N   N
        chr3    10503990    SSX2IP  -   Syn Gln->Gln    no  no  N   N   N
        chr1    13131320    MEF2A   +   intronic    intronic    no  no  N   N   N

Output File:

        NC_000001.10    11456   A   0   *   *
        NC_000001.10    11457   C   0   *   *
        NC_000002.11    13914   G   1   ,   E
        NC_000003.11    10503990    C   1   ,   E

So for every refseq in the pileup file I would convert it to the matching chromose in which is found in the reference file, and then overlap it with my lists file so I would get the lines with the same chromose and position.

The problem is that while there are many refseq coordinates that belong to chr1 for example, alignment to each of them will give a position, with respect to the specific coordinate. But on the list there is a full, single, chr1.

I need to be able to compare between:

  • a pileup file that was created based on a refseq reference genome
  • a file with editing sites that was downloaded from RADAR (we may give the file)
    Maybe there is a more appropriate reference genome file or that there is a way to convert the refseq positions according to the list

EDIT: If I were to convert my pileup file into a bed file (by repeating the position twice) could I use somehow bedtools to solve my problem?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP