How to import large .bed, .gff, .vcf, .paf, .sam files into an SQL database?

Question

Are there best practices to load different bioinformatics file formats such as VCF, BED, GFF, and SAM to SQL databases? I am wondering how people out there do that efficiently.
All of these three formats are tab-separated files, so basically the following should work. I feel weird about it since most people I know don't use MySQL to work with these files.
LOAD DATA LOCAL INFILE 'bed.bed' INTO TABLE bed-file FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' IGNORE 1 ROWS (list of the columns) SET creation_date  = STR_TO_DATE(@creation_date, '%m/%d/%y');

gringer · Answer

Answer from @liam-mcintyre converted from comment:
I don't use dask as it doesn't support enough pandas functionality (unfortunately). With pandas I do it with read_csv... if its big then read in chunks and send chunks to separate threads. If you want to ask a specific question with example data etc then I can show code.

How to import large .bed, .gff, .vcf, .paf, .sam files into an SQL database?

One Answer

Add your own answers!

Ask a Question