Bioinformatics Asked by Pawan Verma on June 24, 2021
AIM: Download "Unique Identifier List" for the following query from GEO DataSets.
Query: ("Expression profiling by high throughput sequencing"[DataSet Type] AND ("Homo sapiens"[Organism] OR "Mus musculus"[Organism] OR "rattus norvegicus"[Organism])) AND ("2020/01/01"[PDAT] : "3000"[PDAT])
which means, all RNASeq studies deposited on GEO in the year 2020 for humans, mice or rat.
Problem: I need the GSE ID list for ~9k datasets, but while trying to download the list of ids, it loads to a blank page and nothing happens. Also, clicking on "Next Page" gives error.
I have been trying for the last 3-4 days but it doesn’t work.
Steps to generate file:
"Send To" -> "File" -> "Format" (Unique Identifier List) -> "Sort By" (Default Order) -> "Create File"
You can use Entrez Direct for this. The following returns Unique Identifiers which are just bare integers.
$ geo_query='"Expression profiling by high throughput sequencing"[DataSet Type] AND ("Homo sapiens"[Organism] OR "Mus musculus"[Organism] OR "rattus norvegicus"[Organism]) AND ("2020/01/01"[PDAT] : "3000"[PDAT])'
$ esearch -db gds -query "$geo_query" | efetch -format uid > gds_results.txt
$ wc -l gds_results.txt
9981 gds_results.txt
$ head -n2 gds_results.txt
200134092
200120931
Instead, if you are looking for a way to get the GSE accessions, you can use the built-in xtract
command to parse the XML returned by esummary
as follows:
$ esearch -db gds -query "$geo_query" | esummary | xtract -pattern DocumentSummary -first Accession > gse_accs.txt > gse_accs.txt
$ wc -l gse_accs.txt
9981 gse_accs.txt
$ head -n2 gse_accs.txt
GSE165829
GSE165824
Correct answer by vkkodali on June 24, 2021
I think I would just try to do this with GEOquery.
https://bioconductor.org/packages/release/bioc/html/GEOquery.html
Answered by k1sauce on June 24, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP