How can I restrict the search space when querying GEO?

Question

I am using the following command to retrieve GEO Datasets entries for daf-2 c. elegans that have as their experiment type High Throughpout sequencing:
esearch -db gds -query "daf-2" | efilter -query "expression profiling by high throughput sequencing [DataSet Type] AND Caenorhabditis elegans [organism]"

However, among the results that I got, I noticed that I got this:
23. Impaired Insulin-/IGF1-Signaling Extends Life Span by Promoting Mitochondrial L-Proline Catabolism to Induce a Transient ROS-Signal
(Submitter supplied) Transcriptome profiling of three models with impaired insulin/IGF1 signaling. 1. Deep sequencing of endogenous mRNA from Caenorhabditis elegans N2 var. Bristol (wildtype) and daf-2(e1370) mutant; 2. Deep sequencing  of endogenous mRNA from murine embryonic fibroblasts (MEF)  wildtype and irs1-/- knockout; 3. Deep sequencing of endogenous mRNA from murine embryoinic fibroblast (MEF) insr+/- -lox and insr+/- knockout  Jena Centre for Systems Biology of Ageing - JenAge (www.jenage.de)
Organism:   Mus musculus; Caenorhabditis elegans
Type:       Expression profiling by high throughput sequencing
Platforms: GPL11002 GPL13776 14 Samples
FTP download: GEO (CSV) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE36nnn/GSE36041/
SRA Run Selector: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA151765
Series      Accession: GSE36041 ID: 200036041

As you can see, in the Organism field, there are both Mus musculus and Caenorhabditis elegans. How can I restrict my search so that I only get entries that exclusively work on Caenorhabditis elegans? One naive way to do this would be to include NOT Mus Musculus in the query, but of course that would mean that I could get something else too.
Another way I can think of solving this is just to write a script that would do the extra filtering using regexes, but I was wondering if there's a simpler solution using the e-utilities functionalities

vkkodali · Accepted Answer

You can use the following query to first search GEO to get a list of GEO Series of interest, then find all linked SRA runs and perform a second filter step to keep only the SRA runs that satisfy another set of criteria as shown below.
esearch 
  -db gds 
  -query "daf-2[all fields] 
    AND expression profiling by high throughput sequencing [DataSet Type] 
    AND Caenorhabditis elegans [organism]" 
  | elink 
  -db gds -target sra 
  | esearch 
  -query "(#3) 
    AND Caenorhabditis elegans [organism]" 
  | efetch -format runinfo

Here, I am combining two independent queries using the (#3) term in the second query. More information about this is in the "Combining Independent Queries" subsection of "Searching and Filtering" section of the Entrez Direct documentation.

How can I restrict the search space when querying GEO?

One Answer

Add your own answers!

Ask a Question