Bioinformatics Asked on February 15, 2021
I have two pandas Dataframes, using python3.x:
import pandas as pd
dict1 = {0:['chr1','chr1','chr1','chr1','chr2'],
1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100],
3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']}
df1 = pd.DataFrame(dict1)
print(df1)
## 0 1 2 3 4 5
## 0 chr1 1 100 feature1 0 +
## 1 chr1 100 200 feature2 0 +
## 2 chr1 150 500 feature3 0 -
## 3 chr1 900 950 feature4 0 +
## 4 chr2 1 100 feature4 0 +
dict2 = {0:['chr1','chr1'], 1:[155, 800], 2:[200, 901],
3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']}
df2 = pd.DataFrame(dict2)
print(df2)
## 0 1 2 3 4 5
## 0 chr1 155 200 feature5 0 -
## 1 chr1 800 901 feature6 0 +
The columns to focus on in these dataframes are the first three columns: location, start, and end. Each start:end value represents a distance on location (e.g. chr1
, chr2
, chr3
).
I would like to output the intersection of df1
against df2
. Here is the correct output:
chr1 155 200 feature2 0 +
chr1 155 200 feature3 0 -
chr1 900 901 feature4 0 +
What is the most efficient (in terms of runtime and RAM) to find the intersections using pandas?
There are python wrappers around BEDTools: https://daler.github.io/pybedtools/intersections.html
I was hoping there was a pandas only solution…
Your setup:
import pandas as pd
dict1 = {0:['chr1','chr1','chr1','chr1','chr2'],
1:[1, 100, 150, 900, 1], 2:[100, 200, 500, 950, 100],
3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
4:[0, 0, 0, 0, 0], 5:['+','+','-','+','+']}
df1 = pd.DataFrame(dict1)
print(df1)
## 0 1 2 3 4 5
## 0 chr1 1 100 feature1 0 +
## 1 chr1 100 200 feature2 0 +
## 2 chr1 150 500 feature3 0 -
## 3 chr1 900 950 feature4 0 +
## 4 chr2 1 100 feature4 0 +
dict2 = {0:['chr1','chr1'], 1:[155, 800], 2:[200, 901],
3:['feature5', 'feature6'], 4:[0, 0], 5:['-','+']}
df2 = pd.DataFrame(dict2)
Now use pyranges (https://github.com/biocore-ntnu/pyranges):
# pip install pyranges
# conda install -c bioconda pyranges
import pyranges as pr
cols = "Chromosome Start End Name Score Strand".split()
# Add column names to the dfs so PyRanges knows which
# represent Chromosomes, Starts, Ends and optionally Strand
df1.columns = cols
df2.columns = cols
# create PyRanges-objects from the dfs
gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)
# intersect the two
gr = gr1.intersect(gr2)
# get the result as a ? df
df = gr.df
Results:
print(gr)
# +--------------+-----------+-----------+------------+-----------+--------------+
# | Chromosome | Start | End | Name | Score | Strand |
# | (category) | (int32) | (int32) | (object) | (int64) | (category) |
# |--------------+-----------+-----------+------------+-----------+--------------|
# | chr1 | 155 | 200 | feature2 | 0 | + |
# | chr1 | 900 | 901 | feature4 | 0 | + |
# | chr1 | 155 | 200 | feature3 | 0 | - |
# +--------------+-----------+-----------+------------+-----------+--------------+
# Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
print(df)
# Chromosome Start End Name Score Strand
# 0 chr1 155 200 feature2 0 +
# 1 chr1 900 901 feature4 0 +
# 2 chr1 155 200 feature3 0 -
See the docs for more info: https://biocore-ntnu.github.io/pyranges/
Correct answer by The Unfun Cat on February 15, 2021
I implemented pandas "Intervals" and ... it should be a few lines, clearly there are limitations. For non-overlapping data it is very cool however. It will work for overlapping data, BUT if the data you are using as the interval data is overlapping, it falls over. It could work if an independent (non-overlapping) interval was constructed.
Anyway, the point of the excercise was to keep it all in pandas, "start_stop" should not be needed. This can't be done via Intervals, you've got to drop out, and that will slow it down and harder to parse.
The query contig df3.iloc[1,]
, is still a bug (hence not printed), should be [x,] where x is the value_counts hit, but this is just a pilot.
The script would be fine if it the df was wrangled using groupby
rather than value_counts
because value_counts
trashes important data, which is what I'm trying to recover via df3.iloc[1,]
. I like pandas elegant solutions, but you know another day another dollar.
Note: I changed 200 to 199 otherwise there's too much overlap.
Overlap Output (not labeled very well)
start stop
index
(1.0, 100.0] 0 0
(100.0, 199.0] 1 0
(150.0, 500.0] 1 1
(900.0, 950.0] 0 1
import pandas as pd
def tableit (dict):
df = pd.DataFrame(dict)
table = pd.pivot_table(df, values=['start', 'stop'], columns=['chromosome'], index=['feature', 'misc', 'misc2'], fill_value = '-')
return table
def start_stop(point, table2, bin):
df3 = pd.cut(table2[point,'chr1'], bin[0]).value_counts()
df3 = df3.to_frame(point).reset_index()
for x in bin:
if not (x == bin[0]):
df3_tmp = pd.cut(table2[point,'chr1'], x).value_counts()
df3_tmp = df3_tmp.to_frame(point).reset_index()
df3 = pd.concat([df3, df3_tmp]
else:
continue
df3.set_index('index', inplace=True)
return df3, df3.iloc[1,] # [1,] is a bug and the query locus needs correctly reading from df3_tmp
def main():
dict1 = {'chromosome':['chr1','chr1','chr1','chr1','chr2'], 'start':[1, 100, 150, 900, 1], 'stop':[100, 199, 500, 950, 100], 'feature':['feature1', 'feature2', 'feature3', 'feature4', 'feature4'], 'misc':[0, 0, 0, 0, 0], 'misc2':['+','+','-','+','+']}
dict2 = {'chromosome':['chr1','chr1'], 'start':[155, 800], 'stop':[200, 901], 'feature':['feature5', 'feature6'], 'misc':[0, 0], 'misc2':['-','+']}
table1 = tableit(dict1)
table2 = tableit(dict2)
bin = [[i, j] for i, j in zip(table1['start', 'chr1'], table1['stop', 'chr1'])]
df_start, locus = start_stop('start', table2, bin)
# locus.name gets to the data
df_stop, _ = start_stop('stop', table2, bin)
df_merge = pd.merge(df_start, df_stop, on="index", how='outer')
print (df_merge)
if __name__ == "__main__":
main()
The following would have been better,
table_join = pd.concat([table1, table2]).fillna('-')
and then sort them.. perhaps including a new column to specify the original dataframe and screen for the overlap with table_join.iloc[i, column] - table_join.iloc[1-i, other_column].
The key step in all approaches is to convert from "long format" to "wide format" and thats the most important thing I've done in all code and makes pandas work.
Output Table_join
start stop
chromosome chr1 chr2 chr1 chr2
feature misc misc2
feature1 0 + 1.0 - 100.0 -
feature2 0 + 100.0 - 199.0 -
feature3 0 - 150.0 - 500.0 -
feature4 0 + 900.0 1 950.0 100
feature5 0 - 155.0 - 200.0 -
feature6 0 + 800.0 - 901.0 -
Answered by M__ on February 15, 2021
I don't think Pandas has this implemented functionality out-of-the-box. Even if it did, solutions not designed specifically for bioinformatics probably rarely handle intervals on different chromosomes correctly unless you split the intervals by chromosome first.
Pandas does handle intervals (see docs for the Interval and IntervalIndex classes), but I've never used these. I have used the excellent intervaltree package. Whatever approach and tools you decide to use, there's a good chance you'll have to write some code. Computing interval intersections with intervaltree
would look something like this.
>>> from collections import defaultdict
>>> from intervaltree import IntervalTree
>>> from pandas import DataFrame
>>>
>>> dict1 = {
... 0:['chr1','chr1','chr1','chr1','chr2'],
... 1:[1, 100, 150, 900, 1],
... 2:[100, 200, 500, 950, 100],
... 3:['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
... 4:[0, 0, 0, 0, 0],
... 5:['+','+','-','+','+']
... }
>>> dict2 = {
... 0:['chr1','chr1'],
... 1:[155, 800],
... 2:[200, 901],
... 3:['feature5', 'feature6'],
... 4:[0, 0],
... 5:['-','+']
... }
>>> df1 = DataFrame(dict1)
>>> df2 = DataFrame(dict2)
>>>
>>>
>>> # Construct an interval index from one of the dataframes,
>>> # one tree per chromosome
>>> index = defaultdict(IntervalTree)
>>> for n, row in df2.iterrows():
... index[row[0]].addi(row[1], row[2], row)
...
>>>
>>> # Query the index from the other dataframe to build a 3rd
>>> # dataframe representing the intersection
>>> dict3 = { i: list() for i in range(6) }
>>> for n, row in df1.iterrows():
... overlapping = index[row[0]].overlap(row[1], row[2])
... for itvl in overlapping:
... itsct_start = max(row[1], itvl.begin)
... itsct_end = min(row[2], itvl.end)
... dict3[0].append(row[0])
... dict3[1].append(itsct_start)
... dict3[2].append(itsct_end)
... dict3[3].append(row[3])
... dict3[4].append(row[4])
... dict3[5].append(row[5])
...
>>> df3 = DataFrame(dict3)
>>> df3
0 1 2 3 4 5
0 chr1 155 200 feature2 0 +
1 chr1 155 200 feature3 0 -
2 chr1 900 901 feature4 0 +
>>>
Answered by Daniel Standage on February 15, 2021
As mentioned by OP, another option is to use pybedtools, which in my opinion is pretty convenient for people already familiar with BedTools.
Let's even say df1's format is slightly different than df2:
import pandas as pd
dict1 = {0: ['chr1', 'chr1', 'chr1', 'chr1', 'chr2'], 1: [1, 100, 150, 900, 1],
2: [100, 200, 500, 950, 100],
3: ['feature1', 'feature2', 'feature3', 'feature4', 'feature4'],
4: [0, 0, 0, 0, 0], 5: ['+', '+', '-', '+', '+'],
6: ["remember", "the", "5th", "of", "november"]}
df1 = pd.DataFrame(dict1)
print(df1)
0 1 2 3 4 5 6
0 chr1 1 100 feature1 0 + remember
1 chr1 100 200 feature2 0 + the
2 chr1 150 500 feature3 0 - 5th
3 chr1 900 950 feature4 0 + of
4 chr2 1 100 feature4 0 + November
dict2 = {0: ['chr1', 'chr1'], 1: [155, 800], 2: [200, 901],
3: ['feature5', 'feature6'], 4: [0, 0], 5: ['-', '+']}
df2 = pd.DataFrame(dict2)
print(df2)
0 1 2 3 4 5
0 chr1 155 200 feature5 0 -
1 chr1 800 901 feature6 0 +
Using pybedtools we get:
from pybedtools import BedTool
bed1 = BedTool.from_dataframe(df1)
bed2 = BedTool.from_dataframe(df2)
intersected_bed = bed1.intersect(bed2)
print(intersected_bed)
chr1 155 200 feature2 0 + the
chr1 155 200 feature3 0 - 5th
chr1 900 901 feature4 0 + of
We can then keep using the intersected BedTool object or convert it back into a DataFrame:
# create a pandas.DataFrame, passing args and kwargs to pandas.read_table
intersected_df = intersected_bed.to_dataframe(header=None,
names=["chrom", "start", "end", "name",
"score", "strand", "whatever"])
print(intersected_df)
chrom start end name score strand whatever
0 chr1 155 200 feature2 0 + the
1 chr1 155 200 feature3 0 - 5th
2 chr1 900 901 feature4 0 + of
Answered by Kapara newbie on February 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP