Pandas - avoid iterrows() assembling a multi-index data frame from another time-series multi-index data frame

Question

For analyzing pairs of currencies, I have two dataframes. The first one contains information about the symbols constituting every unique pair (as in "pair metadata"):
    X           Y           handle
0   ETHEUR E3   XMRUSD E3   ETHEUR E3_XMRUSD E3
1   ETHEUR E3   TRXETH E3   ETHEUR E3_TRXETH E3
2   ETHEUR E3   ETHUSD E3   ETHEUR E3_ETHUSD E3
3   ETHEUR E3   BTCEUR E3   ETHEUR E3_BTCEUR E3
4   ETHEUR E3   ETHBTC E3   ETHEUR E3_ETHBTC E3
...     ...     ...     ...

The second one is a multi-index dataframe providing historic data, having the symbol as level 0 index and time as a level 1 index. This specific structure is desired (and given by some other library) as not every symbol has data for every point in time.
                                    close       high        low         open        volume
symbol      time                    
ETHEUR E3   2021-01-18 04:41:00     988.730000  988.730000  988.726611  988.726611  8.404723
            2021-01-18 04:42:00     988.730000  988.730000  988.726611  988.726611  0.000000
            2021-01-18 04:43:00     988.730000  988.730000  988.726611  988.726611  0.000000
            2021-01-18 04:44:00     990.620360  990.620360  989.640000  989.640000  0.158394
            2021-01-18 04:45:00     995.010000  995.010000  995.010000  995.010000  0.006665
...     ...     ...     ...     ...     ...     ...
ANTETH E3   2021-01-18 04:56:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 04:57:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 04:58:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 04:59:00     0.003745    0.003745    0.003745    0.003745    0.000000
            2021-01-18 05:00:00     0.003745    0.003745    0.003745    0.003745    0.000000

My goal is to assemble a pair-wise history as follows, with 'X' and 'Y' containing the data from the previous dataframe for the first and second symbol of the pair:
                                                X                                                           Y
                                                close       high        low         open        volume      close       high        low         open        volume
handle                  time                                        
ETHEUR E3_XMRUSD E3     2021-01-18 04:41:00     988.730000  988.730000  988.726611  988.726611  8.404723    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:42:00     988.730000  988.730000  988.726611  988.726611  0.000000    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:43:00     988.730000  988.730000  988.726611  988.726611  0.000000    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:44:00     990.620360  990.620360  989.640000  989.640000  0.158394    153.450000  153.450000  153.250000  153.250000  0.000000
                        2021-01-18 04:45:00     995.010000  995.010000  995.010000  995.010000  0.006665    153.860000  153.860000  153.360000  153.360000  4.845876
...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
EOSETH E3_ANTETH E3     2021-01-18 04:56:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 04:57:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 04:58:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 04:59:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000
                        2021-01-18 05:00:00     0.002259    0.002259    0.002259    0.002259    0.000000    0.003745    0.003745    0.003745    0.003745    0.000000

This pair-wise history is used for further time-series style calculations between the two symbols.
Currently, I use the following code to generate the shown dataframe:
def assemble_pairs_from_history(history):
    # Build unique pair combinations from symbols
    pair_tuples = list(combinations(history.index.unique('symbol'), 2))
    pair_info = pd.DataFrame(pair_tuples, columns=['X', 'Y'])
    # Designate a handle for each combination
    pair_info['handle'] = pair_info['X'].str.cat(pair_info['Y'], sep="_")
    
    display(pair_info)
    display(history)
    
    # Iterate over all pair combinations, assembling a dict of historical data DFs for X and Y symbol of the pair
    pair_history = {}
    for i, pi in pair_info.iterrows():
        pair_history[pi.handle] = pd.concat([history.loc[pi.X], history.loc[pi.Y]], axis=1, keys=['X', 'Y'])
        # Clean the time series so that only rows having values for both symbols of the pair remain
        pair_history[pi.handle].dropna(inplace=True)

# Assemble into one big dataframe, index by handle -> time with X and Y top level columns
    pair_history = pd.concat(pair_history.values(), keys=pair_history.keys(), names=['handle'])

display(pair_history)

The issue with this code is that the performance is pretty bad, with the iterrows() and subsequent concat() part taking 10 seconds for a result dataframe size of 5000 rows.
I already tried various combinations of apply(), join() and merge() as well as building a multiindex manually, but was not able to achieve the desired output format.
How can the performance of this task be improved?

Oxbowerce · Answer

Not sure if you need the multi index for the columns, but would something like this be what you're looking for?
Given the following two input dataframes

X
Y
handle

ETHEUR E3
XMRUSD E3
ETHEUR E3_XMRUSD E3

ETHEUR E3
TRXETH E3
ETHEUR E3_TRXETH E3

ETHEUR E3
ETHUSD E3
ETHEUR E3_ETHUSD E3

ETHEUR E3
BTCEUR E3
ETHEUR E3_BTCEUR E3

ETHEUR E3
ETHBTC E3
ETHEUR E3_ETHBTC E3

symbol
time
close
high
low
open
volume

ETHEUR E3
2021-01-18 04:41:00
988.73
988.73
988.727
988.727
8.40472

ETHEUR E3
2021-01-18 04:42:00
988.73
988.73
988.727
988.727
0

ETHEUR E3
2021-01-18 04:43:00
988.73
988.73
988.727
988.727
0

ETHEUR E3
2021-01-18 04:44:00
990.62
990.62
989.64
989.64
0.158394

ETHEUR E3
2021-01-18 04:45:00
995.01
995.01
995.01
995.01
0.006665

ANTETH E3
2021-01-18 04:56:00
0.003745
0.003745
0.003745
0.003745
0

ANTETH E3
2021-01-18 04:57:00
0.003745
0.003745
0.003745
0.003745
0

ANTETH E3
2021-01-18 04:58:00
0.003745
0.003745
0.003745
0.003745
0

ANTETH E3
2021-01-18 04:59:00
0.003745
0.003745
0.003745
0.003745
0

ANTETH E3
2021-01-18 05:00:00
0.003745
0.003745
0.003745
0.003745
0

you can use pandas.merge to join the tables twice for each of the currency pairs.
combined = (
    df1
    # join data for first currency pair
    .merge(df2, how="left", left_on="X", right_on="symbol")
    # join data for second currency pair
    .merge(df2, how="left", left_on=["Y", "time"], right_on=["symbol", "time"], suffixes=("_X", "_Y"))
)

# get columns names for multi index
cols = [x.split("_")[::-1] if "_" in x else ["", x] for x in combined.columns]
# set new column names
combined.columns = pd.MultiIndex.from_arrays(list(zip(*cols)))
# select required columns
combined = combined[[x for x in combined.columns if x[1] in ["handle", "time", "close", "high", "low", "open", "volume"]]]

Which results in the following dataframe:

('', 'handle')
('', 'time')
('X', 'close')
('X', 'high')
('X', 'low')
('X', 'open')
('X', 'volume')
('Y', 'close')
('Y', 'high')
('Y', 'low')
('Y', 'open')
('Y', 'volume')

ETHEUR E3_XMRUSD E3
2021-01-18 04:41:00
988.73
988.73
988.727
988.727
8.40472
nan
nan
nan
nan
nan

ETHEUR E3_XMRUSD E3
2021-01-18 04:42:00
988.73
988.73
988.727
988.727
0
nan
nan
nan
nan
nan

ETHEUR E3_XMRUSD E3
2021-01-18 04:43:00
988.73
988.73
988.727
988.727
0
nan
nan
nan
nan
nan

ETHEUR E3_XMRUSD E3
2021-01-18 04:44:00
990.62
990.62
989.64
989.64
0.158394
nan
nan
nan
nan
nan

ETHEUR E3_XMRUSD E3
2021-01-18 04:45:00
995.01
995.01
995.01
995.01
0.006665
nan
nan
nan
nan
nan

Pandas - avoid iterrows() assembling a multi-index data frame from another time-series multi-index data frame

One Answer

Add your own answers!

Ask a Question

X	Y	handle
ETHEUR E3	XMRUSD E3	ETHEUR E3_XMRUSD E3
ETHEUR E3	TRXETH E3	ETHEUR E3_TRXETH E3
ETHEUR E3	ETHUSD E3	ETHEUR E3_ETHUSD E3
ETHEUR E3	BTCEUR E3	ETHEUR E3_BTCEUR E3
ETHEUR E3	ETHBTC E3	ETHEUR E3_ETHBTC E3

symbol	time	close	high	low	open	volume
ETHEUR E3	2021-01-18 04:41:00	988.73	988.73	988.727	988.727	8.40472
ETHEUR E3	2021-01-18 04:42:00	988.73	988.73	988.727	988.727	0
ETHEUR E3	2021-01-18 04:43:00	988.73	988.73	988.727	988.727	0
ETHEUR E3	2021-01-18 04:44:00	990.62	990.62	989.64	989.64	0.158394
ETHEUR E3	2021-01-18 04:45:00	995.01	995.01	995.01	995.01	0.006665
ANTETH E3	2021-01-18 04:56:00	0.003745	0.003745	0.003745	0.003745	0
ANTETH E3	2021-01-18 04:57:00	0.003745	0.003745	0.003745	0.003745	0
ANTETH E3	2021-01-18 04:58:00	0.003745	0.003745	0.003745	0.003745	0
ANTETH E3	2021-01-18 04:59:00	0.003745	0.003745	0.003745	0.003745	0
ANTETH E3	2021-01-18 05:00:00	0.003745	0.003745	0.003745	0.003745	0