Data Science Asked by Alexander Willer on July 4, 2021
For analyzing pairs of currencies, I have two dataframes. The first one contains information about the symbols constituting every unique pair (as in "pair metadata"):
X Y handle
0 ETHEUR E3 XMRUSD E3 ETHEUR E3_XMRUSD E3
1 ETHEUR E3 TRXETH E3 ETHEUR E3_TRXETH E3
2 ETHEUR E3 ETHUSD E3 ETHEUR E3_ETHUSD E3
3 ETHEUR E3 BTCEUR E3 ETHEUR E3_BTCEUR E3
4 ETHEUR E3 ETHBTC E3 ETHEUR E3_ETHBTC E3
... ... ... ...
The second one is a multi-index dataframe providing historic data, having the symbol as level 0 index and time as a level 1 index. This specific structure is desired (and given by some other library) as not every symbol has data for every point in time.
close high low open volume
symbol time
ETHEUR E3 2021-01-18 04:41:00 988.730000 988.730000 988.726611 988.726611 8.404723
2021-01-18 04:42:00 988.730000 988.730000 988.726611 988.726611 0.000000
2021-01-18 04:43:00 988.730000 988.730000 988.726611 988.726611 0.000000
2021-01-18 04:44:00 990.620360 990.620360 989.640000 989.640000 0.158394
2021-01-18 04:45:00 995.010000 995.010000 995.010000 995.010000 0.006665
... ... ... ... ... ... ...
ANTETH E3 2021-01-18 04:56:00 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 04:57:00 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 04:58:00 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 04:59:00 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 05:00:00 0.003745 0.003745 0.003745 0.003745 0.000000
My goal is to assemble a pair-wise history as follows, with ‘X’ and ‘Y’ containing the data from the previous dataframe for the first and second symbol of the pair:
X Y
close high low open volume close high low open volume
handle time
ETHEUR E3_XMRUSD E3 2021-01-18 04:41:00 988.730000 988.730000 988.726611 988.726611 8.404723 153.450000 153.450000 153.250000 153.250000 0.000000
2021-01-18 04:42:00 988.730000 988.730000 988.726611 988.726611 0.000000 153.450000 153.450000 153.250000 153.250000 0.000000
2021-01-18 04:43:00 988.730000 988.730000 988.726611 988.726611 0.000000 153.450000 153.450000 153.250000 153.250000 0.000000
2021-01-18 04:44:00 990.620360 990.620360 989.640000 989.640000 0.158394 153.450000 153.450000 153.250000 153.250000 0.000000
2021-01-18 04:45:00 995.010000 995.010000 995.010000 995.010000 0.006665 153.860000 153.860000 153.360000 153.360000 4.845876
... ... ... ... ... ... ... ... ... ... ... ...
EOSETH E3_ANTETH E3 2021-01-18 04:56:00 0.002259 0.002259 0.002259 0.002259 0.000000 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 04:57:00 0.002259 0.002259 0.002259 0.002259 0.000000 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 04:58:00 0.002259 0.002259 0.002259 0.002259 0.000000 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 04:59:00 0.002259 0.002259 0.002259 0.002259 0.000000 0.003745 0.003745 0.003745 0.003745 0.000000
2021-01-18 05:00:00 0.002259 0.002259 0.002259 0.002259 0.000000 0.003745 0.003745 0.003745 0.003745 0.000000
This pair-wise history is used for further time-series style calculations between the two symbols.
Currently, I use the following code to generate the shown dataframe:
def assemble_pairs_from_history(history):
# Build unique pair combinations from symbols
pair_tuples = list(combinations(history.index.unique('symbol'), 2))
pair_info = pd.DataFrame(pair_tuples, columns=['X', 'Y'])
# Designate a handle for each combination
pair_info['handle'] = pair_info['X'].str.cat(pair_info['Y'], sep="_")
display(pair_info)
display(history)
# Iterate over all pair combinations, assembling a dict of historical data DFs for X and Y symbol of the pair
pair_history = {}
for i, pi in pair_info.iterrows():
pair_history[pi.handle] = pd.concat([history.loc[pi.X], history.loc[pi.Y]], axis=1, keys=['X', 'Y'])
# Clean the time series so that only rows having values for both symbols of the pair remain
pair_history[pi.handle].dropna(inplace=True)
# Assemble into one big dataframe, index by handle -> time with X and Y top level columns
pair_history = pd.concat(pair_history.values(), keys=pair_history.keys(), names=['handle'])
display(pair_history)
The issue with this code is that the performance is pretty bad, with the iterrows() and subsequent concat() part taking 10 seconds for a result dataframe size of 5000 rows.
I already tried various combinations of apply(), join() and merge() as well as building a multiindex manually, but was not able to achieve the desired output format.
How can the performance of this task be improved?
Not sure if you need the multi index for the columns, but would something like this be what you're looking for?
Given the following two input dataframes
X | Y | handle |
---|---|---|
ETHEUR E3 | XMRUSD E3 | ETHEUR E3_XMRUSD E3 |
ETHEUR E3 | TRXETH E3 | ETHEUR E3_TRXETH E3 |
ETHEUR E3 | ETHUSD E3 | ETHEUR E3_ETHUSD E3 |
ETHEUR E3 | BTCEUR E3 | ETHEUR E3_BTCEUR E3 |
ETHEUR E3 | ETHBTC E3 | ETHEUR E3_ETHBTC E3 |
symbol | time | close | high | low | open | volume |
---|---|---|---|---|---|---|
ETHEUR E3 | 2021-01-18 04:41:00 | 988.73 | 988.73 | 988.727 | 988.727 | 8.40472 |
ETHEUR E3 | 2021-01-18 04:42:00 | 988.73 | 988.73 | 988.727 | 988.727 | 0 |
ETHEUR E3 | 2021-01-18 04:43:00 | 988.73 | 988.73 | 988.727 | 988.727 | 0 |
ETHEUR E3 | 2021-01-18 04:44:00 | 990.62 | 990.62 | 989.64 | 989.64 | 0.158394 |
ETHEUR E3 | 2021-01-18 04:45:00 | 995.01 | 995.01 | 995.01 | 995.01 | 0.006665 |
ANTETH E3 | 2021-01-18 04:56:00 | 0.003745 | 0.003745 | 0.003745 | 0.003745 | 0 |
ANTETH E3 | 2021-01-18 04:57:00 | 0.003745 | 0.003745 | 0.003745 | 0.003745 | 0 |
ANTETH E3 | 2021-01-18 04:58:00 | 0.003745 | 0.003745 | 0.003745 | 0.003745 | 0 |
ANTETH E3 | 2021-01-18 04:59:00 | 0.003745 | 0.003745 | 0.003745 | 0.003745 | 0 |
ANTETH E3 | 2021-01-18 05:00:00 | 0.003745 | 0.003745 | 0.003745 | 0.003745 | 0 |
you can use pandas.merge
to join the tables twice for each of the currency pairs.
combined = (
df1
# join data for first currency pair
.merge(df2, how="left", left_on="X", right_on="symbol")
# join data for second currency pair
.merge(df2, how="left", left_on=["Y", "time"], right_on=["symbol", "time"], suffixes=("_X", "_Y"))
)
# get columns names for multi index
cols = [x.split("_")[::-1] if "_" in x else ["", x] for x in combined.columns]
# set new column names
combined.columns = pd.MultiIndex.from_arrays(list(zip(*cols)))
# select required columns
combined = combined[[x for x in combined.columns if x[1] in ["handle", "time", "close", "high", "low", "open", "volume"]]]
Which results in the following dataframe:
('', 'handle') | ('', 'time') | ('X', 'close') | ('X', 'high') | ('X', 'low') | ('X', 'open') | ('X', 'volume') | ('Y', 'close') | ('Y', 'high') | ('Y', 'low') | ('Y', 'open') | ('Y', 'volume') |
---|---|---|---|---|---|---|---|---|---|---|---|
ETHEUR E3_XMRUSD E3 | 2021-01-18 04:41:00 | 988.73 | 988.73 | 988.727 | 988.727 | 8.40472 | nan | nan | nan | nan | nan |
ETHEUR E3_XMRUSD E3 | 2021-01-18 04:42:00 | 988.73 | 988.73 | 988.727 | 988.727 | 0 | nan | nan | nan | nan | nan |
ETHEUR E3_XMRUSD E3 | 2021-01-18 04:43:00 | 988.73 | 988.73 | 988.727 | 988.727 | 0 | nan | nan | nan | nan | nan |
ETHEUR E3_XMRUSD E3 | 2021-01-18 04:44:00 | 990.62 | 990.62 | 989.64 | 989.64 | 0.158394 | nan | nan | nan | nan | nan |
ETHEUR E3_XMRUSD E3 | 2021-01-18 04:45:00 | 995.01 | 995.01 | 995.01 | 995.01 | 0.006665 | nan | nan | nan | nan | nan |
Answered by Oxbowerce on July 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP