Data Science Asked by meow on July 2, 2021
I am building a multistep workflow.
input.csv
, parameters_global.yml
, parameters_step1.yml
, parameters_step2.yml
.input.csv
, parameters_global.yml
, parameters_step1.yml
. It processes the data and x, processes it and produces step1_results.csv
.step1_checked.csv
.parameters_global.yml
, parameters_step2.yml
and step1_checked.csv
, processes everything, and produces some step2_results.csv
.Repeat ad infinitum. Are there best practices for how to organize the different steps (data and code) in data input, output etc folders?
Current suggestion is
input/01_step1name/
, input/02_step2name/
for all user-provided extra input per stepoutput/01_step1name/
, output/02_step2name/
for all generated output per stepscripts/01_step1name/01_firstscript.R
, scripts/01_step1name/02_secondscript.R
, scripts/01_step1name/03_thirdscript.py
There are also cases where a user might need to take an output and go to a measurement device, measure something based on that output, come back and add the results as a new input for the next step. I could "symbolize" the measurement by adding an extra step where there is nothing to execute, and manually put the measurements in the output folder when done. Alternatively, I could treat the measurement as user input and use the measurement results as input for the next step…
An extra add-on is that some but not all steps share a code basis, e.g.
scripts/02_processingstep2/01_run.R
sources a functions.R
scripts/03_processingstep3/01_run.R
sources the same functions.R
scripts/04_processingstep4/01_run.R
might use a completely different set of R libraries/environment/functionsI don’t really have a good concept to handle this… perhaps an extra
library/mytoolset/
with functions.R
library/othertoolset/
with different codebaseI know about Nextflow, but I feel it doesn’t solve the problem of organization, more the problem of execution which is not such a big deal for me.
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP