Scripting¶
Example¶
Let’s say we want to construct a joint likelihood of two data vectors data1 and data2, with their own models model1 and model2
and a common covariance matrix cov. Our parameter .yaml file would then be:
main:
$modules: [like]
like:
$module_name: template_lib.likelihood
$module_class: JointGaussianLikelihood
join: [like1, like2]
$modules: [cov]
like1:
$module_name: template_lib.likelihood
$module_class: BaseLikelihood
$modules: [data1, model1]
like2:
$module_name: template_lib.likelihood
$module_class: BaseLikelihood
$modules: [data2, model2]
data1:
$module_name: template_lib.data_vector
# details about how to get the data vector #1
y: [1.0,1.0,1.0,1.0,1.0]
model1:
# details about model #1
$module_name: template_lib.model
$module_class: FlatModel
data2:
$module_name: template_lib.data_vector
# details about how to get the data vector #2
y: [1.0,1.0,1.0]
model2:
# details about model #2
$module_name: template_lib.model
$module_class: FlatModel
cov:
# details about how to get the common covariance
$module_name: template_lib.covariance
yerr: [1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]
Here module names (left aligned) can be of any name, except main which is the entry to the pipeline.
The fields $modules provided for BasePipeline inherited-modules (called (sub)pipelines in the following)
list the modules they run.
Note
All pypescript keywords start with the Dollar sign $
If the configuration file above is saved in ‘config_file.yaml’, and you have installed template_lib by running e.g.:
python setup.py develop
in :root:`template_lib’, then the example above can be launched as:
pypescript config_file.yaml
One can achieve the same pipeline in Python with:
from pypescript import BaseModule, BasePipeline, ConfigBlock, SectionBlock
from template_lib.model import FlatModel
from template_lib.likelihood import BaseLikelihood, JointGaussianLikelihood
config_block = ConfigBlock('config_file.yaml')
data1 = BaseModule.from_filename(name='data1',options=SectionBlock(config_block,'data1'))
model1 = FlatModel(name='model1')
data2 = BaseModule.from_filename(name='data2',options=SectionBlock(config_block,'data2'))
model2 = BaseModule.from_filename(name='model2',options=SectionBlock(config_block,'model2'))
cov = BaseModule.from_filename(name='cov',options=SectionBlock(config_block,'cov'))
like1 = BaseLikelihood(name='like1',modules=[data1,model1])
like2 = BaseLikelihood(name='like2',modules=[data2,model2])
like = JointGaussianLikelihood(name='like',join=[like1,like2],modules=[cov])
pipeline = BasePipeline(modules=[like])
Here we used the configuration file above saved in ‘config_file.yaml’,
but options can be simple dictionaries (SectionBlock simply takes a slice of config_block for the given section).
You can also modify each module’s data_block at your will.
There is really no more complexity than using Python classes successively, with pypescript holding all these classes.
So even if you do not like the pypescript framework, you can still use pypescript modules very easily in your own code.
We also provide a wrapper that allows lighter syntax:
from pypescript import mimport
model1 = mimport('template_lib.model',module_class='FlatModel',name='model1')
In diagrammatic representation (generated by plot_pipeline_graph()):
pypescript rules¶
The pypescript framework is agnostic about the actual operations performed by the modules it sets up, executes and cleans up. This is key to ensuring the base code does not need to be modified when adding a new module.
Similarly, modules are agnostic about the operations performed by other modules. This is key to ensuring modules do not need to be modified when adding new ones.
Hence, the pipeline integrity is ensured by the user script.
The main difficulty is to ensure that each module takes the input of a previous module at the relevant entry (section, name)
of data_block, the DataBlock instance passed to all modules (see Framework).
data_block¶
CosmoSIS implements a linear pipeline: all modules form a single chain.
Instead, we allow for a tree-like structure, which is explored depth-first, left to right, where nodes are (sub)pipelines (see below).
Both approaches would be fully equivalent if the data_block were a global variable for all modules (within all (sub)pipelines).
Instead, contrary to CosmoSIS, each (sub)pipeline creates (at initialisation only) a (shallow!) copy of the data_block to be passed to its modules
(at the exception of StreamPipeline).
Note
In the example above, [model2] does not know anything about [model1] products. If one wanted to add a common calculation beforehand
(e.g. linear power spectrum), it would be added at the head of the modules list of [main]
(not of [like] because of the peculiar structure of JointGaussianLikelihood - its modules being run after join).
Hence, any change made these modules to the data_block are local (effective within the (sub)pipeline), which we think is the most commmon expected behaviour.
Therefore, a precomputation performed ahead of this (sub)pipeline, saved into data_block[section,name] will not be erased by the
modules of this (sub)pipeline even if they write in the same entry of data_block.
This allows modules to update (for them) previous entries in data_block and hence to keep a short list of entries (section, name) in use.
Then, most of the links between module input and output entries is encoded in the pipeline structure itself.
We think it also makes the pipeline structure more readable.
Yet, this may not be sufficient in some corner cases; we may e.g. want to save the result of a given operation (e.g. derived parameter)
performed at some position in the tree. This is made possible by using the keyword datablock_duplicate in any module section of the configuration file/dictionary:
$datablock_duplicate:
section2.name2: section1.name1
will (shallow!) copy the element from data_block entry (section1, name1) to entry (section2, name2) at each step (setup, execute, cleanup).
There is a global (i.e. shared by all modules whatever their depth) section: ‘common’. So taking section2 = 'common' will make the element accessible anywhere in the pipeline.
One can also locally (i.e. in one module or subpipeline) map a data_block entry (section1, name1) to entry (section2, name2):
$datablock_mapping:
section2.name2: section1.name1
Unlike datablock_duplicate, this works as a reference: any change in the value pointed by entry (section1, name1) is visible by (section2, name2).
This can be useful in case one wants to cast modules-specific parameters to their standard name within the relevant modules.
Eventually, one can locally set data_block entries using e.g.:
$datablock_set:
section2.name2: 42
- To summarize:
we allow for a tree-like structure, where
data_blockis (shallow) copied at each node.any change to
data_blockis local within a given (sub)pipelinethe section where changes are global (effective for the whole pipeline) is ‘common’
if necessary, any entry of
data_blockcan be moved anywhere (including the ‘global’ sections) with the keyworddatablock_duplicatein the configuration file/dictionaryconfig_blockis always global.
Note
Our framework therefore generalizes the CosmoSIS structure. Therefore, one can always stick to the CosmoSIS structure if more intuitive.
(sub)pipelines¶
There are several pre-defined pipelines in pypescript, but one can implement others in pypescript libraries.
These pre-defined pipelines are BasePipeline, StreamPipeline, MPIPipeline, BatchPipeline.
StreamPipeline, MPIPipeline and BatchPipeline all inherit from BasePipeline, which implements
the common behavior described below. Then we will explain the differences between StreamPipeline, MPIPipeline and BatchPipeline.
Generally, a BasePipeline-inherited pipeline will instantiate and call (either setup, execute or cleanup) several modules,
that will perform some operations (take some input from and add output to data_block).
Note
In the example above, if model1 and model2 had the same options, one could equivalently specify model options
once and have $modules: [data1, model] and $modules: [data2, model] in like1 and like2, respectively.
Behind the scenes, the model module will be instantiated independently in like1 and like2 (though with the same options).
Then, one can specify the operations to be performed at each step of the subpipeline, e.g.:
$modules: [data1, model1]
is equivalent to:
$setup: [data1, model1]
$execute: [data1, model1]
$cleanup: [data1, model1]
(the last line being optional, as all modules will be cleaned up eventually). This means that when the pipeline is set up,
it will call successively setup of data1 and model1 (same for execute and clean up).
If we want e.g. model1 to be set up at the execute step, we would do:
$setup: [data1]
$execute: [data1, model1:setup, model1:execute]
which is actually equivalent to:
$setup: [data1]
$execute: [data1, model1]
because the pipeline will understand that model1 should be set up before being executed in the pipeline’s execute step.
One can have any variation upon this, e.g.:
$setup: [data1:execute, model1]
$execute: [model1]
will run data1 setup and execute and model1 setup in the pipeline’s setup step, then execute model1
in the pipeline’s execute step. Finally, data1 and model1 will be cleaned up in the pipeline’s cleanup step.
Note
If your (sub)pipeline performs MCMC sampling, for example, then the step execute of this pipeline will naturally be called at each MCMC step.
But we can imagine that we loop on different data vectors instead. In this case, execute will be called for each of these vectors.
Let us move to specificities of pre-implemented pipelines.
In StreamPipeline, data_block is directly passed on to different modules, without (shallow) copy. Hence:
main:
$module_name: pypescript
$module_class: BasePipeline
$modules: [pipe1, pipe2]
pipe1:
$module_name: pypescript
$module_class: StreamPipeline
$modules: [module1, module2]
is equivalent to:
main:
$module_name: pypescript
$module_class: BasePipeline
$modules: [module1, module2, pipe2]
i.e. pipe2 will see the outputs of module1, module2 in data_block.
In MPIPipeline, whatever is performed in the execute step is repeated and distributed with MPI.
Of course, it does not make sense to repeat exactly the same task, so we shall update config_block and/or data_block.
In this example:
main:
$module_name: pypescript
$module_class: BasePipeline
$modules: [pipe1, pipe2]
pipe1:
$module_name: pypescript
$module_class: MPIPipeline
$nprocs_per_task: 2
$modules: [module1, module2]
$configblock_iter:
module1.xlim: [[0.02, 0.3], [0.02, 0.2], [0.02, 0.1]]
module1.value: e'lambda i:i+1'
$datablock_iter:
data.input: [a, b, c]
$datablock_key_iter:
data.result: [result_0, result_1, result_2]
module1:
...
xlim: [0, 0.5]
value: 1
module1 and module2 will first be set up at main’s setup step.
Then module1 and module2’s execute will be run on three different batches of nprocs_per_task processes.
In the first batch, xlim and value options of module1 are set to [0.02, 0.3], 1 respectively.
The (data, input) data_block entry is set to 'a', and the object in data_block entry (data, result)
(e.g. as put by module1 in this batch) is broadcast to data_block entry (data, result_0) (which can be seen by e.g. pipe2).
In the second batch, xlim and value options of module1 are set to [0.02, 0.2], 2 respectively.
The (data, input) data_block entry is set to 'b', and the object in data_block entry (data, result) is broadcast to data_block entry (data, result_1).
In the third batch, xlim and value options of module1 are set to [0.02, 0.1], 3 respectively.
The (data, input) data_block entry is set to 'c', and the object in data_block entry (data, result) is broadcast to data_block entry (data, result_2).
In the end, data_block objects in (data, result_0), (data, result_1), (data, result_2) are available for further processing in e.g. pipe2.
BatchPipeline (which has not been thoroughfully tested) follows the same spirit as MPIPipeline but instead
of creating batches of MPI processes, dump the data_block to disk, with an appropriate confiuration file, and launch pypescript with these new inputs.
main:
$module_name: pypescript
$module_class: BasePipeline
$modules: [pipe1, pipe2]
pipe1:
$module_name: pypescript
$module_class: BatchPipeline
$nprocs_per_task: 2
$modules: [module1, module2]
$configblock_iter:
module1.xlim: [[0.02, 0.3], [0.02, 0.2], [0.02, 0.1]]
module1.value: e'lambda i:i+1'
$datablock_iter:
data.input: [a, b, c]
$datablock_key_iter:
data.result: [result_0, result_1, result_2]
job_dir: jobs/
job_template: job-template.sh
job_options:
time: 02.00.00
job_submit: sbatch
module1:
...
xlim: [0, 0.5]
value: 1
Here, data_block and configuration files will be saved in job_dir ('jobs/').
Jobs will be submitted with command sbatch, with a script based on job_template ('job-template.sh') to be filled with options specified in job_options.
Resulting data_block will be dumped to disk by each job. Then the current job will reload them, and make data_block objects in
(data, result_0), (data, result_1), (data, result_2) available for further processing in e.g. pipe2.
Configuration file shortcuts¶
For rapid and convenient scripting, a number of configuration file shortcuts have been defined.
Replacements¶
One can refer to values define in any part of the configuration file through the syntax ${section1.section2...}, e.g.:
answer:
to: 42
the: 84
ultimate:
question: ${answer.to}
of: ${the}
of2: ${of}
Here ${answer.to} will be replaced by 42 and ${the} by 84. By default, ${name} refers to the same section, hence ${of} will be replaced by 84.
Note that since (section, name) only fields are retained, the original the entry will be discarded in the rest of the pipeline.
One can also refer to another configuration file, using the syntax: ${path_to_other_file:answer.to}:.
Imports¶
answer:
to: 42
the: 84
ultimate:
${answer}:
to: 21
of: ${the}
Here utimate will be filled with the elements of answer (to: 42), then ultimate.to will be replaced by 21.
One can also import a section from another configuration file, using the syntax: ${path_to_other_file:section}:.
To import the other configuration file completely, no section is specified: ${path_to_other_file:}:.
Mapping (references)¶
config_block entries can be mapped to each other through the syntax $&{section.name}, e.g.:
answer:
to: 42
ultimate:
question: $&{answer.to}
Here the config_block entry (ultimate, question) will refer to (answer, to) (meaning any change to the latter in the process of the pipeline will affect the former as well).
Eval pattern¶
In some cases we may want to directly evaluate some Python code (e.g. comprehension list).
The syntax is e'':
answer:
to: 42
ultimate:
question: e'[${answer.to} + i for i in range(10)]'
The entry (ultimate, question) will be filled with the list of size 10, filled with numbers from 42 to 53.
Format pattern¶
One may want to set variables defined anywhere in the configuration file (e.g. a directory path) into a string (e.g. a full file path). Here is the corresponding syntax:
plots_dir: 'plots'
ultimate:
question: f'${plots_dir}/my_plot.png'
answer: e'"{}/my_plot.png".format(${plots_dir})'
The entry (ultimate, question) will be filled with the string ‘plots/my_plot.png’.
The eval syntax produces the same output in (ultimate, answer) but is more verbose.
Repeats¶
One can generate on-the-fly configuration with the syntax “$(%)”:
main:
$modules: [model$(1), model$(2)]
model$(%):
$modules: [base$(%)]
base$(%):
value: e'%$ + 1'
is equivalent to:
main:
$modules: [model1, model2]
model1:
$modules: [base1]
base1:
value: 2
model2:
$modules: [base2]
base2:
value: 3
data_block operations¶
We also propose shortcuts for datablock_duplicate, datablock_mapping and datablock_set operations presented above.
These data_block operations use [] instead of {} for config_block.
One can achieve the datablock_duplicate operation (shallow copy of data_block entry from (section1, name1) to (section2, name2)) through the syntax:
ultimate:
$[section2.name2]: $[section1.name1]
One can achieve the datablock_mapping operation (data_block entry (section2, name2) referencing (section1, name1)) through the syntax:
ultimate:
$[section2.name2]: $&[section1.name1]
Eventually, the datablock_set operation (locally filling data_block entry (section, name)) can be achieved with:
answer:
to: 42
ultimate:
$[section.name]: 42
Here 42 can be replaced by any reference to the configuration file (e.g. ${answer.to}).