Skip to content

Latest commit

 

History

History
196 lines (143 loc) · 6.92 KB

File metadata and controls

196 lines (143 loc) · 6.92 KB

Task Config Guide

The task config specifies task-related configurations - guiding the behavior of components according to the needs of your machine learning task.

Whenever we say "task config" we are talking about an instance off {py:class}snapchat.research.gbml.gbml_config_pb2.GbmlConfig

This is a protobuff class whose definition can be found in gbml_config.proto:
:language: proto

Just like resource config, the values to instantiate this proto class are usually provided as a .yaml file. Most components accept the task config as an argument --task_config_uri - i.e. a {py:class}gigl.common.Uri pointing to a task_config.yaml file.

Example

We will use the MAG240M task config to walk you through what a config may look like.

Full task config for reference:
:language: yaml

GraphMetadata

We specify what are all the nodes and edges in the graph. In this case we have one node type: paper_or_author. And, one edge type: (paper_or_author, references, paper_or_author)

Note: In this example we have converted the hetrogeneous MAG240M dataset to a homogeneous one with just one edge and one node; which we will be doing self supervised learning on.

:language: yaml
:start-after: GraphMetadata
:end-before: ========

TaskMetadata

Now we specify what type of learning task we want to do. In this case we want to leverage Node Anchor Based Link Prediction to do self supervised learning on the edge: (paper_or_author, references, paper_or_author). Thus, we are using the NodeAnchorBasedLinkPredictionTaskMetadata task.

:language: yaml
:start-after: TaskMetadata
:end-before: ========
An example of `NodeBasedTaskMetadata` can be found in `gigl/src/mocking/configs/e2e_supervised_node_classification_template_gbml_config.yaml`

SharedConfig

Shared config are parameters that are common and may be used across multiple components i.e. Trainer, Inferencer, SubgraphSampler, etc.

:language: yaml
:start-after: SharedConfig
:end-before: ========

DatasetConfig

We create the dataset that we will be using. In this example we use dataPreprocessorConfigClsPath to read and preprocess the data. See the Preprocessor Guide.

For current in-memory subgraph sampling pipelines, the Data Preprocessor output is consumed directly by Trainer and Inferencer, which sample neighborhoods online when the in-memory sampling flag (featureFlags.should_run_glt_backend) is enabled.

If you are maintaining the older tabularized pipeline, the next stages are the legacy Subgraph Sampler and Split Generator.

:language: yaml
:start-after: DatasetConfig
:end-before: ========

TrainerConfig

The class specified by trainerClsPath will be initialized and all the arguments specified in trainerArgs will be directly passed as **kwargs to your trainer class. Thes only requirement is the trainer class implement the protocol defined @ {py:class}gigl.src.training.v1.lib.base_trainer.BaseTrainer.

Some common sense pre-configured trainer implementations can be found in {py:class}gigl.src.common.modeling_task_specs. Although, you are recommended to implement your own.

:language: yaml
:start-after: TrainerConfig
:end-before: ========

InferencerConfig

Similar to Trainer, the class specified by inferencerClsPath will be initialized and all arguments specified in inferencerArgs will be directly passed in **kwargs to your inferencer class. The only requirement is the inferencer class implement the protocol defined @ {py:class}gigl.src.inference.v1.lib.base_inferencer.BaseInferencer

:language: yaml
:start-after: InferencerConfig
:end-before: ========

Custom resolvers

GiGL makes use of custom Omegaconf resolvers to allow us to expose macros that are resolved at runtime instead of being hardcoded. Our resolvers are defined in omegaconf_resolvers.py

In tabularized GiGL: Subgraph Sampler, and Split Generator do not have support for custom resolvers. In most cases this should not be a problem as the Config Populator should populate any needed custom resolution to your frozen configs.

Time resolvers

You can specify now: along with datetime.datetime compatible strftime format string. Subsequently, you can also specify datetime.timedelta with compatible time offsets. Specifying these in any of your values in your yamls will automatically resolve when loaded into GiGL.

Examples:

name: "exp_${now:%Y%m%d_%H%M%S}"
start_time: "${now:%Y-%m-%d %H:%M:%S}"
log_file: "logs/run_${now:%H-%M-%S}.log"
timestamp: "${now:}"  # Uses default format "%Y%m%d_%H%M%S"
short_date: "${now:%m-%d}"

tomorrow: "${now:%Y-%m-%d, days+1}"
yesterday: "${now:%Y-%m-%d, days-1}"
tomorrow_plus_5_hours_30_min_15_sec: "${now:%Y-%m-%d %H:%M:%S,hours+5,days+1,minutes+30,seconds+15}"
next_week: "${now:%Y-%m-%d, weeks+1}"
multiple_args: "${now:%Y%m%d, days-15}:${now:%Y%m%d, days-1}"

Assuming, the current datetime is 2023-12-15 14:30:22, this would resolve to something like:

name: "exp_20231215_143022"
start_time: "2023-12-15 14:30:22"
log_file: "logs/run_14-30-22.log"
timestamp: "20231215_143022"
short_date: "12-15"

tomorrow: "2023-12-16"
yesterday: "2023-12-14"
tomorrow_plus_5_hours_30_min_15_sec: "2023-12-16 20:00:37"
next_week: "2023-12-22"
multiple_args: "20231130:20231214"

Git Hash Resolver

This resolver returns the current git hash if one is available. Takes no arguments and returns the git hash as a string. Specifically this returns the SHA that is returned when the following is run in the active working directory:

git rev-parse HEAD

If no git repository is found, or there is an error, will return empty string.

Examples:

experiment:
    commit: "${git_hash:}"
    model_version: "model_${git_hash:}"

Assuming you are scheduling workflows from an active git repo with the current commit hash: 9d42b423b65961692ffc650a0714a63a1b695b12, this would resolve:

experiment:
    commit: "9d42b423b65961692ffc650a0714a63a1b695b12"
    model_version: "model_9d42b423b65961692ffc650a0714a63a1b695b12"