Loading tabular data (`data_sources`)¶

We have chosen YAML syntax to define Calliope models as it is human-readable. However, when you have a large dataset, the YAML files can become large and ultimately not as readable as we would like. For instance, for parameters that vary in time we would have a list of 8760 values and timestamps to put in our YAML file!

Therefore, alongside your YAML model definition, you can load tabular data from CSV files (or from in-memory pandas.DataFrame objects) under the data_sources top-level key. As of Calliope v0.7.0, this tabular data can be of any kind. Prior to this, loading from file was limited to timeseries data.

The full syntax from loading tabular data can be found in the associated schema. In brief it is:

source: path to file or reference name for an in-memory object.
rows: the dimension(s) in your table defined per row.
columns: the dimension(s) in your table defined per column.
select: values within dimensions that you want to select from your tabular data, discarding the rest.
drop: dimensions to drop from your rows/columns, e.g., a "comment" row.
add_dimensions: dimensions to add to the table after loading it in, with the corresponding value(s) to assign to the dimension index.

When we refer to "dimensions", we mean the sets over which data is indexed in the model: nodes, techs, timesteps, carriers, costs. In addition, when loading from file, there is the required dimension parameters. This is a placeholder to point Calliope to the parameter name(s) that your tabular data is referring to. The values assigned as parameters will become the array names in your loaded model dataset (model.inputs).

YAML vs tabular definitions¶

In this section we will show some examples of loading data and provide the equivalent YAML definition that it would replace.

Loading timeseries data¶

TabularYAML

Data in file:

timesteps
2005-01-01 12:00:00	15
2005-01-01 13:00:00	5

YAML definition to load data:

data_sources:
  pv_capacity_factor_data:
    source: data_sources/pv_resource.csv
    rows: timesteps
    add_dimensions:
      techs: pv
      parameters: source_use_equals

techs:
  pv:
    source_use_equals:
      data: [15, 5, 20, 11, ...]
      index: ["2005-01-01 12:00:00", "2005-01-01 13:00:00", "2005-01-01 12:00:00", "2005-01-01 13:00:00", ....]
      dims: "timesteps"

Getting timestamp formats right

By default, Calliope expects time series data in a model to be indexed by ISO 8601 compatible time stamps in the format YYYY-MM-DD hh:mm:ss, e.g. 2005-01-01 00:00:00. This can be changed by setting config.build.time_format based on strftime directives, which defaults to "ISO8601".

If you work with your CSV files in Excel, keep a careful eye on the format of the timestamps. Excel will automatically update the format to match your operating system default, which is usually not the ISO8601 format.

Loading technology data from file¶

TabularYAML

Data in file:


tech1	base_tech	supply (1)
tech1	flow_cap_max	100
tech1	flow_out_eff	0.1
tech1	area_use_max	500
tech1	area_use_per_flow_cap	7
tech2	base_tech	demand
tech3	base_tech	storage
tech3	storage_cap_max	200
tech3	flow_cap_max	100

Unlike the previous example, we do not have a "header" row with column names in this file. We start directly with defining data. Our dimensions are only defined per row, not per column.

YAML definition to load data:

data_sources:
  tech_data:
    source: data_sources/tech_data.csv
    rows: [techs, parameters]

techs:
  tech1:
    base_tech: supply
    flow_cap_max: 100
    flow_out_eff: 0.1
    area_use_max: 500
    area_use_per_flow_cap: 7
  tech2:
    base_tech: demand
  tech3:
    base_tech: storage
    storage_cap_max: 200
    flow_cap_max: 100

Loading technology cost data from file¶

TabularYAML

Data in file:


tech1	cost_flow_cap	100
tech1	cost_area_use	50
tech1	cost_flow_out	0.2
tech1	cost_interest_rate	0.1
tech3	cost_flow_cap	20
tech3	cost_storage_cap	150
tech3	cost_interest_rate	0.1

YAML definition to load data:

data_sources:
  tech_data:
    source: data_sources/tech_data.csv
    rows: [techs, parameters]
    add_dimensions:
      costs: monetary

tech_groups:  # (1)!
  cost_setter:
    cost_interest_rate:
      data: 0.1
      index: monetary
      dims: costs
    cost_flow_cap:
      data: null
      index: monetary
      dims: costs
    cost_area_use:
      data: null
      index: monetary
      dims: costs
    cost_flow_out:
      data: null
      index: monetary
      dims: costs
    cost_storage_cap:
      data: null
      index: monetary
      dims: costs

techs:
  tech1:
    inherit: cost_setter
    cost_flow_cap.data: 100
    cost_area_use.data: 50
    cost_flow_out.data: 0.2
  tech3:
    cost_flow_cap.data: 20
    cost_storage_cap.data: 150

To limit repetition, we have defined technology groups for our costs.

Selecting dimension values and dropping dimensions¶

If you only want to use a subset of the tabular data you've defined, you can select it at load time. For instance:

Data in file:

		node1	node2	node3
tech1	parameter1	100	200	300
tech2	parameter1	0.1	0.3	0.5
tech3	parameter1	20	45	50

YAML definition to load only data from nodes 1 and 2:

data_sources:
  tech_data:
    source: data_sources/tech_data.csv
    rows: [techs, parameters]
    columns: nodes
    select:
      nodes: [node1, node2]

You may also want to store scenarios in your file. When you load in the data, you can select your scenario. You will also need to drop the dimension so that it doesn't appear in the final calliope model dataset:

		scenario1	scenario2
tech1	parameter1	100	200
tech2	parameter1	0.1	0.3
tech3	parameter1	20	45

YAML definition to load only data from scenario 1:

data_sources:
  tech_data:
    source: data_sources/tech_data.csv
    rows: [techs, parameters]
    columns: scenarios
    select:
      scenarios: scenario1
    drop: scenarios

You can then also tweak just one line of your data source YAML with an override to point to your other scenario:

override:
  switch_to_scenario2:
    data_sources.tech_data.select.scenarios: scenario2  # (1)!

We use the dot notation as a shorthand for abbreviate nested dictionaries.

Adding dimensions¶

We used the add_dimensions functionality in some examples earlier in this page. It's a useful mechanism to avoid repetition in the tabular data, and offers you the possibility to use the same data for different parts of your model definition.

For example, to define costs for the parameter cost_flow_cap:

Without add_dimensionsWith add_dimensions

			node1	node2	node3
tech1	monetary	cost_flow_cap	100	200	300
tech2	monetary	cost_flow_cap	0.1	0.3	0.5
tech3	monetary	cost_flow_cap	20	45	50

data_sources:
  tech_data:
    source: data_sources/tech_data.csv
    rows: [techs, costs, parameters]
    columns: nodes

	node1	node2	node3
tech1	100	200	300
tech2	0.1	0.3	0.5
tech3	20	45	50

data_sources:
  tech_data:
    source: data_sources/tech_data.csv
    rows: techs
    columns: nodes
    add_dimensions:
      costs: monetary
      parameters: cost_flow_cap

Or to define the same timeseries source data for two technologies at different nodes:

Without add_dimensionsWith add_dimensions

	node1 tech1 source_use_max	node2 tech2 source_use_max
2005-01-01 00:00	100	200
2005-01-01 01:00	200	200

data_sources:
  tech_data:
    source: data_sources/tech_data.csv
    rows: timesteps
    columns: [nodes, techs, parameters]


2005-01-01 00:00	100
2005-01-01 01:00	200

data_sources:
  tech_data_1:
    source: data_sources/tech_data.csv
    rows: timesteps
    add_dimensions:
      techs: tech1
      nodes: node1
      parameters: source_use_max
  tech_data_2:
    source: data_sources/tech_data.csv
    rows: timesteps
    add_dimensions:
      techs: tech2
      nodes: node2
      parameters: source_use_max

Loading CSV files vs `pandas` dataframes¶

To load from CSV, set the filepath in source to point to your file. This filepath can either be relative to your model.yaml file (as in the above examples) or an absolute path.

To load from a pandas.DataFrame, you can specify the data_source_dfs dictionary of objects when you initialise your model:

import calliope
import pandas as pd
df1 = pd.DataFrame(...)
df2 = pd.DataFrame(...)

model = calliope.Model(
    "path/to/model.yaml",
    data_source_dfs={"data_source_1": df1, "data_source_2": df2}
)

And then you point to those dictionary keys in the source for your data source:

data_sources:
  ds1:
    source: data_source_1
    ...
  ds2:
    source: data_source_2
    ...

Note

As with loading tabular data from CSV, you will need to specify rows, columns, etc. based on the shape of your dataframe. Rows correspond to your dataframe index levels and columns to your dataframe column levels.

You cannot specify pandas.Series objects. Ensure you convert them to dataframes (to_frame()) before adding them to your data source dictionary.

Important considerations¶

To get the expected results when loading tabular data, here are some things to note:

You must always have a parameters dimension. This could be defined in rows, columns, or add_dimensions.
The order of events for select, drop, add_dimensions is:
1. select from dimensions;
2. drop unwanted dimensions;
3. add_dimensions to add dimensions. This means you can technically select value "A" from dimensions nodes, then drop nodes, then add nodes back in with the value "B". This effectively replaces "A" with "B" on that dimension.
The order of tabular data loading is in the order you list the sources. If a new table has data which clashes with preceding data sources, it will override that data. This may have unexpected results if the files have different dimensions as the dimensions will be broadcast to match each other.
CSV files must have .csv in their filename (even if compressed, e.g., .csv.zip). If they don't, they won't be picked up by Calliope.
We automatically infer which technologies are available at which nodes according to any tabular data containing both the nodes and techs dimensions. However, we do not recommend you rely on tabular data entirely to define your model. Instead, at least list the techs available at each node in YAML. E.g.,
```
nodes:
  node1.techs: {tech1, tech2, tech3}
  node2.techs: {tech1, tech2}
data_sources:
  ...
```
We process dimension data after loading it in according to a limited set of heuristics:
1. we assume any dimension with the suffix steps (e.g., timesteps, monthsteps) is timeseries data, and attempt to convert the data type of the dimension values accordingly.
2. We will attempt to convert dimension data to numeric values. Therefore, dimensions with the data [1, 2], ["1", "2"], [1, "2"], and ["1.0", 2.0] will all be converted to having a numeric data type (integer or float). ["foo", "1"] and ["foo", 1] will not be converted, as not all dimension data entries are convertible to numeric data types.

Data you cannot load in tabular format¶

Some data is specific to the YAML definition or is computed by Calliope internally and therefore cannot be loaded by the user from tabular data. These are:

# Parameters for which loading from file via `data_sources` is prohibited
active: >-
  Technology/Node activation (`active`) can only be used in the
  YAML model definition.
definition_matrix: >-
  `definition_matrix` is a protected array.
  It will be generated internally based on the values you assign to
  the `carrier_in` and `carrier_out` parameters.
inherit: >-
  Technology/Node inheritance (`inherit`) can only be used in the
  YAML model definition.

Loading tabular data (data_sources)¶

YAML vs tabular definitions¶

Loading timeseries data¶

Loading technology data from file¶

Loading technology cost data from file¶

Selecting dimension values and dropping dimensions¶

Adding dimensions¶

Loading CSV files vs pandas dataframes¶

Important considerations¶

Data you cannot load in tabular format¶

Loading tabular data (`data_sources`)¶

Loading CSV files vs `pandas` dataframes¶