Time Series Regression
Our other examples have been classification examples, but let us switch to a different regime. One of the most powerful features of QCML is that the same model architectures can function well across many different problem regimes. Here we’ll apply it to time series forecasting.
The data set that we are using for this example contains the responses of a gas multisensor device deployed in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. See the paper and dataset for more information.
In this example we will use QCML models to predict all observed features for a given horizon using a specified lookback window.
Uploading the Dataset
First we need to get our hands on the data and upload it to the qognitive servers. We use the UC Irvine Machine Learning Repository to download and access the data. After that, we need to convert our datetime features, scale the data, and add lagged features. Each measurement is taken hourly so we will use a lookback window of 24 hours and a horizon of 24 hours. Since the original dataset has 15 features, we will have 15*(24+24) = 720 features in total during training and each of these will correspond to an observable operator.
We’ll be using some extra packages here such as scikit-learn
and ucimlrepo
. You can install these with the following command:
(venv)$ pip install ucimlrepo scikit-learn
Let’s download the data and format it into a dataframe suitable for training and inference.
# std
import os
import pickle
import json
# external
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from ucimlrepo import fetch_ucirepo
def add_lagged_features(
train_test_idx: np.ndarray,
data_scaled: np.ndarray,
features: list[str],
lookback_window: int,
horizon: int,
column_names: list[str],
) -> pd.DataFrame:
"""Add lagged features to data.
This function creates and returns a new array (data_raw), where it has
introduced F*(L+H-1) new columns, where F is the number of features, L
is the size of the lookback window, and H is the size of the horizon. For
each feature, it adds values from the original data_scaled array with a
lag of 0 to L+H-1.
Every feature uses the same lookback window and horizon.
Parameters
----------
train_test_idx : np.ndarray
The indices in data_scaled that will be used to create the new array for either
the train or test data.These must be in the range
[lookback_window + horizon, data_scaled.shape[0]].
data_scaled : np.ndarray
The raw data array to add the lagged features to.
features : list[str]
The list of feature names.
lookback_window : int
The size of the lookback window.
horizon : int
The size of the horizon.
column_names : list[str]
The names of the columns after adding lagged features.
Returns
-------
pd.DataFrame
the new dataframe containing the union of old features and new lagged features.
"""
data_raw = np.zeros(
(train_test_idx.shape[0], len(features) * (lookback_window + horizon))
)
for ti in range(train_test_idx.shape[0]):
t = train_test_idx[ti]
for i, f in enumerate(features):
col_start = i * (lookback_window + horizon)
for j in range(lookback_window + horizon):
data_raw[ti, col_start + j] = data_scaled[t - j, i]
return pd.DataFrame(data_raw, columns=column_names)
def load_air_quality(
n_train: int, n_test: int, lookback_window: int, horizon: int
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
"""Load the air quality dataset from UCI ML repository.
See https://archive.ics.uci.edu/dataset/360/air+quality and the original paper
https://www.semanticscholar.org/paper/a90a54a39ff934772df57771a0012981f355949d.
Testing and training data are chosen from disjoint sets of data points.
Parameters
----------
n_train : int
The number of data points to use for training.
n_test : int
The number of data points to use for testing.
lookback_window : int
Size of the lookback window.
horizon : int
Size of the horizon to predict
Returns
-------
tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
Training data, test data (missing data we want to predict), and target data
(labels for test data).
"""
data_path = os.path.join("/tmp", "air_quality.pkl")
# Cache dataset
if os.path.exists(data_path):
print("Using cached data")
with open(data_path, "rb") as f:
air_quality = pickle.load(f)
else:
air_quality = fetch_ucirepo(id=360)
with open(data_path, "wb") as f:
pickle.dump(air_quality, f)
# data (as pandas dataframes)
X = air_quality.data.features
X["datetime"] = pd.to_datetime(X["Date"] + " " + X["Time"])
X["month"] = X["datetime"].dt.month
X["day_of_week"] = X["datetime"].dt.dayofweek
X["hour"] = X["datetime"].dt.hour
X.drop(columns=["Date", "Time", "datetime"], inplace=True)
# Features
features = X.columns.tolist()
forecast_features = [f"{f}_{t}" for f in features for t in range(horizon)]
column_names = [
f"{f}_{t}" for f in features for t in range(lookback_window + horizon)
]
# Train, validation, and test boundaries in data
# Train = [0, 60%], Validation = (60%, 80%], Test = (80%, 100%]
n_data = X.shape[0]
boundaries = [0, int(n_data * 0.6), int(n_data * 0.8), n_data]
# Input checking
if boundaries[1] - lookback_window - horizon - n_train < 0:
raise ValueError(
"Not enough training data points for lookback window and horizon"
)
if boundaries[3] - boundaries[2] - lookback_window - horizon - n_test < 0:
raise ValueError("Not enough test data points for lookback window and horizon")
# Scaling data (scaled just by the training data)
scaler = StandardScaler().fit(X[boundaries[0] : boundaries[1]])
df_scaled = pd.DataFrame(scaler.transform(X), index=X.index, columns=X.columns)
# Select indices
train_idx = np.random.choice(
np.arange(lookback_window + horizon, boundaries[1]), n_train, replace=False
)
test_idx = np.random.choice(
np.arange(lookback_window + horizon + boundaries[2], boundaries[3]),
n_test,
replace=False,
)
# Dataframes with lagged features
df_train = add_lagged_features(
train_test_idx=train_idx,
data_scaled=df_scaled.values,
features=df_scaled.columns,
lookback_window=lookback_window,
horizon=horizon,
column_names=column_names,
)
df_test = add_lagged_features(
train_test_idx=test_idx,
data_scaled=df_scaled.values,
features=df_scaled.columns,
lookback_window=lookback_window,
horizon=horizon,
column_names=column_names,
)
df_target = pd.DataFrame(
df_test[forecast_features].values,
index=df_test.index,
columns=forecast_features,
).reindex(sorted(forecast_features), axis=1)
# Drop forecast features from test data
df_test.drop(columns=forecast_features, inplace=True)
return df_train, df_test, df_target
Let’s instantiate a client object and set the dataset to our timeseries dataframe. We’re only going to upload the df_train
dataframe as the test data is only used for evaluation.
from qcog_python_client import QcogClient
# Set the random seed for consistent selection of training and test data
np.random.seed(42)
df_train, df_test, df_test_labels = load_air_quality(1000, 200, 72, 24)
# Send the training data to the server
qcml = QcogClient.create(token=API_TOKEN)
qcml.data(df_train)
Parameterizing our Model
Let’s pick a Pauli model to run.
qcml = qcml.pauli(
operators=df_train.columns.tolist(),
qbits=4,
pauli_weight=2
)
Here we remember our operators have to match the dataset that we are going to run.
Training the Model
Now set some training specific parameters and execute the training.
from qcog_python_client.schema.parameters import LOBPCGStateParameters, GradOptimizationParameters
qcml = qcml.train(
batch_size=32,
num_passes=10,
weight_optimization=GradOptimizationParameters(
learning_rate=1e-5,
iterations=3
),
get_states_extra=LOBPCGStateParameters(
iterations=15
)
)
qcml.wait_for_training()
print(qcml.trained_model["guid"])
Here we use the gradient descent optimizer with a learning rate of 1e-5 and 3 iterations. We also use the LOBPCG_FAST state method with 15 iterations. We are not using the analytic solver because we are not passing the entire dataset at the same time.
Note
The training process may take a while to complete, here we call wait_for_training
which will block until training is complete.
Note
We print out the trained model guid
so we can use it in a different interpreter session if needed.
Executing Inference
If you are running in the same session you can skip the next step, but if you are running in a different session you can load the model using the guid
we printed out.
qcml = qcml.preloaded_model(MODEL_GUID)
With our trained model loaded into the client, we can now run inference on the dataset.
from qcog_python_client.schema.parameters import LOBPCGFastStateParameters
result_df = qcml.inference(
data=df_test,
parameters=LOBPCGFastStateParameters(
iterations=25,
tolerance=1e-4
)
)
mse = mean_squared_error(df_test_labels, results_df)
mape = mean_absolute_percentage_error(df_test_labels, results_df)
print(f"MSE: {mse:.4f}")
print(f"MAPE: {mape:.4f}")
Results
Some example results for various qubit counts and Pauli weights are shown below. The mean squared error (MSE) and mean absolute percentage error (MAPE) are calculated for each case.
Qubits |
Pauli Weight |
MSE |
MAPE |
---|---|---|---|
2 |
1 |
1.098 |
7.770 |
4 |
2 |
0.983 |
4.912 |
6 |
2 |
0.903 |
6.17 |