Appearance
Using Data Pools in PLANQK Services
This guide explains how to use the DataPool
feature to work with datasets and file collections within your PLANQK services.
What is a Data Pool?
A Data Pool is a managed collection of files, similar to a directory or a folder, that can be attached to your service at runtime. It provides a simple and efficient way to access large datasets, pre-trained models, or any other file-based resources without having to include them directly in your service's deployment package.
When you use a Data Pool, the PLANQK platform mounts the specified file collection into your service's runtime environment. The planqk-commons
library provides a convenient DataPool
abstraction to interact with these mounted files.
Data Pool Limits
Data Pools are designed to handle large datasets, but there are some limits to keep in mind:
- The maximum size of a single file in a Data Pool is 500 MB.
- The files are mounted from a bucket within the PLANQK platform, which means performance may vary based on the size and number of files.
How to Use the DataPool
Class
To use a Data Pool in your service, you simply need to declare a parameter of type DataPool
in your run
method. The PLANQK runtime will automatically detect this and inject a DataPool
object that corresponds to the mounted file collection.
The DataPool
Object
The DataPool
object, found in planqk.commons.datapool
, provides the following methods to interact with the files in the mounted directory:
list_files() -> Dict[str, str]
: Returns a dictionary of all files in the Data Pool, where the keys are the file names and the values are their absolute paths.open(file_name: str, mode: str = "r")
: Opens a specific file within the Data Pool and returns a file handle, similar to Python's built-inopen()
function.path
: A property that returns the absolute path to the mounted Data Pool directory.name
: A property that returns the name of the Data Pool (which corresponds to the parameter name in yourrun
method).
Tutorial: Building a Service with a Data Pool
Let's walk through an example of a service that reads data from a Data Pool.
1. Initialize a New Project
If you haven't already, create a new PLANQK service project. You can use the planqk
CLI to set up a new service:
bash
planqk init
cd [user_code]
uv venv
source .venv/bin/activate
uv sync
For the rest of this guide, we assume that you created your PLANQK service in a directory named user_code
, with the main code in user_code/src/
.
2. Update the run
Method
In your program.py
, define a run
method that accepts a DataPool
parameter. The name of the parameter (e.g., my_dataset
) is important, as it will be used to identify the Data Pool in the API call.
python
# user_code/src/program.py
from typing import Optional
from planqk.commons.datapool import DataPool
from pydantic import BaseModel
class InputData(BaseModel):
file_to_read: str
def run(data: InputData, my_dataset: DataPool) -> str:
"""
Reads the content of a specified file from a Data Pool.
"""
try:
# Use the open() method to read a file from the Data Pool
with my_dataset.open(data.file_to_read) as f:
content = f.read()
return content
except FileNotFoundError:
return f"File '{data.file_to_read}' not found in the Data Pool."
In this example, the run
method expects a Data Pool to be provided for the my_dataset
parameter.
3. Configuring the Data Pool in the API Call
When you execute this service via the PLANQK API, you need to specify which Data Pool to mount. This is done by providing a special JSON object in the request body. The key of this object must match the DataPool
parameter name in your run
method (my_dataset
in our example).
The JSON object has two fields:
id
: The unique identifier (UUID) of the Data Pool you want to use.ref
: A static value that must be"DATAPOOL"
.
Here is an example of a request body for our service:
json
{
"data": {
"file_to_read": "hello.txt"
},
"my_dataset": {
"id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"ref": "DATAPOOL"
}
}
When the service is executed with this input, the PLANQK platform will:
- Identify that the
my_dataset
parameter is a Data Pool reference. - Mount the Data Pool with the specified
id
. - Instantiate a
DataPool
object pointing to the mounted directory. - Inject this
Data Pool
object into therun
method as themy_dataset
argument.
Your code can then use the my_dataset
object to interact with the files in the mounted Data Pool.
4. Local Testing with Data Pools
When developing and testing your service locally, you don't have access to the PLANQK platform's Data Pool mounting system. However, you can easily simulate this by creating a local directory and passing it to your run
method.
Steps for Local Testing
Create a local directory for your Data Pool. This directory should be placed inside the
user_code
directory, alongside thesrc
andinput
folders. The name of this directory can be anything, but for this example, we'll name itmy_dataset
to match the parameter in therun
method.Populate the directory with your test files. Place any files you need for your test inside this directory (e.g.,
user_code/my_dataset/hello.txt
).Update the
__main__.py
file. Modify your main entrypoint to manually create aDataPool
instance and pass it to therun
function. You will create theDataPool
object with a relative path to your local Data Pool directory.Run your service. Now you can run your service directly without setting any environment variables.
bash# Run your service's main entrypoint cd user_code python -m src
Example
Let's assume your project has the following structure:
user_code
├── my_dataset/
│ └── hello.txt
├── src/
│ ├── __main__.py
│ └── program.py
└── input/
└── data.json
And user_code/input/data.json
contains:
json
{
"file_to_read": "hello.txt"
}
Update your user_code/src/__main__.py
to look like this:
python
# user_code/src/__main__.py
import json
import os
from planqk.commons.constants import OUTPUT_DIRECTORY_ENV
from planqk.commons.datapool import DataPool
from planqk.commons.json import any_to_json
from planqk.commons.logging import init_logging
from .program import InputData, run
init_logging()
# This file is executed if you run `python -m src` from the project root. Use this file to test your program locally.
# You can read the input data from the `input` directory and map it to the respective parameter of the `run()` function.
# Redirect PLANQK's output directory for local testing
directory = "./out"
os.makedirs(directory, exist_ok=True)
os.environ[OUTPUT_DIRECTORY_ENV] = directory
with open(f"./input/data.json") as file:
data = InputData.model_validate(json.load(file))
result = run(data, my_dataset=DataPool("./my_dataset"))
print()
print(any_to_json(result))
The __main__.py
script now manually creates the DataPool
object and passes it to your run
function, simulating the behavior of the PLANQK platform and allowing you to test your run
method's logic with local files.
Now run the service:
bash
python -m src
OpenAPI Specification for Data Pools
When you generate an OpenAPI specification for a service that uses a DataPool
, the planqk openapi
library automatically creates the correct schema for the Data Pool parameter. Instead of showing the internal structure of the DataPool
class, it generates a schema that reflects the expected API input format.
For the my_dataset: DataPool
parameter, the generated OpenAPI schema will look like this:
yaml
my_dataset:
type: object
properties:
id:
type: string
description: UUID of the Data Pool to mount
ref:
type: string
enum: [DATAPOOL]
description: Reference type indicating this is a Data Pool
required:
- id
- ref
additionalProperties: false
This ensures that the API documentation accurately represents how to use the service and provides a clear contract for API clients.