Python-caterva documentation

Python-caterva is a Python wrapper of Caterva, an open source C library specially designed to deal with large multidimensional, chunked, compressed datasets.

Getting Started

New to python-caterva? Check out the getting started guides. They contain an introduction to python-caterva’ main concepts and an installation tutorial.

API Reference

The reference guide contains a detailed description of the python-caterva API. The reference describes how the functions work and which parameters can be used.

Development

Saw a typo in the documentation? Want to improve existing functionalities? The contributing guidelines will guide you through the process of improving python-caterva.

Release Notes

Want to see what’s new in the latest release? Check out the release notes to find out!

Getting Started

Installation

Pip

python -m pip install caterva

Source code

git clone --recurse-submodules https://github.com/Blosc/python-caterva
cd python-caterva
python -m pip install .

Tutorial

import caterva as cat

cat.__version__
'0.6.0'

Creating an array

c = cat.zeros((10000, 10000), itemsize=4, chunks=(1000, 1000), blocks=(100, 100))

c
<caterva.ndarray.NDArray at 0x7f0bc0552150>

Reading and writing data

import struct
import numpy as np

dtype = np.int32

c[0, :] = np.arange(10000, dtype=dtype)
c[:, 0] = np.arange(10000, dtype=dtype)
c[0, 0]
<caterva.ndarray.NDArray at 0x7f0bb00bf050>
np.array(c[0, 0]).view(dtype)
array(0, dtype=int32)
np.array(c[0, -1]).view(dtype)
array(9999, dtype=int32)
np.array(c[0, :]).view(dtype)
array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int32)
np.array(c[:, 0]).view(dtype)
array([   0,    1,    2, ..., 9997, 9998, 9999], dtype=int32)
np.array(c[:]).view(dtype)
array([[   0,    1,    2, ..., 9997, 9998, 9999],
       [   1,    0,    0, ...,    0,    0,    0],
       [   2,    0,    0, ...,    0,    0,    0],
       ...,
       [9997,    0,    0, ...,    0,    0,    0],
       [9998,    0,    0, ...,    0,    0,    0],
       [9999,    0,    0, ...,    0,    0,    0]], dtype=int32)

Persistent data

c1 = cat.full((1000, 1000), fill_value=b"pepe", chunks=(100, 100), blocks=(50, 50),
             urlpath="cat_tutorial.caterva")
c2 = cat.open("cat_tutorial.caterva")

c2.info
TypeNDArray (Blosc)
Itemsize4
Shape(1000, 1000)
Chunks(100, 100)
Blocks(50, 50)
Comp. codecLZ4
Comp. level5
Comp. filters[SHUFFLE]
Comp. ratio588.24
np.array(c2[0, 20:30]).view("S4")
array([b'pepe', b'pepe', b'pepe', b'pepe', b'pepe', b'pepe', b'pepe',
       b'pepe', b'pepe', b'pepe'], dtype='|S4')
import os
if os.path.exists("cat_tutorial.caterva"):
  cat.remove("cat_tutorial.caterva")

Compression params

b = np.arange(1000000).tobytes()

c1 = cat.from_buffer(b, shape=(1000, 1000), itemsize=8, chunks=(500, 10), blocks=(50, 10))

c1.info
TypeNDArray (Blosc)
Itemsize8
Shape(1000, 1000)
Chunks(500, 10)
Blocks(50, 10)
Comp. codecLZ4
Comp. level5
Comp. filters[SHUFFLE]
Comp. ratio6.64
c2 = c1.copy(chunks=(500, 10), blocks=(50, 10),
             codec=cat.Codec.ZSTD, clevel=9, filters=[cat.Filter.BITSHUFFLE])

c2.info
TypeNDArray (Blosc)
Itemsize8
Shape(1000, 1000)
Chunks(500, 10)
Blocks(50, 10)
Comp. codecZSTD
Comp. level9
Comp. filters[BITSHUFFLE]
Comp. ratio20.83

Metalayers

from msgpack import packb, unpackb
meta = {
    "dtype": packb("i8"),
    "coords": packb([5.14, 23.])
}
c = cat.zeros((1000, 1000), 5, chunks=(100, 100), blocks=(50, 50), meta=meta)
len(c.meta)
3
c.meta.keys()
['caterva', 'dtype', 'coords']
for key in c.meta:
    print(f"{key} -> {unpackb(c.meta[key])}")
caterva -> [0, 2, [1000, 1000], [100, 100], [50, 50]]
dtype -> i8
coords -> [5.14, 23.0]
c.meta["coords"] = packb([0., 23.])
for key in c.meta:
    print(f"{key} -> {unpackb(c.meta[key])}")
caterva -> [0, 2, [1000, 1000], [100, 100], [50, 50]]
dtype -> i8
coords -> [0.0, 23.0]

Example of use

from PIL import Image
im = Image.open("../_static/blosc-logo_128.png")

im
_images/tutorial_31_0.png
meta = {"dtype": b"|u1"}

c = cat.asarray(np.array(im), chunks=(50, 50, 4), blocks=(10, 10, 4), meta=meta)

c.info
TypeNDArray (Blosc)
Itemsize1
Shape(70, 128, 4)
Chunks(50, 50, 4)
Blocks(10, 10, 4)
Comp. codecLZ4
Comp. level5
Comp. filters[SHUFFLE]
Comp. ratio2.68
im2 = c[15:55, 10:35]  # Letter B

Image.fromarray(np.array(im2).view(c.meta["dtype"]))
_images/tutorial_33_0.png

API Reference

Global variables

caterva.__version__

The version of the caterva package.

class caterva.Codec(value)

Available codecs.

BLOSCLZ = 0
LZ4 = 1
LZ4HC = 2
ZLIB = 4
ZSTD = 5
class caterva.Filter(value)

Available filters.

BITSHUFFLE = 2
DELTA = 3
NOFILTER = 0
SHUFFLE = 1
TRUNC_PREC = 4

Constructors

Basics

empty(shape, itemsize, **kwargs)

Create an empty array.

copy(array, **kwargs)

Create a copy of an array.

from_buffer(buffer, shape, itemsize, **kwargs)

Create an array out of a buffer.

open(urlpath)

Open a new container from urlpath.

asarray(ndarray, **kwargs)

Convert the input to an array.

Utils

remove(urlpath)

Remove a caterva file.

NDArray

The multidimensional data array class.

Attributes

itemsize

The itemsize of this container.

ndim

The number of dimensions of this container.

shape

The shape of this container.

chunks

The chunk shape of this container.

blocks

The block shape of this container.

meta

Methods

Slicing

__getitem__

Get a (multidimensional) slice as specified in key.

__setitem__

slice

Get a (multidimensional) slice as specified in key.

Metalayers

class caterva.meta.Meta(ndarray)

Class providing access to user meta on a NDArray. It will be available via the .meta property of an array.

Methods

__getitem__

Return the item metalayer.

__setitem__

Update the key metalayer with value.

get

Return the value for key if key is in the dictionary, else default.

keys

Return the metalayers keys

__iter__

Iter over the keys of the metalayers

__contains__

Check if the key metalayer exists or not

Development

Contributing to python-caterva

python-caterva is a community maintained project. We want to make contributing to this project as easy and transparent as possible.

Asking for help

If you have a question about how to use python-caterva, please post your question on StackOverflow using the “caterva” tag.

Bug reports

We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue. The ideal report should contain the following:

1. Summarize the problem: Include details about your goal, describe expected and actual results and include any error messages.

2. Describe what you’ve tried: Show what you’ve tried, tell us what you found and why it didn’t meet your needs.

3. Minimum reproducible example: Share the minimum amount of code needed to reproduce your issue. You can format the code nicely using markdown:

```python
import caterva as cat

...
```

4. Determine the environment: Indicates the python-caterva version and the operating system the code is running on.

Contributing to code

We actively welcome your code contributions. By contributing to python-caterva, you agree that your contributions will be licensed under the LICENSE file of the project.

Fork the repo

Make a fork of the python-caterva repository and clone it:

git clone https://github.com/<your-github-username>/python-caterva
Create your branch

Before you do any new work or submit a pull request, please open an issue on GitHub to report the bug or propose the feature you’d like to add.

Then create a new, separate branch for each piece of work you want to do.

Update docstrings

If you’ve changed APIs, update the involved docstrings using the doxygen format.

Run the test suite

If you have added code that needs to be tested, add the necessary tests and verify that all tests pass successfully.

Roadmap

This document lists the main goals for the upcoming python-caterva releases.

Features

  • Support for variable-length metalayers. This would provide users a lot of flexibility to define their own metadata

  • Resize array dimensions. This feature would allow Caterva to increase or decrease in size any dimension of the arrays.

Interoperability

  • Third-party integration. Caterva need better integration with libraries like:

    • xarray (labeled arrays)

    • dask (computation)

    • napari (visualization)

Release notes

Changes from 0.5.3 to 0.6.0

  • Provide wheels in PyPi.

  • Update caterva submodule to 0.5.0.

Changes from 0.5.1 to 0.5.3

  • Fix dependencies installation issue.

Changes from 0.5.0 to 0.5.1

  • Update setup.py and add pyproject.toml.

Changes from 0.4.2 to 0.5.0

  • Big c-core refactoring improving the slicing performance.

  • Implement __setitem__ method for arrays to allow to update the values of the arrays.

  • Use Blosc special-constructors to initialize the arrays.

  • Improve the buffer and array protocols.

  • Remove the data type support in order to simplify the library.

Changes from 0.4.1 to 0.4.2

  • Add files in MANIFEST.in.

Changes from 0.4.0 to 0.4.1

  • Fix invalid values for classifiers defined in setup.py.

Changes from 0.3.0 to 0.4.0

  • Compile the package using scikit-build.

  • Introduce a second level of multidimensional chunking.

  • Complete API renaming.

  • Support the buffer protocol and the numpy array protocol.

  • Generalize the slicing.

  • Make cat4py independent of numpy.

Changes from 0.2.3 to 0.3.0

  • Set the development status to alpha.

  • Add instructions about installing cat4py from pip.

  • getitem and setitem are now special methods in ext.Container.

  • Add new class from numpy arrays NPArray.

  • Support for serializing/deserializing Containers to/from serialized frames (bytes).

  • The pshape is calculated automatically if is None.

  • Add a .sframe attribute for the serialized frame.

  • Big refactor for more consistent inheritance among classes.

  • The from_numpy() function always return a NPArray now.

Changes from 0.2.2 to 0.2.3

  • Rename MANINFEST.in for MANIFEST.in.

  • Fix the list of available cnames.

Changes from 0.2.1 to 0.2.2

  • Added a MANIFEST.in for including all C-Blosc2 and Caterva sources in package.

Changes from 0.1.1 to 0.2.1

  • Docstrings has been added. In addition, the documentation can be found at: https://cat4py.readthedocs.io.

  • Add a copy parameter to from_file().

  • complib has been renamed to cname for compatibility with blosc-powered packages.

  • The use of an itemsize different than a 2 power is allowed now.