PAX Storage Format

SynxDB supports the PAX (Partition Attributes Across) storage format.

PAX is a database storage format that combines the benefits of row-based storage (NSM, N-ary Storage Model) and column-based storage (DSM, Decomposition Storage Model). It is designed to improve query performance, particularly in terms of cache efficiency. In OLAP scenarios, PAX offers batch write performance similar to row-based storage and read performance like column-based storage. PAX can adapt to both cloud environments with object storage models and traditional offline physical file-based storage methods.

Compared to traditional storage formats, PAX has the following features:

Data updates and deletions: PAX uses a mark-and-delete approach for data updates and deletions. This effectively manages changes in physical files without immediately rewriting the entire data file.
Concurrency control and read-write isolation: PAX uses Multi-Version Concurrency Control (MVCC) to achieve efficient concurrency control and read-write isolation. The control granularity reaches the level of individual data files, enhancing operation safety and efficiency.
Index support: PAX supports B-tree indexes, which help speed up query operations. This is particularly useful for improving data retrieval speed when dealing with large amounts of data.
Data encoding and compression: PAX offers multiple data encoding methods (such as run-length encoding and delta encoding) and compression options (such as zstd and zlib), with various compression levels. These features help reduce storage space requirements while optimizing read performance.
Statistics: Data files contain detailed statistics that are used for quick filtering and query optimization, reducing unnecessary data scanning and speeding up query processing.
Vectorized engine: PAX also supports a vectorized engine. It aims to further enhance data processing capabilities and query performance, especially in applications like data analysis and report generation.

Applicable scenarios

The hybrid storage capability of PAX makes it suitable for complex OLAP applications that need to handle large amounts of data writes and frequent queries. Whether you are looking for a high-performance data analysis solution in a cloud infrastructure or dealing with large datasets in a traditional data center environment, PAX can provide strong support.

Usage

Create a PAX table

To create a table in PAX format, you need to set the table access method to PAX. You can do this in one of the following ways:

Use the USING PAX clause explicitly when creating the table, for example:

CREATE TABLE t1(a int, b int, c text) USING PAX;
-- t1 is a PAX table and can be used like a normal heap table.

Set the default table access method to PAX and then create the table:

-- Set the default table access method. From now on, newly created tables will use PAX format.
SET default_table_access_method = pax;

-- Implicitly use the default access method, which is PAX.
CREATE TABLE t1(a int, b int, c text);

When creating a table, you can also specify minimum and maximum value information for certain columns to speed up queries:

-- Use WITH(minmax_columns='b,c') to specify that columns b and c
-- should record min and max statistics.
-- This helps optimize queries involving these two columns,
-- because the system can quickly determine which data blocks might contain matching data.
CREATE TABLE p2(a INT, b INT, c INT) USING pax WITH(minmax_columns='b,c');

INSERT INTO p2 SELECT i, i * 2, i * 3 FROM generate_series(1,10)i;

-- because column b has minmax statistics,
-- the system can quickly locate the data blocks that might contain the value, speeding up the query.
SELECT * FROM p2 WHERE b = 4;

-- Similarly, because of the minmax information on column b, the system can quickly determine that no data blocks can meet this condition
-- (if all generated values are positive), possibly returning no data immediately and avoiding unnecessary data scans.
SELECT * FROM p2 WHERE b < 0;

-- Modify the minmax statistics settings for table p2 to apply only to column b. For data inserted later,
-- only column b will maintain this statistical information, and it won't affect existing data or trigger any rewrites or adjustments.
ALTER TABLE p2 SET(minmax_columns='b');

View the format of an existing table

To check whether a table is in PAX format, you can use one of the following methods:

Use the psql command \d+:

gpadmin=# \d+ t1
                                            Table "public.t1"
Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
a      | integer |           |          |         | plain    |             |              |
b      | integer |           |          |         | plain    |             |              |
c      | text    |           |          |         | extended |             |              |
Distributed by: (a)
Access method: pax

Query the system catalog tables pg_class and pg_am:

SELECT relname, amname FROM pg_class, pg_am WHERE relam = pg_am.oid AND relname = 't1';

relname | amname
---------+--------
t1      | pax
(1 row)

Support for TOAST

If some columns in a PAX table contain large values, you can enable TOAST storage to store these large values in a separate TOAST file. This helps to make the data in the main data file more compact, allowing you to scan more tuples within the same data size.

By default, the TOAST storage is enabled for PAX tables. Unlike PostgreSQL, the TOAST storage supported by PAX does not rely on Page management, which allows PAX to store data larger than 2 MiB.

You can configure TOAST-related thresholds using the following parameters. For more details, refer to PAX-related system parameters.

pax_enable_toast
pax_min_size_of_compress_toast
pax_min_size_of_external_toast

Support for clustering

SynxDB has introduced z-ordering support for PAX tables, which greatly improves performance when managing and querying multi-dimensional data. Using the clustering feature, you can organize data in PAX tables using two different sorting strategies: index-based clustering and reloptions-based clustering. Note that the two strategies are mutually exclusive, and you can only choose one.

Clustering strategies

You can use the clustering feature of SynxDB to physically sort and optimize table data. The goal is to improve query performance by reorganizing the physical storage order of the data. In PAX tables, the clustering feature provides two different strategies for sorting data:

Index-based clustering. This strategy is based on B-tree indexes and works similarly to the native clustering method in PostgreSQL. It is suitable when an index has already been created for the table.

Example:
```
CREATE TABLE t2(c1 int, c2 int) USING PAX;
CREATE INDEX c1_idx ON t2(c1);
CLUSTER t2 USING c1_idx;
DROP TABLE t2;
```
reloptions-based clustering. PAX tables support clustering operations based on reloptions where cluster_columns are specified, and it provides the following sorting methods:
- Z-order clustering: This method encodes the values of multiple columns into a byte array and sorts by this byte array. It is ideal for cases where multiple columns are used as query conditions (with no fixed order for the columns in the query). This method greatly enhances performance for multi-dimensional data queries. However, it is not suitable for string columns with the same prefix. When cluster_type is not specified, PAX uses z-order sorting as the default clustering strategy:
```
CREATE TABLE t2(c1 int, c2 float, c3 text) USING PAX WITH (cluster_columns='c1,c2');
INSERT INTO t2 SELECT i, i, i::text FROM generate_series(1, 100000) i;
CLUSTER t2;
DROP TABLE t2;
```
  In this example, the data in table t2 is sorted using z-order by the c1 and c2 columns. This sorting method provides great performance improvement in multi-dimensional query scenarios.
- Lexical clustering: This method sorts by the values and order of the columns, and it is primarily used for sorting and optimizing queries on string columns with the same prefix. For example:
```
CREATE TABLE t2(c1 int, c2 float, c3 text) USING PAX WITH (cluster_columns='c1,c2', cluster_type='lexical');
INSERT INTO t2 SELECT i, i, i::text FROM generate_series(1, 100000) i;
CLUSTER t2;
DROP TABLE t2;
```
  In this example, the data in table t2 is sorted lexically by the c1 and c2 columns. This sorting method is well-suited for columns with the same prefix.

Conflicts between clustering types

Note that index-based clustering and cluster_columns-based clustering cannot be used at the same time. If a table has already been clustered based on an index (or cluster_columns), you cannot specify cluster_columns (or an index) for clustering.

An example of handling cluster conflicts:

-- Creates a table and uses index-based clustering
CREATE TABLE t2(c1 int, c2 float, c3 text) USING PAX;
CREATE INDEX t2_idx ON t2(c1);
CLUSTER t2 USING t2_idx;

-- Tries to use cluster columns (will fail)
ALTER TABLE t2 SET (cluster_columns='c1,c2', cluster_type='zorder');
-- Tries to use lexical clustering (will fail)
ALTER TABLE t2 SET (cluster_columns='c1,c2', cluster_type='lexical');

-- Drops the index and successfully use lexical clustering
DROP INDEX t2_idx;
ALTER TABLE t2 SET (cluster_columns='c1,c2', cluster_type='lexical');

In the above example, after table t2 is clustered using an index, attempts to set z-order or lexical clustering using cluster_columns will fail, until the index is dropped.

Bloom filter support

SynxDB introduces support for bloom filters, allowing you to generate and maintain bloom filter information at the column level. This feature helps quickly filter data blocks and improves query performance, especially when using IN conditions with multiple values, greatly reducing unnecessary data scans.

Sparse filtering

Sparse filtering is a data scanning optimization provided by the PAX storage format. It improves query performance by skipping file scans that do not meet conditions and reducing the amount of data block to be scanned within files.

To use sparse filtering, make sure that the following conditions are met:

The system parameter pax_enable_sparse_filter is set to ON (default value).
Statistics have been configured for the relevant columns (min/max or bloom filter).
The query contains filter conditions (WHERE clause) for these columns

Expression support examples

PAX supports basic conditional expressions. For example:

-- Creates test table
CREATE TABLE a(v1 int, v2 int, v3 int) USING PAX WITH(minmax_columns='v1, v2, v3');

-- Basic conditions
SELECT * FROM a WHERE v1 <= 3;                                    -- Full condition support. root is OpExpr.
SELECT * FROM a WHERE v1 != NULL;                                -- Full condition support. root is NULLTest.

-- Multiple conditions
SELECT * FROM a WHERE v1 <= 3 AND v2 >= 3;                       -- Full condition support. Tree structure has only one level.
SELECT * FROM a WHERE v1 <= 3 AND v2 >= 3 AND v3 >= 3;          -- Full condition support.
SELECT * FROM a WHERE v1 <= 3 AND v2 >= 3 AND v3 != NULL;       -- Full condition support.

PAX supports nested expression structures and certain operators. For example:

-- Creates test table.
CREATE TABLE a(v1 int, v2 int, v3 int) USING PAX WITH(minmax_columns='v1, v2, v3');

-- Nested expressions.
SELECT * FROM a WHERE (v1 <= 3 OR v2 > 3) AND v3 >= 10;         -- Full support for all conditions.
SELECT * FROM a WHERE (v1 <= 3 AND v2 > 3) OR v3 >= 10;         -- Supports complete nested expressions.

-- Operators.
SELECT * FROM a WHERE v1 + v2 <= 3;
SELECT * FROM a WHERE v1 + 10 <= 3;

PAX can handle complete expression trees, including nested AND/OR conditions. This means that all the above queries can be optimized, and SynxDB will use all available filter conditions to improve query efficiency.

Supported expression types

PAX sparse filtering supports the following expression types:

Arithmetic operators (OpExpr)
- Supports addition (+), subtraction (-), and multiplication (*)
- Division operations are not supported (due to difficulty in estimating negative number ranges)
- Examples: a + 1, 1 - a, a * b
Comparison operators (OpExpr)
- Supports <, <=, =, >=, >
- Examples: a < 1, a + 1 < 10, a = b
Logical operators (BoolExpr)
- Supports AND, OR, NOT
- Example: a < 10 AND b > 10
NULL value tests (NullTest)
- Supports IS NULL, IS NOT NULL
- Examples: a IS NULL, a IS NOT NULL
Type casting (FuncExpr)
- Only supports basic type casting
- Example: a::float8 < 1.1
- Custom functions are not supported, such as func(a) < 10
IN operator (ScalarArrayOpExpr)
- Supports IN expressions
- Example: a IN (1, 2, 3)

Partial condition support

When a query contains unsupported expressions (such as custom functions), PAX adopts a partial condition support strategy:

Identifies and extracts supported conditions.
Uses supported conditions for sparse filtering.

For example:

-- Creates custom function v2
CREATE OR REPLACE FUNCTION func(v2 double precision)
RETURNS double precision AS $$
BEGIN
  RETURN v2 * 2;
END;
$$ LANGUAGE plpgsql;

-- Uses custom function in query
SELECT * FROM a WHERE v1 < 3 AND func(v2) < 10;  -- PAX will use v1 < 3 for filtering

In this example, although func(v2) < 10 cannot be used for sparse filtering, SynxDB still uses v1 < 3 to optimize query performance. This approach ensures that partial performance optimization is achieved even in complex queries.

Note

The effectiveness of sparse filtering depends on data distribution and query conditions.
It is recommended to enable statistics on columns that are frequently used as filter conditions.
Certain conditions (such as custom functions) are ignored.

View data change records on PAX tables in WAL logs

Data operations related to PAX tables are recorded in the Write-Ahead Logging (WAL) system. This ensures reliable backups between primary and mirror nodes, which assists in failover processes.

You can use the pg_waldump tool to view the WAL logs for PAX tables. The logs are stored in the $COORDINATOR_DATA_DIRECTORY/pg_wal directory or the pg_wal directory within the Segment node’s data directory.

The PAX table WAL logs can be used for these purposes:

Data recovery: Helps restore data during system failures or data inconsistencies.
Failover support: Ensures data synchronization between primary and mirror nodes.
Debugging and analysis: Identifies and analyzes specific operations on PAX tables.

Example: operate PAX tables and view WAL logs

Create a PAX table.

CREATE TABLE a (
   v1 int,
   v2 int,
   v3 int
) USING PAX;

This creates a PAX table a with columns v1, v2, and v3.

Insert data into the table:
```
INSERT INTO a VALUES (1, 2, 3);
```

Locate the node where the PAX table resides:

-- Views the IDs and data directory locations of all nodes in the cluster
SELECT * FROM gp_segment_configuration;

-- Identifies the Segment ID of table a based on the results of the above query
-- This helps determine the segment and data directory location of table a
SELECT gp_segment_id FROM gp_distribution_policy WHERE localoid = 'a'::regclass;

Find the corresponding WAL log in the pg_wal directory of the identified node. For example:
```
ls $COORDINATOR_DATA_DIRECTORY/pg_wal
```
Locate the most recent WAL file to analyze the relevant operations. In this example, assume the WAL log file is named 00000001000000000000000A.
Query the relfilenode of the PAX table to associate the WAL log with the PAX table:
```
SELECT relfilenode FROM pg_class WHERE relname = 'a';
```
The relfilenode is the unique identifier of the table in the WAL log.
Use the pg_waldump tool to parse the WAL log and identify operations related to the PAX table:
```
pg_waldump -f $COORDINATOR_DATA_DIRECTORY/pg_wal/00000001000000000000000A
```
In the output, search for the previously obtained relfilenode to find operations associated with the PAX table.

Example output of pg_waldump:
```
rmgr: PAX       len (rec/tot):     68/   104, tx:    593, lsn: 0/016E1D98, prev 0/016E1D60, desc: INSERT off 3, blkref #0: rel 1663/16384/19780 blk 0
```
In this example:
- rmgr: PAX indicates the record type.
- rel 1663/16384/19780 corresponds to the relfilenode of the PAX table.
- INSERT indicates that an insert operation was recorded.

By following the steps above, you can track and analyze WAL logs related to PAX tables in SynxDB for debugging, recovery, or performance optimization purposes.

Limitations for PAX tables

For index support, the PAX storage format currently only supports B-tree (btree) indexes. Bugs might occur when using GiST, SP-GiST (gist/spgist) or Brin indexes. Supports for other index types are still experimental and might be unavailable.
Currently, if a field is too long, it will be stored in a TOAST file. This TOAST is different from PostgreSQL’s TOAST tables, and they only share the same name.
Unlike traditional heap tables, PAX format does not support TOAST fields. Currently, all column data is stored in the same data file.
The PAX format does not support data backup and restore using pg_dump or pg_restore. PAX tables are ignored during these operations.
PAX format does not support Write-Ahead Logging (WAL), so there is no data backup between the primary and mirror servers.

PAX-related SQL options

PAX supports SQL options to control its behavior. You can use these options in the WITH() clause, for example, WITH(minmax_columns='b,c', storage_format=porc).

Name	Type	Valid values	Description
`storage_format`	`string`	`porc` `porc_vec`	Controls the internal storage format. `porc` is the default value, which stores data in a regular format and does not preserve null values. `porc_vec` stores data in a vector format and always preserves null values for fixed-length fields, regardless of whether the column value is null or not.
`compresstype`	`string`	`none` `rle` `delta` `zstd` `zlib` `dict`	Specifies the compression method for column values. You can only choose one method. The default value is `none`. `none` means no compression is applied. `rle` uses run-length encoding to compress repeated consecutive data. `delta` is used for numeric columns, storing the difference between adjacent values. `zstd` uses the zstd algorithm, which offers high compression ratio and fast compression speed. `zlib` is a general-purpose compression algorithm, suitable for compressing general data. `dict` uses dictionary encoding to speed up the processing of many repeated strings. Currently, it is an experimental feature and is not recommended for production use.
`compresslevel`	`int`	Range `[0, 19]`	Indicates the compression level. The default value is `0`. A smaller value means faster compression, while a larger value provides higher compression. This option is only effective when used with `compresstype`.
`minmax_columns`	`string`	Valid column names in the table, separated by commas	Records `minmax` statistics for the defined columns to speed up data queries. After renaming a column, the statistics will no longer be recorded for that column. If you modify `minmax_columns` using `ALTER TABLE`, the change only applies to data files written afterward and does not affect existing data files.
`cluster_columns`	`string`	A comma-separated list of valid column names in the table.	Indicates that the PAX table stores the internal data in a clustered way. When using the `cluster` command, the data can be sorted by these columns, and the sorting method is controlled by the `cluster_type` parameter.
`bloomfilter_columns`	`string`	A comma-separated list of valid column names in the table.	Calculates bloom filters for the data in the specified columns to help data filtering.
`parallel_workers`	`int`	`[0,64]`	A PostgreSQL option that sets the number of parallel processes for concurrent scanning.
`cluster_type`	`string`	`lexical` and `zorder`	Specifies the cluster type when clustering is based on custom columns defined in `reloptions` rather than an index. Valid values are `lexical` and `zorder`. `lexical` sorts by the values and order of the columns, while `zorder` encodes the values of multiple columns into a byte array and sorts by that array.

The values of these options only affect newly inserted and updated data and do not change the existing data.

PAX-related system parameters

The following system parameters (GUC) are used to set the behavior of PAX tables in the current session. Use the command SET <parameter>=<value> to configure them, for example, SET pax_enable_sparse_filter=on.

Parameter name	Value type	Valid values	Description
`pax_enable_sparse_filter`	`bool`	`on` and `off`	Specifies whether to enable sparse filtering based on statistics. The default value is `on`, which means that sparse filtering is enabled by default.
`pax_enable_row_filter`	`bool`	`on` and `off`	Specifies whether to enable row filtering. The default value is `off`, which means that row filtering is disabled. It is not recommended to enable this parameter.
`pax_scan_reuse_buffer_size`	`int`	Range `[1048576, 33554432]` (1MiB to 32MiB)	The buffer block size used during scanning. The default value is `8388608` (8MiB).
`pax_max_tuples_per_group`	`int`	Range `[5, 524288]`	Specifies the maximum number of tuples allowed in each group. The default value is `131072`.
`pax_max_tuples_per_file`	`int`	Range `[131072, 8388608]`	Specifies the maximum number of tuples allowed in each data file. The maximum value is `8388608`. The default value is `1310720`.
`pax_max_size_per_file`	`int`	Range `[8388608, 335544320]` (8MiB to 320MiB)	The maximum physical size allowed for each data file. The default value is `67108864` (64MiB). The actual file size might be slightly larger than the set size. Very large or small values might negatively impact performance.
`pax_enable_toast`	`bool`	`on` and `off`. The default value is `on`.	Specifies whether to enable TOAST support.
`pax_min_size_of_compress_toast`	`int`	Range `[524288, 1073741824]` (512KiB to 1GiB). The default value is `524288` (512KiB).	Specifies the threshold for creating compressed TOAST tables. If the character length exceeds this threshold, SynxDB creates compressed TOAST tables for storage.
`pax_min_size_of_external_toast`	`int`	Range `[10485760, 2147483647]` (10MiB to 2GiB). The default value is `10485760` (10MiB).	Specifies the threshold for creating external TOAST tables. If the character length exceeds this threshold, SynxDB creates external TOAST tables for storage.
`pax_default_storage_format`	`string`	`porc` (the default value)	Controls the default storage format.
`pax_bloom_filter_work_memory_bytes`	`int`	Range `[1024, 2147483647]` (1KiB to 2GiB). The default value is `10240` (10KiB).	Controls the maximum memory allowed for bloom filter usage.

Best practices

Use partitioning options:
- It is recommended to use partitioning options when data needs to be imported based on an integer column and meets the following conditions:
  The data is evenly distributed across this column, with a wide range and no extreme concentration.
  
  This column is often used as a filter condition in queries or as a key for joins.
- Keep in mind that the PAX partition key is only effective during a single batch data import. It cannot be adjusted between multiple data writes. The partition key settings only apply to future inserts or updates, so after changing the partition key, newly imported data will follow the new partition key.
Use minmax statistics:
- For columns with a wide data range that are often used in query filters, setting minmax values for these columns can greatly speed up the query process.
- By using minmax statistics, if a column in a data file does not match the minmax values or null tests, the entire file can be skipped quickly, avoiding unnecessary data scans.
- Important note: The effectiveness of minmax depends on how data is inserted. If the data in a PAX table is inserted in batches (such as with batch insert or copy) and the data range within each batch is continuous, then minmax will be very effective. However, if the data is inserted randomly, the effectiveness of minmax filtering may be reduced.
PAX has been well-integrated with the vector execution engine. You can control whether to generate vector query plans by setting parameters in the query plan, which helps to improve query performance.

Vector plans take advantage of the columnar storage format of PAX tables, efficiently using memory and speeding up query execution. By processing data in batches, PAX tables can better support high-throughput queries. For more details, see Vectorization Query Computing.

For example:
```
SET vector.enable_vectorization = on;  -- Enables vector query plans.
SET vector.max_batch_size = 16384;  -- Sets the maximum number of rows per batch.
```
If a PAX table has multiple columns used in query filters, you can use with(cluster_columns='b,c,d', cluster_type='zorder'). After sorting the data with zorder encoding, any of the cluster_columns can benefit from filtering. In comparison, for single-column-based cluster sorting, filtering is only effective on the sorted key, with minimal impact on other columns. Clustering sorting influences minmax-based filtering but does not affect bloom filter-based filtering.
If your query uses filters like column1 in (12, 11, 13) or column1 = 'X', you might consider using a bloom filter (with(bloomfilter_columns='b,c')). In internal implementation, PAX calculates bloom filters for certain data. If the bloom filter can confirm that the data is not in the block, PAX can skip the current block. Note that the effectiveness of bloom filter depends on several factors:
- The amount of data within a group. If the data in a group is too small, it might reduce the efficiency of vector execution.
- The space used by the bloom filter.
SynxDB PAX storage supports vector thread-level scanning, but this is limited to parallel scanning of local files only. For specific settings, see Threaded execution on single node.