PAX Storage Format
SynxDB supports the PAX (Partition Attributes Across) storage format.
PAX is a database storage format that combines the benefits of row-based storage (NSM, N-ary Storage Model) and column-based storage (DSM, Decomposition Storage Model). It is designed to improve query performance, particularly in terms of cache efficiency. In OLAP scenarios, PAX offers batch write performance similar to row-based storage and read performance like column-based storage. PAX can adapt to both cloud environments with object storage models and traditional offline physical file-based storage methods.
Compared to traditional storage formats, PAX has the following features:
Data updates and deletions: PAX uses a mark-and-delete approach for data updates and deletions. This effectively manages changes in physical files without immediately rewriting the entire data file.
Concurrency control and read-write isolation: PAX uses Multi-Version Concurrency Control (MVCC) to achieve efficient concurrency control and read-write isolation. The control granularity reaches the level of individual data files, enhancing operation safety and efficiency.
Index support: PAX supports B-tree indexes, which help speed up query operations. This is particularly useful for improving data retrieval speed when dealing with large amounts of data.
Data encoding and compression: PAX offers multiple data encoding methods (such as run-length encoding and delta encoding) and compression options (such as zstd and zlib), with various compression levels. These features help reduce storage space requirements while optimizing read performance.
Statistics: Data files contain detailed statistics that are used for quick filtering and query optimization, reducing unnecessary data scanning and speeding up query processing.
Vectorized engine: PAX also supports a vectorized engine. It aims to further enhance data processing capabilities and query performance, especially in applications like data analysis and report generation.
Applicable scenarios
The hybrid storage capability of PAX makes it suitable for complex OLAP applications that need to handle large amounts of data writes and frequent queries. Whether you are looking for a high-performance data analysis solution in a cloud infrastructure or dealing with large datasets in a traditional data center environment, PAX can provide strong support.
Usage
Create a PAX table
To create a table in PAX format, you need to set the table access method to PAX. You can do this in one of the following ways:
Use the
USING PAX
clause explicitly when creating the table, for example:CREATE TABLE t1(a int, b int, c text) USING PAX; -- t1 is a PAX table and can be used like a normal heap table.
Set the default table access method to PAX and then create the table:
-- Set the default table access method. From now on, newly created tables will use PAX format. SET default_table_access_method = pax; -- Implicitly use the default access method, which is PAX. CREATE TABLE t1(a int, b int, c text);
When creating a table, you can also specify minimum and maximum value information for certain columns to speed up queries:
-- Use WITH(minmax_columns='b,c') to specify that columns b and c
-- should record min and max statistics.
-- This helps optimize queries involving these two columns,
-- because the system can quickly determine which data blocks might contain matching data.
CREATE TABLE p2(a INT, b INT, c INT) USING pax WITH(minmax_columns='b,c');
INSERT INTO p2 SELECT i, i * 2, i * 3 FROM generate_series(1,10)i;
-- because column b has minmax statistics,
-- the system can quickly locate the data blocks that might contain the value, speeding up the query.
SELECT * FROM p2 WHERE b = 4;
-- Similarly, because of the minmax information on column b, the system can quickly determine that no data blocks can meet this condition
-- (if all generated values are positive), possibly returning no data immediately and avoiding unnecessary data scans.
SELECT * FROM p2 WHERE b < 0;
-- Modify the minmax statistics settings for table p2 to apply only to column b. For data inserted later,
-- only column b will maintain this statistical information, and it won't affect existing data or trigger any rewrites or adjustments.
ALTER TABLE p2 SET(minmax_columns='b');
View the format of an existing table
To check whether a table is in PAX format, you can use one of the following methods:
Use the
psql
command\d+
:gpadmin=# \d+ t1 Table "public.t1" Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description --------+---------+-----------+----------+---------+----------+-------------+--------------+------------- a | integer | | | | plain | | | b | integer | | | | plain | | | c | text | | | | extended | | | Distributed by: (a) Access method: pax
Query the system catalog tables
pg_class
andpg_am
:SELECT relname, amname FROM pg_class, pg_am WHERE relam = pg_am.oid AND relname = 't1'; relname | amname ---------+-------- t1 | pax (1 row)
Support for TOAST
If some columns in a PAX table contain large values, you can enable TOAST storage to store these large values in a separate TOAST file. This helps to make the data in the main data file more compact, allowing you to scan more tuples within the same data size.
By default, the TOAST storage is enabled for PAX tables. Unlike PostgreSQL, the TOAST storage supported by PAX does not rely on Page management, which allows PAX to store data larger than 2 MiB.
You can configure TOAST-related thresholds using the following parameters. For more details, refer to PAX-related system parameters.
pax_enable_toast
pax_min_size_of_compress_toast
pax_min_size_of_external_toast
Support for clustering
SynxDB has introduced z-ordering support for PAX tables, which greatly improves performance when managing and querying multi-dimensional data. Using the clustering feature, you can organize data in PAX tables using two different sorting strategies: index-based clustering and reloptions
-based clustering. Note that the two strategies are mutually exclusive, and you can only choose one.
Clustering strategies
You can use the clustering feature of SynxDB to physically sort and optimize table data. The goal is to improve query performance by reorganizing the physical storage order of the data. In PAX tables, the clustering feature provides two different strategies for sorting data:
Index-based clustering. This strategy is based on B-tree indexes and works similarly to the native clustering method in PostgreSQL. It is suitable when an index has already been created for the table.
Example:
CREATE TABLE t2(c1 int, c2 int) USING PAX; CREATE INDEX c1_idx ON t2(c1); CLUSTER t2 USING c1_idx; DROP TABLE t2;
reloptions
-based clustering. PAX tables support clustering operations based onreloptions
wherecluster_columns
are specified, and it provides the following sorting methods:Z-order clustering: This method encodes the values of multiple columns into a byte array and sorts by this byte array. It is ideal for cases where multiple columns are used as query conditions (with no fixed order for the columns in the query). This method greatly enhances performance for multi-dimensional data queries. However, it is not suitable for string columns with the same prefix. When
cluster_type
is not specified, PAX uses z-order sorting as the default clustering strategy:CREATE TABLE t2(c1 int, c2 float, c3 text) USING PAX WITH (cluster_columns='c1,c2'); INSERT INTO t2 SELECT i, i, i::text FROM generate_series(1, 100000) i; CLUSTER t2; DROP TABLE t2;
In this example, the data in table
t2
is sorted using z-order by thec1
andc2
columns. This sorting method provides great performance improvement in multi-dimensional query scenarios.Lexical clustering: This method sorts by the values and order of the columns, and it is primarily used for sorting and optimizing queries on string columns with the same prefix. For example:
CREATE TABLE t2(c1 int, c2 float, c3 text) USING PAX WITH (cluster_columns='c1,c2', cluster_type='lexical'); INSERT INTO t2 SELECT i, i, i::text FROM generate_series(1, 100000) i; CLUSTER t2; DROP TABLE t2;
In this example, the data in table
t2
is sorted lexically by thec1
andc2
columns. This sorting method is well-suited for columns with the same prefix.
Conflicts between clustering types
Note that index-based clustering and cluster_columns
-based clustering cannot be used at the same time. If a table has already been clustered based on an index (or cluster_columns
), you cannot specify cluster_columns
(or an index) for clustering.
An example of handling cluster conflicts:
-- Creates a table and uses index-based clustering
CREATE TABLE t2(c1 int, c2 float, c3 text) USING PAX;
CREATE INDEX t2_idx ON t2(c1);
CLUSTER t2 USING t2_idx;
-- Tries to use cluster columns (will fail)
ALTER TABLE t2 SET (cluster_columns='c1,c2', cluster_type='zorder');
-- Tries to use lexical clustering (will fail)
ALTER TABLE t2 SET (cluster_columns='c1,c2', cluster_type='lexical');
-- Drops the index and successfully use lexical clustering
DROP INDEX t2_idx;
ALTER TABLE t2 SET (cluster_columns='c1,c2', cluster_type='lexical');
In the above example, after table t2
is clustered using an index, attempts to set z-order or lexical clustering using cluster_columns
will fail, until the index is dropped.
Bloom filter support
SynxDB introduces support for bloom filters, allowing you to generate and maintain bloom filter information at the column level. This feature helps quickly filter data blocks and improves query performance, especially when using IN
conditions with multiple values, greatly reducing unnecessary data scans.
Bloom filter options
You can specify the columns for which you want to record bloom filter information by setting the bloomfilter_columns
option. For example:
CREATE TABLE p1 (
a int,
b int,
c text
) USING PAX WITH (bloomfilter_columns='b,c,a');
In this example, bloom filters will be generated for columns b
, c
, and a
in the table p1
.
The size of the bloom filter is controlled by the following GUC (global user configuration) parameters:
pax_max_tuples_per_file
: Controls the maximum number of tuples stored in each data file.pax_bloom_filter_work_memory_bytes
: Controls the maximum memory allowed for the bloom filter.
For example:
-- Sets the maximum number of tuples per file.
SET pax_max_tuples_per_file = 131073;
-- Sets the maximum bloom filter memory.
SET pax_bloom_filter_work_memory_bytes = 1048576; -- 1MB
-- Creates a table and specifies bloom filter options for columns.
CREATE TABLE p1 (
a int,
b int,
c text
) USING PAX WITH (bloomfilter_columns='b,c,a');
The size of the generated bloom filter is calculated as follows:
ceil(min(pax_max_tuples_per_file * 2, pax_bloom_filter_work_memory_bytes))
ceil
is a rounding function, which ensures that the size of the bloom filter is always a power of 2
.
Sparse filtering
Sparse filtering is a data scanning optimization provided by the PAX storage format. It improves query performance by skipping file scans that do not meet conditions and reducing the amount of data block to be scanned within files.
To use sparse filtering, make sure that the following conditions are met:
The system parameter
pax_enable_sparse_filter
is set toON
(default value).Statistics have been configured for the relevant columns (
min
/max
or bloom filter).The query contains filter conditions (
WHERE
clause) for these columns
Expression support examples
PAX supports basic conditional expressions. For example:
-- Creates test table
CREATE TABLE a(v1 int, v2 int, v3 int) USING PAX WITH(minmax_columns='v1, v2, v3');
-- Basic conditions
SELECT * FROM a WHERE v1 <= 3; -- Full condition support. root is OpExpr.
SELECT * FROM a WHERE v1 != NULL; -- Full condition support. root is NULLTest.
-- Multiple conditions
SELECT * FROM a WHERE v1 <= 3 AND v2 >= 3; -- Full condition support. Tree structure has only one level.
SELECT * FROM a WHERE v1 <= 3 AND v2 >= 3 AND v3 >= 3; -- Full condition support.
SELECT * FROM a WHERE v1 <= 3 AND v2 >= 3 AND v3 != NULL; -- Full condition support.
PAX supports nested expression structures and certain operators. For example:
-- Creates test table.
CREATE TABLE a(v1 int, v2 int, v3 int) USING PAX WITH(minmax_columns='v1, v2, v3');
-- Nested expressions.
SELECT * FROM a WHERE (v1 <= 3 OR v2 > 3) AND v3 >= 10; -- Full support for all conditions.
SELECT * FROM a WHERE (v1 <= 3 AND v2 > 3) OR v3 >= 10; -- Supports complete nested expressions.
-- Operators.
SELECT * FROM a WHERE v1 + v2 <= 3;
SELECT * FROM a WHERE v1 + 10 <= 3;
PAX can handle complete expression trees, including nested AND
/OR
conditions. This means that all the above queries can be optimized, and SynxDB will use all available filter conditions to improve query efficiency.
Supported expression types
PAX sparse filtering supports the following expression types:
Arithmetic operators (OpExpr)
Supports addition (
+
), subtraction (-
), and multiplication (*
)Division operations are not supported (due to difficulty in estimating negative number ranges)
Examples:
a + 1
,1 - a
,a * b
Comparison operators (OpExpr)
Supports
<
,<=
,=
,>=
,>
Examples:
a < 1
,a + 1 < 10
,a = b
Logical operators (BoolExpr)
Supports
AND
,OR
,NOT
Example:
a < 10 AND b > 10
NULL value tests (NullTest)
Supports
IS NULL
,IS NOT NULL
Examples:
a IS NULL
,a IS NOT NULL
Type casting (FuncExpr)
Only supports basic type casting
Example:
a::float8 < 1.1
Custom functions are not supported, such as
func(a) < 10
IN
operator (ScalarArrayOpExpr)Supports
IN
expressionsExample:
a IN (1, 2, 3)
Partial condition support
When a query contains unsupported expressions (such as custom functions), PAX adopts a partial condition support strategy:
Identifies and extracts supported conditions.
Uses supported conditions for sparse filtering.
For example:
-- Creates custom function v2
CREATE OR REPLACE FUNCTION func(v2 double precision)
RETURNS double precision AS $$
BEGIN
RETURN v2 * 2;
END;
$$ LANGUAGE plpgsql;
-- Uses custom function in query
SELECT * FROM a WHERE v1 < 3 AND func(v2) < 10; -- PAX will use v1 < 3 for filtering
In this example, although func(v2) < 10
cannot be used for sparse filtering, SynxDB still uses v1 < 3
to optimize query performance. This approach ensures that partial performance optimization is achieved even in complex queries.
Note
The effectiveness of sparse filtering depends on data distribution and query conditions.
It is recommended to enable statistics on columns that are frequently used as filter conditions.
Certain conditions (such as custom functions) are ignored.
View data change records on PAX tables in WAL logs
Data operations related to PAX tables are recorded in the Write-Ahead Logging (WAL) system. This ensures reliable backups between primary and mirror nodes, which assists in failover processes.
You can use the pg_waldump
tool to view the WAL logs for PAX tables. The logs are stored in the $COORDINATOR_DATA_DIRECTORY/pg_wal
directory or the pg_wal
directory within the Segment node’s data directory.
The PAX table WAL logs can be used for these purposes:
Data recovery: Helps restore data during system failures or data inconsistencies.
Failover support: Ensures data synchronization between primary and mirror nodes.
Debugging and analysis: Identifies and analyzes specific operations on PAX tables.
Example: operate PAX tables and view WAL logs
Create a PAX table.
CREATE TABLE a ( v1 int, v2 int, v3 int ) USING PAX;
This creates a PAX table
a
with columnsv1
,v2
, andv3
.Insert data into the table:
INSERT INTO a VALUES (1, 2, 3);
Locate the node where the PAX table resides:
-- Views the IDs and data directory locations of all nodes in the cluster SELECT * FROM gp_segment_configuration; -- Identifies the Segment ID of table a based on the results of the above query -- This helps determine the segment and data directory location of table a SELECT gp_segment_id FROM gp_distribution_policy WHERE localoid = 'a'::regclass;
Find the corresponding WAL log in the
pg_wal
directory of the identified node. For example:ls $COORDINATOR_DATA_DIRECTORY/pg_wal
Locate the most recent WAL file to analyze the relevant operations. In this example, assume the WAL log file is named
00000001000000000000000A
.Query the
relfilenode
of the PAX table to associate the WAL log with the PAX table:SELECT relfilenode FROM pg_class WHERE relname = 'a';
The
relfilenode
is the unique identifier of the table in the WAL log.Use the
pg_waldump
tool to parse the WAL log and identify operations related to the PAX table:pg_waldump -f $COORDINATOR_DATA_DIRECTORY/pg_wal/00000001000000000000000A
In the output, search for the previously obtained
relfilenode
to find operations associated with the PAX table.Example output of
pg_waldump
:rmgr: PAX len (rec/tot): 68/ 104, tx: 593, lsn: 0/016E1D98, prev 0/016E1D60, desc: INSERT off 3, blkref #0: rel 1663/16384/19780 blk 0
In this example:
rmgr: PAX
indicates the record type.rel 1663/16384/19780
corresponds to therelfilenode
of the PAX table.INSERT
indicates that an insert operation was recorded.
By following the steps above, you can track and analyze WAL logs related to PAX tables in SynxDB for debugging, recovery, or performance optimization purposes.
Limitations for PAX tables
For index support, the PAX storage format currently only supports B-tree (
btree
) indexes. Bugs might occur when using GiST, SP-GiST (gist/spgist
) or Brin indexes. Supports for other index types are still experimental and might be unavailable.Currently, if a field is too long, it will be stored in a TOAST file. This TOAST is different from PostgreSQL’s TOAST tables, and they only share the same name.
Unlike traditional heap tables, PAX format does not support
TOAST
fields. Currently, all column data is stored in the same data file.The PAX format does not support data backup and restore using
pg_dump
orpg_restore
. PAX tables are ignored during these operations.PAX format does not support Write-Ahead Logging (WAL), so there is no data backup between the primary and mirror servers.
Best practices
Use partitioning options:
It is recommended to use partitioning options when data needs to be imported based on an integer column and meets the following conditions:
The data is evenly distributed across this column, with a wide range and no extreme concentration.
This column is often used as a filter condition in queries or as a key for joins.
Keep in mind that the PAX partition key is only effective during a single batch data import. It cannot be adjusted between multiple data writes. The partition key settings only apply to future inserts or updates, so after changing the partition key, newly imported data will follow the new partition key.
Use
minmax
statistics:For columns with a wide data range that are often used in query filters, setting
minmax
values for these columns can greatly speed up the query process.By using
minmax
statistics, if a column in a data file does not match theminmax
values or null tests, the entire file can be skipped quickly, avoiding unnecessary data scans.Important note: The effectiveness of
minmax
depends on how data is inserted. If the data in a PAX table is inserted in batches (such as withbatch insert
orcopy
) and the data range within each batch is continuous, thenminmax
will be very effective. However, if the data is inserted randomly, the effectiveness ofminmax
filtering may be reduced.
PAX has been well-integrated with the vector execution engine. You can control whether to generate vector query plans by setting parameters in the query plan, which helps to improve query performance.
Vector plans take advantage of the columnar storage format of PAX tables, efficiently using memory and speeding up query execution. By processing data in batches, PAX tables can better support high-throughput queries. For more details, see Vectorization Query Computing.
For example:
SET vector.enable_vectorization = on; -- Enables vector query plans. SET vector.max_batch_size = 16384; -- Sets the maximum number of rows per batch.
If a PAX table has multiple columns used in query filters, you can use
with(cluster_columns='b,c,d', cluster_type='zorder')
. After sorting the data withzorder
encoding, any of thecluster_columns
can benefit from filtering. In comparison, for single-column-based cluster sorting, filtering is only effective on the sorted key, with minimal impact on other columns. Clustering sorting influencesminmax
-based filtering but does not affectbloom filter
-based filtering.If your query uses filters like
column1 in (12, 11, 13)
orcolumn1 = 'X'
, you might consider using abloom filter (with(bloomfilter_columns='b,c'))
. In internal implementation, PAX calculates bloom filters for certain data. If the bloom filter can confirm that the data is not in the block, PAX can skip the current block. Note that the effectiveness of bloom filter depends on several factors:The amount of data within a
group
. If the data in agroup
is too small, it might reduce the efficiency of vector execution.The space used by the
bloom filter
.
SynxDB PAX storage supports vector thread-level scanning, but this is limited to parallel scanning of local files only. For specific settings, see Threaded execution on single node.