summaryrefslogtreecommitdiff
path: root/src/backend/utils/sort
AgeCommit message (Collapse)Author
8 daysSwitch buffile.c/h to use pgoff_t instead of off_tMichael Paquier
off_t was previously used for offsets, which is 4 bytes on Windows, hence limiting the backend code to a hard limit for files longer than 2GB. This leads to some simplification in these files, removing some casts based on long, also 4 bytes on Windows. This commit removes one comment introduced in db3c4c3a2d98, not relevant anymore as pgoff_t is a safe 8-byte alternative on Windows. This change is surprisingly not invasive, as the callers of BufFileTell(), BufFileSeek() and BufFileTruncateFileSet() (worker.c, tuplestore.c, etc.) track offsets in local structures that just to switch from off_t to pgoff_t for the most part. The file is still relying on a maximum file size of MAX_PHYSICAL_FILESIZE (1GB). This change allows the code to make this maximum potentially larger in the future, or larger on a per-demand basis. Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/aUStrqoOCDRFAq1M@paquier.xyz
2025-12-11Fix some comments.Nathan Bossart
Like commit 123661427b, these were discovered while reviewing Aleksander Alekseev's proposed changes to pgindent.
2025-12-10Fix misleading comment in tuplesort.cDavid Rowley
A comment in tuplesort.c was claiming that the code was defining INITIAL_MEMTUPSIZE so that it *does not* exceed ALLOCSET_SEPARATE_THRESHOLD, but the code actually ensures that we purposefully *do* exceed ALLOCSET_SEPARATE_THRESHOLD for the initial allocation of the tuples array, as per reasons detailed in the commentary of grow_memtuples(). Also, there's not much need to repeat the mention about ALLOCSET_SEPARATE_THRESHOLD in each location where INITIAL_MEMTUPSIZE is used, so remove those comments. Author: ChangAo Chen <cca5507@qq.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Discussion: https://postgr.es/m/tencent_6FA14F85D6B5B5291532D6789E07F4765C08%40qq.com
2025-12-10Use palloc_object() and palloc_array() in backend codeMichael Paquier
The idea is to encourage more the use of these new routines across the tree, as these offer stronger type safety guarantees than palloc(). This batch of changes includes most of the trivial changes suggested by the author for src/backend/. A total of 334 files are updated here. Among these files, 48 of them have their build change slightly; these are caused by line number changes as the new allocation formulas are simpler, shaving around 100 lines of code in total. Similar work has been done in 0c3c5c3b06a3 and 31d3847a37be. Author: David Geier <geidav.pg@gmail.com> Discussion: https://postgr.es/m/ad0748d4-3080-436e-b0bc-ac8f86a3466a@gmail.com
2025-11-26Remove a few unused struct members.Nathan Bossart
Oversights in commits ab9e0e718a, f3049a603a, and 247ce06b88. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/aScUuBSawPWogUxs%40ip-10-97-1-34.eu-west-3.compute.internal
2025-11-21Remove useless casts to (void *)Peter Eisentraut
Their presence causes (small) risks of hiding actual type mismatches or silently discarding qualifiers. Some have been missed in 7f798aca1d5 and some are new ones along the same lines. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/aR8Yv%2BuATLKbJCgI%40ip-10-97-1-34.eu-west-3.compute.internal
2025-10-30Mark ItemPointer arguments as const throughoutPeter Eisentraut
This is a follow up 991295f. I searched over src/ and made all ItemPointer arguments as const as much as possible. Note: We cut out from the original patch the pieces that would have created incompatibilities in the index or table AM APIs. Those could be considered separately. Author: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw%40mail.gmail.com
2025-10-03Fix incorrect function reference in commentRichard Guo
The comment incorrectly references the defunct function BufFileOpenShared(), which was replaced in commit dcac5e7ac. This patch updates the comment to refer to the current function BufFileOpenFileSet(). Author: Zhang Mingli <zmlpostgres@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Richard Guo <guofenglinux@gmail.com> Discussion: https://postgr.es/m/1cb48b4c-54ab-40cc-b355-0b3c2af6d3f7@Spark
2025-08-14Avoid including tableam.h and xlogreader.h in nbtree.hÁlvaro Herrera
Doing that seems rather random and unnecessary. This commit removes those and fixes fallout, which is pretty minimal. We do need to add a forward declaration of struct TM_IndexDeleteOp (whose full definition appears in tableam.h) so that _bt_delitems_delete_check()'s declaration can use it. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/202508051109.lzk3lcuzsaxo@alvherre.pgsql
2025-08-13Grab the low-hanging fruit from forcing sizeof(Datum) to 8.Tom Lane
Remove conditionally-compiled code for smaller Datum widths, and simplify comments that describe cases no longer of interest. I also fixed up a few more places that were not using DatumGetIntXX where they should, and made some cosmetic adjustments such as using sizeof(int64) not sizeof(Datum) in places where that fit better with the surrounding code. One thing I remembered while preparing this part is that SP-GiST stores pass-by-value prefix keys as Datums, so that the on-disk representation depends on sizeof(Datum). That's even more unfortunate than the existing commentary makes it out to be, because now there is a hazard that the change of sizeof(Datum) will break SP-GiST indexes on 32-bit machines. It appears that there are no existing SP-GiST opclasses that are actually affected; and if there are some that I didn't find, the number of installations that are using them on 32-bit machines is doubtless tiny. So I'm proceeding on the assumption that we can get away with this, but it's something to worry about. (gininsert.c looks like it has a similar problem, but it's okay because the "tuples" it's constructing are just transient data within the tuplesort step. That's pretty poorly documented though, so I added some comments.) Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us
2025-08-08Add missing Datum conversionsPeter Eisentraut
Add various missing conversions from and to Datum. The previous code mostly relied on implicit conversions or its own explicit casts instead of using the correct DatumGet*() or *GetDatum() functions. We think these omissions are harmless. Some actual bugs that were discovered during this process have been committed separately (80c758a2e1d, fd2ab03fea2). Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/8246d7ff-f4b7-4363-913e-827dadfeb145%40eisentraut.org
2025-04-06Relax ordering-related hardcoded btree requirements in planningPeter Eisentraut
There were several places in ordering-related planning where a requirement for btree was hardcoded but an amcanorder index could suffice. This fixes that. We just need to do the necessary mapping between strategy numbers and compare types and adjust some related APIs so that this works independent of btree strategy numbers. For instance, non-btree amcanorder indexes can now be used to support sorting and merge joins. Also, predtest.c works independent of btree strategy numbers now. To avoid performance regressions, some details on btree and other built-in index types are still hardcoded as shortcuts, but other index types now have access to the same features by providing the required flags and callbacks. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2025-04-01Fix minor C type confusionPeter Eisentraut
Returning false instead of NULL gets a compiler error under gcc-14 -std=gnu23, and it appears to have been unintentional. Fix for commit 8492feb98f6.
2025-03-29Use PRI?64 instead of "ll?" in format strings (continued).Peter Eisentraut
Continuation of work started in commit 15a79c73, after initial trial. Author: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/b936d2fb-590d-49c3-a615-92c3a88c6c19%40eisentraut.org
2025-03-24Add ExecCopySlotMinimalTupleExtra().Jeff Davis
Allows an "extra" argument that allocates extra memory at the end of the MinimalTuple. This is important for callers that need to store additional data, but do not want to perform an additional allocation. Suggested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvppeqw2pNM-+ahBOJwq2QmC0hOAGsmCpC89QVmEoOvsdg@mail.gmail.com
2025-03-14Simplify and generalize PrepareSortSupportFromIndexRel()Peter Eisentraut
PrepareSortSupportFromIndexRel() was accepting btree strategy numbers purely for the purpose of comparing it later against btree strategies to determine if the sort direction was forward or reverse. Change that. Instead, pass a bool directly, to indicate the same without an unfortunate assumption that a strategy number refers specifically to a btree strategy. (This is similar in spirit to commits 0d2aa4d4937 and c594f1ad2ba.) (This could arguably be simplfied further by having the callers fill in ssup_reverse directly. But this way, it preserves consistency by having all PrepareSortSupport*() variants be responsible for filling in ssup_reverse.) Moreover, remove the hardcoded check against BTREE_AM_OID, and check against amcanorder instead, which is the actual requirement. Co-authored-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
2025-03-03Allow parallel CREATE INDEX for GIN indexesTomas Vondra
Allow using parallel workers to build a GIN index, similarly to BTREE and BRIN. For large tables this may result in significant speedup when the build is CPU-bound. The work is divided so that each worker builds index entries on a subset of the table, determined by the regular parallel scan used to read the data. Each worker uses a local tuplesort to sort and merge the entries for the same key. The TID lists do not overlap (for a given key), which means the merge sort simply concatenates the two lists. The merged entries are written into a shared tuplesort for the leader. The leader needs to merge the sorted entries again, before writing them into the index. But this way a significant part of the work happens in the workers, and the leader is left with merging fewer large entries, which is more efficient. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for GIN indexes (e.g. uniqueness checks). Original patch by me, with reviews and substantial improvements by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
2025-01-31Get rid of our dependency on type "long" for memory size calculations.Tom Lane
Consistently use "Size" (or size_t, or in some places int64 or double) as the type for variables holding memory allocation sizes. In most places variables' data types were fine already, but we had an ancient habit of computing bytes from kilobytes-units GUCs with code like "work_mem * 1024L". That risks overflow on Win64 where they did not make "long" as wide as "size_t". We worked around that by restricting such GUCs' ranges, so you couldn't set work_mem et al higher than 2GB on Win64. This patch removes that restriction, after replacing such calculations with "work_mem * (Size) 1024" or variants of that. It should be noted that this patch was constructed by searching outwards from the GUCs that have MAX_KILOBYTES as upper limit. So I can't positively guarantee there are no other places doing memory-size arithmetic in int or long variables. I do however feel pretty confident that increasing MAX_KILOBYTES on Win64 is safe now. Also, nothing in our code should be dealing in multiple-gigabyte allocations without authorization from a relevant GUC, so it seems pretty likely that this search caught everything that could be at risk of overflow. Author: Vladlen Popolitov <v.popolitov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217
2025-01-29Fix grammatical typos around possessive "its"John Naylor
Some places spelled it "it's", which is short for "it is". In passing, fix a couple other nearby grammatical errors. Author: Jacob Brazeal <jacob.brazeal@gmail.com> Discussion: https://postgr.es/m/CA+COZaAO8g1KJCV0T48=CkJMjAnnfTGLWOATz+2aCh40c2Nm+g@mail.gmail.com
2025-01-01Update copyright for 2025Bruce Momjian
Backpatch-through: 13
2024-11-28Remove useless casts to (void *)Peter Eisentraut
Many of them just seem to have been copied around for no real reason. Their presence causes (small) risks of hiding actual type mismatches or silently discarding qualifiers Discussion: https://www.postgresql.org/message-id/flat/461ea37c-8b58-43b4-9736-52884e862820@eisentraut.org
2024-09-12Adjust tuplestore stats APIDavid Rowley
1eff8279d added an API to tuplestore.c to allow callers to obtain storage telemetry data. That API wasn't quite good enough for callers that perform tuplestore_clear() as the telemetry functions only accounted for the current state of the tuplestore, not the maximums before tuplestore_clear() was called. There's a pending patch that would like to add tuplestore telemetry output to EXPLAIN ANALYZE for WindowAgg. That node type uses tuplestore_clear() before moving to the next window partition and we want to show the maximum space used, not the space used for the final partition. Reviewed-by: Tatsuo Ishii, Ashutosh Bapat Discussion: https://postgres/m/CAApHDvoY8cibGcicLV0fNh=9JVx9PANcWvhkdjBnDCc9Quqytg@mail.gmail.com
2024-08-14Remove TRACE_SORT macroPeter Eisentraut
The TRACE_SORT macro guarded the availability of the trace_sort GUC setting. But it has been enabled by default ever since it was introduced in PostgreSQL 8.1, and there have been no reports that someone wanted to disable it. So just remove the macro to simplify things. (For the avoidance of doubt: The trace_sort GUC is still there. This only removes the rarely-used macro guarding it.) Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://www.postgresql.org/message-id/flat/be5f7162-7c1d-44e3-9a78-74dcaa6529f2%40eisentraut.org
2024-07-06Adjust tuplestore.c not to allocate BufFiles in generation contextDavid Rowley
590b045c3 made it so tuplestore.c would store tuples inside a generation.c memory context. After fixing a bug report in 97651b013, it seems that it's probably best not to allocate BufFile related allocations in that context. Let's keep it just for tuple data. This adjusts the code to switch to the Tuplestorestate.context's parent, which is the MemoryContext that tuplestore_begin_common() was called in. It does not seem worth adding a new field in Tuplestorestate to store this when we can access it by looking at the Tuplestorestate's context's parent. Discussion: https://postgr.es/m/CAApHDvqFt_CdJtSr+E9YLZb7jZAyRCy3hjQ+ktM+dcOFVq-xkg@mail.gmail.com
2024-07-05Improve memory management and performance of tuplestore.cDavid Rowley
Here we make tuplestore.c use a generation.c memory context rather than allocating tuples into the CurrentMemoryContext, which primarily is the ExecutorState or PortalHoldContext memory context. Not having a dedicated context can cause the CurrentMemoryContext context to become bloated when pfree'd chunks are not reused by future tuples. Using generation speeds up users of tuplestore.c, such as the Materialize, WindowAgg and CTE Scan executor nodes. The main reason for the speedup is due to generation.c being more memory efficient than aset.c memory contexts. Specifically, generation does not round sizes up to the next power of 2 value. This both saves memory, allowing more tuples to fit in work_mem, but also makes the memory usage more compact and fit on fewer cachelines. One benchmark showed up to a 22% performance increase in a query containing a Materialize node. Much higher gains are possible if the memory reduction prevents tuplestore.c from spilling to disk. This is especially true for WindowAgg nodes where improvements of several thousand times are possible if the memory reductions made here prevent tuplestore from spilling to disk. Additionally, a generation.c memory context is much better suited for this job as it works well with FIFO palloc/pfree patterns, which is exactly how tuplestore.c uses it. Because of the way generation.c allocates memory, tuples consecutively stored in tuplestores are much more likely to be stored consecutively in memory. This allows the CPU's hardware prefetcher to work more efficiently as it provides a more predictable pattern to allow cachelines for the next tuple to be loaded from RAM in advance of them being needed by the executor. Using a dedicated memory context for storing tuples also allows us to more efficiently clean up the memory used by the tuplestore as we can reset or delete the context rather than looping over all stored tuples and pfree'ing them one by one. Also, remove a badly placed USEMEM call in readtup_heap(). The tuple wasn't being allocated in the Tuplestorestate's context, so no need to adjust the memory consumed by the tuplestore there. Author: David Rowley Reviewed-by: Matthias van de Meent, Dmitry Dolgov Discussion: https://postgr.es/m/CAApHDvp5Py9g4Rjq7_inL3-MCK1Co2CRt_YWFwTU2zfQix0p4A@mail.gmail.com
2024-07-05Add memory/disk usage for Material nodes in EXPLAINDavid Rowley
Up until now, there was no ability to easily determine if a Material node caused the underlying tuplestore to spill to disk or even see how much memory the tuplestore used if it didn't. Here we add some new functions to tuplestore.c to query this information and add some additional output in EXPLAIN ANALYZE to display this information for the Material node. There are a few other executor node types that use tuplestores, so we could also consider adding these details to the EXPLAIN ANALYZE for those nodes too. Let's consider those independently from this. Having the tuplestore.c infrastructure in to allow that is step 1. Author: David Rowley Reviewed-by: Matthias van de Meent, Dmitry Dolgov Discussion: https://postgr.es/m/CAApHDvp5Py9g4Rjq7_inL3-MCK1Co2CRt_YWFwTU2zfQix0p4A@mail.gmail.com
2024-07-02Add missing includes for some global variablesPeter Eisentraut
src/backend/libpq/pqcomm.c: "postmaster/postmaster.h" for Unix_socket_group, Unix_socket_permissions src/backend/utils/init/globals.c: "postmaster/postmaster.h" for MyClientSocket src/backend/utils/misc/guc_tables.c: "utils/rls.h" for row_security src/backend/utils/sort/tuplesort.c: "utils/guc.h" for trace_sort Nothing currently diagnoses missing includes for global variables, but this is being cleaned up, and these ones had an obvious header file available. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
2024-04-08Use bump memory context for tuplesortsDavid Rowley
29f6a959c added a bump allocator type for efficient compact allocations. Here we make use of this for non-bounded tuplesorts to store tuples. This is very space efficient when storing narrow tuples due to bump.c not having chunk headers. This means we can fit more tuples in work_mem before spilling to disk, or perform an in-memory sort touching fewer cacheline. Author: David Rowley Reviewed-by: Nathan Bossart Reviewed-by: Matthias van de Meent Reviewed-by: Tomas Vondra Reviewed-by: John Naylor Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
2024-03-25Clarify comment for LogicalTapeSetBlocks().Jeff Davis
Discussion: https://postgr.es/m/1229327.1711160246@sss.pgh.pa.us Backpatch-through: 13
2024-03-13Make the order of the header file includes consistentPeter Eisentraut
Similar to commit 7e735035f20. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CAMbWs4-WhpCFMbXCjtJ%2BFzmjfPrp7Hw1pk4p%2BZpU95Kh3ofZ1A%40mail.gmail.com
2024-03-04Remove unused #include's from backend .c filesPeter Eisentraut
as determined by include-what-you-use (IWYU) While IWYU also suggests to *add* a bunch of #include's (which is its main purpose), this patch does not do that. In some cases, a more specific #include replaces another less specific one. Some manual adjustments of the automatic result: - IWYU currently doesn't know about includes that provide global variable declarations (like -Wmissing-variable-declarations), so those includes are being kept manually. - All includes for port(ability) headers are being kept for now, to play it safe. - No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the patch from exploding in size. Note that this patch touches just *.c files, so nothing declared in header files changes in hidden ways. As a small example, in src/backend/access/transam/rmgr.c, some IWYU pragma annotations are added to handle a special case there. Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
2024-01-03Update copyright for 2024Bruce Momjian
Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
2024-01-02Fix typos in comments and in one isolation test.Robert Haas
Dagfinn Ilmari Mannsåker, reviewed by Shubham Khanna. Some subtractions by me. Discussion: http://postgr.es/m/87le9fmi01.fsf@wibble.ilmari.org
2023-12-30Minor cleanup of the BRIN parallel build codeTomas Vondra
Commit b437571714 added support for parallel builds for BRIN indexes, using code similar to BTREE parallel builds, and also a new tuplesort variant. This commit simplifies the new code in two ways: * The "spool" grouping tuplesort and the heap/index is not necessary. The heap/index are available as separate arguments, causing confusion. So remove the spool, and use the tuplesort directly. * The new tuplesort variant does not need the heap/index, as it sorts simply by the range block number, without accessing the tuple data. So simplify that too. Initial report and patch by Ranier Vilela, further cleanup by me. Author: Ranier Vilela Discussion: https://postgr.es/m/CAEudQAqD7f2i4iyEaAz-5o-bf6zXVX-AkNUBm-YjUXEemaEh6A%40mail.gmail.com
2023-12-08Allow parallel CREATE INDEX for BRIN indexesTomas Vondra
Allow using multiple worker processes to build BRIN index, which until now was supported only for BTREE indexes. For large tables this often results in significant speedup when the build is CPU-bound. The work is split in a simple way - each worker builds BRIN summaries on a subset of the table, determined by the regular parallel scan used to read the data, and feeds them into a shared tuplesort which sorts them by blkno (start of the range). The leader then reads this sorted stream of ranges, merges duplicates (which may happen if the parallel scan does not align with BRIN pages_per_range), and adds the resulting ranges into the index. The number of duplicate results produced by workers (requiring merging in the leader process) should be fairly small, thanks to how parallel scans assign chunks to workers. The likelihood of duplicate results may increase for higher pages_per_range values, but then there are fewer page ranges in total. In any case, we expect the merging to be much cheaper than summarization, so this should be a win. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for BRIN indexes (e.g. uniqueness checks). This also introduces a new index AM flag amcanbuildparallel, determining whether to attempt to start parallel workers for the index build. Original patch by me, with reviews and substantial reworks by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent Discussion: https://postgr.es/m/c2ee7d69-ce17-43f2-d1a0-9811edbda6e6%40enterprisedb.com
2023-11-17Change logtape/tuplestore code to use int64 for block numbersMichael Paquier
The code previously relied on "long" as type to track block numbers, which would be 4 bytes in all Windows builds or any 32-bit builds. This limited the code to be able to handle up to 16TB of data with the default block size of 8kB, like during a CLUSTER. This code now relies on a more portable int64, which should be more than enough for at least the next 20 years to come. This issue has been reported back in 2017, but nothing was done about it back then, so here we go now. Reported-by: Peter Geoghegan Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CAH2-WznCscXnWmnj=STC0aSa7QG+BRedDnZsP=Jo_R9GUZvUrg@mail.gmail.com
2023-10-26Add trailing commas to enum definitionsPeter Eisentraut
Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
2023-10-10Add const to values and nulls argumentsPeter Eisentraut
This excludes any changes that would change the external AM APIs. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/14c31f4a-0347-0805-dce8-93a9072c05a5%40eisentraut.org
2023-08-30Avoid possible overflow with ltsGetFreeBlock() in logtape.cMichael Paquier
nFreeBlocks, defined as a long, stores the number of free blocks in a logical tape. ltsGetFreeBlock() has been using an int to store the value of nFreeBlocks, which could lead to overflows on platforms where long and int are not the same size (in short everything except Windows where long is 4 bytes). The problematic intermediate variable is switched to be a long instead of an int. Issue introduced by c02fdc9223015, so backpatch down to 13. Author: Ranier vilela Reviewed-by: Peter Geoghegan, David Rowley Discussion: https://postgr.es/m/CAEudQApLDWCBR_xmwNjGBrDo+f+S4E87x3s7-+hoaKqYdtC4JQ@mail.gmail.com Backpatch-through: 13
2023-08-16Split out tiebreaker comparisons from comparetup_* functionsJohn Naylor
Previously, if a specialized comparator found equal datum1 keys, the "comparetup" function would repeat the comparison on the datum before proceeding with the unabbreviated first key and/or additional sort keys. Move comparing additional sort keys into "tiebreak" functions so that specialized comparators can call these directly if needed, avoiding duplicate work. Reviewed by David Rowley Discussion: https://postgr.es/m/CAFBsxsGaVfUrjTghpf%3DkDBYY%3DjWx1PN-fuusVe7Vw5s0XqGdGw%40mail.gmail.com
2023-06-10nbtree: Allocate new pages in separate function.Peter Geoghegan
Split nbtree's _bt_getbuf function is two: code that read locks or write locks existing pages remains in _bt_getbuf, while code that deals with allocating new pages is moved to a new, dedicated function called _bt_allocbuf. This simplifies most _bt_getbuf callers, since it is no longer necessary for them to pass a heaprel argument. Many of the changes to nbtree from commit 61b313e4 can be reverted. This minimizes the divergence between HEAD/PostgreSQL 16 and earlier release branches. _bt_allocbuf replaces the previous nbtree idiom of passing P_NEW to _bt_getbuf. There are only 3 affected call sites, all of which continue to pass a heaprel for recovery conflict purposes. Note that nbtree's use of P_NEW was superficial; nbtree never actually relied on the P_NEW code paths in bufmgr.c, so this change is strictly mechanical. GiST already took the same approach; it has a dedicated function for allocating new pages called gistNewBuffer(). That factor allowed commit 61b313e4 to make much more targeted changes to GiST. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
2023-05-19Pre-beta mechanical code beautification.Tom Lane
Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-05-19Allocate hash join files in a separate memory contextTomas Vondra
Should a hash join exceed memory limit, the hashtable is split up into multiple batches. The number of batches is doubled each time a given batch is determined not to fit in memory. Each batch file is allocated with a block-sized buffer for buffering tuples and parallel hash join has additional sharedtuplestore accessor buffers. In some pathological cases requiring a lot of batches, often with skewed data, bad stats, or very large datasets, users can run out-of-memory solely from the memory overhead of all the batch files' buffers. Batch files were allocated in the ExecutorState memory context, making it very hard to identify when this batch explosion was the source of an OOM. This commit allocates the batch files in a dedicated memory context, making it easier to identify the cause of an OOM and work to avoid it. Based on initial draft by Tomas Vondra, with significant reworks and improvements by Jehan-Guillaume de Rorthais. Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Author: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20190421114618.z3mpgmimc3rmubi4@development Discussion: https://postgr.es/m/20230504193006.1b5b9622%40karst#273020ff4061fc7a2fbb1ba96b281f17
2023-04-19Fix various typos and incorrect/outdated name referencesDavid Rowley
Author: Alexander Lakhin Discussion: https://postgr.es/m/699beab4-a6ca-92c9-f152-f559caf6dc25@gmail.com
2023-04-08Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro
In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
2023-04-01Pass down table relation into more index relation functionsAndres Freund
This is done in preparation for logical decoding on standby, which needs to include whether visibility affecting WAL records are about a (user) catalog table. Which is only known for the table, not the indexes. It's also nice to be able to pass the heap relation to GlobalVisTestFor() in vacuumRedirectAndPlaceholder(). Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/21b700c3-eecf-2e05-a699-f8c78dd31ec7@gmail.com
2023-03-29Simplify useless 0L constantsPeter Eisentraut
In ancient times, these belonged to arguments or fields that were actually of type long, but now they are not anymore, so this "L" decoration is just confusing. (Some other 0L and other "L" constants remain, where they are actually associated with a long type.)
2023-01-26Fix rare sharedtuplestore.c corruption.Thomas Munro
If the final chunk of an oversized tuple being written out to disk was exactly 32760 bytes, it would be corrupted due to a fencepost bug. Bug #17619. Back-patch to 11 where the code arrived. While testing that (see test module in archives), I (tmunro) noticed that the per-participant page counter was not initialized to zero as it should have been; that wasn't a live bug when it was written since DSM memory was originally always zeroed, but since 14 min_dynamic_shared_memory might be configured and it supplies non-zeroed memory, so that is also fixed here. Author: Dmitry Astapov <dastapov@gmail.com> Discussion: https://postgr.es/m/17619-0de62ceda812b8b5%40postgresql.org
2023-01-23Fix and clarify function comment on LogicalTapeSetCreate.Heikki Linnakangas
Commit c4649cce39 removed the "shared" and "ntapes" arguments, but the comment still talked about "shared". It also talked about "a shared file handle", which was technically correct because even before commit c4649cce39, the "shared file handle" referred to the "fileset" argument, not "shared". But it was very confusing. Improve the comment. Also add a comment on what the "preallocate" argument does. Backpatch to v15, just to make backpatching other patches easier in the future. Discussion: https://www.postgresql.org/message-id/af989685-91d5-aad4-8f60-1d066b5ec309@enterprisedb.com Reviewed-by: Peter Eisentraut
2023-01-16Add BufFileRead variants with short read and EOF detectionPeter Eisentraut
Most callers of BufFileRead() want to check whether they read the full specified length. Checking this at every call site is very tedious. This patch provides additional variants BufFileReadExact() and BufFileReadMaybeEOF() that include the length checks. I considered changing BufFileRead() itself, but this function is also used in extensions, and so changing the behavior like this would create a lot of problems there. The new names are analogous to the existing LogicalTapeReadExact(). Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f3501945-c591-8cc3-5ef0-b72a2e0eaa9c@enterprisedb.com