summaryrefslogtreecommitdiff
path: root/contrib
diff options
context:
space:
mode:
authorTom Lane <tgl@sss.pgh.pa.us>2025-09-20 14:48:16 -0400
committerTom Lane <tgl@sss.pgh.pa.us>2025-09-20 14:48:16 -0400
commit261f89a976bf3dbf25e43bab9983fdd28f20b49b (patch)
tree6a5b68ebca64ceb450422e2c84109bfff40a3f13 /contrib
parent1eccb93150707acfcc8f24556a15742a6313c8ac (diff)
Track the maximum possible frequency of non-MCE array elements.
The lossy-counting algorithm that ANALYZE uses to identify most-common array elements has a notion of cutoff frequency: elements with frequency greater than that are guaranteed to be collected, elements with smaller frequencies are not. In cases where we find fewer MCEs than the stats target would permit us to store, the cutoff frequency provides valuable additional information, to wit that there are no non-MCEs with frequency greater than that. What the selectivity estimation functions actually use the "minfreq" entry for is as a ceiling on the possible frequency of non-MCEs, so using the cutoff rather than the lowest stored MCE frequency provides a tighter bound and more accurate estimates. Therefore, instead of redundantly storing the minimum observed MCE frequency, store the cutoff frequency when there are fewer tracked values than we want. (When there are more, then of course we cannot assert that no non-stored elements are above the cutoff frequency, since we're throwing away some that are; so we still use the minimum stored frequency in that case.) Notably, this works even when none of the values are common enough to be called MCEs. In such cases we previously stored nothing in the STATISTIC_KIND_MCELEM pg_statistic slot, which resulted in the selectivity functions falling back to default estimates. So in that case we want to construct a STATISTIC_KIND_MCELEM entry that contains no "values" but does have "numbers", to wit the three extra numbers that the MCELEM entry type defines. A small obstacle is that update_attstats() has traditionally stored a null, not an empty array, when passed zero "values" for a slot. That gives rise to an MCELEM entry that get_attstatsslot() will spit up on. The least risky solution seems to be to adjust update_attstats() so that it will emit a non-null (but possibly empty) array when the passed stavalues array pointer isn't NULL, rather than conditioning that on numvalues > 0. In other existing cases I don't believe that that changes anything. For consistency, handle the stanumbers array the same way. In passing, improve the comments in routines that use STATISTIC_KIND_MCELEM data. Particularly, explain why we use minfreq / 2 not minfreq as the estimate for non-MCE values. Thanks to Matt Long for the suggestion that we could apply this idea even when there are more than zero MCEs. Reported-by: Mark Frost <FROSTMAR@uk.ibm.com> Reported-by: Matt Long <matt@mattlong.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/PH3PPF1C905D6E6F24A5C1A1A1D8345B593E16FA@PH3PPF1C905D6E6.namprd15.prod.outlook.com
Diffstat (limited to 'contrib')
-rw-r--r--contrib/intarray/_int_selfuncs.c11
1 files changed, 7 insertions, 4 deletions
diff --git a/contrib/intarray/_int_selfuncs.c b/contrib/intarray/_int_selfuncs.c
index 9bf64486242..ddffd69cb6e 100644
--- a/contrib/intarray/_int_selfuncs.c
+++ b/contrib/intarray/_int_selfuncs.c
@@ -210,8 +210,8 @@ _int_matchsel(PG_FUNCTION_ARGS)
*/
if (sslot.nnumbers == sslot.nvalues + 3)
{
- /* Grab the lowest frequency. */
- minfreq = sslot.numbers[sslot.nnumbers - (sslot.nnumbers - sslot.nvalues)];
+ /* Grab the minimal MCE frequency. */
+ minfreq = sslot.numbers[sslot.nvalues];
mcelems = sslot.values;
mcefreqs = sslot.numbers;
@@ -269,8 +269,11 @@ int_query_opr_selec(ITEM *item, Datum *mcelems, float4 *mcefreqs,
else
{
/*
- * The element is not in MCELEM. Punt, but assume that the
- * selectivity cannot be more than minfreq / 2.
+ * The element is not in MCELEM. Estimate its frequency as half
+ * that of the least-frequent MCE. (We know it cannot be more
+ * than minfreq, and it could be a great deal less. Half seems
+ * like a good compromise.) For probably-historical reasons,
+ * clamp to not more than DEFAULT_EQ_SEL.
*/
selec = Min(DEFAULT_EQ_SEL, minfreq / 2);
}