Add a "row processor" API to libpq for better handling of large results.

Traditionally libpq has collected an entire query result before passing it back to the application. That provides a simple and transactional API, but it's pretty inefficient for large result sets. This patch allows the application to process each row on-the-fly instead of accumulating the rows into the PGresult. Error recovery becomes a bit more complex, but often that tradeoff is well worth making. Kyotaro Horiguchi, reviewed by Marko Kreen and Tom Lane
author: Tom Lane <tgl@sss.pgh.pa.us> 2012-04-04 18:27:56 -0400
committer: Tom Lane <tgl@sss.pgh.pa.us> 2012-04-04 18:27:56 -0400
commit: 92785dac2ee7026948962cd61c4cd84a2d052772 (patch)
tree: deb7a2c120978b9f3b85410317271a91b76ad66d /doc/src
parent: cb917e1544612c187c74fed1a990e26820514c8a (diff)
1 files changed, 268 insertions, 0 deletions
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 96064bbb0de..0ec501e5bda 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -5581,6 +5581,274 @@ defaultNoticeProcessor(void *arg, const char *message)
 
  </sect1>
 
+ <sect1 id="libpq-row-processor">
+  <title>Custom Row Processing</title>
+
+  <indexterm zone="libpq-row-processor">
+   <primary>PQrowProcessor</primary>
+  </indexterm>
+
+  <indexterm zone="libpq-row-processor">
+   <primary>row processor</primary>
+   <secondary>in libpq</secondary>
+  </indexterm>
+
+  <para>
+   Ordinarily, when receiving a query result from the server,
+   <application>libpq</> adds each row value to the current
+   <type>PGresult</type> until the entire result set is received; then
+   the <type>PGresult</type> is returned to the application as a unit.
+   This approach is simple to work with, but becomes inefficient for large
+   result sets.  To improve performance, an application can register a
+   custom <firstterm>row processor</> function that processes each row
+   as the data is received from the network.  The custom row processor could
+   process the data fully, or store it into some application-specific data
+   structure for later processing.
+  </para>
+
+  <caution>
+   <para>
+    The row processor function sees the rows before it is known whether the
+    query will succeed overall, since the server might return some rows before
+    encountering an error.  For proper transactional behavior, it must be
+    possible to discard or undo whatever the row processor has done, if the
+    query ultimately fails.
+   </para>
+  </caution>
+
+  <para>
+   When using a custom row processor, row data is not accumulated into the
+   <type>PGresult</type>, so the <type>PGresult</type> ultimately delivered to
+   the application will contain no rows (<function>PQntuples</> =
+   <literal>0</>).  However, it still has <function>PQresultStatus</> =
+   <literal>PGRES_TUPLES_OK</>, and it contains correct information about the
+   set of columns in the query result.  On the other hand, if the query fails
+   partway through, the returned <type>PGresult</type> has
+   <function>PQresultStatus</> = <literal>PGRES_FATAL_ERROR</>.  The
+   application must be prepared to undo any actions of the row processor
+   whenever it gets a <literal>PGRES_FATAL_ERROR</> result.
+  </para>
+
+  <para>
+   A custom row processor is registered for a particular connection by
+   calling <function>PQsetRowProcessor</function>, described below.
+   This row processor will be used for all subsequent query results on that
+   connection until changed again.  A row processor function must have a
+   signature matching
+
+<synopsis>
+typedef int (*PQrowProcessor) (PGresult *res, const PGdataValue *columns,
+                               const char **errmsgp, void *param);
+</synopsis>
+   where <type>PGdataValue</> is described by
+<synopsis>
+typedef struct pgDataValue
+{
+    int         len;            /* data length in bytes, or <0 if NULL */
+    const char *value;          /* data value, without zero-termination */
+} PGdataValue;
+</synopsis>
+  </para>
+
+  <para>
+   The <parameter>res</> parameter is the <literal>PGRES_TUPLES_OK</>
+   <type>PGresult</type> that will eventually be delivered to the calling
+   application (if no error intervenes).  It contains information about
+   the set of columns in the query result, but no row data.  In particular the
+   row processor must fetch <literal>PQnfields(res)</> to know the number of
+   data columns.
+  </para>
+
+  <para>
+   Immediately after <application>libpq</> has determined the result set's
+   column information, it will make a call to the row processor with
+   <parameter>columns</parameter> set to NULL, but the other parameters as
+   usual.  The row processor can use this call to initialize for a new result
+   set; if it has nothing to do, it can just return <literal>1</>.  In
+   subsequent calls, one per received row, <parameter>columns</parameter>
+   is non-NULL and points to an array of <type>PGdataValue</> structs, one per
+   data column.
+  </para>
+
+  <para>
+   <parameter>errmsgp</parameter> is an output parameter used only for error
+   reporting.  If the row processor needs to report an error, it can set
+   <literal>*</><parameter>errmsgp</parameter> to point to a suitable message
+   string (and then return <literal>-1</>).  As a special case, returning
+   <literal>-1</> without changing <literal>*</><parameter>errmsgp</parameter>
+   from its initial value of NULL is taken to mean <quote>out of memory</>.
+  </para>
+
+  <para>
+   The last parameter, <parameter>param</parameter>, is just a void pointer
+   passed through from <function>PQsetRowProcessor</function>.  This can be
+   used for communication between the row processor function and the
+   surrounding application.
+  </para>
+
+  <para>
+   In the <type>PGdataValue</> array passed to a row processor, data values
+   cannot be assumed to be zero-terminated, whether the data format is text
+   or binary.  A SQL NULL value is indicated by a negative length field.
+  </para>
+
+  <para>
+   The row processor <emphasis>must</> process the row data values
+   immediately, or else copy them into application-controlled storage.
+   The value pointers passed to the row processor point into
+   <application>libpq</>'s internal data input buffer, which will be
+   overwritten by the next packet fetch.
+  </para>
+
+  <para>
+   The row processor function must return either <literal>1</> or
+   <literal>-1</>.
+   <literal>1</> is the normal, successful result value; <application>libpq</>
+   will continue with receiving row values from the server and passing them to
+   the row processor.  <literal>-1</> indicates that the row processor has
+   encountered an error.  In that case,
+   <application>libpq</> will discard all remaining rows in the result set
+   and then return a <literal>PGRES_FATAL_ERROR</> <type>PGresult</type> to
+   the application (containing the specified error message, or <quote>out of
+   memory for query result</> if <literal>*</><parameter>errmsgp</parameter>
+   was left as NULL).
+  </para>
+
+  <para>
+   Another option for exiting a row processor is to throw an exception using
+   C's <function>longjmp()</> or C++'s <literal>throw</>.  If this is done,
+   processing of the incoming data can be resumed later by calling
+   <function>PQgetResult</>; the row processor will be invoked as normal for
+   any remaining rows in the current result.
+   As with any usage of <function>PQgetResult</>, the application
+   should continue calling <function>PQgetResult</> until it gets a NULL
+   result before issuing any new query.
+  </para>
+
+  <para>
+   In some cases, an exception may mean that the remainder of the
+   query result is not interesting.  In such cases the application can discard
+   the remaining rows with <function>PQskipResult</>, described below.
+   Another possible recovery option is to close the connection altogether with
+   <function>PQfinish</>.
+  </para>
+
+  <para>
+   <variablelist>
+    <varlistentry id="libpq-pqsetrowprocessor">
+     <term>
+      <function>PQsetRowProcessor</function>
+      <indexterm>
+       <primary>PQsetRowProcessor</primary>
+      </indexterm>
+     </term>
+
+     <listitem>
+      <para>
+       Sets a callback function to process each row.
+
+<synopsis>
+void PQsetRowProcessor(PGconn *conn, PQrowProcessor func, void *param);
+</synopsis>
+      </para>
+
+      <para>
+       The specified row processor function <parameter>func</> is installed as
+       the active row processor for the given connection <parameter>conn</>.
+       Also, <parameter>param</> is installed as the passthrough pointer to
+       pass to it.  Alternatively, if <parameter>func</> is NULL, the standard
+       row processor is reinstalled on the given connection (and
+       <parameter>param</> is ignored).
+      </para>
+
+      <para>
+       Although the row processor can be changed at any time in the life of a
+       connection, it's generally unwise to do so while a query is active.
+       In particular, when using asynchronous mode, be aware that both
+       <function>PQisBusy</> and <function>PQgetResult</> can call the current
+       row processor.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry id="libpq-pqgetrowprocessor">
+     <term>
+      <function>PQgetRowProcessor</function>
+      <indexterm>
+       <primary>PQgetRowProcessor</primary>
+      </indexterm>
+     </term>
+
+     <listitem>
+      <para>
+       Fetches the current row processor for the specified connection.
+
+<synopsis>
+PQrowProcessor PQgetRowProcessor(const PGconn *conn, void **param);
+</synopsis>
+      </para>
+
+      <para>
+       In addition to returning the row processor function pointer, the
+       current passthrough pointer will be returned at
+       <literal>*</><parameter>param</>, if <parameter>param</> is not NULL.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry id="libpq-pqskipresult">
+     <term>
+      <function>PQskipResult</function>
+      <indexterm>
+       <primary>PQskipResult</primary>
+      </indexterm>
+     </term>
+
+     <listitem>
+      <para>
+       Discard all the remaining rows in the incoming result set.
+
+<synopsis>
+PGresult *PQskipResult(PGconn *conn);
+</synopsis>
+      </para>
+
+      <para>
+       This is a simple convenience function to discard incoming data after a
+       row processor has failed or it's determined that the rest of the result
+       set is not interesting.  <function>PQskipResult</> is exactly
+       equivalent to <function>PQgetResult</> except that it transiently
+       installs a dummy row processor function that just discards data.
+       The returned <type>PGresult</> can be discarded without further ado
+       if it has status <literal>PGRES_TUPLES_OK</>; but other status values
+       should be handled normally.  (In particular,
+       <literal>PGRES_FATAL_ERROR</> indicates a server-reported error that
+       will still need to be dealt with.)
+       As when using <function>PQgetResult</>, one should usually repeat the
+       call until NULL is returned to ensure the connection has reached an
+       idle state.  Another possible usage is to call
+       <function>PQskipResult</> just once, and then resume using
+       <function>PQgetResult</> to process subsequent result sets normally.
+      </para>
+
+      <para>
+       Because <function>PQskipResult</> will wait for server input, it is not
+       very useful in asynchronous applications.  In particular you should not
+       code a loop of <function>PQisBusy</> and <function>PQskipResult</>,
+       because that will result in the installed row processor being called
+       within <function>PQisBusy</>.  To get the proper behavior in an
+       asynchronous application, you'll need to install a dummy row processor
+       (or set a flag to make your normal row processor do nothing) and leave
+       it that way until you have discarded all incoming data via your normal
+       <function>PQisBusy</> and <function>PQgetResult</> loop.
+      </para>
+     </listitem>
+    </varlistentry>
+   </variablelist>
+  </para>
+
+ </sect1>
+
  <sect1 id="libpq-events">
   <title>Event System</title>
author	Tom Lane <tgl@sss.pgh.pa.us>	2012-04-04 18:27:56 -0400
committer	Tom Lane <tgl@sss.pgh.pa.us>	2012-04-04 18:27:56 -0400
commit	92785dac2ee7026948962cd61c4cd84a2d052772 (patch)
tree	deb7a2c120978b9f3b85410317271a91b76ad66d /doc/src
parent	cb917e1544612c187c74fed1a990e26820514c8a (diff)