summaryrefslogtreecommitdiff
path: root/src/backend/commands/copyfromparse.c
AgeCommit message (Collapse)Author
2021-04-28Add heuristic incoming-message-size limits in the server.Tom Lane
We had a report of confusing server behavior caused by a client bug that sent junk to the server: the server thought the junk was a very long message length and waited patiently for data that would never come. We can reduce the risk of that by being less trusting about message lengths. For a long time, libpq has had a heuristic rule that it wouldn't believe large message size words, except for a small number of message types that are expected to be (potentially) long. This provides some defense against loss of message-boundary sync and other corrupted-data cases. The server does something similar, except that up to now it only limited the lengths of messages received during the connection authentication phase. Let's do the same as in libpq and put restrictions on the allowed length of all messages, while distinguishing between message types that are expected to be long and those that aren't. I used a limit of 10000 bytes for non-long messages. (libpq's corresponding limit is 30000 bytes, but given the asymmetry of the FE/BE protocol, there's no good reason why the numbers should be the same.) Experimentation suggests that this is at least a factor of 10, maybe a factor of 100, more than we really need; but plenty of daylight seems desirable to avoid false positives. In any case we can adjust the limit based on beta-test results. For long messages, set a limit of MaxAllocSize - 1, which is the most that we can absorb into the StringInfo buffer that the message is collected in. This just serves to make sure that a bogus message size is reported as such, rather than as a confusing gripe about not being able to enlarge a string buffer. While at it, make sure that non-mainline code paths (such as COPY FROM STDIN) are as paranoid as SocketBackend is, and validate the message type code before believing the message length. This provides an additional guard against getting stuck on corrupted input. Discussion: https://postgr.es/m/2003757.1619373089@sss.pgh.pa.us
2021-04-01Do COPY FROM encoding conversion/verification in larger chunks.Heikki Linnakangas
This gives a small performance gain, by reducing the number of calls to the conversion/verification function, and letting it work with larger inputs. Also, reorganizing the input pipeline makes it easier to parallelize the input parsing: after the input has been converted to the database encoding, the next stage of finding the newlines can be done in parallel, because there cannot be any newline chars "embedded" in multi-byte characters in the encodings that we support as server encodings. This changes behavior in one corner case: if client and server encodings are the same single-byte encoding (e.g. latin1), previously the input would not be checked for zero bytes ('\0'). Any fields containing zero bytes would be truncated at the zero. But if encoding conversion was needed, the conversion routine would throw an error on the zero. After this commit, the input is always checked for zeros. Reviewed-by: John Naylor Discussion: https://www.postgresql.org/message-id/e7861509-3960-538a-9025-b75a61188e01%40iki.fi
2021-03-04Remove server and libpq support for old FE/BE protocol version 2.Heikki Linnakangas
Protocol version 3 was introduced in PostgreSQL 7.4. There shouldn't be many clients or servers left out there without version 3 support. But as a courtesy, I kept just enough of the old protocol support that we can still send the "unsupported protocol version" error in v2 format, so that old clients can display the message properly. Likewise, libpq still understands v2 ErrorResponse messages when establishing a connection. The impetus to do this now is that I'm working on a patch to COPY FROM, to always prefetch some data. We cannot do that safely with the old protocol, because it requires parsing the input one byte at a time to detect the end-of-copy marker. Reviewed-by: Tom Lane, Alvaro Herrera, John Naylor Discussion: https://www.postgresql.org/message-id/9ec25819-0a8a-d51a-17dc-4150bb3cca3b%40iki.fi
2021-02-05Fix backslash-escaping multibyte chars in COPY FROM.Heikki Linnakangas
If a multi-byte character is escaped with a backslash in TEXT mode input, and the encoding is one of the client-only encodings where the bytes after the first one can have an ASCII byte "embedded" in the char, we didn't skip the character correctly. After a backslash, we only skipped the first byte of the next character, so if it was a multi-byte character, we would try to process its second byte as if it was a separate character. If it was one of the characters with special meaning, like '\n', '\r', or another '\\', that would cause trouble. One such exmple is the byte sequence '\x5ca45c2e666f6f' in Big5 encoding. That's supposed to be [backslash][two-byte character][.][f][o][o], but because the second byte of the two-byte character is 0x5c, we incorrectly treat it as another backslash. And because the next character is a dot, we parse it as end-of-copy marker, and throw an "end-of-copy marker corrupt" error. Backpatch to all supported versions. Reviewed-by: John Naylor, Kyotaro Horiguchi Discussion: https://www.postgresql.org/message-id/a897f84f-8dca-8798-3139-07da5bb38728%40iki.fi
2021-02-04Fix small error in COPY FROM progress reporting.Heikki Linnakangas
The # of bytes processed was accumulated slightly incorrectly. After loading more data to the input buffer, we added the number of bytes in the buffer to the sum. But in case of multi-byte characters or escapes, there can be a few unprocessed bytes left over from previous load in the buffer. Those bytes got counted twice.
2021-01-06Report progress of COPY commandsTomas Vondra
This commit introduces a view pg_stat_progress_copy, reporting progress of COPY commands. This allows rough estimates how far a running COPY progressed, with the caveat that the total number of bytes may not be available in some cases (e.g. when the input comes from the client). Author: Josef Šimánek Reviewed-by: Fujii Masao, Bharath Rupireddy, Vignesh C, Matthias van de Meent Discussion: https://postgr.es/m/CAFp7QwqMGEi4OyyaLEK9DR0+E+oK3UtA4bEjDVCa4bNkwUY2PQ@mail.gmail.com Discussion: https://postgr.es/m/CAFp7Qwr6_FmRM6pCO0x_a0mymOfX_Gg+FEKet4XaTGSW=LitKQ@mail.gmail.com
2021-01-02Update copyright for 2021Bruce Momjian
Backpatch-through: 9.5
2020-11-30Remove leftover comments, left behind by removal of WITH OIDS.Heikki Linnakangas
Author: Amit Langote Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGaRoF3XrhPW-Y7P%2BG7bKo84Z_h%3DkQHvMh-80%3Dav3wmOw%40mail.gmail.com
2020-11-24Put "inline" marker on declarations of inline functions.Tom Lane
I'm having a hard time telling whether the letter of the C standard requires this, but we do have a couple of buildfarm members that throw warnings when this is not done. Oversight in c532d15dd.
2020-11-23Split copy.c into four files.Heikki Linnakangas
Copy.c has grown really large. Split it into more manageable parts: - copy.c now contains only a few functions that are common to COPY FROM and COPY TO. - copyto.c contains code for COPY TO. - copyfrom.c contains code for initializing COPY FROM, and inserting the tuples to the correct table. - copyfromparse.c contains code for reading from the client/file/program, and parsing the input text/CSV/binary format into tuples. All of these parts are fairly complicated, and fairly independent of each other. There is a patch being discussed to implement parallel COPY FROM, which will add a lot of new code to the COPY FROM path, and another patch which would allow INSERTs to use the same multi-insert machinery as COPY FROM, both of which will require refactoring that code. With those two patches, there's going to be a lot of code churn in copy.c anyway, so now seems like a good time to do this refactoring. The CopyStateData struct is also split. All the formatting options, like FORMAT, QUOTE, ESCAPE, are put in a new CopyFormatOption struct, which is used by both COPY FROM and TO. Other state data are kept in separate CopyFromStateData and CopyToStateData structs. Reviewed-by: Soumyadeep Chakraborty, Erik Rijkers, Vignesh C, Andres Freund Discussion: https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi