At Affectv, we process a lot of data every day. We clean it up thoroughly, and once clean, we assume it’s clean. But, we have recently faced a puzzling situation, where clean data wasn’t as clean as we thought. The participants in it are
- Apache Spark
- Amazon Web Services Redshift
- Apache Parquet as data interchange format (via AWS’s S3)
- myself, as a puzzled observer
We are processing some data in Spark, filtering for a column to have length of less (or equal) than 24 and dumping this output to S3. Something like,
val df = List("abcdef1234567890abcdef12", "ABCDEF1234567890ABCDEF12Z") .toDF("column") df.where(length($"column") <= 24).write.parquet("s3://somewhere")
In some follow-up process, we pick the data from S3 into a table, where that
column is a
CREATE TABLE something ( column VARCHAR(24) encode zstd -- Always encode zstd, even if it's an example ); COPY something FROM 's3://somewhere' credentials 'some_role_that_works' PARQUET;
It worked without a hitch for months, until a few days ago we got the always (not) helpful
ERROR: String length exceeds DDL length
on load. What? It’s supposed to be less, by construction.
Okay, let’s investigate the data directly on Redshift, by creating a table with,
VARCHAR(300) and loading the same data.
CREATE TABLE tmp_investigate_something ( column VARCHAR(300) encode zstd ); SELECT LENGTH(column), count(*) GROUP BY 1 ORDER BY 1 DESC;
24 1 23 1 20 8 19 785567167 18 85980301 17 8597932 16 862163 15 86856 14 8668 13 1037 12 303
This looks like less than 24 to me. But wait, what is
LENGTH doing in
Redshift? Note that
LENGTH is an alias of
The LEN function returns an integer indicating the number of characters in the input string. The LEN function returns the actual number of characters in multi-byte strings, not the number of bytes. For example, a VARCHAR(12) column is required to store three four-byte Chinese characters. The LEN function will return 3 for that same string. To get the length of a string in bytes, use the OCTET_LENGTH function.
To be fair, this is more detailed than the documentation for
Postgres, which just
mentions characters. What happens if we use
SELECT LENGTH(column), OCTET_LENGTH(column), count(*) GROUP BY 1, 2 ORDER BY 1 DESC;
24 24 1 23 25 1 20 20 8 19 19 785567167 18 20 2 18 18 85980299 17 17 8597932 16 16 862163 15 15 86856 14 14 8668 13 13 1037 12 12 303
A telltale 25 appears as octet length. And, for confirmation the documentation
VARCHAR type confirms single-byte for the length limit. Okay, gotcha. We
should change the filter in Spark. Note, though that the documentation for
length in Spark says
Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
I’d expect this length to be, then, 25 in Spark side. It’s clearly not.
> val a = "�'or (100-1)=99 limit 1" > a.length 23 > a.getBytes.length 25 > val df = List(a).toDF("column") > df.where(length($"column") <= 24).show +--------------------+ | column| +--------------------+ |�'or (100-1)=99 l...| +--------------------+
In Scala we could use
someString.getBytes.length (this would return 25), but
that looks a bit lousy to do in Spark.
Note that this error was impossible to catch by our extensive (unit) test suite, since this could only happen in a real environment. Even if we had UTF8 values in our tests, they would have passed through validation since Spark and Redshift were counting different values.
For our use case (we don’t want these UTF8 ids anyway), a different form of
validation (the column is alphanumeric with a couple of possible signs) is
enough, and easier than possible alternatives. Also, a quick investigation shows
VARCHAR(24) takes essentially the same storage space in Redshift as
VARCHAR(MAX), so there is no reason to restrict this (
CHAR(24) would take
marginally less, if most ids are arount 24 characters).
Note that using CSV as the interchange format would avoid in general this
problem. This is because Redshift’s
COPY statement has a
(which, we could set at, say, 10) but this field cannot be used for Parquet
In conclusion, don’t assume data is arriving clean, even if you are cleaning it yourself. An additional step of validation is sometimes needed.