Spark, Redshift and the revenge of UTF8

At Affectv, we process a lot of data every day. We clean it up thoroughly, and once clean, we assume it’s clean. But, we have recently faced a puzzling situation, where clean data wasn’t as clean as we thought. The participants in it are

Apache Spark
Amazon Web Services Redshift
Apache Parquet as data interchange format (via AWS’s S3)
myself, as a puzzled observer

The situation

We are processing some data in Spark, filtering for a column to have length of less (or equal) than 24 and dumping this output to S3. Something like,

val df = List("abcdef1234567890abcdef12", "ABCDEF1234567890ABCDEF12Z")
    .toDF("column")
df.where(length($"column") <= 24).write.parquet("s3://somewhere")

In some follow-up process, we pick the data from S3 into a table, where that column is a VARCHAR(24). Like

CREATE TABLE something (
    column VARCHAR(24) encode zstd 
    -- Always encode zstd, even if it's an example
);

COPY something FROM 's3://somewhere'
credentials 'some_role_that_works'
PARQUET;

It worked without a hitch for months, until a few days ago we got the always (not) helpful

ERROR: String length exceeds DDL length

on load. What? It’s supposed to be less, by construction.

The investigation

Okay, let’s investigate the data directly on Redshift, by creating a table with, say, VARCHAR(300) and loading the same data.

CREATE TABLE tmp_investigate_something (
    column VARCHAR(300) encode zstd
);

SELECT LENGTH(column), count(*) GROUP BY 1 ORDER BY 1 DESC;

getting

This looks like less than 24 to me. But wait, what is LENGTH doing in Redshift? Note that LENGTH is an alias of LEN.

The LEN function returns an integer indicating the number of characters in the input string. The LEN function returns the actual number of characters in multi-byte strings, not the number of bytes. For example, a VARCHAR(12) column is required to store three four-byte Chinese characters. The LEN function will return 3 for that same string. To get the length of a string in bytes, use the OCTET_LENGTH function.

To be fair, this is more detailed than the documentation for Postgres, which just mentions characters. What happens if we use OCTET_LENGTH, then?

SELECT LENGTH(column), OCTET_LENGTH(column), count(*) GROUP BY 1, 2 ORDER BY 1 DESC;

24	1
25	1
20	8
19	785567167
20	2
18	85980299
17	8597932
16	862163
15	86856
14	8668
13	1037
12	303

A telltale 25 appears as octet length. And, for confirmation the documentation for VARCHAR type confirms single-byte for the length limit. Okay, gotcha. We should change the filter in Spark. Note, though that the documentation for length in Spark says

Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.

I’d expect this length to be, then, 25 in Spark side. It’s clearly not.

> val a = "�'or (100-1)=99 limit 1"
> a.length
23
> a.getBytes.length
25
> val df = List(a).toDF("column")
> df.where(length($"column") <= 24).show
+--------------------+
|              column|
+--------------------+
|�'or (100-1)=99 l...|
+--------------------+

¯\_(ツ)_/¯

In Scala we could use someString.getBytes.length (this would return 25), but that looks a bit lousy to do in Spark.

Note that this error was impossible to catch by our extensive (unit) test suite, since this could only happen in a real environment. Even if we had UTF8 values in our tests, they would have passed through validation since Spark and Redshift were counting different values.

The solution

For our use case (we don’t want these UTF8 ids anyway), a different form of validation (the column is alphanumeric with a couple of possible signs) is enough, and easier than possible alternatives. Also, a quick investigation shows that using VARCHAR(24) takes essentially the same storage space in Redshift as VARCHAR(MAX), so there is no reason to restrict this (CHAR(24) would take marginally less, if most ids are arount 24 characters).

Note that using CSV as the interchange format would avoid in general this problem. This is because Redshift’s COPY statement has a MAXERROR field (which, we could set at, say, 10) but this field cannot be used for Parquet loading.

In conclusion, don’t assume data is arriving clean, even if you are cleaning it yourself. An additional step of validation is sometimes needed.