Spark, Redshift and the revenge of UTF8
At Affectv, we process a lot of data every day. We clean it up thoroughly, and once clean, we assume it’s clean. But, we have recently faced a puzzling situation, where clean data wasn’t as clean as we thought. The participants in it are
- Apache Spark
- Amazon Web Services Redshift
- Apache Parquet as data interchange format (via AWS’s S3)
- myself, as a puzzled observer
The situation
We are processing some data in Spark, filtering for a column to have length of less (or equal) than 24 and dumping this output to S3. Something like,
val df = List("abcdef1234567890abcdef12", "ABCDEF1234567890ABCDEF12Z")
.toDF("column")
df.where(length($"column") <= 24).write.parquet("s3://somewhere")
In some follow-up process, we pick the data from S3 into a table, where that
column is a VARCHAR(24)
. Like
CREATE TABLE something (
column VARCHAR(24) encode zstd
-- Always encode zstd, even if it's an example
);
COPY something FROM 's3://somewhere'
credentials 'some_role_that_works'
PARQUET;
It worked without a hitch for months, until a few days ago we got the always (not) helpful
ERROR: String length exceeds DDL length
on load. What? It’s supposed to be less, by construction.
The investigation
Okay, let’s investigate the data directly on Redshift, by creating a table with,
say, VARCHAR(300)
and loading the same data.
CREATE TABLE tmp_investigate_something (
column VARCHAR(300) encode zstd
);
SELECT LENGTH(column), count(*) GROUP BY 1 ORDER BY 1 DESC;
getting
24 1
23 1
20 8
19 785567167
18 85980301
17 8597932
16 862163
15 86856
14 8668
13 1037
12 303
This looks like less than 24 to me. But wait, what is LENGTH
doing in
Redshift? Note that LENGTH
is an alias of LEN
.
The LEN function returns an integer indicating the number of characters in the input string. The LEN function returns the actual number of characters in multi-byte strings, not the number of bytes. For example, a VARCHAR(12) column is required to store three four-byte Chinese characters. The LEN function will return 3 for that same string. To get the length of a string in bytes, use the OCTET_LENGTH function.
To be fair, this is more detailed than the documentation for
Postgres, which just
mentions characters. What happens if we use OCTET_LENGTH
, then?
SELECT LENGTH(column), OCTET_LENGTH(column), count(*) GROUP BY 1, 2 ORDER BY 1 DESC;
24 24 1
23 25 1
20 20 8
19 19 785567167
18 20 2
18 18 85980299
17 17 8597932
16 16 862163
15 15 86856
14 14 8668
13 13 1037
12 12 303
A telltale 25 appears as octet length. And, for confirmation the documentation
for VARCHAR
type confirms single-byte for the length limit. Okay, gotcha. We
should change the filter in Spark. Note, though that the documentation for
length
in Spark says
Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
I’d expect this length to be, then, 25 in Spark side. It’s clearly not.
> val a = "�'or (100-1)=99 limit 1"
> a.length
23
> a.getBytes.length
25
> val df = List(a).toDF("column")
> df.where(length($"column") <= 24).show
+--------------------+
| column|
+--------------------+
|�'or (100-1)=99 l...|
+--------------------+
¯\_(ツ)_/¯
In Scala we could use someString.getBytes.length
(this would return 25), but
that looks a bit lousy to do in Spark.
Note that this error was impossible to catch by our extensive (unit) test suite, since this could only happen in a real environment. Even if we had UTF8 values in our tests, they would have passed through validation since Spark and Redshift were counting different values.
The solution
For our use case (we don’t want these UTF8 ids anyway), a different form of
validation (the column is alphanumeric with a couple of possible signs) is
enough, and easier than possible alternatives. Also, a quick investigation shows
that using VARCHAR(24)
takes essentially the same storage space in Redshift as
VARCHAR(MAX)
, so there is no reason to restrict this (CHAR(24)
would take
marginally less, if most ids are arount 24 characters).
Note that using CSV as the interchange format would avoid in general this
problem. This is because Redshift’s COPY
statement has a MAXERROR
field
(which, we could set at, say, 10) but this field cannot be used for Parquet
loading.
In conclusion, don’t assume data is arriving clean, even if you are cleaning it yourself. An additional step of validation is sometimes needed.