Understanding MySQL binary and non-binary string data types

There are two main categories of string data types available and understanding their characteristics properly will allow you to choose the appropriate string data type for the column in accordance with the requirements of the application. Having reviewed different table structures designed by different people, I have come to the conclusion that binary and non-binary string data types are used without consideration of the consequences of choosing either one. The confusion stems from the fact that both non-binary and binary string data appear to store characters because they can be saved as quoted string.

Broadly speaking there are two major categories within which the different string types can be divided, and understanding these two categories is very important because each of this category is handled differently by MySQL. These two categories are described as follows:

Non-binary string data type

Non-binary string is actually a sequence of characters that have a character set and a collation. Now let’s understand both of these terms.

The character set as the name suggests determines the set of characters supported by that character set. The character set also determines whether the storage of each character requires a single byte or multiple bytes. Multi-byte characters may need a fixed number of bytes per character or a variable number of bytes per character.
For example,

  • ucs2 Unicode character set requires 2 bytes per character, and
  • utf8 Unicode character set requires 1 to 3 bytes per character.

Now what is collation? Collation actually determines the sorting order of the characters in the character set associated with the string and is hence used in string comparisons. Collation also determines whether the string can be case-sensitive or case-insensitive. Collation can be binary as well, in which case the characters are sorted based on their numeric byte value, which also means that a string with a binary collation will be case-sensitive.

Note also that character set and collation can be defined at the database level, the table level or the column level.

Executing the following,

SHOW CHARACTER SET;

will give you a list of available character sets and their default collations. Note that the collations end in either ci, cs, or bin, these stand for, case-insensitive, case-sensitive, and binary collations respectively.

The non-binary string types are

CHAR, VARCHAR, TEXT (and other variations of TEXT data types)

Binary string data type

The most important point to remember in case of binary string data type is that, binary strings don’t have character sets and collations. Binary strings are merely a sequence of byte values, and because of this, such strings seem to be case-sensitive, because uppercase and lowercase characters have different byte values. Also, there is no concept of multi-byte characters, hence, multi-byte characters when stored in binary string columns are treated as separate bytes, and any association between the bytes are lost.

The binary string types are

BINARY, VARBINARY, BLOB (and other variations of BLOB data types)

Which category of string data type should you use?

Now the big question is which string data type should be chosen? That really depends on the type of value that you would be storing in the column, if you are storing data that you want to represent as text, such as a blog post or product description, then you should choose non-binary string data type. But if you want to store data that are merely bytes such as data that represents image, then you should use a binary string data type.

Also note that, you should choose a multi-byte character set such as utf8 or ucs2 only when you want to store characters that cannot be represented as single-byte characters, otherwise you would be wasting space.

A CHAR(10) column would need 30 bytes for each value regardless of the actual value if utf8 character set is used, however, the same column would need 10 bytes for each value if a single-byte character set such as latin1 is used. Keeping these considerations in mind is very important.