Migrate your mindset too

2019-10-10

With the increasing popularity of Postgres, I see more and more questions requiring help to migrate code from other database systems such as Oracle or SQL Server to Postgres.

What I very often see is, that the migration tries to simply translate the syntax from one system to another. This isn’t limited to migrations to Postgres. But very often migrations from Oracle to SQL Server or from SQL Server to Oracle, fall into the same trap.

Simply migrating e.g. a LOOP from one e.g. T-SQL to PL/pgSQL might technically work, but is very often a mediocre solution when features of the target database are not used to its full extent.

Some examples and topics I regularly stumble upon are:

Assuming that “database” means the same thing in every product

People coming from e.g. SQL Server are very often disapointed by the lack of cross-database queries in Postgres. While it would be nice to have easier cross-database queries in Postgres, insisting to use a database to separate different logical parts of the data ignores the fact that this is exactly what schemas are for in Postgres (older versions of SQL Server didn’t have schemas, so it seems many SQL Server users still use databases the way schemas should be used).

In MySQL a “database” and a “schema” is actually the same thing, so MySQL users often use “databases” in Postgres to create separate namespaces, ignoring the fact that this is exactly what schemas are for in the SQL standard and Postgres.

Understanding the difference between the query language and the procedural language

Postgres, like most other database products (Oracle, DB2, Firebird, HSQLDB) has a clear distinction between the query language SQL and the procedural language used for stored functions or procedures. Users migrating from SQL Server very often don’t take the time to understand this distinction and are frustrated that apparently simple things like an IF statement don’t work as part of a SQL query.

Sadly, “reading the manual” seems to be a forgotten skill in the times of Google, Stackoverflow and other online resources. I see many questions regarding failed attempts to migrate T-SQL code to PL/pgSQL (or PL/SQL) that would not have been necessary to ask, if the person had taken the time to go through the corresponding reference manual of the procedural language.

Using stored procedures to retrieve data

Especially in the SQL Server community it seems that stored procedures are the preferred way to encapsulate the logic needed to retrieve data. SQL Server’s T-SQL makes this very easy as SQL Server can dynamically determine the result sets returned by the procedure. However this is not the case for Postgres (and to a certain extent for Oracle as well).

Maybe it’s because I was “raised” with Pascal (and Modula) and there a procedure is something that does not return a result. A procedure returning a result is something odd for someone who is not used to the “SQL Server way” of doing things. Returning a result from a stored procedure also has the disadvantage, that it’s not possible to use the result in the context of another query, e.g. a JOIN or sub-select or aopply additional WHERE or ORDER BY clauses.

For Postgres (and basically for Oracle as well), the natural choice for encapsulating the logic to retrieve something is a function. Postgres makes creating so called “set returning functions” very easy. As I have shown in the post about Performance of stored functions the language used, also makes a difference.

So instead of fighting with Postgres’ limited abilities to return result sets from a stored procedure (or a “void” function in versions prior to 11) using a set returning function is the natural choice.

Consider the following (simplified) T-SQL function:

create procedure get_data(@zip_code varchar(50))
as
declare
    @result table
    (
      customer_id integer,
      customer_name  varchar(20),
      zip_code varchar(5)
    )
begin

  insert into @result
  select c.id, c.customer_name, c.zip_code
  from customers c
  where c.zip_code like @zip_code + '%';

  select *
  from @result;
end;

When trying to migrate this approach 1:1 to Postgres (or Oracle for that matter) this will get complicated and very often I hear people complaining that “Postgres (or Oracle) is too complicated”.

However with the right “Postgres mindset” (rather than the “SQL Server mindset”) the migration is as simple as:

create function get_data(p_zip_code varchar(50))
  returns table
    (
      customer_id integer,
      customer_name  varchar(20),
      zip_code varchar(5)
    )
as
$$
  select c.id, c.customer_name, c.zip_code
  from customers c
  where c.zip_code like p_zip_code || '%';
$$
language sql;

Using that function or procedure is different, however. Where I could write

get_data '1234';

in SQL Server, I now have to write:

select *
from get_data('1234');

Ignoring the capabilities of the target system

Neither Oracle nor SQL Server support native arrays in SQL (Oracle has collections, but they can only be used in PL/SQL).

I regularly stumble upon systems where stored functions returns their results as comma separated strings. Very often those strings are then used inside dynamic SQL to built SQL IN expressions (I usually challenge the basic approach of passing lists of IDs around, rather than properly joining (partial) results - but that is a a different topic).

Passing comma separated strings around is error prone and typically less efficient than using proper arrays in Postgres.

I have seen functions like this:

create function get_ids(@dept_id integer)
  returns varchar(max)
as
begin
  declare @result_list varchar(max);
  declare @id integer;
  declare  dept_cursor CURSOR FOR
            SELECT employee_id
            FROM employees
            WHERE department_id = @dept_id;

  OPEN dept_cursor
  FETCH NEXT FROM dept_cursor INTO @id

  set @result_list = cast(@id as varchar(20));

  WHILE @@FETCH_STATUS = 0
  BEGIN
    SET @result_list = @result_list + ',' + cast(@id as varchar(20));
    FETCH NEXT FROM dept_cursor INTO @id;
  END;

  CLOSE dept_cursor;
  DEALLOCATE dept_cursor;

  return @result_list;
end;

The use of a cursor is very questionable to begin with, but I assume this was created long before SQL Server supported string_agg().

But very often developers with the “SQL Server mindset” aren’t aware that Postgres offers arrays and a handy aggregation function to collect values into an array. So they try to migrate the above function using a cursor and string concatenation as well.

However, using the “Postgres mindset”, the correct migrated function would be:

create function get_ids(p_dept_id integer)
   returns int[]
as
$$
  select array_agg(employee_id)
  from employees
  where department_id = p_dept_id;
$$
language sql;

Using an array typically results in about 30% less memory usage at run time compared to a comma separated string of numbers.

Blindly copying workarounds for missing features

Oracle has no built in function to split a string based on a delimiter. Many people use a hack that uses Oracle’s CONNECT BY clause to do that, e.g.:

select regexp_substr('one,two,three','[^,]+', 1,level) as q 
from dual
connect by regexp_substr('one,two,three', '[^,]+', 1, level) is not null;

Attempting to translate that query to Postgres is a useless attempt because of the missing CONNECT BY.

This can easily be achieved with Postgres’ built-in functions:

select *
from unnest(string_to_array('one,two,three', ',') as t(q);

Using the wrong data types

number vs. numeric

Postgres has a very rich set of data types supporting many standard and non-standard types that can make life a lot easier when used properly.

A common error I see when migrating from Oracle to Postgres is to use numeric(22) as a replacement to Oracle’s number(22).

While this works, it’s not an efficient solution. Oracle’s number data type is a variable length data type (similar to varchar) where the storage size depends on the actual value. Bigger numbers need more storage. In Postgres numeric is a fixed length data type that also includes decimal digits and its a lot slower and uses more space than e.g. a bigint.

The correct data type in Postgres for a number(22) in Oracle is therefor bigint not numeric.

bit vs. boolean

Postgres (unlike most other relational databases) fully supports the boolean data type. In SQL Server very often bit is used for columns that act as “true/false” flags. Postgres does have a bit type as well, but using that for a “flag”, makes queries more complicated than they should be.

With e.g. a boolean column named is_active can be used like this:

select *
from some_table
where not is_active;

With a bit column (e.g. bit(1)) the above would have to be written as:

select *
from some_table
where is_active = B'0';

But more importantly: the boolean type uses less storage space. A single boolean column uses 1 byte, a bit(1) column requires 6 bytes of storage because the bit data type has an overhead of 5 to 8 bytes .

UUID type

Both, Postgres and SQL Server have specialized data types for storing UUIDs. However Oracle does not have such a data type and thus to store a UUID in Oracle, one must the RAW data type if efficiency and storage size is important. Using a VARCHAR column avoids the use of the cumbersome RAW type but requires much more storage space in the table and any index.

But I have seen migrations that insisted on doing a 1:1 migration and created the UUID columns as bytea in Postgres. (or something similar in SQL Server) This will have many negative side effects. Byte arrays are usually very ugly to look at or specify literal values for. But more importantly, one loses the ability to efficiently index them.

Assuming all DBMS are implemented the same

Knowledge about the internals of the implementation of certain features can not be transferred from one database system to another. Just because product a implements e.g. indexes (and features around them) in one way, you can not assume that other products implement this the same way. So you can also not apply optimizations that rely on internal workings from one database to another.

Conclusion

When migrating to a new database product, it’s vital to get accustomed with the possibilities of the target system.

Blindly applying patterns and design choices that worked in the old system might lead to an implementation that doesn’t exploit the full potential of the new system. Very often the new database is blamed for being “slow” or “complicated” while that isn’t actually the case - it’s just that it works differently and requires a new mindset.