For this reason, it is highly recommended to limit the length of columns to reflect actual value length stored. Redshift however, will occupy the amount of bytes as per the column definition, rather than the actual usage. It will only occupy about 20 bytes in-memory. Oracle uses variable length for in-memory processing, this means if you have a VARCHAR2 column of 100 Bytes, but only 20 Bytes are filled. It is important to note that it will automatically suggest to multiply the VARCHAR2 CHAR length by 4, in order to produce a BYTE length which supports the same limit of characters – this means a column definition of VARCHAR2(10 CHAR) will be converted to a VARCHAR(40). Using the AWS Schema Conversion Tool, it suggested to migrate all Oracle VARCHAR2 columns to VARCHAR with BYTE length (aka: CHARACTER VARYING). The source (Oracle) database stored most character related fields as VARCHAR2 with a specified CHAR length.īy using CHAR length instead of BYTE length, it allowed to easily specify the exact maximum limit of characters, including special characters or languages where each character can take up to 4 bytes per character. Thus by modifying the users timezone, we abstract the fact that all data is stored in UTC, and we eliminate the need to explicitly specify the input/output timezone. We decided to modify the Redshift users to have a default timezone.Īlter user set timezone to 'Europe/Amsterdam' īy setting the ETL user(s) to a default timezone, it sets all input data to be processed by this default timezone (unless a timezone is specified).īy setting the Reporting user(s) to a default timezone, it makes sure that all output data is displayed in the default timezone, in addition, any date time filters are assumed to be in the default timezone. Setting a default TIMEZONEīy default TIMESTAMPTZ assumes input and output data to be inserted and retrieved in UTC.īecause in our case, all source databases and business users are expected to use Amsterdam as a consistent timezone (even when not explicitly mentioning it). When the input value contains a timezone indication, this indication is used to convert the value into UTC and is stored as such (if no timezone indication is provided, the source data is assumed to be by default in UTC). It does not store any timezone details related to the original input value. It is important to clarify that TIMESTAMPTZ stores the data in UTC. TIMESTAMPTZ is also an 8 bytes data type which supports up to 6 digits of precision for fractional seconds. However as we are aware of the original timezone, we decided to utilize TIMESTAMPTZ and explicitly specify the input timezone. This can create a lot of confusion & ambiguity, and for this reason it is highly recommended to use instead TIMESTAMPTZ (aka: TIMESTAMP WITH TIMEZONE) Chosen Target TIMESTAMPTZĪWS Schema Conversion Tool did not suggest TIMESTAMPTZ (aka: TIMESTAMP WITH TIMEZONE) as our target, due to Oracle DATE not containing a timezone indication. The input is stored as is, and any timezone input is ignored. The disadvantage to using TIMESTAMP is that the data is always assumed to be in UTC. TIMESTAMP is an 8 bytes data type which supports up to 6 digits of precision for fractional seconds. Using the AWS Schema Conversion Tool, it suggested to migrate all Oracle DATE columns to TIMESTAMP (aka: TIMESTAMP WITHOUT TIMEZONE). This is a 7 bytes fixed length data type, which stores the date time up to a precision of seconds.ĭATE is a commonly used data type in Oracle, and many Oracle date time related functions support DATE as both the input and the output. The source (Oracle) database stored most date time related columns in a DATE data type. This particular post discusses the migration of Oracle data types to Redshift, obstacles, tips and some important function usage differences This post is a first in a set of posts, which discuss various lessons learned from migrating a BI infrastructure from Oracle to Redshift
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |