Using descriptors with UDFs

By Claudio Valderrama - Copyright www.cvalde.net

For most people that started using I when Delphi was released by Borland and the engine came "as a bonus toy for practicing SQL", all the known capabilities are circumscribed to the features the DSQL layer has to offer.

Under the hood, however, there's a procedural and neutral language called BLR that's the only language that the core engine understands. Previously to InterBase^Ž 4, the main language was the proprietary GDML for preprocessed applications, where GDML is converted into BLR by gpre before the host language's compiler (C, Ada, Pascal, etc.) sees the source files. In that era, DSQL was not a subsystem integrated in the engine but an external utility, same as ISQL, QLI and gpre are utilities. But in InterBase^Ž 4, the story changed: now Borland owned InterBase^Ž and the recognition that SQL was the winner against other proposed standards and proprietary languages led the team to integrate DSQL into the engine. However, for whatever reason (schedule, human resources, priorities, complexity, SQL being 99% a declarative language as opposed to GDML, etc.) Borland failed to surface several capabilities of the engine through DSQL... and unfortunately, UDFs were one of those overlooked areas.

The following declaration exists in ibase.h and enumerates the types of parameter passing mechanisms that are available for UDFs:

 typedef ENUM { FUN_value, FUN_reference, FUN_descriptor, FUN_blob_struct, FUN_scalar_array } FUN_T;

The following query recovers the equivalent values from the system tables. You can observe that the symbolic names are not exactly the same:

 SQL> select rdb$type || '=' || rdb$type_name 
 CON> from rdb$types
 CON> where rdb$field_name = 'RDB$MECHANISM';
 ======================================
 0=BY_VALUE
 1=BY_REFERENCE
 2=BY_VMS_DESCRIPTOR
 3=BY_ISC_DESCRIPTOR
 4=BY_SCALAR_ARRAY_DESCRIPTOR

Here's the explanation for each mechanism. Some mechanisms apply to almost any data type, whereas others are very specific:

By value 0: GDML used to use it for parameter passing, but it was non-portable across different architectures. Deprecated. Only used optionally for the return argument. Do not try to use it for input arguments. As basic measure, the DSQL layer doesn't provide a way to declare input arguments by value but only the return argument.
By reference: 1: by pointer. Typical of 99% of the UDFs. You get the equivalent of a pointer in programming languages or the VAR declaration for procedure parameters in Pascal. You don't need to declare it since it's the default except for blobs. Besides, there's no way to declare it explicitly.
By VMS descriptor: 2: it will be used when you issue BY DESCRIPTOR in Firebird. It's the same old DSC structure used by the engine internally. I defined it as PARAMDSC in ibase.h because I didn't want potential issues with redefinitions. You can use it to get a blob, but I wouldn't use it, since you won't get the pointer to the special callback functions needed to work with blob's segments or blob streams. It has no much sense to use this mechanism with blobs.
By ISC descriptor: 3: used to pass blobs. Since the way to return a blob is to make it an argument, it's used to return blobs, too. This is a special descriptor for blobs, that uses the BLOB structure. It's different than the internal BLB structure used by the engine. I defined it as BLOBCALLBACK in ibase.h a time ago. This struct provides the external entry points to get and set blob segments and a seek function to operate on a rare flavor of blob named stream blob. You'll find mechanism 3 in any current UDF that handles blobs. It's the default for blob passing and you don't need and can't declare it explicitly.
By scalar array descriptor: 4: I don't know any UDF that currently uses it. It will fill a scalar array descriptor (SAD, that I didn't define in ibase.h) and will populate it with values from the array column. Same than with the BLOB/BLB issue, SAD is a structure to convey the array to the UDF; the engine uses ADS (array descriptor) internally.

By value and by reference don't entail more discussion. By isc_descriptor is discussed (although very scarcely) in the engine's documentation, with the only caveat that the full model that BLOBCALLBACK provides wasn't shown, since the seek function wasn't acknowledged and the BLOB structure wasn't available in any public header, even though it was meant to be public. For now, descriptors for arrays are left out of the discussion. Let's concentrate on VMS descriptors.

VMS descriptors, used thoroughly inside the engine for data and metadata, are probably one of the original constructions. Internally, the declaration is referred as DSC and for Firebird, the non orthodox BY DESCRIPTOR keyword was chosen for the DSQL layer to be able to declare this mechanism because blobs use by default a fake descriptor that's used only to interface with UDFs and its the default that doesn't need declaration and can't be declared, so no clash here. Furthermore, the fourth type may be implemented in a more descriptive way, for example, BY ARRAY DESCRIPTOR in the future. Forcing developers to change system tables by brute force to match the desired declaration for type 3 is calling for potential problems, hence the BY DESCRIPTOR syntax. GMDL was always able to declare this mechanism. Again in ibase.h, the following declaration is found:

 typedef struct paramdsc {
     unsigned char   dsc_dtype;
     signed char     dsc_scale;
     ISC_USHORT      dsc_length;
     short           dsc_sub_type;
     ISC_USHORT      dsc_flags;
     unsigned char   *dsc_address;
 } PARAMDSC;

In addition, the following special definition is added:

 /* Overload text typing information into the dsc_sub_type field.
 See intl.h for definitions of text types */ 
 #define dsc_ttype dsc_sub_type

According to Dave Schnepper, this overloading was done to avoid changing the DSC structure. It's so critical to the engine, that changing it means an ODS upgrade. Furthermore, Dave envisioned that some day, that structure should get a separate data member to measure the logical length of string for international character sets support.

That overloading of dsc_sub_type into dsc_ttype is not valid for blobs, beware! Let's analyze each structure's member separately.

`dsc_dtype`

This is -with no doubt- the most important member in the structure. It's the data type of the value being carried. The following declarations apply:

 /* Note that dtype_null actually means that we do not yet know the
 dtype for this descriptor. A nice cleanup item would be to globally
 change it to dtype_unknown. --chrisj 1999-02-17 */
 #define dtype_null 0
 #define dtype_text 1
 #define dtype_cstring 2
 #define dtype_varying 3
 #define dtype_packed 6
 #define dtype_byte 7
 #define dtype_short 8
 #define dtype_long 9
 #define dtype_quad 10
 #define dtype_real 11
 #define dtype_double 12
 #define dtype_d_float 13
 #define dtype_sql_date 14
 #define dtype_sql_time 15
 #define dtype_timestamp 16
 #define dtype_blob 17
 #define dtype_array 18
 #define dtype_int64 19
 #define DTYPE_TYPE_MAX 20

Notice that the engine cannot do anything useful is the type is zero, as Chris Jewell's comment says at the top of the list. Facing code where the value should be converted or operated with other values, type zero will cause the engine to complain and stop the request. Same if the type is greater or equal than DTYPE_TYPE_MAX. Bear in mind that those values are used in run-time and are completely different than the ones you get by querying system tables:

 SQL> select rdb$type || '=' || rdb$type_name
 CON> from rdb$types
 CON> where rdb$field_name = 'RDB$FIELD_TYPE';
 ======================================
 14=TEXT
 7=SHORT
 8=LONG
 9=QUAD
 10=FLOAT
 27=DOUBLE
 35=TIMESTAMP
 37=VARYING
 261=BLOB
 40=CSTRING
 45=BLOB_ID
 12=DATE
 13=TIME
 16=INT64

In fact, there's an internal table at the DSQL layer and a function in the core engine that make the translation between the two equivalent type "scales", same as you can measure temperature in Fahrenheit, Celsius or Kelvin. One important point here is that you can't declare a procedure parameter or a table field to be of type CSTRING, but you can declare a UDF argument of type CSTRING, that's nothing more than the type of string used by C, namely, an array of CHAR with a null ASCII terminator, unlike Pascal's length-counted string, where the length goes hidden at the beginning. Whereas CSTRING is not used for storage, it can be used for UDF handling. The following table shows the equivalences:

DSC	`RDB$TYPES`	SQL
`dtype_null 0`
`dtype_text 1`	`14=TEXT`	`char`
`dtype_cstring 2`	`40=CSTRING`	`cstring` (only for UDFs)
`dtype_varying 3`	`37=VARYING`	`varchar`, `char varying`
`dtype_packed 6`
`dtype_byte 7`
`dtype_short 8`	`7=SHORT`	`smallint`
`dtype_long 9`	`8=LONG`	`int`, `integer`
`dtype_quad 10`	`9=QUAD`
`dtype_real 11`	`10=FLOAT`	`float`
`dtype_double 12`	`27=DOUBLE`	`double precision`, `long float`
`dtype_d_float 13`	`27=DOUBLE`	`double precision`, `d_float` (VMS)
`dtype_sql_date 14`	`12=DATE`	`date` (dialect 3)
`dtype_sql_time 15`	`13=TIME`	`time` (dialect 3)
`dtype_timestamp 16`	`35=TIMESTAMP`	`date` (dialect 1), `timestamp`
`dtype_blob 17`	`261=BLOB`	`blob`
`dtype_array 18`
`dtype_int64 19`	`16=INT64`	`numeric(18, s)`
	`45=BLOB_ID`

There are some cases that aren't mapped to system tables or to SQL types. Packed and real types seem obsolete, byte may be the foundation for a future Boolean, blob_id appears unused (any field declared as blob really holds only a blob_id) and dtype_array is a special stream blob that handles arrays of other types. To have an easy way to test if a field is of type string (char, varchar, cstring), the following macro is defined internally in dsc.h and may be useful to copy it to a UDF:

 #define dtype_any_text dtype_varying

and the typical test is

 if (descriptor->dsc_dtype <= dtype_any_text)

provided that the dsc_dtype is not zero, of course. This works because the three string types have the lowest consecutive values after zero. Remember that the UDF gets the values in the DSC column shown above, not the ones in the rdb$types system table.

`dsc_scale`

In the era when InterBase^Ž was born around 1985, economy of resources was a must. Powerful workstations didn't handle more than 2 MB or 4 MB RAM and hard disks were expensive, small and slow. Packing information densely was a strong goal and dsc_scale is an example. It's unused for string types. However, integral types are used to hold numeric and decimal SQL types. Here, dsc_scale is the negative number of places reserved for the scale in numeric and decimal types. A declaration like numeric(9, 3) means dsc_scale is -3, precision is 9 and so it's easy to get the remnant for the integral part, precision plus scale. Notice that the precision is carried implictly and the engine doesn't enforce it. The maximum precision depends on the underlying storage used by the engine. The following table shows the possibilities:

numeric(p, s) or decimal(p, s)	Storage in dialect 1	Storage in dialect 3
`p<5` promoted to `4`	`smallint`	`smallint`
`5<=p<10` promoted to `9`	`int`	`int`
`10<=p<18` promoted to `19`	`double precision`	`int64`

If you try MS SQL for example, you will find that numeric(3,1) doesn't let you put more than 3-1=2 digits in the integral part and one digit in the decimal part. In InterBase^Ž, using the previous table, the maximum value allowed is determined by the underlying storage and for such declaration it's smallint that goes up to 32767, so numeric(3,1) will allow 4-1=3 digits in the integral part and for some values, 4 digits provided that you don't go above 32767. Extra decimal places are ignored. You can also deduce from the table that dialect 1 uses inexact storage for precisions above 9 but dialect 3 solves that problem. This happens at declaration time: if the field was declared with precision 15 from dialect 1, it will be always double precision, even after backup and restore. It is not affected by changing the dialect of the database after the field creation. The only way to get a true int64 storage is to set the database dialect to 3, connect with a dialect 3 client and define the field.

The second usage for dsc_scale is for blobs: since the other members of the structure are already filled with information, there's no place to put the character set of the blob. Being the scale a signed char, it's enough, provided that there aren't charset identifiers above 127. Currently, the maximum charset id is 57.

`dsc_length`

This is the raw length of a data type. It's applied to all data types. For string char types, it's the length of the declared field; for varchar, it's the length of the declared field plus the size of the header to hold the effective length (currently an unsigned short, hence two bytes on 32-bit platforms); for cstring, it's not clear if the null ASCII terminator is counted but DSC_make_descriptor, a widely used function inside the engine, doesn't count it. For integral and double precision types, it's the length of the underlying storage, typically the result of the sizeof operator in C. A quad is a combination of a signed and an unsigned long integer, so it's equivalent in storage to int64; same history for [[TIMESTAMP | timestamp. Date-only needs a signed long and time-only needs an unsigned long. For blobs, the length is only the length of the blob-id field. This is a type of quad, therefore it uses always two long integers. The field holding the blob only has the blob id. The zone for a blob inside a page holds the contents.

For the developer, only the length of string fields is important, since it's variable. For the rest of the fields, they match the sizes used by the compiler in the client application. One special case is the blob, that provides the length in the BLOB structure. There are official and extra information on the commonly used isc_descriptor mechanism for blob passing to UDFs, that includes the number of segments, the largest segment's size and the total effective blob's length. The structure is documented as blobcallback in ibase.h due to the callback functions it makes available to the UDF.

Unlike char and cstring, varchar fields are length-counted strings. In other words, the engine tracks the exact length of the string, no right padding with blanks up to the declared length like the char type. This is why the following helper structure has been declared in ibase.h:

 typedef struct paramvary {
     ISC_USHORT    vary_length;
     unsigned char     vary_string [1];
 } PARAMVARY;

Again, the clumsy name is to avoid any potential clash with the original structure (named vary) if some change is made. The dsc_length member counts the header plus the effective length. Hence, the safest calculation for a UDF that receives a varchar is to take the minimum between dsc_length minus sizeof(vary_length) and the value of vary_length. In practical terms, dsc_length and vary_length are unsigned short integers, so you the size of vary_length is 2. The maximum allowed length of char fields is 32767 and this gives 32765 for varchar fields.

`dsc_sub_type`

For date/time fields, it's unused. For smallint, int and int64, it's zero if one of those types was used in the field definition, one for numeric and two for decimal. This is an improvement over InterBase^Ž 5 that made impossible to distinguish between numeric(3,0) and smallint, since the latter is the underlying storage of the former. In InterBase^Ž 5, other cases could be distinguished by looking directly at the scale (that doesn't help if the scale is zero), but now a formal behavior has been implemented.

For string fields, the aforementioned overloading applies:

 #define dsc_ttype dsc_sub_type

It means that dsc_ttype is a synonym for dsc_sub_type and is used to keep the text type. This is a combination of the character set and the collation and is achieved with the following macros in intl.h:

 #define INTL_ASSIGN_DSC(dsc, cs, coll)      { (dsc)->dsc_sub_type = (SSHORT) ((coll) << 8 | (cs)); }
 #define INTL_GET_CHARSET(dsc) ((SCHAR)((dsc)->dsc_sub_type & 0x00FF))
 #define INTL_GET_COLLATE(dsc) ((SCHAR)((dsc)->dsc_sub_type >> 8))

Basically, those macros get and set the charset and collation for string fields. It becomes evident that the dsc_sub_type, a signed char, uses its high order byte to hold the collation and its low order byte to hold the charset. If a UDF needs the specific charset (for MBCS operations, for example), it should use functionality that resembles those macros.

For blobs, the sub type is really the blob's sub type. Again, the values are present in the system tables:

 SQL> select rdb$type || '=' || rdb$type_name
 CON> from rdb$types
 CON> where rdb$field_name='RDB$FIELD_SUB_TYPE';
 ======================================
 1=TEXT
 2=BLR
 3=ACL
 4=RANGES
 5=SUMMARY
 6=FORMAT
 7=TRANSACTION_DESCRIPTION
 8=EXTERNAL_FILE_DESCRIPTION

The default sub type for a blob is zero, that's a binary blob (no symbolic name). When declaring a blob field, the number or the symbolic name (if available) can be used. Text is the sub type for memos or very large varchar-like fields. The rest of the named sub types is for internal use. Interestingly, type format is used to keep an array of DSC structures. Positive numbers are reserved for the system, but being dsc_sub_type a signed short integer, it allows any negative value to be used by an application. Remember, the engine doesn't enforce the sub type of a blob. It's a mere indication for the application. Some extra checks may be made internally (for example, considering the charset only for sub_type text) but the engine doesn't check that the format of the information kept in a blob corresponds to the declared blob's sub type. Also, the sub type is useful to allow the correct blob filter to be selected and executed internally. Blob filters are a variety of UDFs that convert from one blob's sub type into another.

Blobs only have character sets, not collations. Hence, the charset of a blob is kept into dsc_scale instead, since this a scale doesn't make sense for a blob. See the previous discussion about dsc_scale.

`dsc_flags`

Currently, the flags have limited use. These are the accepted values:

 /* Note: DSC_null is only reliably set for local variables (blr_variable) */
 #define DSC_null            1
 #define DSC_no_subtype      2 /* dsc has no sub type specified */
 #define DSC_nullable        4 /* not stored. instead, is derived
                             from metadata primarily to flag
                             SQLDA (in DSQL) */

The only field that may be important to the UDF is DSC_null, an indication that there's no value, but NULL in the parameter being examined. This is one of the advantages of using descriptors instead of pointers to values (although the reference mechanism could have achieved the same effect by passing null pointers to signal SQL nulls). Those flags may come bitwise or'ed from the engine, so use

 descriptor->dsc_flags & DSC_null

to test for null. Anyway, I've found that the engine is inconsistent to signal nulls, at least when UDFs are invoked.

This may come from the fact that the engine has a null flag in the internal request itself, so it doesn't rely absolutely in the descriptor's flags. To make the UDF crash-proof, it needs to do the following check

 if (descriptor->dsc_flags & DSC_null || !descriptor->dsc_length || !descriptor->dsc_address)
    assume the parameter is null

so flexibility comes at a cost sometimes. To return NULL, you only need to return a null pointer. If the UDF uses the RETURNS PARAMETER <n> syntax, you should activate the null flag with

 descriptor->dsc_flags |= DSC_null

because the engine doesn't pay any attention to the return value in such case; instead it examines the input argument that was marked as the return argument by its position.

`dsc_address`

Although it's declared as a pointer to unsigned char, it can convey any data type that the engine knows about, requiring a hard type cast, depending on dsc_dtype. It could have been declared as a void pointer (un-typed pointer in Pascal). I can speculate that one reason is that the void pointer wasn't an original construction in C, so probably not all compilers supported it. In other constructions, a pointer to integer was used (the blobcallback structure that appear in ibase.h is an example where void pointers replaced the integer pointers in the original declaration inside the engine). The advantage of a pointer to char is that it allows direct pointer arithmetic to access a location N bytes ahead without any extra cast. Looking at dsc_dtype, the pointer should be forced to the appropriate pointer to type and then dereferenced to get the value. The value should come aligned already.

In the explanation about dsc_length, it was said that the varchar type receives special treatment. This is because it holds the effective length of the string, without more trailing blanks than the ones the user inserted. Therefore, dsc_address is not the address of the data, but the address of the header. One solution is to cast dsc_adress to this declaration already explained:

 typedef struct paramvary {
     ISC_USHORT    vary_length;
     unsigned char     vary_string [1];
 } PARAMVARY;

It was also said that the effective string length is

 paramvary* v = reinterpret_cast<paramvary*>(descriptor->dsc_address);
 ISC_USHORT len = v->vary_length;
 len = min(descriptor->dsc_length - sizeof(ISC_USHORT), len);
 unsigned char *str = v->vary_string;

At least this is how the engine guarantees internally that it won't read outside the valid bounds of the varchar field.

Remember that this string isn't null terminated, since the length is given separately. Same for char that only requires dsc_length directly. The other solution is to create a function that works directly on the original field:

 ISC_USHORT len = *reinterpret_cast<ISC_USHORT*>(descriptor->dsc_address);
 len = min(descriptor->dsc_length - sizeof(ISC_USHORT), len);
 unsigned char *str = descriptor->dsc_address + sizeof(ISC_USHORT);

Both ways are equivalent. Choose the one that looks more elegant for you. I think that there cannot be problems with the short integer members of the DSC structure between different platforms than handle the most significative byte in different position (the known problem of little endian v/s big endian whose name comes from Gulliver's Travels). The reason is that the UDF should be installed in the same machine than the engine and so they should share the HW architecture. There's no way to have a remote UDF installed on another machine, hence no HW mismatch can happen. If you need the UDF for another platform where the engine runs, you should recompile it and again they use the same platform.

Caveat

Unlike most internal declarations where the uppercased name denotes pointer to the structure of the same name in lowercase, DSC is the structure itself, so your UDF needs to declare DSC pointers for its arguments. The declaration in ibase.h follows the mimics declaration.
There's an hard coded limit of 10 arguments by UDF. This limit includes the return argument, so in practice you have 9 input arguments. It should be enough for most users; several UDFs use only one or two input arguments.
There's only one return argument and it's mandatory. If you don't need it, declare it as smallint or int with the by value mechanism and return zero and one to signal success or failure, for example.
When the engine realizes that a parameter is using descriptors, it doesn't pay much attention to the formal parameter declaration (the data type) and pass whatever it has to the UDF. This is why FBUDF can handle different integral types with more than one declaration and only two functions. Had only one declaration existed, both numbers and string would have been passed to the same UDF and it would be a mess, since returning a string from the UDF requires special code, same as with traditional UDFs.
The hack that allows returning a value by descriptor didn't exist in InterBase^Ž and is only a hack. It doesn't allow you to change the return type. If you get an int64 in a function expecting int and return it, even if you detect it properly in the function, it will overflow silently when the engine takes the address of the returned descriptor and applies the (fixed) declared return data type. There should be a method to allow the UDF to signal overflow. One workaround is to set the dsc_dtype to zero so the engine complains, but if it's standardized to signal overflow, it should complain with the proper message.
UDFs that have division by zero will cause Superserver to be terminated. The reason is that Superserver decides to finish itself when a UDF executes a failed operation due to the risk for the data. Failed operations include division by zero, some interrupts, invalid memory access and other operating system specific exceptions. Why division by zero was considered so dangerous is a thing I ignore.