1. Approach

For small embedded applications String processing seems to be unnecessary. But it is contained in the C and C++ standard, and their usage is not able to exclude. Hence the compiler support the standard String processing approaches. More rich embedded applications may deal with Strings, for example:

  • A human interface, simple commands, parameter handling. Commands and parameter are strings.

  • Messages, errors, warnings, process messages stored in plain text, maybe outputted via string-oriented interfaces as serial communication (UART) or Ethernet protocolls.

The standard C string functions are offered by the compilers, but it is not a necessity to use it. Often especially on small processors this functions needs to much effort. A printf with text substituations is not simple in machine code. It should not be used if not necessary.

2. Definition of String capabilities in applstdef_emC.h

For the sources of src_emC some compiler switches are used which should be defined in applstdef_emC.h. This defines can be used in the user’s source too to adapt it to different plattforms:

  • #define DEF_NO_StringUSAGE: If this is set, the usage of strings are generally prevented in the emC sources. This is for small or poor processors which don’t need String information and which have too less memory to store (not used) strings. For example the classJc struct is defined without the name of the type, only with a type identifier. For exception handling no textual information is stored (need space), only error numbers and line numbers. The StringJc type is nevertheless defined (unconditionally), the definition needs no space. It can be used, for example for very short string information in the user’s sources. The header files emC/Base/String…​h are not necessary in this projects.

  • #define DEF_CharSeqJcCapabilities: If this is set, also DEF_REFLECTION_FULL should be set, respectively DEF_ClassJc_Vtbl. With this capability a StringJc can also be a CharSeqJc which offers routines charAt_CharSeqJc(…​), length_CharSeqJc() and subSequence_CharSeqJc(…​). This three routines are defined in Java also in java.lang.CharSequence. It is possible to get characters from any instance which implemenents this interface. The CharSeqJc is defined for C usage without virtual operations of C++, see Virtual Operations in C++, alternatives, chapter Solution: Vtbl referenced from reflection

  • If both definitions are not set, StringJc is used, some functionality is used, depending on the sources.

3. Non 0-terminated StringJc

The zero-terminated string is one of the basic ideas of the C language. Given a string argument to a function is very simple. Only the pointer is necessary. The end is found though with the char '0' if the String is processed. - All other languages uses a pointer to the character array and their length.

C has a known problem with String overflow. One of the reason is: If a length is not known (the '0' should be firstly searched), then it is not able to calculate.

For example it you use the simple function strlen(text): If the text pointer referes (because of an error in a not yet ready software) a faulty location, which can be filled with for example AA AA …​ till the end of the memory, then this function searches a long time, maybe wrapped at address 0, maybe have a null-pointer-termination or not, searches furthermore, needs a long time, and the whole system may crash because time overflow in an interrupt.

Hence: The simple 0-terminated Strings are not proper, or unfortunately, or dangerous.

But it is sensible to use the 0-termination inside a given length. For example there is a buffer for a name of a paramter inside a greater data struct:

char nameParam[32]; float valueParam;

The name can have till 32 character. But it may be shorten internally by a \0.

4. Supplementation of unsafe basic string functions

See for example msdn…​strcpy-is-unsafe-consider-using-strcpys-instead forum or msdn …​ strncpy_s Microsoft offers in its Visual Studio compiler instead strncpy(…​) a strncpy_s(…​). - Though strncpy(…​) is safe (in comparision to strcpy, which is really unsafe). There are many hits in internet about that problem. The argumentation of strncpy_s(…​): strncpy(…​) does not guarantee a 0-terminated result string, if the src has exact the same length as the result buffer. But that is not a problem of this funtion (https://linux.die.net/man/3/strncpy). but an understanding problem of usage.

Hence, there is a lot of wild growth, emC offers some own functions in its emC/Base/StringBase.h and …​c:

4.1. strnlen_emC

The strlen(…​) routine of standard C is really unsafe and should never be used. The known strnlen(…​) is safe. See description in https://man7.org/linux/man-pages/man3/strnlen.3.html. But this is for gcc. Visual Studio warns on using of strnlen(…​). Other platforms may offer other implementations. Because the routine is real simple and can be adapted for special embedded platforms, it is offered as emC version too. If emC is in use, this function should be used.

The maximal expected length of a string should always be known. For example it is the length of a buffer containing the String. Then the 0-termination is not necessary if the string fills the whole buffer. To get compatibility with a strlen(…​) in legacy sources, the length value can be set to INT_MAX. Then the behavior is the same. It may be better to use an expected value of a known memory size of a known value about the maximal used length of strings in this application.

4.2. strncmp_emC

The strcmp(…​) routine of standard C is unsafe because the comparison is not terminated. The strncmp(…​) see https://man7.org/linux/man-pages/man3/strncmp.3.html is safe. But this routine has the capability of improvement, return the position of the faulty character. This is often necessary or (in debug situations) nice to know.

int strncmp_emC(char const* const text1, char const* const text2, int maxNrofChars);

returns 0 if both text are equal till a found \0 or till maxnrofChars. It returns <0 or >0 in the same kind as the original strncmp(…​), but the absolut value is the position of the first difference count from 1. If the first character on text1[0] is different, it return 1 or -1. Comparison of "abcde" and "abcdx" returns 5 because the 5th character is different. The strncmp(…​) original C routines guarantees only return a result >0.

4.3. strpncpy_emC

The strcpy(…​) routine of standard C is unsafe and should never be used because on faulty arguments the destination is uncontrolled overridden which can destroy the whole application. A simple example shows it:

char srcBuffer[6];  //declared as not initialized
//... and forgotten to write or the expected routine was not called.
char dstBuffer[20];   //may be large enough ...?
strcpy(dstBuffer, srcBuffer); //seems to be save in the eyes of the developer

This disturbes the whole memory if the srcBuffer does not contain a '\0' character and it is not followed by any 0 in the memory. The reason is lightweigth and the result is catastrophical. Do never use the simple strcpy(…​).

Another example, similar:

char srcBuffer[6] = {0};  //better, it is initialized.
strcpy(srcBuffer, "ident"); //correct. Safe. 0-terminated
srcBuffer[5] = nr;    //user expect "ident5" or such, the 0-termination is missed.
char dstBuffer[20];   //may be large enough ...?
strcpy(dstBuffer, srcBuffer); //seems to be save in the eyes of the developer

This is also disturbing, because of the missing 0-termination by a maybe thinking error on the srcBuffer. Do never use the simple strcpy(…​).

The strncpy(…​) is save, but not in the eyes of all developer, because it is sometimes complicated:

char dstBuffer[20];   //may be large enough ...? but not initialized
strncpy(dstBuffer, srcBuffer, sizeof(srcBuffer));

Following the problems of srcBuffer above, this reduces the number of character copied from srcBuffer to the given length of srcBuffer, which prevents the error above. But there are two further problems:

  • 1) The content in dstBuffer may not be 0-terminated.

  • 2) strncpy may be firstly used to prevent overriding of too many memory in the destination, not to determine the number of characters from src, or ? It is not definitely defined.

It seems to be, especially for the effect 1) Microsofts Visual Studio offers an own solution (tested on Visual Studio 2015, on an updated version at 2021-05-23):

char dst[40] = {0};  //initializes the whole content with 0
#ifdef __COMPILER_IS_MSVC__
dst[6] = 'Q';
int okMsc = strncpy_s(dst, "abcde", 5);
...

Because of the known argument meanings of strncpy this routine might copy the given five characters. But it sets an additionally terminating \0 on the 6th position and overrides the 'Q', it copies 6 characters instead 5 as in the originally strncpy(…​). In Microsoft’s thinking the 0-termination is more important than compatibility with other given standards. The thinking may be, that the length argument is the length of the source. Then Microsoft’s behavior is understandable. But the usual problem is: Overriding the destination should be prevented. Hence the length argument should be follow the situation on destination, more exact it should be regard both, regarding an exact given number of characters for not 0-terminated source strings, and the size of the destination buffer. The behavior of Microsoft’s solution is wrong in respect to the destination length.

The offered emC solution: strpncpy_emC(…​):

The Version of strpncpy_emC regards all necessities to secure copy and concatenate strings with or without zero termination. See in emC/Base/StringBase_emC.h:

int strpncpy_emC(char* dst, int posDst, int zDst, char const* src, int zSrc);

dst is the destination buffer, which will be written from posDst. zDst is the length which can be filled with the characters. It determines the used and written memory. src is the source string with either the given length in zSrc or till a found '\0' character. Especially if zSrc = -1 or <0 the zSrc is used till 0-termination. But never the dst overflows because it is determined by zDst. The return value is the number of copied character without the maybe copied 0 as termination. It can be simple used to add to posDst for appending. Last not least this operation does not spend unnecessary calculation time to fill the destination with 0 till end. Usual it should be filled with an operation before, respectively one 0-terminating '\0' character is enough.

If there is a difference in arguments, especially a given src string cannot be copied as a whole to the given dst because the number of possible character to copy (zdst - posDst) is too less, then it can be seen as error (the given string cannot be copied) or as expected behavior (the number of copied strings is lesser because size of destination). If it is seen as error, then either an exception should be thrown, or any simple testable output should be given. But it is defined as behavior. The resulting string can be simple checked by debugging and the result is obviously. Hence no exception is given. It may be possible to compare the return value with a known length of source to detect a truncation:

char dstBuffer[40] = {0};
int zDst = sizeof(dstBuffer);  //it is 40
int nrofCharsCopied = strpncpy_emC(dstBuffer, 0, zDst, src, -1);
if(src[nrofCharsCopied] !=0) {
  ....

Here it is expected and tested that src is copied always till its 0-termination.

The operation is save, never an undesired memory location will be overridden. See the example for concatenation:

char dstBuffer[40] = {0};
int zDst = sizeof(dstBuffer);  //it is 40
int posDst = strpncpy_emC(dstBuffer, 0, zDst, "first string", -1);
if(posDst < zDst) { dstBuffer[posDst++] = nr; } //write a number as char between
posDst += strpncpy_emC(dstBuffer, posDst, zDst, "  ", 2); //typical append
//append a numeric value
posDst += toString_int32_emC(dstBuffer + posDst, zDst - posDst, nr2, -1);
posDst += strpncpy_emC(dstBuffer, posDst, zDst, " kg\n", -1); //typical append

This is pure C. Using C++ with overridden operators is more simple to read, but it bases on the same routines. As you see the argument posDst makes it easier for concatenation. The called routine toString_int32_emC(…​) does not support that handling of posDst, instead a pointer add and a subtract is need on call, which is equivalent on execution but a little bit more complex in the source.

4.4. strnchr_emC, searchChar[Back]_emC

Adequate remarks as for strcmp etc. are valid for strchr. This function is not save, because it may crash if the source string is not found as 0-terminated by any mistake. The original routine in emC is:

int searchChar_emC ( char const* text, int zText, char cc);

The length of the text is terminated. A 0-termination is regarded if the 0 is found, but not necessary. The return value is the position of the found character, or -1 if not found. It follows known and proven Java concepts.

char const* strnchr_emC ( char const* text, int cc, int maxNrofChars)

is also supported. The difference is the return value only. It is compatible with the knwon strnchr(…​) of standard-C.

searchCharBack_emC ( char const* const text, char cc, int fromIx);

is the adequate solution for searching the last occurence of cc in the given text.

4.5. searchString[Back]_emC

This is the replacement of the strstr(…​) function, but with more simple usage following the Java approach in java.lang.String.indexOf(String).

int searchString_emC ( char const* text, int maxNrofChars, char const* search, int zs)

As the other replacements a 0-termination is not need. search is terminated either by a '\0' or by the given length zs. The text where the searching occurs is also determined either by a '\0' or by the given length maxNrofChars. The return value is either the found position or -1 if not found.

int searchStringBack_emC ( char const* text, int maxNrofChars, char const* search, int zs)

is the adquate function searching the last occurence of the given String. It is adequate the Java core function java.lang.String.lastIndexOf(String).

4.6. searchAnyChar[Back]_emC

This is the replacement of the strspn(…​) function, but with more simple usage.

int searchAnyChar_emC ( char const* text, int maxNrofChars, char const* chars, int zs)

As the other replacements a 0-termination is not need. search is terminated either by a '\0' or by the given length zs. The text where the searching occurs is also determined either by a '\0' or by the given length maxNrofChars. The return value is either the found position or -1 if not found.

int searchAnyCharBack_emC ( char const* text, int maxNrofChars, char const* chars, int zs)

is the adquate function searching the last occurrence of one char of the chars in the given String.

4.7. Necessity of printf? instead toString_int32/float_emC

The printf is a core of C usage, see all "Hello world" examples. But printf is a really complex operation. If it is used, it increases the amount of memory, especially for poor embedded controller with small capabilities.

What is really necessary? Sometimes numeric values should be present as string also in poor controllers for example to output values in a UART communication. The whole capability of the C standard printf is often too much.

The simple routine

int toString_int32_emC(char* buffer, int zBuffer, int32 value
  , int radix, int minNrofCharsAndNegative);

offers a conversion from numeric value to string in decimal or hexadecimal presentation. It prevents usage of division because the division may be a more expensive operation in calculation time on some processors.

This solution is enqueued in the solution of string concatenations. It prevents the necessity of including library functions for printf which are often more extensive.

4.8. Scanning of numeric values: parse[Int|Float]Radix_emC

This is the counterpart of scanf approaches, more simple as the original and standard scanf(…​).

int parseIntRadix_emC ( const char* src, int zSrc, int radix, int* parsedChars
                      , char const* addChars);

It returns the number. This function is a jack of all, respectively the necessities, though it is short in code (200 Bytes on a TI320 controller) and short in execution. addChars controls the abilities:

  • The parsedChars is the second output of the function via the given integer pointer for the parsed number of characters for the digit. It is usual necessary in the parsing process, but can be omitted too if not necessary, by given a null pointer.

  • If radix = 10 it parses decimal, radix=16 or =8 hexa or octal, or all other radix. An abbreviated radix, not 10 or 16 may be unnecessary, but the number is a simple number for calculation and comparison.

  • The hexa digits are 'a'..'f' or ’A'..'F', and till z for the radix 26 (if desired).

  • If addChars =null or empty, it parses only the digits.

  • A space as first char in addChars =" " forces skipping over leading spaces and tabs, a "\n" on first position forces skipping over all leading white spaces.

  • A minus addChars ="-" on the first position or on second position after addChars =" -" or ="\n-" accepts a ’-' for a negative number, a plus on that positions addChars =" +" accepts also a '+' character (positive value) which is generally unnecessary but it is hence possible.

  • Space or newline after the minus or plus addChars ="- " accepts spaces or tabs after the sign till start of the number, same for addChars ="+ ". The combination addChars ="\n+ " accepts white spaces before the number and spaces or tabs between sign and number.

  • A 'x' after this designations or on start addChars ="x" accepts switching to hexa numbers on a given 0x designation in the number. It means addChars ="\n+ x" accepts whithe spaces before sign, the sign, spaces and tabs after the sign, and then a 0x to detect a hexa number.

  • All other characters in addChars =" '| ," are characters which can be placed between the digits. Especially the space to group digits in one number may be sensitive for some applications. A space to separate should not written on the first positions see above, but on following positions:

    • addChars =" " accepts leading spaces and tabs, and then spaces (not tabs) between the digits, two spaces are given.

    • addChars ="1 " does not accept leading spaces (the space is not left) but spaces as separator inside the digit. The "1" has no meaning, it cannot be a separator, it is a digit. It is a little helper to have the space not left.

    • addChars =" x " accepts for example ` 0x 34 56 ca FE` as hexa number.

    • addChars =" x" accepts only ` 0x3456caFE`, not with space between the digits, but

    • addChars =" x'" accepts also ` 0x3456’caFE` for better readablity.

There may be some combinations which are not sensitive, but formally possible. The operation is not intend as a sharp parser, it is intend to read numbers in a more free style, for example in a parameter list:

  param1 = + 456.45
  param2 = 0x100
  key = 234 356 357
  key2 = 0xcafe'affe

The numbers are parsed in any case after the '=', all this forms are accepted with addChars =" + x '". The white space form with "\n" is not necessary in most cases but possible if a strong whitespace thinking is given. The floating number is parsed with

float parseFloat_emC ( const char* src, int size, int* parsedChars);

which uses the parseIntRadix_emC(…​) for the integer and fractional part and for the exponent.

5. StringJc as unique string representation with capabilities

The struct StringJc. The white space form with `"\n" is not necessary in most cases but possible if a strong whitespace thinking is given. The floating number is parsed with

float parseFloat_emC ( const char* src, int size, int* parsedChars);

which uses the parseIntRadix_emC(…​) for the integer and fractional part and for the exponent. ` in emC presents a string representation with its length.

One of the basic ideas in the development was: It should be returned by value or also used for call by value:

StringJc myStringOperation ( StringJc inp ) {
  StringJc val = //build the String maybe from inp
  return val;
}

Older compiler were optimized if the returned value can be passed by two CPU registers. Hence the struct StringJc was defined as:

//in applstdef_emC.h:
#define VALTYPE_AddrVal_emC int16       //for a small processor
//in emC/Base/types_def_common.h:

#ifndef VALTYPE_AddrVal_emC            //possible to define in applstdef_emC.h
  #define VALTYPE_AddrVal_emC int32    //the default
#endif

#define STRUCT_AddrVal_emC(NAME, TYPE) \
struct NAME##_T { TYPE* addr; VALTYPE_AddrVal_emC val; } NAME

/**The type AddrVal_emC handles with a address (pointer) for a 8 byte alignment. */
typedef STRUCT_AddrVal_emC(AddrVal_emC, Addr8_emC);

/**Defines a struct with a byte address and the length. */
typedef STRUCT_AddrVal_emC(int8ARRAY, int8);

typedef STRUCT_AddrVal_emC(int16ARRAY, int16);

typedef STRUCT_AddrVal_emC(int32ARRAY, int32);

typedef STRUCT_AddrVal_emC(int64ARRAY, int64);

typedef STRUCT_AddrVal_emC(floatARRAY, float);

typedef STRUCT_AddrVal_emC(doubleARRAY, double);

This is a possibility to have a data struct consisting from a pointer with a value proper to the bit width of the processor. Now with experience the compl_adaption.h can be adapted so that this data struct is returned by value as registers.

In the same kind the StringJc is defined. But depending on more capabilities the address is a union:

//definition of StringJc to use this type before including emC/StringJc
typedef struct StringJc_T {
  union CharSeqTypes_T {
    char const* str;
    struct StringBuilderJc_t* bu;
    struct ObjectJc_T const* obj;
    #ifdef __cplusplus
    class CharSequenceJcpp* csq;
    #endif
  } addr;
  VALTYPE_AddrVal_emC val;    //Note: Use same type as in STRUCT_AddrVal_emC
} StringJc;
#define DEFINED_StringJc_emC

It is the adequate definition as the other struct with an address and the length value, but the type can be used:

  • As simple C-String, maybe 0-terminated or not, as const string, unmutual as in Java.

  • As reference to a buffer to prepare a String, see chapter StringBuilderJc: Buffer to prepare Strings. Then it is a mutual String.

  • To any Object, which has an operations due to Vtbl_CharSeqJc, see #Vtbl_CharSeqJc

  • To any C++ instance of type CharSequenceJcpp which is an interface to the C++ CharSequence with virtual operations.

This offers some capabilities of String processing which are more affine to Java language then to C/++ standards. Especially the simple form only using the str element of this union is very simple also proper for small, poor processors. It is the pointer to a string, not necessary zero-terminated, and the length is given in val.

The val of this struct contains not only the length of the String but also some control bits. For simple usage 16 bit are sufficient, for more capability 32 bit for this val value are necessary:

The basic definition to evaluate this val is

//in applstdef_emC.h or in compl_adaption.h only if necessary:
#define mLength_StringJc  0x00003fff

This is the standard definition, also established in emC/Base/StringBase_emC.h;

#ifndef mLength_StringJc
#define mLength_StringJc 0x00003fff
#endif

Depending on this value some other definitions are contains in emC/Base/StringBase_emC.h:

#ifndef DEF_CharSeqJcCapabilities
  #define mVtbl_CharSeqJc 0
  #define kIsCharSeqJc_CharSeqJc 0
  #define kMaxNrofChars_StringJc (mLength_StringJc -2)
  #define mIsCharSeqJcVtbl_CharSeqJc 0
#else
  #define mVtbl_CharSeqJc (mLength_StringJc >>2)
  #define kIsCharSeqJc_CharSeqJc (mLength_StringJc -2)
  #define kMaxNrofChars_StringJc ((mLength_StringJc & ~mVtbl_CharSeqJc)-1)
  #define mIsCharSeqJcVtbl_CharSeqJc (mLength_StringJc & ~mVtbl_CharSeqJc)
#endif
#define kIs_0_terminated_StringJc (mLength_StringJc)
#define kIsStringBuilder_CharSeqJc (mLength_StringJc -1)
#define mNonPersists__StringJc       (mLength_StringJc +1)
#define mThreadContext__StringJc     ((mNonPersists__StringJc)<<1)

This definitions based on mLength_StringJc == 0x3fff defines the following values:

  • 0x8000: mThreadContext__StringJc: This bit describes the location of a StringJc inside the ThreadContext. It is allocated thread local and should be deallocated after usage.

  • 0x4000: mNonPersists__StringJc: If this bit is set, the string may be changed. It is not an unmutual String.

  • 0x3fff: kIs_0_terminated_StringJc: This value means, the length of the String is not known yet, but the string is zero terminated. The length can be determined by searching the '0' character. This definition makes it easy to define a const StringJc from a literal with initializer list:

    #define INIZ_z_StringJc(TEXT) { TEXT, kIs_0_terminated_StringJc}
  • 0x3ffe: kIsStringBuilder_CharSeqJc: A reference to a StringBuilderJc is used, the string is in the buffer of the StringBuilderJc. The length is determined by the StringBuilderJc data.

If DEF_CharSeqJcCapabilities is not set, then it is more simple.

  • 0x3ffd: kMaxNrofChars_StringJc: This is the maximal value for a length of a String. If it is 0 .. kMaxNrofChars_StringJc then it is an unmutual char const* string with this given length or with a length not greater as this given length but maybe contain a \0 for a shorter length or not.

With this bits designation a StringJc reference can present all of this named strings. The simple case is always possible, the unmutual char const* simple String.

If DEF_CharSeqJcCapabilities is given, then the StringJc can refer to an ObjectJc instance which implements the CharSeq function pointer table. The ObjectJc* pointer part of the union is used. The ObjectJc instance should offer 3 operations to get the length, any index chars and a sub sequence likewise as java.lang.CharSequence interface. But the StringJc can also be a simple unmutual char const* or a StringBuilderJc instance, of course.

  • 3ffd: kIsCharSeqJc_CharSeqJc If this value is set masked with mLength_StringJc, then the reference refers to an ObjectJc instance which should contain a function pointer table to CharSeqJc routines. The function pointer table is gotten from the instance by calling getVtbl_ObjectJc(obj, sign_Vtbl_CharSeqJc). This searches the function pointer table inside the ClassJc type information.

  • 0x3000: mIsCharSeqJcVtbl_CharSeqJc; If the val mask with this bits is set, then the val & mLength_StringJc is in range 0x3000..0x3ffc. The val && mVtbl_CharSeqJc is the index to a function pointer table which is used to implement dynamic call on runtime in C language or in C++ without using virtual. The advantage for such a StringJc is: The index is already built (elsewhere int getPosInVtbl_ObjectJc(othiz, sign_Vtbl_CharSeqJc); should be called which needs a little effort. This variant should only be used for local values (hold in stack) which are more safe then anywhere in data memory. Elsewhere there can be the same problems as using the virtual mechanism in C++: Disturbed data can force a crash of programm execution.

  • 0x0fff: mVtbl_CharSeqJc This is the mask for the posInVtbl for a CharSeqJc Object. See examples and chapter…​TODO

  • 0x2fff: kMaxNrofChars_StringJc: 0x2fff If the (val & mLength_StringJc) ⇐ kMaxNrofChars_StringJc then it is a char const* unmutable String with this given length.

6. StringBuilderJc: Buffer to prepare Strings

7. Vtbl_CharSeqJc: CharSequence in C language