diff options
-rw-r--r-- | yxml.md | 428 | ||||
-rw-r--r-- | yxml.pod | 450 |
2 files changed, 428 insertions, 450 deletions
@@ -0,0 +1,428 @@ +% Yxml Manual + +# Introduction + +Yxml is a small non-validating and mostly conforming XML parser written in C. + +The latest version of yxml and this document can be found on +[https://dev.yorhel.nl/yxml](https://dev.yorhel.nl/yxml). + +# Compiling yxml + +Due to the small size of yxml, the recommended way to use it is to copy the +[yxml.c](https://g.blicky.net/yxml.git/plain/yxml.c) and +[yxml.h](https://g.blicky.net/yxml.git/plain/yxml.h) from the git repository +into your project directory, and compile and link yxml.c as part of your +program or library. + +The git repository also includes a Makefile. Running `make` without specifying +a target will compile a `.a` file for easy static linking. A test suite is +available under `make test`. + +# API documentation + +## Overview + +Yxml is designed to be very flexible and efficient, and thus offers a +relatively low-level stream-based API. The entire API consists of two typedefs +and three functions: + +```c +typedef enum { /* .. */ } yxml_ret_t; +typedef struct { /* .. */ } yxml_t; + +void yxml_init(yxml_t *x, void *buf, size_t bufsize); +yxml_ret_t yxml_parse(yxml_t *x, int ch); +yxml_ret_t yxml_eof(yxml_t *x); +``` + +The values of _yxml\_ret\_t_ and the public fields of _yxml\_t_ are explained +in detail below. Parsing a file using yxml involves three steps: + +1. Initialization, using `yxml_init()`. +2. Parsing. This is performed in a loop where `yxml_parse()` is called on each + character of the input file. +3. Finalization, using `yxml_eof()`. + +## Initialization + +```c +#define BUFSIZE 4096 +void *buf = malloc(BUFSIZE); +yxml_t x; +yxml_init(&x, buf, BUFSIZE); +``` + +The parsing state for an input document is remembered in the `yxml_t` +structure. This structure needs to be allocated and initialized before parsing +a new XML document. + +Allocating space for the `yxml_t` structure is the responsibility of the +application. Allocation can be done on the stack, but it is also possible to +embed the struct inside a larger object or to allocate space for the struct +separately. + +`yxml_init()` takes a pointer to an (uninitialized) `yxml_t` struct as first +argument and performs the necessary initialization. The two additional +arguments specify a pointer to a buffer and the size of this buffer. The given +buffer must be writable, but does not have to be initialized by the +application. + +The buffer is used internally by yxml to keep a stack of opened XML element +names, property names and PI targets. The size of the buffer determines both +the maximum depth in which XML elements can be nested and the maximum length of +element names, property names and PI targets. Each name consumes +`strlen(name)+1` bytes in the buffer, and the first byte of the buffer is +reserved for the `\0` byte. This means that in order to parse an XML document +with an element name of 100 bytes, a property name or PI target of 50 bytes and +a nesting depth of 10 levels, the buffer must be at least +`1+10*(100+1)+(50+1)=1062` bytes. Note that properties and PIs don't nest, so +the `max(PI_name, property_name)` only needs to be counted once. + +It is not currently possibly to dynamically grow the buffer while parsing, so +it is important to choose a buffer size that is large enough to handle all the +XML documents that you want to parse. Since element names, property names and +PI targets are typically much shorter than in the previous example, a buffer +size of 4 or 8 KiB will give enough headroom even for documents with deep +nesting. + +As a useful hack, it is possible to merge the memory for the `yxml_t` struct +and the stack buffer in a single allocation: + +```c +yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE); +yxml_init(x, x+1, BUFSIZE); +``` + +This way, the complete parsing state can be passed around with a single +pointer, and both the struct and the buffer can be freed with a single call to +`free(x)`. + +## Parsing + +```c +yxml_t *x; /* An initialized state */ +char *doc; /* The XML document as a zero-terminated string */ +for(; *doc; doc++) { + yxml_ret_t r = yxml_parse(x, *doc); + if(r < 0) + exit(1); /* Handle error */ + /* Handle any tokens we are interested in */ +} +``` + +The actual parsing of an XML document is facilitated by the `yxml_parse()` +function. It accepts a pointer to an initialized `yxml_t` struct as first +argument and a byte as second argument. The byte is passed as an `int`, and +values in the range of -128 to 255 (both inclusive) are accepted. This way you +can pass either `signed char` or `unsigned char` values, yxml will work fine +with both. To parse a complete document, `yxml_parse()` needs to be called for +each byte of the document in sequence, as done in the above example. + +For each byte, `yxml_parse()` will return either _YXML\_OK_ (0), a token (>0) +or an error (<0). _YXML\_OK_ is returned if the given byte has been +parsed/consumed correctly but that otherwise nothing worthy of note has +happened. The application should then continue processing and pass the next +byte of the document. + +### Public State Variables + +After each call to `yxml_parse()`, a number of interesting fields in the +`yxml_t` struct are updated. The fields documented here are part of the API, +and are considered as extra return values of `yxml_parse()`. All of these +fields should be considered read-only. + +`char *elem;` +: Name of the currently opened XML element. Points into the buffer given to + `yxml_init()`. Described in ["Elements"](#elements). + +`char *attr;` +: Name of the currently opened attribute. Points into the buffer given to + `yxml_init()`. Described in ["Attributes"](#attributes). + +`char *pi;` +: Target of the currently opened PI. Points into the buffer given to + `yxml_init()`. Described in ["Processing Instructions"](#processing-instructions). + +`char data[8];` +: Character data of element contents, attribute values or PI contents. Described + in ["Character Data"](#character-data). + +`uint32_t line;` +: Number of the line in the XML document that is currently being parsed. + +`uint64_t byte;` +: Byte offset into the current line the XML document. + +`uint64_t total;` +: Byte offset into the XML document. + +The values of the _elem_, _attr_, _pi_ and _data_ elements depend on the +parsing context, and only remain valid within that context. The exact contexts +in which these fields contain valid information is described in their +respective sections below. + +The _line_, _byte_ and _total_ fields are mainly useful for error reporting. +When `yxml_parse()` reports an error, these fields can be used to generate a +useful error message. For example: + +```c +printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64", + filename, x->line, x->byte, x->total); +``` + +### Error Handling + +Errors are not recoverable. No further calls to `yxml_parse()` or `yxml_eof()` +should be performed on the same `yxml_t` struct. Re-initializing the same +struct using `yxml_init()` to start parsing a new document is possible, +however. The following error values may be returned by `yxml_parse()`: + +YXML\_EREF +: Invalid character or entity reference. E.g. `&whatever;` or `&#ABC;`. + +YXML\_ECLOSE +: Close tag does not match open tag. E.g. `<Tag> .. </SomeOtherTag>`. + +YXML\_ESTACK +: Stack overflow. This happens when the buffer given to `yxml_init()` was not + large enough to parse this document. E.g. when elements are too deeply nested + or an element name, attribute name or PI target is too long. + +YXML\_ESYN +: Miscellaneous syntax error. + +## Handling Tokens + +The `yxml_parse()` function will return tokens as they are found. When loading +an XML document, it is important to know which tokens are returned in which +situation and how to handle them. + +The following graph shows the (simplified) state machine of the parser to +illustrate the order in which tokens are returned. The labels on the edge +indicate the tokens that are returned by `yxml_parse()`, with their `YXML_` +prefix removed. The special return value `YXML_OK` and error returns are not +displayed. + +![](https://dev.yorhel.nl/img/yxml-apistates.png) + +Tokens that the application is not interested in can be ignored safely. For +example, if you are not interested in handling processing instructions, then +the `YXML_PISTART`, `YXML_PICONTENT` and `YXML_PIEND` tokens can be handled +exactly as if they were an alias for `YXML_OK`. + +### Elements + +The `YXML_ELEMSTART` and `YXML_ELEMEND` tokens are returned when an XML +element is opened and closed, respectively. When `YXML_ELEMSTART` is returned, +the _elem_ struct field will hold the name of the element. This field will be +valid (i.e. keeps pointing to the name of the opened element) until the end of +the attribute list. That is, until any token other than those described in +["Attributes"](#attributes) is returned. Although the _elem_ pointer itself may be reused +and modified while parsing the contents of the element, the buffer that _elem_ +points to will remain valid up to and including the corresponding +`YXML_ELEMEND`. + +Yxml will verify that elements properly nest and that the name of each closing +tag properly matches that of the corresponding opening tag. The application may +safely assume that each `YXML_ELEMSTART` is properly matched with a +`YXML_ELEMEND`, or that otherwise an error is returned. Furthermore, only a +single root element is allowed. When the root element is closed, no further +`YXML_ELEMSTART` tokens will be returned. + +No distinction is made between self-closing tags and elements with empty +content. For example, both `<a/>` and `<a></a>` will result in the +`YXML_ELEMSTART` token (with `elem="a"`) followed by `YXML_ELEMEND`. + +Element contents are returned in the form of the `YXML_CONTENT` token and the +_data_ field. This is described in more detail in ["Character +Data"](#character-data). + +### Attributes + +Element attributes are passed using the `YXML_ATTRSTART`, `YXML_ATTRVAL` and +`YXML_ATTREND` tokens. The name of the attribute is available in the _attr_ +field, which is available when `YXML_ATTRSTART` is returned and valid up to +and including the next `YXML_ATTREND`. + +Yxml does not verify that attribute names are unique within a single element. +It is thus possible that the same attribute will appear twice, possibly with a +different value. The correct way to handle this situation is to stop parsing +the rest of the document and to report an error, but if the application is not +interested in all attributes, detecting duplicates in them may complicate the +code and possibly even introduce security vulnerabilities (e.g. algorithmic +complexity attacks in a hash table). As such, the best solution is to report an +error when you can easily detect a duplicate attribute, but ignore duplicates +that require more effort to be detected. + +The attribute value is returned with the `YXML_ATTRVAL` token and the _data_ +field. This is described in more detail in ["Character Data"](#character-data). + +### Processing Instructions + +Processing instructions are passed in similar fashion to attributes, and are +passed using `YXML_PISTART`, `YXML_PICONTENT` and `YXML_PIEND`. The target of +the PI is available in the _pi_ field after `YXML_PISTART` and remains valid up +to (but excluding) the next `YXML_PIEND` token. + +PI contents are returned as `YXML_PICONTENT` tokens and using the _data_ field, +described in more detail in ["Character Data"](#character-data). + +### Character Data + +Element contents (`YXML_CONTENT`), attribute values (`YXML_ATTRVAL`) and PI +contents (`YXML_PICONTENT`) are all passed to the application in small chunks +through the _data_ field. Each time that `yxml_parse()` returns one of these +tokens, the _data_ field will contain one or more bytes of the element +contents, attribute value or PI content. The string is zero-terminated, and its +value is only valid until the next call to `yxml_parse()`. + +Typically only a single byte is returned after each call, but multiple bytes +can be returned in the following special cases: + +- Character references outside of the ASCII character range. When a character + reference is encountered in element contents or in an attribute value, it is + automatically replaced with the referenced character. For example, the XML + string `/` is replaced with the single character "/". If the character + value is above 127, its value is encoded in UTF-8 and then returned as a + multi-byte string in the _data_ field. For example, the character reference + `ç` is returned as the C string "\\xc3\\xa9", which is the UTF-8 + encoding for the character "é". Character references are not expanded in PI + contents. +- The special character "\]" in CDATA sections. When the "\]" character is + encountered inside a CDATA section, yxml can't immediately return it to the + application because it does not know whether the character is part of the + CDATA ending or whether it is still part of its contents. So it remembers the + character for the next call to `yxml_parse()`, and if it then turns out that + the character was part of the CDATA contents, it returns both the "\]" + character and the following byte in the same _data_ string. Similarly, if two + "\]" characters appear in sequence as part of the CDATA content, then the two + characters are returned in a single _data_ string together with the byte that + follows. CDATA sections only appear in element contents, so this does not + happen in attribute values or PI contents. +- The special character "?" in PI contents. This is similar to the issue with + "\]" characters in CDATA sections. Yxml remembers a "?" character while + parsing a PI, and then returns it together with the byte following it if it + turned out to be part of the PI contents. + +Note that `yxml_parse()` operates on bytes rather than characters. If the +document is encoded in a multi-byte character encoding such as UTF-8, then each +Unicode character that occupies more than a single byte will be broken up and +its bytes processed individually. As a result, the bytes returned in the +_data_ field may not necessarily represent a single Unicode character. To +ensure that multi-byte characters are not broken up, the application can +concatenate multiple data tokens to a single buffer before attempting to do +further processing on the result. + +To make processing easier, an application may want to combine all the tokens +into a single buffer. This can be easily implemented as follows: + +```c +SomeString attrval; +while(..) { + yxml_ret_t r = yxml_parse(x, ch); + switch(r) { + case YXML_ATTRSTART: + somestring_initialize(attrval); + break; + case YXML_ATTRVAL: + somestring_append(attrval, x->data); + break; + case YXML_ATTREND: + /* Now we have a full attribute. Its name is in x->attr, and its value is + * in the string 'attrval'. */ + somestring_reset(attrval); + break; + } +} +``` + +The `SomeString` type and `somestring_` functions are stubs for any string +handling library of your choosing. When using Glib, for example, one could use +the [GString](https://developer.gnome.org/glib/stable/glib-Strings.html) +type and the `g_string_new()`, `g_string_append()` and `g_string_free()` +functions. For a more lighter-weight string library there is also +[kstring.h in klib](https://github.com/attractivechaos/klib), but the +functionality required in the above example can easily be implemented in a few +lines of pure C, too. + +When buffering data into an ever-growing string, as done in the previous +example, one should be careful to protect against memory exhaustion. This can +be done trivially by limiting the size of the total XML document or the maximum +length of the buffer. If you want to extract information from an XML document +that might not fit into memory, but you know that the information you care +about is limited in size and is only stored in specific attributes or elements, +you can choose to ignore data you don't care about. For example, if you only +want to extract the "Size" attribute and you know that its value is never +larger than 63 bytes, you can limit your code to read only that value and store +it into a small pre-allocated buffer: + +```c +char sizebuf[64], *sizecur = NULL, *tmp; +while(..) { + yxml_ret_t r = yxml_parse(x, ch); + switch(r) { + case YXML_ATTRSTART: + if(strcmp(x->attr, "Size") == 0) + sizecur = sizebuf; + break; + case YXML_ATTRVAL: + if(!sizecur) /* Are we in the "Size" attribute? */ + break; + /* Append x->data to sizecur while there is space */ + tmp = x->data; + while(*tmp && sizecur < sizebuf+sizeof(sizebuf)) + *(sizecur++) = *(tmp++); + if(sizecur == sizebuf+sizeof(sizebuf)) + exit(1); /* Too long attribute value, handle error */ + *sizecur = 0; + break; + case YXML_ATTREND: + if(sizecur) { + /* Now we have the value of the "Size" attribute in sizebuf */ + sizecur = NULL; + } + break; + } +} +``` + +## Finalization + +```c +yxml_t *x; /* An initialized state */ +yxml_ret_t r = yxml_eof(x); +if(r < 0) + exit(1); /* Handle error */ +else + /* No errors in the XML document */ +``` + +Because `yxml_parse()` does not know when the end of the XML document has been +reached, it is unable to detect certain errors in the document. This is why, +after successfully parsing a complete document with `yxml_parse()`, the +application should call `yxml_eof()` to perform some extra checks. + +`yxml_eof()` will return `YXML_OK` if the parsed XML document is well-formed, +`YXML_EEOF` otherwise. The following errors are not detected by +`yxml_parse()` but will result in an error on `yxml_eof()`: + +- The XML document did not contain a root element (e.g. an empty file). +- The XML root element has not been closed (e.g. "`<a> ..`"). +- The XML document ended in the middle of a comment or PI (e.g. + "`<a/><!-- ..`"). + +## Utility functions + +```c +size_t yxml_symlen(yxml_t *, const char *); +``` + +`yxml_symlen()` returns the length of the element name (`x->elem`), attribute +name (`x->attr`), or PI name (`x->pi`). When used correctly, it gives the same +result as `strlen()`, except without having to scan through the string. This +function should **ONLY** be used directly after the `YXML_ELEMSTART`, +`YXML_ATTRSTART` or `YXML_PISTART` (respectively) tokens have been returned by +`yxml_parse()`, calling this function at any other time may not give the +correct results. This function should **NOT** be used on strings other than +`x->elem`, `x->attr` or `x->pi`. diff --git a/yxml.pod b/yxml.pod deleted file mode 100644 index 9448a62..0000000 --- a/yxml.pod +++ /dev/null @@ -1,450 +0,0 @@ -=head1 Introduction - -Yxml is a small non-validating and mostly conforming XML parser written in C. - -The latest version of yxml and this document can be found on -L<http://dev.yorhel.nl/yxml>. - -=head1 Compiling yxml - -Due to the small size of yxml, the recommended way to use it is to copy the -L<yxml.c|http://g.blicky.net/yxml.git/plain/yxml.c> and -L<yxml.h|http://g.blicky.net/yxml.git/plain/yxml.h> from the git repository -into your project directory, and compile and link yxml.c as part of your -program or library. - -The git repository also includes a Makefile. Running C<make> without specifying -a target will compile a C<.a> file for easy static linking. A test suite is -available under C<make test>. - -=head1 API documentation - -=head2 Overview - -Yxml is designed to be very flexible and efficient, and thus offers a -relatively low-level stream-based API. The entire API consists of two typedefs -and three functions: - - typedef enum { /* .. */ } yxml_ret_t; - typedef struct { /* .. */ } yxml_t; - - void yxml_init(yxml_t *x, void *buf, size_t bufsize); - yxml_ret_t yxml_parse(yxml_t *x, int ch); - yxml_ret_t yxml_eof(yxml_t *x); - -The values of I<yxml_ret_t> and the public fields of I<yxml_t> are explained in -detail below. Parsing a file using yxml involves three steps: - -=over - -=item 1. Initialization, using C<yxml_init()>. - -=item 2. Parsing. This is performed in a loop where C<yxml_parse()> is called -on each character of the input file. - -=item 3. Finalization, using C<yxml_eof()>. - -=back - - -=head2 Initialization - - #define BUFSIZE 4096 - void *buf = malloc(BUFSIZE); - yxml_t x; - yxml_init(&x, buf, BUFSIZE); - -The parsing state for an input document is remembered in the C<yxml_t> -structure. This structure needs to be allocated and initialized before parsing -a new XML document. - -Allocating space for the C<yxml_t> structure is the responsibility of the -application. Allocation can be done on the stack, but it is also possible to -embed the struct inside a larger object or to allocate space for the struct -separately. - -C<yxml_init()> takes a pointer to an (uninitialized) C<yxml_t> struct as first -argument and performs the necessary initialization. The two additional -arguments specify a pointer to a buffer and the size of this buffer. The given -buffer must be writable, but does not have to be initialized by the -application. - -The buffer is used internally by yxml to keep a stack of opened XML element -names, property names and PI targets. The size of the buffer determines both -the maximum depth in which XML elements can be nested and the maximum length of -element names, property names and PI targets. Each name consumes -C<strlen(name)+1> bytes in the buffer, and the first byte of the buffer is -reserved for the C<\0> byte. This means that in order to parse an XML document -with an element name of 100 bytes, a property name or PI target of 50 bytes and -a nesting depth of 10 levels, the buffer must be at least -C<1+10*(100+1)+(50+1)=1062> bytes. Note that properties and PIs don't nest, so -the C<max(PI_name, property_name)> only needs to be counted once. - -It is not currently possibly to dynamically grow the buffer while parsing, so -it is important to choose a buffer size that is large enough to handle all the -XML documents that you want to parse. Since element names, property names and -PI targets are typically much shorter than in the previous example, a buffer -size of 4 or 8 KiB will give enough headroom even for documents with deep -nesting. - -As a useful hack, it is possible to merge the memory for the C<yxml_t> struct -and the stack buffer in a single allocation: - - yxml_t *x = malloc(sizeof(yxml_t) + BUFSIZE); - yxml_init(x, x+1, BUFSIZE); - -This way, the complete parsing state can be passed around with a single -pointer, and both the struct and the buffer can be freed with a single call to -C<free(x)>. - - -=head2 Parsing - - yxml_t *x; /* An initialized state */ - char *doc; /* The XML document as a zero-terminated string */ - for(; *doc; doc++) { - yxml_ret_t r = yxml_parse(x, *doc); - if(r < 0) - exit(1); /* Handle error */ - /* Handle any tokens we are interested in */ - } - -The actual parsing of an XML document is facilitated by the C<yxml_parse()> -function. It accepts a pointer to an initialized C<yxml_t> struct as first -argument and a byte as second argument. The byte is passed as an C<int>, and -values in the range of -128 to 255 (both inclusive) are accepted. This way you -can pass either C<signed char> or C<unsigned char> values, yxml will work fine -with both. To parse a complete document, C<yxml_parse()> needs to be called -for each byte of the document in sequence, as done in the above example. - -For each byte, C<yxml_parse()> will return either I<YXML_OK> (0), a token (>0) -or an error (<0). I<YXML_OK> is returned if the given byte has been -parsed/consumed correctly but that otherwise nothing worthy of note has -happened. The application should then continue processing and pass the next -byte of the document. - -=head3 Public State Variables - -After each call to C<yxml_parse()>, a number of interesting fields in the -C<yxml_t> struct are updated. The fields documented here are part of the API, -and are considered as extra return values of C<yxml_parse()>. All of these -fields should be considered read-only. - -=over - -=item C<char *elem;> - -Name of the currently opened XML element. Points into the buffer given to -C<yxml_init()>. Described in L</Elements>. - -=item C<char *attr;> - -Name of the currently opened attribute. Points into the buffer given to -C<yxml_init()>. Described in L</Attributes>. - -=item C<char *pi;> - -Target of the currently opened PI. Points into the buffer given to -C<yxml_init()>. Described in L</Processing Instructions>. - -=item C<char data[8];> - -Character data of element contents, attribute values or PI contents. Described -in L</Character Data>. - -=item C<uint32_t line;> - -Number of the line in the XML document that is currently being parsed. - -=item C<uint64_t byte;> - -Byte offset into the current line the XML document. - -=item C<uint64_t total;> - -Byte offset into the XML document. - -=back - -The values of the I<elem>, I<attr>, I<pi> and I<data> elements depend on the -parsing context, and only remain valid within that context. The exact contexts -in which these fields contain valid information is described in their -respective sections below. - -The I<line>, I<byte> and I<total> fields are mainly useful for error reporting. -When C<yxml_parse()> reports an error, these fields can be used to generate a -useful error message. For example: - - printf("Parsing error at %s:%"PRIu32":%"PRIu64" byte offset %"PRIu64", - filename, x->line, x->byte, x->total); - -=head3 Error Handling - -Errors are not recoverable. No further calls to C<yxml_parse()> or -C<yxml_eof()> should be performed on the same C<yxml_t> struct. Re-initializing -the same struct using C<yxml_init()> to start parsing a new document is -possible, however. The following error values may be returned by -C<yxml_parse()>: - -=over - -=item YXML_EREF - -Invalid character or entity reference. E.g. C<&whatever;> or C<&#ABC;>. - -=item YXML_ECLOSE - -Close tag does not match open tag. E.g. C<< <Tag> .. </SomeOtherTag> >>. - -=item YXML_ESTACK - -Stack overflow. This happens when the buffer given to C<yxml_init()> was not -large enough to parse this document. E.g. when elements are too deeply nested -or an element name, attribute name or PI target is too long. - -=item YXML_ESYN - -Miscellaneous syntax error. - -=back - - -=head2 Handling Tokens - -The C<yxml_parse()> function will return tokens as they are found. When loading -an XML document, it is important to know which tokens are returned in which -situation and how to handle them. - -The following graph shows the (simplified) state machine of the parser to -illustrate the order in which tokens are returned. The labels on the edge -indicate the tokens that are returned by C<yxml_parse()>, with their C<YXML_> -prefix removed. The special return value C<YXML_OK> and error returns are not -displayed. - -[html]<img src="/img/yxml-apistates.png" />É - -Tokens that the application is not interested in can be ignored safely. For -example, if you are not interested in handling processing instructions, then -the C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND> tokens can be handled -exactly as if they were an alias for C<YXML_OK>. - -=head3 Elements - -The C<YXML_ELEMSTART> and C<YXML_ELEMEND> tokens are returned when an XML -element is opened and closed, respectively. When C<YXML_ELEMSTART> is returned, -the I<elem> struct field will hold the name of the element. This field will be -valid (i.e. keeps pointing to the name of the opened element) until the end of -the attribute list. That is, until any token other than those described in -L</Attributes> is returned. Although the I<elem> pointer itself may be reused -and modified while parsing the contents of the element, the buffer that I<elem> -points to will remain valid up to and including the corresponding -C<YXML_ELEMEND>. - -Yxml will verify that elements properly nest and that the name of each closing -tag properly matches that of the corresponding opening tag. The application may -safely assume that each C<YXML_ELEMSTART> is properly matched with a -C<YXML_ELEMEND>, or that otherwise an error is returned. Furthermore, only a -single root element is allowed. When the root element is closed, no further -C<YXML_ELEMSTART> tokens will be returned. - -No distinction is made between self-closing tags and elements with empty -content. For example, both C<< <a/> >> and C<< <a></a> >> will result in the -C<YXML_ELEMSTART> token (with C<elem="a">) followed by C<YXML_ELEMEND>. - -Element contents are returned in the form of the C<YXML_CONTENT> token and the -I<data> field. This is described in more detail in L</Character Data>. - -=head3 Attributes - -Element attributes are passed using the C<YXML_ATTRSTART>, C<YXML_ATTRVAL> and -C<YXML_ATTREND> tokens. The name of the attribute is available in the I<attr> -field, which is available when C<YXML_ATTRSTART> is returned and valid up to -and including the next C<YXML_ATTREND>. - -Yxml does not verify that attribute names are unique within a single element. -It is thus possible that the same attribute will appear twice, possibly with a -different value. The correct way to handle this situation is to stop parsing -the rest of the document and to report an error, but if the application is not -interested in all attributes, detecting duplicates in them may complicate the -code and possibly even introduce security vulnerabilities (e.g. algorithmic -complexity attacks in a hash table). As such, the best solution is to report an -error when you can easily detect a duplicate attribute, but ignore duplicates -that require more effort to be detected. - -The attribute value is returned with the C<YXML_ATTRVAL> token and the I<data> -field. This is described in more detail in L</Character Data>. - -=head3 Processing Instructions - -Processing instructions are passed in similar fashion to attributes, and are -passed using C<YXML_PISTART>, C<YXML_PICONTENT> and C<YXML_PIEND>. The target -of the PI is available in the I<pi> field after C<YXML_PISTART> and remains -valid up to (but excluding) the next C<YXML_PIEND> token. - -PI contents are returned as C<YXML_PICONTENT> tokens and using the I<data> -field, described in more detail in L</Character Data>. - -=head3 Character Data - -Element contents (C<YXML_CONTENT>), attribute values (C<YXML_ATTRVAL>) and PI -contents (C<YXML_PICONTENT>) are all passed to the application in small chunks -through the I<data> field. Each time that C<yxml_parse()> returns one of these -tokens, the I<data> field will contain one or more bytes of the element -contents, attribute value or PI content. The string is zero-terminated, and its -value is only valid until the next call to C<yxml_parse()>. - -Typically only a single byte is returned after each call, but multiple bytes -can be returned in the following special cases: - -=over - -=item * Character references outside of the ASCII character range. When a -character reference is encountered in element contents or in an attribute -value, it is automatically replaced with the referenced character. For example, -the XML string C</> is replaced with the single character "/". If the -character value is above 127, its value is encoded in UTF-8 and then returned -as a multi-byte string in the I<data> field. For example, the character -reference C<ç> is returned as the C string "\xc3\xa9", which is the UTF-8 -encoding for the character "é". Character references are not expanded in PI -contents. - -=item * The special character "]" in CDATA sections. When the "]" character is -encountered inside a CDATA section, yxml can't immediately return it to the -application because it does not know whether the character is part of the CDATA -ending or whether it is still part of its contents. So it remembers the -character for the next call to C<yxml_parse()>, and if it then turns out that -the character was part of the CDATA contents, it returns both the "]" character -and the following byte in the same I<data> string. Similarly, if two "]" -characters appear in sequence as part of the CDATA content, then the two -characters are returned in a single I<data> string together with the byte that -follows. CDATA sections only appear in element contents, so this does not -happen in attribute values or PI contents. - -=item * The special character "?" in PI contents. This is similar to the issue -with "]" characters in CDATA sections. Yxml remembers a "?" character while -parsing a PI, and then returns it together with the byte following it if it -turned out to be part of the PI contents. - -=back - -Note that C<yxml_parse()> operates on bytes rather than characters. If the -document is encoded in a multi-byte character encoding such as UTF-8, then each -Unicode character that occupies more than a single byte will be broken up and -its bytes processed individually. As a result, the bytes returned in the -I<data> field may not necessarily represent a single Unicode character. To -ensure that multi-byte characters are not broken up, the application can -concatenate multiple data tokens to a single buffer before attempting to do -further processing on the result. - -To make processing easier, an application may want to combine all the tokens -into a single buffer. This can be easily implemented as follows: - - SomeString attrval; - while(..) { - yxml_ret_t r = yxml_parse(x, ch); - switch(r) { - case YXML_ATTRSTART: - somestring_initialize(attrval); - break; - case YXML_ATTRVAL: - somestring_append(attrval, x->data); - break; - case YXML_ATTREND: - /* Now we have a full attribute. Its name is in x->attr, and its value is - * in the string 'attrval'. */ - somestring_reset(attrval); - break; - } - } - -The C<SomeString> type and C<somestring_> functions are stubs for any string -handling library of your choosing. When using Glib, for example, one could use -the L<GString|https://developer.gnome.org/glib/stable/glib-Strings.html> -type and the C<g_string_new()>, C<g_string_append()> and C<g_string_free()> -functions. For a more lighter-weight string library there is also -L<kstring.h in klib|https://github.com/attractivechaos/klib>, but the -functionality required in the above example can easily be implemented in a few -lines of pure C, too. - -When buffering data into an ever-growing string, as done in the previous -example, one should be careful to protect against memory exhaustion. This can -be done trivially by limiting the size of the total XML document or the maximum -length of the buffer. If you want to extract information from an XML document -that might not fit into memory, but you know that the information you care -about is limited in size and is only stored in specific attributes or elements, -you can choose to ignore data you don't care about. For example, if you only -want to extract the "Size" attribute and you know that its value is never -larger than 63 bytes, you can limit your code to read only that value and store -it into a small pre-allocated buffer: - - char sizebuf[64], *sizecur = NULL, *tmp; - while(..) { - yxml_ret_t r = yxml_parse(x, ch); - switch(r) { - case YXML_ATTRSTART: - if(strcmp(x->attr, "Size") == 0) - sizecur = sizebuf; - break; - case YXML_ATTRVAL: - if(!sizecur) /* Are we in the "Size" attribute? */ - break; - /* Append x->data to sizecur while there is space */ - tmp = x->data; - while(*tmp && sizecur < sizebuf+sizeof(sizebuf)) - *(sizecur++) = *(tmp++); - if(sizecur == sizebuf+sizeof(sizebuf)) - exit(1); /* Too long attribute value, handle error */ - *sizecur = 0; - break; - case YXML_ATTREND: - if(sizecur) { - /* Now we have the value of the "Size" attribute in sizebuf */ - sizecur = NULL; - } - break; - } - } - - -=head2 Finalization - - yxml_t *x; /* An initialized state */ - yxml_ret_t r = yxml_eof(x); - if(r < 0) - exit(1); /* Handle error */ - else - /* No errors in the XML document */ - -Because C<yxml_parse()> does not know when the end of the XML document has been -reached, it is unable to detect certain errors in the document. This is why, -after successfully parsing a complete document with C<yxml_parse()>, the -application should call C<yxml_eof()> to perform some extra checks. - -C<yxml_eof()> will return C<YXML_OK> if the parsed XML document is well-formed, -C<YXML_EEOF> otherwise. The following errors are not detected by -C<yxml_parse()> but will result in an error on C<yxml_eof()>: - -=over - -=item * The XML document did not contain a root element (e.g. an empty -file). - -=item * The XML root element has not been closed (e.g. "C<< <a> .. >>"). - -=item * The XML document ended in the middle of a comment or PI (e.g. -"C<< <a/><!-- .. >>"). - -=back - -=head2 Utility functions - - size_t yxml_symlen(yxml_t *, const char *); - -C<yxml_symlen()> returns the length of the element name (C<< x->elem >>), -attribute name (C<< x->attr >>), or PI name (C<< x->pi >>). When used -correctly, it gives the same result as C<strlen()>, except without having to -scan through the string. This function should B<ONLY> be used directly after -the C<YXML_ELEMSTART>, C<YXML_ATTRSTART> or C<YXML_PISTART> (respectively) -tokens have been returned by C<yxml_parse()>, calling this function at any -other time may not give the correct results. This function should B<NOT> be -used on strings other than C<< x->elem >>, C<< x->attr >> or C<< x->pi >>. |