Monday, October 29, 2012

Webvtt C Parser version 0.2, mostly

I've been working on implementing a C parser for the webvtt standard.
The parser can be found at https://github.com/dperit/webvtt/tree/cparser
and the standard can be found at http://dev.w3.org/html5/webvtt/

Because I'm building very exacting code for a web browser that will be handling input from malicious attackers I chose to write the code very deliberately. This involved implementing each of the small steps in the standard in a different function, where each function only requires that the offset + buffer pointer in the parser struct point to the start of the text it is concerned with. It should be easy to update the code when parts of the spec changes because only the function that is concerned with that part of the spec will need to be updated and the rest of them will continue functioning independently.

Paired with the above method for splitting up the problem are the parser struct and the parse function.
The parser struct stores the buffer location, offset, and length, as well as the reached end of buffer and invalid webvtt flags. It also stores a parser state, which uses values declared in the state enum to indicate what step the parser is currently on.
The parse function loops until the parser struct indicates that it has reached the end of the buffer or has found an invalid part of the buffer. Inside of the loop is a switch statement for the parser state which executes the function associated with each step of the standard and, if successful, advances the state by one step.

The result of this is that the program is a series of small self contained steps, each responsible for advancing the offset and implementing a specific section of the spec. If something goes wrong it should be easy to trace the problem back to the source.

Another challenge in implementing this spec was that it's possible to receive incomplete buffers, where the parser could run out of buffer partway through a function. Splitting the functions into small steps deals with that effectively with the aid of the following conventions:
  1. If the parser runs out of buffer in a function, return the offset to a point where, if the current function is run again, it will continue parsing properly. This could return the offset to the start of a block for the cue times or not chang it at all for proceeding through the blocks of "almost anything goes" text, which intended for comments.
  2. If we hit something that indicates the thing being parsed is invalid, then return immediately without changing the offset. This means that the parser struct will always point to whatever the invalid part was, which is useful for debugging.
  3. You shouldn't read from the buffer very much ahead of where the offset is pointing to or else it'll be unclear about what it broke on
With the way things are now if you pass the parser struct (with additional data in the buffer) and the first cue (if you have one) (as well as, optionally, the last) back into the parser function it'll pick up right where you left off, which seems to solve the "you could run out of data at any time but the webvtt file might still be valid" problem nicely. Or, if you never come back to it, you'll still have any cues it retrieved originally.

I'm not confident of my ability to edit this post to a point where it makes sense right now if it doesn't already, so I'm going to go ahead and hit the publish button and hope for the best.

No comments:

Post a Comment