Exercise: Protobuf Parsing
In this exercise, you will build a parser for the protobuf binary encoding. Don’t worry, it’s simpler than it seems! This illustrates a common parsing pattern, passing slices of data. The underlying data itself is never copied.
Fully parsing a protobuf message requires knowing the types of the fields, indexed by their field numbers. That is typically provided in a proto file. In this exercise, we’ll encode that information into match statements in functions that get called for each field.
We’ll use the following proto:
message PhoneNumber {optional string number = 1;optional string type = 2;}message Person {optional string name = 1;optional int32 id = 2;repeated PhoneNumber phones = 3;}
A proto message is encoded as a series of fields, one after the next. Each is implemented as a “tag” followed by the value. The tag contains a field number (e.g., 2 for the id field of a Person message) and a wire type defining how the payload should be determined from the byte stream.
Integers, including the tag, are represented with a variable-length encoding called VARINT. Luckily, parse_varint is defined for you below. The given code also defines callbacks to handle Person and PhoneNumber fields, and to parse a message into a series of calls to those callbacks.
What remains for you is to implement the parse_field function and the ProtoMessage trait for Person and PhoneNumber.
/// A wire type as seen on the wire.enum WireType {/// The Varint WireType indicates the value is a single VARINT.Varint,/// The I64 WireType indicates that the value is precisely 8 bytes in/// little-endian order containing a 64-bit signed integer or double type.//I64, -- not needed for this exercise/// The Len WireType indicates that the value is a length represented as a/// VARINT followed by exactly that number of bytes.Len,// The I32 WireType indicates that the value is precisely 4 bytes in// little-endian order containing a 32-bit signed integer or float type.//I32, -- not needed for this exercise}#[derive(Debug)]/// A field's value, typed based on the wire type.enum FieldValue<'a> {Varint(u64),//I64(i64), -- not needed for this exerciseLen(&'a [u8]),//I32(i32), -- not needed for this exercise}#[derive(Debug)]/// A field, containing the field number and its value.struct Field<'a> {field_num: u64,value: FieldValue<'a>,}trait ProtoMessage<'a>: Default {fn add_field(&mut self, field: Field<'a>);}impl From<u64> for WireType {fn from(value: u64) -> Self {match value {0 => WireType::Varint,//1 => WireType::I64, -- not needed for this exercise2 => WireType::Len,//5 => WireType::I32, -- not needed for this exercise_ => panic!("Invalid wire type: {value}"),}}}impl<'a> FieldValue<'a> {fn as_str(&self) -> &'a str {let FieldValue::Len(data) = self else {panic!("Expected string to be a `Len` field");};std::str::from_utf8(data).expect("Invalid string")}fn as_bytes(&self) -> &'a [u8] {let FieldValue::Len(data) = self else {panic!("Expected bytes to be a `Len` field");};data}fn as_u64(&self) -> u64 {let FieldValue::Varint(value) = self else {panic!("Expected `u64` to be a `Varint` field");};*value}}/// Parse a VARINT, returning the parsed value and the remaining bytes.fn parse_varint(data: &[u8]) -> (u64, &[u8]) {for i in 0..7 {let Some(b) = data.get(i) else {panic!("Not enough bytes for varint");};if b & 0x80 == 0 {// This is the last byte of the VARINT, so convert it to// a u64 and return it.let mut value = 0u64;for b in data[..=i].iter().rev() {value = (value << 7) | (b & 0x7f) as u64;}return (value, &data[i + 1..]);}}// More than 7 bytes is invalid.panic!("Too many bytes for varint");}/// Convert a tag into a field number and a WireType.fn unpack_tag(tag: u64) -> (u64, WireType) {let field_num = tag >> 3;let wire_type = WireType::from(tag & 0x7);(field_num, wire_type)}/// Parse a field, returning the remaining bytesfn parse_field(data: &[u8]) -> (Field, &[u8]) {let (tag, remainder) = parse_varint(data);let (field_num, wire_type) = unpack_tag(tag);let (fieldvalue, remainder) = match wire_type {_ => todo!("Based on the wire type, build a Field, consuming as many bytes as necessary.")};todo!("Return the field, and any un-consumed bytes.")}/// Parse a message in the given data, calling `T::add_field` for each field in/// the message.////// The entire input is consumed.fn parse_message<'a, T: ProtoMessage<'a>>(mut data: &'a [u8]) -> T {let mut result = T::default();while !data.is_empty() {let parsed = parse_field(data);result.add_field(parsed.0);data = parsed.1;}result}#[derive(Debug, Default)]struct PhoneNumber<'a> {number: &'a str,type_: &'a str,}#[derive(Debug, Default)]struct Person<'a> {name: &'a str,id: u64,phone: Vec<PhoneNumber<'a>>,}// TODO: Implement ProtoMessage for Person and PhoneNumber.fn main() {let person: Person = parse_message(&[0x0a, 0x07, 0x6d, 0x61, 0x78, 0x77, 0x65, 0x6c, 0x6c, 0x10, 0x2a, 0x1a,0x16, 0x0a, 0x0e, 0x2b, 0x31, 0x32, 0x30, 0x32, 0x2d, 0x35, 0x35, 0x35,0x2d, 0x31, 0x32, 0x31, 0x32, 0x12, 0x04, 0x68, 0x6f, 0x6d, 0x65, 0x1a,0x18, 0x0a, 0x0e, 0x2b, 0x31, 0x38, 0x30, 0x30, 0x2d, 0x38, 0x36, 0x37,0x2d, 0x35, 0x33, 0x30, 0x38, 0x12, 0x06, 0x6d, 0x6f, 0x62, 0x69, 0x6c,0x65,]);println!("{:#?}", person);}
This slide and its sub-slides should take about 30 minutes.
- In this exercise there are various cases where protobuf parsing might fail, e.g. if you try to parse an
i32when there are fewer than 4 bytes left in the data buffer. In normal Rust code we’d handle this with theResultenum, but for simplicity in this exercise we panic if any errors are encountered. On day 4 we’ll cover error handling in Rust in more detail.