What size data / position is this C code assigning to a value

Question

I am trying to port some C code to another language. Most of the code is working, except working out what this section does. The C code I've been handed is not in a compilable state so I can't do a run time analysis, but will fix that as a last resort.

Buffer is pointer to file's raw contents.

Based on reading the code my expectation was this:

if pos = 4132, then this is trying to read a 16-bit unsigned value from file position (pos << 8) + (pos + 1).

However when I make this calculation for pos = 4132, I get a file position of 1061925, which is way beyond the end of the file.

unsigned short id;

    struct file
    {
      char *buffer;
    }

id = (*(file_instance->buffer + pos)<<8) + *(file_instance->buffer+pos+1);

It takes the octect at `pos` and places the octec at `pos+1` next to it. If the octect at `pos` is `0xAA` and the one at `pos+1` `0xBB`, you get `0xAABB`. `|` would do the same thing as `+` in this context. — Petr Skocik, Mar 26 '17 at 21:35
Note it's `*(file_instance->buffer + pos)` which gets left-shifted, not just `pos`. — aschepler, Mar 26 '17 at 21:36
It might be clearer if the last statement was written with array subscripts: `id = (file_instance->buffer[pos]<<8) + file_instance->buffer[pos+1];` — dbush, Mar 26 '17 at 21:39
`(pos << 8) + (pos + 1)` is the same as `pos*257 + 1` - certainly not correct. Try `pos * sizeof(16-bit unsigned value)` or `pos *2`. I have no idea why OP is considering `<<8` is part of the solution. — chux - Reinstate Monica, Mar 26 '17 at 21:40
@PSkocik if `char` is signed then that case is undefined behaviour. My system actually gives `0xFFFFFD50` as result in that case, not `0xAABB`. — M.M, Mar 26 '17 at 21:51
@M.M Nice catch. I think the code meant to use unsigned chars, and this kind of number compositioning was the goal — Petr Skocik, Mar 26 '17 at 22:01

score 2 · Accepted Answer · edited May 23 '17 at 12:17

To make the code easier to read, you could transform it using the identities *(A+B) == A[B], and X << 1 == X * 2:

id = (file_instance->buffer[0] * 256) + file_instance->buffer[1];

If buffer were an unsigned char *, then this code would be a common idiom for reading a 16-bit integer out of memory, where the first octet is the most-significant one. For example if the memory were { 0x01, 0x02 } then you can verify that this equation produces the integer value 0x0102.

However you said buffer is a char *. In C, char may either be a signed or an unsigned type. If it is unsigned on the system you are working on, then everything is fine. Also, everything is fine if the data you are reading never has the MSB set of any byte.

But if char is signed then the code causes undefined behaviour due to left-shifting a negative number, when the value at file_instance->buffer[0] is a negative value (i.e. has the MSB set). Also there may be further unexpected behaviour when adding two negative numbers.

If you are unable to run the code then it may be difficult to work out what the existing behaviour was, since it would be at the mercy of various hardware and compiler optimization details.

If you can run code on the target system then you could try and see what happens with a code snippet like:

#include <stdio.h>

int main()
{
    char buf[2] = { 0xAA, 0xBB };     // put sample data here

    int r = buf[0] << 8 + buf[1];

    printf("%x\n", (unsigned)r);
}

What size data / position is this C code assigning to a value

1 Answers1