Computers store almost all data types using three basic methods: enumerations, adjacency, and pointers.
Enumerations #
Given a finite set of values, I can enumerate them – that is, pick a different bit pattern for each. Usually there’s some effort to mae the bit patterns meaningful, but not always.
For example, the 8-bit value 0x54
could mean any of the following:
Type | Value |
---|---|
Unsigned integer | 84 |
Signed integer | +84 |
Floating point | 12 |
ASCII | ‘T’ |
Bitvector sets | {2, 3, 5} |
Toy ISA | Flip all the bits of the value in register 1 |
Context is important when working with data.
Adjacency #
Virtually all computers since the 1970s have used byte-addressable memory – that is. each byte of memory has a unique address. What happens if you have a multi-byte value? You can store it in adjacent memory locations. The smallest byte address is referred to as the address of the entire value. For example, a four-byte value stored at address 0x1234
would be stored at 0x1234
, 0x1235
, 0x1236
, and 0x1237
.
If the value is made up of ordered parts, we put the first part in the smallest address and move up from there. If the value is not ordered, we can break it up in some pre-determined but arbitrary way. Most famously, there’s the endianness of multi-byte integers. Let’s look at an example using the 32-bit integer 0x12345678
:
Address | Little-endian | Big-endian |
---|---|---|
0x1234 | 0x78 | 0x12 |
0x1235 | 0x56 | 0x34 |
0x1236 | 0x34 | 0x56 |
0x1237 | 0x12 | 0x78 |
We apply these two rules (place parts adjacently in order, break big things up in a pre-arranged but arbitrary way) recursively.
A lot of code will add padding between elements to ensure that four-byte integers start at addresses that are multiples of four. This is because the processor can only read and write four bytes at a time. If the address is not a multiple of four, it takes more time to read the value.
Pointers #
Sometimes, it is more convenient to build a data structure using references to other data instead of copies of the data. Instead of storing the value itself, we store the address of the value. This is called a pointer.
Telling Them Apart #
It’s important to remember that all of this is ultimately bytes. There’s nothing inherently distinguishing a float
from an int
from an enum
from a char
from a struct
from a void*
. The only way to tell them apart is to know what the data is supposed to represent. How do we keep it all straight? That’s where typing – assigning a type, or data format, to a variable – comes in.
Static typing means that we explicitly state what type of data a variable holds. We inform the system that the variable, and any register or memory location it is stored in, will always hold a specific data type. The compiler will also verify that we don’t try to use a variable in a way that is incompatible with its type. Common statically-typed languages include C, C++, Java, and Rust.
Dynamic typing, on the other hand, means that we don’t explicitly state what type of value a variable data. Instead, every value is stored as exactly two adjacent pieces: an enumeration value noting what type of data it is, and a pointer to the actual data. The code doesn’t blindly assume it knows what type it found: instead, it always checks the enumeration and picks what to do baesd on what it found. Common dynamically-typed languages include Python, JavaScript, and Ruby.