MIT DCC, Lab 2

Creating a failure tolerant key-value store server

Sun Sep 07

This lab’s assignment was to create a simple key-value store that can accept Get, Put, and Append operations from multiple clients, ensuring that all requests are only performed once despite network failures.

In future labs, these will be replicated to emulate a redundant server environment, where if one server crashes others can still serve requests.

The requirements are pretty simple:

When a client makes a get request, fetch the currenct value
When a client makes a put request, replace the current value and return ""
When a client makes an append request, append to the current value and return the old value

Here’s the catch of a distributed environment: If there is a network failure when the server tries to tell a client the operation was completed, a client might resend a put / append request even though it was already performed. The server needs to catch duplicate events and respond as though the operation was only performed once.

This whole process is straightforward:

Each client gets a random client ID.
Each request gets an incrementing request ID
When the server recieves a request, first check a cache to see if that cliend ID is re-trying its last request ID. If yes, return the cached data
If not (eg. the request ID was incremented), perform the operation and cache the data

Here is my simple implemenation, starting with the client:

type Clerk struct {
	server *labrpc.ClientEnd

	ID        int64
	OpCounter int
}

func (ck *Clerk) Get(key string) string {
	args := GetArgs{
		Key: key,
	}

	reply := GetReply{}

	var success bool

	for !success {
		success = ck.server.Call("KVServer.Get", &args, &reply)
	}

	return reply.Value
}

func (ck *Clerk) PutAppend(key string, value string, op string) string {
	ck.OpCounter += 1

	args := PutAppendArgs{
		Key:   key,
		Value: value,

		ClientID:   ck.ID,
		RequestNum: ck.OpCounter,
	}

	reply := PutAppendReply{}

	var success bool

	for !success {
		switch op {
		case "Put":
			success = ck.server.Call("KVServer.Put", &args, &reply)
		case "Append":
			success = ck.server.Call("KVServer.Append", &args, &reply)
		default:
			log.Fatalf("Tried to call PutAppend with unknown op")
		}
	}

	return reply.Value
}

func (ck *Clerk) Put(key string, value string) {
	ck.PutAppend(key, value, "Put")
}

func (ck *Clerk) Append(key string, value string) string {
	return ck.PutAppend(key, value, "Append")
}

When the client is created, it assigns itself a random ID and uses that + an incrementing counter. When making RPC calls, it will simply retry until it recieves a successful response.

Now, the server. Since performance isn’t paramount here, it just uses one lock and one data field for simplicity. However, if this were real code, I would shard the data and caches so less data is locked at any given time.

type cachedPutAppend struct {
	requestNum int
	value      string
}

type KVServer struct {
	mu sync.Mutex

	data  map[string]string
    // Maps ClientID's to the result of their last performed operation
    // This construction assumes clients won't make a new request until the old one succeeds
	cache map[int64]cachedPutAppend
}

func (kv *KVServer) requestAlreadyProcessed(args *PutAppendArgs) bool {
	last, ok := kv.cache[args.ClientID]
	return ok && args.RequestNum == last.requestNum
}

func (kv *KVServer) cachePutAppend(request *PutAppendArgs, value string) {
	kv.cache[request.ClientID] = cachedPutAppend{
		requestNum: request.RequestNum,
		value:      value,
	}
}

func (kv *KVServer) Get(args *GetArgs, reply *GetReply) {
	kv.mu.Lock()
	defer kv.mu.Unlock()

	reply.Value = kv.data[args.Key]
}

func (kv *KVServer) Put(args *PutAppendArgs, reply *PutAppendReply) {
	kv.mu.Lock()
	defer kv.mu.Unlock()

	if kv.requestAlreadyProcessed(args) {
		reply.Value = ""
		return
	}

	kv.data[args.Key] = args.Value
	kv.cachePutAppend(args, "")
}

func (kv *KVServer) Append(args *PutAppendArgs, reply *PutAppendReply) {
	kv.mu.Lock()
	defer kv.mu.Unlock()

	if kv.requestAlreadyProcessed(args) {
		reply.Value = kv.cache[args.ClientID].value
		return
	}

	old := kv.data[args.Key]
	reply.Value = old
	kv.data[args.Key] = kv.data[args.Key] + args.Value
	kv.cachePutAppend(args, old)
}

The server’s only logic-intensive job is preventing duplicate operations through the cache.

This lab was a good warmup and I’m now pretty eager to try and implement a replicated + sharded version of this with the broken-up key-value tables I wrote about earlier.

I did get to practice something important: deciding where to start. In both labs, you aren’t told whether to write the server or the client first. Practicing thinking through the best place to start is a valuable experience. For this lab, I started with the clients. When you aren’t sure about how an API should be structured, it seems best to write the clients first so you know the requirements, then you can just flesh out server logic to meet demand. Very Straightward.

Total time to complete this lab: About an hour.

Test: one client ...
  ... Passed -- t  4.7 nrpc 58266 ops 58266
Test: many clients ...
  ... Passed -- t  6.7 nrpc 194826 ops 194826
Test: unreliable net, many clients ...
  ... Passed -- t  3.3 nrpc  1138 ops  904
Test: concurrent append to same key, unreliable ...
  ... Passed -- t  0.2 nrpc    68 ops   52
Test: memory use get ...
  ... Passed -- t  0.3 nrpc     4 ops    0
Test: memory use put ...
  ... Passed -- t  0.1 nrpc     2 ops    0
Test: memory use append ...
  ... Passed -- t  0.2 nrpc     2 ops    0
Test: memory use many put clients ...
  ... Passed -- t  4.3 nrpc 100000 ops    0
Test: memory use many get client ...
  ... Passed -- t  5.3 nrpc 100001 ops    0
Test: memory use many appends ...
2025/09/08 11:17:42 m0 785280 m1 2797248
  ... Passed -- t  1.4 nrpc  1000 ops    0
PASS
ok      6.5840/kvsrv    27.665s

Hooray!