Kafka | Notion

1. Kafka Streaming

Sending/Receiving Messages
- Recall RPC
  - Move counts and increase to a server accessible to many client programs on different computers.
  - Client needs some extra functions to make calling a remote function feel the same as calling a regular one.
  Dictionary:
  
  Protobuf:
- Synchronous vs. Asynchronous Communication
  - Synchronous
    - Both parties have to participate at the same time
    - Examples: phone call, RPC call
  - Asynchronous
    - One party can send any time, the other can receive later
    - Examples: email, streaming
ETL (Extract Transform Load)
- Have multiple OLTP database
- ETL Code
  - needs to detect what is new (e.g., by timestamp)
  - cron: Linux programs to run program on a schedule
  - Google's cloud scheduler can similarly launch tasks
- Problem: if we have X OLTP databases and Y derivative stores, how many ETL programs must we write? $X \cdot Y$ (Too Much!)
- Solution: Uniﬁed Log (Centralize changes in a distributed logging service)
Kafka Design
- Topics (managed by servers called brokers)
  
  pip install kafka-python
  
  admin = KafkaAdminClient(...)
  
  admin.create_topics([NewTopic("sports", ...)])
- Producers, Consumers
  - Producers Publish (pub/sub)
    
    producer3 = KafkaProducer(...)
    
    producer3.send("sports", ...)
  - Consumers Subscribe (pub/sub)
    
    consumer3 = KafkaConsumer(...)
    
    consumer3.subscribe(["sports"])
Receiving Messages
- poll() loop
```
consumer3 = KafkaConsumer(...)
while True:
	batch = consumer3.poll(????) 
	for topic, messages in batch.items():
		for msg in messages:
			...
```
  - poll() (ideally) returns some messages the consumer hasn't seen before, from any subscribed topic
  - poll() leaves messages intact on brokers (for other consumers), unlike many prior streaming systems
- What's in a Message?
  - key (optional): some bytes. The key is used for partitioning and is usually one of the entries in the value structure.
    
    producer.send("topic", value=????, key=????)
  - value (required): some bytes. The value is usually some kind of structure with many values.
    
    producer.send("topic", value=????)
  - Python dict => bytes
    
    value = bytes(json.dumps(dic), "utf-8")
    
    Protobuf => bytes:
    
    msg = mymod_pb2.MyMessage(...)
    
    value = msg.SerializeToString() # actually bytes, not str

2. Scalability with Partitioning

Reason for Partitioning

Some topics might have too many messages for one machine (or set of machines with replicas) to keep up

Topics can be created with N partitions:
- Each partition is like an array of messages
- Partitions are assigned to brokers
- Each producer using a stream works with all partitions
Changing Partitions
- Append right
- Delete left (depends on "retention" policy), delete does NOT change indexes!
Selecting Partitions
- Case 1: message only has value
  - Producer rotates between partitions
  - Called "round robin" policy (just cycle through partitions)
- Case 2: message has key and value
  - Calculate partition (hash(key) % partition_count)
  - Same keys will go to the same partition
  - Can plug in alternative partitioning schemes
Consumers: Read Offsets
- Poll returns batches (when enough data or timeout) which contain some subset of partitions.
- Every time starting at offset, the offset is recorded in a table.
Ordering Kafka Messages
- Partially vs. Totally Ordered
  - Some things are totally ordered, like integers. Either x < y or y >= x.
  - Other things are partially ordered, like Git commits. Sometimes you can compare, sometimes you can't!
- Kafka Messages are partially ordered. Messages are consumed from a partition in the order they were written to that partition (no guarantees across topics or across partitions).
- If A and B share the same topic and key, and B was produced after A, then:
  - we say B "happened after" A
  - A and B will be in the same partition (assuming partition count is constant)
  - each consumer group of the topic will consume A before B

Seek to an Offset
```
part = TopicPartition("clicks", 3)
offset = 6
consumer.seek(part, offset)
```
- Read pattern
  - consumers normally read forward sequentially
  - seek can jump back (or ahead)
  - useful if processing batch failed: just go back and retry
Consumer Groups
- c = KafkaConsumer("clicks", group_id="g1", ...)
- different applications might operate independently
- they should ALL get a chance to consume messages
- need offsets for each topic/partition/consumer group combination

Partition Assignment: Manual

tp0 = TopicPartition("clicks", 0) 
...
consumer2.assign([tp0, tp1]) 
consumer3.assign([tp2, tp3]

Untitled

Partition Assignment: Automatic
```
while True:
	batch = consumer.poll(1000)
	for topic, msgs in batch.items():
		for msg in msgs:
consumer.close()
```
- by default, consumers are automatically assigned partitions when they start polling
- challenge: Kafka shouldn't re-assign a partition in the middle of a batch (might double process messages)
Segment Files: Log Rollover and Deletion