1. Systems and Big Data System
- System is a software that manages resources (computer, memory, storage, network).
- Compute: computational resources execute code.
- Memory: memory holds data for active usage.
- Storage: storage holds long-term data.
- Network: network provides communication between computers.
- Big Data System manages resources that are specialized (e.g. GPUs) and distributed (cluster of machines).
- Scale Up: more resources assigned to a single machine.
- Scale Out: more machines are used on the same task.
2. Compute
- Core: the more cores we have, the more tasks we can run simultaneously.
- Code: high level code to machine code
- Compiler translates from high level code to machine code.
- CPU runs an interpeter program that loops over programmer’s code and runs it.
- Compiler creates Bytecode from high level code and a Virtual Machine (VM) running on the CPU runs the bytecode
- CPU vs. GPU
- CPUs (few cores that are fast, flexible, independent) are versatile and excel at handling a wide variety of tasks.
- GPUs (many cores that are slow, float-optimized, coordinated) are specialized for parallel processing and perform exceptionally well in tasks like graphics rendering, scientific simulations, and machine learning when parallelism can be exploited.
- Metric: FLOPS (floating-point operations per second)
3. Memory
- Random Access Memory (RAM):
- Random: means we can jump around and access data from different locations efficiently. In contrast, some devices that hold data are only efficient sequentially.
- Byte Addressible: each byte of data has it’s own address the CPU can use to access it; extracting a single bit from a byte actually involves more steps than using the whole byte.
- Characteristcs: Small, Volatile, Fast.
- Bits and Bytes:
- N bits can hold 2 possible values.
- 1 Byte = 8 bits (1 MB/s = 1 Mbps)
- 1 KB = 1024 Bytes
- 1 MB = 1024 KB
- 1 GB = 1024 MB
- 1 TB = 1024 GB
4. Storage
- Block Devices
- Hard Disk Drives (HDDs): 0’s/1’s stored on spinning magnetized platter; a moving head reads/writes data.
- Solid State Disks (SSDs): 0s/1s stored in charged cells; no moving parts (faster).
- Data is read/written in blocks of many bytes (for example, 0.5 KB), so that reading 1 byte or 1 block takes same time.
- Characteristics: large, nonvolatile, slow.
- Metric: Capacity (Bytes), Throughput (Bytes/sec), Latency (ms)
5. Network
- When scaling out, many nodes (computers) will be communicating via a network.
- Metric: Latency (sec, ms), Bandwidth/Throughput.
6. Other Terms
- Deployment: running code somewhere.
- Containers: a lightweight alternative to virtual machines.