Spark 2 Workbook Answers Online

Add a short paragraph for each stage, explaining why you chose that API.

| Tip | How to Apply | |-----|--------------| | **Show Spark’s lazy evaluation** | Mention that transformations build a DAG, actions trigger execution. | | **Explain the physical plan** | Use `df.explain()` in a note to demonstrate understanding of shuffle, broadcast, etc. | | **State assumptions** | “Assume the input file fits in HDFS and each line is a UTF‑8 string.” | | **Edge‑case handling** | Talk about empty files, null values, or malformed CSV rows. | | **Performance hints** | Suggest `repartition` before a heavy shuffle or using `broadcast` for small lookup tables. | | **Testing** | Show a tiny local test (e.g., `sc.parallelize(["a b","b c"]).flatMap(...).collect()`). | | **Clean code** | Use meaningful variable names, consistent indentation, and short comments. |

# 2️⃣ Split lines into words and clean them words = lines.flatMap(lambda line: line.split()) \ .map(lambda w: w.lower().strip('.,!?"\'')) spark 2 workbook answers

- [ ] All code compiles/run on Spark 2.x (no 3.x‑only APIs). - [ ] Comments are present for every non‑obvious line. - [ ] You’ve referenced at least **one** Spark concept (lazy eval, shuffle, broadcast, etc.). - [ ] Edge cases are discussed. - [ ] The answer is written **in your own words** (no copy‑pasting from the internet).

## 6. Quick Reference Cheatsheet (Spark 2.4) Add a short paragraph for each stage, explaining

– bulk HTTP calls:

sc = SparkContext(appName="WordCount") lines = sc.textFile("hdfs:///data/myfile.txt") | | **State assumptions** | “Assume the input

## 7. Putting It All Together – A Mini‑Project Blueprint

# 3️⃣ Keep only unique words distinct_words = words.distinct()

```python from pyspark import SparkContext