minor changes, added TODOs

a1ba0603 · joeld · 356bb018 · a1ba0603 · a1ba0603 · a1ba0603
Commit a1ba0603 authored 1 month ago by joeld
--- a/README.md
+++ b/README.md
 # Scalable C4 Preprocessing Utility

+**Disclaimer:**  
+This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
+
 ## Overview
 This project implements a scalable, multi-process data preprocessing utility for large-scale language model training, specifically for the multilingual C4 dataset. Each process handles a portion of the dataset, performs pre-processing, and writes processed output independently—designed for efficient execution on HPC clusters like **Noctua2**.

@@ -11,14 +14,11 @@ This project implements a scalable, multi-process data preprocessing utility for
 - **Modular Design:** `data_loader.py`, `shard_processor.py`, and `main.py` for clean separation of logic.


-⚠️ **Disclaimer:**  
-This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
-
 ## TODOs
+- [ ] Enhanced pre-processing and real tokenization.
+- [ ] Explain concept better and add model to explaing, discuss advantages, disadvantages
 - [ ] Add sentence splitting support (optional).
 - [ ] Support output in indexed/binary format.
- [ ] Add advanced error logging and resource monitoring.
- [ ] Validate with larger dataset samples.
 - [ ] Finalize environment initialization and testing on cluster.

 ## Setup

--- a/environment_initialization.sh
+++ b/environment_initialization.sh
 #!/bin/bash
-source .venv/bin/activate  # Adjust if using virtualenv
+source .venv/bin/activate  #TODO:
 pip install -r requirements.txt
--- a/performance_eval.txt
+++ b/performance_eval.txt
+TOTAL_PROCS = 50
+results:
+Avg Partition Time: 21.19s
+Avg Processing Time: 75.15s
+Avg Total Time: 96.34s
+Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
+
+TOTAL_PROCS = 50
+Avg Partition Time: 22.29s
+Avg Processing Time: 77.96s
+Avg Total Time: 100.25s
+Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
+
+TOTAL_PROCS = 50
+Avg Partition Time: 56.98s
+Avg Processing Time: 102.40s
+Avg Total Time: 159.38s
+Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
+
+TOTAL_PROCS = 50
+Avg Partition Time: 22.69s
+Avg Processing Time: 74.88s
+Avg Total Time: 97.57s
+Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
\ No newline at end of file