This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
## Overview
This project implements a scalable, multi-process data preprocessing utility for large-scale language model training, specifically for the multilingual C4 dataset. Each process handles a portion of the dataset, performs pre-processing, and writes processed output independently—designed for efficient execution on HPC clusters like **Noctua2**.
...
...
@@ -11,14 +14,11 @@ This project implements a scalable, multi-process data preprocessing utility for
-**Modular Design:**`data_loader.py`, `shard_processor.py`, and `main.py` for clean separation of logic.
⚠️ **Disclaimer:**
This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
## TODOs
- [ ] Enhanced pre-processing and real tokenization.
- [ ] Explain concept better and add model to explaing, discuss advantages, disadvantages
- [ ] Add sentence splitting support (optional).
- [ ] Support output in indexed/binary format.
- [ ] Add advanced error logging and resource monitoring.
- [ ] Validate with larger dataset samples.
- [ ] Finalize environment initialization and testing on cluster.