Skip to content
Snippets Groups Projects
Commit a1ba0603 authored by joeld's avatar joeld
Browse files

minor changes, added TODOs

parent 356bb018
No related branches found
No related tags found
No related merge requests found
# Scalable C4 Preprocessing Utility
**Disclaimer:**
This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
## Overview
This project implements a scalable, multi-process data preprocessing utility for large-scale language model training, specifically for the multilingual C4 dataset. Each process handles a portion of the dataset, performs pre-processing, and writes processed output independently—designed for efficient execution on HPC clusters like **Noctua2**.
......@@ -11,14 +14,11 @@ This project implements a scalable, multi-process data preprocessing utility for
- **Modular Design:** `data_loader.py`, `shard_processor.py`, and `main.py` for clean separation of logic.
⚠️ **Disclaimer:**
This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
## TODOs
- [ ] Enhanced pre-processing and real tokenization.
- [ ] Explain concept better and add model to explaing, discuss advantages, disadvantages
- [ ] Add sentence splitting support (optional).
- [ ] Support output in indexed/binary format.
- [ ] Add advanced error logging and resource monitoring.
- [ ] Validate with larger dataset samples.
- [ ] Finalize environment initialization and testing on cluster.
## Setup
......
#!/bin/bash
source .venv/bin/activate # Adjust if using virtualenv
source .venv/bin/activate #TODO:
pip install -r requirements.txt
TOTAL_PROCS = 50
results:
Avg Partition Time: 21.19s
Avg Processing Time: 75.15s
Avg Total Time: 96.34s
Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
TOTAL_PROCS = 50
Avg Partition Time: 22.29s
Avg Processing Time: 77.96s
Avg Total Time: 100.25s
Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
TOTAL_PROCS = 50
Avg Partition Time: 56.98s
Avg Processing Time: 102.40s
Avg Total Time: 159.38s
Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
TOTAL_PROCS = 50
Avg Partition Time: 22.69s
Avg Processing Time: 74.88s
Avg Total Time: 97.57s
Total Lines: 4662836, Total Tokens: 2040046108, Avg Tokens per Line: 437.51
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment