Skip to content
Snippets Groups Projects
Commit 356bb018 authored by joeld's avatar joeld
Browse files

adapted readme + shell scripts + requirements

parent 47fac1d8
No related branches found
No related tags found
No related merge requests found
# Scalable C4 Preprocessing Utility
## Overview
This project implements a scalable, multi-process data preprocessing utility for large-scale language model training, specifically for the multilingual C4 dataset. Each process handles a portion of the dataset, performs pre-processing, and writes processed output independently—designed for efficient execution on HPC clusters like **Noctua2**.
## Approach
- **File Discovery:** Collects all relevant JSONL/JSON.GZ files using a master list.
- **Sharding:** Files are evenly assigned to processes via slicing based on `PROC_RANK` and `TOTAL_PROCS`.
- **Dynamic Processing:** Each process tokenizes text (`lower().strip().split()`), outputs to its own result file.
- **Performance Tracking:** Logs time, line count, and token count for evaluation.
- **Modular Design:** `data_loader.py`, `shard_processor.py`, and `main.py` for clean separation of logic.
⚠️ **Disclaimer:**
This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
## TODOs
- [ ] Add sentence splitting support (optional).
- [ ] Support output in indexed/binary format.
- [ ] Add advanced error logging and resource monitoring.
- [ ] Validate with larger dataset samples.
- [ ] Finalize environment initialization and testing on cluster.
## Setup
1. Install dependencies:
```bash
pip install -r requirements.txt
<!-- Its a good practice to have a README file in your project folder. This file should contain information about the project, how to install it and run it etc. -->
**Note**: Below is the sample README for the project.
......
#!/bin/bash
# NOTE: This script is used to initialize the environment for the project.
\ No newline at end of file
source .venv/bin/activate # Adjust if using virtualenv
pip install -r requirements.txt
#!/bin/bash
export N_PROCS=20000 # Number of total processes to spawn. This is just an example, your utility should be ready to scale up or down when this value is changed.
export N_PROCS=20 # Number of total processes to spawn. This is just an example, your utility should be ready to scale up or down when this value is changed.
# Loading the environment
source environment_initialization.sh
......@@ -9,6 +9,6 @@ source environment_initialization.sh
bash prepare.sh
# Spawner calling your preprocessing script N_PROCS times across the computing cluster
super_duper_process_spawner -n $N_PROCS python text_preprocessor.py
super_duper_process_spawner -n $N_PROCS python main.py
echo "Finished preprocessing data."
\ No newline at end of file
# TODO: list of all the packages required for this project
\ No newline at end of file
# TODO: list of all the packages required for this project
datasets
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment