adapted readme + shell scripts + requirements

356bb018 · joeld · 47fac1d8 · 356bb018 · 356bb018 · 356bb018
Commit 356bb018 authored 1 month ago by joeld
--- a/README.md
+++ b/README.md
+# Scalable C4 Preprocessing Utility
+
+## Overview
+This project implements a scalable, multi-process data preprocessing utility for large-scale language model training, specifically for the multilingual C4 dataset. Each process handles a portion of the dataset, performs pre-processing, and writes processed output independently—designed for efficient execution on HPC clusters like **Noctua2**.
+
+## Approach
+- **File Discovery:** Collects all relevant JSONL/JSON.GZ files using a master list.
+- **Sharding:** Files are evenly assigned to processes via slicing based on `PROC_RANK` and `TOTAL_PROCS`.
+- **Dynamic Processing:** Each process tokenizes text (`lower().strip().split()`), outputs to its own result file.
+- **Performance Tracking:** Logs time, line count, and token count for evaluation.
+- **Modular Design:** `data_loader.py`, `shard_processor.py`, and `main.py` for clean separation of logic.
+
+
+⚠️ **Disclaimer:**  
+This code is **work in progress** and will be updated until **March 16th**. Some elements (e.g., environment setup, dataset handling) may evolve for optimization and completeness.
+
+## TODOs
+- [ ] Add sentence splitting support (optional).
+- [ ] Support output in indexed/binary format.
+- [ ] Add advanced error logging and resource monitoring.
+- [ ] Validate with larger dataset samples.
+- [ ] Finalize environment initialization and testing on cluster.
+
+## Setup
+1. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+
+
+
 <!-- Its a good practice to have a README file in your project folder. This file should contain information about the project, how to install it and run it etc. -->
 **Note**: Below is the sample README for the project.


--- a/environment_initialization.sh
+++ b/environment_initialization.sh
 #!/bin/bash
-
-# NOTE: This script is used to initialize the environment for the project.
\ No newline at end of file
+source .venv/bin/activate  # Adjust if using virtualenv
+pip install -r requirements.txt
--- a/preprocess_dataset.sh
+++ b/preprocess_dataset.sh
 #!/bin/bash

-export N_PROCS=20000 # Number of total processes to spawn. This is just an example, your utility should be ready to scale up or down when this value is changed.
+export N_PROCS=20 # Number of total processes to spawn. This is just an example, your utility should be ready to scale up or down when this value is changed.

 # Loading the environment
 source environment_initialization.sh
@@ -9,6 +9,6 @@ source environment_initialization.sh
 bash prepare.sh

 # Spawner calling your preprocessing script N_PROCS times across the computing cluster
-super_duper_process_spawner -n $N_PROCS python text_preprocessor.py
+super_duper_process_spawner -n $N_PROCS python main.py

 echo "Finished preprocessing data."
\ No newline at end of file
--- a/requirements.txt
+++ b/requirements.txt
-# TODO: list of all the packages required for this project
\ No newline at end of file
+# TODO: list of all the packages required for this project
+datasets
\ No newline at end of file