PyTorch Checkpoint Recovery: Fix Pickle Errors and Recover .pth Model Files
PyTorch checkpoint recovery addresses two distinct problems that ML practitioners face: deleted .pt or .pth files that need to be restored from disk, and existing checkpoint files that produce _pickle.UnpicklingError or EOFError on load. Both problems are solvable — but the approach is entirely different for each. This guide covers file recovery for deleted checkpoints and diagnostic steps for corrupted ones.
⚠️ Warning: If a checkpoint file was deleted from your training machine, stop all training runs and disk writes to that drive immediately. Checkpoint files are large and distinctive — they are recoverable — but continued disk activity will overwrite sectors and reduce recovery success. Do not launch a new training run to "replace" the checkpoint until you have attempted recovery.
Part 1. PyTorch Error Types and Solutions
The first step in checkpoint recovery is correctly diagnosing whether the file is missing or present-but-corrupt. These are different problems with different solutions.
Table 1: PyTorch Error Types and Solutions
| Error / Symptom | Cause | Solution | Recovery Needed? |
|---|---|---|---|
FileNotFoundError: [Errno 2] No such file or directory |
File deleted or path wrong | Check path; run file recovery if deleted | Yes — if file is deleted |
_pickle.UnpicklingError: invalid load key |
File partially corrupt or truncated | Attempt partial load; repair or re-train | No — file exists but damaged |
EOFError during torch.load() |
File write interrupted (crash during save) | Check file size; likely truncated | No — try to repair |
RuntimeError: PytorchStreamReader failed |
ZIP container (for .pt) is corrupt |
Re-download or recover from backup | No — container issue |
ModuleNotFoundError on load |
Model class not in scope | Import the class before loading | No — code issue |
KeyError: 'model_state_dict' |
Checkpoint saved with different key | Inspect checkpoint dict keys | No — code issue |
| Drive not detected, files missing | Physical or logical drive failure | Ritridata recovery scan | Yes — drive-level issue |
| Checkpoint file is 0 bytes | Write interrupted at start of save | Recovery from previous checkpoint | Yes — if prior version exists |
| Checkpoint file is much smaller than expected | Write interrupted mid-save | File is truncated — see repair steps | No — partial data only |
The critical distinction: if os.path.exists(checkpoint_path) returns True, the file is on disk and the problem is corruption or format, not deletion. File recovery tools only help when the file is not present on the file system.
Part 2. Checkpoint File Sizes and Recovery Characteristics
Understanding typical checkpoint sizes helps you plan recovery destination space and identify recovered files during a scan.
Table 2: Checkpoint File Sizes and Recovery by Model Type
| Model Type | Architecture | Typical Checkpoint Size | Recovery Notes |
|---|---|---|---|
| Small custom CNN | ResNet-18 scale | 50–200 MB | Very small — fast recovery |
| BERT base | 12-layer transformer | ~440 MB | Fast recovery, good success rate |
| GPT-2 small | 117M parameters | ~500 MB | Small, reliable recovery |
| ResNet-50 | Standard CNN | ~100 MB | Very fast recovery |
| ViT-B/16 | Vision transformer | ~340 MB | Reliable recovery |
| LLaMA 7B full (FP32) | Large language model | ~26 GB | Large — act fast, fragmentation risk |
| LLaMA 7B (BF16) | Large language model | ~13 GB | Large — recover promptly |
| SDXL (Stable Diffusion XL) | Diffusion model | ~6.5 GB | Medium-large — good recovery rate |
| Optimizer state (AdamW) | Paired with model checkpoint | Equal to model size | Always recover in pairs with model |
| Full training state (model + optimizer + scheduler) | Complete checkpoint | 2x model size | Largest files — prioritize recovery |
💡 Tip: Always save checkpoints with both the model state dict and the optimizer state. Using
torch.save({'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'epoch': epoch}, path)ensures you can resume training from any checkpoint rather than just running inference.
Part 3. Fixing Corrupted PyTorch Checkpoints (File Exists but Will Not Load)
If the .pt or .pth file exists but produces errors on load, try these steps before concluding the file is unrecoverable.
Step 1: Check file size A valid checkpoint should have a predictable size based on your model architecture. A 0-byte file or a file that is unexpectedly small (less than 10% of expected size) was truncated during write — the data was not saved before the process was interrupted.
Step 2: Try loading with map_location='cpu'
checkpoint = torch.load('checkpoint.pth', map_location='cpu')
Device mismatch errors can mimic corruption. Loading to CPU first isolates the problem.
Step 3: Try weights_only=True
checkpoint = torch.load('checkpoint.pth', weights_only=True)
This bypasses arbitrary object unpickling and loads only tensor data. If this works, the issue is with non-tensor objects in the checkpoint, not the tensors themselves.
Step 4: Inspect raw pickle structure
import pickle
with open('checkpoint.pth', 'rb') as f:
data = pickle.load(f)
If this also fails, the file is genuinely corrupt. If it succeeds with a dict, check which keys are present — some keys may have been saved before the process crashed.
💡 Tip: When saving checkpoints mid-training, use an atomic write pattern: save to a
.tmpfile first, then rename to the final filename. This prevents a crashed save from leaving a corrupted checkpoint in place of a valid one:torch.save(state, 'checkpoint.tmp'); os.rename('checkpoint.tmp', 'checkpoint.pth').
Part 4. Recovering Deleted PyTorch Checkpoint Files with Ritridata
If the checkpoint file is genuinely missing from the file system — confirmed by os.path.exists() returning False — Ritridata can often recover it from the drive sectors where it previously existed.
Step 1: Stop all disk activity on the affected drive Do not start a new training run. Do not install packages with pip. Do not save any new files to the same drive partition. Large checkpoint files require many sectors — any new writes increase overwrite risk.
Step 2: Install Ritridata on a different drive Install on your OS drive or an external drive — not the training drive containing the lost checkpoint.
Step 3: Launch Ritridata and select the training drive Select the partition or raw disk where your checkpoint directory was located. Common locations:
- Linux training servers:
/home/{user}/checkpoints/or/mnt/data/ - Windows:
C:\Users\{user}\checkpoints\or a dedicated data drive - Note: Ritridata supports Windows and Mac — for Linux servers, see alternative tools in the FAQ.
Step 4: Run a deep scan and filter by size Start a deep scan. When complete, sort recovered files by size and look for files in the expected size range for your model. PyTorch checkpoint files have distinctive pickle headers that aid detection.
Step 5: Recover to a separate drive
Save recovered checkpoints to a drive other than the source. Test by loading: torch.load('recovered_checkpoint.pth', map_location='cpu').
💡 Tip: If you are recovering from a network-attached storage system or a Linux training server, Ritridata's NAS recovery is not supported. For those environments, consider PhotoRec (open source) for basic binary file recovery, or contact a professional data recovery service for NAS-level failures.
Part 5. Checkpoint Backup Best Practices for ML Workflows
Losing a training checkpoint represents lost compute time, not just lost disk space. A minimal checkpoint strategy prevents the most common loss scenarios.
Save checkpoints at every epoch or every N steps depending on your training duration. Keep the last 3 checkpoints rather than just the latest — if the most recent save is corrupted, you can resume from two epochs prior with minimal loss.
Use cloud storage for checkpoint sync during long training runs. Services like AWS S3, Google Cloud Storage, and Azure Blob Storage can receive checkpoint uploads automatically via training callbacks in PyTorch Lightning, Hugging Face Trainer, or custom training loops.
Part 6. Ritridata for PyTorch Checkpoint Recovery
Ritridata is effective for recovering large binary files — which PyTorch checkpoints are. The deep scan engine detects pickle file signatures and can recover .pt and .pth files from drives where the file system has been damaged or the files have been deleted.
The free scan shows recovered file sizes and names so you can confirm the right checkpoint was found before committing to recovery. For ML practitioners who have lost weeks of training, that confirmation is worth the scan time.
Download Ritridata and start a free scan
FAQ
Q1: What is the difference between a .pt and .pth file in PyTorch?
Both are standard PyTorch save formats — the extension is cosmetic. .pth is a common convention for checkpoints containing training state, while .pt is often used for exported models or tensors. Both use Python pickle serialization and both are recoverable the same way.
Q2: My training server is Linux-based — does Ritridata work on Linux? Ritridata currently supports Windows and Mac. For Linux, consider TestDisk or PhotoRec (both free and open source) for basic binary file recovery. Enterprise ML setups on Linux may benefit from professional data recovery services.
Q3: Can I recover a checkpoint from a GPU cloud instance that was terminated? Cloud instance storage is typically ephemeral — when an AWS EC2 or Google Cloud instance terminates, instance storage is wiped. Persistent storage (EBS, GCS buckets) is retained. Always save checkpoints to persistent attached storage, not instance local drives.
Q4: My checkpoint loaded but training loss jumped — is the checkpoint partially corrupt?
A checkpoint that loads without error but produces unexpected behavior may have had optimizer state corrupted while model state is intact. Try loading only model_state_dict from the checkpoint and reinitializing the optimizer from scratch.
Q5: How do I recover a checkpoint if I only have the exported ONNX file?
ONNX files contain frozen model weights but not optimizer state or training history. You can convert an ONNX file back to PyTorch using onnx2torch or similar tools, but you cannot resume training from an ONNX export alone.
Q6: Is it possible to recover a checkpoint that was overwritten by a newer save? Once a file has been overwritten in place (same path, same file), the previous version is gone unless you have versioned backups or a snapshot file system. Tools like ZFS, Btrfs snapshots, or Time Machine on Mac can preserve file versions.
Q7: Can Ritridata recover checkpoints from an external SSD used for training? Yes. Ritridata supports external SSDs on both Windows and Mac. Connect the external SSD before launching Ritridata and select it from the device list. Act quickly — SSD TRIM may purge deleted sectors faster than on an HDD.
Q8: What happens if only part of the checkpoint is recovered?
A partially recovered checkpoint will likely fail to load due to truncation or missing data. You can attempt pickle.load() to see which keys were saved before the truncation. Depending on how much was recovered, you may be able to extract the model weights even if optimizer state is missing.
