Home ai tool recovery PyTorch Checkpoint Recovery: Fix _pickle Errors and Recover .pt Files (2026)

PyTorch Checkpoint Gone or Broken? Fix Pickle Errors and Recover .pt Files

Ethan CarterEthan Carter
|Last Updated: March 14, 2026

A corrupted .pth file or an accidental delete of a training checkpoint can cost days or weeks of GPU compute time.
This guide covers both recovery paths: restoring deleted .pt files from disk and fixing _pickle.UnpicklingError on files that exist but will not load.
Ritridata can recover deleted checkpoint files — and this guide explains how to repair corrupt ones too.

PyTorch Checkpoint Recovery: Fix Pickle Errors and Recover .pth Model Files

PyTorch checkpoint recovery addresses two distinct problems that ML practitioners face: deleted .pt or .pth files that need to be restored from disk, and existing checkpoint files that produce _pickle.UnpicklingError or EOFError on load. Both problems are solvable — but the approach is entirely different for each. This guide covers file recovery for deleted checkpoints and diagnostic steps for corrupted ones.

⚠️ Warning: If a checkpoint file was deleted from your training machine, stop all training runs and disk writes to that drive immediately. Checkpoint files are large and distinctive — they are recoverable — but continued disk activity will overwrite sectors and reduce recovery success. Do not launch a new training run to "replace" the checkpoint until you have attempted recovery.

Part 1. PyTorch Error Types and Solutions

The first step in checkpoint recovery is correctly diagnosing whether the file is missing or present-but-corrupt. These are different problems with different solutions.

Table 1: PyTorch Error Types and Solutions

Error / Symptom Cause Solution Recovery Needed?
FileNotFoundError: [Errno 2] No such file or directory File deleted or path wrong Check path; run file recovery if deleted Yes — if file is deleted
_pickle.UnpicklingError: invalid load key File partially corrupt or truncated Attempt partial load; repair or re-train No — file exists but damaged
EOFError during torch.load() File write interrupted (crash during save) Check file size; likely truncated No — try to repair
RuntimeError: PytorchStreamReader failed ZIP container (for .pt) is corrupt Re-download or recover from backup No — container issue
ModuleNotFoundError on load Model class not in scope Import the class before loading No — code issue
KeyError: 'model_state_dict' Checkpoint saved with different key Inspect checkpoint dict keys No — code issue
Drive not detected, files missing Physical or logical drive failure Ritridata recovery scan Yes — drive-level issue
Checkpoint file is 0 bytes Write interrupted at start of save Recovery from previous checkpoint Yes — if prior version exists
Checkpoint file is much smaller than expected Write interrupted mid-save File is truncated — see repair steps No — partial data only

The critical distinction: if os.path.exists(checkpoint_path) returns True, the file is on disk and the problem is corruption or format, not deletion. File recovery tools only help when the file is not present on the file system.

Part 2. Checkpoint File Sizes and Recovery Characteristics

Understanding typical checkpoint sizes helps you plan recovery destination space and identify recovered files during a scan.

Table 2: Checkpoint File Sizes and Recovery by Model Type

Model Type Architecture Typical Checkpoint Size Recovery Notes
Small custom CNN ResNet-18 scale 50–200 MB Very small — fast recovery
BERT base 12-layer transformer ~440 MB Fast recovery, good success rate
GPT-2 small 117M parameters ~500 MB Small, reliable recovery
ResNet-50 Standard CNN ~100 MB Very fast recovery
ViT-B/16 Vision transformer ~340 MB Reliable recovery
LLaMA 7B full (FP32) Large language model ~26 GB Large — act fast, fragmentation risk
LLaMA 7B (BF16) Large language model ~13 GB Large — recover promptly
SDXL (Stable Diffusion XL) Diffusion model ~6.5 GB Medium-large — good recovery rate
Optimizer state (AdamW) Paired with model checkpoint Equal to model size Always recover in pairs with model
Full training state (model + optimizer + scheduler) Complete checkpoint 2x model size Largest files — prioritize recovery

💡 Tip: Always save checkpoints with both the model state dict and the optimizer state. Using torch.save({'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'epoch': epoch}, path) ensures you can resume training from any checkpoint rather than just running inference.

Part 3. Fixing Corrupted PyTorch Checkpoints (File Exists but Will Not Load)

If the .pt or .pth file exists but produces errors on load, try these steps before concluding the file is unrecoverable.

Step 1: Check file size A valid checkpoint should have a predictable size based on your model architecture. A 0-byte file or a file that is unexpectedly small (less than 10% of expected size) was truncated during write — the data was not saved before the process was interrupted.

Step 2: Try loading with map_location='cpu'

checkpoint = torch.load('checkpoint.pth', map_location='cpu')

Device mismatch errors can mimic corruption. Loading to CPU first isolates the problem.

Step 3: Try weights_only=True

checkpoint = torch.load('checkpoint.pth', weights_only=True)

This bypasses arbitrary object unpickling and loads only tensor data. If this works, the issue is with non-tensor objects in the checkpoint, not the tensors themselves.

Step 4: Inspect raw pickle structure

import pickle
with open('checkpoint.pth', 'rb') as f:
    data = pickle.load(f)

If this also fails, the file is genuinely corrupt. If it succeeds with a dict, check which keys are present — some keys may have been saved before the process crashed.

💡 Tip: When saving checkpoints mid-training, use an atomic write pattern: save to a .tmp file first, then rename to the final filename. This prevents a crashed save from leaving a corrupted checkpoint in place of a valid one: torch.save(state, 'checkpoint.tmp'); os.rename('checkpoint.tmp', 'checkpoint.pth').

Part 4. Recovering Deleted PyTorch Checkpoint Files with Ritridata

If the checkpoint file is genuinely missing from the file system — confirmed by os.path.exists() returning False — Ritridata can often recover it from the drive sectors where it previously existed.

Step 1: Stop all disk activity on the affected drive Do not start a new training run. Do not install packages with pip. Do not save any new files to the same drive partition. Large checkpoint files require many sectors — any new writes increase overwrite risk.

Step 2: Install Ritridata on a different drive Install on your OS drive or an external drive — not the training drive containing the lost checkpoint.

Step 3: Launch Ritridata and select the training drive Select the partition or raw disk where your checkpoint directory was located. Common locations:

  • Linux training servers: /home/{user}/checkpoints/ or /mnt/data/
  • Windows: C:\Users\{user}\checkpoints\ or a dedicated data drive
  • Note: Ritridata supports Windows and Mac — for Linux servers, see alternative tools in the FAQ.

Step 4: Run a deep scan and filter by size Start a deep scan. When complete, sort recovered files by size and look for files in the expected size range for your model. PyTorch checkpoint files have distinctive pickle headers that aid detection.

Step 5: Recover to a separate drive Save recovered checkpoints to a drive other than the source. Test by loading: torch.load('recovered_checkpoint.pth', map_location='cpu').

💡 Tip: If you are recovering from a network-attached storage system or a Linux training server, Ritridata's NAS recovery is not supported. For those environments, consider PhotoRec (open source) for basic binary file recovery, or contact a professional data recovery service for NAS-level failures.

Part 5. Checkpoint Backup Best Practices for ML Workflows

Losing a training checkpoint represents lost compute time, not just lost disk space. A minimal checkpoint strategy prevents the most common loss scenarios.

Save checkpoints at every epoch or every N steps depending on your training duration. Keep the last 3 checkpoints rather than just the latest — if the most recent save is corrupted, you can resume from two epochs prior with minimal loss.

Use cloud storage for checkpoint sync during long training runs. Services like AWS S3, Google Cloud Storage, and Azure Blob Storage can receive checkpoint uploads automatically via training callbacks in PyTorch Lightning, Hugging Face Trainer, or custom training loops.

Part 6. Ritridata for PyTorch Checkpoint Recovery

Ritridata is effective for recovering large binary files — which PyTorch checkpoints are. The deep scan engine detects pickle file signatures and can recover .pt and .pth files from drives where the file system has been damaged or the files have been deleted.

The free scan shows recovered file sizes and names so you can confirm the right checkpoint was found before committing to recovery. For ML practitioners who have lost weeks of training, that confirmation is worth the scan time.

Download Ritridata and start a free scan

FAQ

Q1: What is the difference between a .pt and .pth file in PyTorch? Both are standard PyTorch save formats — the extension is cosmetic. .pth is a common convention for checkpoints containing training state, while .pt is often used for exported models or tensors. Both use Python pickle serialization and both are recoverable the same way.

Q2: My training server is Linux-based — does Ritridata work on Linux? Ritridata currently supports Windows and Mac. For Linux, consider TestDisk or PhotoRec (both free and open source) for basic binary file recovery. Enterprise ML setups on Linux may benefit from professional data recovery services.

Q3: Can I recover a checkpoint from a GPU cloud instance that was terminated? Cloud instance storage is typically ephemeral — when an AWS EC2 or Google Cloud instance terminates, instance storage is wiped. Persistent storage (EBS, GCS buckets) is retained. Always save checkpoints to persistent attached storage, not instance local drives.

Q4: My checkpoint loaded but training loss jumped — is the checkpoint partially corrupt? A checkpoint that loads without error but produces unexpected behavior may have had optimizer state corrupted while model state is intact. Try loading only model_state_dict from the checkpoint and reinitializing the optimizer from scratch.

Q5: How do I recover a checkpoint if I only have the exported ONNX file? ONNX files contain frozen model weights but not optimizer state or training history. You can convert an ONNX file back to PyTorch using onnx2torch or similar tools, but you cannot resume training from an ONNX export alone.

Q6: Is it possible to recover a checkpoint that was overwritten by a newer save? Once a file has been overwritten in place (same path, same file), the previous version is gone unless you have versioned backups or a snapshot file system. Tools like ZFS, Btrfs snapshots, or Time Machine on Mac can preserve file versions.

Q7: Can Ritridata recover checkpoints from an external SSD used for training? Yes. Ritridata supports external SSDs on both Windows and Mac. Connect the external SSD before launching Ritridata and select it from the device list. Act quickly — SSD TRIM may purge deleted sectors faster than on an HDD.

Q8: What happens if only part of the checkpoint is recovered? A partially recovered checkpoint will likely fail to load due to truncation or missing data. You can attempt pickle.load() to see which keys were saved before the truncation. Depending on how much was recovered, you may be able to extract the model weights even if optimizer state is missing.

References