PyTorch Checkpoint Recovery: Fix Pickle Errors and Recover .pth Model Files

PyTorch checkpoint recovery addresses two distinct problems that ML practitioners face: deleted .pt or .pth files that need to be restored from disk, and existing checkpoint files that produce _pickle.UnpicklingError or EOFError on load. Both problems are solvable — but the approach is entirely different for each. This guide covers file recovery for deleted checkpoints and diagnostic steps for corrupted ones.

⚠️ Warning: If a checkpoint file was deleted from your training machine, stop all training runs and disk writes to that drive immediately. Checkpoint files are large and distinctive — they are recoverable — but continued disk activity will overwrite sectors and reduce recovery success. Do not launch a new training run to "replace" the checkpoint until you have attempted recovery.

Part 1. PyTorch Error Types and Solutions

The first step in checkpoint recovery is correctly diagnosing whether the file is missing or present-but-corrupt. These are different problems with different solutions.

Table 1: PyTorch Error Types and Solutions

Error / Symptom	Cause	Solution	Recovery Needed?
`FileNotFoundError: [Errno 2] No such file or directory`	File deleted or path wrong	Check path; run file recovery if deleted	Yes — if file is deleted
`_pickle.UnpicklingError: invalid load key`	File partially corrupt or truncated	Attempt partial load; repair or re-train	No — file exists but damaged
`EOFError` during `torch.load()`	File write interrupted (crash during save)	Check file size; likely truncated	No — try to repair
`RuntimeError: PytorchStreamReader failed`	ZIP container (for `.pt`) is corrupt	Re-download or recover from backup	No — container issue
`ModuleNotFoundError` on load	Model class not in scope	Import the class before loading	No — code issue
`KeyError: 'model_state_dict'`	Checkpoint saved with different key	Inspect checkpoint dict keys	No — code issue
Drive not detected, files missing	Physical or logical drive failure	Ritridata recovery scan	Yes — drive-level issue
Checkpoint file is 0 bytes	Write interrupted at start of save	Recovery from previous checkpoint	Yes — if prior version exists
Checkpoint file is much smaller than expected	Write interrupted mid-save	File is truncated — see repair steps	No — partial data only

The critical distinction: if os.path.exists(checkpoint_path) returns True, the file is on disk and the problem is corruption or format, not deletion. File recovery tools only help when the file is not present on the file system.

Part 2. Checkpoint File Sizes and Recovery Characteristics

Understanding typical checkpoint sizes helps you plan recovery destination space and identify recovered files during a scan.

Table 2: Checkpoint File Sizes and Recovery by Model Type

Model Type	Architecture	Typical Checkpoint Size	Recovery Notes
Small custom CNN	ResNet-18 scale	50–200 MB	Very small — fast recovery
BERT base	12-layer transformer	~440 MB	Fast recovery, good success rate
GPT-2 small	117M parameters	~500 MB	Small, reliable recovery
ResNet-50	Standard CNN	~100 MB	Very fast recovery
ViT-B/16	Vision transformer	~340 MB	Reliable recovery
LLaMA 7B full (FP32)	Large language model	~26 GB	Large — act fast, fragmentation risk
LLaMA 7B (BF16)	Large language model	~13 GB	Large — recover promptly
SDXL (Stable Diffusion XL)	Diffusion model	~6.5 GB	Medium-large — good recovery rate
Optimizer state (AdamW)	Paired with model checkpoint	Equal to model size	Always recover in pairs with model
Full training state (model + optimizer + scheduler)	Complete checkpoint	2x model size	Largest files — prioritize recovery

💡 Tip: Always save checkpoints with both the model state dict and the optimizer state. Using torch.save({'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'epoch': epoch}, path) ensures you can resume training from any checkpoint rather than just running inference.

Part 3. Fixing Corrupted PyTorch Checkpoints (File Exists but Will Not Load)

If the .pt or .pth file exists but produces errors on load, try these steps before concluding the file is unrecoverable.

Step 1: Check file size A valid checkpoint should have a predictable size based on your model architecture. A 0-byte file or a file that is unexpectedly small (less than 10% of expected size) was truncated during write — the data was not saved before the process was interrupted.

Step 2: Try loading with map_location='cpu'

checkpoint = torch.load('checkpoint.pth', map_location='cpu')

Device mismatch errors can mimic corruption. Loading to CPU first isolates the problem.

Step 3: Try weights_only=True

checkpoint = torch.load('checkpoint.pth', weights_only=True)

This bypasses arbitrary object unpickling and loads only tensor data. If this works, the issue is with non-tensor objects in the checkpoint, not the tensors themselves.

Step 4: Inspect raw pickle structure

import pickle
with open('checkpoint.pth', 'rb') as f:
    data = pickle.load(f)

If this also fails, the file is genuinely corrupt. If it succeeds with a dict, check which keys are present — some keys may have been saved before the process crashed.

💡 Tip: When saving checkpoints mid-training, use an atomic write pattern: save to a .tmp file first, then rename to the final filename. This prevents a crashed save from leaving a corrupted checkpoint in place of a valid one: torch.save(state, 'checkpoint.tmp'); os.rename('checkpoint.tmp', 'checkpoint.pth').

Part 4. Recovering Deleted PyTorch Checkpoint Files with Ritridata

If the checkpoint file is genuinely missing from the file system — confirmed by os.path.exists() returning False — Ritridata can often recover it from the drive sectors where it previously existed.

Step 1: Stop all disk activity on the affected drive Do not start a new training run. Do not install packages with pip. Do not save any new files to the same drive partition. Large checkpoint files require many sectors — any new writes increase overwrite risk.

Step 2: Install Ritridata on a different drive Install on your OS drive or an external drive — not the training drive containing the lost checkpoint.

Step 3: Launch Ritridata and select the training drive Select the partition or raw disk where your checkpoint directory was located. Common locations:

Linux training servers: /home/{user}/checkpoints/ or /mnt/data/
Windows: C:\Users\{user}\checkpoints\ or a dedicated data drive
Note: Ritridata supports Windows and Mac — for Linux servers, see alternative tools in the FAQ.

Step 4: Run a deep scan and filter by size Start a deep scan. When complete, sort recovered files by size and look for files in the expected size range for your model. PyTorch checkpoint files have distinctive pickle headers that aid detection.

Step 5: Recover to a separate drive Save recovered checkpoints to a drive other than the source. Test by loading: torch.load('recovered_checkpoint.pth', map_location='cpu').

💡 Tip: If you are recovering from a network-attached storage system or a Linux training server, Ritridata's NAS recovery is not supported. For those environments, consider PhotoRec (open source) for basic binary file recovery, or contact a professional data recovery service for NAS-level failures.

Part 5. Checkpoint Backup Best Practices for ML Workflows

Losing a training checkpoint represents lost compute time, not just lost disk space. A minimal checkpoint strategy prevents the most common loss scenarios.

Save checkpoints at every epoch or every N steps depending on your training duration. Keep the last 3 checkpoints rather than just the latest — if the most recent save is corrupted, you can resume from two epochs prior with minimal loss.

Use cloud storage for checkpoint sync during long training runs. Services like AWS S3, Google Cloud Storage, and Azure Blob Storage can receive checkpoint uploads automatically via training callbacks in PyTorch Lightning, Hugging Face Trainer, or custom training loops.

Part 6. Ritridata for PyTorch Checkpoint Recovery

Ritridata is effective for recovering large binary files — which PyTorch checkpoints are. The deep scan engine detects pickle file signatures and can recover .pt and .pth files from drives where the file system has been damaged or the files have been deleted.

The free scan shows recovered file sizes and names so you can confirm the right checkpoint was found before committing to recovery. For ML practitioners who have lost weeks of training, that confirmation is worth the scan time.

Download Ritridata and start a free scan

FAQ

Q1: What is the difference between a .pt and .pth file in PyTorch? Both are standard PyTorch save formats — the extension is cosmetic. .pth is a common convention for checkpoints containing training state, while .pt is often used for exported models or tensors. Both use Python pickle serialization and both are recoverable the same way.

Q2: My training server is Linux-based — does Ritridata work on Linux? Ritridata currently supports Windows and Mac. For Linux, consider TestDisk or PhotoRec (both free and open source) for basic binary file recovery. Enterprise ML setups on Linux may benefit from professional data recovery services.

Q3: Can I recover a checkpoint from a GPU cloud instance that was terminated? Cloud instance storage is typically ephemeral — when an AWS EC2 or Google Cloud instance terminates, instance storage is wiped. Persistent storage (EBS, GCS buckets) is retained. Always save checkpoints to persistent attached storage, not instance local drives.

Q4: My checkpoint loaded but training loss jumped — is the checkpoint partially corrupt? A checkpoint that loads without error but produces unexpected behavior may have had optimizer state corrupted while model state is intact. Try loading only model_state_dict from the checkpoint and reinitializing the optimizer from scratch.

Q5: How do I recover a checkpoint if I only have the exported ONNX file? ONNX files contain frozen model weights but not optimizer state or training history. You can convert an ONNX file back to PyTorch using onnx2torch or similar tools, but you cannot resume training from an ONNX export alone.

Q6: Is it possible to recover a checkpoint that was overwritten by a newer save? Once a file has been overwritten in place (same path, same file), the previous version is gone unless you have versioned backups or a snapshot file system. Tools like ZFS, Btrfs snapshots, or Time Machine on Mac can preserve file versions.

Q7: Can Ritridata recover checkpoints from an external SSD used for training? Yes. Ritridata supports external SSDs on both Windows and Mac. Connect the external SSD before launching Ritridata and select it from the device list. Act quickly — SSD TRIM may purge deleted sectors faster than on an HDD.

Q8: What happens if only part of the checkpoint is recovered? A partially recovered checkpoint will likely fail to load due to truncation or missing data. You can attempt pickle.load() to see which keys were saved before the truncation. Depending on how much was recovered, you may be able to extract the model weights even if optimizer state is missing.

References

PyTorch Checkpoint Recovery: Fix Pickle Errors and Recover .pth Model Files

⚠️ Warning: If a checkpoint file was deleted from your training machine, stop all training runs and disk writes to that drive immediately. Checkpoint files are large and distinctive — they are recoverable — but continued disk activity will overwrite sectors and reduce recovery success. Do not launch a new training run to "replace" the checkpoint until you have attempted recovery.

Part 1. PyTorch Error Types and Solutions

The first step in checkpoint recovery is correctly diagnosing whether the file is missing or present-but-corrupt. These are different problems with different solutions.

Table 1: PyTorch Error Types and Solutions

Error / Symptom	Cause	Solution	Recovery Needed?
`FileNotFoundError: [Errno 2] No such file or directory`	File deleted or path wrong	Check path; run file recovery if deleted	Yes — if file is deleted
`_pickle.UnpicklingError: invalid load key`	File partially corrupt or truncated	Attempt partial load; repair or re-train	No — file exists but damaged
`EOFError` during `torch.load()`	File write interrupted (crash during save)	Check file size; likely truncated	No — try to repair
`RuntimeError: PytorchStreamReader failed`	ZIP container (for `.pt`) is corrupt	Re-download or recover from backup	No — container issue
`ModuleNotFoundError` on load	Model class not in scope	Import the class before loading	No — code issue
`KeyError: 'model_state_dict'`	Checkpoint saved with different key	Inspect checkpoint dict keys	No — code issue
Drive not detected, files missing	Physical or logical drive failure	Ritridata recovery scan	Yes — drive-level issue
Checkpoint file is 0 bytes	Write interrupted at start of save	Recovery from previous checkpoint	Yes — if prior version exists
Checkpoint file is much smaller than expected	Write interrupted mid-save	File is truncated — see repair steps	No — partial data only

Part 2. Checkpoint File Sizes and Recovery Characteristics

Understanding typical checkpoint sizes helps you plan recovery destination space and identify recovered files during a scan.

Table 2: Checkpoint File Sizes and Recovery by Model Type

Model Type	Architecture	Typical Checkpoint Size	Recovery Notes
Small custom CNN	ResNet-18 scale	50–200 MB	Very small — fast recovery
BERT base	12-layer transformer	~440 MB	Fast recovery, good success rate
GPT-2 small	117M parameters	~500 MB	Small, reliable recovery
ResNet-50	Standard CNN	~100 MB	Very fast recovery
ViT-B/16	Vision transformer	~340 MB	Reliable recovery
LLaMA 7B full (FP32)	Large language model	~26 GB	Large — act fast, fragmentation risk
LLaMA 7B (BF16)	Large language model	~13 GB	Large — recover promptly
SDXL (Stable Diffusion XL)	Diffusion model	~6.5 GB	Medium-large — good recovery rate
Optimizer state (AdamW)	Paired with model checkpoint	Equal to model size	Always recover in pairs with model
Full training state (model + optimizer + scheduler)	Complete checkpoint	2x model size	Largest files — prioritize recovery

💡 Tip: Always save checkpoints with both the model state dict and the optimizer state. Using torch.save({'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'epoch': epoch}, path) ensures you can resume training from any checkpoint rather than just running inference.

Part 3. Fixing Corrupted PyTorch Checkpoints (File Exists but Will Not Load)

If the .pt or .pth file exists but produces errors on load, try these steps before concluding the file is unrecoverable.

Step 2: Try loading with map_location='cpu'

checkpoint = torch.load('checkpoint.pth', map_location='cpu')

Device mismatch errors can mimic corruption. Loading to CPU first isolates the problem.

Step 3: Try weights_only=True

checkpoint = torch.load('checkpoint.pth', weights_only=True)

This bypasses arbitrary object unpickling and loads only tensor data. If this works, the issue is with non-tensor objects in the checkpoint, not the tensors themselves.

Step 4: Inspect raw pickle structure

import pickle
with open('checkpoint.pth', 'rb') as f:
    data = pickle.load(f)

If this also fails, the file is genuinely corrupt. If it succeeds with a dict, check which keys are present — some keys may have been saved before the process crashed.

💡 Tip: When saving checkpoints mid-training, use an atomic write pattern: save to a .tmp file first, then rename to the final filename. This prevents a crashed save from leaving a corrupted checkpoint in place of a valid one: torch.save(state, 'checkpoint.tmp'); os.rename('checkpoint.tmp', 'checkpoint.pth').

Part 4. Recovering Deleted PyTorch Checkpoint Files with Ritridata

Step 2: Install Ritridata on a different drive Install on your OS drive or an external drive — not the training drive containing the lost checkpoint.

Step 3: Launch Ritridata and select the training drive Select the partition or raw disk where your checkpoint directory was located. Common locations:

Linux training servers: /home/{user}/checkpoints/ or /mnt/data/
Windows: C:\Users\{user}\checkpoints\ or a dedicated data drive
Note: Ritridata supports Windows and Mac — for Linux servers, see alternative tools in the FAQ.

Step 5: Recover to a separate drive Save recovered checkpoints to a drive other than the source. Test by loading: torch.load('recovered_checkpoint.pth', map_location='cpu').

💡 Tip: If you are recovering from a network-attached storage system or a Linux training server, Ritridata's NAS recovery is not supported. For those environments, consider PhotoRec (open source) for basic binary file recovery, or contact a professional data recovery service for NAS-level failures.

Part 5. Checkpoint Backup Best Practices for ML Workflows

Losing a training checkpoint represents lost compute time, not just lost disk space. A minimal checkpoint strategy prevents the most common loss scenarios.

Part 6. Ritridata for PyTorch Checkpoint Recovery

Download Ritridata and start a free scan

PyTorch Checkpoint Gone or Broken? Fix Pickle Errors and Recover .pt Files

PyTorch Checkpoint Recovery: Fix Pickle Errors and Recover .pth Model Files

Part 1. PyTorch Error Types and Solutions

Part 2. Checkpoint File Sizes and Recovery Characteristics

Part 3. Fixing Corrupted PyTorch Checkpoints (File Exists but Will Not Load)

Part 4. Recovering Deleted PyTorch Checkpoint Files with Ritridata

Part 5. Checkpoint Backup Best Practices for ML Workflows

Part 6. Ritridata for PyTorch Checkpoint Recovery

FAQ

References

PyTorch Checkpoint Gone or Broken? Fix Pickle Errors and Recover .pt Files

PyTorch Checkpoint Recovery: Fix Pickle Errors and Recover .pth Model Files

Part 1. PyTorch Error Types and Solutions

Part 2. Checkpoint File Sizes and Recovery Characteristics

Part 3. Fixing Corrupted PyTorch Checkpoints (File Exists but Will Not Load)

Part 4. Recovering Deleted PyTorch Checkpoint Files with Ritridata

Part 5. Checkpoint Backup Best Practices for ML Workflows

Part 6. Ritridata for PyTorch Checkpoint Recovery

FAQ

References