Copying a large volume of data like 500GiB, especially when it consists of thousands of individual files, from an SD card to an external hard disk in Linux requires strategies that minimize overhead and maximize throughput. The key is to leverage parallel processing to utilize multiple CPU cores and choose tools that handle file operations efficiently.
Understanding the Bottlenecks
Before diving into tools, it’s helpful to understand potential bottlenecks:
- Per-File Overhead (CPU Bound): When dealing with thousands of small files, the operating system incurs overhead for each file operation (opening, reading metadata, writing metadata, closing). This can make the CPU a bottleneck even if the drives aren’t saturated. Parallel processing helps here.
- I/O Throughput (Drive Bound): The read speed of your SD card and the write speed of your external HDD (especially if it’s a mechanical drive vs. an SSD) will ultimately limit transfer rates for large files or when per-file overhead is minimized.
- Single-Threaded Operations: Standard
cp
ormv
commands are typically single-threaded, processing one file at a time, making them inefficient for this scale.
Here are several effective Linux tools and techniques to accomplish this task, focusing on speed and resource utilization:
1. rsync
with GNU parallel
(Recommended for Robustness & Parallelism)
rsync
is a powerful and versatile tool for copying and synchronizing files. While rsync
itself processes files sequentially within a single instance, you can use it with GNU parallel
to run multiple rsync
jobs concurrently, significantly speeding up the transfer of many files.
- How it Works:
find
lists all files and directories. GNUparallel
takes this list and launches multiplersync
processes, each handling a subset of the files/directories simultaneously. This leverages multiple CPU cores to manage the per-file operations and can better saturate your drive’s I/O capabilities. - Key Advantages: Robust error handling, ability to resume interrupted transfers (with
rsync
), preserves permissions and metadata, detailed progress. - Installation: If
parallel
isn’t installed:sudo apt update && sudo apt install parallel
(Debian/Ubuntu) orsudo dnf install parallel
(Fedora/RHEL).
Example Command (Copying contents of source into destination):
# Ensure destination directory exists: mkdir -p /media/user/externalhdd/backup_destination
find /media/user/sdcard/source_folder/ -mindepth 1 -print0 | \
parallel -0 -j$(nproc) --eta --joblog /tmp/rsync_parallel.log \
rsync -aP {} /media/user/externalhdd/backup_destination/
/media/user/sdcard/source_folder/
: Your source directory on the SD card. The trailing slash means “contents of.”-mindepth 1
: Excludes the top-level source directory itself from the list, processing its contents.-print0
: Handles filenames with spaces or special characters safely.parallel -0 -j$(nproc) --eta --joblog /tmp/rsync_parallel.log
:-0
: Expects null-terminated input fromfind
.-j$(nproc)
: Runs a number of jobs equal to your CPU cores. You can set a specific number, e.g.,-j4
.--eta
: Shows estimated time of arrival.--joblog /tmp/rsync_parallel.log
: Logs the progress and success/failure of each parallel job.
rsync -aP {} /media/user/externalhdd/backup_destination/
:-a
: Archive mode (preserves permissions, timestamps, symbolic links, etc.).-P
: Combines--progress
and--partial
(for resumability).{}
: Placeholder for the file/directory passed byparallel
./media/user/externalhdd/backup_destination/
: The destination. The trailing slash is important forrsync
to copy items into this directory.
Dry Run (Highly Recommended): Before running the actual copy, perform a dry run:
Add rsync -anP
(note the n
for dry-run) in the command above, or add --dry-run
to the parallel
command.
2. tar
Pipelined (Efficient for Many Small Files)
This classic method archives the source files into a single stream (stdout
) and pipes this stream directly to another tar
process that extracts it at the destination (stdin
). This significantly reduces the overhead of individual file system operations, especially beneficial for mechanical drives and vast numbers of tiny files.
- How it Works:
tar
reads all source files sequentially and writes them as a continuous data stream. The receivingtar
process reads this stream and recreates the files and directory structure. - Key Advantages: Can be very fast for scenarios with extreme numbers of small files by minimizing disk head seeking.
- Considerations: Less easily resumable if interrupted compared to
rsync
. Progress indication is often through tools likepv
(Pipe Viewer).
Example Command:
# Ensure destination directory exists: mkdir -p /media/user/externalhdd/backup_destination
(cd /media/user/sdcard/source_folder/ && tar -cf - .) | pv | (cd /media/user/externalhdd/backup_destination/ && tar -xf -)
(cd /media/user/sdcard/source_folder/ && tar -cf - .)
:cd ...
: Changes to the source directory. The subshell(...)
ensures thiscd
doesn’t affect your main shell’s working directory.tar -cf - .
: Creates (c
) an archive of the current directory (.
) and writes it to standard output (f -
).
pv
: (Optional, install withsudo apt install pv
) Pipe Viewer shows progress of data through the pipe.(cd /media/user/externalhdd/backup_destination/ && tar -xf -)
:cd ...
: Changes to the destination directory in a subshell.tar -xf -
: Extracts (x
) the archive from standard input (f -
).
3. find
with xargs
and cp --parents
(Parallel Basic Copy)
This method uses find
to locate files, and xargs
to execute cp
commands in parallel. The crucial --parents
option for cp
ensures the source directory structure is replicated at the destination.
- How it Works:
find
generates a list of files.xargs
takes this list and runs multiplecp
commands simultaneously.cp --parents
recreates the necessary parent directories at the destination. - Key Advantages: Uses standard
cp
, can be effective ifrsync
’s overhead is a concern for a simple copy. - Considerations:
cp
doesn’t haversync
’s advanced resumability or delta-transfer capabilities (though not relevant for an initial full copy).
Example Command:
# Ensure base destination directory exists: mkdir -p /media/user/externalhdd/backup_destination
cd /media/user/sdcard/source_folder/ && \
find . -type f -print0 | \
xargs -0 -P$(nproc) -I {} cp --parents -a {} /media/user/externalhdd/backup_destination/
cd /media/user/sdcard/source_folder/
: Change to the source directory to make relative paths work withcp --parents
.find . -type f -print0
: Finds only files (-type f
) in the current directory (.
) and its subdirectories.xargs -0 -P$(nproc) -I {}
:-0
: Null-terminated input.-P$(nproc)
: Parallel processes up to the number of CPU cores.-I {}
: Replaces{}
with each input item. This makescp --parents
work correctly with paths containing spaces.
cp --parents -a {} /media/user/externalhdd/backup_destination/
:--parents
: Recreates the source directory structure under the destination.-a
: Archive mode (likersync -a
, equivalent to-dR --preserve=all
).{}
: The file to copy./media/user/externalhdd/backup_destination/
: The target directory where the structure fromsource_folder
will be created.
4. fpsync
(Specialized Parallel rsync
Wrapper)
fpsync
is a tool designed to parallelize rsync
. It uses fpart
to partition the file list and then runs multiple rsync
workers.
- How it Works: Automates the process of splitting the workload and managing parallel
rsync
instances. - Key Advantages: Tailored for this exact scenario; can be very efficient.
- Installation: May need to be installed via your package manager (e.g.,
sudo apt install fpart
, asfpsync
is often bundled with it).
Example Command:
# Ensure destination directory exists: mkdir -p /media/user/externalhdd/backup_destination
fpsync -n $(nproc) -v \
/media/user/sdcard/source_folder/ /media/user/externalhdd/backup_destination/
-n $(nproc)
: Number of parallelrsync
workers (e.g., number of CPU cores).-v
: Verbose mode./media/user/sdcard/source_folder/
: Source directory./media/user/externalhdd/backup_destination/
: Destination directory.- Note on batching:
fpsync
usesfpart
underneath. If you need to control batching by number of files per job (rather than just total workers), you might passfpart
options usingfpsync -o "-f <num_files>"
. Checkman fpart
for details.
5. mc
(Midnight Commander - TUI Alternative)
For users who prefer a Text-based User Interface, Midnight Commander is a powerful console file manager. Its built-in copy operations (F5
) are generally well-optimized and can handle large numbers of files more gracefully than a simple desktop file manager.
- How it Works: Provides an interactive way to select source files/directories and copy them. While it might not offer the same granular parallel control as CLI combinations, it’s often faster than basic
cp
for large jobs. - Installation:
sudo apt install mc
orsudo dnf install mc
. - Usage: Run
mc
, navigate panels to source and destination, select files (e.g.,Insert
key or*
), pressF5
to copy.
Additional Considerations for Maximizing Speed:
- Hardware: Ensure both SD card reader and external HDD are connected to the fastest available USB ports (USB 3.0+). An SSD external drive will be significantly faster than a mechanical HDD.
- Filesystem Mount Options: Mounting filesystems with
noatime
orrelatime
can reduce some disk I/O by not updating file access times on every read.sudo mount -o remount,noatime /media/user/sdcard
(if applicable and safe for your use case). - I/O Scheduler: For mechanical drives, the I/O scheduler can matter. Modern kernels often default to
bfq
ormq-deadline
, which are generally good. - System Load: Minimize other disk-intensive or CPU-intensive processes during the copy.
- Resource Monitoring: Use tools like
iotop
(to see disk I/O per process),htop
(CPU/memory),vmstat
, ordstat
to identify bottlenecks during the transfer. - GUI
rsync
Front-ends: If you prefer a GUI but wantrsync
’s power, tools likegrsync
provide a graphical interface torsync
.
Which Method to Choose?
- For general robustness, features, and good parallel performance:
rsync
with GNUparallel
(Method 1) orfpsync
(Method 4) are excellent choices. - For potentially the highest speed with extreme numbers of very small files (especially to/from mechanical drives): The
tar
pipe (Method 2) can be very effective. - For a simpler parallel
cp
approach:find
withxargs
andcp --parents
(Method 3) is a solid option. - For an interactive TUI approach:
mc
(Method 5) is user-friendly.
Always test with a smaller subset of your data and use dry-run options where available before committing to the full 500GiB transfer. This allows you to verify commands and estimate performance. Remember to replace placeholder paths with your actual SD card and external HDD mount points.