Organize files by their extensions

Slug: files-by-type

56156 characters 4382 words

#File Organizer Documentation

A Python utility for organizing files from source directories into categorized target folders.

#Project Overview

#Purpose

This utility solves the problem of disorganized file directories by automatically sorting files based on their types or extensions. It addresses common issues like:

  • Mixed file types in download or document folders
  • Difficulty finding specific file types
  • Managing large collections of files
  • Batch organizing files while preserving their content and metadata

#Core Functionality

  • Sort files by predefined categories (images, documents, etc.)
  • Sort files by their extensions
  • Copy or move files from source to destination
  • Handle duplicate filenames
  • Process hidden files and follow symbolic links if requested
  • Track progress during file operations

#Design Principles

  1. Configurability: Users can customize how files are categorized
  2. Reliability: Careful handling of edge cases like duplicates and long paths
  3. Transparency: Clear feedback on what’s happening during operation
  4. Simplicity: Straightforward command-line interface

#Installation

The utility requires only Python standard library modules and no external dependencies.

  1. Download the script:
    curl -O https://example.com/file_organizer.py # or wget https://example.com/file_organizer.py
  2. Make the script executable (Linux/macOS):
    chmod +x file_organizer.py

#Usage

#Basic Command Structure

python file_organizer.py --source <source_directory> --target <target_directory> [options]

#Command Line Options

Option Description
–source, -s Source directory to organize files from (required)
–target, -t Target directory to organize files into (required)
–organize-by Organization method: ‘category’ or ‘extension’ (default: ‘category’)
–no-timestamp Disable adding timestamps to duplicate filenames
–move Move files instead of copying them
–config, -c Path to a JSON configuration file for custom categories
-i, –include_hidden Include hidden files and directories
-l, –follow_links Follow symbolic links during directory traversal
-sk, –skip_existing Skip existing files instead of timestamping

#Configuration

#Default Categories

The utility uses these default file categories:

Category File Extensions
images .jpg, .jpeg, .png, .gif, .bmp, .webp
documents .pdf, .docx, .doc, .txt, .rtf, .odt, .xlsx, .xls, .csv, .pptx, .ppt
videos .mp4, .avi, .mkv, .mov, .wmv, .flv
audio .mp3, .wav, .flac, .aac, .ogg, .m4a
archives .zip, .rar, .tar, .gz, .bz2, .7z
code .py, .java, .c, .cpp, .h, .html, .css, .js, .xml, .json
apps .exe, .msi, .apk, .dmg
other Any file extension not listed above

#Custom Categories

You can define your own categories using a JSON configuration file:

{ "category_name1": [".ext1", ".ext2"], "category_name2": [".ext3", ".ext4"], "other": [] }

Example custom config file:

{ "work": [".doc", ".docx", ".pdf", ".xls", ".xlsx", ".ppt", ".pptx"], "photos": [".jpg", ".jpeg", ".png", ".gif", ".webp"], "code": [".py", ".js", ".html", ".css", ".java", ".c", ".cpp", ".h"], "media": [".mp3", ".mp4", ".avi", ".mkv", ".mov", ".flac", ".wav"], "compressed": [".zip", ".rar", ".tar", ".gz", ".7z"], "other": [] }

#Examples

  1. Basic organization by category:
    python file_organizer.py --source ~/Downloads --target ~/Organized
  2. Organize by file extension:
    python file_organizer.py --source ~/Downloads --target ~/Organized --organize-by extension
  3. Move files instead of copying:
    python file_organizer.py --source ~/Downloads --target ~/Organized --move
  4. Skip duplicate files instead of timestamping:
    python file_organizer.py --source ~/Downloads --target ~/Organized --skip_existing
  5. Include hidden files and follow symbolic links:
    python file_organizer.py --source ~/Downloads --target ~/Organized --include_hidden --follow_links

#Troubleshooting

#Permission Errors

  • Ensure you have read permissions for the source directory
  • Ensure you have write permissions for the target directory
  • On Unix systems, run with sudo for system directories (use with caution)

#Long Paths on Windows

The utility automatically handles long paths (>255 characters) on Windows by prefixing with \\?\. If you still encounter issues:

  • Use shorter directory names
  • Move files to a less deeply nested location before organizing

#Performance with Large Directories

  • For very large directories (thousands of files), the initial scan may take time
  • Consider organizing subdirectories separately if performance is an issue

#Duplicate Files

When a file with the same name exists in the target directory:

  1. Default behavior: Add timestamp to filename
  2. With --skip_existing: Skip the file
  3. With --no-timestamp: Overwrite existing file (use with caution)

#Development Notes

#Design Decisions

  1. File Operations (Copy vs. Move)
    • Copy is the default to prevent accidental data loss
    • Move functionality provided for efficiency when source files aren’t needed
  2. Categorization System
    • Default categories cover common file types
    • Custom categories supported via JSON for flexibility
    • Extension-based organization added for users who prefer that system
  3. Handling Duplicates
    • Timestamp approach preserves both old and new files
    • Skip option added for incremental organization tasks
  4. Error Handling
    • Individual file errors don’t halt the entire process
    • Errors are reported but the utility continues processing other files

#Error Handling Strategy

The utility employs an “attempt and continue” error handling strategy:

  • Each file operation is wrapped in a try/except block
  • Errors with individual files are reported but don’t stop the overall process
  • This ensures maximum files are processed even if some cause issues

#Security Considerations

  1. File Operations
    • The utility doesn’t attempt to open or read file contents (only metadata)
    • No execution of files occurs during organization
  2. When Using Move Operations
    • Be aware that move operations permanently change your file system
    • Always verify the target directory before using –move

#Testing

Refer to TESTING.md for detailed testing procedures and scenarios.

#Testing the File Organizer

This document outlines procedures for testing the File Organizer utility to ensure it functions correctly.

#Test Environment Setup

Create a test directory structure with various file types:

# Create test directories mkdir -p test_environment/source mkdir -p test_environment/target # Create test files touch test_environment/source/document1.pdf touch test_environment/source/document2.docx touch test_environment/source/image1.jpg touch test_environment/source/image2.png touch test_environment/source/video1.mp4 touch test_environment/source/script.py touch test_environment/source/archive.zip touch test_environment/source/noextension touch test_environment/source/.hidden_file # Create subdirectory with more files mkdir -p test_environment/source/subdir touch test_environment/source/subdir/nested_doc.pdf touch test_environment/source/subdir/nested_image.jpg

#Test Cases

#Test Case 1: Basic Category Organization

Purpose: Verify that files are correctly organized into category folders

Command:

python file_organizer.py --source test_environment/source --target test_environment/target

Expected Results:

  • Target directory should contain category folders: documents, images, videos, code, archives, other
  • Files should be placed in their correct category folders:
    • documents: document1.pdf, document2.docx
    • images: image1.jpg, image2.png
    • videos: video1.mp4
    • code: script.py
    • archives: archive.zip
    • other: noextension
  • Hidden files should be skipped (.hidden_file)
  • All files should be copied, not moved (source files should still exist)

Verification:

ls -la test_environment/target/*/ ls -la test_environment/source/

#Test Case 2: Extension-based Organization

Purpose: Verify that files are correctly organized by their extensions

Command:

python file_organizer.py --source test_environment/source --target test_environment/target --organize-by extension

Expected Results:

  • Target directory should contain extension folders: pdf, docx, jpg, png, mp4, py, zip, no_extension
  • Files should be placed in their respective extension folders
  • Files without extension should be in the no_extension folder
  • All files should be copied, not moved

Verification:

ls -la test_environment/target/*/

#Test Case 3: Move Operation

Purpose: Verify that files are moved instead of copied when using the –move flag

Command:

python file_organizer.py --source test_environment/source --target test_environment/target --move

Expected Results:

  • Files should be moved to their category folders in the target directory
  • Source directory should no longer contain the moved files
  • Subdirectories in source should remain (unless empty on your OS)

Verification:

ls -la test_environment/source/ ls -la test_environment/target/*/

#Test Case 4: Duplicate File Handling with Timestamps

Purpose: Verify that duplicate files are handled correctly with timestamps

Preparation:

# Create duplicate file in target mkdir -p test_environment/target/documents cp test_environment/source/document1.pdf test_environment/target/documents/

Command:

python file_organizer.py --source test_environment/source --target test_environment/target

Expected Results:

  • document1.pdf should be copied with a timestamp in the name (e.g., 2023-01-01T12-30-45-document1.pdf)
  • Original document1.pdf should remain unchanged in target directory
  • Console output should indicate that a timestamp was added

Verification:

ls -la test_environment/target/documents/

#Test Case 5: Skip Existing Files

Purpose: Verify that existing files are skipped with the –skip_existing flag

Command:

python file_organizer.py --source test_environment/source --target test_environment/target --skip_existing

Expected Results:

  • document1.pdf should be skipped (not copied again)
  • Console output should indicate that document1.pdf was skipped
  • Other files should be processed normally

Verification:

# Only one document1.pdf should exist in the target ls -la test_environment/target/documents/

#Test Case 6: Include Hidden Files

Purpose: Verify that hidden files are processed when using the –include_hidden flag

Command:

python file_organizer.py --source test_environment/source --target test_environment/target --include_hidden

Expected Results:

  • .hidden_file should be processed and copied to the “other” category
  • Console output should indicate that .hidden_file was processed

Verification:

ls -la test_environment/target/other/

#Test Case 7: Custom Categories

Purpose: Verify that custom category configurations work correctly

Preparation:

# Create a custom config file cat > test_environment/custom_config.json << EOF { "text_files": [".pdf", ".docx", ".txt"], "media": [".jpg", ".png", ".mp4"], "code_files": [".py", ".js", ".html"], "other": [] } EOF

Command:

python file_organizer.py --source test_environment/source --target test_environment/target --config test_environment/custom_config.json

Expected Results:

  • Files should be organized according to the custom categories:
    • text_files: document1.pdf, document2.docx
    • media: image1.jpg, image2.png, video1.mp4
    • code_files: script.py
    • other: archive.zip, noextension
  • Console output should indicate custom categories are being used

Verification:

ls -la test_environment/target/*/

#Test Result Interpretation

Each test case should result in files being organized according to the expected results. If any test fails:

  1. Check console output for error messages
  2. Verify file permissions in source and target directories
  3. Ensure test environment was set up correctly
  4. Check if target directories were created as expected
  5. Verify file contents to ensure they weren’t corrupted during copy/move

#Cleanup

After testing, remove the test environment:

rm -rf test_environment

#Example config

{ "work": [".doc", ".docx", ".pdf", ".xls", ".xlsx", ".ppt", ".pptx"], "photos": [".jpg", ".jpeg", ".png", ".gif", ".webp"], "code": [".py", ".js", ".html", ".css", ".java", ".c", ".cpp", ".h"], "media": [".mp3", ".mp4", ".avi", ".mkv", ".mov", ".flac", ".wav"], "compressed": [".zip", ".rar", ".tar", ".gz", ".7z"], "other": [] }

#Maintenance Guide for File Organizer

This document provides guidelines for maintaining and extending the File Organizer utility. It is designed for developers who may be unfamiliar with the original implementation but need to maintain or enhance the codebase.

#Project Structure

The File Organizer consists of:

  • file_organizer.py: Main script containing all functionality
  • Configuration files: JSON files that define custom category mappings

#Code Architecture

The utility follows a simple procedural design with these core components:

  1. Argument Parsing: Uses argparse to process command-line options
  2. Configuration Management: Loads and validates category definitions
  3. File Processing: Traverses directories and processes files
  4. File Operations: Handles copying, moving, and naming of files

#Key Functions

Function Purpose Implementation Notes
categorize_file_by_category() Maps files to categories Performs simple extension lookup
create_folders() Prepares target directory structure Creates folders only when needed for ‘category’ mode
handle_long_path() Handles Windows path limitations Windows-specific fix for paths >255 chars
sort_files() Main file processing logic Contains the core logic and most complex function
load_config_file() Loads custom category definitions Includes fallback to defaults on error
main() Entry point and argument processing Sets up and initiates the process

#Causality Chain

Understanding why certain implementation choices were made:

  1. Why copy files by default?
    • To prevent accidental data loss
    • Move operation is available but requires explicit flag
  2. Why use timestamps for duplicates?
    • Preserves both original and new files
    • Maintains file history
    • Prevents unintentional overwrites
  3. Why separate extension handling?
    • Some users prefer organization by extension
    • Provides flexibility for different workflows
  4. Why include Windows long path handling?
    • Windows has a 255 character path limitation
    • Without this, deeply nested files would fail to process

#Common Maintenance Tasks

#Adding New File Categories

To add new categories to the default configuration:

  1. Modify the DEFAULT_CATEGORIES_CONFIG dictionary:
    DEFAULT_CATEGORIES_CONFIG = { # Existing categories... "new_category": [".ext1", ".ext2", ".ext3"], "other": [] # Always keep this as the fallback }

#Adding New Command Line Options

To add a new command line option:

  1. Add the option to the argument parser in main():
    parser.add_argument('--new-option', action='store_true', help='Description of the new option')
  2. Extract the option value:
    new_option = args.new_option
  3. Pass the option to functions that need it:
    sort_files(..., new_option, ...)
  4. Update the function signatures and implementations to use the new option

#Error Handling

The current error handling strategy is:

  • Individual file errors are caught and reported
  • The process continues with the next file
  • Overall process doesn’t terminate on individual file errors

When adding new functionality, maintain this pattern:

try: # Your operation here except Exception as e: print(f"Error: {e}") # Continue with next item rather than raising

#Testing

When making changes, ensure you test:

  1. Basic functionality with default options
  2. Any specific options you’ve modified
  3. Edge cases like:
    • Empty directories
    • Files with unusual names or extremely long paths
    • Very large directories
    • Permission-restricted files

Follow the testing guide in TESTING.md to verify your changes.

#Performance Considerations

The utility was designed for moderate-sized directories. For very large directories (thousands of files), consider:

  1. Adding progress indicators for lengthy operations
  2. Implementing batch processing
  3. Adding resume capabilities for interrupted operations

#Security Considerations

When modifying the code, maintain these security principles:

  1. Never execute file contents
  2. Validate all user inputs, especially paths and configuration files
  3. Be careful with move operations that permanently alter file systems
  4. Maintain appropriate error handling to prevent information leakage

#Documentation Updates

When changing functionality, update these documentation components:

  1. Function docstrings in the code
  2. README.md for user-facing changes
  3. MAINTENANCE.md for developer-facing changes
  4. TESTING.md for new test cases ```

#Decision Record and Implementation Notes

#Key Design Decisions

#1. File Organization Approach

Decision: Implement two organization methods (category and extension)
Context: Different users have different preferences for file organization
Consequences: More flexible tool but more complex implementation and testing required

#2. Default Copy vs. Move

Decision: Make copy the default operation and move optional
Context: Moving files is destructive and could lead to data loss if not used carefully
Consequences: Safer operation but may require more disk space temporarily

#3. Duplicate File Handling

Decision: Implemented three strategies: timestamp, skip, or overwrite
Context: Users need different approaches based on their specific use cases
Consequences: More complexity but greater flexibility for different scenarios

#4. Error Handling Strategy

Decision: Catch and report individual file errors but continue processing
Context: A single problematic file shouldn’t prevent organizing all other files
Consequences: More robust operation but may mask underlying issues

#5. Custom Configuration System

Decision: Use JSON for category definitions
Context: Provides flexibility while using a standard format
Consequences: Requires error handling for invalid JSON but enables easy customization

#Implementation Notes

#Platform Compatibility

  • Windows long path handling was added specifically to address the 255-character path limit
  • The utility uses path handling that works across Windows, macOS, and Linux
  • File metadata preservation is implemented using shutil.copy2() instead of regular copy

#Progress Reporting

  • Real-time progress updates were implemented to provide feedback during long operations
  • The counter system shows both files processed and total files for context

#Security Considerations

  • The tool only examines file metadata, not contents
  • No execution of files occurs during the organization process
  • User input validation is performed for all paths and configuration options

#Performance Optimization

  • Directory walking is optimized by filtering directories early when hidden files are excluded
  • Folders are created only as needed in extension mode to minimize filesystem operations

#Maintenance Approach

  • Code is documented thoroughly for third-party maintenance
  • Functions have clear purposes and interfaces
  • Error handling is consistent across the codebase
  • Testing procedures cover both common and edge cases

#Third-Party Maintenance Guidelines

For developers maintaining this code:

  1. Understanding the Core Logic:
    • The main functionality is in the sort_files() function
    • File categorization happens in categorize_file_by_category()
    • Configuration loading is handled by load_config_file()
  2. Adding New Features:
    • Maintain the existing error handling pattern
    • Document all changes thoroughly
    • Update tests to cover new functionality
    • Consider backward compatibility
  3. Fixing Issues:
    • Check for edge cases with unusual filenames or paths
    • Verify platform-specific behavior (especially Windows long paths)
    • Test with large directories and various file types
  4. Refactoring Guidelines:
    • Maintain clear function purposes
    • Preserve the current error handling strategy
    • Ensure backward compatibility
    • Update documentation to reflect changes

#Python script code

#!/usr/bin/env python3 import os import shutil import argparse import json import platform import traceback from datetime import datetime DEFAULT_CATEGORIES_CONFIG = { "images": [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".webp"], "documents": [".pdf", ".docx", ".doc", ".txt", ".rtf", ".odt", ".xlsx", ".xls", ".csv", ".pptx", ".ppt"], "videos": [".mp4", ".avi", ".mkv", ".mov", ".wmv", ".flv"], "audio": [".mp3", ".wav", ".flac", ".aac", ".ogg", ".m4a"], "archives": [".zip", ".rar", ".tar", ".gz", ".bz2", ".7z"], "code": [".py", ".java", ".c", ".cpp", ".h", ".html", ".css", ".js", ".xml", ".json"], "apps": [".exe", ".msi", ".apk", ".dmg"], "other": [] } def handle_long_path(path): path = os.path.abspath(path) if platform.system() == "Windows" and len(path) > 260 and not path.startswith("\\\\?\\"): path = "\\\\?\\" + path return path def load_config_file(config_path): if config_path and os.path.isfile(config_path): try: with open(config_path, "r", encoding="utf-8") as f: return json.load(f) except (json.JSONDecodeError, OSError): print(f"Invalid config file '{config_path}'. Using default categories.") return DEFAULT_CATEGORIES_CONFIG def categorize_file(filename, categories_config): _, ext = os.path.splitext(filename) ext = ext.lower() for category, extensions in categories_config.items(): if ext in extensions: return category return "other" def create_target_folders(base_dir, organize_by, categories_config): if organize_by == "category": for category in categories_config: os.makedirs(os.path.join(base_dir, category), exist_ok=True) elif organize_by == "extension": pass else: raise ValueError("Invalid organize_by option.") def sort_files( source_directory, target_directory, organize_by, timestamp_duplicates, move_files, categories_config, include_hidden, follow_links, skip_existing ): source_directory = handle_long_path(source_directory) target_directory = handle_long_path(target_directory) if not os.path.isdir(source_directory): print(f"Error: Source directory '{source_directory}' is invalid.") return if not os.path.exists(target_directory): os.makedirs(target_directory) elif not os.path.isdir(target_directory): print(f"Error: Target path '{target_directory}' is not a directory.") return total_files = 0 processed_files = 0 for root, dirs, files in os.walk(source_directory, followlinks=follow_links): root = handle_long_path(root) if not include_hidden: dirs[:] = [d for d in dirs if not d.startswith(".")] files = [f for f in files if not f.startswith(".")] total_files += len(files) for file in files: filepath = os.path.join(root, file) filepath = handle_long_path(filepath) if file.lower().endswith(".lnk") or not os.path.exists(filepath): processed_files += 1 print(f"Skipping .lnk or non-existent: {filepath}") print(f"Progress: {processed_files}/{total_files}", end="\r") continue if organize_by == "category": category = categorize_file(file, categories_config) target_folder = os.path.join(target_directory, category) elif organize_by == "extension": _, ext = os.path.splitext(file) ext_folder = ext[1:].lower() if ext else "no_extension" target_folder = os.path.join(target_directory, ext_folder) else: raise ValueError("Invalid organize_by option.") os.makedirs(target_folder, exist_ok=True) target_fullpath = os.path.join(target_folder, file) try: if os.path.exists(target_fullpath): if skip_existing: processed_files += 1 print(f"Skipping existing: {target_fullpath}") print(f"Progress: {processed_files}/{total_files}", end="\r") continue if timestamp_duplicates: stamp = datetime.now().strftime("%Y%m%d_%H%M%S") base, ext = os.path.splitext(file) new_name = f"{base}_{stamp}{ext}" target_fullpath = os.path.join(target_folder, new_name) if move_files: shutil.move(filepath, target_fullpath) else: shutil.copy2(filepath, target_fullpath) except Exception as e: print(f"\nError processing '{filepath}': {e}") traceback.print_exc() finally: processed_files += 1 print(f"Progress: {processed_files}/{total_files}", end="\r") print("\nFile organization completed.") def main(): parser = argparse.ArgumentParser(description="Organize files by category or extension.") parser.add_argument("--source", "-s", required=True, help="Source directory.") parser.add_argument("--target", "-t", required=True, help="Target directory.") parser.add_argument("--config", "-c", help="Path to JSON config file for categories.") parser.add_argument("--organize-by", choices=["category", "extension"], default="category", help="Organize by category or extension.") parser.add_argument("--move", action="store_true", help="Move instead of copy.") parser.add_argument("--no-timestamp", action="store_true", help="Disable timestamping duplicates.") parser.add_argument("--include-hidden", action="store_true", help="Include hidden files.") parser.add_argument("--follow-links", action="store_true", help="Follow symbolic links.") parser.add_argument("--skip-existing", action="store_true", help="Skip existing files.") args = parser.parse_args() categories = load_config_file(args.config) if args.organize_by == "category": create_target_folders(args.target, args.organize_by, categories) sort_files( source_directory=args.source, target_directory=args.target, organize_by=args.organize_by, timestamp_duplicates=not args.no_timestamp, move_files=args.move, categories_config=categories, include_hidden=args.include_hidden, follow_links=args.follow_links, skip_existing=args.skip_existing ) if __name__ == "__main__": main()
URL: https://ib.bsb.br/files-by-type