files to prompt (f2p)

1. INSTALLATION
- 1.1 Primary Requirements
- 1.2 Script File Reference
2. BASIC USAGE
3. FLAGS & OPTIONS
4. FILE & ENCODING LOGIC
5. IGNORING LOGIC
6. OUTPUT FORMATS
7. EXAMPLES
8. Jinja2 EXAMPLES
- Example Templates
9. FURTHER NOTES

#1. INSTALLATION

#1.1 Primary Requirements

click (command-line interface support)
Optional modules:
- rarfile for .rar archives
- py7zr for .7z archives
- pathspec for advanced .gitignore matching
- jinja2 for templated output

Install everything:

pip install click rarfile py7zr pathspec jinja2

Or install only what you need (for example, omit rarfile if you don’t require .rar support).

#1.2 Script File Reference

The provided script is typically named f2p.py, so you would run:

python f2p.py [OPTIONS] [PATHS...]

#2. BASIC USAGE

Run the script as follows:

python f2p.py [OPTIONS] /some/path /another/path

The script recursively scans directories and extracts recognized archives:
- .zip, .rar, .7z, .tar, .gz, .bz2
- Office Open XML / OpenDocument formats: .docx, .xlsx, .pptx, .odt, .ods, .odp

(Legacy formats such as .doc, .xls, .ppt are not treated as archives.)

#3. FLAGS & OPTIONS

Flag/Option	Type / Default	Short	Description
`--extension`	multiple=True / None	`-e`	Restricts processing to specific extensions (archives & Office docs).
`--include-hidden`	is_flag=True / False	n/a	Considers hidden/dot files and directories.
`--ignore-gitignore`	is_flag=True / False	n/a	Ignores `.gitignore` rules in directories.
`--ignore`	multiple=True / None	n/a	Excludes files matching glob patterns (e.g., `*.log`).
`--output`	file path / None	`-o`	Writes output to the specified file instead of stdout.
`--xml`	is_flag=True / False	n/a	Outputs content in an XML-like structure.
`--template-file`	file path / None	`-t`	Uses a Jinja2 template for custom formatting (requires jinja2).
`--max-depth`	int / 5	`-d`	Limits recursion depth for nested archives (default: 5).

#4. FILE & ENCODING LOGIC

Multiple Encoding Attempts:
- Tries utf-8 first, then latin-1.
- If both fail, the file’s content is omitted (with a warning logged).
Archive & Office Document Extraction:
- .rar extraction requires rarfile, and .7z extraction requires py7zr.
- Office documents (OOXML/ODF) are processed as .zip archives.
- Extraction is performed safely to prevent path traversal vulnerabilities.

#5. IGNORING LOGIC

.gitignore / pathspec:
- When pathspec is installed, advanced .gitignore rules apply.
- Otherwise, a simpler fnmatch approach is used.
Hidden Files:
- Hidden items are skipped by default (unless --include-hidden is used).
Additional Patterns:
- Use --ignore to exclude files (e.g., --ignore="*.log" or --ignore="*_backup.*").

#6. OUTPUT FORMATS

Plain Text (Default):
- Prints each file’s path, followed by a separator and the file content.
XML-like (--xml):
- Wraps the content within <section>...</section> elements.
Jinja2 Templates (-t/--template-file):
- Applies a provided .j2 template to format each file’s content.

#7. EXAMPLES

Restrict to .py & .md and ignore *.log:
python f2p.py -e .py -e .md --ignore="*.log" /path/to/process
Process hidden files, disable .gitignore, and output to a file:
python f2p.py --include-hidden --ignore-gitignore -o out.txt /some/path
Output as XML and limit recursion to 3 levels:
python f2p.py --xml --max-depth=3 /path/to/archives
Use a Jinja2 template:
python f2p.py --template-file=custom_template.j2 /path/to/files
Process multiple paths:
python f2p.py /first/path /second/path

#8. Jinja2 EXAMPLES

When invoking:

python f2p.py -t my_template.j2 [PATHS...]

the template receives:

{{ path }}: the file’s path
{{ content }}: the file’s text content
{{ index }}: a numeric counter for labeling

#Example Templates

Minimal Example: Plain-Text Highlight
File: minimal_example.j2
File #{{ index }}: {{ path }} ---------------------------- {{ content }} ----------------------------

Rationale: Displays the file path and content with a simple separator.
Numbered Lines Use Case
File: numbered_lines.j2
File: {{ path }} (Index: {{ index }}) ====================================== {% set lines = content.split('\n') %} {% for loop_index, line in lines | enumerate(start=1) %} {{ loop_index }}: {{ line }} {% endfor %} ======================================

Rationale: Enumerates each line, useful for line-by-line reference.
HTML Output for Browser Rendering
File: html_output.j2
<html> <head> <title>File {{ index }}</title> </head> <body> <h2>File Path: {{ path }}</h2> <p><strong>Index:</strong> {{ index }}</p> <hr /> <pre> {{ content }} </pre> </body> </html>

Rationale: Formats the file content into a simple HTML page.
Markdown Code Snippet
File: markdown_snippet.j2
### File #{{ index }}: {{ path }}

{{ content }}

Rationale: Ideal for embedding code or text in a Markdown document.
Summarized Headings Template
File: summarized_headings.j2
["FILE #{{ index }}"] {{ path }} --------------------------------- {% set first_lines = content.split('\n')[:3] %} {% for line in first_lines %} {{ line }} {% endfor %} --------------------------------- (... Content Truncated ...)

Rationale: Shows only the first few lines to save space.
JSON-Inspired Output Template
File: json_inspired.j2
{ "index": {{ index }}, "path": "{{ path | replace('\\', '\\\\') }}", "content_lines": [ {% set lines = content.split('\n') %} {% for line in lines %} "{{ line | replace('"','\\"') }}"{{ "," if not loop.last else "" }} {% endfor %} ] }

Rationale: Outputs the file data in a JSON-like structure.
Columnar Key-Value Template
File: columnar_kv.j2
=============== File #{{ index }} =============== Path: {{ path }} --------------- {% for key, val in { 'Characters': content|length, 'Lines': content.split('\n')|length, 'First Line': content.split('\n')[0] if content else '' }.items() %} {{ key }}: {{ val }} {% endfor %} --------------- {{ content }}

Rationale: Displays statistics followed by the file content.
Interactive-Like Script Template
File: interactive_prompt.j2
=== File #{{ index }} === LOAD FILE: {{ path }} RUN COMMANDS: 1) SomeProcess --file "{{ path }}" 2) AnotherProcess --analyze "{{ path }}" 3) (Optional) Check content below: {{ content }} =================

Rationale: Provides a stylized “script” output with follow-up commands.
Task-List / To-Do Style Template
File: task_list.j2
## File #{{ index }}: {{ path }} - [ ] Review lines for errors - [ ] Extract useful references - [ ] Create summary - [ ] Mark for final review Content: {{ content }}

Rationale: Produces a checklist along with the file content.
Blockquote Slicer Template
File: blockquote_slicer.j2
> **File #{{ index }}**: {{ path }} {% for i, line in content.split('\n') | enumerate %} > {{ "%02d" | format(i+1) }} {{ line }} {% endfor %}

Rationale: Converts each line into a blockquote with a line number.
Content by Word Count Buckets
File: word_bucket.j2
## File #{{ index }}: {{ path }} {% set words = content.split() %} {% if words|length < 30 %} (SHORT FILE) {{ content }} {% elif words|length < 100 %} (MEDIUM FILE) ---BEGIN--- {{ content }} ---END----- {% else %} (LONG FILE - WORD COUNT: {{ words|length }}) [Preview: First 100 words] {{ words[:100]|join(' ') }} [... shortened ...] {% endif %}

Rationale: Adjusts output based on the file’s length.
Script-Inlining Template (Code + Comments)
File: inline_script.j2
### SCRIPT SNIPPET (INDEX: {{ index }}) # File Path: {{ path }} cat <<'EOF' > output_file_{{ index }}.txt {{ content }} EOF # Explanation: # Writes the file content into "output_file_{{ index }}.txt" using a here-document.

Rationale: Useful for recreating file content on another system.
Directory Tree Logging Template
File: directory_tree.j2
[FILE ENTRY #{{ index }}] PATH: {{ path }} DIR OR FILE: {% if content == '' and '.' not in path.split('/')[-1] %} (Possibly a directory or empty file) {% else %} (File with content) {% endif %} ========= CONTENT START ========= {{ content }} ========= CONTENT END ===========

Rationale: Distinguishes between empty directories and files with content.
Quick Data Stats with Regex (Custom Filters)
File: quick_regex_stats.j2
{% set lines = content.split('\n') %} {% set import_lines = lines | select("match", "^(import|from) ") | list %} {% set todo_lines = lines | select("match", ".*TODO.*") | list %} File #{{ index }}: {{ path }} ============================ TOTAL LINES: {{ lines|length }} IMPORT STATEMENTS: {{ import_lines|length }} TODO MARKERS: {{ todo_lines|length }} -- EXCERPT (first 5 lines) -- {% for l in lines[:5] %} {{ l }} {% endfor %} ----------------------------

Rationale: Analyzes text (e.g., counting “import” or “TODO” occurrences).

#9. FURTHER NOTES

Ensure you have installed all required modules (e.g., rarfile, py7zr) for handling specific archive types.
The script processes .docx, .pptx, .xlsx, .odt, .ods, and .odp as archives.
Legacy MS Office formats (such as .doc, .xls, .ppt) are not supported.
Adjust the --max-depth parameter when processing heavily nested archives.

#!/usr/bin/env python3

import os
import sys
import tempfile
import shutil
import zipfile
import tarfile
import click
import logging
from fnmatch import fnmatch
from typing import Callable, List, Optional, Tuple

# Attempt to import optional modules
try:
    import rarfile
except ImportError:
    rarfile = None

try:
    import py7zr
except ImportError:
    py7zr = None

try:
    import pathspec
except ImportError:
    pathspec = None

try:
    from jinja2 import Environment, FileSystemLoader
except ImportError:
    Environment = None
    FileSystemLoader = None

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    stream=sys.stderr,
)
logger = logging.getLogger(__name__)

def is_within_directory(directory: str, target: str) -> bool:
    """
    Checks if the target path is within the specified directory,
    helping to avoid path traversal vulnerabilities.
    """
    abs_directory = os.path.abspath(directory)
    abs_target = os.path.abspath(target)
    return os.path.commonprefix([abs_directory, abs_target]) == abs_directory

def safe_extract(tar: tarfile.TarFile, path: str = ".", members=None) -> None:
    """
    Safely extract tar contents, preventing directory traversal.
    """
    for member in (members or tar.getmembers()):
        member_path = os.path.join(path, member.name)
        if not is_within_directory(path, member_path):
            raise ValueError("Detected path traversal attempt.")
    tar.extractall(path=path, members=members)

def handle_zip(file_path: str, extract_dir: str) -> bool:
    """
    Extracts ZIP archives, handling potential exceptions.
    """
    try:
        with zipfile.ZipFile(file_path, "r") as zf:
            zf.extractall(extract_dir)
        return True
    except zipfile.BadZipFile as e:
        logger.warning(f"Bad ZIP file {file_path}: {str(e)}")
        return False

def handle_rar(file_path: str, extract_dir: str) -> bool:
    """
    Extracts RAR archives if rarfile is available.
    """
    if not rarfile:
        logger.warning("RAR handling requires 'rarfile' to be installed.")
        return False
    try:
        with rarfile.RarFile(file_path, "r") as rf:
            rf.extractall(extract_dir)
        return True
    except rarfile.Error as e:
        logger.warning(f"RAR extraction failed: {str(e)}")
        return False

def handle_7z(file_path: str, extract_dir: str) -> bool:
    """
    Extracts 7z archives if py7zr is available.
    """
    if not py7zr:
        logger.warning("7z handling requires 'py7zr' to be installed.")
        return False
    try:
        with py7zr.SevenZipFile(file_path, "r") as sz:
            sz.extractall(extract_dir)
        return True
    except py7zr.exceptions.Bad7zFile as e:
        logger.warning(f"7z extraction failed: {str(e)}")
        return False

def handle_tar(file_path: str, extract_dir: str) -> bool:
    """
    Extracts TAR archives, using safe_extract to avoid path traversal.
    """
    try:
        with tarfile.open(file_path, "r:*") as tf:
            safe_extract(tf, extract_dir)
        return True
    except tarfile.TarError as e:
        logger.warning(f"TAR extraction failed: {str(e)}")
        return False

ARCHIVE_HANDLERS = {
    ".zip": handle_zip,
    ".rar": handle_rar,
    ".7z": handle_7z,
    ".tar": handle_tar,
    ".gz": handle_tar,
    ".bz2": handle_tar,
}

OFFICE_EXTENSIONS = [".docx", ".xlsx", ".pptx", ".odt", ".ods", ".odp"]

def read_gitignore(directory: str) -> List[str]:
    """
    Reads lines from .gitignore if present.
    """
    path = os.path.join(directory, ".gitignore")
    if os.path.isfile(path):
        with open(path, "r", encoding="utf-8") as f:
            return [line.strip() for line in f if line.strip() and not line.startswith("#")]
    return []

def should_ignore(path: str, rules: List[str]) -> bool:
    """
    Basic fnmatch-based ignoring for files or directories.
    """
    base = os.path.basename(path)
    if os.path.isdir(path):
        base += "/"
    return any(fnmatch(base, rule) for rule in rules)

class GitignoreHandler:
    """
    Handles ignoring of files or directories based on .gitignore (via pathspec or fallback).
    """
    def __init__(self, directory: str):
        self.pathspec_spec = None
        self.fallback_rules = []
        lines = read_gitignore(directory)
        if pathspec:
            self.pathspec_spec = pathspec.PathSpec.from_lines("gitwildmatch", lines)
        else:
            self.fallback_rules = lines

    def should_ignore(self, path_to_check: str) -> bool:
        if self.pathspec_spec:
            return self.pathspec_spec.match_file(path_to_check)
        return should_ignore(path_to_check, self.fallback_rules)

class OutputFormatter:
    """
    Outputs data in either plain text, XML-like format, or via Jinja2 templates if available.
    """
    def __init__(
        self,
        writer: Callable[[str], None],
        xml_mode: bool = False,
        template_file: Optional[str] = None
    ):
        self.writer = writer
        self.xml_mode = xml_mode
        self.xml_index = 1
        self.template_file = template_file
        self.jinja_env = None
        if template_file and Environment and FileSystemLoader:
            template_dir = os.path.dirname(template_file)
            self.jinja_env = Environment(loader=FileSystemLoader(template_dir))

    def write(self, path: str, content: str) -> None:
        """
        Decides the approach (Jinja2/XML/plain text) for output.
        """
        if self.jinja_env and self.template_file:
            try:
                template_name = os.path.basename(self.template_file)
                template = self.jinja_env.get_template(template_name)
                rendered = template.render(path=path, content=content, index=self.xml_index)
                self.writer(rendered)
            except Exception as e:
                logger.warning(f"Jinja2 rendering error: {e}")
                self._fallback_write(path, content)
        elif self.xml_mode:
            self.writer(f'<## data-filename="xml_code-block xml" data-code="">')
            self.writer(f'    {path}</source>')
            self.writer('    ')
            for line in content.splitlines():
                self.writer(f'        {line}')
            self.writer('    ')
            self.writer('</section>')
            self.xml_index += 1
        else:
            self._fallback_write(path, content)

    def _fallback_write(self, path: str, content: str):
        """
        Prints content with separators if not using templates or XML.
        """
        self.writer(path)
        self.writer("--")
        self.writer(content)
        self.writer("")
        self.writer("--")
        self.xml_index += 1

class FileProcessor:
    """
    Manages recursion through directories or archives and applies ignoring, formatting, etc.
    """
    def __init__(
        self,
        extensions: Tuple[str, ...],
        include_hidden: bool,
        ignore_gitignore: bool,
        ignore_patterns: Tuple[str, ...],
        formatter: OutputFormatter,
        max_depth: int = 5,
    ):
        self.extensions = [ext.lower() for ext in extensions]
        self.include_hidden = include_hidden
        self.ignore_gitignore = ignore_gitignore
        self.ignore_patterns = ignore_patterns
        self.formatter = formatter
        self.max_depth = max_depth

    def process_path(self, path: str, depth: int = 0, extra_gitignore_rules: List[str] = None) -> None:
        if extra_gitignore_rules is None:
            extra_gitignore_rules = []
        if depth > self.max_depth:
            logger.warning(f"Max recursion depth ({self.max_depth}) reached at {path}.")
            return
        if os.path.isfile(path):
            self._handle_file(path, depth)
        elif os.path.isdir(path):
            if not self.ignore_gitignore:
                extra_gitignore_rules.extend(read_gitignore(path))
            self._handle_directory(path, depth, extra_gitignore_rules)

    def _handle_file(self, path: str, depth: int) -> None:
        ext = os.path.splitext(path)[1].lower()
        if ext in ARCHIVE_HANDLERS or ext in OFFICE_EXTENSIONS:
            self._extract_and_recurse(path, ext, depth)
        else:
            self._read_and_output(path)

    def _extract_and_recurse(self, path: str, ext: str, depth: int) -> None:
        handler_func = ARCHIVE_HANDLERS.get(ext)
        if ext in OFFICE_EXTENSIONS:
            # Office documents are ZIP-based archives
            handler_func = ARCHIVE_HANDLERS[".zip"]
        if not handler_func:
            logger.warning(f"No valid handler for extension: {ext}")
            return
        with tempfile.TemporaryDirectory() as tmpdir:
            success = handler_func(path, tmpdir)
            if success:
                self.process_path(tmpdir, depth + 1)
            else:
                logger.warning(f"Extraction failed for {path}")

    def _read_and_output(self, path: str) -> None:
        encodings_to_try = ["utf-8", "latin-1"]
        for encoding in encodings_to_try:
            try:
                with open(path, "r", encoding=encoding) as f:
                    content = f.read()
                self.formatter.write(path, content)
                return
            except UnicodeDecodeError:
                continue
            except Exception as e:
                logger.warning(f"File read error {path}: {e}")
                return
        logger.warning(f"Could not read file {path} with provided encodings.")

    def _handle_directory(self, directory: str, depth: int, extra_gitignore_rules: List[str]) -> None:
        gitignore_handler = None
        if not self.ignore_gitignore:
            gitignore_handler = GitignoreHandler(directory)
        for root, dirs, files in os.walk(directory):
            if not self.include_hidden:
                dirs[:] = [d for d in dirs if not d.startswith(".")]
                files = [f for f in files if not f.startswith(".")]

            if gitignore_handler:
                dirs[:] = [d for d in dirs if not gitignore_handler.should_ignore(os.path.join(root, d))]
                files = [f for f in files if not gitignore_handler.should_ignore(os.path.join(root, f))]

            dirs[:] = [d for d in dirs if not should_ignore(os.path.join(root, d), extra_gitignore_rules)]
            files = [f for f in files if not should_ignore(os.path.join(root, f), extra_gitignore_rules)]

            if self.ignore_patterns:
                files = [
                    f for f in files
                    if not any(fnmatch(f, pattern) for pattern in self.ignore_patterns)
                ]

            if self.extensions:
                files = [
                    f for f in files
                    if any(f.lower().endswith(ext) for ext in self.extensions)
                ]

            for file_name in sorted(files):
                self.process_path(os.path.join(root, file_name), depth + 1, extra_gitignore_rules)

@click.command()
@click.argument("paths", nargs=-1, type=click.Path(exists=True))
@click.option("-e", "--extension", "extensions", multiple=True, help="Specify file extensions, e.g. .txt, .md.")
@click.option("--include-hidden", is_flag=True, default=False, help="Include hidden files and subdirectories.")
@click.option("--ignore-gitignore", is_flag=True, default=False, help="Do not apply .gitignore-based filtering.")
@click.option("--ignore", "ignore_patterns", multiple=True, help="Specify one or more glob patterns to exclude.")
@click.option("-o", "--output", "output_file", type=click.Path(writable=True), help="Output file path (stdout by default).")
@click.option("--xml", "output_xml", is_flag=True, default=False, help="Output in XML-like format.")
@click.option("-t", "--template-file", "template_file", type=click.Path(exists=True), help="Use a Jinja2 template for output.")
@click.option("-d", "--max-depth", "max_depth", default=5, help="Maximum recursion depth for nested archives.")
def cli(paths, extensions, include_hidden, ignore_gitignore, ignore_patterns, output_file, output_xml, template_file, max_depth):
    """
    "f2p" -- Enhanced from "framework1" using "raw_data":
    1) Safe recursion-based file and archive handling.
    2) Advanced ignoring logic from .gitignore or pathspec.
    3) Optional Jinja2-based templating for output formatting.
    """
    writer = click.echo
    file_handle = None

    if output_file:
        try:
            file_handle = open(output_file, "w", encoding="utf-8")
            writer = lambda msg: print(msg, file=file_handle)
        except IOError as e:
            logger.error(f"Could not open output file {output_file}: {e}")
            sys.exit(1)

    formatter = OutputFormatter(
        writer=writer,
        xml_mode=output_xml,
        template_file=template_file
    )

    if output_xml and not template_file:
        writer("<root>")

    processor = FileProcessor(
        extensions=extensions,
        include_hidden=include_hidden,
        ignore_gitignore=ignore_gitignore,
        ignore_patterns=ignore_patterns,
        formatter=formatter,
        max_depth=max_depth
    )

    for path in paths:
        processor.process_path(path)

    if output_xml and not template_file:
        writer("</root>")

    if file_handle:
        file_handle.close()

if __name__ == "__main__":
    cli()

URL: https://ib.bsb.br/f2p