- 1. INSTALLATION
- 2. BASIC USAGE
- 3. FLAGS & OPTIONS
- 4. FILE & ENCODING LOGIC
- 5. IGNORING LOGIC
- 6. OUTPUT FORMATS
- 7. EXAMPLES
- 8. Jinja2 EXAMPLES
- 9. FURTHER NOTES
#1. INSTALLATION
#1.1 Primary Requirements
- click (command-line interface support)
- Optional modules:
- rarfile for
.rar
archives - py7zr for
.7z
archives - pathspec for advanced
.gitignore
matching - jinja2 for templated output
- rarfile for
Install everything:
pip install click rarfile py7zr pathspec jinja2
Or install only what you need (for example, omit rarfile
if you don’t require .rar
support).
#1.2 Script File Reference
The provided script is typically named f2p.py
, so you would run:
python f2p.py [OPTIONS] [PATHS...]
#2. BASIC USAGE
Run the script as follows:
python f2p.py [OPTIONS] /some/path /another/path
- The script recursively scans directories and extracts recognized archives:
.zip
,.rar
,.7z
,.tar
,.gz
,.bz2
- Office Open XML / OpenDocument formats:
.docx
,.xlsx
,.pptx
,.odt
,.ods
,.odp
(Legacy formats such as .doc
, .xls
, .ppt
are not treated as archives.)
#3. FLAGS & OPTIONS
Flag/Option | Type / Default | Short | Description |
---|---|---|---|
--extension |
multiple=True / None | -e |
Restricts processing to specific extensions (archives & Office docs). |
--include-hidden |
is_flag=True / False | n/a | Considers hidden/dot files and directories. |
--ignore-gitignore |
is_flag=True / False | n/a | Ignores .gitignore rules in directories. |
--ignore |
multiple=True / None | n/a | Excludes files matching glob patterns (e.g., *.log ). |
--output |
file path / None | -o |
Writes output to the specified file instead of stdout. |
--xml |
is_flag=True / False | n/a | Outputs content in an XML-like structure. |
--template-file |
file path / None | -t |
Uses a Jinja2 template for custom formatting (requires jinja2). |
--max-depth |
int / 5 | -d |
Limits recursion depth for nested archives (default: 5). |
#4. FILE & ENCODING LOGIC
- Multiple Encoding Attempts:
- Tries
utf-8
first, thenlatin-1
. - If both fail, the file’s content is omitted (with a warning logged).
- Tries
- Archive & Office Document Extraction:
.rar
extraction requiresrarfile
, and.7z
extraction requirespy7zr
.- Office documents (OOXML/ODF) are processed as
.zip
archives. - Extraction is performed safely to prevent path traversal vulnerabilities.
#5. IGNORING LOGIC
- .gitignore / pathspec:
- When
pathspec
is installed, advanced.gitignore
rules apply. - Otherwise, a simpler fnmatch approach is used.
- When
- Hidden Files:
- Hidden items are skipped by default (unless
--include-hidden
is used).
- Hidden items are skipped by default (unless
- Additional Patterns:
- Use
--ignore
to exclude files (e.g.,--ignore="*.log"
or--ignore="*_backup.*"
).
- Use
#6. OUTPUT FORMATS
- Plain Text (Default):
- Prints each file’s path, followed by a separator and the file content.
- XML-like (
--xml
):- Wraps the content within
<section>...</section>
elements.
- Wraps the content within
- Jinja2 Templates (
-t
/--template-file
):- Applies a provided
.j2
template to format each file’s content.
- Applies a provided
#7. EXAMPLES
- Restrict to
.py
&.md
and ignore*.log
:python f2p.py -e .py -e .md --ignore="*.log" /path/to/process
- Process hidden files, disable
.gitignore
, and output to a file:python f2p.py --include-hidden --ignore-gitignore -o out.txt /some/path
- Output as XML and limit recursion to 3 levels:
python f2p.py --xml --max-depth=3 /path/to/archives
- Use a Jinja2 template:
python f2p.py --template-file=custom_template.j2 /path/to/files
- Process multiple paths:
python f2p.py /first/path /second/path
#8. Jinja2 EXAMPLES
When invoking:
python f2p.py -t my_template.j2 [PATHS...]
the template receives:
{{ path }}
: the file’s path{{ content }}
: the file’s text content{{ index }}
: a numeric counter for labeling
#Example Templates
- Minimal Example: Plain-Text Highlight
File: minimal_example.j2File #{{ index }}: {{ path }} ---------------------------- {{ content }} ----------------------------
Rationale: Displays the file path and content with a simple separator.
- Numbered Lines Use Case
File: numbered_lines.j2File: {{ path }} (Index: {{ index }}) ====================================== {% set lines = content.split('\n') %} {% for loop_index, line in lines | enumerate(start=1) %} {{ loop_index }}: {{ line }} {% endfor %} ======================================
Rationale: Enumerates each line, useful for line-by-line reference.
- HTML Output for Browser Rendering
File: html_output.j2<html> <head> <title>File {{ index }}</title> </head> <body> <h2>File Path: {{ path }}</h2> <p><strong>Index:</strong> {{ index }}</p> <hr /> <pre> {{ content }} </pre> </body> </html>
Rationale: Formats the file content into a simple HTML page.
- Markdown Code Snippet
File: markdown_snippet.j2### File #{{ index }}: {{ path }}
{{ content }}
Rationale: Ideal for embedding code or text in a Markdown document.
- Summarized Headings Template
File: summarized_headings.j2["FILE #{{ index }}"] {{ path }} --------------------------------- {% set first_lines = content.split('\n')[:3] %} {% for line in first_lines %} {{ line }} {% endfor %} --------------------------------- (... Content Truncated ...)
Rationale: Shows only the first few lines to save space.
- JSON-Inspired Output Template
File: json_inspired.j2{ "index": {{ index }}, "path": "{{ path | replace('\\', '\\\\') }}", "content_lines": [ {% set lines = content.split('\n') %} {% for line in lines %} "{{ line | replace('"','\\"') }}"{{ "," if not loop.last else "" }} {% endfor %} ] }
Rationale: Outputs the file data in a JSON-like structure.
- Columnar Key-Value Template
File: columnar_kv.j2=============== File #{{ index }} =============== Path: {{ path }} --------------- {% for key, val in { 'Characters': content|length, 'Lines': content.split('\n')|length, 'First Line': content.split('\n')[0] if content else '' }.items() %} {{ key }}: {{ val }} {% endfor %} --------------- {{ content }}
Rationale: Displays statistics followed by the file content.
- Interactive-Like Script Template
File: interactive_prompt.j2=== File #{{ index }} === LOAD FILE: {{ path }} RUN COMMANDS: 1) SomeProcess --file "{{ path }}" 2) AnotherProcess --analyze "{{ path }}" 3) (Optional) Check content below: {{ content }} =================
Rationale: Provides a stylized “script” output with follow-up commands.
- Task-List / To-Do Style Template
File: task_list.j2## File #{{ index }}: {{ path }} - [ ] Review lines for errors - [ ] Extract useful references - [ ] Create summary - [ ] Mark for final review Content: {{ content }}
Rationale: Produces a checklist along with the file content.
- Blockquote Slicer Template
File: blockquote_slicer.j2> **File #{{ index }}**: {{ path }} {% for i, line in content.split('\n') | enumerate %} > {{ "%02d" | format(i+1) }} {{ line }} {% endfor %}
Rationale: Converts each line into a blockquote with a line number.
- Content by Word Count Buckets
File: word_bucket.j2## File #{{ index }}: {{ path }} {% set words = content.split() %} {% if words|length < 30 %} (SHORT FILE) {{ content }} {% elif words|length < 100 %} (MEDIUM FILE) ---BEGIN--- {{ content }} ---END----- {% else %} (LONG FILE - WORD COUNT: {{ words|length }}) [Preview: First 100 words] {{ words[:100]|join(' ') }} [... shortened ...] {% endif %}
Rationale: Adjusts output based on the file’s length.
- Script-Inlining Template (Code + Comments)
File: inline_script.j2### SCRIPT SNIPPET (INDEX: {{ index }}) # File Path: {{ path }} cat <<'EOF' > output_file_{{ index }}.txt {{ content }} EOF # Explanation: # Writes the file content into "output_file_{{ index }}.txt" using a here-document.
Rationale: Useful for recreating file content on another system.
- Directory Tree Logging Template
File: directory_tree.j2[FILE ENTRY #{{ index }}] PATH: {{ path }} DIR OR FILE: {% if content == '' and '.' not in path.split('/')[-1] %} (Possibly a directory or empty file) {% else %} (File with content) {% endif %} ========= CONTENT START ========= {{ content }} ========= CONTENT END ===========
Rationale: Distinguishes between empty directories and files with content.
- Quick Data Stats with Regex (Custom Filters)
File: quick_regex_stats.j2{% set lines = content.split('\n') %} {% set import_lines = lines | select("match", "^(import|from) ") | list %} {% set todo_lines = lines | select("match", ".*TODO.*") | list %} File #{{ index }}: {{ path }} ============================ TOTAL LINES: {{ lines|length }} IMPORT STATEMENTS: {{ import_lines|length }} TODO MARKERS: {{ todo_lines|length }} -- EXCERPT (first 5 lines) -- {% for l in lines[:5] %} {{ l }} {% endfor %} ----------------------------
Rationale: Analyzes text (e.g., counting “import” or “TODO” occurrences).
#9. FURTHER NOTES
- Ensure you have installed all required modules (e.g.,
rarfile
,py7zr
) for handling specific archive types. - The script processes
.docx
,.pptx
,.xlsx
,.odt
,.ods
, and.odp
as archives. - Legacy MS Office formats (such as
.doc
,.xls
,.ppt
) are not supported. - Adjust the
--max-depth
parameter when processing heavily nested archives.
#!/usr/bin/env python3
import os
import sys
import tempfile
import shutil
import zipfile
import tarfile
import click
import logging
from fnmatch import fnmatch
from typing import Callable, List, Optional, Tuple
# Attempt to import optional modules
try:
import rarfile
except ImportError:
rarfile = None
try:
import py7zr
except ImportError:
py7zr = None
try:
import pathspec
except ImportError:
pathspec = None
try:
from jinja2 import Environment, FileSystemLoader
except ImportError:
Environment = None
FileSystemLoader = None
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
stream=sys.stderr,
)
logger = logging.getLogger(__name__)
def is_within_directory(directory: str, target: str) -> bool:
"""
Checks if the target path is within the specified directory,
helping to avoid path traversal vulnerabilities.
"""
abs_directory = os.path.abspath(directory)
abs_target = os.path.abspath(target)
return os.path.commonprefix([abs_directory, abs_target]) == abs_directory
def safe_extract(tar: tarfile.TarFile, path: str = ".", members=None) -> None:
"""
Safely extract tar contents, preventing directory traversal.
"""
for member in (members or tar.getmembers()):
member_path = os.path.join(path, member.name)
if not is_within_directory(path, member_path):
raise ValueError("Detected path traversal attempt.")
tar.extractall(path=path, members=members)
def handle_zip(file_path: str, extract_dir: str) -> bool:
"""
Extracts ZIP archives, handling potential exceptions.
"""
try:
with zipfile.ZipFile(file_path, "r") as zf:
zf.extractall(extract_dir)
return True
except zipfile.BadZipFile as e:
logger.warning(f"Bad ZIP file {file_path}: {str(e)}")
return False
def handle_rar(file_path: str, extract_dir: str) -> bool:
"""
Extracts RAR archives if rarfile is available.
"""
if not rarfile:
logger.warning("RAR handling requires 'rarfile' to be installed.")
return False
try:
with rarfile.RarFile(file_path, "r") as rf:
rf.extractall(extract_dir)
return True
except rarfile.Error as e:
logger.warning(f"RAR extraction failed: {str(e)}")
return False
def handle_7z(file_path: str, extract_dir: str) -> bool:
"""
Extracts 7z archives if py7zr is available.
"""
if not py7zr:
logger.warning("7z handling requires 'py7zr' to be installed.")
return False
try:
with py7zr.SevenZipFile(file_path, "r") as sz:
sz.extractall(extract_dir)
return True
except py7zr.exceptions.Bad7zFile as e:
logger.warning(f"7z extraction failed: {str(e)}")
return False
def handle_tar(file_path: str, extract_dir: str) -> bool:
"""
Extracts TAR archives, using safe_extract to avoid path traversal.
"""
try:
with tarfile.open(file_path, "r:*") as tf:
safe_extract(tf, extract_dir)
return True
except tarfile.TarError as e:
logger.warning(f"TAR extraction failed: {str(e)}")
return False
ARCHIVE_HANDLERS = {
".zip": handle_zip,
".rar": handle_rar,
".7z": handle_7z,
".tar": handle_tar,
".gz": handle_tar,
".bz2": handle_tar,
}
OFFICE_EXTENSIONS = [".docx", ".xlsx", ".pptx", ".odt", ".ods", ".odp"]
def read_gitignore(directory: str) -> List[str]:
"""
Reads lines from .gitignore if present.
"""
path = os.path.join(directory, ".gitignore")
if os.path.isfile(path):
with open(path, "r", encoding="utf-8") as f:
return [line.strip() for line in f if line.strip() and not line.startswith("#")]
return []
def should_ignore(path: str, rules: List[str]) -> bool:
"""
Basic fnmatch-based ignoring for files or directories.
"""
base = os.path.basename(path)
if os.path.isdir(path):
base += "/"
return any(fnmatch(base, rule) for rule in rules)
class GitignoreHandler:
"""
Handles ignoring of files or directories based on .gitignore (via pathspec or fallback).
"""
def __init__(self, directory: str):
self.pathspec_spec = None
self.fallback_rules = []
lines = read_gitignore(directory)
if pathspec:
self.pathspec_spec = pathspec.PathSpec.from_lines("gitwildmatch", lines)
else:
self.fallback_rules = lines
def should_ignore(self, path_to_check: str) -> bool:
if self.pathspec_spec:
return self.pathspec_spec.match_file(path_to_check)
return should_ignore(path_to_check, self.fallback_rules)
class OutputFormatter:
"""
Outputs data in either plain text, XML-like format, or via Jinja2 templates if available.
"""
def __init__(
self,
writer: Callable[[str], None],
xml_mode: bool = False,
template_file: Optional[str] = None
):
self.writer = writer
self.xml_mode = xml_mode
self.xml_index = 1
self.template_file = template_file
self.jinja_env = None
if template_file and Environment and FileSystemLoader:
template_dir = os.path.dirname(template_file)
self.jinja_env = Environment(loader=FileSystemLoader(template_dir))
def write(self, path: str, content: str) -> None:
"""
Decides the approach (Jinja2/XML/plain text) for output.
"""
if self.jinja_env and self.template_file:
try:
template_name = os.path.basename(self.template_file)
template = self.jinja_env.get_template(template_name)
rendered = template.render(path=path, content=content, index=self.xml_index)
self.writer(rendered)
except Exception as e:
logger.warning(f"Jinja2 rendering error: {e}")
self._fallback_write(path, content)
elif self.xml_mode:
self.writer(f'<## data-filename="xml_code-block xml" data-code="">')
self.writer(f' {path}</source>')
self.writer(' ')
for line in content.splitlines():
self.writer(f' {line}')
self.writer(' ')
self.writer('</section>')
self.xml_index += 1
else:
self._fallback_write(path, content)
def _fallback_write(self, path: str, content: str):
"""
Prints content with separators if not using templates or XML.
"""
self.writer(path)
self.writer("--")
self.writer(content)
self.writer("")
self.writer("--")
self.xml_index += 1
class FileProcessor:
"""
Manages recursion through directories or archives and applies ignoring, formatting, etc.
"""
def __init__(
self,
extensions: Tuple[str, ...],
include_hidden: bool,
ignore_gitignore: bool,
ignore_patterns: Tuple[str, ...],
formatter: OutputFormatter,
max_depth: int = 5,
):
self.extensions = [ext.lower() for ext in extensions]
self.include_hidden = include_hidden
self.ignore_gitignore = ignore_gitignore
self.ignore_patterns = ignore_patterns
self.formatter = formatter
self.max_depth = max_depth
def process_path(self, path: str, depth: int = 0, extra_gitignore_rules: List[str] = None) -> None:
if extra_gitignore_rules is None:
extra_gitignore_rules = []
if depth > self.max_depth:
logger.warning(f"Max recursion depth ({self.max_depth}) reached at {path}.")
return
if os.path.isfile(path):
self._handle_file(path, depth)
elif os.path.isdir(path):
if not self.ignore_gitignore:
extra_gitignore_rules.extend(read_gitignore(path))
self._handle_directory(path, depth, extra_gitignore_rules)
def _handle_file(self, path: str, depth: int) -> None:
ext = os.path.splitext(path)[1].lower()
if ext in ARCHIVE_HANDLERS or ext in OFFICE_EXTENSIONS:
self._extract_and_recurse(path, ext, depth)
else:
self._read_and_output(path)
def _extract_and_recurse(self, path: str, ext: str, depth: int) -> None:
handler_func = ARCHIVE_HANDLERS.get(ext)
if ext in OFFICE_EXTENSIONS:
# Office documents are ZIP-based archives
handler_func = ARCHIVE_HANDLERS[".zip"]
if not handler_func:
logger.warning(f"No valid handler for extension: {ext}")
return
with tempfile.TemporaryDirectory() as tmpdir:
success = handler_func(path, tmpdir)
if success:
self.process_path(tmpdir, depth + 1)
else:
logger.warning(f"Extraction failed for {path}")
def _read_and_output(self, path: str) -> None:
encodings_to_try = ["utf-8", "latin-1"]
for encoding in encodings_to_try:
try:
with open(path, "r", encoding=encoding) as f:
content = f.read()
self.formatter.write(path, content)
return
except UnicodeDecodeError:
continue
except Exception as e:
logger.warning(f"File read error {path}: {e}")
return
logger.warning(f"Could not read file {path} with provided encodings.")
def _handle_directory(self, directory: str, depth: int, extra_gitignore_rules: List[str]) -> None:
gitignore_handler = None
if not self.ignore_gitignore:
gitignore_handler = GitignoreHandler(directory)
for root, dirs, files in os.walk(directory):
if not self.include_hidden:
dirs[:] = [d for d in dirs if not d.startswith(".")]
files = [f for f in files if not f.startswith(".")]
if gitignore_handler:
dirs[:] = [d for d in dirs if not gitignore_handler.should_ignore(os.path.join(root, d))]
files = [f for f in files if not gitignore_handler.should_ignore(os.path.join(root, f))]
dirs[:] = [d for d in dirs if not should_ignore(os.path.join(root, d), extra_gitignore_rules)]
files = [f for f in files if not should_ignore(os.path.join(root, f), extra_gitignore_rules)]
if self.ignore_patterns:
files = [
f for f in files
if not any(fnmatch(f, pattern) for pattern in self.ignore_patterns)
]
if self.extensions:
files = [
f for f in files
if any(f.lower().endswith(ext) for ext in self.extensions)
]
for file_name in sorted(files):
self.process_path(os.path.join(root, file_name), depth + 1, extra_gitignore_rules)
@click.command()
@click.argument("paths", nargs=-1, type=click.Path(exists=True))
@click.option("-e", "--extension", "extensions", multiple=True, help="Specify file extensions, e.g. .txt, .md.")
@click.option("--include-hidden", is_flag=True, default=False, help="Include hidden files and subdirectories.")
@click.option("--ignore-gitignore", is_flag=True, default=False, help="Do not apply .gitignore-based filtering.")
@click.option("--ignore", "ignore_patterns", multiple=True, help="Specify one or more glob patterns to exclude.")
@click.option("-o", "--output", "output_file", type=click.Path(writable=True), help="Output file path (stdout by default).")
@click.option("--xml", "output_xml", is_flag=True, default=False, help="Output in XML-like format.")
@click.option("-t", "--template-file", "template_file", type=click.Path(exists=True), help="Use a Jinja2 template for output.")
@click.option("-d", "--max-depth", "max_depth", default=5, help="Maximum recursion depth for nested archives.")
def cli(paths, extensions, include_hidden, ignore_gitignore, ignore_patterns, output_file, output_xml, template_file, max_depth):
"""
"f2p" -- Enhanced from "framework1" using "raw_data":
1) Safe recursion-based file and archive handling.
2) Advanced ignoring logic from .gitignore or pathspec.
3) Optional Jinja2-based templating for output formatting.
"""
writer = click.echo
file_handle = None
if output_file:
try:
file_handle = open(output_file, "w", encoding="utf-8")
writer = lambda msg: print(msg, file=file_handle)
except IOError as e:
logger.error(f"Could not open output file {output_file}: {e}")
sys.exit(1)
formatter = OutputFormatter(
writer=writer,
xml_mode=output_xml,
template_file=template_file
)
if output_xml and not template_file:
writer("<root>")
processor = FileProcessor(
extensions=extensions,
include_hidden=include_hidden,
ignore_gitignore=ignore_gitignore,
ignore_patterns=ignore_patterns,
formatter=formatter,
max_depth=max_depth
)
for path in paths:
processor.process_path(path)
if output_xml and not template_file:
writer("</root>")
if file_handle:
file_handle.close()
if __name__ == "__main__":
cli()