Extract information from disorganized data with `ugrep` bash script

Slug: ugrep

146700 characters 17833 words
  • bash script
  • bash script documentation
  • #bash script cheatsheet


    #1. Command-line Options

    Option Description Example
    --locations-file <file> Path to file listing directories to search (one per line). --locations-file mydirs.txt
    --queries-file <file> Path to file listing queries (one per line). --queries-file myqueries.txt
    --config-file <file> Path to shell script to source for overriding any defaults or advanced config. --config-file myconf.sh
    -o <dir> or --output-dir <dir> Override output root directory (default: created in working dir w/ timestamp). -o ./results001
    -h or --help Show help/usage and exit. -h

    #2. Configuration/Variable Overrides (via --config-file)

    Variables you can override in a config file:

    Variable What it controls Default Value Example override in config file
    MASTER_OUTPUT_ROOT Main output directory for run ./batch_search_run_YYYYmmdd_HHMMSS MASTER_OUTPUT_ROOT="/tmp/mybatch"
    MAX_CONCURRENT_JOBS Number of parallel searches nproc result or 7 MAX_CONCURRENT_JOBS=4
    MAX_RETRIES Number of retry attempts if a search fails 1 (no actual retry) MAX_RETRIES=3
    RETRY_DELAY_SECONDS Delay between retries (secs) 5 RETRY_DELAY_SECONDS=10
    UGREP_CMD Command used for search (default: ug+) "ug+" UGREP_CMD="ugrep"
    UGREP_OPTS_BASE Base array of ugrep options (recursion, case, etc.) -r -i -l UGREP_OPTS_BASE=(-r -i -w -l)
    UGREP_OPTS_ARCHIVES ugrep options specifically for archive searching -z UGREP_OPTS_ARCHIVES=()
    UGREP_OPTS_INDEX ugrep options for index usage (empty by default) `` (none) UGREP_OPTS_INDEX=(--index)
    SEARCH_LOCATIONS_DEFAULT Default location(s) to search (array) see script default SEARCH_LOCATIONS_DEFAULT=("/docs/")
    SEARCH_QUERIES_DEFAULT Default query pattern(s) (array) see script default SEARCH_QUERIES_DEFAULT=("quanti")

    #3. File Format for --locations-file and --queries-file

    Each line can provide optional custom ugrep options for that directory/pattern using the syntax:

    /path/to/location1[:::optional_ugrep_options_for_location] search_query_text[:::optional_ugrep_options_for_query]

    Examples (mydirs.txt):

    /mnt/external_hdd/pdf/ /data/texts/ ::: -r -z

    Examples (myqueries.txt):

    quantifi quantifi ::: -i # Case-insensitive (redundant if -i is base) 'sociology.*' ::: -e # Regex pattern, '-e' for extended regex quant ::: -i -G # POSIX basic regex: match 'quant' as substring/word

    #4. How Search Patterns and “Half Word”/Partial Matches Work

    • Winner simply passes the query string to ugrep.
    • For substring: Use a substring in the pattern (e.g., quantifi matches any file with that string).
    • For regex/fuzzy match: Use ugrep’s pattern syntax, wildcards, and regex options directly in the queries file.

    Common ways to match “half-words” or partials:

    Intent queries file entry Effect
    Substring search quantifi Matches ‘quantify’, ‘quantification’
    Regex, suffixes quantifi.* ::: -e Matches ‘quantification’, ‘quantifier’
    Prefix match ^quantifi ::: -e Begins with ‘quantifi’
    Suffix match cation$ ::: -e Ends with ‘cation’
    Any compound or prefix quant Matches ‘quantum’, ‘quantity’,’quantify’, etc.
    Multiple related roots 'quant|qualit' ::: -e Matches ‘quant…’, ‘qualit…’
    Fuzzy/approximate (if ugrep supports) quantifi ::: –fuzzy=1 See ugrep manual for fuzzy support*

    *Note: ugrep (ug+) versions >= 4.0 support experimental fuzzy search (--fuzzy=n), but only if your system’s ugrep is built with fuzzy enabled.


    #5. Dummy Examples

    a. Querying for more matches via “half words” File: queries.txt

    quantifi desquant quant ::: -i # case-insensitive, catch any word with 'quant' "quant.*" ::: -e # regex: starts with quant... quantifi ::: --fuzzy=1 # passes fuzzy search if supported

    b. Using advanced per-directory options File: locations.txt

    /mnt/backup/pdf/ /mnt/mediaarchive/ ::: -z # Also search archives in this dir

    c. Running winner with all customization

    ./_script.sh --locations-file locations.txt --queries-file queries.txt --config-file myconfig.sh -o ./outdir

    myconfig.sh (example content):

    MAX_CONCURRENT_JOBS=2 MAX_RETRIES=2 UGREP_OPTS_BASE=(-r -i -l) # Base: recursive, case-insensitive, just list files UGREP_OPTS_ARCHIVES=(-z) # Search archives UGREP_CMD="ugrep" # Use full ugrep, not ug+

    #6. Output Files/Structure

    • Output directory (MASTER_OUTPUT_ROOT, default: batch_search_run_YYYYmmdd_HHMMSS in pwd)
      • search_results/loc_{LOC}_query_{QUERY}_{TASKID}/matches.txt: One per (location, query) task, matching files listed.
      • search_logs/: Logs for each search task.
      • summary_report.txt: Run summary (matches found, failed tasks, stats)
      • master_orchestrator_log.txt: Master control log for entire batch

    #7. Notes

    • You may override any variable by specifying it in the config file and passing it with --config-file.
    • Per-location and per-query custom ugrep options are highly flexible; this is the primary way to control matching mode.
    • To perform “half-word” or maximize matches for similar words:
      • Enter the shortest useful stem (quant, jurim, etc.) as your pattern in queries file.
      • Add regex/fuzzy options as supported by your ugrep build.

    #8. Help Display

    ./_script.sh --help

    Prints all supported command-line options and notes about config file overrides.


    #I. Introduction: Taming Your Digital Research Archive with ugrep

    Researchers often accumulate vast collections of digital files, encompassing PDFs, text documents, Word files, and various other formats. This digital deluge, while a rich source of information, can quickly become disorganized, making the task of locating specific data points or themes for a research paper a significant challenge. The ugrep file pattern searcher emerges as a powerful ally in this context. It is an ultra-fast, user-friendly, and feature-rich tool designed to navigate and extract information from large, mixed-format file collections with remarkable efficiency.[1]

    ugrep distinguishes itself not merely as a replacement for standard grep utilities but as an enhanced toolkit tailored for complex search requirements. Its capabilities extend to searching within various document types (PDF, DOC, DOCX), compressed archives, and binary files, all while offering sophisticated pattern matching through Unicode-aware regular expressions, Boolean queries, and even fuzzy searching.[1] This inherent power makes it an invaluable asset for researchers aiming to systematically mine their digital archives, identify relevant materials, and extract precise information for their scholarly work. The tool’s design, which includes an interactive Text User Interface (TUI) and the ability to handle diverse file encodings, further underscores its utility in academic research, where data sources are often heterogeneous and search needs are nuanced.[1]

    This tutorial provides a comprehensive, step-by-step guide for novice users to harness the capabilities of ugrep, specifically focusing on its application in managing and extracting information from a large, disorganized collection of research files. Assuming ugrep is installed via Docker, this guide will walk through initial setup, core concepts, basic to advanced search techniques, and strategies for streamlining complex research workflows. By the end of this tutorial, users will be equipped to transform their potentially chaotic digital archives into well-interrogated sources of information for their research endeavors.

    #II. Setting Up ugrep with Docker

    For users who have ugrep installed via Docker, interacting with the tool involves prefixing ugrep commands with a Docker execution instruction. This isolates the ugrep environment while allowing it to access files from the host system through volume mounts.

    A. The Basic Docker exec Command Structure

    To run any ugrep command (e.g., ug, ugrep, ug+, ugrep-indexer), the general Docker command structure is:

    docker exec <container_id_or_name> <ugrep_command> [OPTIONS] PATTERN [FILE…]

    Where:

    • <container_id_or_name>: This is the ID or the name assigned to your running ugrep Docker container.
    • <ugrep_command>: This can be ug, ugrep, ug+, ugrep+, or ugrep-indexer.
    • [OPTIONS]: These are the various command-line options ugrep accepts (e.g., -r for recursive, -i for ignore case).
    • PATTERN: The search pattern (e.g., a keyword or regular expression).
    • [FILE…]: These are the paths to the files or directories you want to search, as they appear inside the Docker container.

    B. Accessing Your Research Files: Volume Mounting

    To enable ugrep running inside Docker to search your local research files, you must have mounted your local directory (containing the research files) as a volume when you initially ran the Docker container. For example, if your local research files are in /home/user/my_research_papers and you mounted this directory to /research_files inside the Docker container, then all ugrep commands targeting these files must use the path /research_files.

    Example: If your local research folder /path/to/your/research_files is mounted as /data inside the Docker container named ugrep_container, a command to search for “keyword” recursively within these files would be:

    docker exec ugrep_container ug -r “keyword” /data

    This Docker command prefix effectively acts as a gateway to the ugrep tool. While it adds a layer to the command invocation, it does not alter ugrep’s internal functionality. The core power and versatility of ugrep remain fully accessible, allowing researchers to manage disorganized, mixed-format file collections efficiently even within a containerized environment. For the remainder of this tutorial, ugrep commands will be presented without the docker exec <cid> prefix for brevity. Users should remember to add this prefix and use the appropriate paths as configured in their Docker setup.

    #III. Understanding ugrep Core Concepts

    Before diving into practical search examples, it’s essential to grasp some fundamental concepts of ugrep, including its primary commands, how patterns are specified, and how file arguments are handled.

    A. The ugrep Family of Commands

    ugrep provides a suite of commands, each tailored for slightly different use cases, particularly concerning configuration files and handling specialized document formats.[1]

    • ug: This command is designed for user-friendly, interactive use. A key feature of ug is that it automatically loads an optional .ugrep configuration file. It first looks for this file in the current working directory and then in the user’s home directory. This allows for persistent, preferred options without needing to specify them on every command invocation. The ug command also enables --pretty and --sort by default when output is to a terminal, enhancing readability.[1]
    • ugrep: This is the core command, intended for batch processing and scripting. Unlike ug, ugrep does not load any .ugrep configuration file by default and generally does not set default options like --pretty or --sort (though --color is enabled by default for terminals). This makes its behavior more predictable and suitable for scripts where user-specific configurations might interfere.[1]
    • ug+: This command extends ug. It includes all the functionalities of ug (including loading .ugrep configuration files) and adds the capability to search within PDF files, various document formats (like DOC, DOCX), e-books, and image metadata. This is achieved by utilizing pre-configured filter utilities.[1]
    • ugrep+: Similarly, this command extends ugrep. It provides the same document and metadata searching capabilities as ug+ but, like ugrep, does not load .ugrep configuration files, making it suitable for scripting tasks that require searching these richer file formats.[1]

    The choice between ug and ugrep (and their + counterparts) depends on whether interactive defaults and configuration files are desired (ug/ug+) or if a more pristine, scriptable environment is needed (ugrep/ugrep+). For searching a mixed collection of research files including PDFs and DOCX, ug+ will often be the most convenient starting point for interactive exploration due to its automatic filter application and user-friendly defaults.

    Table 1: Core ugrep Commands and Their Characteristics

    Command Configuration File (.ugrep) Default Pretty/Sort PDF/DOCX/etc. Search Primary Use Case
    ug Yes (loaded automatically) Yes (for terminal) No (by default) Interactive, general use
    ugrep No (not loaded by default) No (color default) No (by default) Scripting, batch jobs
    ug+ Yes (loaded automatically) Yes (for terminal) Yes (via filters) Interactive, mixed-formats
    ugrep+ No (not loaded by default) No (color default) Yes (via filters) Scripting, mixed-formats

    B. Search Patterns (PATTERN)

    The PATTERN is what ugrep searches for within files. It can be a simple keyword, a phrase, or a complex regular expression. By default, ugrep treats patterns as POSIX Extended Regular Expressions (EREs).[1] The documentation provides extensive details on regex syntax, including matching Unicode characters, newlines (\n or \R), and various character classes (\d for digit, \s for whitespace, etc.).[1]

    It is crucial to quote patterns containing spaces or special shell characters (like *, ?, (, )) to prevent the shell from interpreting them before ugrep sees them. Single quotes (‘PATTERN’) are generally safer on Linux/macOS, while double quotes (“PATTERN”) are necessary on Windows Command Prompt.[1]

    C. File and Directory Arguments (FILE…)

    These arguments specify where ugrep should look for the pattern.

    • If FILE arguments are provided, ugrep searches those specific files or directories.
    • If a DIR is specified, ugrep searches files directly within that directory but does not recurse into subdirectories by default (it behaves like ls DIR).[1] Recursive searching requires options like -r, -R, or a depth specifier (e.g., -3).
    • If no FILE arguments are given and standard input is not a terminal (e.g., piped input), ugrep reads from standard input.[1]
    • If no FILE arguments are given and standard input is a terminal, ugrep defaults to a recursive search of the current working directory.[1]

    Understanding these core components is the first step towards effectively using ugrep to manage and query your research files.

    #IV. Basic Searching: Finding Your Way

    With the core concepts in mind, let’s explore basic search operations. These form the foundation for more complex queries.

    A. Searching for a Simple Keyword

    The most straightforward use of ugrep is to search for a literal string (a keyword or phrase) in one or more files.

    • In a single file: ug “your keyword” path/to/your/file.txt This command searches for “your keyword” within file.txt.
    • In multiple files: ug “your keyword” file1.txt report.pdf notes.docx ugrep will search for the keyword in all listed files. If using ug+ or ugrep+ (or ug/ugrep with appropriate --filter options), it will process PDF and DOCX files accordingly.
    • Recursive search when no files are specified: If you are in your main research directory and type: ug “specific concept” ugrep (specifically, the ug command) will recursively search all files in the current directory and its subdirectories for “specific concept”.[1]

    B. Recursive Searching in a Directory

    For disorganized collections spread across many subfolders, recursive searching is indispensable.

    • Using ug PATTERN DIR (Non-Recursive by Default for Specified Directories): As mentioned, if you explicitly provide a directory path, ugrep searches files directly within that directory, not its subdirectories.[1] ug “keyword” /path/to/research_folder This searches for “keyword” only in files immediately inside research_folder.
    • The -r option (Recursive, Follows Symlinks on Command Line): To search a directory and its subdirectories, use the -r option. It follows symbolic links if they are specified on the command line but not otherwise during recursion.[1] ug -r “keyword” /path/to/research_folder
    • The -R option (Recursive, Follows All Symlinks): The -R option also searches recursively but follows all symbolic links it encounters, both to files and to directories.[1] This can be useful but might lead to searching outside the intended scope or getting into symlink loops if not careful. ug -R “keyword” /path/to/research_folder
    • The -S option (Recursive, Follows Symlinks to Files only): When used with -r, -S makes ugrep follow symbolic links to files but not to directories.[1] ug -rS “keyword” /path/to/research_folder

    Differences between -r and -R: The primary difference lies in how they handle symbolic links during recursion [1]:

    • -r: Follows symbolic links only if they are explicitly listed as command-line arguments. When traversing directories found during recursion, it does not follow symbolic links to other directories or files.
    • -R: Follows all symbolic links encountered, whether to files or directories. This is more expansive.

    For most research file collections, -r is often a safer and more predictable choice to avoid unintentionally searching linked system directories or other unrelated areas.

    • Controlling Recursion Depth (--depth or -1, -2, etc.): You can limit how many levels deep ugrep searches using options like -1 (current directory only, no subdirectories), -2 (current directory and one level of subdirectories), or --depth=MAX or --depth=MIN,MAX.[1] ug -2 “keyword” /path/to/research_folder (Searches research_folder and its immediate children) ug -3 -g”foo*.txt” “keyword” /path/to/research_folder (Searches up to 3 levels deep for foo*.txt files) [1]

    These basic commands, especially recursive search, are the first line of attack for navigating a large and potentially disorganized set of research files.

    #V. Targeting Specific Research File Formats

    A significant challenge in research is dealing with mixed file formats. ugrep offers robust mechanisms to search within common research file types like PDF, TXT, DOC, and DOCX. This is achieved through the ug+/ugrep+ commands, the --filter option, or by specifying file types/extensions directly.[1]

    A. Searching PDFs, DOCs, DOCXs, and other Rich Formats

    Plain text files (.txt) are searched by ugrep natively. For formats like PDF, DOC, and DOCX, ugrep relies on external filter utilities to convert their content to searchable text.

    • Using ug+ or ugrep+: These commands are the simplest way to search rich document formats. They come pre-configured to use common filter utilities (if installed on the system or within the Docker container) for PDFs, DOC(X) files, e-books, and image metadata.[1] ug+ -r “critical analysis” /path/to/research_papers This command would attempt to search for “critical analysis” in all files, including PDFs and DOCX files, within the specified path by invoking the appropriate filters.
    • Using the --filter Option: For more control or if ug+ doesn’t pick up a specific filter, you can define filters explicitly using the --filter option. The syntax is --filter=”ext1,ext2:command % [args]” where exts are file extensions, command is the filter utility, and % is replaced by the file path. The output of the command is then searched by ugrep.[1]
      • PDF: Requires a utility like pdftotext. ug -r --filter=”pdf:pdftotext % -” “main hypothesis” /path/to/pdfs (The - after pdftotext % directs its output to standard output for ugrep to read).[1]
      • DOC (legacy Word format): Often uses antiword. ug -r --filter=”doc:antiword %” “historical data” /path/to/docs.[1]
      • DOCX (modern Word format), ODT, EPUB, RTF: pandoc is a versatile tool. ug -r --filter=”docx,odt:pandoc -t plain % -o -” “methodology section” /path/to/modern_docs (The -o - directs pandoc output to standard output).[1]
      • Multiple Filters: You can specify multiple filters by separating them with commas within the same --filter option or by using multiple --filter options. ug -r --filter=”pdf:pdftotext % -,doc:antiword %,docx:pandoc -t plain % -o -” “conclusion” /path/to/all_docs [1]

    It’s important that the filter utilities (pdftotext, antiword, pandoc, etc.) are installed and accessible within the Docker container’s environment for these options to work.

    B. Filtering by File Type (-t)

    The -t TYPES option allows searching only files associated with predefined TYPES. ugrep maintains a list of types and their corresponding extensions and sometimes “magic bytes” (file signatures).[1]

    • ug -tlist: Displays all available file types.
    • For Text Files (.txt, .md, etc.): ug -r -ttext “research notes” /path/to/files [1]
    • For PDF Files: ug -r -tpdf “statistical analysis” /path/to/files [1] Using Pdf (capitalized) also checks file signature magic bytes.[1]
    • For DOC/DOCX: The documentation does not list doc or docx as direct file types for -t. For these, ug+ or explicit --filter options are the primary methods for content searching.[1] However, if you only want to select files named *.doc without necessarily filtering their content through a converter (perhaps to list them or search metadata if ugrep supported that directly without filters for these types), you’d use -O or -g.

    C. Filtering by File Extension (-O)

    The -O EXTENSIONS option is a shorthand to include files based on their extensions. It’s equivalent to -g”*.ext1,*.ext2”.[1]

    • ug -r -Opdf,txt,docx “keyword” /path/to/research_files This command will select files ending in .pdf, .txt, or .docx for searching. For the content of PDF and DOCX to be searched, ug+ or --filter would still be needed in conjunction if ug is used. If ug+ is used, -Opdf,docx would ensure only those file types are passed to their respective filters. [1]

    D. Filtering by Glob Patterns (-g)

    The -g GLOBS option provides powerful filename and path matching using gitignore-style glob patterns. This is highly useful for precisely targeting files in a disorganized collection.[1] Remember to quote glob patterns.

    • ug -r -g”*.pdf,*.txt,*.doc,*.docx” “specific_term” /path/to/research_files [1]
    • To search only in a papers_2023 subdirectory for PDFs: ug+ -r -g”papers_2023/*.pdf” “new findings” /path/to/archive
    • To exclude all files in drafts directories: ug+ -r -g”^drafts/” “final version” /path/to/projects

    Table 2: Key ugrep Options for File Type Filtering in Research

    Option How it Works Example for Research Files Notes
    ug+/ugrep+ Automatically uses filters for PDF, DOC(X), etc. ug+ -r “literature review” /data/research_archive Simplest for mixed formats; relies on installed filter utilities.
    --filter Explicitly defines filter commands for specific extensions. ug -r --filter=”pdf:pdftotext % -” “theory” /data/pdfs Provides fine-grained control over conversion.
    -t TYPE Searches files matching predefined types (e.g., text, pdf, Pdf). ug -r -ttext,Pdf “methodology” /data/articles Pdf (capitalized) also checks magic bytes. Not directly listed for DOC/DOCX content search; use ug+ or --filter for that.
    -O EXT Shorthand to search files with specific extensions (e.g., pdf, txt, docx). ug+ -r -Opdf,docx,txt “data analysis” /data/project_xyz Convenient for common extensions. Combine with ug+ or --filter for PDF/DOCX content.
    -g GLOB Uses gitignore-style globs to match file/directory names or paths. ug+ -r -g”chapter_*.docx,summary.pdf” “key results” /data/thesis_files (ensure ug+ or filters for DOCX/PDF content) Most flexible for complex naming schemes or directory structures. Quote globs.

    By combining these options, a researcher can effectively navigate a disorganized collection, ensuring that ugrep only processes and searches the intended file formats and locations, making the information retrieval process more targeted and efficient. The ability to define custom filters or rely on ug+ for common research document types is a significant advantage when dealing with varied file formats.

    #VI. Constructing Powerful Search Patterns

    ugrep’s true power comes from its sophisticated pattern matching capabilities. Understanding how to construct effective patterns is key to extracting precise information.

    A. Default: Extended Regular Expressions (ERE)

    By default, ugrep interprets search patterns as POSIX Extended Regular Expressions (EREs). This is the same as using the -E option.[1] EREs offer a rich syntax for pattern matching:

    • .: Matches any single character (except newline, unless in dotall mode).
    • *: Matches the preceding item zero or more times.
    • +: Matches the preceding item one or more times.
    • ?: Matches the preceding item zero or one time.
    • {n}, {n,}, {n,m}: Specify exact, minimum, or range for repetitions.
    • : Acts as an OR operator (e.g., cat dog matches “cat” or “dog”).
    • (…): Groups expressions.
    • […]: Defines a character set (e.g., [abc] matches ‘a’, ‘b’, or ‘c’; [0-9] matches any digit).
    • [^…]: Defines a negated character set (e.g., [^0-9] matches any non-digit).
    • ^: Anchors the match to the beginning of a line.
    • $: Anchors the match to the end of a line.
    • \n: Matches a newline character, allowing for multi-line patterns.[1]
    • \R: Matches any Unicode line break.[1]
    • Unicode properties: \p{Class} (e.g., \p{L} for any letter, \p{Nd} for decimal digit).[1]

    Example (ERE): Search for lines starting with “Chapter” followed by a number, then a colon. ug -r “^Chapter\s[0-9]+:” /path/to/manuscripts (Here, \s matches a whitespace character, [0-9]+ matches one or more digits)

    The documentation provides a detailed list of ERE syntax elements and Unicode character classes.[1] For researchers, this means patterns can be crafted to find very specific textual structures, numerical data, or sequences spanning multiple lines.

    B. Perl-Compatible Regular Expressions (-P)

    For even more advanced regex capabilities, ugrep supports Perl-Compatible Regular Expressions (PCRE) via the -P option. PCRE includes features like:

    • Lookaheads: (?=…), (?!…)
    • Lookbehinds: (?<=…), (?<!…)
    • Named capture groups: (?<name>…)
    • Backreferences in patterns (though primarily used with --format or --replace for output).

    Example (PCRE): Find occurrences of “Dr. Smith” but only if not preceded by “Professor”. ug -r -P “(?<!Professor\s)Dr\.\sSmith” /path/to/articles

    PCRE can be particularly useful for extracting structured data where context before or after the match is important for qualification, or when named captures simplify data extraction with --format. The documentation indicates that -P uses the PCRE2 library.[1]

    C. Fixed String (Literal) Search (-F)

    If you need to search for a string exactly as it is, without any characters being interpreted as regex metacharacters, use the -F (or --fixed-strings) option. This is like fgrep. ugrep will treat the pattern as a set of fixed strings separated by newlines (if multiple are given, e.g., from a file with -f).[1]

    Example (Fixed String): Search for the literal string “Project*” (where * is part of the name, not a wildcard). ug -r -F “Project*” /path/to/project_files

    This is useful for searching code, configuration files, or specific phrases where special characters should be treated literally.

    D. Word Search (-w)

    The -w (or --word-regexp) option constrains the pattern to match only whole words. A “word” is typically a sequence of alphanumeric characters and underscores, bounded by non-word characters (like spaces, punctuation, or line boundaries).[1]

    Example (Word Search): Search for the word “cell” but not “cellular” or “excellent”. ug -r -w “cell” /path/to/biology_notes

    This is extremely useful in research to avoid partial matches that can clutter results (e.g., searching for “gene” and not matching “general” or “generate”). ugrep defines word-like characters as Unicode letters, digits, and connector punctuations.[1]

    Table 3: Comparison of Key Pattern Matching Modes

    Option Mode Name Interpretation of data.* Use Case for Research
    (none) Extended Regex (ERE) (Default) Matches “data” followed by any char (except newline) zero or more times. Flexible pattern matching, standard for many text processing tasks.
    -P Perl-Compatible Regex (PCRE) Same as ERE, but enables advanced features like lookarounds. Complex contextual searches, extracting structured data with named captures.
    -F Fixed Strings (Literal) Matches the literal string “data.*”. Searching for exact phrases or terms containing special characters that should be literal.
    -w Word Regex Matches “data” as a whole word, then .* as regex. (More accurately, data.* must form a word or words). Finding specific terms without matching superstrings (e.g., “analysis” not “analytical”).

    When constructing patterns, especially complex regular expressions, it’s often beneficial to start simple and test incrementally. Quoting patterns appropriately is also vital to ensure the shell doesn’t interfere with the special characters intended for ugrep.

    #VII. Refining Searches: Context, Details, and Boolean Logic

    Once you can target files and construct basic patterns, the next step is to refine your searches to get more relevant results and extract the precise information needed for your research paper. This involves using Boolean queries to combine criteria and controlling how matches and their surrounding context are displayed.

    A. Boolean Queries: Combining Search Criteria

    ugrep offers powerful Boolean query capabilities, allowing you to combine multiple patterns using AND, OR, and NOT logic. This is invaluable for pinpointing documents or lines that meet complex criteria.[1]

    • Using -% (Line-Level Boolean) and -%% (File-Level Boolean): The -% option enables Boolean logic where conditions apply to individual lines. The -%% option (equivalent to --bool --files) applies the Boolean logic to entire files: a file matches if all conditions are met by patterns found anywhere within that file.[1]
      Syntax for -% and -%% patterns:
      • pattern1 pattern2: Implies AND (e.g., ‘methodology results’ finds lines/files with both).
      • pattern1 pattern2: Implies OR (e.g., ‘qualitative quantitative’ finds lines/files with either).
      • -pattern: Implies NOT (e.g., experiment -control finds lines/files with “experiment” but not “control”).
      • “literal phrase”: Matches the phrase exactly, ignoring regex interpretation within the quotes.
      • (group): Parentheses for grouping complex expressions.
      • Operators AND, OR, NOT can also be used explicitly if spaced correctly. NOT has the highest precedence, then OR, then AND (when operators are mixed with implicit ANDs via spaces, space-as-AND has lowest precedence).[1]

      Examples for Research:

      1. Find research papers (PDFs) that mention “machine learning” AND “healthcare” but NOT “review”: ug+ -r -%% -Opdf --filter=”pdf:pdftotext % -” “‘machine learning’ healthcare -review” /path/to/papers This file-level search (-%%) helps identify relevant documents for a literature review.
      2. Find lines in your notes (.txt files) that contain “hypothesis” OR “assumption” AND also “validated”: ug -r -% -Otxt “ (hypothesis assumption) validated” /path/to/notes This line-level search (-%) helps find specific statements.
    • Using --and, --not, --andnot with -e: These options provide an alternative way to build Boolean queries, often used when patterns are specified with multiple -e flags.[1]
      • -e PAT1 --and -e PAT2: Matches if both PAT1 and PAT2 are found.
      • -e PAT1 --not -e PAT2: Matches if PAT1 is found OR PAT2 is NOT found. (For “PAT1 AND NOT PAT2”, use --andnot).
      • -e PAT1 --andnot -e PAT2: Matches if PAT1 is found AND PAT2 is NOT found.

    Example for Research: Find lines discussing “ethical considerations” (-e “ethical considerations”) AND specifically related to “AI” (--and -e “AI”) but NOT “children” (--andnot -e “children”): ug+ -r -% -Opdf,txt --filter=”pdf:pdftotext % -” -e “ethical considerations” --and -e “AI” --andnot -e “children” /path/to/ethics_docs

    Table 4: Common Boolean Query Operators for -% and -%%

    Operator / Syntax Meaning Example for Research
    p1 p2 p1 AND p2 ‘climate change’ impact (finds both terms)
    `p1 | p2` p1 OR p2  
    -p1 NOT p1 model -simulation (finds “model” but not “simulation”)
    “literal phrase” Match the exact phrase “statistical significance”
    `(p1 | p2) p3` (p1 OR p2) AND p3  

    Boolean searches dramatically improve the precision of information retrieval from large and varied research datasets, allowing researchers to quickly sift through material to find the most relevant information based on multiple intersecting or excluding criteria.

    B. Displaying Match Context

    Understanding the context of a match is crucial. ugrep provides options to show lines before, after, or around your match.[1]

    • -A NUM or --after-context=NUM: Shows NUM lines of context after the matching line. ug -A3 “critical finding” report.txt
    • -B NUM or --before-context=NUM: Shows NUM lines of context before the matching line. ug -B2 “conclusion drawn” thesis.docx (use ug+ or --filter for docx)
    • -C NUM or --context=NUM: Shows NUM lines of context before AND after the matching line. This is often the most useful. ug -C2 “experimental setup” lab_notes.txt
    • -y or --any-line (or --passthru): Prints all lines, highlighting matches and showing non-matching lines as context (typically prefixed with a -).[1] ug -y “keyword” long_document.pdf (use ug+ or --filter for pdf)

    When combined with -o (only matching), context options like -oC20 will try to fit the match and 20 characters of context before/after on a single line, which is useful for very long lines.[1]

    C. Displaying Specific Match Details

    For precise referencing or data extraction, knowing the exact location of a match is important.[1]

    • -n or --line-number: Prepends each output line with its line number in the file. ug -n “definition” glossary.txt
    • -k or --column-number: Displays the starting column number of the match. Tab characters are expanded (default tab size 8, configurable with --tabs=NUM).[1] ug -nk “specific_variable_name” code.py
    • -b or --byte-offset: Shows the byte offset of the start of the matching line (or the match itself if -u is used). ug -b “unique_identifier” data_log.bin
    • -o or --only-matching: Prints only the exact matching part of the text, not the entire line. ug -o “ISBN\s[0-9X-]+” bibliography.txt (extracts just ISBNs)
    • -H or --with-filename: Always prints the filename for each match. This is default when searching multiple files.
    • -h or --no-filename: Never prints filenames. Default when searching a single file or stdin.

    Combining these options, for instance ug -nHk -C1 “keyword” file.txt, provides a rich output showing the filename, line number, column number, the match itself, and one line of surrounding context. This level of detail is extremely helpful when reviewing search results for a research paper, allowing for quick verification and accurate citation.

    #VIII. Advanced Techniques for Research Data Extraction

    Beyond refining searches, ugrep offers advanced features that can transform it into a sophisticated data extraction tool, particularly useful for researchers needing to pull specific, structured information from their text-based datasets.

    A. Interactive Searching with the Text User Interface (-Q)

    For exploratory searching or when you’re unsure of the exact patterns, ugrep’s interactive Text User Interface (TUI) is a powerful feature. Activate it with the -Q option.[1]

    • Usage: ug -Q If you want to start with an initial pattern, use -e: ug -Q -e “initial term”
    • Features:
      • Live Search: Results update as you type your pattern.
      • Option Toggling: Use ALT-key combinations (e.g., ALT-L for -l to list files, ALT-N for -n to show line numbers) to dynamically change search options. On macOS, this might be OPTION-key. If ALT doesn’t work, CTRL-O followed by the key can be used.[1]
      • Navigation: Use Tab and Shift-Tab to navigate into directories or select files for searching, effectively changing the scope of your search on the fly.
      • File Viewing/Editing: Press CTRL-Y or F2 to open the currently highlighted file in a pager or editor (configurable with --view=COMMAND or defaults to PAGER/EDITOR environment variables).
      • Context Control: ALT-] increases context.
      • Help: F1 or CTRL-Z displays a help screen with active options.
      • Glob Editing: ALT-G opens an editor for file/directory glob patterns.
      • Split Screen: CTRL-T or F5 toggles a split-screen file viewer.
      • Bookmarks: CTRL-X (F3) sets a bookmark, CTRL-R (F4) restores it.
      • Output Selection: ENTER switches to selection mode, allowing you to choose specific lines to output when exiting the TUI.

    The TUI is excellent for iteratively refining search queries, exploring file contents, and quickly assessing the relevance of matches within a large, unfamiliar dataset. For a researcher, this can significantly speed up the initial phases of literature review or data exploration.

    B. Custom Output Formats for Data Extraction (--format, --csv, --json, --xml)

    This is where ugrep truly shines for research data extraction. You can precisely control the output format, making it easy to create structured data from your search results.[1]

    • Predefined Formats:
      • --csv: Outputs matches in Comma-Separated Values format. ug -r -Hnk --csv “keyword” /path/to/data > results.csv
      • --json: Outputs matches in JSON format. ug -r -n --json “pattern” /path/to/logs > logs.json
      • --xml: Outputs matches in XML format. ug -r -nk --xml “term” /path/to/articles > articles.xml These are invaluable for feeding data into spreadsheets, databases, or analysis scripts (e.g., in Python or R).
    • Custom Formatting with --format=FORMAT_STRING: The FORMAT_STRING uses %-prefixed fields to specify what information to include and how. This offers immense flexibility.[1]
      Table 5: Useful %-fields for --format in Research Data Extraction
    Field Description Example Use Case for Data Extraction
    %f Pathname of the matching file. Tracking the source document for each extracted piece of data.
    %n Line number of the match. Pinpointing the exact location of information for citation or verification.
    %k Column number of the match. Further precision in locating data, especially in structured text or code.
    %b Byte offset of the match. Useful for binary data or when character-based line/column numbers are ambiguous.
    %O The entire matching line (raw string of bytes). Extracting full sentences or paragraphs containing a keyword.
    %o Only the matching part of the text (raw string of bytes). Extracting specific terms, codes, or values (e.g., “ISBN: XXXX”, extract just “XXXX”).
    %~ A newline character. Ensuring each formatted output record is on a new line.
    %1, %2… Regex group capture (requires -P). Extracting specific components from a complex pattern (e.g., author and year from “Author (Year)”).
    %[NAME]# Named regex group capture (requires -P and (?<NAME>…)). Similar to numbered captures but with more readable names for extracted components.
    %z Pathname in an archive (when searching with -z). Identifying the source file within a ZIP or TAR archive.
    %Z Edit distance cost (for fuzzy search with -Z). Quantifying the similarity of a fuzzy match, useful for filtering or ranking results.
    %$ Set a custom field separator (e.g., %[;]$ for semicolon-separated values). Creating custom delimited files if CSV’s comma is problematic.

    **Example: Extracting Author and Year from Bibliographic Entries**
    Suppose you have text files with lines like: “Smith, J. (2023). Title of work…”
    You can extract the author and year into a custom format:
    `ug -r -P -Otxt --format=”File: %f, Line: %n, Author: %1, Year: %2%~” “([A-Za-z\s,.\-]+)\s*\((\d{4})\)” /path/to/bibliographies`
    Here, `-P` enables Perl regex. `([A-Za-z\s,.\-]+)` is capture group `%1` (author) and `(\d{4})` is capture group `%2` (year).

    The ability to generate structured output directly from text searches is a significant boon for researchers. It allows `ugrep` to serve as a powerful pre-processing tool, transforming raw textual data from diverse sources into a normalized, analyzable format. This can feed directly into citation management software, databases for meta-analysis, or quantitative analysis tools, streamlining the research workflow and reducing manual data entry errors. For instance, extracting all reported p-values or effect sizes matching a certain pattern across a corpus of papers can be automated, creating a dataset for statistical review. Similarly, compiling a list of all mentions of specific genes or proteins, along with their source document and line number, becomes a trivial task.

    #IX. Streamlining Your ugrep Workflow

    For researchers who frequently perform similar types of searches or work with very large datasets, ugrep provides features to save time and improve performance: configuration files and indexing.

    A. Saving Time with Configuration Files (.ugrep and ug --save-config)

    Constantly retyping common search options can be tedious and error-prone. ugrep addresses this through configuration files.[1]

    • The .ugrep File:
      The ug command (distinct from ugrep) automatically looks for a file named .ugrep first in the current working directory, and if not found, then in your home directory. This file can store default options.
      The format is simple: one long-option-name=value per line (e.g., recursive=true or file-type=pdf,txt). Comments start with #.
    • Creating and Using Configuration Files:
      You can create/edit .ugrep manually, or use the ug --save-config command.
      ug --save-config [OPTIONS_TO_SAVE]
      This command saves the specified OPTIONS_TO_SAVE (and any currently active relevant options from a loaded config) into a new .ugrep file in the current working directory. If you execute this in your home directory, it creates a global default configuration for ug. If done in a specific project directory, it creates a project-specific configuration.
      Example for a Research Project:
      Suppose for a particular project, you always want to search recursively (-r), target PDF and DOCX files (using ug+’s implicit filters or explicit ones), and see 2 lines of context (-C2).
      1. Navigate to your project directory: cd /path/to/my_project_A
      2. Save these preferences:
        ug --save-config -r -Opdf,docx --filter=”pdf:pdftotext % -” --filter=”docx:pandoc -t plain % -o -” -C2
        (Note: ug+ implicitly handles filters, so if using ug+, the --filter parts might be redundant in the save command if you intend to always use ug+. If you save filters and use plain ug, it will apply them.)
      3. Now, whenever you are in /path/to/my_project_A and run ug “keyword”, these saved options will be automatically applied.

    This personalization of ugrep is a significant time-saver. It allows researchers to tailor the tool to their specific habits and the requirements of different research projects, reducing the cognitive overhead of remembering and typing numerous options for common search tasks. It effectively creates a customized search environment.

    B. Speeding Up Searches in Large Collections: Indexing

    For truly massive and relatively static collections of research files, especially if stored on slower media or not frequently accessed (a “cold” file system), ugrep’s indexing feature can offer a performance boost.[1]

    • ugrep-indexer: This command is used to create and manage indexes.
      ugrep-indexer [OPTIONS] [PATH]
      • Example: To index a large archive of research papers, including contents of zip/tar archives and ignoring binary files:
        ugrep-indexer -Iz -v /path/to/massive_research_archive
        (-I ignores binary files during indexing, -z indexes archives, -v is verbose).[1]
      • Indexes are stored as hidden files within the directory structure.
      • Re-indexing is incremental and faster than the initial indexing.
    • ug --index: This command tells ugrep to use the pre-built indexes for searching.
      ug --index PATTERN [PATH…]
      • Example: Searching the indexed archive:
        ug --index “rare specific term” /path/to/massive_research_archive
      • ugrep will first consult the index to quickly identify files that might contain the pattern, and then search only those candidate files. It will also search any new or modified files not yet covered by the index timestamp, ensuring results are always current.[1]
    • Important Limitations:
      The --index option is not compatible with certain other powerful ugrep options, notably -P (Perl regex), -Z (fuzzy search), -v (invert match), and crucially for mixed-format research, --filter.[1]
      This means that while indexing can speed up the process of finding which PDF or DOCX files might contain your search terms (if their text content was somewhat indexed, e.g., via -z during indexing for archives), the actual step of using pdftotext or pandoc via --filter on those candidate files will not be accelerated by the index for that specific content extraction phase. The main benefit for filtered files might be a faster initial selection of candidate files from the broader collection, especially if the collection is vast and on slow storage.

    Indexing is a strategic choice. For very large, stable datasets where search speed is paramount and the incompatible options are not always needed for initial discovery, it can be beneficial. However, for dynamic datasets or when advanced regex, fuzzy search, or filtering are central to every query, the overhead of indexing might not always provide a net benefit over ugrep’s already impressive default speed.

    #X. Putting It All Together: A Sample Research Workflow Scenario

    To illustrate how these ugrep features can be combined in a practical research context, let’s consider a hypothetical scenario. A researcher is investigating the “impact of social media on adolescent mental health” and has a large, disorganized folder named /research_data containing PDFs, DOCX files, and TXT notes. All commands will assume the Docker prefix docker exec <cid> and that /research_data inside the container maps to the researcher’s local folder.

    Scenario: Literature review on “the impact of social media on adolescent mental health.”

    Step 1: Initial Broad Search for Relevant Documents (File-level Boolean)

    • Goal: Identify all documents that mention “social media” AND (“mental health” OR “well-being”) AND (“adolescent” OR “teenager”).
    • Command:
      docker exec <cid> ug+ -r -%% -Opdf,docx,txt --filter=”pdf:pdftotext % -” --filter=”docx:pandoc -t plain % -o -” “‘social media’ (‘mental health’|’well-being’) (adolescent|teenager)” /research_data > /research_data/relevant_papers_list.txt
    • Explanation:
      • ug+: Used because we’re searching PDFs and DOCX alongside TXT, and ug+ handles filters for these types.[1]
      • -r: Recursive search through /research_data.
      • -%%: File-level Boolean search. The document matches if all parts of the Boolean query are found anywhere within it.[1]
      • -Opdf,docx,txt: Restricts the search to files with these extensions.[1]
      • --filter=”pdf:pdftotext % -” and --filter=”docx:pandoc -t plain % -o -”: Explicitly define filters for PDF and DOCX to text conversion.[1]
      • “‘social media’ (‘mental health’ ‘well-being’) (adolescent teenager)”: The Boolean query. Quotes ensure phrases are treated as units.
      • /research_data: The path inside the Docker container.
      • > /research_data/relevant_papers_list.txt: The list of matching file paths is saved for the next step. (Assuming /research_data is a mounted volume writable from the container).

    Step 2: Narrowing Down - Finding Specific Methodologies (File-level Boolean within results)

    • Goal: From the relevant_papers_list.txt, find papers that also discuss “longitudinal study” OR “survey data” but NOT “cross-sectional”.
    • Command:
      docker exec <cid> ug+ --from=/research_data/relevant_papers_list.txt -l -%% -Opdf,docx,txt --filter=”pdf:pdftotext % -” --filter=”docx:pandoc -t plain % -o -” “(‘longitudinal study’|’survey data’) -‘cross-sectional’” > /research_data/methodological_papers_list.txt
    • Explanation:
      • --from=/research_data/relevant_papers_list.txt: Tells ugrep to search only the files listed in this input file.[1]
      • -l: Lists only the names of files that match this new, more specific Boolean query.[1]
      • The rest of the options are similar to Step 1, applying a new file-level Boolean search.

    Step 3: Extracting Key Sentences with Context (Line-level search, context)

    • Goal: From the methodological_papers_list.txt, extract actual sentences mentioning “key finding” or “significant result”, along with some surrounding context.
    • Command:
      docker exec <cid> ug+ --from=/research_data/methodological_papers_list.txt -n -C2 -Opdf,docx,txt --filter=”pdf:pdftotext % -” --filter=”docx:pandoc -t plain % -o -” “(‘key finding’|’significant result’)” > /research_data/extracted_findings_with_context.txt
    • Explanation:
      • -n: Include line numbers for easy reference.[1]
      • -C2: Provide 2 lines of context before and after each matching line.[1]
      • This is now a line-level search (default, or could use -%) to find the specific phrases.

    Step 4: Extracting Specific Data Points (Format, Regex Captures)

    • Goal: Suppose some papers in methodological_papers_list.txt report effect sizes like “Cohen’s d = 0.XX” or “r =.YY”. Extract these values along with the source file and line.
    • Command:
      docker exec <cid> ug+ --from=/research_data/methodological_papers_list.txt -P -o -Opdf,docx,txt --filter=”pdf:pdftotext % -” --filter=”docx:pandoc -t plain % -o -” --format=”%f:%n: %1 = %2%~” “(Cohen’s d|r)\s*=\s*([0-9.]*[0-9])” > /research_data/effect_sizes.csv
    • Explanation:
      • -P: Enable Perl-compatible regular expressions for capture groups.[1]
      • -o: Output only the matching part (though --format often makes this implicit for the fields used).
      • --format=”%f:%n: %1 = %2%~”: Custom format to output filename (%f), line number (%n), the type of statistic (%1 which captures “Cohen’s d” or “r”), and its value (%2 which captures the number).[1] %~ adds a newline.
      • (Cohen’s d r)\s*=\s*([0-9.]*[0-9]): The PCRE pattern.
        • (Cohen’s d r) is the first capture group (%1).
        • \s*=\s* matches the equals sign with optional surrounding spaces.
        • ([0-9.]*[0-9]) is the second capture group (%2), matching a numerical value that might contain a decimal and must end in a digit.
      • The output is directed to effect_sizes.csv, creating a structured dataset.

    This multi-stage workflow demonstrates how ugrep can be applied iteratively. It starts with broad discovery to narrow down a set of relevant documents and then proceeds to extract increasingly specific information, even transforming it into a structured format suitable for further analysis or direct inclusion in a research paper. This approach mirrors the natural progression of many research tasks, showcasing ugrep not just as a search tool, but as a versatile instrument for textual data management and extraction.

    #XI. Troubleshooting Common Issues & Getting More Help

    While ugrep is powerful, novices may encounter some common issues. Understanding these and knowing where to find help can smooth the learning curve.

    A. Common Pitfalls for Novices

    • Forgetting to Quote Patterns: Patterns containing spaces, *, ?, (, , &, or other shell metacharacters must be quoted (e.g., ‘my search pattern’ or “another one”). Otherwise, the shell will interpret them, leading to errors or unexpected behavior.[1]
    • Using ugrep/ug for PDFs/DOCX without Filters: For searching content within PDF, DOC, DOCX files, either use the ug+ or ugrep+ commands (which attempt to use filters automatically) or explicitly specify the --filter option with the correct conversion utility (e.g., pdftotext, antiword, pandoc).[1] Simply running ug “keyword” mydoc.pdf will likely search the raw binary content, not the readable text.
    • Complex Regex Errors: Regular expressions can be tricky. If a complex regex isn’t working:
      • Start with a simpler version of the pattern and build it up.
      • Test parts of the regex in isolation.
      • For literal string searches, remember to use the -F option to avoid regex interpretation.
    • Docker Command Syntax Errors:
      • Ensure the docker exec <container_id_or_name> prefix is correct.
      • Verify that the file paths provided to ugrep are the paths inside the Docker container (as per your volume mounts), not the paths on your host machine.
    • Filter Utilities Not Available/Working: If ug+ or --filter commands fail for specific file types, the necessary filter utility (e.g., pdftotext, pandoc) might not be installed within the Docker container or on the system, or there might be an issue with the filter command itself. Check the installation of these tools.
    • Case Sensitivity: By default, ugrep searches are case-sensitive. If you’re not finding expected matches, try the -i (ignore case) or -j (smart case) option.[1]
    • Word Boundaries: If you search for “cat” and get “caterpillar,” use the -w (word regexp) option to match “cat” as a whole word.[1]

    B. Interpreting “No Matches Found”

    If ugrep reports no matches, consider these checks:

    1. Pattern Accuracy: Double-check your search pattern for typos or incorrect regex syntax. Is it too specific? Too broad?
    2. Case Sensitivity: As above, try -i or -j.
    3. Word Boundaries: Could -w help or hinder?
    4. File Paths: Are you pointing ugrep to the correct files or directories (especially within Docker)?
    5. Recursive Options: If files are in subdirectories, did you use -r or a similar recursive option?
    6. File Type/Extension Filters: Are your -t, -O, or -g options too restrictive, excluding the files you intend to search?
    7. PDF/DOCX Content: If searching these types, ensure your ug+ command is used or that --filter options are correct and the filter utilities are functional. Try converting a single problematic file manually with the filter utility outside of ugrep to see if it produces searchable text.
    8. Encoding: While ugrep handles UTF-8, UTF-16, and UTF-32 well, very old or unusually encoded files might cause issues. The --encoding option can be used for specific encodings if known.[1]

    C. Getting More Help from ugrep Documentation

    ugrep has excellent built-in help and extensive online documentation.

    • General Help:
      ug --help (or ugrep --help)
      This displays a comprehensive list of options.[1]
    • Specific Help Topics:
      ug --help WHAT
      Replace WHAT with a keyword for more targeted help. Highly useful topics for researchers include:
      • ug --help regex: Detailed information on regular expression syntax.[1]
      • ug --help globs: Explanation of glob pattern syntax for file matching.[1]
      • ug --help format: Details on all %-fields for custom output formatting.[1]
      • ug --help fuzzy: Information on fuzzy search options.[1]
      • ug --help count: Help on counting options like -c and -m.[1]
    • Man Page:
      If installed system-wide (not just Docker), the manual page provides exhaustive details:
      man ugrep.[1]
    • Official Website:
      For the most current documentation, examples, and news, refer to the official ugrep website: https://ugrep.com/.[1] The documentation snippet itself is dated Tue April 22, 2025, indicating it’s kept up-to-date.

    D. Final Encouragement

    ugrep is an exceptionally versatile and powerful tool. While its wide array of options might seem daunting to a novice initially, starting with the basics and gradually incorporating more advanced features relevant to your research needs will quickly demonstrate its value. The ability to precisely target diverse file types, construct nuanced search queries, and format output for further analysis can significantly enhance research productivity and help manage the often-overwhelming volume of digital information. With practice, ugrep can become an indispensable part of your research toolkit.

    #XII. Conclusion

    The ugrep utility offers a robust and highly efficient solution for researchers grappling with the common problem of managing and extracting information from large, disorganized collections of mixed-format files. Its ultra-fast search capabilities, coupled with extensive support for various file types including PDFs and DOCX through filtering mechanisms, make it a significant upgrade over traditional command-line search tools. For the novice user, particularly one operating within a Docker environment, ugrep provides a clear path from basic keyword searching to sophisticated data extraction workflows.

    Key strengths that directly address the researcher’s needs include its flexible pattern matching (from simple fixed strings to complex Perl-compatible regular expressions), powerful Boolean query syntax for combining multiple search criteria, and comprehensive options for displaying match context and specific details like line numbers and byte offsets. The interactive TUI (-Q) facilitates exploratory searching, which is invaluable during the initial phases of research. Furthermore, the ability to customize output formats (--format, --csv, --json, --xml) allows for the direct extraction of data into structured formats suitable for analysis, citation management, or integration into other research tools. This transforms ugrep from a mere search utility into a potent pre-processing engine for textual data.

    Features such as configuration files (.ugrep, ug --save-config) and file indexing (ugrep-indexer, ug --index) provide avenues for streamlining repetitive tasks and optimizing performance on very large, static datasets, respectively. While indexing has some limitations with dynamic filtering, its utility for cold storage systems can still be beneficial for initial file culling.

    #bash script

    #!/bin/bash # --- Script Configuration --- # Base directory for all outputs of this batch search run. # A timestamp is included to ensure uniqueness for multiple runs. MASTER_OUTPUT_ROOT_DEFAULT="./batch_search_run_$(date +%Y%m%d_%H%M%S)" MASTER_OUTPUT_ROOT="$MASTER_OUTPUT_ROOT_DEFAULT" # Can be overridden by --config-file or -o ALL_RESULTS_PARENT_DIR="${MASTER_OUTPUT_ROOT}/search_results" ALL_LOGS_PARENT_DIR="${MASTER_OUTPUT_ROOT}/search_logs" MASTER_LOG_FILE="${MASTER_OUTPUT_ROOT}/master_orchestrator_log.txt" SUMMARY_REPORT_FILE="${MASTER_OUTPUT_ROOT}/summary_report.txt" # OCR Configuration (if uncommented and used) # OCR_BASE_TEMP_DIR="/tmp/ocr_temp_area" # Base for unique OCR temporary directories # Concurrency Configuration MAX_CONCURRENT_JOBS_DEFAULT=$(nproc --all 2>/dev/null || echo 7) # Default to number of processors or 7 MAX_CONCURRENT_JOBS="$MAX_CONCURRENT_JOBS_DEFAULT" # Can be overridden # Retry Configuration MAX_RETRIES_DEFAULT=1 # Default to 1 attempt (0 actual retries). Set to >1 for retries. MAX_RETRIES="$MAX_RETRIES_DEFAULT" RETRY_DELAY_SECONDS_DEFAULT=5 RETRY_DELAY_SECONDS="$RETRY_DELAY_SECONDS_DEFAULT" # --- ugrep Configuration --- UGREP_CMD_DEFAULT="ug+" UGREP_CMD="$UGREP_CMD_DEFAULT" UGREP_OPTS_BASE_DEFAULT="-r -i -l" # Recursive, case-insensitive, list filenames UGREP_OPTS_BASE=() # Initialize as array for robust option handling UGREP_OPTS_ARCHIVES_DEFAULT="-z" # Search archives UGREP_OPTS_ARCHIVES=() UGREP_OPTS_INDEX_DEFAULT="" # Example: "--index" UGREP_OPTS_INDEX=() # --- Search Targets (Defaults, can be overridden by external files) --- declare -a SEARCH_LOCATIONS_DEFAULT=( "/mnt/external_hdd/pdf/" "/mnt/external_hdd/0000000000000000001/" # Add more default locations as needed ) declare -a SEARCH_QUERIES_DEFAULT=( "jurime" "sociolo" # Add more default queries as needed ) declare -a SEARCH_LOCATIONS=() declare -a SEARCH_QUERIES=() # External Configuration File Paths LOCATIONS_FILE_ARG="" QUERIES_FILE_ARG="" CONFIG_FILE_ARG="" # --- Utility Functions --- master_log() { echo "$(date '+%Y-%m-%d %H:%M:%S') [MASTER] - $1" | tee -a "$MASTER_LOG_FILE" } sanitize_for_path() { echo "$1" | sed 's#^/*##;s#/*$##;s#/#_#g' | tr ' ' '_' | sed 's/[^a-zA-Z0-9_-]/_/g' } # Initialize ugrep option arrays from default strings # This allows defaults to be simple strings but used as arrays internally # Users can override these arrays directly in a config file if they prefer array syntax read -r -a UGREP_OPTS_BASE <<< "$UGREP_OPTS_BASE_DEFAULT" read -r -a UGREP_OPTS_ARCHIVES <<< "$UGREP_OPTS_ARCHIVES_DEFAULT" if [ -n "$UGREP_OPTS_INDEX_DEFAULT" ]; then read -r -a UGREP_OPTS_INDEX <<< "$UGREP_OPTS_INDEX_DEFAULT" fi validate_commands() { local all_found=true # Use $UGREP_CMD which might have been updated by config local first_word_ugrep_cmd="${UGREP_CMD%% *}" for cmd_name in "$first_word_ugrep_cmd" "nproc" "date" "tee" "sed" "tr" "wc" "grep" "cut" "basename" "mkdir" "rm" "sleep" "kill"; do if ! command -v "$cmd_name" >/dev/null 2>&1; then master_log "ERROR: Required command '$cmd_name' not found in PATH." all_found=false fi done if [ "$all_found" = false ]; then master_log "FATAL: One or more required commands are missing. Please install them and ensure they are in your PATH. Exiting." exit 1 fi master_log "All required commands are available." } validate_search_locations() { local valid_locations=() local invalid_locations_count=0 if [ ${#SEARCH_LOCATIONS[@]} -eq 0 ]; then master_log "ERROR: No search locations defined. Please provide locations. Exiting." exit 1 fi master_log "Validating ${#SEARCH_LOCATIONS[@]} search locations..." for loc_with_opts in "${SEARCH_LOCATIONS[@]}"; do local loc="${loc_with_opts%%:::*}" # Extract path part for validation if [ -z "$loc" ]; then # Skip empty lines if they somehow get here master_log "WARNING: Encountered empty location string. Skipping." invalid_locations_count=$((invalid_locations_count + 1)) continue fi if [ ! -e "$loc" ]; then master_log "WARNING: Search location '$loc' does not exist. Skipping." invalid_locations_count=$((invalid_locations_count + 1)) elif [ ! -r "$loc" ]; then master_log "WARNING: Search location '$loc' is not readable. Skipping." invalid_locations_count=$((invalid_locations_count + 1)) elif [ ! -d "$loc" ]; then # Check if it's a directory master_log "WARNING: Search location '$loc' is not a directory. Skipping." invalid_locations_count=$((invalid_locations_count + 1)) else valid_locations+=("$loc_with_opts") fi done SEARCH_LOCATIONS=("${valid_locations[@]}") if [ ${#SEARCH_LOCATIONS[@]} -eq 0 ]; then master_log "FATAL: No valid and readable search locations remaining after validation. Exiting." exit 1 fi if [ "$invalid_locations_count" -gt 0 ]; then master_log "INFO: Skipped $invalid_locations_count invalid or unreadable locations." fi master_log "Proceeding with ${#SEARCH_LOCATIONS[@]} valid search locations." } # Summary Reporting Variables tasks_succeeded=0 tasks_failed=0 tasks_no_match=0 total_matching_files=0 declare -a failed_tasks_details=() # Advanced Concurrency - PID tracking map declare -A pid_task_info_map # PID to "Query: X, Location: Y, Log: Z, Results: R, TaskIdentifier: TID" process_job_completion() { local pid="$1" local job_actual_exit_status="$2" # This is the exit status of perform_search_task local task_info_str="${pid_task_info_map["$pid"]}" if [ -z "$task_info_str" ]; then master_log "WARN: No task info found for completed PID $pid. Cannot process its completion for summary." return fi local task_log_file=$(echo "$task_info_str" | grep -o 'Log: [^,]*' | cut -d' ' -f2) local results_file=$(echo "$task_info_str" | grep -o 'Results: [^,]*' | cut -d' ' -f2) # local task_identifier_from_map=$(echo "$task_info_str" | grep -o 'TaskIdentifier: [^,]*' | cut -d' ' -f2) if [ "$job_actual_exit_status" -eq 0 ]; then # perform_search_task succeeded (meaning ugrep eventually succeeded) if [ -f "$results_file" ] && [ -s "$results_file" ]; then tasks_succeeded=$((tasks_succeeded + 1)) local matches_in_task matches_in_task=$(wc -l < "$results_file") total_matching_files=$((total_matching_files + matches_in_task)) else # ugrep succeeded but found no matches (or results file was removed) tasks_no_match=$((tasks_no_match + 1)) fi else # perform_search_task failed (meaning ugrep ultimately failed after retries) tasks_failed=$((tasks_failed + 1)) failed_tasks_details+=("Task for ${task_info_str} FAILED (script exit: $job_actual_exit_status). Check log: $task_log_file") fi unset pid_task_info_map["$pid"] } perform_search_task() { local search_param_full="$1" local query_param_full="$2" local task_unique_id="$3" # A unique ID for this specific task instance, e.g. "task_N" local search_dir_param="${search_param_full%%:::*}" local task_specific_loc_opts_str="${search_param_full#*:::}" if [ "$task_specific_loc_opts_str" = "$search_dir_param" ]; then # No ::: found or empty opts task_specific_loc_opts_str="" fi local query_param="${query_param_full%%:::*}" local task_specific_query_opts_str="${query_param_full#*:::}" if [ "$task_specific_query_opts_str" = "$query_param" ]; then # No ::: found or empty opts task_specific_query_opts_str="" fi local sane_loc_part=$(sanitize_for_path "$(basename "$search_dir_param")") if [ -z "$sane_loc_part" ]; then sane_loc_part="root"; fi local sane_query_part=$(sanitize_for_path "$query_param") local task_identifier_fs="loc_${sane_loc_part}_query_${sane_query_part}_${task_unique_id}" # Filesystem-safe identifier local task_results_dir="${ALL_RESULTS_PARENT_DIR}/${task_identifier_fs}" local task_log_file="${ALL_LOGS_PARENT_DIR}/log_${task_identifier_fs}.txt" mkdir -p "$task_results_dir" # Local task_log function to avoid issues with export -f and global state _task_log_internal() { echo "$(date '+%Y-%m-%d %H:%M:%S') [TASK: ${task_identifier_fs}] - $1" >> "$task_log_file" } echo "--- Log for Query: '$query_param' in Location: '$search_dir_param' (ID: ${task_identifier_fs}) ---" > "$task_log_file" _task_log_internal "Raw location spec: '$search_param_full'" _task_log_internal "Raw query spec: '$query_param_full'" _task_log_internal "Parsed location-specific ugrep opts: '$task_specific_loc_opts_str'" _task_log_internal "Parsed query-specific ugrep opts: '$task_specific_query_opts_str'" _task_log_internal "Results will be in: $task_results_dir" _task_log_internal "Full log at: $task_log_file" local current_search_paths="$search_dir_param" local output_file_matches="$task_results_dir/matches.txt" local ugrep_cmd_to_run=() read -r -a ugrep_cmd_base_array <<< "$UGREP_CMD" # Handle if UGREP_CMD itself has args ugrep_cmd_to_run+=("${ugrep_cmd_base_array[@]}") ugrep_cmd_to_run+=("${UGREP_OPTS_BASE[@]}") ugrep_cmd_to_run+=("${UGREP_OPTS_ARCHIVES[@]}") ugrep_cmd_to_run+=("${UGREP_OPTS_INDEX[@]}") # Add task-specific options, parsed from strings local temp_opts_array=() if [ -n "$task_specific_loc_opts_str" ]; then read -r -a temp_opts_array <<< "$task_specific_loc_opts_str" ugrep_cmd_to_run+=("${temp_opts_array[@]}") fi if [ -n "$task_specific_query_opts_str" ]; then read -r -a temp_opts_array <<< "$task_specific_query_opts_str" ugrep_cmd_to_run+=("${temp_opts_array[@]}") fi ugrep_cmd_to_run+=("$query_param" "$current_search_paths") local retries_done=0 local ugrep_final_exit_code=1 # Assume failure initially while [ "$retries_done" -le "$MAX_RETRIES" ]; do if [ "$retries_done" -gt 0 ]; then # This means it's a retry attempt _task_log_internal "Retrying ugrep (attempt $((retries_done +1 ))/$((MAX_RETRIES + 1)))..." sleep "$RETRY_DELAY_SECONDS" else # First attempt _task_log_internal "Executing ugrep (attempt 1/$((MAX_RETRIES + 1))): ${ugrep_cmd_to_run[*]}" fi # Execute ugrep if "${ugrep_cmd_to_run[@]}" > "$output_file_matches" 2>> "$task_log_file"; then ugrep_final_exit_code=0 # Success if [ -s "$output_file_matches" ]; then _task_log_internal "SUCCESS (attempt $((retries_done + 1))): Matches found. Results are in $output_file_matches" else _task_log_internal "INFO (attempt $((retries_done + 1))): No matches found." rm -f "$output_file_matches" fi break # Successful attempt, exit retry loop else ugrep_final_exit_code=$? _task_log_internal "ERROR (attempt $((retries_done + 1))): ugrep command failed with exit code $ugrep_final_exit_code." # Placeholder for specific retryable error logic: # if [[ "$ugrep_final_exit_code" -eq KNOWN_TRANSIENT_ERROR_CODE ]]; then # _task_log_internal "Transient error detected, will retry if attempts remain." # else # _task_log_internal "Non-retryable error or unknown error, will not retry if this was the only attempt." # if [ "$MAX_RETRIES" -eq 0 ]; then break; fi # No retries configured for this error type # fi if [ "$retries_done" -ge "$MAX_RETRIES" ]; then _task_log_internal "ERROR: All $((MAX_RETRIES + 1)) attempts failed for ugrep command." break # Max attempts reached fi fi retries_done=$((retries_done + 1)) done _task_log_internal "Search task finished with final ugrep exit code: $ugrep_final_exit_code." return "$ugrep_final_exit_code" } export -f perform_search_task sanitize_for_path # Export functions needed by subshells usage() { echo "Usage: $0 [options]" echo "Options:" echo " --locations-file <file> File containing search locations, one per line." echo " Format per line: /path/to/location[:::optional_ugrep_options_for_location]" echo " --queries-file <file> File containing search queries, one per line." echo " Format per line: search_query_text[:::optional_ugrep_options_for_query]" echo " --config-file <file> Bash script file to source for overriding default configurations." echo " (e.g., MAX_CONCURRENT_JOBS, UGREP_CMD, UGREP_OPTS_BASE array, etc.)" echo " -o <dir> Master output root directory (default: $MASTER_OUTPUT_ROOT_DEFAULT)." echo " -h, --help Display this help message." exit 0 } # Simple argument parsing loop while [[ $# -gt 0 ]]; do key="$1" case $key in --locations-file) LOCATIONS_FILE_ARG="$2"; shift; shift ;; --queries-file) QUERIES_FILE_ARG="$2"; shift; shift ;; --config-file) CONFIG_FILE_ARG="$2"; shift; shift ;; -o|--output-dir) MASTER_OUTPUT_ROOT="$2"; shift; shift ;; -h|--help) usage ;; *) echo "Unknown option: $1"; usage ;; esac done # Re-derive paths if MASTER_OUTPUT_ROOT was changed by -o ALL_RESULTS_PARENT_DIR="${MASTER_OUTPUT_ROOT}/search_results" ALL_LOGS_PARENT_DIR="${MASTER_OUTPUT_ROOT}/search_logs" MASTER_LOG_FILE="${MASTER_OUTPUT_ROOT}/master_orchestrator_log.txt" SUMMARY_REPORT_FILE="${MASTER_OUTPUT_ROOT}/summary_report.txt" # --- Main Orchestration --- mkdir -p "$ALL_RESULTS_PARENT_DIR" || { echo "FATAL: Could not create results parent dir '$ALL_RESULTS_PARENT_DIR'. Exiting." >&2; exit 1; } mkdir -p "$ALL_LOGS_PARENT_DIR" || { echo "FATAL: Could not create logs parent dir '$ALL_LOGS_PARENT_DIR'. Exiting." >&2; exit 1; } echo "--- Master Orchestrator Log ---" > "$MASTER_LOG_FILE" echo "Run ID: $(basename "$MASTER_OUTPUT_ROOT")" >> "$MASTER_LOG_FILE" master_log "Script started." if [ -n "$CONFIG_FILE_ARG" ]; then if [ -f "$CONFIG_FILE_ARG" ]; then master_log "Sourcing external configuration from: $CONFIG_FILE_ARG" # shellcheck source=/dev/null source "$CONFIG_FILE_ARG" # Re-initialize option arrays if string versions were overridden by config # Config file can directly set UGREP_OPTS_BASE as an array. # If it sets UGREP_OPTS_BASE_DEFAULT (string), then re-parse: if declare -p UGREP_OPTS_BASE_DEFAULT &>/dev/null && ! declare -p UGREP_OPTS_BASE &>/dev/null; then read -r -a UGREP_OPTS_BASE <<< "$UGREP_OPTS_BASE_DEFAULT" fi if declare -p UGREP_OPTS_ARCHIVES_DEFAULT &>/dev/null && ! declare -p UGREP_OPTS_ARCHIVES &>/dev/null; then read -r -a UGREP_OPTS_ARCHIVES <<< "$UGREP_OPTS_ARCHIVES_DEFAULT" fi if declare -p UGREP_OPTS_INDEX_DEFAULT &>/dev/null && ! declare -p UGREP_OPTS_INDEX &>/dev/null; then if [ -n "$UGREP_OPTS_INDEX_DEFAULT" ]; then read -r -a UGREP_OPTS_INDEX <<< "$UGREP_OPTS_INDEX_DEFAULT"; else UGREP_OPTS_INDEX=(); fi fi else master_log "WARNING: Config file '$CONFIG_FILE_ARG' not found. Using default/script internal settings." fi fi master_log "Outputs will be stored under: $MASTER_OUTPUT_ROOT" master_log "Maximum concurrent jobs: $MAX_CONCURRENT_JOBS" master_log "Max retries per task: $MAX_RETRIES (actual attempts: $((MAX_RETRIES + 1))), Retry delay: $RETRY_DELAY_SECONDS seconds" master_log "ugrep command: '$UGREP_CMD'" master_log "Base ugrep options: '${UGREP_OPTS_BASE[*]}'" master_log "Archive ugrep options: '${UGREP_OPTS_ARCHIVES[*]}'" master_log "Index ugrep options: '${UGREP_OPTS_INDEX[*]}'" if [ -n "$LOCATIONS_FILE_ARG" ]; then master_log "Loading search locations from: $LOCATIONS_FILE_ARG" if [ -f "$LOCATIONS_FILE_ARG" ]; then mapfile -t SEARCH_LOCATIONS < <(grep -v -e '^\s*#' -e '^\s*$' "$LOCATIONS_FILE_ARG") else master_log "ERROR: Locations file '$LOCATIONS_FILE_ARG' not found. Exiting." exit 1 fi else master_log "Using default search locations defined in script." SEARCH_LOCATIONS=("${SEARCH_LOCATIONS_DEFAULT[@]}") fi if [ -n "$QUERIES_FILE_ARG" ]; then master_log "Loading search queries from: $QUERIES_FILE_ARG" if [ -f "$QUERIES_FILE_ARG" ]; then mapfile -t SEARCH_QUERIES < <(grep -v -e '^\s*#' -e '^\s*$' "$QUERIES_FILE_ARG") else master_log "ERROR: Queries file '$QUERIES_FILE_ARG' not found. Exiting." exit 1 fi else master_log "Using default search queries defined in script." SEARCH_QUERIES=("${SEARCH_QUERIES_DEFAULT[@]}") fi validate_commands validate_search_locations master_log "Assumptions: Read access to search locations. Write access to output directory." total_tasks_launched=0 current_active_jobs=0 master_log "Starting to launch search tasks..." for location_spec in "${SEARCH_LOCATIONS[@]}"; do for query_spec in "${SEARCH_QUERIES[@]}"; do total_tasks_launched=$((total_tasks_launched + 1)) task_unique_id_for_run="task${total_tasks_launched}" # Simple unique ID for this run # Manage concurrency while [ "$current_active_jobs" -ge "$MAX_CONCURRENT_JOBS" ]; do master_log "Reached max concurrent jobs ($current_active_jobs/$MAX_CONCURRENT_JOBS). Waiting for a slot..." # Try to get PID of finished job using `wait -n -p` (Bash 4.3+) # The `finished_pid_var_holder` is a dummy var for -p local finished_pid="" if finished_pid=$(wait -n -p finished_pid_var_holder 2>/dev/null && echo "$finished_pid_var_holder"); then local wait_n_p_exit_status=$? # Status of `wait -n -p ...` command itself if [ -n "$finished_pid" ] && [ "$wait_n_p_exit_status" -eq 0 ]; then wait "$finished_pid" # Ensure it's fully reaped local job_actual_exit_status=$? process_job_completion "$finished_pid" "$job_actual_exit_status" current_active_jobs=$((current_active_jobs - 1)) else # `wait -n -p` might have failed or returned no PID (e.g. interrupted) # Fallback: Check all known PIDs master_log "WARN: 'wait -n -p' did not yield a PID cleanly (status $wait_n_p_exit_status). Checking all active PIDs." local found_finished_job=0 for pid_to_check in "${!pid_task_info_map[@]}"; do if ! kill -0 "$pid_to_check" 2>/dev/null; then # If PID no longer exists wait "$pid_to_check" # Reap local fallback_job_status=$? process_job_completion "$pid_to_check" "$fallback_job_status" current_active_jobs=$((current_active_jobs - 1)) found_finished_job=1 break # Processed one, re-evaluate fi done if [ "$found_finished_job" -eq 0 ]; then master_log "WARN: Fallback check found no finished jobs. `wait -n` might have been interrupted. Retrying wait." sleep 0.1 # Small delay before retrying wait fi fi else # `wait -n` (without -p) or `wait -n -p` failed more fundamentally master_log "WARN: 'wait -n' command failed. Checking all active PIDs." # Similar fallback as above local found_finished_job_alt=0 for pid_to_check in "${!pid_task_info_map[@]}"; do if ! kill -0 "$pid_to_check" 2>/dev/null; then wait "$pid_to_check" local fallback_job_status_alt=$? process_job_completion "$pid_to_check" "$fallback_job_status_alt" current_active_jobs=$((current_active_jobs - 1)) found_finished_job_alt=1 break fi done if [ "$found_finished_job_alt" -eq 0 ]; then master_log "WARN: Fallback (alt) check found no finished jobs. `wait -n` might have been interrupted. Retrying wait." sleep 0.5 fi fi if [ "$current_active_jobs" -lt 0 ]; then current_active_jobs=0; fi done # Prepare info for pid_task_info_map loc_path_part="${location_spec%%:::*}" q_path_part="${query_spec%%:::*}" sane_loc_part_map=$(sanitize_for_path "$(basename "$loc_path_part")"); if [ -z "$sane_loc_part_map" ]; then sane_loc_part_map="root"; fi sane_query_part_map=$(sanitize_for_path "$q_path_part") task_identifier_fs_map="loc_${sane_loc_part_map}_query_${sane_query_part_map}_${task_unique_id_for_run}" task_log_file_map="${ALL_LOGS_PARENT_DIR}/log_${task_identifier_fs_map}.txt" task_results_file_map="${ALL_RESULTS_PARENT_DIR}/${task_identifier_fs_map}/matches.txt" master_log "Launching task $total_tasks_launched (ID: $task_unique_id_for_run -> FS: $task_identifier_fs_map): Query ['$query_spec'] in Location ['$location_spec']" perform_search_task "$location_spec" "$query_spec" "$task_unique_id_for_run" & local current_pid=$! pid_task_info_map["$current_pid"]="Query: $q_path_part, Location: $loc_path_part, Log: $task_log_file_map, Results: $task_results_file_map, TaskIdentifierFS: $task_identifier_fs_map" current_active_jobs=$((current_active_jobs + 1)) done done master_log "All $total_tasks_launched tasks have been launched." master_log "Waiting for remaining $current_active_jobs active jobs to complete..." while [ "$current_active_jobs" -gt 0 ]; do local finished_pid_final="" # Using similar logic as above for waiting if finished_pid_final=$(wait -n -p finished_pid_var_holder_final 2>/dev/null && echo "$finished_pid_var_holder_final"); then local wait_n_p_exit_status_final=$? if [ -n "$finished_pid_final" ] && [ "$wait_n_p_exit_status_final" -eq 0 ]; then wait "$finished_pid_final" local job_actual_exit_status_final=$? process_job_completion "$finished_pid_final" "$job_actual_exit_status_final" current_active_jobs=$((current_active_jobs - 1)) master_log "A job (PID $finished_pid_final) finished. Remaining: $current_active_jobs" else master_log "WARN (final loop): 'wait -n -p' did not yield PID. Checking all." # Fallback logic as in the main launch loop local found_finished_job_f=0 for pid_to_check_f in "${!pid_task_info_map[@]}"; do if ! kill -0 "$pid_to_check_f" 2>/dev/null; then wait "$pid_to_check_f"; local f_stat=$? process_job_completion "$pid_to_check_f" "$f_stat" current_active_jobs=$((current_active_jobs - 1)); found_finished_job_f=1; break fi done if [ "$found_finished_job_f" -eq 0 ]; then sleep 0.1; fi fi else master_log "WARN (final loop): 'wait -n' command failed. Checking all." # Fallback logic as in the main launch loop local found_finished_job_f_alt=0 for pid_to_check_f_alt in "${!pid_task_info_map[@]}"; do if ! kill -0 "$pid_to_check_f_alt" 2>/dev/null; then wait "$pid_to_check_f_alt"; local f_stat_alt=$? process_job_completion "$pid_to_check_f_alt" "$f_stat_alt" current_active_jobs=$((current_active_jobs - 1)); found_finished_job_f_alt=1; break fi done if [ "$found_finished_job_f_alt" -eq 0 ]; then sleep 0.5; fi fi if [ "$current_active_jobs" -lt 0 ]; then current_active_jobs=0; fi done # Final sanity check for any PIDs that might still be in the map (should be empty) if [ ${#pid_task_info_map[@]} -gt 0 ]; then master_log "Performing final PID reaping for any stragglers (${!pid_task_info_map[*]})..." for pid_to_reap in "${!pid_task_info_map[@]}"; do master_log "Waiting for potentially missed PID: $pid_to_reap" if wait "$pid_to_reap"; then # if it already exited, wait returns immediately with its status actual_job_exit_status=$? else actual_job_exit_status=$? # Capture status if it errors (e.g. no such PID) fi # Check if PID still in map (might have been processed by a successful wait -n -p) if [[ -v pid_task_info_map["$pid_to_reap"] ]]; then # Check if key exists process_job_completion "$pid_to_reap" "$actual_job_exit_status" master_log "Processed straggler PID $pid_to_reap, exit status $actual_job_exit_status" fi done fi master_log "All search tasks have completed." master_log "Generating summary report to: $SUMMARY_REPORT_FILE" { echo "--- Batch Search Run Summary ---" echo "Run ID: $(basename "$MASTER_OUTPUT_ROOT")" echo "Completion Time: $(date '+%Y-%m-%d %H:%M:%S')" echo "" echo "Total Tasks Launched: $total_tasks_launched" echo " Tasks Succeeded (found matches): $tasks_succeeded" echo " Tasks Succeeded (no matches found): $tasks_no_match" echo " Tasks Failed: $tasks_failed" echo "" echo "Total Matching Files Found (across all successful tasks with matches): $total_matching_files" echo "" if [ ${#failed_tasks_details[@]} -gt 0 ]; then echo "Details of Failed Tasks (${#failed_tasks_details[@]}):" for detail in "${failed_tasks_details[@]}"; do echo " - $detail" done else echo "No tasks failed." fi echo "" echo "Master Log: $MASTER_LOG_FILE" echo "All Results In: $ALL_RESULTS_PARENT_DIR" echo "All Logs In: $ALL_LOGS_PARENT_DIR" } > "$SUMMARY_REPORT_FILE" master_log "Summary report generated." cat "$SUMMARY_REPORT_FILE" # Also print to stdout for convenience master_log "Master script finished." exit 0

    #bash script documentation

    #Batch Search Orchestration Script: Technical Documentation

    #I. Overview and Purpose

    The Batch Search Orchestration Script is a Bash utility designed to automate and manage large-scale search operations across a defined set of file system locations using the ugrep tool. Its primary purpose is to execute multiple search queries concurrently against numerous directories, including those containing various file types and archives. The script organizes search results and detailed logs systematically, facilitating efficient data retrieval and analysis from extensive file collections.

    Key capabilities include:

    • Batch Processing: Iterates through predefined lists of search locations and queries.
    • Concurrent Execution: Launches multiple search tasks in parallel to leverage multi-core processors and expedite the overall search process.
    • Configurable Search Parameters: Allows users to specify target directories, search terms, and ugrep command-line options through internal script variables.
    • Archive Searching: Supports searching within common archive formats (e.g.,.zip,.gz,.tar) if ugrep is configured accordingly.
    • Organized Output: Creates a unique, timestamped root directory for each run, containing separate subdirectories for search results and logs. Individual tasks also have dedicated result files and log files with standardized naming conventions.
    • Comprehensive Logging: Maintains a master log for the overall orchestration process and detailed logs for each individual search task, capturing ugrep’s output and any errors encountered.
    • Extensibility (Experimental): Includes a commented-out module for OCR (Optical Character Recognition) pre-processing of PDF files, suggesting a potential for future expansion to search image-based documents.

    This script is particularly useful for scenarios requiring repetitive searches across large, static datasets, such as digital archives, code repositories, or document stores, where findings need to be systematically collected and logged.

    #II. Prerequisites and Dependencies

    To ensure the full functionality of the Batch Search Orchestration Script, the following software components must be installed and accessible in the system’s PATH, and appropriate file system permissions must be in place.

    A. Software Requirements:

    Software Minimum Version (Recommended) Purpose Notes
    Bash 4.0+ Script interpreter. Essential for script execution. Modern features like wait -n are used.
    ugrep (ug+) Latest stable Core search utility. The script is configured to use ug+ by default. Ensure it’s installed and accessible.
    nproc (GNU coreutils) Determines the number of available processing units. Used to set the default for MAX_CONCURRENT_JOBS. If unavailable, the script defaults to 7.
    date (GNU coreutils) Generates timestamps for output directories and log entries. Standard utility, typically available.
    tee (GNU coreutils) Redirects output to both standard output and log files. Used by master_log. Standard utility.
    sed (GNU sed) Stream editor for text manipulation. Used in sanitize_for_path. Standard utility.
    tr (GNU coreutils) Translates or deletes characters. Used in sanitize_for_path. Standard utility.
    pdftotext (poppler-utils) Extracts text from PDF files. Optional: Only required if the OCR functionality is uncommented and used.
    tesseract Latest stable OCR engine for converting images (including scanned PDFs) to text. Optional: Only required if the OCR functionality is uncommented and used. Requires language packs (e.g., English).

    B. File System Permissions:

    • Read Access: The user executing the script must have read permissions for all directories specified in the SEARCH_LOCATIONS array and all files and subdirectories within them. Lack of read access will result in ugrep errors for the affected tasks, which will be recorded in the respective task logs.
    • Write Access: The user must have write and execute permissions for the directory where the script is located if the MASTER_OUTPUT_ROOT is set to a relative path like ./batch_search_run_…. Specifically, the script needs to create the MASTER_OUTPUT_ROOT directory and its subdirectories (search_results, search_logs). Failure to create these initial directories will cause the script to exit with a fatal error.
    • Execute Access: The script file itself must have execute permissions (e.g., chmod +x script_name.sh).

    The script employs mkdir -p when creating output directories. This command creates parent directories as needed and does not error if the directory already exists, contributing to robust execution. However, the initial creation of ALL_RESULTS_PARENT_DIR and ALL_LOGS_PARENT_DIR includes an explicit check; if these cannot be created, the script terminates. This fail-fast behavior for critical setup steps prevents further execution with a flawed output structure.

    It is important to note that if search locations are on network mounts or external drives (as suggested by paths like /mnt/external_hdd/), the stability of these network connections and the availability of the mount points are crucial. The script does not perform explicit pre-checks for mount point accessibility before initiating search tasks; failures related to inaccessible paths during a ugrep operation will be logged at the task level.

    #III. Configuration Deep Dive

    The script’s behavior is primarily controlled by a set of variables defined in its “Script Configuration” section. Understanding and appropriately setting these variables is key to tailoring the script for specific search tasks and system environments.

    A. Environment Variables and Script Variables:

    The following table details the main configuration variables, their default values or initialization logic, scope, purpose, and customization impact.

    Variable Name Default Value / Example Initialization Scope Description/Purpose Customization Impact & Notes
    MASTER_OUTPUT_ROOT ”./batch_search_run_$(date +%Y%m%d_%H%M%S)” Global Base directory for all outputs of a single script run. Timestamp ensures uniqueness. Change to specify a different parent location for all run data. Ensure write permissions.
    ALL_RESULTS_PARENT_DIR ”${MASTER_OUTPUT_ROOT}/search_results” Global Parent directory for all task-specific search result subdirectories. Path is relative to MASTER_OUTPUT_ROOT. Modifying its name changes the results folder name.
    ALL_LOGS_PARENT_DIR ”${MASTER_OUTPUT_ROOT}/search_logs” Global Parent directory for all task-specific log file subdirectories. Path is relative to MASTER_OUTPUT_ROOT. Modifying its name changes the logs folder name.
    MASTER_LOG_FILE ”${MASTER_OUTPUT_ROOT}/master_orchestrator_log.txt” Global Path to the main log file for the script’s orchestration activities. Path is relative to MASTER_OUTPUT_ROOT. Modifying its name changes the master log filename.
    OCR_BASE_TEMP_DIR “/tmp/ocr_temp_area” (Commented) Global Base directory for temporary files generated during OCR processing. If OCR is enabled, ensure this location is writable and has sufficient space. Consider cleanup strategy.
    MAX_CONCURRENT_JOBS `${MAX_CONCURRENT_JOBS:-$(nproc --all      
    echo 7)}` Global Maximum number of perform_search_task instances to run concurrently. Critical for performance tuning. Adjust based on CPU cores, I/O capacity, and nature of searches (CPU vs I/O bound). The default tries to use all processors, falling back to 7 if nproc fails.  
    UGREP_CMD “ug+” Global The ugrep command to execute. Can be ug if filters are globally configured. Change if ugrep is installed with a different name/path, or to use a specific version.
    UGREP_OPTS_BASE “-r -i -l” Global Basic options passed to ugrep for every search: recursive (-r), case-insensitive (-i), list filenames only (-l). Modify to change core search behavior (e.g., remove -i for case-sensitive, add -w for whole word, -n for line numbers). Note that -l means matches.txt will contain filenames, not matched lines.
    UGREP_OPTS_ARCHIVES “-z” Global ugrep options for searching within archives. Remove or modify if archive searching is not needed or requires different parameters.
    UGREP_OPTS_INDEX ”–index” (Commented) Global ugrep option to utilize pre-built indexes for faster searches. Uncomment and use if you have ugrep indexes for the SEARCH_LOCATIONS. Can significantly improve performance.
    SEARCH_LOCATIONS declare -a SEARCH_LOCATIONS=( “/mnt/external_hdd/pdf/”… ) Global Bash array defining the directories to be searched. Modify this list to define the scope of your search operations. Add or remove paths as needed.
    SEARCH_QUERIES declare -a SEARCH_QUERIES=( “jurime”… ) Global Bash array defining the search terms (queries). Modify this list with the keywords or patterns you are searching for.

    The structured organization of output directories (a unique MASTER_OUTPUT_ROOT for each run, with distinct search_results and search_logs subdirectories), coupled with a master log and task-specific logs, reflects a design that prioritizes traceability and simplifies debugging. If an issue arises with a particular search (e.g., unexpected results for “sociolo” in “/mnt/external_hdd/pdf/”), the MASTER_LOG_FILE provides the overarching context of the run, while the specific task log (e.g., search_logs/log_loc_pdf_query_sociolo.txt) offers detailed diagnostics for that individual operation.

    Furthermore, the ability to customize UGREP_CMD and the various UGREP_OPTS_* variables provides considerable flexibility. Advanced users can adapt ugrep’s behavior to very specific needs—such as adding filters for file types (–include, --exclude), enabling binary file detection heuristics (–ignore-binary), or altering output formats—without needing to modify the core execution logic of the script. This separation of configuration from operational code enhances the script’s maintainability and adaptability.

    B. Customization Guide:

    • Modifying Search Scope (SEARCH_LOCATIONS, SEARCH_QUERIES):
      • To change the directories to be searched, edit the SEARCH_LOCATIONS array. Add new paths within the parentheses, ensuring each path is quoted and separated by a space. Example: SEARCH_LOCATIONS=(“/data/projectA” “/data/projectB” “/mnt/archive/docs”).
      • To change the search terms, edit the SEARCH_QUERIES array in a similar manner. Example: SEARCH_QUERIES=(“error_code_500” “confidential_report” “ProjectPhoenix”).
      • The script’s use of Bash arrays for these configurations makes it straightforward to expand or contract the search scope by simply editing these lists. This design choice lowers the barrier to customization, as users do not need to delve into the script’s looping mechanisms to alter what is searched or where.
      • It is important to exercise caution when populating SEARCH_LOCATIONS, as the script does not currently perform pre-checks to validate if these paths exist or are accessible. An incorrect or inaccessible path will lead to a failure for that specific search task, which will be logged, but the script will continue processing other tasks.
    • Adjusting Concurrency (MAX_CONCURRENT_JOBS):
      • The MAX_CONCURRENT_JOBS variable controls how many search tasks run in parallel. The default value is determined by $(nproc --all   echo 7), which attempts to use the total number of available CPU cores, or defaults to 7 if nproc is unavailable or fails.
      • To increase performance: If the system has ample CPU resources and fast I/O, and searches are I/O-bound, increasing MAX_CONCURRENT_JOBS might improve overall speed.
      • To reduce system load: If the script consumes too many resources or if searches are CPU-bound (especially if OCR is enabled), decrease MAX_CONCURRENT_JOBS.
      • Example: MAX_CONCURRENT_JOBS=4 to limit to 4 concurrent jobs.
    • Fine-tuning ugrep Behavior (UGREP_OPTS_BASE, UGREP_OPTS_ARCHIVES, UGREP_OPTS_INDEX):
      • Case Sensitivity: To make searches case-sensitive, remove the -i option from UGREP_OPTS_BASE. Example: UGREP_OPTS_BASE=”-r -l”.
      • Whole Word Search: To search for whole words only, add the -w option. Example: UGREP_OPTS_BASE=”-r -i -l -w”.
      • Including Line Numbers: If you need line numbers in the output (note: this changes matches.txt content from filenames to lines with matches), remove -l and add -n. Example: UGREP_OPTS_BASE=”-r -i -n”. This would require changes to how results are interpreted.
      • Using ugrep Indexes: If you have pre-built ugrep indexes for your SEARCH_LOCATIONS, uncomment UGREP_OPTS_INDEX=”–index” and ensure it’s added to the ugrep command line in perform_search_task (the script has a comment placeholder for this). This can dramatically speed up searches.
      • Disabling Archive Search: If you do not want to search within archives, you can set UGREP_OPTS_ARCHIVES=””.

    #IV. Execution and Operational Guide

    A. Running the Script:

    The script is executed as a standard Bash script from the command line.

    1. Ensure the script file (e.g., batch_search_script.sh) has execute permissions: chmod +x batch_search_script.sh.
    2. Navigate to the directory containing the script, or provide its full path.
    3. Run the script: ./batch_search_script.sh.

    The script does not accept command-line arguments. All configuration is handled by modifying the variables within the script file itself, as detailed in Section III. This design makes it well-suited for automated or scheduled execution (e.g., via cron) where search parameters are relatively static. For scenarios requiring highly dynamic search parameters at runtime, a wrapper script could be developed to programmatically modify a copy of this script or generate its configuration section before execution.

    B. Understanding Script Output:

    The script produces output in two primary forms: messages to the console (standard output) and a structured set of directories and files.

    • Console Output:
      During execution, the script prints messages from the master_log function. These messages are simultaneously written to the MASTER_LOG_FILE and displayed on the console, providing real-time feedback on:
      • Script initialization and run ID.
      • The base output directory being used.
      • The maximum number of concurrent jobs.
      • Announcements when search tasks are launched.
      • Notifications when the script is waiting for concurrent job slots to become available.
      • A summary message when all tasks are launched and when all tasks have completed.
    • Output Directory Structure:
      For each run, the script creates a unique master output directory. The naming convention and contents are as follows:
    Path/File Pattern (relative to script location if . used) Description of Contents Generated By Key Information Contained
    ./batch_search_run_YYYYMMDD_HHMMSS/ Root directory for a single execution of the script. Timestamp ensures uniqueness. (MASTER_OUTPUT_ROOT) Master Script Container for all logs and results for this specific run.
    ${MASTER_OUTPUT_ROOT}/master_orchestrator_log.txt Main log file for the script’s overall orchestration. (MASTER_LOG_FILE) Master Script (master_log function) Timestamps, script start/end, configuration summary, task launch details, concurrency management messages, overall status.
    ${MASTER_OUTPUT_ROOT}/search_results/ Parent directory for all search result subdirectories. (ALL_RESULTS_PARENT_DIR) Master Script Organizes results by individual search tasks.
    ${ALL_RESULTS_PARENT_DIR}/loc_<SANE_LOC>_query_<SANE_QUERY>/ Task-specific directory for results. <SANE_LOC> and <SANE_QUERY> are sanitized versions of the location’s basename and the query. perform_search_task Contains results for a single location/query pair.
    ${ALL_RESULTS_PARENT_DIR}/loc_<SANE_LOC>_query_<SANE_QUERY>/matches.txt File containing the list of files that matched the query for that task. (Content depends on UGREP_OPTS_BASE, typically filenames due to -l). perform_search_task (ugrep output) List of matching files. If empty, it means ugrep ran successfully but found no matches (file is then removed by default).
    ${MASTER_OUTPUT_ROOT}/search_logs/ Parent directory for all task-specific log files. (ALL_LOGS_PARENT_DIR) Master Script Organizes logs by individual search tasks.
    ${ALL_LOGS_PARENT_DIR}/log_loc_<SANE_LOC>_query_<SANE_QUERY>.txt Task-specific log file. perform_search_task (task_log function) Detailed log for a single location/query pair: start time, ugrep command executed, ugrep standard error output, success/failure status, path to results.

    The `sanitize_for_path` function plays a critical role in generating the `<SANE_LOC>` and `<SANE_QUERY>` components. It ensures that directory and file names derived from search locations (specifically, their basenames) and queries are valid for the filesystem, even if the original inputs contain spaces, slashes, or other special characters. This systematic and predictable naming convention is essential for users navigating the results and for any automated post-processing scripts that might consume these outputs.

    A notable behavior is the removal of empty `matches.txt` files. If `ugrep` completes successfully for a task but finds no matching files, the script will remove the (empty) `matches.txt` file for that task. This helps to quickly identify tasks that yielded no results. However, it means the absence of a `matches.txt` file is ambiguous without consulting the corresponding task log: it could mean “no matches found” or “an error occurred preventing result generation” (though in the latter case, the script currently might not remove it unless the removal line for errors is uncommented). The task log is the definitive source for determining the outcome of each search task.

    #V. Internal Architecture and Core Logic

    A. Main Orchestration Flow:

    The script’s main execution logic, found after the function definitions and configuration, orchestrates the entire batch search process:

    1. Initialization:
      • The script first creates the primary output directories: ALL_RESULTS_PARENT_DIR and ALL_LOGS_PARENT_DIR within the unique MASTER_OUTPUT_ROOT. If these critical directories cannot be created, the script issues a fatal error message to standard error and exits with status 1.
      • The MASTER_LOG_FILE is initialized with a header and initial status messages, including the run ID, output directory, and configured concurrency level. Assumptions like ugrep availability are also logged.
    2. Task Generation and Launch Loop:
      • The script enters a nested loop structure, iterating through each directory path in the SEARCH_LOCATIONS array and, for each location, iterating through each term in the SEARCH_QUERIES array. This generates a unique search task for every location/query pair.
      • Concurrency Management: Before launching a new task, the script checks if the number of currently active_jobs has reached the MAX_CONCURRENT_JOBS limit.
        • If the limit is reached, the master log records that it’s waiting for a slot. The script then executes wait -n, which pauses execution until any one of the backgrounded child processes (search tasks) terminates.
        • Upon a job finishing, active_jobs is decremented. The script’s handling of wait -n’s exit status is designed to be somewhat resilient, decrementing active_jobs even on certain non-zero exit codes from wait -n to prevent potential infinite loops, though this area is noted in comments as potentially needing more robust handling for specific OS behaviors of wait -n.
      • Task Launch: Once a concurrency slot is available, the master_log announces the launch of the new task, specifying the query and location. The perform_search_task function is then invoked with the current location, query, and a task counter (total_tasks) as parameters. This function call is executed in the background using the & operator, allowing the main script to continue launching other tasks.
      • The Process ID (PID) of the backgrounded task is stored in the pids_list array (though this array is not actively used later in the current script version for specific PID waiting), and active_jobs is incremented.
    3. Shutdown and Cleanup:
      • After all location/query pairs have been processed and their corresponding tasks launched, the master log indicates that all tasks are underway.
      • The script then enters a while loop that continues as long as active_jobs is greater than zero. Inside this loop, wait -n is used again to wait for any of the remaining background jobs to complete. active_jobs is decremented each time a job finishes.
      • A final wait command is issued after the loop to ensure all background processes have indeed terminated before the script exits. This acts as a catch-all, as wait -n behavior can sometimes have subtleties.
      • The script concludes with final messages in the master log, indicating completion and pointing to the output directories, then exits with status 0.

    The export -f perform_search_task sanitize_for_path command ensures that these functions are available in the subshell environments created when tasks are backgrounded with &. While often implicitly available in many Bash versions for direct backgrounding of functions, export -f makes this availability explicit and robust, particularly if the method of subshell invocation were to change (e.g., to bash -c “function_name”).

    B. Key Functions:

    The script utilizes several key Bash functions to modularize its operations:

    Function Name Purpose Key Inputs Key Outputs/Side Effects Called By
    master_log() Centralized logging for main script orchestration. $1: Message string. Appends timestamped message to MASTER_LOG_FILE and prints to stdout via tee. Main script body.
    sanitize_for_path() Sanitizes a string to be safely used as part of a file or directory name. $1: Input string (e.g., path basename, query). Prints sanitized string to stdout. Uses sed and tr to remove/replace problematic characters. perform_search_task().
    perform_search_task() Executes a single search task for a given location and query. $1: search_dir_param (search directory), $2: query_param (search term), $3: task_id_suffix (unique task identifier, e.g., a counter). Creates task-specific output directories and log files. Executes ugrep. Writes results to matches.txt. Logs detailed task progress and errors. Main script loop (launched in background).
    • 1. master_log()
      This function is responsible for all logging originating from the main orchestration part of the script. It takes a single argument (the message to be logged), prepends the current date and time, and a `` tag. The tee -a “$MASTER_LOG_FILE” command is a crucial part of its implementation; it appends the log message to the MASTER_LOG_FILE while simultaneously printing it to standard output. This design provides both persistent logging for later review and real-time feedback to the user running the script interactively.
    • 2. sanitize_for_path()
      This utility function is critical for creating valid and predictable directory and file names from potentially problematic inputs like search queries or directory basenames (which might contain spaces, slashes, or other special characters). It processes its input string through a pipeline of sed and tr commands:
      1. sed ‘s#^/*##;s#/*$##;s#/#_#g’: Removes leading and trailing slashes, and replaces any internal slashes (/) with underscores (_).
      2. tr ‘ ‘ ‘_’: Translates spaces into underscores.
      3. sed ‘s/[^a-zA-Z0-9_-]/_/g’: Replaces any character that is not alphanumeric, an underscore, or a hyphen with an underscore. The resulting sanitized string is printed to standard output and captured by the caller (typically perform_search_task) for use in constructing paths. While this sanitization is robust for many common cases, it’s worth noting that different inputs could potentially sanitize to the same output string (e.g., “file\&name” and “file@name” might both become “file_name”). However, given the script’s usage context—primarily combining sanitized location basenames and query terms—the risk of problematic collisions leading to data overwrites within a single run is low, as each task is generated from a unique location/query pair from the input arrays. The task_id_suffix parameter in perform_search_task, though not currently used to disambiguate output paths beyond location/query, offers a potential mechanism for ensuring stricter uniqueness if needed in future modifications.
    • 3. perform_search_task()
      This is the core workhorse function, executed as a background process for each search query and location pair. Its responsibilities include:
      • Path Setup: It first sanitizes the basename of the search directory and the query term using sanitize_for_path() to form a unique task_identifier (e.g., loc_pdf_query_jurime). This identifier is then used to create task-specific subdirectories within ALL_RESULTS_PARENT_DIR and ALL_LOGS_PARENT_DIR.
      • Task-Specific Logging: It defines an internal helper function, task_log(), which writes timestamped messages, tagged with the task_identifier, to a dedicated log file for that task (e.g., log_loc_pdf_query_jurime.txt). The task log is initialized with details about the query, location, and where results will be stored.
      • (Commented) OCR Pre-processing: Contains a commented-out block for optional OCR processing of PDF files (detailed in Section V.C).
      • ugrep Execution: It constructs and executes the ugrep command using UGREP_CMD, UGREP_OPTS_BASE, UGREP_OPTS_ARCHIVES, the query parameter, and the search directory path.
        • Standard output from ugrep (which, with the default -l option, is the list of matching filenames) is redirected to an output_file_matches (e.g., matches.txt) within the task’s result directory.
        • Standard error from ugrep is appended (2>>) to the task’s log file. This is crucial for capturing ugrep-specific errors or warnings.
      • Result Handling and Logging:
        • If the ugrep command exits successfully (exit code 0):
          • It checks if the output_file_matches has a non-zero size (-s). If so, it logs success and the location of the results file.
          • If the file is empty (ugrep ran successfully but found no matches), it logs this information and, by default, removes the empty output_file_matches file.
        • If the ugrep command fails (non-zero exit code):
          • It captures the ugrep_exit_code and logs an error message including this code. The (potentially empty or partial) output_file_matches is typically kept unless the script is modified to remove it on failure.
      • Cleanup (Commented): Includes commented-out lines for removing temporary OCR directories if OCR were used. The task_id_suffix parameter, passed from the main loop (as total_tasks), is logged and used in the (commented) task_ocr_temp_dir naming. While it doesn’t currently contribute to the uniqueness of the primary result/log paths (which are already unique per location/query pair from the main loop), it could be leveraged for finer-grained differentiation if, for example, multiple identical query/location tasks were run with different parameters within a more complex setup.

    C. (Commented) OCR Pre-processing Module

    Within the perform_search_task function, there is a significant, currently commented-out block of code intended to provide Optical Character Recognition (OCR) capabilities for PDF files. If activated, this module would attempt to extract text from image-based or scanned PDFs, making their content searchable by ugrep.

    • Intended Functionality:
      1. The module uses find to locate all PDF files (-iname “*.pdf”) within the current search_dir_param.
      2. For each PDF found, it employs a heuristic to decide if OCR is necessary: pdftotext is used to extract text. If the length of the extracted text (after removing whitespace) is below an arbitrary threshold (100 characters: if [ ${#text_content_check} -lt 100 ]), the script assumes the PDF might be scanned or contain minimal selectable text, and thus could benefit from OCR.
      3. If a PDF is flagged for OCR, tesseract is invoked to perform OCR, attempting to convert the PDF’s content to text. The output text would be stored in a file within a task-specific temporary OCR directory (task_ocr_temp_dir).
      4. The task_ocr_temp_dir (containing these OCR-generated text files) would then potentially be added to current_search_paths, so ugrep would search both the original files and the OCR-extracted text.
    • Current State and Implications:
      • Experimental/Placeholder: The OCR logic is explicitly commented out, indicating it is not an active feature. The comments themselves suggest the heuristic is “very basic and might misclassify PDFs” and that the overall OCR logic “may need significant enhancement.”
      • Performance Impact: OCR is a CPU-intensive process. If activated, especially in its current synchronous form (processing PDFs one by one within each perform_search_task), it could dramatically slow down tasks that involve many PDFs requiring OCR. This would also increase the overall CPU load on the system. The script’s current concurrency model, managed by MAX_CONCURRENT_JOBS, might need re-evaluation if OCR becomes a frequent and lengthy operation within tasks, as it could lead to a few OCR-heavy tasks monopolizing the available job slots.
      • Dependencies: Activating this module introduces dependencies on pdftotext (from poppler-utils) and tesseract-ocr (including its language data).
      • Resource Management: Temporary files are created in task_ocr_temp_dir. The script includes a commented rm -rf “$task_ocr_temp_dir” for cleanup, which would be essential to manage disk space.
    • Considerations for Activation/Enhancement:
      • Improved Heuristic: The character count heuristic is simplistic. A more reliable method for identifying scanned PDFs might involve analyzing PDF structure or using more advanced tools.
      • Error Handling: Robust error handling for pdftotext and tesseract would be needed (e.g., what if tesseract fails on a specific PDF?).
      • Asynchronous OCR Processing: For better performance, OCR could be decoupled from the main search tasks. For example, a preliminary phase could identify all PDFs needing OCR across all locations, and a separate pool of OCR worker processes could handle them concurrently, storing results in a cache that ugrep tasks can then access.
      • Logging: Enhanced logging for the OCR process itself (e.g., which files were OCRed, success/failure, time taken per file) would be beneficial for diagnostics. The current commented tesseract command does redirect its output to the task log, which is a good starting point.

    This module, even in its commented state, indicates an ambition for the script to handle complex document types. Its current implementation serves as a basic prototype.

    #VI. Testing and Validation

    Ensuring the reliability and correctness of the Batch Search Orchestration Script requires a systematic approach to testing. This involves creating a controlled environment and defining test cases to verify various aspects of its functionality.

    A. Approach to Testing:

    A comprehensive testing strategy should encompass:

    1. Controlled Test Environment:
      • Set up a dedicated directory structure containing a variety of sample files and archives.
      • Include plain text files with known content.
      • Include archive files (e.g., .zip, .gz, .7z) containing text files with known content to test archive searching (UGREP_OPTS_ARCHIVES=”-z”).
      • If testing the OCR functionality (once activated), include both text-based PDFs and image-based (scanned) PDFs.
      • Use file and directory names that include spaces and special characters to test the sanitize_for_path function.
    2. Test Case Categories:
      • Core Search Correctness: Verify that ugrep finds known terms in specified files and does not find terms that are absent. Check that matches.txt contains the correct filenames.
      • ugrep Option Functionality: Test different UGREP_OPTS_BASE configurations (e.g., case sensitivity by removing -i, whole word search with -w).
      • Archive Searching: Confirm that terms within supported archives are found when UGREP_OPTS_ARCHIVES is active.
      • Concurrency Behavior: Set MAX_CONCURRENT_JOBS to a low number (e.g., 2) and run a larger number of tasks. Monitor system processes (e.g., using ps or htop) to confirm that no more than the specified number of ugrep processes run simultaneously. Check MASTER_LOG_FILE for messages about waiting for job slots.
      • Output Structure and Naming: Verify that the MASTER_OUTPUT_ROOT directory and its subdirectories (search_results, search_logs) are created correctly. Check that task-specific directories and log files use the expected sanitized names.
      • Log Content Verification:
        • Inspect MASTER_LOG_FILE for accurate recording of script startup, task launching, concurrency management, and completion messages.
        • Inspect individual task logs for correct query/location information, the exact ugrep command executed, logged errors (if any, including ugrep stderr), and success/failure status.
      • Path Sanitization: Use SEARCH_LOCATIONS with basenames containing spaces or special characters (e.g., “/mnt/my data/path with &/”) and SEARCH_QUERIES with similar characteristics (e.g., “term with/slash”). Verify that the generated output directory names (e.g., loc_path_with___query_term_with_slash) are valid and the script operates without error.
      • Error Handling:
        • Test with a path in SEARCH_LOCATIONS that does not exist or is unreadable. Verify that the relevant task log records an error and other tasks proceed.
        • Introduce deliberate errors into ugrep options to see how ugrep failures are logged.
        • Test permission issues by trying to write MASTER_OUTPUT_ROOT to a read-only location (should fail early) or read from a restricted SEARCH_LOCATIONS directory.
      • (If OCR enabled) OCR Functionality: Test with PDFs that are known to be scanned. Verify that OCR is attempted, text is extracted (check temporary OCR files if possible), and ugrep subsequently finds terms in this extracted text.
    3. Documentation Unit Tests (Conceptual):
      This refers to a process of verifying that the script’s actual behavior aligns with its documentation. For this script, it would involve:
      • Configuration Validation: For each key configuration variable documented (e.g., MAX_CONCURRENT_JOBS, UGREP_OPTS_BASE), design a test run that specifically relies on that variable’s documented effect. For example, to test the documentation for UGREP_OPTS_BASE=”-r -l” (recursive, list filenames, case-insensitive by default), create a test where a term exists in a subdirectory and with mixed case, and verify it’s found and only the filename is listed.
      • Functional Verification: For major functions like perform_search_task, document its expected inputs, outputs, and side effects (like log entries). Then, design tests that trigger this function and assert that these documented outcomes occur. For example, document that a failed ugrep command results in an error message in the task log containing the exit code. A test would then force a ugrep failure and check the task log for this specific message format.
      • Feedback Loop: If a “documentation unit test” fails, it indicates a discrepancy: either the script’s behavior is incorrect, or the documentation is inaccurate/outdated. This process helps maintain synchronization between the code and its description.

    The script’s deterministic output structure (for the same input files, queries, and script version, the output files and most log content, excluding timestamps/PIDs, should be identical) is conducive to automated testing. Test suites can be built to run the script with predefined test data and then use tools like diff to compare the generated output (e.g., matches.txt files, key log messages) against “golden” reference files.

    B. Example Test Cases (Illustrative):

    • Test Case 1: Basic Search & Case Insensitivity
      • Setup:
        • test_env/docs/fileA.txt containing “TestKeyword”.
        • test_env/docs/fileB.txt containing “AnotherString”.
      • Script Configuration:
        • SEARCH_LOCATIONS=(“test_env/docs”)
        • SEARCH_QUERIES=(“testkeyword”)
        • UGREP_OPTS_BASE=”-r -i -l” (default)
      • Action: Run the script.
      • Expected:
        • A matches.txt file for this task should exist, containing the path to fileA.txt.
        • The task log should indicate success.
        • fileB.txt should not be listed.
    • Test Case 2: Archive Search Functionality
      • Setup:
        • test_env/archives/data.zip which contains zipped_doc.txt with the content “ArchiveTerm”.
      • Script Configuration:
        • SEARCH_LOCATIONS=(“test_env/archives”)
        • SEARCH_QUERIES=(“archiveterm”)
        • UGREP_OPTS_ARCHIVES=”-z” (default)
      • Action: Run the script.
      • Expected:
        • matches.txt for this task should list test_env/archives/data.zip (or a path pointing within it, depending on ugrep’s output format for archive contents with -l).
        • Task log indicates success.
    • Test Case 3: Concurrency Limit Verification
      • Setup:
        • At least 3 distinct directories in SEARCH_LOCATIONS (e.g., test_env/dir1, test_env/dir2, test_env/dir3), each large enough to make ugrep run for a noticeable time.
        • One query in SEARCH_QUERIES.
      • Script Configuration:
        • MAX_CONCURRENT_JOBS=2
      • Action: Run the script. While running, monitor active ugrep processes (e.g., watch “ps aux grep ‘[u]grep’”).
      • Expected:
        • No more than 2 ugrep processes should be seen running concurrently on behalf of the script.
        • The MASTER_LOG_FILE should contain entries like “…Reached max concurrent jobs (2). Waiting for a slot…”.
    • Test Case 4: Path Sanitization Robustness
      • Setup:
        • A directory named test_env/My Documents & Files/.
      • Script Configuration:
        • SEARCH_LOCATIONS=(“test_env/My Documents & Files/”)
        • SEARCH_QUERIES=(“A test query/with-slashes”)
      • Action: Run the script.
      • Expected:
        • The script should complete without errors related to file/directory naming.
        • Output directories under search_results and search_logs should have sanitized names, e.g., loc_My_Documents___Files_query_A_test_query_with-slashes.

    #VII. Troubleshooting and Error Handling

    Effective troubleshooting relies on understanding the script’s logging mechanisms and common failure points.

    A. Interpreting Log Files:

    The script generates two types of log files for each run, both located within the MASTER_OUTPUT_ROOT directory:

    • Master Log File (master_orchestrator_log.txt):
      • Purpose: Provides a high-level overview of the entire script execution.
      • Contents:
        • Timestamp of script start and end.
        • The unique Run ID (basename of MASTER_OUTPUT_ROOT).
        • Key configuration parameters used (e.g., MAX_CONCURRENT_JOBS).
        • Announcements for each task launched (query and location).
        • Messages related to concurrency management (e.g., “Reached max concurrent jobs… Waiting…”).
        • Summary status messages (e.g., “All tasks have been launched,” “All search tasks have completed”).
      • Usage: Consult this log first to understand the overall progress of the batch job, identify which tasks were initiated, and see any script-level orchestration issues.
    • Task-Specific Log Files (search_logs/log_loc_<SANE_LOC>_query_<SANE_QUERY>.txt):
      • Purpose: Provides detailed information about the execution of a single search task (one query in one location).
      • Contents:
        • Header indicating the query and location for the task.
        • Timestamp of task start and end.
        • The exact ugrep command that was executed for this task. This is invaluable for debugging ugrep specific issues.
        • Standard error output (stderr) from the ugrep command. Any errors or warnings generated by ugrep itself (e.g., “permission denied” for a specific file it tried to read, “invalid option”) will appear here.
        • Status messages: “SUCCESS: Matches found,” “INFO: No matches found,” or “ERROR: ugrep command failed with exit code X.”
        • Path to the results file (matches.txt) if successful.
      • Usage: If the master log indicates a problem with a specific task, or if a task produces unexpected results (or no results), this is the primary log to inspect. The captured ugrep stderr is particularly critical for diagnosing issues related to the search operation itself.

    B. Common Issues and Resolutions:

    • Issue: Script fails immediately or reports “command not found” for ugrep, nproc, etc.
      • Cause: A required software dependency (see Section II.A) is not installed or not in the system’s PATH. The UGREP_CMD variable might be set to a command that doesn’t exist (e.g., ug+ if only ug is installed).
      • Resolution: Install the missing software. Verify that UGREP_CMD points to the correct ugrep executable. Ensure all dependencies are accessible from the environment where the script is run.
    • Issue: Permission denied errors appear in task logs (from ugrep stderr) or script fails to create output directories.
      • Cause:
        • The script lacks read access to one or more directories listed in SEARCH_LOCATIONS or files within them.
        • The script lacks write/execute permissions for the directory where MASTER_OUTPUT_ROOT is being created.
      • Resolution: Adjust file system permissions for the search target locations and the output directory. Ensure the user running the script has the necessary rights.
    • Issue: Task log shows “ERROR: ugrep command failed with exit code X.”
      • Cause: The ugrep command itself terminated with an error. The specific exit code X provides a clue.
      • Resolution:
        • Consult the ugrep documentation for the meaning of exit code X.
        • Examine the exact ugrep command logged in the task log for syntax errors, incorrect options, or issues with the query pattern.
        • Check the files/directories ugrep was attempting to search for corruption, special permissions, or other accessibility issues.
    • Issue: Script runs very slowly, or system becomes unresponsive.
      • Cause:
        • MAX_CONCURRENT_JOBS is set too high for the system’s CPU or I/O capacity.
        • Searching extremely large numbers of files, very large individual files, or many deep directory structures.
        • Searching many large archives can be I/O and CPU intensive.
        • If OCR is enabled, processing many PDFs, especially scanned ones, can be very CPU intensive and slow.
      • Resolution:
        • Reduce MAX_CONCURRENT_JOBS.
        • Refine SEARCH_LOCATIONS to be more targeted, or break down very large search scopes into smaller, separate script runs.
        • If OCR is the bottleneck: consider disabling it if not essential, or explore optimizing the OCR process (see Section V.C).
        • If searching archives is slow, consider if it’s always necessary or if UGREP_OPTS_ARCHIVES can be refined.
    • Issue: Output directories or files have unexpected names, or some seem to be missing.
      • Cause:
        • An error occurred before the file/directory was supposed to be created (check logs).
        • Extremely unusual characters in SEARCH_LOCATIONS basenames or SEARCH_QUERIES that might not be fully handled by sanitize_for_path (though its current implementation is fairly robust).
        • If matches.txt is missing for a task, it could mean no matches were found (and the empty file was removed), or an error occurred.
      • Resolution: Always check the relevant task log first. It will indicate whether the task succeeded, found no matches, or encountered an error. Verify the input strings in SEARCH_LOCATIONS and SEARCH_QUERIES.

    C. Understanding ugrep Exit Codes:

    The perform_search_task function captures and logs the exit code from ugrep if it fails. Common ugrep (and grep-like utilities) exit codes include:

    • 0: Success. For ugrep with options like -l (list files), this typically means the command ran successfully, and if any matching files were found, they are listed. If no files match, the command might still exit with 0, but the output file will be empty. The script distinguishes this:
      Bash
      if $UGREP_CMD… > “$output_file_matches” 2>> “$task_log_file”; then
      # ugrep exit code was 0
      if [ -s “$output_file_matches” ]; then
      task_log “SUCCESS: Matches found…”
      else
      task_log “INFO: No matches found.”
      rm “$output_file_matches” # Empty file removed
      fi
      else
      # ugrep exit code was non-zero (an error)
      local ugrep_exit_code=$?
      task_log “ERROR: ugrep command failed with exit code $ugrep_exit_code.”
      fi
      This logic correctly differentiates “successful run with matches,” “successful run with no matches,” and “ugrep command error.”
    • 1: Conventionally, for grep, this means no matches were found. However, ugrep’s behavior, especially with various option combinations, should be consulted directly from its documentation. The script’s check for an empty output file ([ -s “$output_file_matches” ]) after a successful exit code (0) is a more reliable way to determine if matches were actually written.
    • >1 (e.g., 2 or higher): Typically indicates an error during ugrep’s execution (e.g., syntax error in options, inaccessible file/directory, etc.). The specific error messages from ugrep (redirected to the task log) are essential for diagnosing these issues.

    #VIII. Maintainability and Extensibility

    The script exhibits several design choices that contribute to its maintainability and offer avenues for future extensions.

    A. Code Structure and Design Decisions:

    • Modularity: The use of functions (master_log, sanitize_for_path, perform_search_task) encapsulates distinct pieces of logic. This makes the script easier to understand, debug, and modify, as changes within one function are less likely to have unintended consequences elsewhere, provided the function’s interface (parameters and expected output/side-effects) is respected.
    • Centralized Configuration: Grouping configurable variables at the beginning of the script allows users to tailor the script’s behavior without needing to search through the entire codebase.
    • Non-Destructive Reruns: The generation of unique, timestamped master output directories (MASTER_OUTPUT_ROOT) for each run ensures that previous results and logs are not overwritten. This is crucial for auditing and comparing results across different runs or configurations.
    • Detailed Logging: The dual logging system (master log and task-specific logs) provides comprehensive traceability, which is invaluable for debugging operational issues and understanding the script’s behavior during complex batch jobs.
    • Concurrency Model: The script implements a straightforward concurrency model using background processes and wait -n. While effective for many scenarios, its potential limitations with wait -n on some systems or under certain signal conditions are noted in the script’s comments, indicating an awareness of areas for potential refinement.
    • Readability: The script is generally well-commented, explaining the purpose of various sections and complex commands. It avoids overly obscure Bashisms, making it relatively accessible to individuals with moderate Bash scripting proficiency.

    B. Guidelines for Modifying Search Parameters or ugrep Options:

    • Primary Method: Users should modify search parameters (SEARCH_LOCATIONS, SEARCH_QUERIES) and ugrep configurations (UGREP_CMD, UGREP_OPTS_BASE, UGREP_OPTS_ARCHIVES, UGREP_OPTS_INDEX) by editing their definitions in the configuration section at the top of the script.
    • Testing: After any modification, thoroughly test the script with a representative subset of data to ensure the changes produce the desired behavior and have not introduced regressions. Refer to Section VI for testing strategies.
    • Impact Awareness: Be mindful that changes to ugrep options can significantly alter the search behavior, performance, and the format of results in matches.txt. For example, removing -l (list filenames) from UGREP_OPTS_BASE would cause matches.txt to contain actual matching lines instead of just filenames, which could impact downstream processing of these files.
    • The separation of ugrep options into UGREP_OPTS_BASE, UGREP_OPTS_ARCHIVES, and UGREP_OPTS_INDEX is a good design choice. It allows for more targeted adjustments. For instance, archive searching can be toggled by altering UGREP_OPTS_ARCHIVES without affecting the core recursive or case-insensitivity settings in UGREP_OPTS_BASE.

    C. Considerations for Activating/Enhancing OCR Functionality:

    If the decision is made to activate or further develop the OCR module:

    • Prerequisites: Ensure pdftotext and tesseract (with necessary language packs) are installed and accessible.
    • Uncomment Code: Carefully uncomment the relevant lines in the “OCR Configuration” section (e.g., OCR_BASE_TEMP_DIR initialization and mkdir) and within the perform_search_task function (the OCR processing loop, addition of task_ocr_temp_dir to current_search_paths, and cleanup of task_ocr_temp_dir).
    • Performance Profiling: Anticipate a significant performance impact, especially on tasks with many PDFs. Profile the script’s execution time and resource usage (CPU, memory, disk I/O) with OCR enabled. The current synchronous, per-task OCR processing model may become a bottleneck.
    • Heuristic Refinement: The existing heuristic (${#text_content_check} -lt 100) for deciding which PDFs to OCR is very basic. Investigate more reliable methods to identify scanned/image-based PDFs.
    • Error Handling & Logging: Implement more robust error handling for pdftotext and tesseract failures. Enhance logging to record which PDFs are attempted for OCR, success/failure status, and processing time per PDF. The current redirection of tesseract output to the task log (>> “$task_log_file” 2>&1) is a good foundation.
    • Resource Management: Ensure the OCR_BASE_TEMP_DIR is appropriate (e.g., sufficient disk space, correct permissions) and that the cleanup mechanism (rm -rf “$task_ocr_temp_dir”) is reliable.
    • Alternative Architectures: For substantial OCR workloads, consider re-architecting the OCR component to run as a separate, asynchronous pipeline that pre-processes PDFs and caches their text content, rather than performing OCR synchronously within each search task. This could improve overall throughput and resource utilization.

    D. Suggestions for Future Enhancements:

    • Input Validation:
      • Add checks at the beginning of the script to validate paths in SEARCH_LOCATIONS (e.g., verify existence and readability). This could prevent tasks from failing midway due to simple configuration errors.
      • Validate that ugrep (and other crucial commands) are available.
    • Advanced Concurrency Management:
      • For OCR, explore a dedicated worker pool or queue system to manage CPU-intensive OCR tasks independently of I/O-bound ugrep tasks.
      • Implement more sophisticated PID tracking for background jobs instead of relying solely on active_jobs count and wait -n, which could offer more precise control and error reporting for individual job failures.
    • External Configuration:
      • Allow SEARCH_LOCATIONS and SEARCH_QUERIES to be specified via command-line arguments (e.g., pointing to files containing lists of locations/queries) or a dedicated configuration file, rather than requiring direct script modification. This would enhance usability for dynamic search definitions.
    • Summary Reporting:
      • Generate a summary report at the end of the script execution, possibly written to a file in MASTER_OUTPUT_ROOT. This report could include:
        • Total number of tasks processed.
        • Number of tasks that succeeded, failed, or found no matches.
        • A list of tasks that encountered errors, with pointers to their specific logs.
        • Total number of matching files found across all successful tasks. This would provide a quick, high-level overview of the batch run’s outcome without requiring manual inspection of all logs.
    • Granular ugrep Control:
      • Allow different ugrep options per location or per query, if such granularity is needed (e.g., by defining options alongside entries in SEARCH_LOCATIONS or SEARCH_QUERIES).
    • Enhanced Error Recovery/Retries: For certain types of transient errors (e.g., temporary network issue for a remote search location), implement an optional retry mechanism for failed tasks.
    • Plugin Architecture for Pre-processors: Generalize the OCR pre-processing concept to allow other types of file pre-processors (e.g., for specific binary formats) to be more easily integrated.

    These enhancements could further improve the script’s robustness, flexibility, and user-friendliness, though they would also increase its complexity. The trade-off between simplicity and feature richness should be considered for each potential addition.

    URL: https://ib.bsb.br/ugrep
    Ref. https://github.com/Genivia/ugrep