Skip to contents

[Experimental]

Usage

find_ref_to_df(input = NULL, no_layout = FALSE, clean_ref = TRUE)

parse_ref_to_df(input = NULL, clean_ref = TRUE)

Arguments

input

Vector of file paths to the documents to be analyzed (PDF for find_ref_to_df() and text for parse_ref_to_df()).

no_layout

Logical; if TRUE, the '--no-layout' option is used in find_ref_to_df(), which might be necessary for some PDFs (e.g., use this if your document uses a multi-column layout). Ignored in parse_ref_to_df(). Default is FALSE.

clean_ref

Logical; if TRUE, cleans the references using the clean_ref() function after conversion (applicable to both functions). Default is TRUE. See clean_ref() for details on what the function does.

Value

A tidy data frame with one row per reference, including metadata (author, title, etc...), unique identifiers for each reference and document, and the complete original reference.

Details

These functions convert references found in PDF documents or parsed from text files into tidy data frames. find_ref_to_df() utilizes the find_ref() function for PDFs, and parse_ref_to_df() utilizes the parse_ref() function for text files.

find_ref_to_df() analyzes PDF documents and extracts all references, converting them into a structured data frame. It requires the 'anystyle' Ruby gem and uses both the 'find' and 'parse' features (find_ref() and parse_ref() respectively) to gather detailed information about each reference.

parse_ref_to_df() works similarly but is designed for text documents. It parses structured references from text files and converts them into a data frame.

These functions Creates unique identifiers for each reference within a document and across the entire set of documents.

  • id_doc: A unique identifier for each document based on its position in the input.

  • id_ref: A unique identifier for each reference within its document. It's a combination of id_doc and the reference's row number within the document, ensuring each reference across all documents has a unique ID.

See also

find_ref(), parse_ref(), and clean_ref() for related functionality.

Examples

if (FALSE) {
# For a PDF document
references_df <- find_ref_to_df(input = c(
  "path/to/document1.pdf",
  "path/to/document2.pdf"
))

# For a text file
references_df <- parse_ref_to_df(input = "path/to/references.txt")
}