Using Ranystyle: Extracting, Parsing and Cleaning Bibliographic References
Aurélien Goutsmedt
Source:vignettes/using_Ranystyle.Rmd
using_Ranystyle.Rmd
Introduction
Ranystyle is an R package designed to automate the extraction, parsing, and cleaning of bibliographic references from PDF and text documents. It relies on anystyle application.1 This vignette provides a comprehensive guide to using Ranystyle’s functionalities, including extracting references from PDFs, parsing references from text files, and cleaning and organizing reference data.
Initial Setup: installing anystyle
Before using the functionalities to extract or parse references, you need to ensure that the anystyle Ruby gem is installed on your system.
The install_anystyle()
function simplifies the
installation of the necessary ‘anystyle’ components. Ensure that the
Ruby environment is properly configured and that gem can be called from
the command line. See more details on Ruby’s installation here and on RubyGems
here.
# Install anystyle
install_anystyle()
Extracting References from PDF and text files
The find_ref()
function is used to extract references
from PDF documents. It utilizes the ‘anystyle’ Ruby gem
find
function to analyze the document and identify
reference sections from PDF. The parse_ref()
function,
coming from anystyle parse
function, is designed to parse
structured references from text files or from a vector of character. It
reads the text, applies parsing rules, and converts the references into
a structured format.
The package incorporates various academic PDFs with bibliographies, as well as the bibliographies of these PDFs extracted in a .txt format, in order to manipulate the different functions of the graph.
# Path to the documents of the package
pdf_path <- system.file("extdata", package = "Ranystyle")
files <- list.files(pdf_path)
files
#> [1] "example_doc_1.pdf" "example_doc_1.txt" "example_doc_2.pdf" "example_doc_2.txt" "example_doc_3.pdf" "example_doc_3.txt"
You can save the parsed references in various format:
bib
, json
, xml
or
ref
(a text document with one reference per line). By
setting path = ""
in find_ref()
or
parse_ref()
, you can directly extract the results in R,
rather than saving the results. Let’s see an example with a PDF.2
doc <- files[1]
print(doc)
#> [1] "example_doc_1.pdf"
extracted_refs <- find_ref(input = paste0(pdf_path, "/", doc),
path = "",
output_format = "bib",
no_layout = FALSE)
#> [1] "anystyle -f bib find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_1.pdf "
# Print the extracted references
cat(stringr::str_trunc(extracted_refs, 523))
#> @article{arnold2023a, author = {Arnold, M.}, date = {2023}, title = {ECB must do more to tackle inflation “Monster”, says christine lagarde’}, journal = {Financial Times}}@article{barro1983a, author = {Barro, R.J. and Gordon, D.B.}, date = {1983}, title = {Rules, discretion and reputation in a model of monetary policy’}, volume = {12}, pages = {101–121}, url = {https://doi.org/10.1016/0304-3932(83)90051-X.}, doi = {10.1016/0304-3932(83)90051-X.}, journal = {Journal of Monetary Economics}, number = {1}}...
Or we can parse reference from a .txt.
doc <- files[2]
print(doc)
#> [1] "example_doc_1.txt"
extracted_refs <- parse_ref(input = paste0(pdf_path, "/", doc),
path = "",
output_format = "xml")
#> [1] "anystyle --no-overwrite -f xml parse C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_1.txt "
# Print the extracted references
cat(stringr::str_trunc(extracted_refs, 615))
#> <?xml version="1.0" encoding="UTF-8"?><dataset> <sequence> <author>Arnold, M.</author> <date>(2023).</date> <title>ECB must do more to tackle inflation “Monster”, says christine lagarde’.</title> <journal>Financial Times.</journal> </sequence> <sequence> <author>Barro, R. J., & Gordon, D. B.</author> <date>(1983).</date> <title>Rules, discretion and reputation in a model of monetary policy’.</title> <journal>Journal of Monetary Economics,</journal> <volume>12(1),</volume> <pages>101–121.</pages> <url>https://doi.org/10.1016/0304-3932(83)90051-X.</url> </sequence> ...
parse_ref()
can work directly on a vector of character,
with one reference per line.
# a set of false references
references <- c(
"Smith, J. (2020). Understanding Artificial Intelligence. AI Journal, 35(5), 123-145.",
"Johnson, L., and Brown, C. (2018). Advances in Machine Learning. New York: Academic Press.",
"Clark, E., & Wright, R. (2019). Robotics and Automation. London: Springer, 200-250.",
"Davis, M. 2021. Quantum Computing: A New Era. Quantum Computing, 5(3):300-320.",
"Adams, James, and Murphy, Finn. (2022). Exploring Virtual Reality. VR World, 10(7), 777-800."
)
extracted_refs <- parse_ref(input = references,
path = "",
output_format = "json")
#> [1] "anystyle --no-overwrite -f json parse ref_to_parse.txt "
# Print the extracted references
cat(stringr::str_trunc(extracted_refs, 640))
#> [ { "author": [ { "family": "Smith", "given": "J." } ], "date": [ "2020" ], "title": [ "Understanding Artificial Intelligence" ], "volume": [ "35" ], "pages": [ "123–145" ], "type": "article-journal", "container-title": [ "AI Journal" ], "issue": [ "5" ] }, { "author": [ { "family": "Johnson", "given": "L." }, { "family": "Brown", "given": "C." } ], "date": [ "2018" ], "title": [ "Advances in Machine Learning" ], "location": [ "New York" ], ...
From Reference to Data Frame
Depending on whether your start from PDF documents or from text
documents or vectors, you can use two functions:
find_ref_to_df()
and parse_ref_to_df()
. These
functions create a data frame (a tibble
more precisely)
gathering the various information of the references extracted.
doc <- files[1]
print(doc)
#> [1] "example_doc_1.pdf"
data_ref <- find_ref_to_df(input = paste0(pdf_path, "/", doc),
no_layout = FALSE)
#> [1] "anystyle -f json find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_1.pdf "
#> [1] "anystyle --overwrite -f ref find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_1.pdf ./"
print(data_ref)
#> # A tibble: 81 × 23
#> id_doc doc id_ref author title year `container-title` volume pages location publisher type date other_date other_title url issue doi
#> <int> <chr> <chr> <list> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 example_do… 1_1 <tibble> ECB … 2023 Financial Times <NA> <NA> <NA> <NA> arti… 2023 <NA> <NA> <NA> <NA> <NA>
#> 2 1 example_do… 1_2 <tibble> Rule… 1983 Journal of Monet… 12 101–… <NA> <NA> arti… 1983 <NA> <NA> http… 1 10.1…
#> 3 1 example_do… 1_3 <tibble> Idea… 2009 Journal of Europ… 16 701–… <NA> <NA> arti… 2009 <NA> <NA> <NA> 5 <NA>
#> 4 1 example_do… 1_4 <tibble> A st… 1999 Scottish Journal… 46 17–39 <NA> <NA> arti… 1999 <NA> <NA> <NA> 1 <NA>
#> 5 1 example_do… 1_5 <tibble> Tech… 2018 International Po… 12 328–… <NA> <NA> arti… 2018 <NA> <NA> <NA> 4 <NA>
#> 6 1 example_do… 1_6 <tibble> Late… 2003 Journal of Machi… 3 <NA> <NA> <NA> arti… 2003 <NA> <NA> <NA> <NA> <NA>
#> 7 1 example_do… 1_7 <tibble> Plan… 2022 Zeitschrift Für … 32 707–… <NA> <NA> arti… 2022 <NA> <NA> http… 3 10.1…
#> 8 1 example_do… 1_8 <tibble> Repu… 2010 <NA> <NA> <NA> <NA> Princeto… book 2010 <NA> <NA> <NA> <NA> <NA>
#> 9 1 example_do… 1_9 <tibble> The … 2016 At the Macroecon… <NA> <NA> <NA> <NA> pape… 2016 <NA> <NA> http… <NA> <NA>
#> 10 1 example_do… 1_10 <tibble> Effe… 2017 At the EUROFI Co… <NA> <NA> <NA> <NA> pape… 2017 <NA> <NA> http… <NA> <NA>
#> # ℹ 71 more rows
#> # ℹ 5 more variables: edition <chr>, genre <chr>, note <chr>, editor <chr>, full_ref <chr>
find_ref_to_df()
and parse_ref_to_df()
can
also take multiple PDF or text documents and create a data frame from
them.
pdfs <- files %>%
.[stringr::str_detect(., "\\.pdf$")]
data_ref <- find_ref_to_df(input = paste0(pdf_path, "/", pdfs),
clean_ref = FALSE,
no_layout = FALSE)
#> [1] "anystyle -f json find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_1.pdf "
#> [1] "anystyle -f json find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_2.pdf "
#> [1] "anystyle -f json find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_3.pdf "
#> [1] "anystyle --overwrite -f ref find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_1.pdf ./"
#> [1] "anystyle --overwrite -f ref find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_2.pdf ./"
#> [1] "anystyle --overwrite -f ref find C:/Users/goutsmedt/Documents/MEGAsync/Research/R/Packages/Ranystyle/inst/extdata/example_doc_3.pdf ./"
data_ref
#> # A tibble: 266 × 21
#> id_doc doc id_ref author date title type `container-title` volume pages url issue doi publisher location edition genre note editor
#> <int> <chr> <chr> <list> <lis> <lis> <chr> <list> <list> <list> <list> <list> <chr> <chr> <chr> <chr> <chr> <chr> <list>
#> 1 1 example_doc… 1_1 <df> <chr> <chr> arti… <chr [1]> <NULL> <NULL> <NULL> <NULL> <NA> <NA> <NA> <NA> <NA> <NA> <NULL>
#> 2 1 example_doc… 1_2 <df> <chr> <chr> arti… <chr [1]> <chr> <chr> <chr> <chr> 10.1… <NA> <NA> <NA> <NA> <NA> <NULL>
#> 3 1 example_doc… 1_3 <df> <chr> <chr> arti… <chr [1]> <chr> <chr> <NULL> <chr> <NA> <NA> <NA> <NA> <NA> <NA> <NULL>
#> 4 1 example_doc… 1_4 <df> <chr> <chr> arti… <chr [1]> <chr> <chr> <NULL> <chr> <NA> <NA> <NA> <NA> <NA> <NA> <NULL>
#> 5 1 example_doc… 1_5 <df> <chr> <chr> arti… <chr [1]> <chr> <chr> <NULL> <chr> <NA> <NA> <NA> <NA> <NA> <NA> <NULL>
#> 6 1 example_doc… 1_6 <df> <chr> <chr> arti… <chr [1]> <chr> <NULL> <NULL> <NULL> <NA> <NA> <NA> <NA> <NA> <NA> <NULL>
#> 7 1 example_doc… 1_7 <df> <chr> <chr> arti… <chr [1]> <chr> <chr> <chr> <chr> 10.1… <NA> <NA> <NA> <NA> <NA> <NULL>
#> 8 1 example_doc… 1_8 <df> <chr> <chr> book <NULL> <NULL> <NULL> <NULL> <NULL> <NA> Princeto… <NA> <NA> <NA> <NA> <NULL>
#> 9 1 example_doc… 1_9 <df> <chr> <chr> pape… <chr [1]> <NULL> <NULL> <chr> <NULL> <NA> <NA> <NA> <NA> <NA> <NA> <NULL>
#> 10 1 example_doc… 1_10 <df> <chr> <chr> pape… <chr [1]> <NULL> <NULL> <chr> <NULL> <NA> <NA> <NA> <NA> <NA> <NA> <NULL>
#> # ℹ 256 more rows
#> # ℹ 2 more variables: `collection-title` <chr>, full_ref <chr>
As you may have seen, anystyle runs twice per document: the second
time serve to extract the references in a .ref
format, and
is used to keep the whole reference in a column of the table. This could
be useful for cleaning the data later.
data_ref %>%
dplyr::select(id_ref, full_ref)
#> # A tibble: 266 × 2
#> id_ref full_ref
#> <chr> <chr>
#> 1 1_1 Arnold, M. (2023). ECB must do more to tackle inflation “Monster”, says christine lagarde’. Financial Times.
#> 2 1_2 Barro, R. J., & Gordon, D. B. (1983). Rules, discretion and reputation in a model of monetary policy’. Journal of Monetary Economics, 12(1), …
#> 3 1_3 Béland, D. (2009). Ideas, institutions, and policy change’. Journal of European Public Policy, 16(5), 701–718.
#> 4 1_4 Berger, H., & Haan, J. (1999). A state within the state? An event study on the bundesbank (1948–1973)’. Scottish Journal of Political Economy…
#> 5 1_5 Best, J. (2018). Technocratic exceptionalism: Monetary policy and the fear of democracy’. International Political Sociology, 12(4), 328–345.
#> 6 1_6 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation’. Journal of Machine Learning Research, 3.
#> 7 1_7 Braun, B., Carlo, D., & Diessner, S. (2022). Planning Laissez-Faire: Supranational Central Banking and Structural Reforms’. Zeitschrift Für P…
#> 8 1_8 Carpenter, D. P. (2010). Reputation and power: Organizational image and pharmaceutical regulation at the FDA. Princeton University Press.
#> 9 1_9 Constâncio, V. (2016). The challenge of low real interest rates for monetary policy’. At the Macroeconomics Symposium, Utrecht School of Econ…
#> 10 1_10 Constâncio, V. (2017). Effectiveness of monetary union and the capital markets union’. At the EUROFI Conference. https://www.bis.org//review/…
#> # ℹ 256 more rows
The data extracted are still a bit messy. Most columns are in list
format, with sometimes two or more element per reference (like two
dates, or two titles). The clean_ref()
function implements
different cleaning processes:
- It collapses non-essential columns (i.e. other than
date
andtitle
), combining multiple entries into a single text string. - It reorganizes
author
information, creating a full name from given, particle, and family components. Also add anauthor_order
column. The column will remain in a list format. - It separates and cleans dates and titles, organizing them into
primary and additional information. The first date and first title are
save in
date
andtitle
. The other dates and titles are saved inother_date
andother_title
. - It extracts the year from the date and places it in a separate column.
- It removes extraneous punctuation and trims whitespace from character columns.
- It relocates key columns to a more standardized order for easier analysis and readability.
data_ref <- clean_ref(data_ref)
data_ref
#> # A tibble: 266 × 24
#> id_doc doc id_ref author title year `container-title` volume pages location publisher type date other_date other_title url issue doi
#> <int> <chr> <chr> <list> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 example_do… 1_1 <tibble> ECB … 2023 Financial Times <NA> <NA> <NA> <NA> arti… 2023 <NA> <NA> <NA> <NA> <NA>
#> 2 1 example_do… 1_2 <tibble> Rule… 1983 Journal of Monet… 12 101–… <NA> <NA> arti… 1983 <NA> <NA> http… 1 10.1…
#> 3 1 example_do… 1_3 <tibble> Idea… 2009 Journal of Europ… 16 701–… <NA> <NA> arti… 2009 <NA> <NA> <NA> 5 <NA>
#> 4 1 example_do… 1_4 <tibble> A st… 1999 Scottish Journal… 46 17–39 <NA> <NA> arti… 1999 <NA> <NA> <NA> 1 <NA>
#> 5 1 example_do… 1_5 <tibble> Tech… 2018 International Po… 12 328–… <NA> <NA> arti… 2018 <NA> <NA> <NA> 4 <NA>
#> 6 1 example_do… 1_6 <tibble> Late… 2003 Journal of Machi… 3 <NA> <NA> <NA> arti… 2003 <NA> <NA> <NA> <NA> <NA>
#> 7 1 example_do… 1_7 <tibble> Plan… 2022 Zeitschrift Für … 32 707–… <NA> <NA> arti… 2022 <NA> <NA> http… 3 10.1…
#> 8 1 example_do… 1_8 <tibble> Repu… 2010 <NA> <NA> <NA> <NA> Princeto… book 2010 <NA> <NA> <NA> <NA> <NA>
#> 9 1 example_do… 1_9 <tibble> The … 2016 At the Macroecon… <NA> <NA> <NA> <NA> pape… 2016 <NA> <NA> http… <NA> <NA>
#> 10 1 example_do… 1_10 <tibble> Effe… 2017 At the EUROFI Co… <NA> <NA> <NA> <NA> pape… 2017 <NA> <NA> http… <NA> <NA>
#> # ℹ 256 more rows
#> # ℹ 6 more variables: edition <chr>, genre <chr>, note <chr>, editor <chr>, `collection-title` <chr>, full_ref <chr>
These cleaning steps are directly implemented in
find_ref_to_df()
and parse_ref_to_df()
: you
just need to set clean_ref
to TRUE.
From References to Word Document
The Ranystyle
package provides two powerful functions,
find_ref_to_word_bibliography()
and
parse_ref_to_word_bibliography()
, designed to streamline
the process of extracting bibliographic references from documents and
compiling them into a formatted Word document. Whether working with
dense academic papers in PDF format or plain text files, these functions
are your gateway to an organized and presentable bibliography. For
instance, it allows you to extract the bibliography of a PDF and
reproduce it in a Word document but in a different Citation
Style Language (CSL).
Extracting References from PDFs to Word
The find_ref_to_word_bibliography function is designed to extract references from PDF documents. It utilizes the find_ref() function under the hood to analyze the PDF and identify reference sections. Once the references are extracted, they are converted into a ‘.bib’ format and subsequently used to generate a Word document with a formatted bibliography.
# Path to the PDF document
pdf_path <- "path/to/your/document.pdf"
# Output Word file path (optional)
word_output_path <- "path/to/output/bibliography.docx"
# CSL file path for formatting (optional)
csl_path <- "path/to/your/style.csl"
# Extract references and generate the Word bibliography
find_ref_to_word_bibliography(input = pdf_path,
output_file = word_output_path,
csl_path = csl_path)
Parsing References from Text Files to Word
The parse_ref_to_word_bibliography function caters to those who prefer working with plain text files, have references in a text format. It also allows you to build a word document bibliography directly from references stored in a vector of strings
# List of references
references <- c(
"Smith, J. (2020). Understanding Artificial Intelligence. AI Journal, 35(5), 123-145.",
"Johnson, L., & Brown, C. (2018). Advances in Machine Learning. New York: Academic Press.",
"Clark, E., & Wright, R. (2019). Robotics and Automation. London: Springer, 200-250.",
"Davis, M. (2021). Quantum Computing: A New Era. Quantum Computing, 5(3), 300-320.",
"Adams, J., & Murphy, F. (2022). Exploring Virtual Reality. VR World, 10(7), 777-800."
)
# Output Word file path (optional)
word_output_path <- "path/to/output/bibliography.docx"
# CSL file path for formatting (optional)
csl_path <- "path/to/your/style.csl"
# Parse references and generate the Word bibliography
parse_ref_to_word_bibliography(input = references,
output_file = word_output_path,
csl_path = csl_path)