Maintainer: | Mauricio Vargas Sepulveda, Will Beasley |
Contact: | m.sepulveda at mail.utoronto.ca |
Version: | 2024-10-27 |
URL: | https://CRAN.R-project.org/view=WebTechnologies |
Source: | https://github.com/cran-task-views/WebTechnologies/ |
Contributions: | Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide. |
Citation: | Mauricio Vargas Sepulveda, Will Beasley (2024). CRAN Task View: Web Technologies and Services. Version 2024-10-27. URL https://CRAN.R-project.org/view=WebTechnologies. |
Installation: | The packages from this task view can be installed automatically using the ctv package. For example, ctv::install.views("WebTechnologies", coreOnly = TRUE) installs all the core packages or ctv::update.views("WebTechnologies") installs all packages that are not yet installed and up-to-date. See the CRAN Task View Initiative for more details. |
This task view recommends packages and strategies for efficiently interacting with resources over the internet with R. This task view focuses on:
If you have suggestions for improving or growing this task view, please submit an issue or a pull request in the GitHub repository linked above. If you can’t contribute on GitHub, please e-mail the task view maintainer. If you have an issue with a package discussed below, please contact the package’s maintainer.
Thanks to all contributors to this task view, especially to Scott Chamberlain, Thomas Leeper, Patrick Mair, Karthik Ram, and Christopher Gandrud who maintained this task view up to 2021.
The bulk of R’s capabilities are supplied by CRAN packages that are layered on top of libcurl. A handful of packages provide the foundation for most modern approaches.
httr2 and its predecessor httr are user-facing clients for HTTP requests. They leverage the curl package for most operations. If you are developing a package that calls a web service, we recommend reading their vignettes.
crul is another package that leverages curl. It is an R6-based client that supports asynchronous HTTP requests, a pagination helper, HTTP mocking via webmockr, and request caching for unit tests via vcr. crul is intended to be called by other packages, instead of R users. Unlike httr2, crul’s current version does not support OAuth. Additional options may be passed to curl when instantiating crul’s R6 classes.
curl is the lower-level package that provides a close interface between R and the libcurl C library. It is not intended to be called directly by typical R users. curl may be useful for operations on web-based XML or with FTP (as crul and httr2 are focused primarily on HTTP).
utils and base are the base R packages that provide download.file()
, url()
, and related functions. These functions also use libcurl.
You may have a code to perform web scraping, and it can be very efficient by time metrics or resources usage, but first we need to talk about whether it’s legal and ethical for you to do so.
You can use the ‘polite’ package, which builds upoen the principles of seeking permission, taking slowly and never asking twice. The package builds on awesome toolkits for defining and managing http sessions (‘httr’ and ‘rvest’, declaring the user agent string and investigating site policies (‘robots.txt’), and utilizing rate-limiting and response caching (‘ratelimitr’ and ‘memoise’).
The problem is not technical, but ethical and also legal. You can technically log into an art auction site and scrape the prices of all the paintings, but if you need an account and to use ‘rSelenium’ to extract the information by automating clicks in the browser, you are subject to the Terms of Service (ToS).
Another problem is that some websites require specific connections. You can connect to a site from a university or government building and access content for free, but if you connect from home, you may find that you require a paid subscription to access the same content. If you scrape a site from a university, you might be breaking some laws if you are not carefull about the goal and scope of the scraping.
In recent years, many functions have been updated to accommodate web pages that are protected with TLS/SSL. Consequently you can usually download a file’s if its url starts with “http” or “https”.
If the data file is not accessible via a simple url, you probably want to skip to the Online services section. It describes how to work with specific web services such as AWS, Google Documents, Twitter, REDCap, PubMed, and Wikipedia.
If the information is served by a database engine, please review the cloud services in the Online services section below, as well as the Databases with R CRAN Task View.
Many base and CRAN packages provide functions that accept a url and return a data.frame
or list
.
read.csv()
, read.table()
, and friends return a base::data.frame
.read_csv()
, read_delim()
and friends return a tibble::tibble
, which derives from base::data.frame
.fread()
returns a data.table::data.table
, which derives from base::data.frame
.read_csv_arrow()
returns a tibble::tibble()
or other Arrow structures.If you need to process a different type of file, you can accomplish this in two steps. First download the file from a server to your local computer; second pass the path of the new local file to a function in a package like haven or foreign.
Many base and CRAN packages provide functions that download files:
download.file()
.curl_download()
, curl_fetch_multi()
, and friends.req_perform(path = <your_file_path>)
, or alternatively req_perform()
piped to resp_body_string()
GET()
getURL()
The vast majority of web-based data is structured as plain text, HTML, XML, or JSON. Web service APIs increasingly rely on JSON, but XML is still prevalent in many applications. There are several packages for specifically working with these format. These functions can be used to interact directly with insecure web pages or can be used to parse locally stored or in-memory web files. Colloquially, these activities are called web scraping.
XML: There are two foundational packages for working with XML: XML and xml2. Both support general XML (and HTML) parsing, including XPath queries. xml2 is less fully featured, but more user friendly with respect to memory management, classes (e.g., XML node vs. node set vs. document), and namespaces. Of the two, only the XML supports de novo creation of XML nodes and documents.
Other XML tools include:
XML2R is a collection of convenient functions for coercing XML into data frames. An alternative to XML is selectr, which parses CSS3 Selectors and translates them to XPath 1.0 expressions. XML is often used for parsing xml and html, but selectr translates CSS selectors to XPath, so can use the CSS selectors instead of XPath.
XMLSchema provides facilities in R for reading XML schema documents and processing them to create definitions for R classes and functions for converting XML nodes to instances of those classes. It provides the framework for meta-computing with XML schema in R.
xslt is an extension for xml2 to transform XML documents by applying an xslt style-sheet. This may be useful for web scraping, as well as transforming XML markup into another human- or machine-readable format (e.g., HTML, JSON, plain text, etc.).
HTML: All of the tools that work with XML also work for HTML, though HTML tends to be more prone to be malformed. So xml2::read_html()
is a good first function to use for importing HTML. Other tools are designed specifically to work with HTML.
For capturing static content of web pages postlightmercury is a client for the web service ‘Mercury’ that turns web pages into structured and clean text.
rvest is another higher-level alternative which expresses common web scraping tasks with pipes (like Base R’s |>
and magrittr’s %>%
).
boilerpipeR provides generic extraction of main text content from HTML files; removal of ads, sidebars and headers using the boilerpipe Java library.
PhantomJS (which was archived in 2018): webshot uses PhantomJS to provide screenshots of web pages without a browser. It can be useful for testing websites (such as Shiny applications). r github("cpsievert/rdom")
uses PhantomJS to access a webpage’s Document Object Model (DOM).
htmltools provides functions to create HTML elements.
RHTMLForms reads HTML documents and obtains a description of each of the forms it contains, along with the different elements and hidden fields. htm2txt uses regex to converts html documents to plain text by removing all html tags. Rcrawler does crawling and scraping of web pages.
HTML Utilities: These tools don’t extract content, but they can help your develop and debug.
JSON: There are several packages for reading and writing JSON: rjson, RJSONIO, and jsonlite. We recommend using jsonlite. Check out the paper describing jsonlite by Jeroen Ooms https://arxiv.org/abs/1403.2805. jqr provides bindings for the fast JSON library ‘jq’. jsonvalidate validates JSON against a schema using the “is-my-json-valid” JavaScript library; ajv does the same using the ‘ajv’ JavaScript library. ndjson supports the “ndjson” format.
RSS/Atom: feedeR can be used to parse RSS or Atom feeds. tidyRSS parses RSS, Atom XML/JSON and geoRSS into a tidy data.frame.
swagger can be used to automatically generate functions for working with an web service API that provides documentation in Swagger.io format.
Amazon Web Services (AWS):
lapply()
for the Elastic Map Reduce (EMR) engine called emrlapply()
. It uses Hadoop Streaming on Amazon’s EMR in order to get simple parallel computation.Microsoft Azure: Azure and Microsoft 365 are Microsoft’s cloud computing services.
Google Cloud and Google Drive:
Dropbox: repmis’s source_Dropbox()
function for downloading/caching plain-text data from non-public folders.
Other Cloud Storage: boxr is a lightweight, high-level interface for the box.com API.
Docker: analogsea is a general purpose client for the Digital Ocean v2 API. In addition, it includes functions to install various R tools including base R, RStudio server, and more. There’s an improving interface to interact with docker on your remote droplets via this package.
crunch provides an interface to the crunch.io storage and analytics platform. crunchy facilitates making Shiny apps on Crunch.
The cloudyr project aims to provide interfaces to popular Amazon, Azure and Google cloud services without the need for external system dependencies. Amazon Web Services is a popular, proprietary cloud service offering a suite of computing, storage, and infrastructure tools.
pins can be used to publish data, models, and other R objects across a range of backends, including AWS, Azure, Google Cloud Storage, and Posit Connect.
googlesheets
) can access private or public ‘Google Sheets’ by title, key, or URL. Extract data or edit data. Create, delete, rename, copy, upload, or download spreadsheets and worksheets. gsheet can download Google Sheets using just the sharing link. Spreadsheets can be downloaded as a data frame, or as plain text to parse manually.imgur_upload()
to load images from literate programming documents.This list describes online services. For a more complete treatment of the topic, please see the MachineLearning CRAN Task View.
This list describes online services. For a more complete treatment of the topic, please see the Analysis Spatial Data CRAN Task View.
Geolocation/Geocoding: Services that translate between addresses and longlats. rgeolocate (archived) offers several online and offline tools. rydn is an interface to the Yahoo Developers network geolocation APIs, and ipapi can be used to geolocate IPv4/6 addresses and/or domain names using the http://ip-api.com/ API. opencage provides access to to the ‘OpenCage’ geocoding service. nominatimlite and nominatim connect to the OpenStreetMap Nominatim API for reverse geocoding. PostcodesioR provides post code lookup and geocoding for the United Kingdom. geosapi is an R client for the ‘GeoServer’ REST API, an open source implementation used widely for serving spatial data. geonapi provides an interface to the ‘GeoNetwork’ legacy API, an open source catalogue for managing geographic metadata. ows4R is a new R client for the ‘OGC’ standard Web-Services, such Web Feature Service (WFS) for data and Catalogue Service (CSW) for metadata.
Mapping: Services that help create visual maps.
Routing: Services that calculate and optimize distances and routes.
The following packages provide an interface to its associated service, unless noted otherwise.
The following packages interface with online services that facilitate web analytics.
The following packages interface with tools that facilitate web analytics.
webreadr
, but webreader focuses on reading log files, while WebAnalytics focuses on analysing them.Reference/bibliography/citation management: rorcid connects to the ORCID.org API, which can identify scientific authors and their publications (e.g., by DOI). rdatacite connects to DataCite, which manages DOIs and metadata for scholarly datasets. scholar extracts citation data from Google Scholar. rscopus extracts citation data from Elsevier Scopus. Convenience functions are also provided for comparing multiple scholars and predicting future h-index values. mathpix converts an image of a formula (typeset or handwritten) via Mathpix webservice to produce the ‘LaTeX’ code. zen4R connects to Zenodo API, including management of depositions, attribution of DOIs and upload of files.
Literature: europepmc connects to the Europe PubMed Central service. pubmed.mineR is for text mining of PubMed Abstracts that supports fetching text and XML from PubMed. jstor retrieves metadata, ngrams and full-texts from Data for Research service by JSTOR. aRxiv connects to arXiv, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics. roadoi connects to the Unpaywall API for finding free full-text versions of academic papers. rcrossref is an interface to Crossref’s API.
Many CRAN packages interact with services facilitating sports analysis. For a more complete treatment of the topic, please see the SportsAnalytics CRAN Task View.
Using packages in this Web Technologies task view can help you acquire data programmatically, which can facilitate Reproducible Research. Please see the ReproducibleResearch CRAN Task View for more tools and information:
“The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, understood, and verified.”
Push Notifications: RPushbullet provides an easy-to-use interface for the Pushbullet service which provides fast and efficient notifications between computers, phones and tablets. pushoverr can sending push notifications to mobile devices (iOS and Android) and desktop using ‘Pushover’. notifyme can control Phillips Hue lighting.
Automated Metadata Harvesting: oai and OAIHarvester harvest metadata using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard.
Wikipedia: WikipediR is a wrapper for the ‘MediaWiki’ API, aimed particularly at the ‘Wikimedia’ “production” wikis, such as ‘Wikipedia’. WikidataR can request data from Wikidata.org, the free knowledge base. WikidataQueryServiceR is a client for the Wikidata Query Service.
rerddap: A generic R client to interact with any ERDDAP instance, which is a special case of OPeNDAP (https://en.wikipedia.org/wiki/OPeNDAP), or Open-source Project for a Network Data Access Protocol. Allows user to swap out the base URL to use any ERDDAP instance.
duckduckr is an R interface to DuckDuckGo
websockets
(retired from CRAN). servr provides a simple HTTP server to serve files under a given directory based on httpuv.PluginR
) to run R code from wiki pages, and use data from their own collected web databases (trackers). A demo: https://r.tiki.org/tiki-index.php .webmock
. webmockr only helps mock HTTP requests, and returns nothing when requests match expectations. It integrates with crul and httr. See Testing for mocking with returned responses.application/x-www-form-urlencoded
as well as multipart/form-data
. mime guesses the MIME type for a file from its extension. rsdmx provides tools to read data and metadata documents exchanged through the Statistical Data and Metadata Exchange (SDMX) framework; it focuses on the SDMX XML standard format(SDMX-ML). robotstxt provides functions and classes for parsing robots.txt files and checking access permissions; spiderbar does the same. uaparserjs uses the JavaScript “ua-parser” library to parse User-Agent HTTP headers. rapiclient is a client for consuming APIs that follow the Open API format. restfulr models a RESTful service as if it were a nested R list.httr::parse_url()
function can be used to extract portions of a URL. The RCurl::URLencode()
and utils::URLencode()
functions can be used to encode character strings for use in URLs. utils::URLdecode()
decodes back to the original strings. urltools can also handle URL encoding, decoding, parsing, and parameter extraction.For specialized situations, the following resources may be useful:
RCurl is another low-level client for libcurl. Of the two low-level curl clients, we recommend using curl. httpRequest is another low-level package for HTTP requests that implements the GET, POST and multipart POST verbs, but we do not recommend its use.
request provides a high-level package that is useful for developing other API client packages. httping provides simplified tools to ping and time HTTP requests, around httr calls. httpcache provides a mechanism for caching HTTP requests.
nanonext is an alternative low-level sockets implementation that can be used to perform HTTP and streaming WebSocket requests synchronously or asynchronously over its own concurrency framework. It uses the NNG/mbedTLS libraries as a backend.
For dynamically generated webpages (i.e., those requiring user interaction to display results), RSelenium can be used to automate those interactions and extract page contents. It provides a set of bindings for the Selenium 2.0 webdriver using the ‘JsonWireProtocol’. It can also aid in automated application testing, load testing, and web scraping. seleniumPipes provides a “pipe”-oriented interface to the same.
Authentication: Using web resources can require authentication, either via API keys, OAuth, username:password combination, or via other means. Additionally, sometimes web resources that require authentication be in the header of an http call, which requires a little bit of extra work. API keys and username:password combos can be combined within a url for a call to a web resource, or can be specified via commands in RCurl or httr2. OAuth is the most complicated authentication process, and can be most easily done using httr2.
See the 6 demos within httr, three for OAuth 1.0 (LinkedIn, Twitter, Vimeo) and three for OAuth 2.0 (Facebook, GitHub, Google). ROAuth provides a separate R interface to OAuth. OAuth is easier to to do in httr, so start there. googleAuthR provides an OAuth 2.0 setup specifically for Google web services, and AzureAuth provides similar functionality for Azure Active Directory.