16 February 2021


About Document AI

Google Document AI (DAI) is a server-based OCR engine that extracts text from pdf files. Released in November 2020, it is much more powerful than static libraries such as tesseract. Short of corpus-specific, self-trained processors, DAI offers some of the best OCR capabilities currently available to the general public. At the time of writing, DAI is more expensive than Amazon’s Textract, but promises to support many more languages.

DAI is accessed through an API, but this API currently has no official R client library. This is where the daiR package comes in; it provides a light wrapper for DAI’s REST API, making it possible to submit documents to DAI from within R. In addition, daiR comes with pre- and postprocessing tools intended to make the whole text extraction process easier.

Google Document AI is closely connected with Google Storage, as the latter serves as a drop-off and pick-up point for files you want processed in DAI. An R workflow for DAI processing consists of three core steps:

  1. Upload your files to a Google Storage bucket. This can be done manually in the Google Cloud Console or programmatically with the package googleCloudStorager.
  2. Using daiR, tell DAI to process the files in your bucket. DAI will return its output to your Storage bucket in the form of json files.
  3. Download the json files from your Storage bucket to your hard drive. Again you can use either the Cloud Console or googleCloudStorager.

Setup

A previous vignette covered the setting up of a Google Cloud service account and interacting with Google Storage. Here we pick up from where that vignette left off, and assume that the following things are in place:

  1. A Google Cloud Services (GCS) project linked to your billing account and with the Document AI API enabled.
  2. A service account with the role “Owner”.
  3. A json file with the service account key, the path to which is stored in an environment variable called GCS_AUTH_FILE.

If these things are in place, daiR will automatically authenticate you when you load the package:

library(daiR)
#> Welcome to daiR 0.0.0.9000, your gateway to Google Document AI v1beta2.

In the Environment pane, you should now also see two variables: google_token and project_id. daiR pulls them up for you automatically because you need them to send processing requests to DAI.

ls()
#> [1] "google_token" "project_id"

If you don’t see any of this, it’s most likely because daiR could not find the GCS_AUTH_FILE variable in your .Renviron file. Return to the vignette on “Setting up Google Storage” and make sure steps 6 and 7 are covered. Alternatively, if you prefer, you can obtain an access token by another method; just make sure to store it as google_token.

Now there’s just one configuration left: Load the library googleCloudStorager and set your default Storage bucket with gcs_global_bucket(). This is not strictly necessary, but will save you from having to type in the bucket name in all your subsequent commands to Google Storage and DAI.

library(googleCloudStorageR)
#> v Setting scopes to https://www.googleapis.com/auth/devstorage.full_control and https://www.googleapis.com/auth/cloud-platform
#> v Successfully auto-authenticated via D:/keys/my_google_service_account_key2.json
gcs_global_bucket("superbucket_2021") # my bucket name for this vignette
#> Set default bucket name to 'superbucket_2021'

Now we can start working with files.

File preparation

Presumably you have your own files ready, but for this vignette I will download two documents; a pdf from the CIA’s Freedom of Information Act Electronic Reading Room and a picture of an old text from the National Park Service Website.

download.file("https://www.cia.gov/readingroom/docs/AGH%2C%20LASLO_0011.pdf", 
              destfile = "CIA.pdf", 
              mode = "wb")
download.file("https://www.nps.gov/articles/images/dec-of-sentiments-loc-copy.jpg", 
              destfile = "nps.jpeg", 
              mode = "wb")

As we can see, both are tough tests for any OCR engine.

The immediate problem is that one of them is an image file. DAI only takes PDFs, so we need to convert it to pdf. This can be done quickly with daiR::image_to_pdf, which invokes the imagemagick package to convert images of practically any file format to pdf.

image_to_pdf("nps.jpeg", "nps.pdf")

Now we need to upload the pdfs to our Storage bucket where DAI can find them. We only have two files, but most of the time we’ll want to upload more than this, so I’ll iterate for illustration. To add a little complexity, we can put them in a bucket subfolder titled 'historical/'. Buckets can’t have real folders, but you can include slashes in filenames to provide the illusion of folders.

library(purrr)
library(fs)
pdfs <- dir_ls(glob = "*.pdf")
map(pdfs, ~ gcs_upload(.x, name = paste0("historical/", .x)))

Let’s check that the files made it safely:

gcs_list_objects()

Processing

Now we’re ready to send them off to Document AI with daiR’s workhorse function, dai_process(). Its core parameter is files, which tell DAI which files to process. files can be either a single filepath or a vector or list of filepaths. It can be either the full uri – i.e, in the format 'gs://<YOUR_BUCKET>/<YOUR_FILEPATH>' – or just the filepath after your bucket name, i.e. '<YOUR_FILEPATH>'.

You can also specify a dest_folder: the name of the bucket folder where you want the output. It defaults to the root of the bucket, but you can specify another subfolder. If the folder does not exist already, it will be created.

dai_process() takes three other parameters, two of which default to things that should already be in your environment, namely, project id and bucket. The last is loc (location), which defaults to 'eu' but can be changed to 'us' if you prefer.

Since we have more than one file, we need to create a vector of filepaths. We want all the files in the bucket subfolder 'historical'. We exploit the fact that gcs_list_objects() returns a dataframe whose first column is a variable titled name, which we can search with grep().

content <- gcs_list_objects()
our_files <- grep("^historical/*", content$name, value = TRUE)

Now, if we don’t mind the json output landing in the root of our bucket, we can process our files with a very simple call:

# dai_process(our_files)

But let’s imagine that our bucket is a bit crowded and we want the output in a separate folder called 'processed/'. We also want to store the immediate response from DAI in an object named response for troubleshooting. Then we would write:

response <- dai_process(our_files, "processed")

The “status: 200” is good news, as it means our HTTP request was accepted by the API. If there was something wrong with your token or your project_id parameter, you would have gotten a 403 (invalid request) or some other error message. You could then inspect the object response for details.

But a 200 does not necessarily mean that the processing was successful, because the API has no way of knowing right away if the filepaths you provided exist in your bucket. If there were errors in your filepaths - say you forgot to add “historical/” at the beginning, your HTTP request would get a 200, but your files would not actually process. They would turn up as empty files in the folder you provided. So if you see json files of around 70 bytes each in the destination folder, you know there was something wrong with your filenames.

The current version of dai_process() handles batch requests by submitting individual files at 10 second intervals. If you have many files, this can hold up your R session for some time, so consider running your script as a separate RStudio job. Document AI starts OCR processing the moment it receives the first document, so you should see action in your bucket fairly soon even if you submitted a large batch. The OCR processing time depends on the length of the document. I haven’t seen official numbers, but it seems to me to take about 5-10 seconds per page.

Our batch was small, so we can check our bucket right away:

contents <- gcs_list_objects()
contents

In this case, everything seems to have worked fine, and we can download the json files. Note, however, that you cannot download the files as they are, because they have a forward slash in the middle ('historical/<FILE>'), and gcs_get_object cannot create folders on your hard drive. Therefore you need to pass a function (e.g. gsub()) that changes the forward slash to something else before gcs_get_object tries to save them. We’ll use an underscore, making the script look as follows:

jsons <- grep("*.json$", contents$name, value = TRUE)
map(jsons, ~ gcs_get_object(.x, saveToDisk = gsub("/", "_", .x), overwrite = TRUE))

Cleaning and checking

DAI adds a long string at the end of each file that we can remove if we want. We could also have done this when we downloaded, in the saveToDisk parameter, but sometimes you may want to keep both the original output and a shortform copy.

long_names <- dir_ls(glob = "*.json")
short_names <- sub(".pdf-output-page-1-to-1", "", long_names)
file_move(long_names, short_names)

We could also reconstruct the folder path from the bucket by creating a new folder titled 'historical/' and moving the files there, but that’s for another time. Now we want to inspect the goods.

We retrieve the text from the jsons with daiR::get_text(). First the CIA document:

text1 <- get_text("processed_CIA.json")
cat(text1)
#> SECRET
#> To:
#> Chief, Contact Division, 00
#> Prom
#> Jy
#> Subject: Laszlo AGH
#> Reference: Memorandum from Chief, Contact Division, dated 1 February,
#> subject Laszlo AGH; Memorandum from Chief, Contact Division, dated
#> 23 December 1949, subject, Report by Laszlo AGH.
#> 1. With reference to your request for assistance in evaluating
#> subject individual as a continuing source of foreign intelligence,
#> we cannot make any definite suggestion without a complete list of
#> the members of AGH's group.
#> On the basis of reference memorandum
#> of 23 December 1949, we assume that AGH is" Contact with the Fellow-
#> ship of Hungarian Combattants although it is not clear to us whether
#> that is "his group.
#> 1
#> "
#> :
#> 2. If AGH's sources are primarily the Hungarian Combattants,
#> we would not advise continued use of him as a source for foreign
#> intelligence. The loyalty of the Hungarian Combattants lies primarily
#> with the French and it can be assumed that all of their reports go
#> to the French. Furthermore this organization is the most widely
#> known Hungarian emigre group. Consequently it has been and is being.
#> tapped in Burope by CIC, repres ntatives of CIA etc.
#> J.
#> SEGRET
#> DECLASSIFIED AND RELEASED BY
#> CENTRAL INTELLIGENCE AGENCY
#> SOURCES METHODSEXEMPTION 3B2B
#> NAZI WAR CRIMES DISCLOSURE ACT
#> BATE 2006
#> of

And then the NPS document:

text2 <- get_text("processed_nps.json")
cat(text2)
#> ny
#> dries
#> die
#> and
#> romote
#> right to
#> hen
#> diely
#> FREDERICK DOUGLASS, AMY POST, CATHARINE
#> STEBBINS, and ELIZABETH C. STANTON, and was
#> unanimously adopted, as follows:
#> DECLARATION OF SENTIMENTS.
#> When, in the course of human events, it be-
#> comes necessary for one portion of the family of
#> man to assume among the people of the earth a
#> position different from that which they have hith-
#> erto occupied, but one to which the laws of nature
#> and of nature's God entitle them, a decent respect
#> to the opinions of mankind requires that they
#> should declare the causes that impel them to such
#> a course.
#> We hold these truths to be self-evident: that
#> all men and women are created equal; that they
#> are endowed by their Creator with certain inalien-
#> able rights; that among these are life, liberty,
#> and the pursuit of happiness; that to secure these
#> rights governments are instituted, deriving their
#> just powers from the consent of the governed.-
#> Whenever any form of Government becomes
#> destructive of these ends, it is the right of those
#> who suffer from it to refuse allegiance to it, and
#> to insist upon the institution of a new govern-
#> ment, laying its foundation on such principles,
#> and organizing its powers in such form as to them
#> shall seem most likely to effect their safety and
#> happiness. Prudence, indeed, will dictate that
#> governments long established should not be changed
#> for light and transient causes; and accordingly,
#> all experience hath shown that mankind are more
#> disposed to suffer, while evils are sufferable,
#> than to right themselves by abolishing the forms
#> to which they are accustomed. But when a long
#> train of abuses and usurpations, pursuing invaria-
#> bly the same object, evinces a design to reduce
#> them under absolute despotism, it is their duty to
#> throw off such government, and to provide new
#> guards for their future security. Such has been
#> 10
#> mering

There we are. Not perfect, but very good considering all the noise in the original documents.

DAI’s big Achilles heel, at least for the time being, is multicolumn text. In another vignette we will look at what to do when DAI has read the columns wrong and jumbled the words.