nlathia.github.io

Home About Research Press & Speaking

Extracting Large Datasets from the Google App Engine Datastore

The Google App Engine (GAE) is a very interesting service for quickly deploying web apps or building systems that use Google’s infrastructure. I’ve recently been working on a project that uses the GAE to receive large amounts of data from a mobile app that is part of a feasibility trial with 20 or so participants. Therefore, the two key things that I wanted to accomplish are (a) getting a back-end that receives/processes/stores data up and running as quickly as possible and (b) having a way to extract/download the data for offline analysis. The GAE was perfect for part (a), and the system was set up: the data that rolled in was easily stored in the NDB data store.

Now the headache: getting data out of the system. What ways could this be done?

  • Don’t even try: using a front-end request handler. These are limited to 60-seconds, hardly enough time to read through/format/process a large datastore.
  • In the old days, it was possible to back up data from the developer console. This limits data extraction to admins only, which is not ideal. This post describes how this was done, and worked for me… until the console was upgraded and the ‘Backup’ functionality was removed (does anyone know where it is?).
  • The bulkloader. This post describes how to use it, and I got this working. Simple and command-line based. Hours and hours later, the bulkloader was still trudging along and would then fail, which seemed to happen if my machine momentarily lost it’s WiFi connection. I didn’t see any way to modify how the bulkloader throttles data download; so I’d only really suggest using this method if your data is very small.
  • Using the remote API shell, and paging through the data store entities. I got this working too, and this seemed to be faster than the bulkloader.. but would hit some over quota limits after some time (even though my project has billing enabled). Again, probably ideal for small data, and admin-access only.

What I did in the end was inspired from the post that wanted me to start the backup from the admin dashboard:

  • I found this post that explains how to transfer data from the data store to BigQuery. What it is actually doing is transferring data from the data store to the cloud storage, and the uploading it from cloud storage into BigQuery.
  • I wrote a map-reduce job that does the same as the first part of that post. The problem I encountered was that the example and  all the documentation points to using the “FileOutputWriter” in “mapreduce.output_writers” which does not exist anymore (therefore producing ImportErrors). By digging around the code, I found that I could replace it with “GoogleCloudStorageConsistentOutputWriter”
  • I ran the map-reduce job by hitting the URL with a get request. My code then redirects to the mapreduce status page. The whole thing took approximately 10 hours to complete.
  • I downloaded the data using gsutil (the -m flag enables multithreaded downloads): gsutil -m cp -R gs://my-bucket/my-dir .