Background
Handling Hive/CSV interaction is a common reality of many analytical and data environments. The question on exporting data from Hive to CSV and other formats is frequently raised on online forums with answers frequently suggesting making use of sed
that combined with nifty regular expressions pipes Hive output into a flat CSV files as an exporting solution. Import of large amounts of data is best handled by suitable tools like Apache Flume. That is fine for simpler tables but may prove problematic for tables with a large amount of unstructured text. Frequently analysts and data scientists are faced with a challenge with storing data Hive on a irregular semi-regular basis. For instance, a job may produce new forecasting scenarios that we may want to make available through a Hive tables.
Relaying on Spark API may prove fiddly when our requirements for resulting Hive table are non-standard. We may be willing to have a detailed control over column types or dynamically store comments providing relevant table context. If our modelling/data creation efforts takes place outside Spark session kicking off a Spark connection only to store some data in Hive feels like unnecessary onerous approach. Generating lengthy HiveQL syntax may also appear unnecessary tedious if we deal with numerous tables of variable structures. Meta-programming paradigm offers a neat solution to that challenge. Using R we can quickly generate HiveQL code that we use to land the CSV data into Hive with minimal overhead and dependencies. The article provides an opinionated overview of how meta-programming could be employed to quickly generate detailed HiveQL that would enable us to generate syntax for inserting data into Hive table controlling for column types/names and automatically generating comments.
Trial Data
Before progressing with the examples and outlining difficulties associated with the export I’m going to create some sample data. I’m using the docker-hive Docker image provided by the Big Data Europe to run the example. The wakefield
package available for R provides access to a number of functions useful for synthetic data generation. In this example the goal is to generate a data set that will have columns of diverse types, such as free text, dates, numbers of various formats, etc.. The wakefield
packages comes with great examples and using them is generally a good idea. I have increased the size of the data to hundred thousand rows so partitioning it will be more realistic
|
|
I will add a few columns to create partitions so our trial data is more “realistic”.
|
|
The trial data looks good; we have generated a number of messy variables that will be “pleasure” to deal with during import/export operation.
|
|
Metaprogramming paradigm
Metaprogramming paradigm assumes using computer code to generate more computer code. In effect metaprogramming gives other programs ability to treat code like data. As stated by Levy^[L. S. Levy, “A metaprogramming method and its economic justification,” in IEEE Transactions on Software Engineering, vol. SE-12, no. 2, pp. 272-277, Feb. 1986, doi: 10.1109/TSE.1986.6312943.]:
Metaprogramming, defined as creating application programs by writing programs that produce programs, is presented as the basis of a method for reducing software costs and improving software quality.
The question is how this relates to our task of importing relatively messy “real-life” data into nice, partitioned Hive table that will be a pleasure to work with. Let’s consider the simplest way to import the data. We would use CREATE_TABLE
|
|
We would then point to the load statement, that would look more or less like that:
|
|
The first observation that emerges is that generating all of the relevant code will be particular onerous and verbose. In the following scenario we may be willing to:
- Comment on the columns we are souring reflecting original name, type and summarising any other transformations we have applied
- Table comments could reflect details on the source, such as the location of the file or any other characteristics that we can gather through
file.info
.
In context of importing wide tables this will prove particulary painful. If we are in position where we are at a risk of undertaking that type of exercise more frequently, reflecting on a process that would enable us to automate or part-automate the data load can prove beneficial.
Generating code
This is where R’s flexibility and data structures come into play. We can use R to generate our HiveQL code and come up with a generic function that could be easily deployed against data frames that we would like to make permanent in Hive. The most subsantial element in will be concerned with generating code responsible for provisioning column names. We will also have to generate code that will create initial opening code and final elements pertaining to providing input on table characteristics.
Importing data
Before uploading the data we devise a table structure that will be used to hold it. Following the displayed variable types this can be done in the following manner. From a development perspective, breaking functions into managable smaller pieces will make it easier to develop.
Generating column names
For this particular function we would like to obtain a string that would form the column part of the data generation query
|
|
This implies the following requirements:
- We want to come up with syntaticaly valid column names that can be used in HiveQL. As stated in the official documentation, the Hive is fairly pemissive in acceptable column names:
In Hive 0.13 and later, column names can contain any Unicode character (see HIVE-6013), however, dot (.) and colon (:) yield errors on querying, so they are disallowed in Hive 1.2.0 (see HIVE-10120). Any column name that is specified within backticks (`) is treated literally. Within a backtick string, use double backticks (``) to represent a backtick character. Backtick quotation also enables the use of reserved keywords for table and column identifiers.
We can assume a more rigours approach reflecting the prefence that, in all likelihood, majority of us would endorse:
- We don’t want for the columns to start with a number
- We want for the function to replace spaces with
_
- We would like to remove non-letter characters
- On that for currency characters we would like for the function to repplace those with a three letter currency code
- With expection of the three letter currency code, the column name should be in lowercase. Hive is case-insesitive but pedentically, I would prefer for the column to be named
price_GBP
instead ofprice_gbp
. As this is matter of preference we may make this operation optional - We would like for the column to remove common three letter words; for instance
price_item
is sufficiently clear and we don’t needprice_for_an_item
. Pedentically, we may be even temepted to reverse common words so the order reads more natural,item_price
reads better thanprice_item
(although, there is no loss in meaning.)
|
|
|
|
Conclusion
In the subsequent articles I will tackle remaining challenges such as automatically generating meaningful strings for the column comments and R to Hive type conversion. The function discussed above will be made available through a dedicated R package that I’m making available on GitHub at konradzdeb/HiveQLgenerator.