R-based metaprogramming strategies for handling Hive/CSV interaction (Part I, imports)

Background Handling Hive/CSV interaction is a common reality of many analytical and data environments. The question on exporting data from Hive to CSV and other formats is frequently raised on online forums with answers frequently suggesting making use of sed that combined with nifty regular expressions pipes Hive output into a flat CSV files as an exporting solution. Import of large amounts of data is best handled by suitable tools like Apache Flume. That is fine for simpler tables but may prove problematic for tables with a large amount of unstructured text. Frequently analysts and data scientists are faced with a challenge with storing data Hive on a irregular semi-regular basis. For instance, a job may produce new forecasting scenarios that we may want to make available through a Hive tables. ...

August 13, 2021 · 9 min · Konrad Zdeb

Why regex is not fuzzy matching

Recently, I cam across an interesting discussion on StackOverflow^[SO discussion on: Fuzzy Join with Partial String Match in R] pertaining to approach to fuzzy matching tables in R. Good answer contributed by one of the most resilient and excellent contributors to whom I owe a lot of thanks for help suggested relying on regular expression, combining this with basic sting removal and transformations like toupper to deterministically match the tables. The solution solved the problem and was accepted. ...

June 29, 2021 · 7 min · Konrad Zdeb

On Sorting Arrays...or why it's good to read the actual assignment

Problem Solving challenges on project Euler or HackerRank is a good past time. For folks working in the wider analaytical / data science field, places like project Euler provide an excellent opportunity to work with academic programming concepts that do not frequently appear in real-life. I was looking at common problem: You are given an unordered array consisting of consecutive integers [1, 2, 3, …, n] without any duplicates. You are allowed to swap any two elements. Find the minimum number of swaps required to sort the array in ascending order. Example Perform the following steps: ...

May 23, 2021 · 4 min · Konrad Zdeb

Using R for File Manipulation

Challenge File manipulation is a frequent task unavoidable in almost every IT business process. Traditionally, file manipulation tasks are accomplished within the ramifications of specific tools native to a given system. As such, the one may consider writing and scheduling shell script to undertake frequent file operations or using more specific purpose-built tools like logrotate in order to archive logs or tools like Kafka are used to build streaming-data pipelines. R is usually though of as a statistical programming language or as an environment for a statistical analysis. The fact that R is a mature programming language able to successfully accomplish a wide array of traditional tasks is frequently ignored. What constitutes a programming language is a valid question. Wikipedia offers somehow wide definition: ...

March 29, 2021 · 6 min · Konrad Zdeb

Inserting Data into Partitioned Table

Rationale Maintaining partitioned Hive tables is a frequent practice in a business. Properly structured tables are conducive to achieving robust performance through speeding up query execution (see Costa, Costa, and Santos 2019). Frequent use cases pertain to creating tables with hierarchical partition structure. In context of a data that is refreshed daily, the frequently utilised partition structure reflects years, months and dates. Creating partitioned table In HiveQL we would create the table with the following structure using the syntax below. In order to keep the development tidy, I’m creating a separate database on Hive which I will use for the purpose of creating tables for this article. ...

February 26, 2021 · 8 min · Konrad Zdeb

Poor Man's Robust Shiny App Deployment (Part II)

Introduction This article draws on the past post concerned with utilisation of golem for robust deployment of analytical and reporting solutions. For this article, we will assume that we are working with defined working requirements that utilise some of the Labour Market Statistics disseminated through the nomis portal. Change Plan What we have Reporting requirements Past scripts we used to create reports with accompanying instructions What we want Stronger business continuity - we want to be able to give some access to this project and don’t be concerned with missing files, outdated unavailable documentation and questions on how to produce updated reports. We want self-encompassing entity that takes of care of its technical requirements and user-interaction^[Good parallel can be drawn between this approach and manuals available with life-saving equipment. Equipment delivers technical capacity and manual ensures operational capacity. In case of an inexperienced user one is not useful without the other. We want to ensure that user with minimum required capacity can use the tools correctly.] ...

February 12, 2021 · 3 min · Konrad

Poor Man's Robust Shiny App Deployment

Not so uncommon problem… RStudio Connect and more modest Shiny Proxy come to mind as most obvious solutions for deploying Shiny applications in production. Application servers are ideal for deploying applications that are to be consumed on a regular basis by larger audiences. In addition to serving the application, managing dependencies and user access or logging user activity are common tasks we would expect for a publishing platform to address. Frequently, however, deployment of Shiny application is directed at smaller audiences and less frequent usage. In such a situation, are availability, accessibility and user access management requirements will be often more modest. Commonly,in business a modelling or analytical solution can be packaged in Shiny application facilitating periodical re-run of models with different parameters and updated data sets. Such solutions can be conveniently utilised to facilitated development of monthly or quarterly reports. If the app is used once per month/quarter by a narrow user group the need to deploy it on the server is not well articulated. In that particular case we are mostly interested in ensuring that we can: ...

July 23, 2020 · 5 min · Konrad

Three-Way Operator in R

Is there a merit for a three-way operator in R? Background In C++20 revision added “spaceship operator”, which is defined as follows: 1 2 3 (a <=> b) < 0 # if lhs < rhs (a <=> b) > 0 # if lhs > rhs (a <=> b) == 0 # if lhs and rhs are equal/equivalent. R implementation The behaviour can be achieved in R in multiple ways. A one straightforward approach would involve making use of the ifelse statement ...

May 8, 2020 · 6 min · Konrad

Installing Hortonworks Sanbox on Mac with Docker

Background The post covers installation of Hortonworks Sandbox (HD) on Mac using Docker. In software development, sandbox describes a testing environment that can be used to isolate untested code changes from a production code. Hortonworks Sandbox provides such an environment with the Hortonworks Data Platform installed. Hortonworks Data Platform is an open source framework facilitating distributed storage and processing large volumes of data. Deploying system for distributed processing within a single computer may seem like a counter-intuitive idea but it’s actually a very common practice. Most frequent use cases involve various learning / professional development activities where one may be interested in learning new technology or simply exploring available interfaces. Other frequent use case pertains to various demos, where there may be a need to demonstrate product capabilities and accessing proper, production environment could be cumbersome. ...

February 23, 2019 · 2 min · Konrad

Interactivly Loading Shiny Modules

TL;DR If you want to see the implemented solution, please refer to: GitHub repo. Context Shiny is a widely popular web application framework for a R. In simple tearms it enables any R programmer to develop and deploy web application. This application could be simple - an interactive document consiting of a few charts and tables or a c complex “behemoth” with multiple functionalities enabling end-users to run models, query external data, generate exportable reports and sophisticated visuals. ...

November 24, 2018 · 2 min · Konrad