Fix the Git completion with Oh-my-Zsh on Mac

In a previous post, I have explained how I have setup oh-my-zsh with the git plugin. I am also using homebrew to manage the packages installed on my Mac. After upgrading Git recently, I have noticed the Git completion was not as powerful anymore.

Continue reading

MongoDB and Apache Spark - Getting started tutorial

MongoDB and Apache Spark are two popular Big Data technologies.

In my previous post, I listed the capabilities of the MongoDB connector for Spark. In this tutorial, I will show you how to configure Spark to connect to MongoDB, load data, and write queries.

To demonstrate how to use Spark with MongoDB, I will use the zip codes from MongoDB tutorial on the aggregation pipeline documentation using a zip code data set. I have prepared a Maven project and a Docker Compose file to get you started quickly.

Continue reading

Introduction to the MongoDB connector for Apache Spark

MongoDB is one of the most popular NoSQL databases. Its unique capabilities to store document-oriented data using the built-in sharding and replication features provide horizontal scalability as well as high availability.

Apache Spark is another popular “Big Data” technology. Spark provides a lower entry level to the world of distributed computing by offering an easier to use, faster, and in-memory framework than the MapReduce framework. Apache Spark is intended to be used with any distributed storage, e.g. HDFS, Apache Cassandra with the Datastax’s spark-cassandra-connector and now the MongoDB’s connector presented in this article.

By using Apache Spark as a data processing platform on top of a MongoDB database, you can benefit from all of the major Spark API features: the RDD model, the SQL (HiveQL) abstraction and the Machine Learning libraries.

In this article, I present the features of the connector and some use cases. An upcoming article will be a tutorial to demonstrate how to load data from MongoDB and run queries with Spark.

Continue reading

Add a Git hook to automatically verify a repository's email

I use Git a lot and I often have to switch between my personal repositories (ie: Github) and my professional (Ippon) repositories on the same laptop. My default Git email is configured to my personal email and I have often forgotten to configure it to my professional email when creating/cloning a repository for my company. Like everything in Git, this can be automated to avoid mistakes.

In this post, I will show how I use a Git hook to check the email configured in any repository before every commit.

Continue reading

My development environment

A customized development environment could be a huge productivity boost in the day to day work. In this post, I will share the tools and configurations I currently use.

Continue reading

git commit fixup

In this article, I will describe a git option to quickly fix a previous commit. This sometimes happens when I want to fix a typo in a previous commit after few new commits. The goal is to keep a “clean” git history with consistent commits adding features to facilitate the code reviews.

Continue reading

Using Docker to simplify Cassandra development in JHipster

JHipster is an open source project that generates a fully working application in seconds. With a minimal configuration, JHipster accelerates the start of new projects by integrating frontend, backend, security and a database.

Cassandra is one of the supported databases and JHipster generates all the configuration needed to access the cluster.

But it is often hard for the developers to configure and maintain a local Cassandra cluster.

Moreover, there is no standard tooling to manage the schema migrations, like Liquibase or Flyway for SQL databases, making it difficult to synchronize the schema between every environment and a local configuration.

JHipster’s goal is to provide the most simple and productive development environment out of the box for the developers, and this tool has been added in the latest (3.4.0) version.

In this post, I’ll describe the design of the tool and the basic commands to use it.

Continue reading

A tour of Databricks Community Edition: a hosted Spark service

With the recent announcement of the Community Edition, it’s time to have a look at the Databricks Cloud solution. Databricks Cloud is a hosted Spark service from Databricks, the team behind Spark.

Continue reading

Testing strategy for Spark Streaming – Part 2 of 2

In a previous post, we’ve seen why it’s important to test your Spark jobs and how you could easily unit test the job’s logic, first by designing your code to be testable and then by writing unit tests.

In this post, we will look at applying the same pattern to another important part of the Spark engine: Spark Streaming.

Continue reading

Testing strategy for Apache Spark jobs – Part 1 of 2

Like any other application, Apache Spark jobs deserve good testing practices and coverage.

Indeed, the costs of running jobs with production data makes unit testing a must-do to have a fast feedback loop and discover the errors earlier.

But because of its distributed nature and the RDD abstraction on top of the data, Spark requires special care for testing.

In this post, we’ll explore how to design your code for testing, how to setup a simple unit-test for your job logic and how the spark-testing-base library can help.

Continue reading