Mix JAVA and SCALA in a Single IntelliJ Project Module

image created by author

So, you might wonder whether it is possible to combine JAVA and SCALA in the same IntelliJ project.

The answer is yes, I have Apache Spark project in my IntelliJ. It uses standard maven directory that has java and scala directories under src/main of module directory.

Many posts online are claiming the same sort of strange issues with mixing in the same project. You might be one of those people that feel real uneasy approaching scala if this has remained unresolved for years with no real articles explaining why.

In this post, I will guide you how to achieve this…


Hands-on: Set up a Docker development environment on Windows 10

Image by Richard Sagredo. https://unsplash.com/photos/ZC2PWF4jTHc

When you read this article, you probably already heard of Docker and decided to give it a try. You probably also have gone to the docker website and ready to choose a Docker engine for your development machine.


This is a tutorial to walk through the NLP model preparation pipeline: tokenization, sequence padding, word embeddings, and Embedding layer setups.

NLP model preparation steps, (created by the author)

Intro: why I wrote this post

Many state-of-the-art results in NLP problems are achieved by using DL (deep learning), and probably you want to use deep learning style to solve NLP problems as well. While there are a lot of materials discussing how to choose and train the “best” neural network architecture, like, an RNN, selecting and configuring a suitable neural network is just one part of solving a practical NLP problem. The other import part, but often being underestimated, is model preparation. NLP tasks usually require special data treatment in the model preparation stage. In other words, there is a lot of things to do…


This article explains what a confusion matrix is and how to use it.

Photo by Emily Morter On Unsplash

Evaluation is an essential part of machine learning. The evaluation result tells us how well a particular machine learning algorithm performs. Evaluation also helps to explain why specific models have specific behavior and provide directions to improve performance.

In this series, we will focus on the evaluation of the classification tasks. That is to identify which classes certain instances belong to given historical observations. I will explain the concept of several evaluation metrics and their application in Python. Here is the road map of this series:

  • Part I: confusion matrix.
  • Part II: accuracy, recall, precision, f1-score.
  • Part III: soft-metrics: ROC…

lighter | faster | safer | flexible | fool-proof

Tutorials on using Pandas ‘category’ data type in Python

resource: Kung Fu Panda: The Paws of Destiny

Lately, I was working on a former Kaggle competition dataset- TalkingData Mobile User Demographics. The relevant data frames have a total size of 2GB. But after I tried to merge the data frames into a single tabular data frame. BOOM! 2GB exploded to 18GB!! 😟

Luckily, after some simple tricks, I was able to make the dataset from 18GB to 5GB, without losing any information or change of the data structure. The critical method I made it happen is by using the ‘category’ date type in Pandas.

Better than that, I realized the ‘category’ data type not only made the…


PLOTLY | GEOGRAPHIC | DATA VISUALIZATION | DATA SCIENCE | PYTHON

A step by step guide on exploratory data analysis and interactive dashboard presentation

Example of geographic data visualization using Plotly (in this post)

I have come across several projects that include geographic data, and I have been searching for the best tool for geo-visualization. As a Python user, I have tried Pyplot’s basemap, GeoPandas, Folium, and Bohem, but none of them has given me a satisfying experience.

Finally, my search ended after I found Plotly. To be more specific, Plotly’s Mapbox package. Today in this post, I will demonstrate a quick start for geographic data visualization using Plotly Mapbox and show it why you should consider using it as well.

Why Plotly?

  • Visually appealing. You have to try hard to make it look ugly 😜.

Try to use PCA but stuck at processing stage? Check this out!

Siyun Wang (co-author of this article). A fluffy bird on a branch [water color] (2020). We used low-poly art to indicate the feeling of PCA, which is dimensionality reduction. But PCA can do more than that.

Intro of the series

In this series, we will explore the combination of scaling data and the PCA. We would like to see how we can better prepare data for machine learning tasks whenever we come across a new dataset. The journey is composed of three parts.

  • Part I: Scalers and PCA
  • Part II: Meet outliers
  • Part III: Categorical data encoding

What we will do in this post

  1. Introduce/review the dataset to work on and the task
  2. Add synthetic outliers to the original dataset
  3. Perform scaling-transformation on the modified dataset
  4. Conduct PCA on the scaling-transformed dataset and evaluate the performance

What you will learn

  • Understand the importance of scalers and their close relationship with PCA

Try to use PCA but stuck at the processing stage? Check this out!

Siyun Wang. A fluffy bird on a branch [water color] (2020). We used low-poly art to indicate the feeling of PCA, which is dimensionality reduction. But PCA can do more than that.

Intro of the series

In this series, we will explore the combination of scaling data and the PCA. We would like to see how we can better prepare data for machine learning tasks whenever we come across a new dataset. The journey is composed of three parts.

  • Part I: Scalers and PCA
  • Part II: Meet outliers
  • Part III: Categorical data encoding

What we will do in this post

  1. Review briefly the background of scalers and PCA
  2. Introduce the dataset to work on and the task
  3. Perform scaling-transformation on the dataset
  4. Conduct PCA on the scaling-transformed dataset and evaluate the performance

What you will learn

  • Understand the importance of scalers and their close relationship with PCA

Kefei Mo

ML engineer / data story teller / electrical engineer / digital illustrator / 3D modeler. I am writing data science blog with my cousin.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store