Benchmarking Golang and Java: Concurrency Performance Analysis for File Consumption

A Deep Dive into Concurrency: Golang vs Java — Unraveling Performance Differences in File Consumption

14 min readJun 3, 2023

Hello everyone, I recently had the opportunity to work on a performance issue while processing a file using Golang and Java. In the middle of these tasks raised a question in my mind that was “If we implement it in the best currency practice and provide a similar environment who will be the best in terms of resource usage?”. So we arrived here :)

Difference between Java and Golang in terms of platform

All of you are tired of wondering about the big difference between Java and Golang but I will just give you a brief that is necessary for all of us to be on the same page. I could spend some time explaining the difference between syntax and bla bla bla, but what matter is:

JVM — Golang does not have a JVM that will interpret your code and improve the performance during the runtime as Java does with JVM (JIT and JNI). If you want to learn a little bit more about how things work inside JVM I strongly recommend you take a look at the talk The Java memory model explained by Rafael Winterhalter
Threads — Golang implements the concept of lightweight threads with goroutines, whereas Java uses threading known as heavy threading. If you want to learn more about it I strongly recommend you see the talk Lightweight Threads by Ron Pressler
Garbage Collector — The JVM provides us with several strategies for Garbage collectors whereas Golang comes with a naive GC implementation that ignored the learning of years in other language communities basically implementing the Stop The World Strategy. If you want to understand the GC types I recommend you read the Types of Garbage Collector in Java and Garbage Management: Java vs Go

Explaining the Problem

The idea is to consume a CSV taking into consideration the following requirements:

The code should be prepared to process huge files ( in our example will be only 19 MB ~100k lines)
We should read the file in chunks, which will increase CPU usage but will be more efficient in memory usage.
We need to print all lines during the processing.

Java Improvements with Native Compilation and GraalVM

We need to take into consideration that removing JVM because Golang when compiling a code will be statically complied, and fortunately, since Java 9 in JEP-295 was introduced Ahead of Time Compilation that basically makes it possible to do the same with Java, without pass through byte code and transform it in binary code, and we can do it using the current JVM but GraalVM have a better performance than JVM when we are using AOT compilation, basically, in the AOT process, GraalVM applies several very aggressive optimizations, such as code analysis and ‘dead code’ removal — that is, it checks for code that is not being used and removes it. The sum of this ‘dead code’ cleaning and compiling to native code is a small executable file, with fast startup and low memory consumption. To understand more about this I recommend you take a look at the following articles:

Worker Pool Pattern

Okay, one important concept about threads is that we can have several of them but it will depend on how many CPU Cores we have and that means we don’t have an infinity of cores think about that we need to set “how many threads will be available to use?”. Here comes the Worker Pool Strategy, the main idea is to set a specific number of threads that will handle a specific process and on top of that we will have a worker pool manager that will be responsible to distribute the data across the threads.

Implementation Design

The implementation is pretty simple:

We need to have a process that will take the data and put it in a queue.
After that, we need to create a Worker Pool manager that will split the number of tasks between the threads (Workers) e.g each worker will process 10 lines, then the manager will split it between them.
Finally, each worker should put the result in another queue of results.

Golang Implementation

In Golang is amazing because we can use a special data type called channels it is possible to share data between goroutines in a thread-safe way let's see a simple example of the implementation provided by gobyexample:

package main

import (
 "fmt"
 "sync"
 "time"
)

func worker(id int, jobs <-chan int, results chan<- int, wg *sync.WaitGroup) {
 defer wg.Done()

 for j := range jobs {
  fmt.Println("worker", id, "started  job", j)
  time.Sleep(time.Second)
  fmt.Println("worker", id, "finished job", j)
  results <- j * 2
 }
}

func main() {
 const numJobs = 5
 jobs := make(chan int, numJobs)
 results := make(chan int, numJobs)

 var wg sync.WaitGroup

 for w := 1; w <= 3; w++ {
  wg.Add(1)
  go worker(w, jobs, results, &wg)
 }

 for j := 1; j <= numJobs; j++ {
  jobs <- j
 }
 close(jobs)

 wg.Wait() // Wait for all workers to finish

 for a := 1; a <= numJobs; a++ {
  <-results
 }
}

In the example above, I’ve just changed a little bit to use the package sync with the WaitGroup, which basically will act as a CountDownLatch in Golang.

Java Implementation

In Java, this strategy is called Thread Pool and it has already an implementation to help us achieve the same result let's see the same example now written in Java


import java.util.concurrent.*;

public class WorkerPoolPatternExample{
    static class Worker implements Runnable {
        private final int id;
        private final BlockingQueue<Integer> jobs;
        private final BlockingQueue<Integer> results;
        private final CountDownLatch latch;

        public Worker(int id, BlockingQueue<Integer> jobs, BlockingQueue<Integer> results, CountDownLatch latch) {
            this.id = id;
            this.jobs = jobs;
            this.results = results;
            this.latch = latch;
        }

        @Override
        public void run() {
            try {
                while (true) {
                    Integer job = jobs.take(); // Blocking until a job is available
                    if (job == -1) { // -1 is a special value indicating no more jobs
                        break;
                    }
                    System.out.println("Worker " + id + " started job " + job);
                    Thread.sleep(1000); // sleep for 1 second
                    System.out.println("Worker " + id + " finished job " + job);
                    results.put(job * 2);
                }
            } catch (InterruptedException ignored) {
            } finally {
                latch.countDown();
            }
        }
    }

    public static void main(String[] args) throws InterruptedException {
        final int numJobs = 5;
        BlockingQueue<Integer> jobs = new LinkedBlockingQueue<>();
        BlockingQueue<Integer> results = new LinkedBlockingQueue<>();
        CountDownLatch latch = new CountDownLatch(3); // 3 worker threads

        ExecutorService executor = Executors.newFixedThreadPool(3);
        for (int w = 1; w <= 3; w++) {
            executor.execute(new Worker(w, jobs, results, latch));
        }

        for (int j = 1; j <= numJobs; j++) {
            jobs.put(j);
        }
        
        // After all jobs are enqueued, add special value to signal workers that no more jobs are coming
        for (int w = 1; w <= 3; w++) {
            jobs.put(-1);
        }

        executor.shutdown(); // It will stop accepting new tasks
        latch.await(); // Wait for all workers to finish

        for (int a = 1; a <= numJobs; a++) {
            Integer result = results.take();
            System.out.println("Result for job " + a + " is " + result);
        }
    }
}

As Java does not exist a channel type so to achieve the same result we will use the LinkedBlockingQueue which will act as a Queue but will guarantee that will be thread safe to avoid some race conditions during the consumption from multiples threads, the ExecutorService will be our Worker Pool Manager that will control the max number of the thread (workers) will be available to handle with the task.

Let's go to the real implementation

All the codes will be available in my GitHub repository.

In this implementation we will have a CSV with 100k lines with the following headers:

seq,namefirst,namelast,age,street,city,state,zip,dollar,pick,date

Golang implementation

Following the last examples let’s start with the Golang implementation:


func main() {

const fileName = "./data_100000.csv"

 file, err := os.Open(fileName)

 if err != nil {
  log.Fatalf("cannot able to read the file: %v", err)
 }

 defer file.Close()
...
}

In the function main we start loading the file with using the OS package.

func main() {
...
 errorLines := make(chan ErrorLine)
 go Process(file, errorLines)
...
}

After that, we create a channel of the error to populate if something happened during the processing as pass as a parameter of the method Process.

func Process(f *os.File, errorLines chan<- ErrorLine) {

 linesPool := sync.Pool{
  New: func() interface{} {
   lines := make([]byte, LinesPerWorker)
   return lines
  },
 }

 stringPool := sync.Pool{
  New: func() interface{} {
   lines := ""
   return lines
  },
 }

 r := bufio.NewReader(f)

 var wg sync.WaitGroup

 for {
  buf := linesPool.Get().([]byte)

  n, err := r.Read(buf)
  buf = buf[:n]

  if n == 0 {
   if err != nil && err != io.EOF {
    log.Printf("Error reading file: %v", err)
   }
   break
  }

  nextUntillNewline, err := r.ReadBytes('\n')

  if err != io.EOF {
   buf = append(buf, nextUntillNewline...)
  }

  wg.Add(1)
  go func() {
   ProcessChunk(buf, &linesPool, &stringPool, errorLines)
   wg.Done()
  }()

 }
 wg.Wait()
 close(errorLines)
}

This method is our Worker Pool Manager that will be responsible to read the file and split the buffer between the workers the difference here is instead we using the channel to receive the chunk of bytes to be processed we are using the sync.Pool. Sync Pool is Golang’s built-in object pooling technology, which can be used to cache temporary objects to avoid the consumption and pressure on GC caused by the frequent creation of temporary objects, in other words, the usage of Sync Pool will be more effective in terms of memory.

func ProcessChunk(chunk []byte, linesPool *sync.Pool, stringPool *sync.Pool, errorLines chan<- ErrorLine) {

 var wg2 sync.WaitGroup

 entries := stringPool.Get().(string)
 entries = string(chunk)

 linesPool.Put(chunk)

 entriesSlice := strings.Split(entries, "\n")

 stringPool.Put(entries)

 chunkSize := 300
 n := len(entriesSlice)
 noOfThread := n / chunkSize

 if n%chunkSize != 0 {
  noOfThread++
 }

 for i := 0; i < noOfThread; i++ {
  wg2.Add(1)
  go func(start int, end int) {
   defer wg2.Done()
   for i := start; i < end; i++ {
    text := entriesSlice[i]
    if len(text) == 0 {
     continue
    }
    entry := strings.Split(text, ",")

    // Check for required fields
    for fieldPos, required := range requiredFields {
     if required && (len(entry) <= fieldPos || entry[fieldPos] == "") {
      errorLines <- ErrorLine{
       Line:  text,
       Error: fmt.Errorf("missing required field at position %d", fieldPos),
      }
      break
     }
    }
   }
  }(i*chunkSize, int(math.Min(float64((i+1)*chunkSize), float64(len(entriesSlice)))))
 }

 wg2.Wait()
}

This method will receive:

chunk: A byte array containing the data to be processed.
linesPool: A pool of byte arrays that can be reused to avoid allocating new arrays for each chunk. This is an optimization to reduce memory allocations.
stringPool: Similar to linesPool, but for strings. It's a pool of strings that can be reused to avoid creating new strings for each chunk.
errorLines: A channel to which any lines that cause errors during processing are sent.

The interesting math calculation in this function is related to splitting the work into smaller chunks that can be processed in parallel. Here’s the explanation:

The function divides the lines of the chunk into smaller pieces of chunkSize lines each and assigns each piece to a separate goroutine for processing.

n is the total number of lines in the chunk.
chunkSize is the number of lines that each goroutine should process.
noOfThread is the number of goroutines needed to process all lines. It is calculated as the total number of lines divided by the number of lines per goroutine (n / chunkSize).

If n is not exactly divisible by chunkSize hen an additional goroutine is needed to process the remaining lines. This is what the if n%chunkSize != 0 { noOfThread++ } code is for.

Then, for each goroutine, it calculates the start and end line indices that this goroutine should process:

start is i * chunkSize.
end is the minimum of ((i + 1) * chunkSize) and n.

This ensures that each goroutine processes exactly chunkSize lines, except possibly the last goroutine which might process fewer lines if n is not exactly divisible by chunkSize.

Then the anonymous function defines the start and end of each chunk of lines. It starts at i*chunkSize and goes up to min((i+1)*chunkSize, len(entriesSlice)). The math.Min function is used to ensure that the end index does not go beyond the actual number of lines.

Deep dive into the step-by-step of ProcessChunk

Let’s break down the concept using simple numbers:

Assume we have:

A data chunk that, when split by newlines (“\n”), results in n = 1000 lines of data.
Each goroutine will process chunkSize = 300 lines.

The first step is to calculate the number of goroutines needed to process all lines:

noOfThread := n / chunkSize

This gives us 1000 / 300 = 3.333. Since the number of threads has to be an integer, Go automatically drops the decimal, resulting in 3 threads.

However, we have a total of 1000 lines, and only 3 threads that each process 300 lines. This only accounts for 900 lines, so we still have 100 lines unaccounted for. That's why we check if n modulo chunkSize is not 0, and if true, increment noOfThread by 1:

if n%chunkSize != 0 {
    noOfThread++
}

This results in 4 threads (or goroutines). Now we have enough threads to process all 1000 lines.

Then, for each goroutine, we calculate the start and end line indices. For example:

For i = 0 (the first thread), start = 0 * 300 = 0, end = min((0 + 1) * 300, 1000) = min(300, 1000) = 300.
For i = 1 (the second thread), start = 1 * 300 = 300, end = min((1 + 1) * 300, 1000) = min(600, 1000) = 600.
For i = 2 (the third thread), start = 2 * 300 = 600, end = min((2 + 1) * 300, 1000) = min(900, 1000) = 900.
For i = 3 (the fourth thread), start = 3 * 300 = 900, end = min((3 + 1) * 300, 1000) = min(1200, 1000) = 1000.

Thus, the 4 goroutines process lines 0-299, 300-599, 600-899, and 900-999 respectively. Each goroutine processes 300 lines, except the last one which processes 100 lines. Now I think is better the understand the chunk.

Java implementation

Java is a little bit easier then Golang in the implementation:

public static void main(String[] args) throws IOException, InterruptedException {
    Path csvFile = Paths.get("./data_100000.csv");
    ExecutorService executor = Executors.newFixedThreadPool(10);
}

We load the file and define the ThreadPool

public static void main(String[] args) throws IOException, InterruptedException {
...
   try (BufferedReader reader = Files.newBufferedReader(csvFile)) {
            String line;
            while ((line = reader.readLine()) != null) {
                final String currentLine = line; // Declare final reference variable
                executor.execute(() -> processLine(currentLine));
            }
        }
...
}

We create the BufferedReader and each line read from the file is passed as a task the ExecutorService to be processed.

private static void processLine(String line) {
    String[] fields = line.split(",");
    boolean hasError = false;
    for (int fieldPos : REQUIRED_FIELDS) {
        if (fields.length <= fieldPos || fields[fieldPos].isEmpty()) {
            hasError = true;
            break;
        }
    }
    if (hasError) {
        errorLines.add(line + ", error: missing required field(s)");
    }
}

The processLine function, which is executed by a worker thread, processes each line, The function checks if the line has all the required fields. If any required field is missing, the line is added to errorLines.

Creating the Native Image

To be able to create a native image you have to install the GraalVM.

I’ve created the nativeCompilation.sh the will execute the following steps:

javac -d build src/main/java/com/github/thukabjj/javaworkerpool/WorkerPoolPattern.java

javac is the Java compiler command that transforms source code (.java files) into Java Bytecode (.class files):

-d build is an option that sets the destination directory for class files. The directory build is where the compiled .class files will be placed.
The final part is the path to the .java file that you want to compile. Here, it’s the WorkerPoolPattern.java file.

jar cfvm WorkerPoolPattern.jar META-INF/MANIFEST.MF -C build .

jar is the Java Archive command, used to pack class files and related resources into a .jar file.
cfvm is a combination of options:
c creates a new archive
f specifies the name of the .jar file that's being created
v generates verbose output to standard output
m includes manifest information from the specified manifest file (here, it's META-INF/MANIFEST.MF).
WorkerPoolPattern.jar is the name of the .jar file that's being created.
-C build . changes to the specified directory (build) and includes all the files in it (indicated by .) in the .jar file.

native-image -jar WorkerPoolPattern.jar

native-image is a utility that comes with GraalVM, a high-performance runtime that provides significant improvements in application performance and efficiency which is ideal for microservices.
-jar WorkerPoolPattern.jar tells native-image to create a native image from the WorkerPoolPattern.jar file.
The resulting native image is a binary file that can be directly executed without the need for the JVM, leading to faster startup times. This is particularly beneficial for applications where startup time is critical, such as serverless functions or command-line applications.

Let’s analyze the numbers

To gain deeper insights into the performance of our Go and Java implementations in the worker pool pattern, we created the go_results.csv and java_results.csv files. These files contain the execution time, CPU usage, and memory usage data collected during 100 iterations of each implementation. To make sense of these numbers and visualize the trends, we developed the analyze_results.py script.

The motivation behind creating the Python files was twofold. First, we wanted to extract meaningful metrics from the raw data to gain a comprehensive understanding of the performance characteristics of the worker pool implementations. Second, we aimed to provide visual representations of these metrics to aid in the interpretation and comparison of the results.

To achieve these goals, we leveraged the power of popular Python libraries. The pandas library played a crucial role in efficiently handling the CSV data, performing data manipulations, and calculating metrics such as execution time and memory usage difference. Additionally, we utilized the matplotlib library to create visually appealing line plots that depict the execution time, CPU usage, and memory usage difference for both Go and Java implementations.

By developing these Python files, we were able to analyze the numbers effectively and gain valuable insights into the performance characteristics of the worker pool pattern in Go and Java. The combination of data manipulation and visualization libraries enabled us to extract, process, and present the results in a meaningful way, facilitating a comprehensive comparison between the two implementations.

Conclusion

Based on the analysis of the performance results obtained from running the Java and Go implementations of the Worker Pool pattern, several observations can be made.

First, in terms of execution time, both Java and Go demonstrate impressive efficiency. The average execution time for both implementations hovers around 0.13–0.15 seconds, indicating that they can process a substantial number of tasks in a short span of time. However, Go shows a slightly lower average execution time compared to Java, suggesting that it may have a slight performance advantage in this aspect.

Second, when considering CPU usage, both Java and Go exhibit minimal CPU utilization throughout the execution of the worker pool tasks. The CPU usage remains consistently at 0% for both implementations, indicating that they efficiently manage the computational resources without incurring excessive CPU overhead.

Lastly, the analysis of memory usage reveals interesting insights. Java shows relatively constant memory usage throughout the execution, with a memory usage difference ranging from approximately 25–35 MB. In contrast, Go exhibits a more variable memory usage, with a difference ranging from 100–200 MB. This difference can be attributed to the underlying memory management mechanisms and runtime environments of Java and Go.

Overall, both Java and Go demonstrate strong performance in the Worker Pool pattern implementation. While Go showcases a slightly better execution time, both implementations showcase efficient resource utilization with minimal CPU usage. The variation in memory usage suggests that developers should consider the specific memory requirements and trade-offs associated with each language when implementing the Worker Pool pattern in their projects.

In conclusion, the comparison of Java and Go in the Worker Pool pattern implementation highlights the strengths and nuances of each language. Developers should carefully evaluate the specific requirements and characteristics of their projects to determine which language is best suited for their use case. By leveraging the concurrency and parallelism features of Java and Go, developers can achieve efficient task processing and optimal resource utilization in their applications.