Binh (Ben) Nguyen

DevOps Engineer at IBM

Hello! I am Ben, a DevOps Engineer (architect level) at IBM. I have over twelve years of hands-on experience in designing, developing, testing, deploying, and managing scalable, secure, and reliable SaaS solutions for various clients and industries. I have a Master in Computer Science from the National University of Singapore, where I graduated with honors and received a full scholarship for ASEAN nationals.

I have strong competencies in DevOps/Cloud operations and backend development, using a wide range of languages, frameworks, platforms, and tools. I have contributed to projects for industry giants such as IBM (Hybrid Cloud Transformation Group), Accenture (Cloud First Group), and NEC. My expertise  also extends to high-performance cluster computing, acquired during my tenure at the Institute of High-Performance Computing (Agency for Science, Technology, and Research of Singapore).

I am open to exploring exciting projects and opportunities. Reach out to me on LinkedIn to initiate the conversation!


• Master in Computer Science (Honors, NUS GSA Scholar), National University of Singapore (2015 - 2017)
• Bachelor in Computer Engineering (First Class Honors), Vietnam National University HCMC (2007 - 2012)


Below are some typical projects (non-confidential) in which I am the key contributor.

Data Platform for Streamed Log Management

The team has built a platform to automate data collection process as well as to manage and process the collected data. I am responsible for maintaining and improving this platform, with a heavy focus on data management, cloud migration, and workflow automation.

As the platform consists of a mix of complicated software with different components and tiers interact with each other, I had to consider various kinds of trade-off between storage, computation, and network. I leveraged ELK stack (Elasticsearch, Logstash, and Kibana) for powering search and analytics of various kinds of transport logs in different file types. By providing computing, storage, and network infrastructure that can run high performance applications, the cloud-based migration solution could provide a significant speed-up ratio. In addition to this, as container-based virtualization techniques are becoming an important option due to their lightweight operation and cost-effective scalability, I implemented the containerization technique Docker to automate our data processing workflow. These improvements made our data-intensive platform operate better with the growth of data collected everyday and the increase of workload.

Resource Allocation and Scheduling

High-load data processing jobs constitute a significant portion of load on multi-tenant clusters and these jobs are expected to be completed before a predetermined deadline. However, predicting the amount of resources needed to achieve this goal is troublesome, mainly because of the unpredictability in resource availability and application behavior. Common approaches typically mitigate this problem by over-allocation of resources, in turn resulting in poor cluster utilization.

I developed a framework to resolve this tension between predictability and utilization. The framework was built on three key modules: a learning module, a scheduling module, and a dynamic reprovisioning module. First, the learning module automatically derive Service Level Objectives (SLO) and job resource models from log data. Second, the scheduling module enforce SLOs by isolating jobs from sharing-induced performance variability. In this scheduling module, the online arriving jobs with different periods need to be packed efficiently. To achieve this goal, I designed and implemented a online packing algorithm for periodic jobs. Finally, the dynamic reprovisioning module track real-time progress of a job and adapts allocation based on its progress to meet the job's objectives. I validated my design and implementation against historical logs from a 500 node cluster, and showed that my framework could reduce the amount of deadline violations by 5 times to 10 times, while lowering cluster footprint by 10% to 30%.

MapReduce Tuning

MapReduce is a prevalent distributed and parallel computing paradigm for large-scale data-intensive applications. However, tuning the MapReduce job parameters is a challenging task as the parameter configuration space is large with more than 70 parameters, and traditional offline tuning system requires multiple test runs to identify a desirable configuration.

I implemented an online performance tuning system to mitigated the aforementioned problems. The system monitors job performance counters and system logs dynamically, tunes the parameters according to the collected statistics, and changes the configuration dynamically during job execution. Thus, instead of having the same configuration for all tasks, each task will have a different configuration. I implemented the system based on YARN, a second-generation Hadoop implementation. The parameters on which I focused are the container CPU and memory allocation for every map and reduce task, the memory allocation of sort buffer in the map stage, the shuffle buffer and reduce buffer in the reduce stage. Moreover, I designed a hill climbing algorithm that searches the configuration space systematically and converges to a near-optimal parameter configurations. I evaluated this dynamic online tuning system on several benchmarks (text search, wordcount, bigram, etc), comparing the results against offline tuning and default parameter configuration. Experiment results showed that my mechanism can improve performance by 20%.

Heuristic Search and Optimization

Neuromorphic processors with a manycore architecture are emerging as promising alternatives to traditional computing platforms for brain inspired computing. The increasing trend of employing a large number of neurons in neuromorphic networks such as spiking neural networks has made it necessary for chips to contain a few hundred to thousands of cores. Due to the large core count and the exponential number of mappings, it is imperative to study how large spiking neural networks can be mapped efficiently onto a neuromorphic processor so as to achieve optimal system performance.

I developed a network-on-chip simulator and proposed a heuristic-based optimization framework framework for mapping logical cores onto physical cores to achieve an optimal clock ratio metric. The framework can handle different types of network topologies which are employed for inter-core communication. As finding an optimal mapping is an NP-hard problem, I used a heuristic algorithm to rapidly search through the design space to find near-optimal solutions, and where possible, compare them with the optimal solution obtained by a branch and bound method. The heuristic algorithm based on the Kernighan-Lin (KL) partitioning algorithm. The framework consists of an "initial mapping" stage followed by "partitioning" and "customization". In the "initial mapping" stage, an initial throughput-aware mapping is generated. By virtue of the KL partitioning algorithm, the initial mapping is then recursively bi-partitioned based on the "partitioning" stage. Between two partitions, I applied a "customization" stage to find a mapping with lowest possible clock ratio. The "partitioning" and "customization" stages are repeated reaching the leaf node. In the experiments, I showed that my heuristic is able to achieve better solution with lower runtime compared to existing techniques. I also demonstrated the scalability of my solution on large networks containing more than 3500 cores.

Thank you for stopping by my homepage! Wishing you a wonderful day ahead!
Feel free to connect with me on LinkedIn. I am always open to chat and explore new opportunities.