Wednesday, June 8, 2016

Wednesday, May 11, 2016

Urdu Digit 6 related Bug in CRULP Urdu Phonetic Keyboard Layout v1.1 for Windows 10 (x64)


Today, while working on Urdu digits, I stumbled upon a bug related to Urdu Digit 6 in CRULP Urdu Phonetic Keyboard Layout v1.1 for Windows 10 (x64). If you try to type Digit 6, you will end up typing Persian Digit 6 (۶) instead of Urdu Digit 6 (٦).

My Environment:

Operating System: Microsoft Windows 10 Enterprise
System Type: 64-bit Operating System, x64-based processor
Processor: Intel® Core™ i5-2450M CPU @ 2.50GHz 2.50GHz

I’ve reported this bug on following email address that I can find from the CLE website:

sarmad DOT hussain AT kics DOT edu DOT pk
sarmad AT cantab DOT net
kamran DOT khan AT kics DOT edu DOT pk

Can you please help by verifying this bug on your machine? simply mention in comments were you able to reproduce it or not on your system and your system environment details as I mentioned above.

Towards Urdu Corpus: Mining Wikipedia Urdu using Wikiforia Parser


Corpus collection is the first step before you even think of Machine Learning and Linguistics. While there are some serious concerted efforts and progress made in different languages to compile and publish Languages Corpora, Urdu Language is no where to be seen in this context. Why? this warrants a separate detailed post, which I will inshaAllah write some time in near future.

In this post, I want to share my first hand experience of collecting a sizable raw Urdu Plain Text from Open Source Wikipedia, so not only I can use it in my research, but will also be able to publish it under Open Source License for others to benefit from it.

Wikipedia-logo-v2-urWikipedia publishes complete database backup dumps of

“all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available”. “These snapshots are provided at the very least monthly and usually twice a month.”

As I was interested in Wikipedia Urdu, I downloaded following two files from the Wikipedia Urdu Database Backup Dumps page:

Next task was to extract actual page content from these dump files which follows a specific schema. Wikipedia has a comprehensive list of open source parsers written in different programming languages and published under different types of open source licenses. Because my goal was to collect open source Urdu Plain “Text” and my programming language choice was “Java”, I opt for Wikiforia.

“Wikiforia is a library and a tool for parsing Wikipedia XML dumps and converting them into plain text for other tools to use.”

So, I began by cloning the Wikiforia github repository locally on my laptop and then ran following command on the terminal:

java -jar wikiforia-1.2.1.jar 
     -pages urwiki-20160501-pages-articles-multistream.xml.bz2 
     -output output.xml

The program worked perfectly and uses concurrency (one thread per logical cores) to speed up the processing. It took few minutes to complete the task, however the output was not pure plain text, in fact it was a simplified form of XML and looks like:


At that time I had two options:

  1. Run another tool to convert this output file from XML to Plain Text, or
  2. Add custom implementation to Wikiforia to make it output pure Plain Text

I opt for second one for two reasons, first I don’t want to waste additional time and processor cycles to process the generated output once again, secondly I thought that there might be others who will benefit from this modification.

So I forked Wikiforia and add a new Sink Implementation I also had to modify the main program “”, to add CLI support for additional switch “outputformat” with a sensible default set to “xml”, with only two possible values (for now) “xml” and “plain-text”. And once I did that, I also submitted the “Pull Request” on the Wikiforia github repository, in case they decided to merge the patch on to the original repository.

Then, I ran following modified command to extract the “Plain Text” out from the Wikipedia Urdu Database Dumps:

java -jar wikiforia-1.2.1.jar 
     -pages urwiki-20160501-pages-articles-multistream.xml.bz2 
     -output output.txt
     -outputformat plain-text

And here’s how plain text output.txt looks like:


Finally alhamdolillah! I made my first contribution to the Urdu Corpus Community Project.

Wednesday, April 13, 2016

Java’s utilization of Multiple CPU Cores for Parallelism or Concurrency


While verifying the utilization of multiple CPU Cores in Java for Parallel or Concurrent or Multi Threading programming, I came across interesting numbers. I wrote a simple program which tries to compute 40,000,000 random integer numbers first using a single thread and then again using maximum threads, one per available CPU Cores.

In order to find available CPU Cores on a system, Java exposes a method in java.lang.Runtime:

public int availableProcessors()

Returns the number of processors available to the Java virtual machine.

This value may change during a particular invocation of the virtual machine. Applications that are sensitive to the number of available processors should therefore occasionally poll this property and adjust their resource usage appropriately.

the maximum number of processors available to the virtual machine; never smaller than one

When I run the program to print out the number of available CPU Cores, I was surprised that it printed “4” instead of “2” because I have a Duo Core Laptop:

duo core

To further verify that, I open up the Task Manager and found this:

task manager

It turns out that it’s “4” because of Hyper-Threading:

“For each processor core that is physically present, the operating system addresses two virtual or logical cores, and shares the workload between them when possible.”

So, finally I ran my single threaded program and observe this:

why all processors were busy for a single thread

Why all the four logical processors were busy in running a single threaded program? shouldn’t that be just one of them?

To dig deeper, I changed my program to run in 4 parallel threads and the result was:

multi threading execution on 4 logical processors

That wasn’t making any sense, clearly both single threaded and multi threaded versions of the program were using all the available logical processors for processing. Searching the internet for clarification reveals that:

“The OS is responsible for scheduling. It is free to stop a thread and start it again on another CPU. It will do this even if there is nothing else the machine is doing.

The process is moved around the CPUs because the OS doesn't assume there is any reason to continue running the thread on the same CPU each time.”

And there comes the concept of CPU or Processor Affinity:

The processor affinity is simply a number that every process is associated with. It serves as a bit array that determines on which CPUs in a system the threads of a particular process are allowed to run. For instance a processor affinity of 2 means that the process can only run on CPU 1, because only the bit at index 1 is set (if the processor affinity is regarded as a bit array with indexing starting at the rightmost bit with zero). A processor affinity of 1 means, that the process, or better yet, the threads of that process, can only run on CPU 0. A processor affinity of 3 means that the process may run on both CPUs 0 and 1. A processor affinity of 0 means that there is no CPU that this process may run on, and is therefore not possible. The processor affinity is normally inherited from the parent process that starts a particular process, but it can also be changed at runtime from another process.

While there are several ways to test the Processor Affinity, the one that I found easy and quick to use was ProcAff. After running the same single threaded version of the program with procaffx64.exe:

procaffx64 command

I observe this:

single thread with processor affinity

That’s how the execution of a single threaded program should look like; utilizes only one logical processor for its execution.

Furthermore it is also quite interesting that the execution time of the following matches (please refer to the Microsoft Excel Sheet “analysis.xlsx” uploaded on GitHub repository along with the code):

Average Time to run a single Thread with no CPU/Process Affinity == Average Time to run a single Thread with CPU/Process Affinity

However, the Task Manager shows visually that former case uses all 4 logical processors while the later case uses only one logical process, but they both end up finishing up their task at almost exactly the same time.

Monday, March 28, 2016

Back to Academics - Master of Science (Big Data) program @ DHA Suffa University

Yay! its good to be back to blogging Smile. Its the Facebook, that was keeping me away from maintaining this blog, because I was literally dumping every thing on Facebook. But, when I thought about sharing my learning experience with my readers, I realize I need to go back to actual blogging and here I am!

One question, that I was asked quite recently and most frequently, by people around me, was “Why enrolling in Masters now? and why DHA Suffa University?”. My answer is more or less the same as “The Statement of Purpose”, which I wrote hastily in few minutes, just before submitting my application for the MS program.

Statement of Purpose

After spending some years in the industry and gaining some post-bachelor’s experience (I graduated in 2001), I started surveying available post-grad programs locally to shortlist those which appeals me. To my bad luck, I end up with an empty list. Why? because I felt those programs, back then (I guess 2008) were too theoretical and lack emphasis on the actual “Application” of those theories. I’m a firm believer of the fact that Humans are as good as Doers as they are Thinkers. And, we always “act” upon the “knowledge” we gain through different experiences in our life.

It was indeed very unfortunate that I did similar surveys, at least three more times from 2008 to 2016 and I always ended up with the same results. None of the post-grads programs were “Practical” enough that would attract me to just roll up my sleeves and join the program.

I’ve been discussing my experiences with a number of people in my circles and I’ve been persuading people, both in industry and academia, to come up with a program that infuses the theoretical knowledge with the practical implementation in a way that would lead us to achieve breakthroughs in all walks of life and specially in science and technology.

Then, I heard about this new Masters Program that DSU is launching and I was like, “I’ve got to enroll in it no matter what!”. Although I’m working full time and I’m kind of running short of time but still, when I hear the vision I was flabbergasted. This is the kind of program that I waited for soooo long!

So, first, thank you to DSU, for being so daring and bold enough to take the initiative in launching a program that is indeed the most pressing need of the time. Secondly, please give me a chance to be part of this game-changing program. I, on my end, will ensure that I’ll add significant value to the program in whatever capacity I can.


Syed Muhammad Humayun

But what’s so “Practical” about this program in particular?

The vision of this program is to work with the people in industry and academia and to form a group who not only research new problems, ideas and techniques but also devise new solutions and newer implementations in the form of complete Working Models, Validated Prototypes, Sophisticated End Products, Re-usable APIs, Components and Platforms. And, through that, they will be able to deliver the “real value” and actually serve the needs of the society (including the industry and the academia).

Besides that, the enrollment in this MS program is open to industry people who have significant experience and skills in related fields. These MS students, along with active PHD students will form a “research group” which will then “Work as a Team in a Lab” to achieve their end goals (mentioned in the para above).

While, a lot of institutions are working with similar concepts world over, in Pakistan, we are fortunate to have one of our own and I wish that we will have similar programs real soon in other universities of Pakistan as well.

MS/PHD Computer Science Programs @ DHA Suffa University