De 5 resource bottlenecks bij Big Data, inclusief de ontastbare 5e

Dit artikel is geschreven door Graham Bolton

Big Data Bottlenecks

“Today’s programmers believe that they have
no need to understand data storage;
tomorrow’s programmers know that they must.”

Whether you’re writing a MapReduce component for a Hadoop Rack-Aware File System or wrestling with a COBOL program from last century, there are five types of bottle neck, each of which concerns a basic resource: Processor, Memory, Storage, Transmission and Contention. Dealing with contention is hard work, so lets start with the easy ones…

Processor

If your program is processor-bound, you need to spend some time looking at how the program has been constructed. Does it contain deeply nested loops? Or deeply nested calls? Or heavy string manipulation? Or is the program just “flabby”, carrying around more data than it should? Or worse still, are you using a stack which makes programming a little faster, and programs much slower?

“Recently, I took a processor-bound program
implemented on a state-of-the-art stack and
re-wrote it in “old-fashioned” C. The task was
to issue a set of five randomized bingo cards to
each of 13 million prospective participants.
This involved generating and storing 65
million bingo cards.

The multi-threaded, state-of-the-art program
took a little over 23 hours.

The single-threaded, “old-fashioned” program
did the same work in just 5 minutes and 45
seconds.”

If you have a program which is CPU-bound, it’s worth considering how much overhead your “all-purpose” framework is adding to your problem.

Memory

How much time does your program spend loading data into memory? And how much time is spent making copies, and moving things from one memory location to another? Consider, if you will, the fact that in the 1960’s and 1970’s memory was measured in kilobytes, and that tapes were used for storage. Do you really need to load all your data into memory as you process it?

A poorly-written data warehouse ETL process
hit a performance wall six months after it was
taken into production. It was clear that the
programmer had not treated memory as a
resource, so I re-designed it to use files rather
than memory.

By using an operating system utility to sort
the data file, writing a variant of the balance
line algorithm to compare it to the file from
the previous day and extract the differences,
we cut processing time from 9 hours to less
than two minutes.

Storage

Storage is as cheap as chips. Solid-state disks are pretty much as fast as memory. Adding an index to a database is quick and easy. So why worry about storage?

At my job interview for the position of
programmer at Logica in Rotterdam, I was
asked: “Graham, how fast is a disk?”

I asked the interviewer what he meant: “Do
you mean the seek time, or the speed at which
data can be transferred from the disk into its
cache, or its latency, the delay before the first
packet of data is put onto the bus?”

I got the job.

All storage is slow, and I mean really slow. The time needed for disk operations is usually expressed in milliseconds: seek time for a solid state disk (SSD) might be as low as 0.1 ms, while for an older disk it might be 100 ms.

A modern CPU can carry out about two hundred million (200,000,000) operations per second. At the storage level, we are doing well if we can average 100 seek operations per second! Yes, that’s right: a CPU operation is about two million times faster than a storage seek operation.

So what can you do to avoid storage seek operations? Index your data. But each index that you add impacts write operations: for example, every time a record is added, it has to be added to every index–that can become a significant overhead.

I was asked to improve the overall
performance of a suite of batch programs
because the threshold of the batch window had
been reached.

The programmers had followed a sequence of
automatically generated optimization
suggestions, like “add an index on columns x, y
and z”. One of the most heavily updated tables
had just 12 columns, and it had 15 indexes
(indices).

I carried out an access path analysis, based
upon the sequence of suggestions, and
designed a minimal set of indices for 8 tables.
After deleting 28 indices, and modifying 7
others, we noted a 40% decrease in processing
time.

In my experience, drawn from a 30-year career in what is now called Big Data, most performance problems are caused by a calamitous lack of understanding of data storage. I believe that today’s programmers believe that they have no need to understand storage; tomorrow’s programmers know that they must.

Transmission

Moving data from one place to another is the fourth bottle neck. The more data that is moved around, the longer it takes. If that data is broken into small packets, it takes even longer, because of the overhead involved: pack, send, wait for acknowledgement, receive, acknowledge, unpack.

It doesn’t matter whether the data is being transferred between two processes on the same machine, or between two processes running on machines in the same rack, or between two processes running on different sides of the globe, the less data that you send, and the fewer packets you divide it into, the faster your process will work.

I was asked to produce a set of test data to
“break” the invoicing process in a system
which was in the acceptance phase. The
process took nearly four hours to produce
comprehesive invoices for 100,000 clients.

The main loop of the program contained 7 SQL
queries, each of which was run 100,000 times:
the program ran 700,001 queries.

I spoke to the programmers, taught them how
to use SQL cursors and how to implement an
8-way merge. They were able to refactor the
program in about 4 hours.

When they ran the new program for the first
time, it stopped after about a minute. The
programmers thought that it had crashed. It
had not. It worked first time. It issued just 8
queries to the database, and a few thousand
data packets were exchanged rather than
about a million and a half.

If you understand how data is moved around within a framework, and the amount of overhead which is incurred, you have a strong chance of being able to refactor an errant program.

Contention

Programs compete for resources. Operating systems provide locking for mutually exclusive operations so that contention can be managed safely.

At university in the UK, we learnt about
“Dijkstra’s semaphore functions”, which were
called p() and v(). Dijkstra wrote his work in
dutch, and now that I speak that language, I
know that the “v” stands for “verhoog”
(increment) en that the “p” stands for
“prolaag–probeer te verlagen” (try to
decrement).

Dijkstra did his best work looking out over the
railway yards near Eindhoven station,
understanding how trains (like computer
processes) were given exclusive use of
particular tracks (like computer resources)
through the use of semaphore signals.

If your program is contending for resources, it will be delayed. If the other programs are slow, they will hold the mutually exclusive resource for longer, and your program will be correspondingly slow. This means that to increase the speed of your program, and you cannot remove the contention as a whole, you will need to optimize the other programs and remove their bottle necks, be they processor-bound, memory-bound, storage-bound, transmission-bound or, worst of all, contention-bound.

The Intangible Fifth

Contention issues are the hardest to detect, the hardest to understand and the hardest to resolve.

I recommend that you try to avoid contention by design, rather than having to resolve it further down the line.

Ook interessant?

BI vs Big Data

Het verschil tussen Business Intelligence (BI) en Big Data

Om te kunnen vertellen wat het verschil is tussen Big Data en BI, is het noodzakelijk om eerst is te ...
Datawarehouse

Wat is een datawarehouse (DWH)?

Een datawarehouse, afgekort DWH, is een database. Het verschil met een gewone database is de architectuur en de opslag van ...
ETL in datawarehouse

Wat is ETL in Datawarehouse?

ETL komt vanuit datawarehousing en staat voor Extract, Transform, Load. ETL dekt het proces van hoe data wordt geladen van ...
dataconversie

Wat is dataconversie?

Een onderdeel van Datamigratie is Dataconversie. Dit is de conversie van data van het ene formaat naar een ander data ...
Blijf op de hoogte van onze nieuwste ontwikkelingen, schrijf je hier in voor de nieuwsbrief!