Friday, July 15, 2011

Engadget Primed: Using benchmarks

Primed goes in-depth on the technobabble you hear on Engadget every day -- we dig deep into each topic's history and how it benefits our lives. Looking to suggest a piece of technology for us to break down? Drop us a line at primed *at* engadget *dawt* com.

Staring at your smartphone, you realize that there's something missing. It does everything you want it to -- very well, we might add -- but what hole is left to fill? We'll help you out with this one: you want bragging rights. There has to be a way to face your friends with confidence, right? All you need is a little nudge in the right direction, and in this edition of Engadget Primed, we'll give you that much-needed shove by explaining benchmarks.

Perhaps you've seen us talk about benchmarks in our product reviews. We'll typically use them to gauge the relative performance of various devices, but discussing a Linpack score doesn't mean much without going deeper into what it actually means. What aspects of performance do these benchmarks measure, and what techniques do they use? How much can we rely on them when making purchasing decisions? Read on after the break for the full scoop.

Table of Contents

What's a benchmark?
How accurate are they?
Why do we still use them, then?
Network Performance
Android
Windows Phone 7
iOS
WebOS
JavaScript
Wrap-up

What's a benchmark?

Return to top

A benchmark is defined as "a standard by which something can be measured or judged," which couldn't be more true in tech. In our last Primed, we stressed the importance of using a standard: the same set of rules or guidelines that can be followed across the board, no matter what the circumstances are. Universal benchmarks are useful for evaluating performance in an unbiased manner because they use the same tests and methods to achieve results.

Benchmarks can be found everywhere and are used as reference points in several ways: comparing employee output, measuring crude oil prices, and conducting tests (to name just a few). They can even be found on mountain peaks, which survey groups originally used to highlight and compare elevation. The most common benchmarks, however, are used in computers -- which, of course, translates easily into smartphones.

Why do we care about benchmarks at all? As we'll cover in the next section, they aren't the definitive means to compare phones or other devices, nor do they even emulate real-life performance. The reason we use them at all is because they're a standardized tool to measure performance as quickly and easily as possible. With phones that share very similar architectures, from a user perspective it's nigh-impossible to detect any noticeable difference between them without a little extra assistance.

This is where benchmarks come in handy. They have an incredibly structured way of measuring the device's performance, and in most cases are thorough, universal, fast, and efficient. A program designed to evaluate a product's speed will test out every likely scenario that the subject will run into. It would be incredibly inaccurate to only use one particular test that runs one type of process and draw any sort of legitimate conclusion.

Most smartphones are sophisticated computers and use the same types of architecture underneath the hood. They all have CPUs (or at least systems-on-chip), RAM, camera, a complex operating system, antennas, and other various chips and sensors. They run the same kinds of subsystems. The largest difference is in who manufactures those components and how well they interact with the rest of the phone's internals.

How accurate are benchmarks?

Return to top

Frankly, benchmarks aren't as accurate as we'd like. We should be able to run the same benchmark on the exact same phone, time after time, and get identical results. But it doesn't work that way -- not even close, sadly. The image above compares two screenshots taken on the HTC Sensation 4G within the space of a few minutes. As indicated, the second score exceeded the first by a rather significant margin. What can cause inaccurate results?

First, cheating can indeed play a role. For as long as computers have been around, there have been benchmarks to evaluate them... and ways to work the system. It's easy enough for a vendor to manufacture a product, tweak it to be efficient at processes that one given benchmark looks at, and then use the inflated results as a marketing tool to proclaim: "such-and-such program shows that our product is 50 percent faster than X's." All this is typically done at the expense of other (more important) system workloads that are more satisfactory to the consumer.

Similarly, benchmarks can also be fooled by users making artificial adjustments on their devices. When CPUs are overclocked (for instance, a 1GHz CPU is manipulated to run at 1.5GHz), results will be much more dilated than they ought to be. Custom ROMs or jailbreaks can also cause an artificial result. While we don't adjust the devices we review, these unnatural benchmark results make it more difficult to give an accurate comparison with other devices.

Now, let's assume that there's been no manipulation of any scores so far. A concern is that benchmarks, when run in the same condition, don't factor in the idea that your device will perform at different levels when faced with various tasks. If you run the benchmark fresh after a reboot and then again during heavy use, you'll get a different result more often than not.

Let's go over a few additional factors that affect benchmark speeds: the manufacturer of specific components, the type of CPU / GPU used, clock speed, RAM, amount of available memory, speed and version of the phone's compiler, cache size, your display's frame rate, and the ROM (and platform) you're currently on. We could keep going, but we hope this is enough to drive the point across.

To make matters more confusing, no single benchmark will accurately portray the overall performance of the phone nor give real-time results. When comparing X with Y, it's entirely possible that Quadrant gives X a higher score, whereas Y gets the better result in Linpack. Each individual benchmark will measure different systems or components. As such, it's only when we compile multiple benchmarks that we get a more accurate assessment of our phones' performance.

Why do we still use them, then?

Return to top

Benchmarks aren't perfect, but they're still used all the time. Simply saying "product X runs really fast and smooth, and product Y is roughly the same but slightly less so" is not a comprehensive or valid method to judging a phone's performance. In addition, our reviews are conducted by different editors, and using our own measuring stick each time is subjective -- it's all in the eye of the beholder. Indeed, benchmarks can still be mighty useful, though we highlight their challenges in order to focus on how we overcome them.

First, none of our reviews are done with artificial adjustments that could cause any difference in benchmark scores. Since we perform benchmarks on every smartphone we review, we have our own reference points to compare to without worries of interference from other users jacking up their devices.

Second, we conduct several tests on each benchmark, under a slough of different scenarios. For example: we run them after a hard reset with no running apps in the background, pushing the phone to its limits, as we stream video, in 3G / 4G zones, while in transit, and any other foreseeable situation we can think of. After running multiple tests, we average out the scores; we want to get as realistic a score as possible, considering everyone uses their phones in so many ways.

We also use multiple benchmarks to get an overall idea of which handset has the best performance. While vendors may have the ability to inflate certain processes, chances are they're only focused on one or two. By doing several benchmarks that all use different parts of the system, we'll get more accurate results.

Finally, and this is the most crucial, it's important to not fully rely on these scores. Benchmarks aren't -- and were never intended to be -- indicative of real-life behavior; they're just a common way to compare devices. If you make a purchasing decision based solely on these scores, you may be missing out on some great handsets.

Here's an example of using multiple benchmarks to compare three LTE devices from Verizon that have eerily similar specs, the LG Revolution, HTC Thunderbolt, and Samsung Droid Charge:

Benchmark

LG Revolution

HTC Thunderbolt

Samsung Droid Charge

Quadrant

1913

1886

943

Linpack

39.6

40.1

13.6

Nenamark

39.2

32.7

42.2

Nenamark2

13.3

12.7

21.4

Neocore

65.1

59.5

56.9

Sunspider

4591

6213

7905

Network Performance

Return to top

Speedtest.net

We review devices that run on WiMax, LTE, and "faux-G," and even the unheard-of 3G that feels so 2009 from time to time. With the several types of network broadband connections available today, we must have a way of running tests on the speed of each one. The speed test app we've primarily used is Speedtest.net Mobile by Ookla, a company that not only provides tests for its own apps, available for iOS and Android, but for the FCC as well. Again, it's not a foolproof method of getting real-time results -- conducting manual downloads / uploads of files with a stopwatch and knowledge of its size are the only ways to get the proper results -- but it's the best way to keep scores as objective as possible.

Ookla measures data throughput entirely through HTTP data transfers. The app performs its test in three stages: latency, download, and upload. To test the latency, it measures the amount of time it takes to receive a response for an HTTP request that was sent to the web server.

To test the download speeds, binary files are downloaded to the client from the web server to estimate the connection speed. It sends through up to 30 samples per second, aggregates them into individual slices, and then discards the outlying (fastest and slowest) ones. What's remaining will get averaged out and becomes the determining score.

In determining the upload speeds, random chunks of data are sent from the client to the server. These chunks are sorted by speed, and the slowest half gets discarded. Once the fastest 50 percent is left, the leftover selection gets averaged to determine the final result.

Android Benchmarks

Return to top

Conveniently, the platform that houses the most benchmarks in its respective application depot is Android. While we're not sure if this is due to the open-source nature of the OS or a massive outpouring of dev support, Android powers a greater variety of devices than any other operating system. Thus, we need all of the benchmarks we can get our hands on. Here are a few of the benchmarks we use on a regular basis:

1. Quadrant Standard

Quadrant measures the performance of the device's CPU, I/O, memory throughput, and 2D / 3D graphics. The higher the score, the better. Results can range anywhere from 200 all the way up to 3,000. Some of the specific metrics involved include:

12 CPU metrics: the benchmark pushes the device through checks like compression, checksum, arithmetics, floating point, XML parsing, and video / audio decoding.

1 Memory throughput check

4 I/O tests: This looks over filesystem access and database operations.

1 2D and 3 3D graphics tests : This test, consisting of several graphics simulations, evaluates the device's OpenGL capabilities, analyzing single-pass and multi-pass rendering with stencil buffers.

2. Neocore

This benchmark runs a graphic-intensive scene involving robots and tanks at battle with each other, so it has to be cool, right? Neocore evaluates the device's GPU performance by tackling its OpenGL 1.1 abilities head-on, using 1-pass light maps and bump mapping to get its results. At the end of the test we get a definitive FPS score to highlight how fast the GPU can do its magic.

3. Nenamark 1 and 2

When there's two benchmarks using the same name, which one should you choose? Well, if you can only pick one Nenamark to use, go with the sequel. Nenamark 1 has been around for a little over a year and was designed to test the GPU limitations of devices considered state of the art at the time -- the Nexus One and HTC Desire running Adreno 200, for instance. The first iteration used shaders for graphical effects such as reflections, dynamic shadows, parametric surfaces, particles, and different light models. It was originally designed to run around 10-15fps, but in just one year we've seen an exponential increase in the capacity of our mobile processors. In fact, some of the top-of-the-line phones are scoring above 60fps, the screen refresh frequency of LCD screens present in mobile devices. In other words, results above this number become absolutely irrelevant because the benchmark claims the processor's capable of handling a faster frequency than the phone's display is.

Because of this huge boom in our phones' capabilities, Nenamark 2 was born. The benchmark still looks at OpenGL 2.0, but now adds in more graphic-intensive operations to really push the new processors to their limits. It uses five times as much geometry, as well as tougher shaders and more HD content; it also deploys dynamic lighting, bump mapping, and cube map reflections. As a result, most scores on the second version are lower and appear to be more representative of the actual device.

4. Linpack

Linpack has been measuring and comparing the speeds of supercomputers for decades, so it's only fitting for this particular computer program to be found on Android and iOS devices -- most of which are faster than any supercomputer evaluated by Linpack when it began its long history. The program has the processor solve a dense system of linear equations and then appraises its subject in terms of MFLOPS -- millions of floating point operations per second -- to demonstrate a device's CPU perfromance. Much like other benchmarks discussed so far, higher scores will indicate greater results in this test.

One drawback in comparing old Linpack scores with recent ones is that results have increased due to enhanced libraries and methods present in some newer smartphones and tablets.

Windows Phone 7

Return to top

As the WP7 Marketplace is finally over the 25K mark, this isn't a bottom-feeding run-down app store anymore. Benchmarks are finally beginning to make their way into the Windows Phone ecosystem, and we don't want to let Android have all of the attention.

WP Bench

Set to relaxing music that makes us feel like we're getting a massage, the video above highlights some of the features WP Bench has to offer, and this one appears to be one of the most comprehensive benchmarking apps we've seen so far. It has the option of hitting CPU (both singlethread and multithread), GPU performance, memory and storage read / write, display color reproduction, and it even will strain the battery in an endless loop to make it easier on us reviewers.

Benchmark Free

Not much new can be said about this particular benchmark, which gives many similar results to WP Bench. In Benchmark Free, however, more emphasis is placed on CPU integers, floating point, and various types of memory. It was also released as of the end of May, which means it doesn't have as much comparison data as other benchmarks might have. This app is unique in that it has a maximum value assigned to each individual property.

iOS

Return to top

Linpack

Linpack for iOS runs the same calculations as its Android counterpart, making the program cross-comparable on both platforms. This turns out to be quite helpful when we need to pit an iPad against a Honeycomb tablet or an iPhone versus a random single-core Android handset.

WebOS

Return to top

Particle System

Particle System is one of very few benchmarks in the App Catalog, and its primary purpose is to calculate the HTML5 capabilities on webOS. By manipulating the particles in every which way (spinning them, putting them in a gravity well, and so on) we can learn the browser's fps.

JavaScript

Return to top

None of the previously mentioned benchmarks are truly universal; very few apps are cross-platform, and zero travel across to every readily-used OS. One series of benchmarks that transcends the platform boundaries are JavaScript-based, which means they can be used on any web browser that uses it. The most popular one is Sunspider.

Sunspider

Built by the same development team responsible for the Webkit browser (which both iOS and Android use), Sunspider simulates real-world usage of the JavaScript on various websites. It generates a tag cloud, and also tests encryption and decompression capabilities on the browser. It calculates the results in terms of ms, so lower scores are better in this case. If you can't get Sunspider to work on your browser, first check to make sure JavaScript is enabled in the settings.

Wrap-up

Return to top

Benchmarks are not intended to be a proper method to sway a customer's purchasing decision, nor are they an ideal way for us to officially declare the supremacy of one device over another. Instead, by using identical standards under the same proper conditions, they give us the opportunity to predict how these products will function in real life. While benchmarks can never give us a full 100 percent indicator of a computer's or smartphone's performance, they at least can be pitted against other similar devices on (somewhat) equal grounds. Let's not take them too seriously, of course, but they're a fun -- and necessary -- way for us to deliver the most comprehensive reviews possible.