Android Dev Summit, October 23-24: two days of technical content, directly from the Android team. Sign-up for livestream updates.

Improve your game's performance

Players enjoy a game most when it has short loading times, a consistent framerate, and dependable input responsiveness.

If you have experience developing games for a computer or a console, you might be surprised by how much these devices differ relative to mobile devices when it comes to GPU size and flash memory performance. This dissimilarity in system structure can make it difficult to predict your game's performance on Android.

This guide serves to help you optimize your game so that its performance is as dependable as possible on the variety of Android devices that run your game. In particular, this guide explains how to configure system tracing of an Android game. The guide then describes how you can use the output report from a system trace to help you consider specific aspects of your game's performance.

Set up a game-based system trace

The Systrace tool, available both as a command-line program and an on-device service, captures a CPU and thread profile of your app over a short period of time. You can use the results shown in a Systrace report to better understand how your game is performing on Android, and to identify how best to improve your game's efficiency and responsiveness.

Systrace is a very low-level tool, which offers the following benefits:

  • Provides ground truth. Systrace captures output directly from the kernel, so the metrics that it captures are nearly identical to those that a series of system calls would report.
  • Consumes few resources. Systrace introduces very low overhead on the device, usually less than 1%, because it streams data into an in-memory buffer.

Optimal settings

Regardless of how you capture the system trace, it's important to give the tool a reasonable set of arguments:

  • Categories: The best set of categories to enable for a game-based system trace are: {sched, freq, idle, am, wm, gfx, view, sync, binder_driver, hal, dalvik}.
  • Buffer size: A general rule is that a buffer size of 10 MB per CPU core allows for a trace that's about 20 seconds long. For example, if a device has a two quad-core CPUs (8 cores total), an appropriate value to pass into the systrace program is 80,000 KB (80 MB).

    If your game performs a great deal of context-switching, however, you should increase the buffer to 15 MB per CPU core.

  • Custom events: If you define custom events to capture in your game, enable the -a flag, which allows Systrace to include these custom events in the output report.

If you're using the systrace command-line program, use the following command to capture a system trace that applies best practices for category set, buffer size, and custom events:

python systrace.py -a com.example.myapp -b 80000 -o my_systrace_report.html \
  sched freq idle am wm gfx view sync binder_driver hal dalvik

If you're using the Systrace system app on a device, complete the following steps to capture a system trace that applies best practices for category set, buffer size, and custom events:

  1. Enable the Trace debuggable applications option.
  2. Under Buffer size, select 65536 (about 64 MB). Note that, in order to use this setting, the device must have 256 MB or 512 MB available (depending on whether the CPU has 4 or 8 cores), and each 64-MB piece of memory must be available as a contiguous chunk.
  3. Choose Categories, then enable the categories in the following list:

    • am: Activity Manager
    • binder_driver: Binder Kernel driver
    • dalvik: Dalvik VM
    • freq: CPU Frequency
    • gfx: Graphics
    • hal: Hardware Modules
    • idle: CPU Idle
    • sched: CPU Scheduling
    • sync: Synchronization
    • view: View System
    • wm: Window Manager
  4. Enable Record tracing.

  5. Load your game.

  6. Perform the interactions in your game corresponding to the gameplay whose device performance you want to measure.

  7. Shortly after you encounter undesirable behavior in your game, turn system tracing off. You've now captured the performance statistics needed to further analyze the issue.

To save disk space, on-device system traces save files in a compressed trace format (*.ctrace). To uncompress this file when generating a report, use the command-line program and include the --from-file option:

python systrace.py --from-file=/data/local/traces/my_game_trace.ctrace \
  -o my_systrace_report.html

Improve specific performance areas

This section highlights several common performance concerns in mobile games and describes how to identify and improve these aspects of your game.

Loading speed

Players want to get into your game's action as quickly as possible, so it's important to improve your game's load times as much as possible. The following measures usually help load times:

  • Perform lazy loading. If you use the same assets across consecutive scenes or levels in your game, load these assets only once.
  • Reduce the size of your assets. That way, you can bundle uncompressed versions of these assets with your game's APK.
  • Use a disk-efficient compression method. An example of such a method is zlib.
  • Use IL2CPP instead of mono. (Applies only if you're using Unity.) IL2CPP provides better execution performance for your C# scripts.
  • Make your game multithreaded. For more details, see the framerate consistency section.

Framerate consistency

One of the most important elements of gameplay experience is achieving a consistent framerate. To make this goal easier to achieve, follow the optimization techniques discussed in this section.

Multithreading

When developing for multiple platforms, it's natural to place all activity within your game in a single thread. Although this method of execution is simple to implement in many game engines, it's far from optimal when running on Android devices. As a result, single-threaded games often load slowly and lack a consistent framerate.

The Systrace shown in Figure 1 displays behavior that's typical of a game running on only one CPU at a time:

Diagram of threads
within a system trace

Figure 1. Systrace report for a single-threaded game

To improve your game's performance, make your game multithreaded. Typically, the best model is to have 2 threads:

  • A game thread, which contains your game's main modules and sends render commands.
  • A render thread, which receives render commands and translates them into graphics commands that a device's GPU can use to display a scene.

The Vulkan API expands upon this model, given its capability to push 2 common buffers in parallel. Using this feature, you can distribute multiple render threads across multiple CPUs, further improving a scene's rendering time.

You can also make some engine-specific changes to enhance your game's multithreading performance:

  • If you're developing your game using the Unity game engine, enable the Multithreaded Rendering and GPU Skinning options.
  • If you're using a custom rendering engine, make sure that the render command pipeline and graphics command pipeline are aligned correctly; otherwise, you could introduce delays in displaying your game's scenes.

After applying these changes, you should see your game occupying at least 2 CPUs simultaneously, as shown in Figure 2:

Diagram of threads
within a system trace

Figure 2. Systrace report for a multi-threaded game

UI element loading

Diagram of a frame
  stack within a system trace
Figure 3. Systrace report for a game that's rendering dozens of UI elements at the same time

When creating a feature-rich game, it's tempting to show many different options and actions to the player at the same time. To maintain a consistent framerate, however, it's important to consider the relatively small size of mobile displays and keep your UI as simple as possible.

The Systrace report shown in Figure 3 is an example of a UI frame that's attempting to render too many elements relative to a mobile device's capabilities.

A good goal is to reduce the UI update time to 2-3 milliseconds. You can achieve such quick updates by performing optimizations similar to the following:

  • Update only the elements on screen that have moved.
  • Limit the number of UI textures and layers. Consider combining graphics calls, such as shaders and textures, that use the same material.
  • Defer element animation operations to the GPU.
  • Perform more aggressive frustum and occlusion culling.
  • If possible, perform draw operations using the Vulkan API. The draw call overhead is lower on Vulkan.

Power consumption

Even after making the optimizations discussed in the previous section, you might find that your game's framerate deteriorates within the first 45-50 minutes of gameplay. Furthermore, the device might begin to heat up and consume more battery over time.

In many cases, this undesirable set of thermals and power consumption is related to how your game's workload is distributed across a device's CPUs. To increase your game's power consumption efficiency, apply the best practices shown in the following sections.

Keep memory-heavy threads on one CPU

On many mobile devices, the L1 caches reside on specific CPUs, and L2 caches reside on the set of CPUs that share a clock. To maximize L1 cache hits, it's generally best to keep your game's main thread, along with any other memory-heavy threads, running on a single CPU.

Defer short-duration work to lower-powered CPUs

Most game engines, including Unity, know to defer worker thread operations onto a different CPU relative to your game's main thread. However, the engine isn't aware of a device's specific architecture and cannot anticipate your game's workload as well as you can.

Most system-on-a-chip devices have at least 2 shared clocks, one for the device's fast CPUs and one for the device's slow CPUs. A consequence of this architecture is that, if one fast CPU needs to operate at maximum speed, all the other fast CPUs also operate at maximum speed.

The example report shown in Figure 4 shows a game that takes advantage of fast CPUs. However, this high activity level generates a great deal of power and heat quickly.

Diagram of threads
within a system trace

Figure 4. Systrace report showing a suboptimal assignment of threads to the device's CPUs

To reduce overall power usage, it's best to suggest to the scheduler that shorter-duration work—such as loading audio, running worker threads, and executing the choreographer—be deferred to the set of slow CPUs on a device. Transfer as much of this work onto the slow CPUs as you can while maintaining a desired framerate.

Most devices list the slow CPUs before the fast CPUs, but you cannot assume that your device's SOC uses this order. To check, run commands similar to the ones shown in this CPU topology discovery code on GitHub.

After you know which CPUs are the slow CPUs on your device, you can declare affinities for your short-duration threads, which the device's scheduler follows. To do so, add the following code within each thread:

#include <sched.h>
#include <sys/types.h>
#include <unistd.h>

pid_t my_pid; // PID of the process containing your thread.

// Assumes that cpu0, cpu1, cpu2, and cpu3 are the "slow CPUs".
cpu_set_t my_cpu_set;
CPU_ZERO(&my_cpu_set);
CPU_SET(0, &my_cpu_set);
CPU_SET(1, &my_cpu_set);
CPU_SET(2, &my_cpu_set);
CPU_SET(3, &my_cpu_set);
sched_setaffinity(my_pid, sizeof(cpu_set_t), &my_cpu_set);

Touch-to-display latency

Games that render frames as quickly as possible create a GPU-bound scenario, where the frame buffer becomes overstuffed. The CPU needs to wait for the GPU, which causes a noticeable delay between a player's input and the input taking effect on screen.

To determine whether you could improve your game's frame pacing, complete the following steps:

  1. Generate a Systrace report that includes the gfx and input categories. These categories comprise particularly useful measurements for determining touch-to-display latency.
  2. Check the SurfaceView section of a Systrace report. An overstuffed buffer causes the number of pending buffer draws to oscillate between 1 and 2, as shown in Figure 5:

    Diagram of
buffer queue within a system trace

    Figure 5. Systrace report showing an overstuffed buffer that is periodically too full to accept drawing commands

To mitigate this inconsistency in frame pacing, complete the actions described in the following sections:

Integrate the Android Frame Pacing API into your game

The Android Frame Pacing API helps you perform frame swaps and define a swap interval such that your game maintains a more consistent framerate.

Reduce the resolution of your game's non-UI assets

The displays on modern mobile devices contain many more pixels than a player can process, so it's OK to downsample such that a run of 5 or even 10 pixels all contains one color. Given the structure of most display caches, it's best to reduce the resolution along one dimension only.

However, don't reduce the resolution of your game's UI elements. It's important to preserve the line thickness on these elements to maintain a large enough touch target size for all of your players.

Rendering smoothness

When SurfaceFlinger latches onto a display buffer to show a scene in your game, the CPU activity momentarily increases. If these spikes in CPU activity occur unevenly, it's possible to see stuttering in your game. The diagram in Figure 6 depicts the reason why this occurs:

Diagram of frames
missing a Vsync window because they started drawing too late

Figure 6. Systrace report showing how a frame can miss a Vsync

If a frame starts drawing too late, even by a few milliseconds, it might miss the next display window. The frame must then wait until the next Vsync to be displayed (33 milliseconds when running a game at 30 FPS), which causes a noticeable delay from the player's perspective.

To address this situation, use the Android Frame Pacing API, which always presents a new frame on a VSync wavefront.

Memory state

When running your game for an extended period of time, it's possible for the device to experience out-of-memory errors.

In this situation, check the CPU activity in a Systrace report and see how often the system is making calls to the kswapd daemon. If there are many calls during your game's execution, it's best to take a closer look at how your game is managing and cleaning up memory.

For more information, see Manage memory effectively in games.

Thread state

When navigating through the typical elements of a Systrace report, you can view the amount of time that a given thread spent in each possible thread state by selecting the thread within the report, as shown in Figure 7:

Diagram of a
Systrace report

Figure 7. Systrace report showing how selecting a thread causes the report to display a state summary for that thread

As Figure 7 shows, you might find that your game's threads aren't in the "running" or "runnable" state as often as they should be. The following list shows several common reasons why a given thread might be periodically transitioning to an unusual state:

  • If a thread is sleeping for an extended period of time, it might be suffering from either lock contention or waiting for GPU activity.
  • If a thread is constantly blocked on I/O, you're either reading too much data from disk at a time, or your game is thrashing.

Additional resources

To learn more about improving your game's performance, see the following additional resources:

Videos