Selective multithreading for sporadic processor workloads
Systems and methods for processing user-interface animations are disclosed. The method may include processing a first frame of a user-interface animation with a first processing core, monitoring a processing time of the first frame of the user-interface animation relative to a first synchronization pulse, and processing, if the elapsed processing time exceeds a threshold, a first portion of the user-interface animation with the first processing core and a second portion of the user-interface animation with a second processing core. Processing of a next frame of the user-interface animation may be initiated with the first processing core while the second processing core is processing the second portion of the user-interface animation.
Latest Qualcomm Innovation Center, Inc. Patents:
1. Field
The present disclosed embodiments relate generally to computing devices, and more specifically to multithreading on processors of computing devices.
2. Background
Computing devices including devices such as smartphones, tablet computers, gaming devices, and laptop computers are now ubiquitous. These computing devices are now capable of running a variety of applications (also referred to as “apps”) and many of these devices include multiple processors to process tasks that are associated with apps. In many instances, multiple processors are integrated as a collection of processor cores within a single functional subsystem. It is known that the processing load on a mobile device may be apportioned to the multiple cores. Some sophisticated devices, for example, have multiple core processors that may be operated asynchronously at different frequencies. On these types of devices, the amount of work that is performed on each processor may be monitored and controlled by a CPU governor to meet workloads.
A user's experience on a computing device is generally dictated by how smooth the user interface (“UI”) animation runs on the device. UI animations on Android-based devices, application scrolls (e.g., browser scroll, email scroll, home launcher scrolls etc.), and the visually attractive animations that are displayed in connection with application launches present an important use-case of periodic workload on the CPU that is also sporadic in nature. Usually a performance benchmark places fixed-sized loads on the CPU, which allows the system to latch on the right clock frequency when running the benchmark. If a particular processing core has a heavy load, the frequency of that processing core may be increased. If a processing core has a relatively low load or is idle, the frequency of that core may be decreased (e.g., to reduce power consumption).
The Linux operating system for example, may use an on demand governor, which monitors the workload on each processor and adjusts the corresponding clock frequency based on the workload. The adjustment of the clock frequency may be heuristic based and may provide power benefits when operating properly. This approach to adjusting the CPU clock frequency works well if the workload is overall consistent, which is usually the case for many of the typical benchmarks. But the on demand governor does not scale well when the periodic workload also becomes sporadic because there is no consistent history associated with the sporadic workload.
As a consequence, sporadic workloads are a challenge for the governor, and to finish a periodic workload in a timely manner, a known (and non-optimal system solution) is to run a processor locked at higher clock frequency for user-interface workloads. Similarly, other CPU governors (e.g., the Interactive governor) respond more aggressively and increase processor clock frequencies when servicing sporadic/interactive workloads. These governor-based approaches that increase processor clock frequencies adversely impact power consumption, they are merely “best effort,” and these governor-based approaches are not deterministic with respect to changing workloads. In short, existing approaches to handling sporadic workloads either result in “stuttering,” undesirable power consumption, and/or poor application performance.
SUMMARYAspects of the present invention may be characterized as a method for processing user-interface animations on a computing device. The method may include processing a first frame of a user-interface animation with a first processing core and monitoring a processing time of the first frame of the user-interface animation relative to a first synchronization pulse. If the elapsed processing time exceeds a threshold, a first portion of the user-interface animation is processed with the first processing core and a second portion of the user-interface animation is processed with a second processing core. Processing of a next frame of the user-interface animation is then initiated at substantially the same time as a second synchronization pulse while the second processing core is processing the second portion of the first frame.
Other aspects may be characterized as computing device that includes means for processing a first frame of a user-interface animation with a first processing core and means for monitoring a processing time of the first frame of the user-interface animation relative to a first synchronization pulse. The computing device also includes means for processing, if the elapsed processing time exceeds a threshold, a first portion of the user-interface animation with the first processing core and means for processing a second portion of the user-interface animation with a second processing core. In addition, the computing device includes means for initiating, at substantially a same time as a second synchronization pulse, processing of a next frame of the user-interface animation with the first processing core while the second processing core is processing the second portion of the user-interface animation.
Yet another aspect may be characterized as a non-transitory, tangible processor readable storage medium, encoded with processor readable instructions to perform a method that includes processing a first frame of a user-interface animation with a first processing core and monitoring a processing time of the first frame of the user-interface animation relative to a first synchronization pulse. If the elapsed processing time exceeds a threshold, a first portion of the user-interface animation is processed with the first processing core and a second portion of the user-interface animation is processed with a second processing core. Processing of a next frame of the user-interface animation is then initiated at substantially a same time as a second synchronization pulse while the second processing core is processing the second portion of the user-interface animation.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
Several embodiments disclosed herein provide more optimal handling of sporadic workloads for UI animations to improve the user experience. As discussed herein, many of these embodiments do not rely on system CPU governors to improve the user experience. Referring to
The one or more applications 102 may be realized by a variety of applications that operate via, or run on, the processor 114. For example, the one or more applications 102 may include an email application 103 (e.g., Gmail), a web browser 103 and associated plug-ins, entertainment applications (e.g., video games, video players), productivity applications (e.g., word processing, spreadsheet, publishing applications, video editing, photo editing applications), core applications (e.g., phone, contacts), and augmented reality applications.
As one of ordinary skill in the art will appreciate, the user-space 130 and kernel-space 132 components depicted in
In general, the workload detection component 118 operates to detect how long a frame of animation is taking to process (e.g., on a first core of the processor 114), and if the processing time that elapses exceeds a threshold, the workload detection component 118 signals the workload division component 120 to divide the workload. More specifically, the workload division component 120 will prompt the kernel scheduling component 110 to divide the workload so that a prepare-stage of animation is processed by the first core and a render-stage of animation to executed by a second core of the processor 114. In this way, the prepare-stage may promptly begin in connection with a synchronization pulse to remove undesirable stuttering as discussed below.
In many variations of the embodiment depicted in
In connection with the Android platform, the processor 114 manages the UI elements (Android View Hierarchy) and prepares rendering work (e.g., using OpenGLES) for the GPU to finish. As is known to those of skill in the art, the UI workload on the Android platform may be split between the processor 114 and the GPU 115 (Android uses the GPU to do the UI rendering as an optimization). Based upon how the applications 102 are designed, both the processor 114 and the GPU 115 workloads can vary during the course of an animation sequence. Under ideal conditions, it is desirable for the application to be double buffered such that both processor 114 and the GPU 115 finish its workload in the current VSYNC cycle to show the rendered frame in the very next VSYNC cycle on the display.
Assuming the computing device 100 operates at a 60 Hz industry standard display refresh rate, the UI workload (processor 114+GPU 115) will have to be completed within 16.66 ms for a double-buffered solution. In a soft real time system (e.g., utilizing the Android platform) this is often not achievable, so in some modes of operation, triple buffering is utilized to allow for an extra frame of latency on the display and hence allow the workload (processor 114+GPU 115) to complete within 33.33 ms. But to show smooth frame-displacement on the screen, the processor 114 workload must start at the VSYNC pulse and complete before the next VSYNC pulse, but the processing performed by the GPU 115 can extend into the next VSYNC cycle.
Regardless of whether double-buffering or the more conservative triple-buffering is utilized, the processor 114 workload (albeit sporadic) must start at the VSYNC pulse and finish before the next VSYNC pulse to get smooth time based animations. Due to the sporadic nature of the processor 114 workload, the CPU governor often does not respond in time to increase the clock frequencies of the cores 116 when the workload becomes high. This results in a non-uniform displacement from frame to frame on the screen in the UI animation sequence. This is one of the types of UI stutter prevalent in time-based animations, which several embodiments discussed further herein avoid.
More specifically, embodiments disclosed herein address issues related to periodic workloads, which are also sporadic in nature with no specific history. Applicant has found for example, that only a portion of the workload needs to be absolutely periodic and a second part can be allowed some extra latency. As a consequence, the workload can be split into two distinct parts and run in two corresponding separate threads on the multicore processor 114. Assuming, for example, the periodic internal cycle for VSYNC is t milliseconds, the first part of the workload (which needs fixed periodicity) can at times, overlap with the second part of a previous workload in a given periodic interval of t milliseconds.
In addition, the workload may be selectively split so that the processing load on the computing device 100, as a whole, is handled more optimally. In some embodiments for example, if the processing of the workload extends beyond a certain time threshold (e.g., t/2 milliseconds) in a given periodic interval of t ms, the second part of the workload is split into a separate worker thread. This selective division of the workload avoids redundant multithreading overhead, and the workload splitting threshold can be tuned based on the particular use-case.
Referring next to
On a deferred GPU rendering architecture (e.g., an Adreno GPU rendering architecture) the processor 114 may pack the GL commands into a command buffer during the rendering stage 346, which is executed later by the GPU 115.
While referring to
If the time threshold has been exceeded (Block 406), then a second portion of the workload, the rendering stage 346, is carried out with a second processing core (Block 410). As depicted, while the rendering stage 346 is occurring, the prepare stage 348 for the next frame is initiated substantially co-currently with the VSYNC pulse so that the prepare stage 348 for the next frame begins while the render stage 346 for the previous frame is still in progress. As depicted, the prepare stage 344 and rendering stage 346 are executed in sequence for a given frame, but they are distinct in that the prepare stage of frame n+1 may overlap with the rendering stage of previous frame n. In addition, the prepare stages 344, 348 are carried out on separate cores from the render stage 346; thus, UI workload is effectively multithreaded in parallel.
The condition that the elapsed processing time exceed a threshold (Block 406) is implemented because multithreading adds its own system overhead; thus multithreading is not always desirable. In general, the goal is to ensure that the prepare stage starts at the beginning of VSYNC pulse to maintain a smooth animation sequence and the corresponding frame-to-frame displacement. In several embodiments, if the combined time to effectuate the prepare stage and rendering stage of the UI workload is small enough to be completed within a VSYNC cycle (e.g., 16.66 ms for 60 Hz panel refresh rate) multithreading is not utilized, and in many embodiments, the UI workload may be selectively multithreaded based on heuristics. In one exemplary embodiment, the threshold is one half of the VSYNC cycle period so that if the prepare stage takes more than half the VSYNC cycle (0.5*VSYNC cycle period), a separate worker thread is utilized for the rendering stage.
Beneficially, on the multicore processor 114, the prepare stage of frame n+1 can always execute in parallel to the rendering stage of previous frame n. As a consequence, the system CPU governor need not be relied upon to run the processor 114 cores 116 at turbo/high frequency instantaneously to handle the sudden increase in the workload. Instead, the workload is split when necessary to allow the workload to complete on different cores 116 which can run at lower clock frequencies; thus there may be power savings in some instances.
Referring to
In general, the nonvolatile memory 420 functions to store (e.g., persistently store) data and executable code including code that is associated with the functional components depicted in
In many implementations, the nonvolatile memory 520 is realized by flash memory (e.g., NAND or ONENAND™ memory), but it is certainly contemplated that other memory types may be utilized as well. Although it may be possible to execute the non-transitory code from the nonvolatile memory 520, the executable code in the nonvolatile memory 520 is typically loaded into RAM 524 and executed by one or more of the N processing components in the processing portion 526.
The camera actuator 510 in the embodiment depicted in
The N processing components 526 in connection with RAM 524 generally operate to execute the instructions stored in nonvolatile memory 520 to effectuate the functional components depicted in
The depicted transceiver component 528 includes N transceiver chains for communicating with external devices. Each of the N transceiver chains represents a transceiver associated with a particular communication scheme. For example, one transceiver chain may operate according to wireline protocols, another transceiver may communicate according to WiFi communication protocols (e.g., 802.11 protocols), another may communicate according to cellular protocols (e.g., CDMA or GSM protocols), and yet another may operate according to Bluetooth protocols. Although the N transceivers are depicted as a transceiver component 528 for simplicity, it is certainly contemplated that the transceiver chains may be separately disposed about the mobile computing device.
This display 512 generally operates to provide text and non-text content (e.g., UI animations) to a user. Although not depicted for clarity, one of ordinary skill in the art will appreciate that other components including a display driver and backlighting (depending upon the technology of the display) are also associated with the display 512.
The architecture depicted in
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A method for processing user-interface animations on a computing device, the method comprising:
- processing a first frame of a user-interface animation with a first processing core;
- monitoring an elapsed processing time of the first frame of the user-interface animation relative to a first synchronization pulse;
- processing, if the elapsed processing time exceeds a threshold, a first portion of the first frame with the first processing core and processing a second portion of the first frame with a second processing core; and
- initiating, at substantially a same time as a second synchronization pulse, processing of a next frame of the user-interface animation with the first processing core while the second processing core is processing the second portion of the first frame.
2. The method of claim 1, wherein the user-interface animation is an Android-based user-interface animation and the second portion of the user-interface animation is a render-stage.
3. The method of claim 1, wherein the threshold is one-half of a time period of the synchronization pulse.
4. The method of claim 1, wherein the threshold is a tunable threshold.
5. The method of claim 1, wherein the monitoring includes monitoring the elapsed processing time with a workload detection component implemented in connection with a user-level library.
6. The method of claim 1, wherein processing the first portion includes managing user-interface elements, frame-to-frame displacement calculation, and recording graphics library commands.
7. The method of claim 6, wherein processing the second portion includes executing recorded graphics library commands.
8. A non-transitory, tangible processor readable storage medium, encoded with processor readable instructions to perform a method for processing user-interface animations, the method comprising:
- processing a first frame of a user-interface animation with a first processing core;
- monitoring an elapsed processing time of the first frame of the user-interface animation relative to a first synchronization pulse;
- processing, if the elapsed processing time exceeds a threshold, a first portion of the first frame with the first processing core and processing a second portion of the first frame with a second processing core; and
- initiating, at substantially a same time as a second synchronization pulse, processing of a next frame of the user-interface animation with the first processing core while the second processing core is processing the second portion of the first frame.
9. The non-transitory, tangible processor readable storage medium of claim 8, wherein the user-interface animation is an Android-based user-interface animation and the second portion of the user-interface animation is a render-stage.
10. The non-transitory, tangible processor readable storage medium of claim 8, wherein the threshold is one-half of a time period of the synchronization pulse.
11. The non-transitory, tangible processor readable storage medium of claim 8, wherein the threshold is a tunable threshold.
12. The non-transitory, tangible processor readable storage medium of claim 8, wherein the processor readable instructions, when executed, implement a workload detection component that operates in connection with a user-level library.
13. The non-transitory, tangible processor readable storage medium of claim 8, wherein processing the first portion includes managing user-interface elements, frame-to-frame displacement calculation, and recording graphics library commands.
14. The non-transitory, tangible processor readable storage medium of claim 13, wherein processing the second portion includes executing recorded graphics library commands.
15. A computing device comprising:
- means for processing a first frame of a user-interface animation with a first processing core;
- means for monitoring an elapsed processing time of the first frame of the user-interface animation relative to a first synchronization pulse;
- means for processing, if the elapsed processing time exceeds a threshold, a first portion of the first frame with the first processing core and a second portion of the first frame with a second processing core; and
- means for initiating, at substantially a same time as a second synchronization pulse, processing of a next frame of the user-interface animation with the first processing core while the second processing core is processing the second portion of the first frame.
16. The computing device of claim 15, wherein the user-interface animation is an Android-based user-interface animation and the second portion of the user-interface animation is a render-stage.
17. The computing device of claim 15, wherein the threshold is one-half of a time period of the synchronization pulse.
18. The computing device of claim 15, wherein the threshold is a tunable threshold.
19. The computing device of claim 15, wherein the monitoring includes monitoring the processing time with a workload detection component implemented in connection with a user-level library.
20080276056 | November 6, 2008 | Giacomoni |
20080310555 | December 18, 2008 | Kee |
20120060161 | March 8, 2012 | Joung |
Type: Grant
Filed: Jan 7, 2014
Date of Patent: Nov 3, 2015
Patent Publication Number: 20150193959
Assignee: Qualcomm Innovation Center, Inc. (San Diego, CA)
Inventors: Premal Shah (San Diego, CA), Omprakash Dhyade (San Diego, CA)
Primary Examiner: Kee M Tung
Assistant Examiner: Frank Chen
Application Number: 14/149,701
International Classification: G06T 13/00 (20110101); G06T 1/20 (20060101); G06F 3/00 (20060101);