Title of Invention

A METHOD FOR RECOVERING ONE CORE EXCEPTION IN MULTI-CORE SYSTEM

Abstract The invention relates to a multi-core CPU system, and in particular to a recovery method for one core exception in multi-core system. When exception occurs on one core, then this core will be recovered without pause of the operation. In the exception processing program of the invention, a core in an exceptional state sets its state value to "exception" at first, and then selects a core in a normal state to assist recovery, and a system scheduling module is informed of the reassignment to system tasks, so as to ensure that the recovery work can be completed as soon as possible and the recovery time can be shortened . The present invention has the advantages that:the recovery method can effectively ensure that the operation of the system cannot be interrupted and the resources of the system cannot be lost before and after one core exception and recovery; the abnormal core can work normally after recovery, which prolongs the possible run time of the system and strengthens the reliability of the system.
Full Text

Specification
A Method for Recovering one Core Exception in Multi-core System Technical Field
The invention relates to a multi-core CPU system, and in particular to a recovery method for one core exception in multi-core system.
Background of the Invention
In multi-core CPU embedded system (hereinafter referred to as multi-core system), exceptions may occur on one core regardless in a symmetrical multi-core system or in a niastcr-slavcr multi-core system. The exception includes illegal instructions, non-align operations, cache exceptions, data bus errors, etc. There are many reasons for the exceptions. One reason could be an accidental hardware error or invalid data that may cause program processing exceptions, and another reason could be that the system runs a branch that is hard to access in the program. However, most of these errors just cause one-off damage to the system since regular errors will be detected and solved during system testing.
In the prior art, usual approach to solve the situation of exception occurring on one core is just to record the exception information and then restart the whole system.Although this approach can recover system operation, it will pause all services and shorten the operating time of the system.For existing multiple systems, in particular, they are generally placed in high-level or core positions, such as the provincial level core router, the PBX(private baranch exchange), etc.Once malfunction of the equipment occurs, the consequences will be serious. Also it will take a long time for the system to restart and restore normal operation, which may cause enormous impact.Therefore, it is particularly important to extend the operating time of multi-core system.Also, it is not worth to restart the whole system just for some non-fatal errors.
Content of Invention
Technical problems to be solved in the invention focus on the above-mentioned problems in existing technology. The invention provides a recovery method for one core exception in multi-core system. When exception occurs on one core, then this core will be recovered without pause of the operation.

The technical scheme adopted in the invention to solve the above-mentioned technical problems is a method for recovering from one core exceptions in a multi-core system, including the shared memory and the system scheduling module. The invention comprises the following steps;
a. Set storage cell in the shared memory, store the state value of each core, and set the
initial state value of all cores to "normal".
b. When exception occurs on one core, this core runs an exception processing program
automatically, sets its state value to "exception" and notifies an other selected core in a
normal state, then the core in an exceptional state enters infinite loop automatically;
c. The selected core in a normal state sets the core in an exceptional state to the reset
state and notifies the system scheduling module. Then the module will dispatch tasks
originally assigned to the core in the exceptional state to any other core in a normal
state. The selected core in a normal state will recycle all resources occupied by the
core in the exceptional state and finally release the core in the exceptional state.
d. After release from reset, the core in an exceptional state will restart and set its state
value to "waiting for recovery" after the start-up is completed;
e. After the selected core in a normal state detects the value of the core in an
exceptional state is "waiting for recovery", it will set the state value of that exceptional
core to "normal" and notifies the system scheduling module;
Furthermore, in Step b, inter-core interrupt is used to send a notification in multi-core communication.
Furthermore, the system scheduling module will monitor the state of each core according to state value stored in storage cell. In case the module detects state of one core is in an exceptional state, it will not assign more tasks to the core in an exceptional state.
Specifically, in case the multi-core system is a symmetrical multi-core system, in Step b, the selected core in a normal state can be any core that is in a normal state.
Specifically, in case the multi-core system is master-slaver multi-core system, in Step b, the selected core in a normal state is the core that is in master state.
The invention has the following beneficial effects: when an exception occurs on one

core in the multi-core system, tasks originally assigned to the core in an exceptional state will be dispatched to other normal core to ensure these tasks to be run in time, and effectively ensure that the whole system operation will not be paused and system resources will not be lost before or after the core in an exceptional state is recovered.After recovery, the core in an exceptional state can operate normally, which extends operation time of the system and enhance reliability of the system.
Description of the Drawings
Figure 1 is a program flowchart of the embodiment of the invention.
Detailed Description of the Preferred Embodiments
Technical scheme of the invention is described in more details in combination with attached drawings and embodiment.
In multi-core system with the shared memory and the system scheduling module, the invention defines a special storage cell in the shared memory and adopts a global array to store state of each core. Subscript of the array can use the No. of each core, and each array element corresponds to the state value of each core.AU state values of the core are defined as "normal", "exceptional" and "waiting for recovery". The initial state value of all'cores is set to "normal".In multi-core system, all tasks executed by each core are assigned by the system scheduling module.A state monitoring program is arranged in the system scheduling module. When the system scheduling module is assigning tasks, first it will monitor the current state of each core, and if the state of one core is in an exceptional state, the module will not assign task to it.When an exception occurs on one core, the CPU exception processing program will run.
In the exception processing program of the invention, a core in an exceptional state sets its state value to "exception" at first, and then selects a core in a normal state and uses a inter-core interrupt communication to notify the selected core in a normal state.Then the system scheduling module dispatches all tasks assigned to the core in exception to other core in a normal state according to its scheduling algorithm, so as to ensure recovery can be accomplished as soon as possible and shorten recovery time. After notification is sent, the core in an exceptional state enters infinite loop and can not exit the exception processing program, so as to avoid more errors and damage occur.

In a symmetrical multi-core system, any core can set the state of any other core. Therefore, when exception occurs on one core, this core can select any other core in a normal state. In addition, any core has the function to reset any other one or multiple cores.There are two algorithms to select one core in a normal state, namely sequential search and random search.The advantage of sequential search is simple, while its disadvantage is that the selected core in a normal state is comparatively fixed. The advantage of random search is that the selected core in a normal state is not fixed, which can increase the probability of successful recovery, while its disadvantage is that the algorithm is comparatively complicated.
In a master-slaver multi-core system, only one core in master state can recover the core in an exceptional state. In other words, when one core is in an exceptional state, it must notify the core in master state so that it can carry out recovery operation.
Multi-core CPU has inter-core communication mechanisms, of which one communication mechanism is to use inter-core interrupt. Its advantage is very fast and this mechanism can send event notification in first time. As a result, the invention adopts inter-core interrupt to send notification.
Embodiment
In a symmetrical mulfi-core CPU embedded system, as shown in Figure 1, in Step 101, illegal operation occurs on the core A and causes exception. Only the core A jumps to exception vector and enters the CPU exception processing program, and other cores continue normal operation.In the exception processing program, the core A first records exception information, including: exception type, exception PC pointer, values of all state registers, stack structure, etc.
In Step 102, in the exception processing program, the core A modifies its state value in storage cell in the shared memory to "exception".When the system scheduling module assigns tasks, first it will detect the state of each current core. If a current core is in an exceptional state, the module will not assign task to this core.
In Step 103, in the exception processing program, the core A randomly selects the core B in a normal state, then uses interrupt to notify core B, and finally it enters infinite loop, which means the core A will never exit the exception processing program to prevent further exception of the instruction when it is re-executed.

In Step 104, the core B in a normal state receives interrupt information from the core A, then awakes its daemon process for the core exception recovery, prepares to search and find out which core is in an exceptional state and prepares for recovery.
In Step 105,the core B utilizes CPU global control register to set core A to the reset state. Because in multi-core CPU, when one core is set to the reset state, it will not execute any code, which means it is in a stop state. Once the core is released from the reset state, it will read and run instructions from a fixed start address, that is, it performs restart operation.
In Step 106, the core B notifies the system scheduling module. Then the module transfers tasks originally assigned to the core A to another core in a normal state according to itb scheduling algorithm, so as to ensure the timeliness of task execution.
In Step 107, core B recycles resources originally occupied by core A to the system. These resources mainly include: task queue, stack space, interrupt, etc.
In Step 108, the core B utilizes CPU global control register to release core A from the reset state and core A restarts. Then the core B polls state value of the core A in storage cell in the shared memory and waits for it changes to "waiting for recovery".
In Step 201, after the core A is released from the reset state, it starts to read and run instructions from the CPU fixed start address, and then restarts.
In Step 202, the core A re-executes initialization operation. Because core A uses new resources, so it is definite that the core A can restart successfully.After the start-up is completed, the core A changes its state in the storage cell in the shared memory to "waiting for recovery", which shows that the core A has already started up.
In Step 203, the core B detects that the state of the core A changes to "waiting for recovery", which shows the core A has already started up. Then the core B modifies the state of the core A in storage cell in the shared memory to "normal". Next the core B notifies the system scheduling module of assigning tasks to the core A.
Exception recovery is completed.

WE CLAIM
1. A method for recovering one core exception in multi-core system, including the
shared memory and the system scheduling module, comprising the following steps;
a. Set storage cell in the shared memory, store the state value of each core, and set the
initial state value of all cores to "normal";
b. When exception occurs on one core, this core runs an exception processing program
automatically, sets its state value to "exception" and notifies an other selected core in a
normal state, then the core in an exceptional state enters infinite loop automatically;
c. The selected core in a normal state sets the core in an exceptional state to the reset
state and notifies the system scheduling module. Then the module will dispatch tasks
originally assigned to the core in the exceptional state to any other core in a normal
state. The selected core in a normal state will recycle all resources occupied by the
core in the exceptional state and finally release the core in the exceptional state;
d. After release from reset, the core in an exceptional state will restart and set its state
value to "waiting for recovery" after the start-up is completed;
e. After the selected core in a normal state detects the value of the core in an
exceptional state is "waiting for recovery", it will set the state value of that exceptional
core to "normal" and notifies the system scheduling module.
2. A method for recovering one core exception in multi-core system as in claim 1, in said Step b, inter-core interrupt is used to send a notification in multi-core communication.
3. A method for recovering one core exception in multi-core system as in claim 1, said system scheduling module will monitor the state of each core according to state value stored in storage cell. In case the module detects state of one core is in an exceptional state, it will not assign more tasks to the core in an exceptional state.
4. A method for recovering one core exception in multi-core system as in claim 1 or 2 or 3, in case the multi-core system is a symmetrical multi-core system, in Step b, the selected core in a normal state can be any core that is in a normal state.

5. A method for recovering one core exception in multi-core system as in claim 1 or 2 or 3, in case the multi-core system is master-slaver multi-core system, in Step b, the selected core in a normal state is the core that is in master state.


Documents:

http://ipindiaonline.gov.in/patentsearch/GrantedSearch/viewdoc.aspx?id=2uwd0njrzLxr3opmPJ4ekg==&loc=egcICQiyoj82NGgGrC5ChA==


Patent Number 269764
Indian Patent Application Number 4501/CHENP/2009
PG Journal Number 45/2015
Publication Date 06-Nov-2015
Grant Date 05-Nov-2015
Date of Filing 30-Jul-2009
Name of Patentee MAIPU COMMUNICATION TECHNOLOGY CO., LTD.
Applicant Address MAIPU MANSION,NO.16, JIU XING AVENUE, HIGH-TECH PARK, CHENGDU,SICHUAN 610041
Inventors:
# Inventor's Name Inventor's Address
1 YAN, XIAOQIANG, MAIPU MANSION,NO.16, JIU XING AVENUE, HIGH-TECH PARK, CHENGDU,SICHUAN PROVINCE 610041
2 LI, JIANGNING, MAIPU MANSION,NO.16, JIU XING AVENUE, HIGH-TECH PARK, CHENGDU,SICHUAN PROVINCE 610041
3 XU, FANG, MAIPU MANSION,NO.16, JIU XING AVENUE, HIGH-TECH PARK, CHENGDU,SICHUAN PROVINCE 610041
PCT International Classification Number G06F11/28
PCT International Application Number PCT/CN08/00224
PCT International Filing date 2008-01-30
PCT Conventions:
# PCT Application Number Date of Convention Priority Country
1 200710048366.6 2007-01-31 China