1.问题现象

​ 车机开启之后,一直重启,仪表屏幕正常。通过CAN工具可以和仪表交互。

2.问题定位

​ 因为我们车机是QNX和Android双系统,从现象来看,仪表正常,说明QNX是正常的,初步定位是Android系统重启。拼手速导出log/log中的日志开始分析。

 首先一想Android系统重启,必然是系统服务出了问题,一般应用重启不会导致系统重启。Android O中增加了一个新的功能,当核心系统组件发生循环崩溃的时候,就会触发“救援程序”。

搜先搜索:“AndroidRuntime”可以看到哪些进程出问题了,被系统杀掉

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1 I/CarAudioManager( 1599): reconnect service
1 E/AndroidRuntime( 2632): FATAL EXCEPTION: main
1 E/AndroidRuntime( 2632): Process: com.android.car, PID: 2632
1 E/AndroidRuntime( 2632): java.lang.RuntimeException: Unable to create service com.android.car.CarService: java.lang.IllegalStateException: Vehicle HAL service is not available.
1 E/AndroidRuntime( 2632): at android.app.ActivityThread.handleCreateService(ActivityThread.java:3582)
1 E/AndroidRuntime( 2632): at android.app.ActivityThread.access$1300(ActivityThread.java:200)
1 E/AndroidRuntime( 2632): at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1672)
1 E/AndroidRuntime( 2632): at android.os.Handler.dispatchMessage(Handler.java:106)
1 E/AndroidRuntime( 2632): at android.os.Looper.loop(Looper.java:193)
1 E/AndroidRuntime( 2632): at android.app.ActivityThread.main(ActivityThread.java:6718)
1 E/AndroidRuntime( 2632): at java.lang.reflect.Method.invoke(Native Method)
1 E/AndroidRuntime( 2632): at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:493)
1 E/AndroidRuntime( 2632): at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:858)
1 E/AndroidRuntime( 2632): Caused by: java.lang.IllegalStateException: Vehicle HAL service is not available.
1 E/AndroidRuntime( 2632): at com.android.car.CarService.onCreate(CarService.java:82)
1 E/AndroidRuntime( 2632): at android.app.ActivityThread.handleCreateService(ActivityThread.java:3570)
1 E/AndroidRuntime( 2632): ... 8 more

从上面日志可以看出CarService一直再重启,但是服务一直重启为什么会导致系统重启呢?

为了方便观看,省略时间戳等非必要信息,搜索:”RescueParty”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1 W/RescueParty( 765): Noticed 1 events for UID 0 in last 7 sec
1 W/RescueParty( 765): Noticed 1 events for UID 1000 in last 12 sec
1 W/RescueParty( 765): Noticed 2 events for UID 1000 in last 13 sec
1 W/RescueParty( 765): Noticed 3 events for UID 1000 in last 13 sec
1 W/RescueParty( 765): Noticed 4 events for UID 1000 in last 13 sec
1 W/RescueParty( 765): Noticed 5 events for UID 1000 in last 14 sec
1 W/RescueParty( 765): Attempting rescue level RESET_SETTINGS_UNTRUSTED_DEFAULTS
1 W/RescueParty( 765): Noticed 1 events for UID 1000 in last 14 sec
1 W/RescueParty( 765): Noticed 2 events for UID 1000 in last 14 sec
1 W/RescueParty( 765): Noticed 3 events for UID 1000 in last 14 sec
1 W/RescueParty( 765): Noticed 4 events for UID 1000 in last 14 sec
1 W/RescueParty( 765): Noticed 5 events for UID 1000 in last 14 sec
1 W/RescueParty( 765): Attempting rescue level RESET_SETTINGS_UNTRUSTED_CHANGES
1 W/RescueParty( 765): Noticed 1 events for UID 1000 in last 15 sec
1 W/RescueParty( 765): Noticed 2 events for UID 1000 in last 15 sec
1 W/RescueParty( 765): Noticed 3 events for UID 1000 in last 15 sec
1 W/RescueParty( 765): Noticed 4 events for UID 1000 in last 15 sec
1 W/RescueParty( 765): Noticed 5 events for UID 1000 in last 15 sec
1 W/RescueParty( 765): Attempting rescue level RESET_SETTINGS_TRUSTED_DEFAULTS
1 W/RescueParty( 765): Noticed 1 events for UID 1000 in last 15 sec
1 W/RescueParty( 765): Noticed 2 events for UID 1000 in last 16 sec
1 W/RescueParty( 765): Noticed 3 events for UID 1000 in last 16 sec
1 W/RescueParty( 765): Noticed 4 events for UID 1000 in last 16 sec
1 W/RescueParty( 765): Noticed 5 events for UID 1000 in last 16 sec
1 W/RescueParty( 765): Attempting rescue level FACTORY_RESET
1 I//system/bin/uncrypt( 2649): --reason=RescueParty

原因就在这里,Android从O版本之后退出了系统救援机制,当系统关键服务或者重启之后都会做一个标记,如果超过阈值了,就会恢复出厂设置,重启车机。

3.RescueParty机制

先把结论说出来,然后一一代码证实。

结论:救援机制把救援事件分成5类,每次应用或者systemserver在一定时间重启5次,救援级别就会加一,当达到4级救援,就恢复出厂设置。

1
2
3
4
5
6
// 救援级别
private static final int LEVEL_NONE = 0; // 默认救援为0级
private static final int LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS = 1;
private static final int LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES = 2;
private static final int LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS = 3;
private static final int LEVEL_FACTORY_RESET = 4;

上述的一定时间是:SystemServer在5分钟内重启五次触发一次救援。persistent app在30秒内重启5次触发一次救援。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public BootThreshold() {
// We're interested in 5 events in any 300 second period; this
// window is super relaxed because booting can take a long time if
// forced to dexopt things.
// 5代表5次,300代表300s
super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);
}

public AppThreshold(int uid) {
// We're interested in 5 events in any 30 second period; apps crash
// pretty quickly so we can keep a tight leash on them.
// 5代表5次,30代表30s
super(uid, 5, 30 * DateUtils.SECOND_IN_MILLIS);
}

好了,阈值我们设置好了,那我们来看看在哪里标记的呢。

1
2
3
4
5
6
7
8
9
// app crash
// frameworks/base/services/core/java/com/android/server/am/AppErrors.java中
// crashApplicationInner()方法,当应用crash了就判断是不是persistent应用,如果是就标记一下

// If a persistent app is stuck in a crash loop, the device isn't very
// usable, so we want to consider sending out a rescue party.
if (r != null && r.persistent) {
RescueParty.notePersistentAppCrash(mContext, r.uid);
}
1
2
3
4
5
6
7
8
9
// SystemServer.java
// 当SystemServer启动的时候,调用救援服务做一个标记
private void startBootstrapServices() {
.....
// Now that we have the bare essentials of the OS up and running, take
// note that we just booted, which might send out a rescue party if
// we're stuck in a runtime restart loop.
RescueParty.noteBoot(mSystemContext);
}

我们可以看到当应用crash和SystemServer重启都会调用到RescueParty的方法取做标记。

下面我们来看一下RescueParty的两个方法。

1
2
3
4
5
6
7
8
9
10
11
public static void noteBoot(Context context) {
if (isDisabled()) return;
// incrementAndTest()判断是否需要提升救援级别。如果在上诉的时间内启动达到5次就提升救援级别
if (sBoot.incrementAndTest()) {
sBoot.reset();
//提升救援级别
incrementRescueLevel(sBoot.uid);
// 执行救援级别,当达到4级就恢复出厂设置
executeRescueLevel(context);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
public static void notePersistentAppCrash(Context context, int uid) {
if (isDisabled()) return;
Threshold t = sApps.get(uid);
if (t == null) {
t = new AppThreshold(uid);
sApps.put(uid, t);
}
// 无论app还是systemserver都会调到这里面去。
if (t.incrementAndTest()) {
t.reset();
incrementRescueLevel(t.uid);
executeRescueLevel(context);
}
}

我们可以从上面方法看到,无论app crash还是systemserver启动都会调到以下三个方法:

  • incrementAndTest():判断是否需要提升救援级别。如果在上诉的时间内启动达到5次就提升救援级别
  • incrementRescueLevel(t.uid):提升指定UID的救援级别
  • executeRescueLevel(context):执行救援服务

下面我们分别分析上诉三个方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public boolean incrementAndTest() {
final long now = SystemClock.elapsedRealtime();
final long window = now - getStart();
// 如果小于阈值时间,更新时间
if (window > triggerWindow) {
setCount(1);
setStart(now);
return false;
} else {
int count = getCount() + 1;
setCount(count); // 设置阈值次数
EventLogTags.writeRescueNote(uid, count, window);
Slog.w(TAG, "Noticed " + count + " events for UID " + uid + " in last "
+ (window / 1000) + " sec");
return (count >= triggerCount); // 如果大于阈值次数返回true,提升救援级别
}
}
1
2
3
4
5
6
7
8
9
10
11
12
private static void incrementRescueLevel(int triggerUid) {
// 获取当前级别并且级别加1
final int level = MathUtils.constrain(
SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE) + 1,
LEVEL_NONE, LEVEL_FACTORY_RESET);
// 设置到属性服务中
SystemProperties.set(PROP_RESCUE_LEVEL, Integer.toString(level));

EventLogTags.writeRescueLevel(level, triggerUid);
logCriticalInfo(Log.WARN, "Incremented rescue level to "
+ levelToString(level) + " triggered by UID " + triggerUid);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
private static void executeRescueLevel(Context context) {
// 从属性服务中获取救援级别
final int level = SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE);
if (level == LEVEL_NONE) return;

Slog.w(TAG, "Attempting rescue level " + levelToString(level));
try {
// 救援级别执行对应任务
executeRescueLevelInternal(context, level);
EventLogTags.writeRescueSuccess(level);
logCriticalInfo(Log.DEBUG,
"Finished rescue level " + levelToString(level));
} catch (Throwable t) {
final String msg = ExceptionUtils.getCompleteMessage(t);
EventLogTags.writeRescueFailure(level, msg);
logCriticalInfo(Log.ERROR,
"Failed rescue level " + levelToString(level) + ": " + msg);
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
private static void executeRescueLevelInternal(Context context, int level) throws Exception {
switch (level) {
case LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS:
// 单纯更新状态
resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_DEFAULTS);
break;
case LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES:
resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_CHANGES);
break;
case LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS:
resetAllSettings(context, Settings.RESET_MODE_TRUSTED_DEFAULTS);
break;
case LEVEL_FACTORY_RESET:
// 达到恢复出厂级别,才真正有事情干
RecoverySystem.rebootPromptAndWipeUserData(context, TAG);
break;
}
}

从executeRescueLevelInternal方法中可以看到,真正干事的是达到LEVEL_FACTORY_RESET级别,才有事情干。因此直接分析rebootPromptAndWipeUserData()方法。

1
2
3
4
5
6
7
8
9
10
// 构建重启指令,调用bootCommand()重启车机
public static void rebootPromptAndWipeUserData(Context context, String reason) throws IOException {
String reasonArg = null;
if (!TextUtils.isEmpty(reason)) {
reasonArg = "--reason=" + sanitizeArg(reason);
}

String localeArg = "--locale=" + Locale.getDefault().toString();
bootCommand(context, null, "--prompt_and_wipe_data", reasonArg, localeArg);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
private static void bootCommand(Context context, String... args) throws IOException {
LOG_FILE.delete();
StringBuilder command = new StringBuilder();
String[] var3 = args;
int var4 = args.length;

for(int var5 = 0; var5 < var4; ++var5) {
String arg = var3[var5];
if (!TextUtils.isEmpty(arg)) {
command.append(arg);
command.append("\n");
}
}
// 重启车机
RecoverySystem rs = (RecoverySystem)context.getSystemService("recovery");
rs.rebootRecoveryWithCommand(command.toString());
throw new IOException("Reboot failed (no permissions?)");
}

4.问题回溯

可以从第二节看到,CarService发生crash,触发了FACTORY_RESET救援,所以车机恢复出厂设置,然后重启了。