Skip to main content

An improved SLAM algorithm for substation inspection robot based on the fusion of IMU and visual information

Abstract

In the past, manual inspection was often used for equipment inspection in indoor environments such as substation rooms and chemical plant rooms. This way often accompanies high labor intensity, low inspection efficiency, and low safety, which is difficult to meet the increasingly stringent requirements of indoor equipment operation and maintenance management. For dealing with these issues, a VIORB-SLAM2 algorithm based on the integration of IMU and visual information, was proposed by this paper. Firstly, the IMU data and image data were integrated to restore scale information of cameras, and then an error function was established to enhance the algorithm’s robustness. Secondly, in order to improve the accuracy of the algorithm, the random sampling consensus method was used to eliminate the wrong matching points in feature point matching, and the normalized cross-correlation matching was employed to constrain key frame matching conditions. Finally, through the iterative closest point method to stitch the point clouds, a dense map for navigation was constructed. The experimental results show that the algorithm designed by this paper has solved the shortcomings of applying the ORB-SLAM2 algorithm to indoor inspection robots while achieving high positioning accuracy, which can be combined with other algorithms in the field of artificial intelligence for object detection and semantic map construction in the future.

Introduction

With the development of intelligent devices, maintenance work for devices is facing increasingly greater challenges (Yuan et al. 2022). As there are many indoor equipment in environments such as substations, nuclear power plants, and chemical plants, workers need to regularly check whether the equipment is working properly to ensure the safety of production (Huang et al. 2024; Liang and Luo 2023). However, the fact is that traditional inspection methods are usually labor-intensive, inefficient, and low safety. Therefore, there is an urgent need for robots that can automatically inspect to replace traditional inspection methods and complete corresponding inspection work efficiently and safely. The key to using inspection robots to complete inspection tasks is that robots can perform autonomous navigation. To complete autonomous navigation, robots first need to face two problems, namely “where am I” and “what kind of environment is around me” (Liu et al. 2023). These two problems represent simultaneous localization and mapping, namely SLAM technology (Ebadi et al. 2023).

In SLAM technology, VSLAM technology started early and developed rapidly. VSLAM (visual simultaneous localization and mapping) technology, also known as visual simultaneous localization and map construction technology, mainly solves the localization problem of robots in unknown environments and the map construction problem of surrounding environments through visual sensors.

In 2007, Klein proposed the PTAM-SLAM (Parallel Tracking and Mapping) algorithm (Klein 2008), which opened the way for research on visual SLAM and immediately aroused the enthusiasm of many researchers. The framework of front-end tracking and back-end optimization proposed by Klein laid the foundation for future research on visual SLAM. Then, Lv (2007) proposed an image feature extraction method using Harris corner points and Scale Invariant Feature Transform (SIFT) feature points based on the characteristics of different environmental images. The method combines local image features for image matching. In 2009, Wu (2009) proposed a calculation method for different texture region weights by analyzing the response value law of ORB features, which improved the accuracy of image matching. In 2010, Wang (2010) proposed a method based on sampling and improved SIFT algorithm for image features, which greatly improved the accuracy and efficiency of image matching. In 2014, Endres proposed RGBD SLAM-V2 (Endres. 2014), which uses a depth camera Kinect as a sensor, matches all points on the image using direct methods, and can establish dense maps. The backend also uses a loop detection framework for error correction. In the same year, Forster et al. proposed the SVO (Semi direct Visual Odometry) algorithm (Forster et al. 2014), which mixes direct and feature point methods and belongs to a semi direct visual odometry. It has strong real-time performance, but it also abandons the optimization estimation of the algorithm, and the algorithm does not have mapping function. In 2015 Mur-Artal proposed the ORB-SLAM scheme (Mur-Artal 2017), which innovatively uses three threads running in parallel to complete tracking, local optimization, and global optimization tasks, and adds map initialization and closed-loop detection functions (Williams et al. 2009). In image matching, it abandons traditional SIFT features and SURF features and proposes an ORB feature with higher perspective invariance (Rublee et al. 2012). However, the map established by ORB-SLAM is a sparse map, which cannot determine the surrounding environment through the map. Moreover, since the algorithm uses grayscale images for calculation, the true scale of camera motion cannot be obtained; Moreover, due to the use of feature point matching in this algorithm, tracking may fail in environments with relatively few feature points, or image features cannot be extracted even in dimly lit environments. In 2017, the author of ORB-SLAM improved the original algorithm and proposed ORB-SLAM2 (Raúl 2017). This algorithm added interfaces for binocular and depth cameras compared to the original algorithm, but still did not solve the problems of tracking failure and sparse mapping in ORB-SLAM. Therefore, this solution is currently only applied in the laboratory research stage. Above all, although ORB-SLAM2 has the aforementioned drawbacks, the algorithm has a well-developed framework and better framework scalability, which still requires researchers to continuously develop it to enable its application in engineering.

This paper proposes a VSLAM algorithm for indoor inspection robots by combining the navigation requirements, which is used for simultaneous positioning and map construction. The contribution of this paper is as follows:

For the navigation requirements of indoor inspection robots, a VIORB-SLAM2 algorithm based on the fusion of vision and IMU is proposed. The algorithm proposed mainly includes: fusing inertial sensor IMU data with image data, restoring the scale information of the camera, optimizing the algorithm by jointly constructing an error function through IMU data and image data, and improving the robustness of the algorithm.

The rest of the paper is organized as follows. Sect. "Improvement for SLAM algorithm" describes the methodology proposed in this paper in detail. Sect. "Results and discussion" provides experimental validation of the methodology proposed in this paper. Finally, Sect. "Conclusions" presents the conclusions of this paper and future work.

Improvement for SLAM algorithm

The overall architecture of the algorithm in this paper is shown in Fig. 1.

Fig. 1
figure 1

The overall architecture of the algorithm

Image and IMU data fusion

In ORB-SLAM2, camera localization is mainly achieved through feature matching of images. In the indoor inspection robot, VIORB-SLAM2 algorithm integrates image data with IMU data by inputting image information, IMU acceleration information and IMU angle information. Then, an error function is built for pose estimation through the joint construction of image data and IMU data. This function makes the algorithm less prone to tracking loss in areas with missing features, and can obtain the true scale of camera motion.

Besides, the final positioning result in VSLAM is usually represented in the world coordinate system, while the data from IMU and camera exist in different coordinate systems. Generally, the IMU coordinate system is called b, the camera coordinate system is called c, and the world coordinate system is called w. Due to the fact that both IMU and camera data are represented in different coordinate systems, it is necessary to convert the data in the IMU and camera coordinate systems to the world coordinate system during the VSLAM process. The conversion relationship is shown in Fig. 2. The coordinates obtained in SLAM are finally displayed on the world coordinate system. Before data processing, the three-dimensional spatial coordinates \(\rho l\) in the camera must be converted to two-dimensional spatial point \(Zl\) in the camera coordinate system, then to the IMU coordinate system b, and finally to the world coordinate system w.

Fig. 2
figure 2

Relationship between various coordinate systems

IMU data preprocessing

The frequency of data acquisition between IMU and camera is different, that is, the measurement frequency of IMU is generally above 100 Hz, while the frequency of camera is generally lower. Therefore, when IMU collects a large amount of data, the camera only captures a small amount of image data, as shown in Fig. 3.

Fig. 3
figure 3

IMU and image acquisition data association diagram

From Fig. 3, it can be seen that adding IMU measurement values directly to the image will cause data misalignment. Therefore, the inter frame accumulation method is used to integrate all IMU data in the frame i and image j to solve data synchronization issue. The integration formula is shown in Eq. (1):

$$\begin{array}{*{20}c} {P_{{wb_{j} }} = P_{{wb_{i} }} + v_{i}^{w} \Delta t + \iint_{t \in [i,j]} {\left( {q_{{wb_{t} }} a^{{b_{t} }} - g^{w} } \right)\,}\delta t^{2} } \\ {V_{j}^{w} = V_{i}^{w} + \int_{t \in [i,j]} {\left( {q_{{wb_{t} }} a^{{b_{t} }} - g^{w} } \right)} \,\delta t} \\ {q_{{wb_{j} }} = \int_{t \in [i,j]} {q_{{wb_{t} }} } \otimes \left[ {\begin{array}{*{20}c} 0 \\ {\frac{1}{2}\omega^{{b_{t} }} } \\ \end{array} } \right]\,\delta t} \\ \end{array}$$
(1)

where, \(P_{{wb_{j} }}\), \(V_{j}^{w}\), \(q_{{wb_{j} }}\) represent the position, velocity, and rotation at the moment i, \(P_{{wb_{i} }}\), \(V_{i}^{w}\), \(q_{{wb_{i} }}\) represent the position, velocity, and rotation at the moment j, \(g\) is gravity, \(w\) represents the world coordinate system, \(b_{t}\) represents the IMU coordinate system, \(wb\) represents the conversion from the IMU coordinate system to the world coordinate system, \(\omega\) represents the rotation angle, \(a\) represents the acceleration.

At this point, integration using the inter frame accumulation method can align the IMU and image measurement data. As \(q_{{wb_{t} }}\) is a constantly optimized and updated process in the SLAM process, position, velocity, and rotation are all integrated, and each calculation integrates the updated initial value step by step during inter frame integration. It leads to large computational workload and errors. Thus, it is imperative to reduce errors by converting coordinate systems. The value conversion formula is shown in Eq. (2):

$$q_{{wb_{t} }} = q_{{wb_{i} }} \otimes q_{{b_{i} b_{t} }}$$
(2)

where, \(q_{{b_{i} b_{t} }}\) represents the rotation in the IMU coordinate system at time \(i\) and time \(j\), \(q_{{wb_{i} }}\) represents the rotation from the IMU coordinate system to the world coordinate system at time \(i\). After substituting Eq. (2) into Eq. (1), the current position, velocity, and rotation of the IMU at the moment are obtained as shown in Eq. (3):

$$\begin{array}{*{20}c} {P_{{wb_{j} }} = P_{{wb_{i} }} + v_{i}^{w} \Delta t - \frac{1}{2}g^{w} \Delta t^{2} + q_{{wb_{i} }} \iint_{t \in [i,j]} {\left( {q_{{b_{t} b_{t} }} a^{{b_{t} }} } \right)}\,\delta t^{2} } \\ {V_{j}^{w} = V_{i}^{w} - g^{w} \Delta t + q_{{wb_{i} }} \int_{t \in [i,j]} {\left( {q_{{b,b_{i} }} a^{{b_{t} }} } \right)} \,\delta t} \\ {q_{{wb_{j} }} = q_{{wb_{i} }} \int_{t \in [i,j]} {q_{{b,b_{t} }} } \otimes \left[ {\begin{array}{*{20}c} 0 \\ {\frac{1}{2}\omega^{{b_{i} }} } \\ \end{array} } \right]\,\delta t} \\ \end{array}$$
(3)

At this point, when performing inter frame accumulation, it is not necessary to integrate \(q_{{wb_{t} }}\), but only to integrate \(q_{{b_{i} b_{t} }}\) The reason is that, \(q_{{b_{i} b_{t} }}\) is the rotation from time \(i\) to time \(j\), only related to the time. Therefore, integrating the data in advance through the above formula is called pre-integration, which can effectively process IMU data for the next data fusion.

Data fusion

The image data fusion IMU data process includes the camera initialization stage and the camera tracking stage. In the camera tracking stage, joint optimization of pose tracking during the tracking phase is achieved through IMU and image data fusion. Firstly, an overall optimized state vector is constructed, including the current moment's rotation, velocity, position, accelerometer bias estimation, and gyroscope bias. Afterwards, an error equation is constructed, including visual error and IMU error. By minimizing the error equation, the state estimation is performed as shown in Eq. (4), where i represents the previous state variable and j represents the current state variable:

$$\begin{array}{*{20}c} {\theta = \left\{ {R_{wb}^{j} ,P_{wb}^{j} ,v_{wb}^{j} ,b_{g}^{j} ,b_{a}^{j} } \right\}} \\ {\theta^{ * } = \mathop {\arg \min }\limits_{\theta } \left( {\sum\limits_{k} {E_{proj} (k,j) + E_{IMU} (i,j)} } \right)} \\ \end{array}$$
(4)

The camera reprojection error is shown in Eqs. (5):

$$\begin{array}{*{20}c} {E_{proj} (k,j) = \rho \left( {\left( {x^{k} - \pi \left( {X_{c}^{k} } \right)} \right)^{T} \sum\limits_{k} {\left( {x^{k} - \pi \left( {X_{c}^{k} } \right)} \right)} } \right)} \\ {X_{c}^{k} = R_{cb} R_{bw}^{j} \left( {X_{w}^{k} - P_{wb}^{j} } \right) + P_{cb} } \\ \end{array}$$
(5)

where, \(x\) represents the feature points of keyframes in the image, X represents map points in the world coordinate system, and \(\sum\nolimits_{k} {}\) represents the relevant information matrix related to \(x\). The error representation of IMU is shown in Eqs. (6):

$$\begin{array}{*{20}c} {E_{IMU} (i,j) = \rho \left( {\left[ {e_{R}^{T} e_{v}^{T} e_{\rho }^{T} } \right]\sum\nolimits_{I} {\left[ {e_{R}^{T} e_{v}^{T} e_{\rho }^{T} } \right]^{T} } } \right) + \rho \left( {e_{b}^{T} \sum\nolimits_{R} {e_{b} } } \right)} \\ {e_{R} = \log \left( {\left( {\Delta R_{ij} \exp \left( {J_{\Delta R}^{g} b_{g}^{j} } \right)} \right)^{T} R_{bw}^{i} R_{wb}^{j} } \right)} \\ {e_{v} = R_{bw}^{i} \left( {v_{wb}^{j} - v_{wb}^{i} - g_{w} \Delta t_{ij} } \right) - \left( {\Delta v_{ij} + J_{\Delta v}^{g} b_{g}^{j} + J_{\Delta v}^{a} b_{a}^{j} } \right)} \\ {e_{p} = R_{bw}^{i} \left( {P_{wb}^{j} - P_{wb}^{i} - v_{wb}^{i} \Delta t_{ij} - \frac{1}{2}g_{w} \Delta t_{ij}^{2} } \right) - \left( {\Delta P_{ij} + J_{\Delta p}^{g} b_{g}^{j} + J_{\Delta p}^{a} b_{a}^{j} } \right)} \\ {e_{b} = b^{j} - b^{i} } \\ \end{array}$$
(6)

where, \(\rho\) represents the robust core cost function, \(\sum\nolimits_{R} {}\) represents the deviation information matrix, and \(\sum\nolimits_{I} {}\) represents the pre-integrated information matrix.

Improved ORB feature point matching

As the errors in IMU are inherent, the following two sections aim to reduce system errors by modifying the proposed algorithm. At the beginning, ORB-SLAM2 uses BF (brute force) matching for feature matching. BF matching is achieved by comparing the number of different binary bits in the BRIEF of the feature points, which is the XOR value, also known as the Hamming distance. In feature matching of ORB-SLAM2, the Hamming distance between two points is directly compared. If the Hamming distance is less than a certain threshold, it indicates that these two points are the same point. However, although the BF matching method has a fast matching speed, it may result in mismatches.

The RANSAC (Random Sampling Consistency) algorithm (Fischler and Bolles 1981) was first proposed by Fischler and Bolles in 1981. After continuous optimization, it is now widely used in the field of computer vision.

By using the RANSAC algorithm to remove matching error points in ORB-SLAM2, it can reduce the occurrence of mismatches. And thus, the positioning accuracy of the algorithm is improved. The specific steps are as follows:

Step1: Calculate a 3*3 homography matrix H as shown in Eq. (7), which must satisfy the maximum number of feature points.

$$\left[ {\begin{array}{*{20}c} {x^{\prime}} \\ {y^{\prime}} \\ 1 \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {h_{11} } & {h_{12} } & {h_{13} } \\ {h_{21} } & {h_{22} } & {h_{23} } \\ {h_{31} } & {h_{32} } & {h_{33} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}l} x \hfill \\ y \hfill \\ 1 \hfill \\ \end{array} } \right]$$
(7)

where, \((x,y)\) represents the corner position of the target image, \((x^{\prime},y^{\prime})\) is the matching image corner position, often set \((h_{33} )\) to 1. In addition, there are 8 positional parameters, and each pair of matching points can solve for a set of unknown parameters. Therefore, at least 4 sets of matching points are needed.

Step 2: Use model H in Step 1 to calculate all matching feature points in the image, and calculate the projection error upon these points. If the error is less than the threshold, add the point pair to the inner point set Q. Otherwise, remove it. The projection error is shown in Eqs. (8):

$$\sum\limits_{i = 1}^{n} {\left( {x_{i}^{\prime } - \frac{{h_{11} x_{i} + h_{12} y_{i} + h_{13} }}{{h_{31} x_{i} + h_{32} y_{i} + h_{33} }}} \right)^{2} } + \left( {y_{i}^{\prime } - \frac{{h_{21} x_{i} + h_{22} y_{i} + h_{23} }}{{h_{31} x_{i} + h_{32} y_{i} + h_{33} }}} \right)^{2}$$
(8)

Step 3: If the number of inner point sets is too small, select the matching point to calculate model H again until the number of inner point sets Q exceeds the set threshold.

Step 4: When the number of inner point sets Q exceeds the threshold, record the number of inner points and repeat the above steps.

Step 5: Terminate the iteration when the number of iterations reaches the preset value.

The number of iterations K is shown in Eqs. (9):

$$K = \frac{\log (1 - P)}{{\log 1 - w^{m} }}$$
(9)

where, P is the probability of obtaining the correct model using the RANSAC algorithm. According to reference (Fischler and Bolles 1981), P is a constant value of 0.0995. w represents the proportion of interior points to all feature points, and m is the required feature point pair for calculating the model. The random sampling consistency algorithm can identify mismatched points in violent matching, thereby improving the accuracy of image matching.

The improved ORB feature matching method was experimentally validated using the TUM dataset. The experimental results are as follows (see Figs. 4 and 5):

Fig. 4
figure 4

BF matching experiment results

Fig. 5
figure 5

BF+RANSAC matching experiment results

It can be clearly seen from Fig. 4 that there are a large number of mismatched points when using BF matching. After adding RANSAC algorithm to BF matching in Fig. 5, the number of mismatched points is greatly reduced.

From Table 1, it can be seen that there are many matching pairs in BF matching, but there are a large number of mismatched points. In BF matching + RANSAC matching, after removing a large number of mismatched points, the accuracy of BF matching + RANSAC matching is significantly improved compared to BF matching in ORB-SLAM2.

Table 1 Results of different matching methods

Improved key frame matching

Keyframe matching in ORB-SLAM2 is achieved through ORB feature points, and the accuracy of keyframe matching can directly affect the positioning accuracy during camera tracking. This section constrains the matching of keyframes by proposing mean free NCC matching (Li et al. 2023; Yang et al. 2024; Wu et al. 2024) in keyframe matching. The idea of applying mean free NCC matching in indoor inspection robot VIORB-SLAM2 is to use image blocks from different keyframes for matching, thereby obtaining more accurate matching results and improving the positioning accuracy of the algorithm.

When feature point matching is performed on keyframes, points P from different depth spaces will be projected onto the same two-dimensional position, which has depth uncertainty. Backprojection of feature points onto different keyframes will generate an epipolar line, and the points on this epipolar line correspond to points in different depth spaces, as shown in Figs. 6.

Fig. 6
figure 6

Spatial projection diagram

Due to the influence of camera parameters in projection and backprojection, the points on the epipolar line do not actually correspond to spatial points at different depths. The points on the epipolar line are referred to as sub-pixels, and the true pixel values on the epipolar line are first calculated before mean NCC matching. The calculation is carried out using bilinear interpolation, which is a linear interpolation extension of two variable interpolation functions (Gao et al. 2023). The core idea is to interpolate in the axis x and axis y direction. When interpolating, the coordinate is first rounded down, and then the nearest four pixel points are selected around the coordinate point. The accurate pixel value of the pixel is calculated, and then the keyframe is matched using the NCC matching algorithm. When matching, the closer the correlation result between two points is to 1, the greater the correlation, that is, approaching 1 can consider two points as matching points. The algorithm in this article sets the threshold to 0.8. If the calculation result is greater than 0.8, it proves that two points are the same point to be matched.

The VIORB-SLAM2 algorithm incorporates NCC matching into ORB-SLAM2 keyframe matching to improve the matching method of keyframes. Through adding matching constraints, both the matching results of keyframes and the results of camera positioning become more accurate.

Improved ORB-SLAM2 mapping method

ORB-SLAM2 mapping is the process of extracting feature points from keyframe images during the tracking phase, and drawing a map based on these feature points. As feature points are a part of the image, only a sparse map composed of feature points can be obtained, which cannot meet the subsequent navigation and path planning needs of indoor inspection robots. Therefore, to establish a navigation map for indoor inspection robots, a dense point cloud map needs to be constructed.

According to the working principle of depth cameras, it can be seen that depth cameras can measure depth information through the Light Coding structured light method and time flight method. By measuring, depth maps of each image can be obtained, and then the depth map can be converted into a point cloud map through the PCL open source library. The PCL open source library is a modular C +  + module library that can achieve point cloud acquisition, registration, visualization, recognition, etc. The color map is shown in Figs. 7, the depth map is shown in Figs. 8, and the point cloud map obtained through the PCL library is shown in Figs. 9.

Fig. 7
figure 7

Color map

Fig. 8
figure 8

Depth map

Fig. 9
figure 9

Point cloud map

Afterwards, the rotation and translation matrix information of the image is obtained through camera tracking. Then, the points in the camera coordinate system are converted to the world coordinate system. The specific steps are as follows:

Step 1: Calculate the rotation and translation matrix of each frame of the image based on the ORB-SLAM2 algorithm for camera tracking;

Step 2: Establish a global coordinate system with the position of the first frame of the image as the initial position;

Step 3: Calculate the coordinates of each pixel point \((u,v)\) with the depth d, which is the position in the camera coordinate system, is shown in Eq. (10):

$$Z_{c} \left[ {\begin{array}{*{20}c} u \\ v \\ 1 \\ \end{array} } \right] = KP_{c}$$
(10)

Step 4: According to Eq. (10), calculate \(P_{c} = [x,y,z]\) as shown in Eq. (11):

$$Z_{c} = d,\,X_{c} = \frac{{u - c_{x} }}{{f_{x} }}Z_{c} ,\,Y_{c} = \frac{{u - c_{y} }}{{f_{x} }}Z_{c}$$
(11)

Step 5: Calculate the world coordinates of each point based on the rotation matrix \(R\) and translation matrix \(t\), as shown in Eq. (12):

$$\left[ {\begin{array}{*{20}c} {X_{c} } \\ {Y_{c} } \\ {Z_{c} } \\ \end{array} } \right] = R\left[ {\begin{array}{*{20}c} {X_{w} } \\ {Y_{w} } \\ {Z_{w} } \\ \end{array} } \right] + t$$
(12)

Since the final map coordinates are expressed through the world coordinate system, the above steps convert the points in the camera coordinate system to the world coordinate system. Next, each frame of point cloud needs to be concatenated to form a complete global map. The point cloud concatenation is completed through iterative nearest point method (Endres. 2014). The iterative nearest point method is to construct an error function and calculate a rotation translation matrix in two points clouds under certain constraints to minimize the error function. By calculating the distance between the points in the two points clouds and iteratively minimizing their distance, the two points clouds are concatenated. The specific process of iterative nearest point method is as follows:

Step 1: Construct an error function as shown in Eq. (13);

$$E(R,t) = \frac{1}{2}\sum\limits_{i = 1}^{n} {\left\| {q_{i} - (Rp_{i} + t)} \right\|^{2} }$$
(13)

Step 2: Obtain point clouds of two consecutive frames of images through a depth camera;

Step 3: Select the two closest points \(p_{i}\) and \(q_{i}\) in the point clouds of two frames;

Step 4: Calculate the rotation matrix R and translation matrix t of two points to minimize the error function;

Step 5: Calculate the distance d between two points;

$$d = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left\| {p_{i} - q_{i} } \right\|^{2} }$$

Step 6: Determine whether d has reached the given threshold. If not, return to Step 3 and iterate continuously. If the threshold is reached, rotate and translate the point clouds of two frames by the rotation translation matrix.

The stitching of two points clouds has been completed through the above steps The VIORB-SLAM2 algorithm establishes a global dense map by concatenating all point clouds, as shown in the following figure. Figure 10 shows the point cloud of a single frame image, and Fig. 11 shows the global dense map obtained by concatenating all point clouds of a single frame image using the iterative nearest point method.

Fig. 10
figure 10

Point cloud of a single frame image

Fig. 11
figure 11

Global dense map

Results and discussion

Experimental platform construction

The VIORB-SLAM2 algorithm experiment for the indoor inspection robot requires a depth camera, IMU, and a robot equipped with the above two sensors (Franchi et al. 2023).

The depth camera required for the algorithm in this paper adopts Intel's Realsense D435i sensor as shown in Fig. 12, and the parameters are shown in Table 2.

Fig. 12
figure 12

Realsense D435i camera

Table 2 Depth camera parameters

The device equipped with Realsense D435i adopts the Pioneer 3DX robot (Wan Zakaria et al. 2024), as shown in Fig. 13.

Fig. 13
figure 13

Pioneer 3DX robot

In summary, the experimental platform of this paper uses Intel Realsense D435i as the sensor and the Pioneer 3DX robot as the loading device. Intel Realsense D435i is installed on the Pioneer 3DX robot for subsequent experiments.

Calibration of experimental equipment

The experimental equipment in this paper mainly includes a camera and an IMU. The calibration of the equipment mainly includes camera calibration, IMU calibration, and joint calibration of the camera and IMU. Camera calibration is mainly to obtain the distortion parameters of the camera, IMU calibration is to obtain the errors existing in the IMU itself, and joint calibration of the camera IMU is to obtain the transformation matrix from the camera to the IMU, so as to convert the data in the IMU coordinate system to the camera coordinate system.

The joint calibration of the camera and IMU is to obtain the conversion matrix from the camera to the IMU, in order to convert data from the IMU coordinate system to the camera coordinate system. The joint calibration of the camera and IMU is carried out using the Kalibr tool (Wu et al. 2023; Wei et al. 2024). The main steps of this tool are:

Step 1: Collect image and IMU data, and align the timestamps of the image and IMU;

Step 2: Calculate and obtain the change curve of IMU angular velocity and the change curve of camera angular velocity. Then, obtain the change matrix of camera and IMU by the two curves;

Step 3: Output calibration results, that is, the rotation translation matrix from IMU to camera coordinate system.

After joint calibration, the following results were obtained:

$$R = \left[ {\begin{array}{*{20}c} {0.9999} & {0.0026} & { - 0.0034} \\ { - 0.0026} & {0.9999} & {0.0020} \\ {0.0034} & { - 0.0020} & {0.9999} \\ \end{array} } \right]$$
$$t = \left[ {\begin{array}{*{20}c} { - 0.1172} & {0.0001} & { - 0.0002} \\ \end{array} } \right]$$

The above calibration results \(R\) and \(t\) are the rotation translation matrix from the IMU coordinate system to the camera coordinate system of Realsense D435i.

Positioning experiment

The positioning experiment requires color image data and IMU data, so the Euroc dataset is used for the experiment. The Euroc dataset is a publicly available dataset from the Swiss Federal Institute of Technology Zurich, which uses a monocular camera and IMU for data collection. It includes two scenes: the laboratory hall and a regular room at the Swiss Federal Institute of Technology Zurich. This experiment used the ORB-SLAM2 algorithm and the indoor inspection robot VIORB-SLAM2 algorithm to run all 11 datasets of Euroc. In the V2_03-difficult dataset, the ORB-SLAM2 algorithm experienced camera tracking loss due to the lack of feature points. The positioning accuracy is evaluated using the open-source tool evo, which is a Python tool used to evaluate SLAM systems. The camera's positioning error is obtained by aligning the real trajectory with the trajectory obtained by the SLAM algorithm.

The positioning accuracy is measured by the root mean square error (RMSE), and its expression is shown in Eq. (14):

$$RMSE = \sqrt {\frac{{\sum\limits_{i = 1}^{n} {(A_{ki(x,y,z)} - B_{gi(x,y,z)} )^{2} } }}{m}}$$
(14)

where, \(A_{ki(x,y,z)}\) represents the position of the keyframe i, \(B_{gi(x,y,z)}\) represents the position of the reference image, and \(m\) represents the number of keyframes. The smaller the RMSE, the higher the accuracy of the algorithm.

By running Euroc on all 11 datasets of ORB-SLAM2, ORB-SLAM2 that only fused IMU data, and indoor inspection robot VIORB-SLAM2, camera positioning trajectories were output. The evo tool was used to fit the trajectory of the algorithm with the actual trajectory. The root mean square error (RMSE) of the final output is shown in Table 3:

Table 3 RMSE for different algorithm trajectories

As shown in Table 3, due to camera tracking failure during runtime, the trajectory of the algorithm running on the V1_03_difficult and V2_03_difficult datasets cannot be obtained, resulting in the inability to obtain the root mean square error. Although the algorithm that only integrates IMU data solves the problem of tracking loss, the positioning error of the entire algorithm is significantly improved compared to the original ORB-SLAM2 algorithm due to the error of IMU itself. However, the indoor inspection robot VIORB-SLAM2 algorithm that integrates IMU data and improves the matching method of feature points and keyframes not only solves the problem of tracking loss but also maintains the good accuracy of the original ORB-SLAM2 algorithm.

All in all, the indoor inspection robot VIORB-SLAM2 algorithm designed in this article still has good positioning accuracy after solving the problem of ORB-SLAM2 tracking loss.

Mapping experiment

The data required for the mapping experiment includes color image data and depth image data, so the TUM dataset was used for the experiment. The TUM dataset is a publicly available dataset from the Technical University of Munich in Germany, which includes color depth datasets and monocular datasets. The following are the results of the mapping experiment using the TUM dataset.

Figure 12 shows the mapping experiment results of ORB-SLAM2, and Fig. 13 shows the mapping experiment results of VIORB-SLAM2. The (a) and (b) from the two figures respectively show the experimental results of the rgbd_dataset_freiburg2xyz dataset and the rgbd_dataset_freiburg1hroom dataset in the TUM dataset. From the two figures, it can be seen that the sparse map points established by ORB-SLAM2 in Fig. 14 are too sparse to distinguish the specific shape of objects in the map, while the map established by VIORB-SLAM2 in Fig. 15 is a dense map, where the shape of objects can be clearly seen from the map.

Fig. 14
figure 14

Mapping results of ORB-SLAM2

Fig. 15
figure 15

Mapping results of VIORB-SLAM2

Real tests in actual substations are as follows. The experimental scenario is the actual internal environment of a substation, as shown in Fig. 16, where the robot’s motion trajectory revolves around the scene once. The experimental results of ORB-SLAM2 and VIORB-SLAM2 are shown in Figs. 17 and 18, respectively.

Fig. 16
figure 16

Schematic diagram of experimental scene trajectory

Fig. 17
figure 17

ORB-SLAM2 experimental results

Fig. 18
figure 18

VIORB-SLAM2 experimental results

In Figs. 17 and 18, (a) shows the localization result and (b) shows the mapping result. In this scenario, ORB-SLAM2 did not experience tracking failure, and the difference in localization results between the two algorithms is not significant. In the mapping experiment results, the map established by ORB-SLAM2 is sparse and cannot distinguish objects in the scene based on the map; The dense map established by VIORB-SLAM2 clearly shows the shapes of objects in the experimental scene.

Conclusions

In response to the navigation requirements of indoor inspection robots, this paper proposes the VIORB-SLAM2 algorithm for indoor inspection robots based on the fusion of vision and IMU. Based on the indoor simulation results of ORB-SLAM2, the shortcomings of the algorithm applied to indoor inspection robots were analyzed. In response to the shortcomings of ORB-SLAM2, the algorithm was improved and the indoor inspection robot VIORB-SLAM2 algorithm was proposed. The algorithm proposed in this article includes: fusing IMU data into image data to make it difficult for the camera to track and lose, and to make the positioning results have scale information; Due to the inherent errors in IMU, its addition will have an impact on the accuracy of the system. Therefore, the system accuracy has been improved by improving ORB feature point matching and keyframe matching; Afterwards, the depth map collected by the depth camera was converted into a point cloud map, and the neighboring point iteration method was used to concatenate a single point cloud to obtain a global dense point cloud map. The experimental results show that the algorithm proposed in this paper still has good accuracy while solving the problems of camera tracking loss and scale uncertainty, and can establish the dense map required by indoor inspection robots.

The algorithm proposed in this paper did not achieve navigation for indoor inspection robots, so in future work, the algorithm proposed in this paper can be used to continue research and enable indoor inspection robots to achieve autonomous navigation. Then, the computational complexity of the proposed algorithm and comparative analysis with newer SLAM algorithms should be discussed. In addition, for the dense map constructed by this algorithm, it can be combined with other algorithms in the field of artificial intelligence for application. For example, it can detect objects during the navigation process of inspection robots and construct semantic maps to make them more intelligent and enable them to achieve more complex functions.

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

  • Ebadi K, Bernreiter L, Biggie H et al (2023) Present and future of slam in extreme environments: the darpa subt challenge. IEEE Trans Robot. https://doi.org/10.1109/TRO.2023.3323938

    Article  Google Scholar 

  • Endres F (2014) 3D Mapping with an RGB-D camera. TRO. https://doi.org/10.1109/TRO.2013.2279412

    Article  Google Scholar 

  • Fischler MA, Bolles RC (1981) Random Sample Consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm ACM 34:381–395

    Article  MathSciNet  Google Scholar 

  • Forster C, Pizzoli M, Scaramuzza D. SVO: Fast semi-direct monocular visual odometry. IEEE International Conference on Robotics and Automation. 2014; 15–22.

  • Franchi GA, Bus JD, Boumans IJMM et al (2023) Estimating body weight in conventional growing pigs using a depth camera. Smart Agric Technol 3:100117

    Article  Google Scholar 

  • Gao C, Zhou RG, Li X (2023) Quantum color image scaling based on bilinear interpolation. Chin Phys B 32(5):050303

    Article  Google Scholar 

  • Huang B, Xie J, Yan J (2024) Inspection robot navigation based on improved TD3 algorithm. Sensors 24(8):2525

    Article  Google Scholar 

  • Klein G. Parallel tracking and mapping for small AR workspaces. IEEE and ACM International Symposium on Mixed and Augmented Reality. 2008; 1–10.

  • Li P, Cao G, Wang G (2023) A fast matching method for skip displacement recognition of hoisting system. Measurement 221:113421

    Article  Google Scholar 

  • Liang Q, Luo B (2023) Visual inspection intelligent robot technology for large infusion industry. Open Comput Sci 13(1):20220262

    Article  Google Scholar 

  • Liu L, Wang X, Yang X et al (2023) Path planning techniques for mobile robots: review and prospect. Expert Syst Appl 227:120254

    Article  Google Scholar 

  • Lv Q (2007) Research on monocular vision odometry based on feature point extraction. Zhejiang University, Hangzhou

    Google Scholar 

  • Mur-Artal R (2017) ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans Rob 31(5):1147–1163

    Article  Google Scholar 

  • Raúl M (2017) An open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Trans Rob 23(5):1036–1049

    Google Scholar 

  • Rublee E, Rabaud V, Konolige K, et al. ORB: an efficient alternative to SIFT or SURF. IEEE International Conference on Computer Vision. 2012; 2564–2571.

  • Wan Zakaria WZE, Ramli A, Misro MY et al (2024) Motion planning of differential drive mobile robot using circular Arc. Eng Lett 32(1):160–167

    Google Scholar 

  • Wang YQ (2010) Research on simultaneous localization and mapping of mobile robot with omnidirectional vision. Harbin Engineering University, Harbin

    Google Scholar 

  • Wei P, Fu K, Villacres J et al (2024) A compact handheld sensor package with sensor fusion for comprehensive and robust 3D map. Sensors 24(8):2494

    Article  Google Scholar 

  • Williams B, Cummins M, Newman P et al (2009) A comparison of loop closing techniques in monocular SLAM. Robot Auton Syst 57(12):1188–1197

    Article  Google Scholar 

  • Wu D (2009) Positioning technology of LHD based on stereo visual odometry. Beijing University of Science and Technology, Beijing

    Google Scholar 

  • Wu Q, Xu X, Chen X et al (2023) 360-VIO: a robust visual–inertial odometry using a 360° camera. IEEE Trans Ind Electron. https://doi.org/10.1109/TIE.2023.3337541

    Article  Google Scholar 

  • Wu J, Hu H, Song Y et al (2024) Ultrasonic phased array imaging of multi-layered structures using full-matrix migration and normalized cross-correlation matching technique. NDT E Int 145:103130

    Article  Google Scholar 

  • Yang AW, Peng BJ, Lu CX et al (2024) A novel method for multiple targets localization based on normalized cross-correlation adaptive variable step-size dynamic template matching. AIP Adv. https://doi.org/10.1063/5.0194376

    Article  Google Scholar 

  • Yuan C, Xiong B, Li X et al (2022) A novel intelligent inspection robot with deep stereo vision for three-dimensional concrete damage detection and quantification. Struct Health Monit 21(3):788–802

    Article  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the above funds for supporting this research and editor and reviewers for their comments and suggestions.

Funding

This work was supported in part by the International Joint Research Center of Robots and Intelligence Program under Grant JQZN2022-001; in part by the School Project of Chengdu Technological University under Grant 2023ZR007, Grant 2023ZR006, and Grant 2023ZR008 and Grant KCSZ202319; in part by the Chengdu Technical Innovation Research Program under Grant 2022-YF05-01134-SN; in part by Collaborative Education Program for Industry and Education Cooperation of the Higher Education Department of the Ministry of Education (231100882302555).

Author information

Authors and Affiliations

Authors

Contributions

Ping Wang: Conceptualization, Methodology,Writing-Original Draft Chuanxue Li: Data Curation, Resources Fangkai Cai: Visualization, Writing-Review & Editing Li Zheng: Investigation, Validation.

Corresponding author

Correspondence to Li Zheng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Li, C., Cai, F. et al. An improved SLAM algorithm for substation inspection robot based on the fusion of IMU and visual information. Energy Inform 7, 86 (2024). https://doi.org/10.1186/s42162-024-00390-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s42162-024-00390-8

Keywords