korena's blog

10. Code in C

Having restructured the project to follow the recommendation outlined in Samsung's iROM document, we need to move to C. Thanks to advanced compilers, the process of moving from assembly to C is extremely simple, provided you stick to calling conventions and take care of your execution paths.

Calling conventions

We're not going to be spending too much time on this, as it's a rather popular topic. Just make sure you read your platform's
ABI, the one we're following is Procedure Call Standard for the ARM® Architecture So let's get right to it.

The following C function:

     uint32_t add (uint32_t a,uint32_t b){
           return a+b;
    }

Is equivalent to the following assembly function:

    add:
        PUSH	{lr}        @ save the return address of the caller
        ADD     r0,r0,r1    @ r0 = r0+r1
        POP 	{pc}        @ return to the previously saved address

The first argument of the add function is placed in r0, the second in r1 and so on. The returned value is placed in r0, simple enough. Note that if we needed to pass more arguments, we'd go up to r3, and then use the stack to pass more. And if we were returning a value larger than an uint32_t (the register size of our processor), we'd pass the returned value in r0 and r1-r3 depending on the size, this is a brief note, a better understanding could be achieved by actually reading the document I linked to above, for the purpose of this post, this is all you need to know.

First switch to C

in the file startup.s , we have:

/*At this point, bl1 should load bl2 and calculate 
*the checksum for verification, then jump to execute it
*a lot needs to be done at this point, which means we 
* need to go for c ...
*/

@ switching to basic C ...
	ldr     r0,=calling_copy_function
	bl 	uart_print
	bl 	copy_bl2_to_sram
	ldr 	r0,=never_come_here_string	
	bl	uart_print
	b	. @loop forever

The function named copy_bl2_to_sram is our first jump to C from assembly, pay no attention to the name of the function for now, just know that it's a C function that takes no arguments, and returns no values, and that it is not expected to ever return. The function can be found in the C source file src/init_bl2.c :

void copy_bl2_to_sram(void)
{
 /* some code (irrelevant here)*/
}

Since we've coded some UART printing routines in assembly, we'd like to use these routines in our C code, as explained above, the calling convention is used from both sides, assembly to C and C to assembly calls, as an example, the following code demonstrates how C code makes use of our previously written assembly functions, this Code can also be found in init_bl2.c source file:

void debug_print(char*str){
	uart_print(str);
}

The above code uses uart_print function defined in assembly source file uart_mod.s, remember that we've globalized this assembly routine in the last post, it is not explicitly included in a header of init_bl2.c, but resolved as an external reference by the linker, because both uart_mod.s and init_bl2.c belong to the same compilation unit.

Copying BL2 code to DRAM

As was discussed in the last post, we are going to follow the recommendation of Samsung's iROM documentation. The recommended procedure is to have all initialization done in BL2, which means we'll have to move our clock and memory initialization code to BL2, then load BL2 to SRAM, then have it execute the initialization functions. To do this, we have to move code around, so we're going to move the clock_mod.s and memory_mod.s files we created under asm/ directory in the last post to the dircetory bl2/asm/ , we will also copy the uart_mod.s module to the same directory, this is done because our Makefile treats BL2 as an independent compilation unit, which means it's linked independent of BL1, we could have exposed the uart_mod.s file as a dependency, and linked it as a shared library, but we did not, and we're not going to. we're then going to
modify our startup.s under asm/ to look like:

/* Standard definitions of mode bits and interrupt (I & F) flags in PSRs */
.equ Mode_USR,   0x10
.equ Mode_FIQ,   0x11
.equ Mode_IRQ,   0x12
.equ Mode_SVC,   0x13
.equ Mode_ABT,   0x17
.equ Mode_UND,   0x1B
.equ Mode_SYS,   0x1F
 
/*useful addresses, fetched from the S5PV210 user manual*/
.equ GPIO_BASE,  0xE0200000
.equ GPJ2CON_OFFSET,  0x280   
.equ GPJ2DAT_OFFSET,   0x284
.equ GPJ2PUD_OFFSET,   0x288
.equ GPJ2DRV_SR_OFFSET,  0x28C
.equ GPJ2CONPDN_OFFSET,  0x290
.equ GPJ2PUDPDN_OFFSET,  0x294
 
 
.equ I_Bit,      0x80 /* when I bit is set, IRQ is disabled*/
.equ F_Bit,      0x40 /* when F bit is set, FIQ is disabled*/
 
.text
.code 32
.global _start
@ the interrupt vectors setup here is useless, this is more suited for cortex-mX 
@ processors running from flash ... I'm just used to it, do not hate! 
_start:

 	b Reset_Handler
        b Undefined_Handler
        b SWI_Handler
        b Prefetch_Handler
        b Data_Handler
        nop /* Reserved vector*/
        b IRQ_Handler
/*FIQ handler would go right here ...*/



/*after fast interrupt handler ...*/



.globl _bss_start
_bss_start:
 .word bss_start
 
.globl _bss_end
_bss_end:
 .word bss_end
 
.globl _data_start
_data_start:
 .word data_start
 
.globl _rodata
_rodata:
 .word rodata


Reset_Handler:
 
/* set the cpu to SVC32 mode and disable IRQ & FIQ */
         msr CPSR_c, #Mode_SVC|I_Bit|F_Bit ;
       /* Disable Caches */
        mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        bic r1, r1, #(0x1 << 12)  /* Disable I Cache*/
        bic r1, r1, #(0x1 << 2)   /* Disable D Cache*/
        mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /* Disable L2 cache (too specific, not needed now, but useful later)*/
        mrc p15, 0, r0, c1, c0, 1  /* reading auxiliary control register*/
        bic r0, r0, #(1<<1)
        mcr p15, 0, r0, c1, c0, 1  /* writing auxiliary control register*/
 
        /* Disable MMU */
        mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        bic r1, r1, #0x1
        mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /* Invalidate L1 Instruction cache */
        mov r1, #0
        mcr p15, 0, r1, c7, c5, 0
 
        /*Invalidate L1 data cache and L2 unified cache*/
        bl invalidate_unified_dcache_all
 
        /*enable L1 cache*/
        mrc     p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        orr     r1, r1, #(0x1 << 12)  /* enable I Cache*/
        orr     r1, r1, #(0x1 << 2)   /* enable D Cache*/
        mcr     p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /*enable L2 cache (in addition to I,D cache on all levels)*/
        mrc     p15, 0, r0, c1, c0, 1
        orr     r0, r0, #(1<<1)
        mcr     p15, 0, r0, c1, c0, 1
      
 
        ldr sp, =0xd0037d80 /* SVC stack top, from irom documentation*/
        sub sp, sp, #12 /* set stack */
       @mov fp, #0
 
        ldr r0,=0x0C
        bl flash_led
 
        bl uart_asm_init

	mov r0,#0xF
        bl      flash_led

/*At this point, bl1 should load bl2 and calculate 
*the checksum for verification, then jump to execute it
*a lot needs to be done at this point, which means we 
* need to go for c ... as it would take less time to
*initialize storage devices there ...
*/

@ switching to basic C ...
	ldr     r0,=calling_copy_function
	bl 	uart_print
	bl 	copy_bl2_to_sram
	ldr 	r0,=never_come_here_string	
	bl	uart_print
	b	. @loop forever

Undefined_Handler:
        b .
SWI_Handler:
        b .
Prefetch_Handler:
        b .
Data_Handler:
        b .
IRQ_Handler:  
        b . 



/*==========================================
* useful routines
============================================ */
 
 
 
/*Massive data/unified cache cleaning to the point of coherency routine, loops all available levels!*/
 
clean_unified_dcache_all:
         mrc p15, 1, r0, c0, c0, 1 /* Read CLIDR into R0*/
         ands r3, r0, #0x07000000
         mov r3, r3, lsr #23 /* Cache level value (naturally aligned)*/
         beq Finished
         mov r10, #0
Loop1:
         add r2, r10, r10, lsr #1 /* Work out 3 x cache level*/
         mov r1, r0, lsr r2 /* bottom 3 bits are the Cache type for this level*/
         and r1, r1, #7 /* get those 3 bits alone*/
         cmp r1, #2
         blt Skip /* no cache or only instruction cache at this level*/
         mcr p15, 2, r10, c0, c0, 0 /* write CSSELR from R10*/
         isb /* ISB to sync the change to the CCSIDR*/
         mrc p15, 1, r1, c0, c0, 0 /* read current CCSIDR to R1*/
         and r2, r1, #7 /* extract the line length field*/
         add r2, r2, #4 /* add 4 for the line length offset (log2 16 bytes)*/
         ldr r4, =0x3FF
         ands r4, r4, r1, lsr #3 /* R4 is the max number on the way size (right aligned)*/
         clz r5, r4 /* R5 is the bit position of the way size increment*/
         mov r9, r4 /* R9 working copy of the max way size (right aligned)*/
Loop2:
         ldr r7, =0x00007FFF
         ands r7, r7, r1, lsr #13 /* R7 is the max num of the index size (right aligned)*/
Loop3:
         orr r11, r10, r9, lsl R5 /* factor in the way number and cache number into R11*/
         orr r11, r11, r7, lsl R2 /* factor in the index number*/
         mcr p15, 0, r11, c7, c10, 2 /* DCCSW, clean by set/way*/
         subs r7, r7, #1 /* decrement the index*/
         bge Loop3
         subs r9, r9, #1 /* decrement the way number*/
         bge Loop2
Skip:
         add r10, r10, #2 /* increment the cache number*/
         cmp r3, r10
         bgt Loop1
         dsb
Finished:
         mov pc, lr
 
 
 
/*Massive data/unified cache invalidation, loops all available levels!*/
invalidate_unified_dcache_all:
         mrc p15, 1, r0, c0, c0, 1 /* Read CLIDR into R0*/
         ands r3, r0, #0x07000000
         mov r3, r3, lsr #23 /* Cache level value (naturally aligned)*/
beq Finished_
         mov r10, #0
Loop_1:
         add r2, r10, r10, lsr #1 /* Work out 3 x cache level*/
         mov r1, r0, lsr r2 /* bottom 3 bits are the Cache type for this level*/
         and r1, r1, #7 /* get those 3 bits alone*/
         cmp r1, #2
         blt Skip_ /* no cache or only instruction cache at this level*/
         mcr p15, 2, r10, c0, c0, 0 /* write CSSELR from R10*/
         isb /* ISB to sync the change to the CCSIDR*/
         mrc p15, 1, r1, c0, c0, 0 /* read current CCSIDR to R1*/
         and r2, r1, #7 /* extract the line length field*/
         add r2, r2, #4 /* add 4 for the line length offset (log2 16 bytes)*/
         ldr r4, =0x3FF
         ands r4, r4, r1, lsr #3 /* R4 is the max number on the way size (right aligned)*/
         clz r5, r4 /* R5 is the bit position of the way size increment*/
         mov r9, r4 /* R9 working copy of the max way size (right aligned)*/
Loop_2:
         ldr r7, =0x00007FFF
         ands r7, r7, r1, lsr #13 /* R7 is the max num of the index size (right aligned)*/
Loop_3:
         orr r11, r10, r9, lsl R5 /* factor in the way number and cache number into R11*/
         orr r11, r11, r7, lsl R2 /* factor in the index number*/
         mcr p15, 0, r11, c7, c6, 2 /* Invalidate line described by r11*/
         subs r7, r7, #1 /* decrement the index*/
         bge Loop_3
         subs r9, r9, #1 /* decrement the way number*/
         bge Loop_2
Skip_:
         add r10, r10, #2 /* increment the cache number*/
         cmp r3, r10
         bgt Loop_1
         dsb
Finished_:
         mov pc, lr
 
 
flash_led:
     ldr r4,=(GPIO_BASE+GPJ2CON_OFFSET)
     ldr r5,[r4]
     ldr r2,=1
     orr r5,r5,r2
     orr r5,r2,lsl#4
     orr r5,r2,lsl#8
     orr r5,r2,lsl#12
     orr r5,r5,r2
     str r5,[r4]
     ldr r4,=(GPIO_BASE+GPJ2DAT_OFFSET)
     ldr r5,[r4]
     ldr r3,=0xF
     orr r5,r5,r3  @ turn them all off ...
     bic r5,r5,r0
     str r5,[r4]
     mov r1, #0x10000  @ this should be passed meh!
1:  subs r1, r1, #1
  bne 1b
     orr r5,r5,r3  @ turn them all off again ...
     str r5,[r4]
     mov pc, lr

.section .rodata
never_come_here_string:
.ascii "you should never see this ...\n\r\0"
calling_copy_function:
.ascii "calling iROM copy function ...\n\r\0"
.end

after initializing the uart module, line 123 jumps to a function called copy_bl2_to_sram, this function resides in file src/init_bl2.c.

In this function, we'll be copying BL2 from MMC card to SROM using the function described in iROM application notes documentation:

This function is defined in the Samsung supplied iROM code (BL0), to use it, we need to define a pointer to it's address in memory, and dereference that pointer to call the copying function.
The following listing shows source file init_bl2.c :

#include <stdint.h>
#include <stdbool.h>


#define COPY_BL2_SIZE	(80*1024)   // 80KB (binary Kilo)
extern void uart_print_string(char*str,uint32_t len);
extern void uart_print_hex(uint32_t address);
extern void uart_print(char*str);


typedef uint32_t (*copy_mmc_to_mem)(uint32_t  channel, uint32_t  start_block, uint16_t block_size,
		                                            uint32_t  *target, bool  init);
void debug_print(char*str){
	uart_print(str);
}

void copy_bl2_to_sram(void){
	uint32_t *load_address =(uint32_t*) (0xD0020FB0);  // SRAM BL1 start address + 16k (binary kilo)
	debug_print("Copying BL2 started ...\n\r\0");
	void (*BL2)(void);
	uint32_t channel = 0;
	copy_mmc_to_mem copy_func = (copy_mmc_to_mem) (*(uint32_t *) 0xD0037F98); //SdMccCopyToMem function from iROM documentation
	uint32_t ret = copy_func(channel, 33, COPY_BL2_SIZE/512,load_address, false);
	if(ret == tru){
		debug_print("BL2 loading successful, running it ...\n\r\0");
		BL2 = (void*) load_address;
		(*BL2)(); // dereferencing and running ...
	}else{
		debug_print("BL2 loading failed :-(\n\r\0");
	}
}

Line 10 shows our definition, it follows the iROM function signature, note that the parameters list in the comment block (in iROM application note document) is missing the first int parameter, which is the SD/MMC channel we're using, this was figured out by looking at open source code contributed by Samsung engineers, and it wasn't a pleasant process.

The rest of the code is short and intuitive, in line 18, we're defining our load address inside SRAM memory space to be right after BL1, we then proceed to define the starting block address of BL2 in our MMC card, note that this function copies data by block, so we have to define the number of blocks occupied by our BL2 code, so the copying function would know where it ends. We then proceed to derefernece and run BL2, effectively jumping to execute it's first instruction.

BL2's entry point is defined in file bl2/asm/bl2_asm.s:


.text
.code 32
.global _bl2_entry

.equ ram_load_address,          0x20000000

_bl2_entry:
        ldr     r0,=starting_bl2_string 
        bl      uart_print

        bl      clock_subsys_init

        bl      mem_ctrl_asm_init

        b       start_linux 

        /* Memory test: copy a block of code from read only memory to ram,
         * and jump to execute it, the executed code should give a
         * certain message if successful*/

@        bl      copy_To_Mem
@        ldr     r0,=0x0C  @ corrupt r0
@        ldr     ip,=ram_load_address
@        mov     lr, pc
@        bx      ip


.section .rodata
starting_bl2_string:
.ascii "bl2 executing ...\n\r\0"
.align 4

The linker script reflects this entry point, effectively linking code at the correct expected address,
BL2_linker.lds:

OUTPUT_FORMAT("elf32-littlearm", "elf32-littlearm", "elf32-littlearm")
OUTPUT_ARCH(arm)
ENTRY(_bl2_entry)
SECTIONS
{
 . = 0xD0020FB0;
 . = ALIGN(4);
        .text : {
        */BL2.bin
        *(.text)
        }
 . = ALIGN(4);
        rodata = .;
        .rodata : { *(.rodata) }

. = ALIGN(4);
        data_start = .;
        .data : { *(.data) }

. = ALIGN(4);
        __bss_start__ = .;
        .bss : { *(.bss) }
        __bss_end__ = .;    
}

note that address 0xD0020FB0 in the linker script is the same address we used in our init_bl2.c code as our load_address for BL2.
At this point, BL2 runs the exact same code we had run by BL1 in the previous posts, it initializes the clock system, then moves to initialize the memory system, and finally jumps to a function called start_linux, which is supposed to implement the bulck of our BL2 code, which needs to provide the data needed by the linux kernel to successfully boot up. This whole process is taking place in SRAM, we haven't yet moved any functional code to SDRAM, but we have to do that with linux, because it definately won't fit in our tiny SRAM.

Conclusion

Some nuances were avoided in this post, you probably have to do some reading on how GCC handles constructors and destructors while initializing C/C++ functions and calling main to get a the full picture on this topic, but it's just irrelevant to what we're doing here. Next, we'll move on to implement some basic functionality of a bootloader, we'll first go through the ultimate purpose of a bootloader, which is loading an operating system, but before getting there, we'll build some basic debugging and other functionality into our bootloader and build system, as to set the stage for loading a (none functional) Linux image for demonstration.

Reference