korena's blog

4. Bare metal & low level init - part 1

While writing this post, I have encountered the grievous misfortune of what appears to be hardware failure. BL1 code can no longer be loaded, and an error code is produced regardless of what's in the MMC card, in vain I have tried rolling back to the most basic setup presented in the second post in this series, yet booting from nand flash seems to be unaffected, I have to investigate this issue further. This probably means that most of the code presented in this post is UNTESTED beyond compilation, and you should be very careful if you plan to copy and paste any bits of it, with that said, the development board was never an important part in this series, since it's all about how one would go about finding information and salvaging code from different sources, besides, once we get to a far enough stage from boot up code, concepts will be more generic, and easier to relate to regardless of the underlying hardware.

Introduction

After loading BL1 into SRAM, we have more freedom to move around, we may choose to perform low level initialization right away (in BL1), but we'd have to constrain ourselves with the size limit of 16KB, as the iROM code (BL0) expects a BL1 image of this size, in order to successfully load it from the booting device (MMC in my case) to iRAM. Or we may choose to perform low level initialization for selected hardware devices within our system, and delegate the rest to the more spacious 80KB BL2. These choices really depend on the requirements of the application, for example, you may want your system to start logging the boot up process to a serial console at a very early stage, in which case you would initialize the UART peripheral in BL1, before loading BL2. On the other hand, we could choose to completely ignore the S5PV210 iROM application note's recommendation and skip this BL2 business altogether, this will probably arise later, but it is worth pointing out that the low level initialization u-boot requires can fit into BL1 without problems.

First things first

Before getting to peripherals and subsystems initialization, there are some things we need to take care of, we will start by gathering a bit more information about hardware facilities, and the state of the system through the boot up process, so we should look for the answers to the following questions:

  • what infrastructure is available for Interrupts? What are the capabilities of the interrupt controller in the S5PV210 chip?

  • How do I invalidate instruction and data cache for all the rapid context switching that happens in the boot up process?

  • We probably wont have any use for MMU and TLB at the low level initialization stage, do I need to turn them off? Or does the initial state of C15 co-processor (the component responsible for controlling MMU, TLB and cache in the cortex-a8 ARM core) guarantee that they are off at every reset state? If Not, then how do I ensure that they are off?

  • what reset states does this processor feature?And what steps of the boot up process are required/not required at each reset state?

  • What's the frequency at which the processor is clocked as it executes BL1? What PLL setting is used?What's the configured clock source? Do I need to change it? When should I change it? Whats the maximum clock frequency possible?

  • What special function registers control clock specific functionality?

Answering these questions requires the following documents:

  • S5PV210 User Manual (The specific SoC user manual).
  • Cortex-A Series Programming Guide (core programming documentation)
  • The schematics of tiny210 CPU board (the development board schematics, provided by FriendlyArm).
  • ARM technical reference manual
  • ARM PrimeCell™ Vectored Interrupt Controller (PL192) Technical Reference Manual (hardware vectored interrupt controller documentation).

In addition to the specifications/requirements of the application in development. Since we have access to the first five, lets define a rough set of requirements to achieve during the low level initialization process:

  1. The early boot loader phase (BL1 and BL2) should cater for all reset possibilities expected, such as wake up from sleep, hardware/software reset, etc.
  2. Logging through UART is to be available at the earliest point possible.
  3. The processor is to hit the maximum clock speed for code execution available at the earliest point possible through the booting process.

Now that we have a road map of sorts, we can start digging for missing pieces of information, it is worth mentioning, that the questions we are about to answer are asked, in that specific order for a reason, which is the fact that, in a not very security oriented development cycle, these are generally the steps one would take at a cold reset. We will ignore the answer for the reset states in this particular post, and defer the trouble it will cause for the next post (or the one after!). An elaborate description of the necessary steps a reset routine would take is described in the Cortex-A series programmer's guide, section 13.1 Booting a bare-metal system.

1. Interrupts

Interrupt setup could be considered well within the comfort zone of embedded software programmers, generally, there are very broad guidelines in setting up a the Interrupt system, that can be applied to most microprocessors.
In a nutshell, the interrupt ecosystem is constructed in two main setups, Vectored and non-vectored interrupts. The specifications of the Interrupt controller (a hardware component) that handles the vector table(1) and execution of interrupts plays a big part on defining which one is supported by our system. But before we get into that, lets consider the most basic idea that should cross a programmer's mind when he/she hears the word 'Interrupt' in an ARM based embedded system's context

Basics

When an Interrupt of any type occurs, the processor expects the vector table to be at address 0x00000000 or 0xFFFF0000, in both cases, because of the consecutive, specific order, 4-byte alignment requirements of the vector table entries (called exception vectors), the processor knows which entry to jump to for each exception type, for example, if an IRQ interrupt occurs, the processor will, after performing some maintenance, jump to execute the instruction at the vector table base address + 0x18, and trust that you, the system programmer, have delivered your promise, and put a branch instruction to the IRQ interrupt relevant handler at that address.
For Fast interrupt request handling, the FIQ handler code is placed at the position of it's exception vector in the vector table, because of the fact that it is the last entry, and there is no size constraint on the instruction at the lower edge of the table, this way, the time for branching can be saved, resulting in reduced interrupt latency.

1.1 Interrupt Controllers (hardware peripherals)

1.1.0 Generic Interrupt Controller (GIC):

A GIC is a hardware component that is responsible for mincing, then delivering all types of interrupts to the core CPU. The system programmer sets it up by initializing it's internal memory mapped registers, enables it, and forgets it exists for the rest of the system's up time. As explained in the Cortex-A series programming guide documentation (12.2 The Generic Interrupt Controller), from a software perspective, a GIC is viewed as having two functional blocks, a Distributor block and a CPU Interface block. The Distributor, to which all interrupt sources in the system are wired, is responsible for determining which interrupt is to be forwarded to a core, through the attached CPU interface. The CPU interface, through which a core receives an interrupt, hosts registers to mask, identify and control states of interrupts forwarded to that core. There is a separate CPU interface for each core in the system.You can find relevant information about GIC in Cortex-A Series programmers Guide, specifically, section 12.2 The Generic Interrupt Controller. For more thorough documentation, check out CoreLink™ GIC-400 Generic Interrupt Controller Technical Reference Manual. Its worth mentioning that this controller is found in many implementations of Cortex-A ARM cores, but not our Cortex-A8, it was only referenced here for it's importance when developing on very high end Cortex-A based processors, I personally find it a lot more interesting to work with than the VIC we are about to deal with.
![](/content/images/2016/10/milestone-2-GIC.png)

1.1.1 Vectored Interrupt Controller(VIC):

The SoC manufacturer's implementations of certain 'interfaces' in the core architecture govern a large portion of how the interrupt ecosystem is constructed. This is probably a good time to talk about ARM architecture extensions.
As you probably know, ARM cores are basically copyrighted blueprints for RISC architecture. ARM (the company) designs and outlines the specifications of the architecture, which are then licensed to processor manufacturers like Qualcomm, ST and Samsung, who then build their processors around the core ARM design. These manufacturers will have to implement the most architecture defining interfaces, for the processor to actually qualify as an ARM based. ARM specifications also define a set of 'Extensions'. Some of these Extensions may be substituted by other none-ARM-designed components that support the same implementation standards. These extensions are outlined as hardware/firmware components to be implemented by the chip manufacturers, check out corelink controllers and peripherals. I will address only the extensions relevant to S5PV210's implementation.

The S5PV210 implements four daisy-chained ARM PrimeCell™ Vectored Interrupt Controllers (PL192), each of which is a powerful system component that takes the burden of determining the source of interrupt that is requesting service and where its service routine is loaded off of the processor.
From a programmer's perspective, Communicating with the PL192 controller is no different from communicating with a standard AMBA compliant slave peripheral, it requires thorough understanding of the functionality of the PL192 Controller, bear in mind that PL192 is not the newest interrupt controller out there, but the principles of dealing with the new ones is more or less the same, the specific procedure of initializing and using this controller is a topic for another day, we will visit it when we talk about the OS kernel in a future milestone. For now, vectored interrupts explanation below describes, broadly, how this VIC helps with interrupt handling.

1.2 Operations

1.2.0 Non vectored Interrupts

The processor will perform some maintenance before servicing an interrupt, Two cases of Non-vectored procedures are what I am going to address, the first is the simplistic interrupt mechanism, the second is the more practical, but a bit more complex Nested interrupt mechanism.

1.2.0.1 Simplistic Interrupt Mechanism

The processor core receives an external interrupt via the IRQ input, so it does the following :

  1. Saves the contents of PC of the current execution context in LR_IRQ.
  2. The CPSR register is copied to SPSR_IRQ.
  3. CPSR register is updated to reflect the mode of the IRQ.
  4. The I bit in CPSR is set, disabling the processor's ability to receive anymore Interrupts.
  5. The PC is set to the IRQ entry in the vector table.
  6. The instruction at the IRQ entry in the vector table (a branch to the interrupt handler) is executed.
  7. The Interrupt handler saves the context of the program we branched from in point 1. pushing the values of the registers that will be corrupted through the execution of the handler onto the IRQ stack (This is implemented by the programmer in assembly, or by efficient use of a compiler).
  8. The interrupt handler determines which interrupt source must be processed and calls the appropriate handler (switch-case or if-else, the difference between vectored and non-vectored interrupt controllers mainly affects this step, we'll see how that happens in a moment).
  9. handling logic code is executed.
  10. Prepare the core to switch to previous execution state by copying the SPSR_IRQ to CPSR.
  11. restores the context saved in point 7.
  12. finally the PC is restored from LR_IRQ, and normal program execution continues.

The same sequence is also applicable to an FIQ interrupt, there however some subtle differences, depending on your implementation, you might not need to save any registers due to the banked nature of the high registers of FIQ mode, the jump from the exception vector might also be avoided by placing the handler logic right in the vector table.

A point to note, is that the source of the IRQ exception is a hardware component called an Interrupt controller, this component is responsible for asserting an interrupt pin in the processor (internally wired in the SoC) depending on the state of it's multiple interrupt inputs, in some cases, it is also responsible for, among other things, delivering the information about the source of the interrupt we discussed in point 8 (vectored). In other cases, the source of the interrupt are determined by checking certain bits from various peripherals to determine which interrupt source was the trigger, or checking some register values in the Interrupt controller (non-vectored).

1.2.0.2 Nested Interrupt Mechanism

Interrupt preemption refers to the processor's ability to interrupt a running interrupt handler, in case it receives another interrupt with a higher priority. Note that Nested Interruption is enforced by software, it is not 'Enabled' by setting a bit in some hardware register (not to be confused with vectored interrupts, which are configurable by initializing some hardware registers, in addition to software enforcement). The processor will deal with interrupts as described in the simplistic interrupt mechanism above, but the implementation of the interrupt handler is what makes the difference. The following steps are taken in the interrupt handler, after the processor jumps to execute it:

  1. Saves the context of the interrupted program by pushing onto the stack of SYS mode* any registers that will be corrupted by the handler, including the return address and SPSR_IRQ.

  2. It determines which interrupt source must be processed and clears the source in the external hardware, typically done by poking some register in the interrupt controller (a hardware peripheral), preventing it from immediately triggering another interrupt.

  3. The Interrupt handler then changes to SYS mode, at this point, the CPSR I bit is still asserted, meaning no interrupt can be received yet, remember that this bit was set automatically by the processor, as a part of it's usual interrupt maintenance procedure discussed above (specifically, step 4).

  4. while in SYS mode, the interrupt handler saves LR_SYS in the stack of this mode (SYS).

  5. CPSR I bit is cleared by the Interrupt handler at this point, re-enabling interrupts.

  6. the appropriate handler code, for the interrupt source determined in step 2 is called using the BL instruction, which will have the effect of modifying LR_SVC (this is why we saved it in step 4).

  7. upon completion, the interrupt handler disables IRQ by setting the I bit in CPSR, pops out the exception return address (LR_IRQ) and (SPSR_IRQ) from the stack of SVC mode (current mode), restores CPSR from SPSR_IRQ, and jumps back to the main interrupted task.

    *SVC mode stack could also be used, since it's usually the operating system's mode, stack size should not be an issue, it is also worth mentioning, that pushing to the IRQ mode's stack is also acceptable, if the stack size allows for it, the important thing is to save LR_IRQ and SPSR_IRQ in memory, and not trusting the core arm registers to keep them, cause their contents are not guaranteed to be known at all times

following the above procedure, the interrupt handling code is guaranteed to be re-entrant, making room for interrupts to be attended to, at almost anytime they occur, reducing latency, and re-introducing prioritization(in hardware) and preemption. We'll have more of a practical application for this explanation soon.

1.2.1 Vectored Interrupts

Vectored Interrupts is a more robust way of dealing with interrupts, lets consider the procedure a processor has to go through to service an interrupt (non-vectored):

  1. Branch to the base address of the vector table plus offset.
  2. execute the branch instruction at the interrupt vector address (base+0x18 for IRQ), this instruction will look like: LDR PC,[PC,#addressOfTopLevelHandlingDispatcher].
  3. the handler(top level dispatcher) executes some boilerplate code (stack stuff, nested interrupts logic ...).
  4. Top level dispatcher determines which interrupt source is requesting a service by interrogating the interrupt controller, or by checking bits in multiple peripheral registers and executing some conditional code.
  5. branches to the ISR specific to the interrupt source.

with vectored Interrupts, the processor can do these steps in a more efficient way, when the vectored interrupt controller (VIC hardware peripheral) signals an interrupt to the CPU, the CPU:

  1. branches to the base address of the vector table plus offset.
  2. executes the branch instruction at the interrupt vector address, this address is provided by the VIC in a certain register, which will be memory mapped, so the branch instruction will look like: LDR PC,[PC,#-120]* (the result of this is LDR PC,=VICVectAddr )
  3. the processor jumps to execute the ISR at VICVectAddr, which also has to do some boilerplate code (stack stuff, Nested interrupts logic, re-entrance assurance ...).
* The value of the current instruction's address is 0x18, so the value held in PC is 0x20 (0x18+ 8 = 0x20 for historical reasons), so the address, the content of which will be loaded into PC is 0x20 - 0x120 = 0xFFFF FF00, which (in a typical implementation) is the location of the memory mapped register VICVectAddr, the executed branch instruction is therefor equivalent to : LDR PC,[0xFFFF FF00], this is specific to PL192, other VICs might differ.

The above steps are more or less how it works. However, in a practical application, more is to be performed, and jumping to the ISR right away might not always be the best thing to do, a top assembly handler implementation might be more practical.

1.3 BL1 interrupt related code

Having talked a lot about interrupts, you would think that we're going to implement all discussed concepts here, the truth is, it would be impractical, the main reason being the fact that we do not want to keep changing the startup code every time we decide to change a relatively high software stack section of our system, like the OS we intend to use. Our home grown simple kernel might initialize interrupts in a way that Linux kernel does not agree with, so it is better to leave this sort of specific initialization to the high level code (such as OS kernel) that will actually use it.
Since we do not expect our processor to service interrupts through the process of booting up, initializing the VIC at this point is not necessary, only when we get to the point of initializing the OS, that we need to worry about complex interrupt setup, for now, all we have to do is provide a valid reset handler. If an interrupt of another type is caught this early, it is an indication of bad hardware, or perhaps a poorly assembled executable binary. Note that the interrupt table we're providing here is not functional, as in if one of the interrupts that would normally cause a handler to be called (eg. an Undefined or SWI interrupt), the processor will not jump to the table we defined! This is just to illustrate how the table is generally structured, a lot more work has to go into setting this exception vector up.

BL1.s:


; --- Standard definitions of mode bits and interrupt (I & F) flags in PSRs
.equ Mode_USR,   0x10
.equ Mode_FIQ,   0x11
.equ Mode_IRQ,   0x12
.equ Mode_SVC,   0x13
.equ Mode_ABT,   0x17
.equ Mode_UND,   0x1B
.equ Mode_SYS,   0x1F
 
.equ I_Bit,      0x80 ; when I bit is set, IRQ is disabled
.equ F_Bit,      0x40 ; when F bit is set, FIQ is disabled
 
.text
.code 32
.global _start
_start:
  b ResetHandler
  B Undefined_Handler
  B SWI_Handler
  B Prefetch_Handler
  B Data_Handler
  NOP @ Reserved vector
  B IRQ_Handler
;FIQ handler would go right here ...
Reset_Handler:
/*Reset handler goes here*/  
          
Undefined_Handler:
           B .
SWI_Handler:
           B .
Prefetch_Handler:
           B .
Data_Handler:
           B .
IRQ_Handler:  
           B . 
                                      

2 Cache

Cache is one of the processor's most important assets. It's true that as a system programmer, one does not have to worry too much about it beyond boot up, and some special conditions like DMA (unless you're a kernel developer, in which case you're doomed), because the processor is generally designed to make the best use of it independently, but we need to take very good care of it at this initial stage, for all the context and content switching that happens, what do I mean by context and content switching you say? Well, lets take the example of BL1 code copying BL2 from MMC card to memory for execution(2), which is the task we intend to achieve next, the processor executes instructions from BL1 code, which means there will be some activity in the instruction and data cache (L1), BL1 code will probably use load and store instructions and loop through loading data blocks from the MMC card and storing them in the designated location in memory, the data that is loaded is essentially code instructions of BL2, which means this data formulates executable instructions, and is to be executed later. Now lets look closely at the store operation, a popular way, or 'policy' of using cache in processor-to-memory write operations is called writeback policy, in which the processor writes data to cache (by executing STR instruction), and delegates the task of finishing the job and writing data to the target memory to the cache controller (typically, the write buffer performs the actual write operation), this delegation is performed by setting the dirty bit of the cache line that was just written, signaling that this cache line holds data that is newer than the data held in the main memory, the cache controller then evicts the data line to the final destination in main memory. Now, as the cortex-A series programming guide puts it:

"it will be necessary to clean that data from the cache before the code can be executed. This ensures that the instructions stored as data go out into main memory and are then available for the instruction fetch logic. In addition, if the area to which code is written was previously used for some other program, the instruction cache could contain stale code (from before main memory was re-written).
Therefore, it may also be necessary to invalidate the instruction cache before branching to the newly copied code."

Cleaning cache means transferring data that is in the cache (new data) to the destination memory, and clearing the dirty bits of the cache lines, the contents of which were just transferred.
Invalidating cache means clearing it from data, by clearing the valid bit of one or more cache lines.

for the purpose of making this section clearer, I have looked back and updated the bits and pieces post on Cache.

We now know that, if caching is enabled, we must clean, then invalidate cache when we make the transition from BL1 to BL2. This process ensures a flawless execution and prevents inconsistent bugs that would be extremely difficult to debug. The aforementioned procedure is presented for completeness, but a common practice is to disable cache operations through BL1, preventing any coherency issues from arising in the first place, at the cost of sacrificing the benefits of caching for a short while.
The question now is whether we have to carry out cache cleaning/invalidation operations when we make the switch from iROM to BL1? We have completed this task in a previous milestone, but maybe we should revisit it, since we will be rewriting BL1 for low level initialization anyway.
Following the execution sequence in the boot up process, S5PV210 user manual (section 6.1 OVERVIEW OF BOOTING SEQUENCE) tells us that the processor is hardwired to start executing the iROM code at reset, the instructions in iROM code are fed directly to the processor for execution, so cleaning cache after this operation (as in inside BL1) does not seem necessary, as one would expect iROM code to make sure that code (BL1 instructions and predefined data) is indeed available in SRAM before jumping to execute it, but what about cache invalidation? Cortex-A Series programming guide suggests that cache invalidation is required at reset (iROM code from the S5PV210 iROM application notes claims to have done it, although cache invalidation can be done automatically by a hardware state machine at reset in Cortex-A8 cores, as the programmers guide document states), cache invalidation can also be performed when it is guaranteed to have data that is in no way useful in the coming execution context, which has the potential of introducing coherency issues. So, will the processor ever need the information that was pumped into cache through the execution of iROM code, when it executes BL1? The answer is probably no, so we can in fact invalidate cache at the beginning of BL1(3).

Now that we have a good grip on the cache operations we want to perform, lets dig up the information we need to actually perform them. All this information is gracefully presented in the Cortex-A Series programming Guide document. After a lengthy, thorough explanation of the principles and practices, the documentation states that cache control is the job of the Coprocessor 15 - System Control
Coprocessor(4) present in the ARM core, the exact ARM instructions used to access this Coprocessor is detailed in the document, and it would be redundant to repeat them here, the exact section in which these bits of information are presented is Chapter 8 (Caches), example code is also presented in Example 13-3 Setting up caches, MMU and branch predictors.

In all the code segments introduced through this milestone, we will be regularly accessing coprocessor 15, specifically, Control Register (c1) so we might want to talk about it a bit, for an in-depth study, check out 3.2.25 c1, Control Register from Cortex-A8 TRM.

In order for you to write initialization code for the cortex-a8, you need to be very comfortable using MRC (Move to Register from Coprocessor) and MCR(Move to Coprocessor from Register) instructions, these instructions are for reads/writes to the implemented coprocessors in the ARM core. in our case, we're communicating with registers within the system control coprocessor (CP15). CP15 provides system control functionality. This includes architecture and feature identification, as well as control, status information and configuration support. The details of CP15 can be found in ARM Architecture Reference Manual ARMv7-A and ARMv7-R Edition document, specifically, sub sections of B3.18, and in Cortex-A8 technical Reference Manual, specifically section 3.1.
To start from the top level and work your way down to more specific functions of coprocessor 15, start from Table B3-42 in the ARM ARM document, and the main functions of the system control coprocessor from Cortex-A8 TRM. To figure out the actual instruction you need to code for a certain functionality, use Table 3-3 Summary of CP15 registers and operations from ARM Cortex-A8 TRM.
What we need to know at this point is that CP15 provides control over important parts of the memory system in our Cortex-A8 based processor, in addition to system modes and interrupt disable/enable flags, among other functions. And that MCR/MRC instructions are used to access it.

The syntax of the MCR/MRC instruction is as follows:

<MRC|MCR>{<cond>} <cp_num>,<opc_1>,Rd,CRn,CRm,<opc_2>

where:

cond : Condition Code Specifier.

cp_num: The number of coprocessor, in our case, it will always be cp15.

opc_1: required coprocessor opcode.

Rd: Destination/Source register from the ARM core register file (r0,r1,r2 ...).

CRn: Coprocessor Source/Dest primary Register.

CRm: Coprocessor additional Register (specifies additional info).

opc_2: Coprocessor additional opcode (specifies additional info).

As an example for the use of MRC (move to register from coprocessor), let us read the configuration data from the control register (c1) of coprocessor 15 (p15) into ARM register r1, for us to do that, we have to pass the required opcode to the coprocessor, which is 0 in this case, we also have to specify CRm to be c0, why you say? Well, because section B3.2.1 Register access instructions of ARM ARM says that if we have no use for CRm in this instruction, we have to pass c0, otherwise, the instruction is UNPREDICTABLE. The document also states that, if we have no use for opc_2, we could pass 0, or just omit it, I passed a zero for the purpose of matching the above syntax format, so the instruction becomes:

mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/

Dealing with coprocessors in ARM assembly is probably the trickiest job you would have to do, the amount of undefined/unpredictable behavior, and obfuscated code segments could be overwhelming, however, the documentation tells you exactly what to do to perform all the known operations you would need. Oh and by the way, if you ever wondered what the use for Undefined Instruction exception is, it is basically part of the mechanism you would use to extend the instruction set for ARM processors, for example, the processor manufacturer could add a coprocessor that is specialized in performing floating point arithmetic in hardware, it would have to define instructions that are not allocated as ARM instructions, nor existing coprocessor instructions. this would work because ARM processors, will only flag an undefined instruction exception if it isn't recognized by the ARM core, and no available coprocessor is capable of executing the undefined instruction it just fetched. If the code that was supposed to run on this extended processor is then used on a processor implementation that does not have a coprocessor which can perform floating point arithmetic, an emulation library that performs these operations in software, by 'catching' the Undefined instruction exception and implementing the calculation in software as a handler. This is a very deep topic that I might dig into in another series of posts, for now, lets just do our best not to hit an undefined instruction exception.
One last thing, data cache invalidation(5) requires a basic understanding of the hardware build of cache memory, you may want to take a look at the cache post under bits and pieces.

3 Memory Management

Chapter 6 in the ARM Cortex-A8 Technical Reference Manual details how the main hardware part of the memory management system , the <a href'https://info.siphyc.com/mmu/'>MMU, is to be managed by the operating system, we need not deal with this flood of information at this point, we'll only look for the required instructions to disable it through the boot up process.

A quick glance at Table 3-46 Control Register bit functions in ARM Cortex-A8 TRM Tells us that, in control register c1 of coprocessor 15, setting bit 0 disables the MMU, it also tells us that this is the Reset value, but we'll still disable it, because at this point, we do not know it's state if the reset cause was not a hardware one, it is likely that it is the reset value in all reset states, but it wouldn't hurt to be cautious.

BL1.s:

/* Standard definitions of mode bits and interrupt (I & F) flags in PSRs */
.equ Mode_USR,   0x10
.equ Mode_FIQ,   0x11
.equ Mode_IRQ,   0x12
.equ Mode_SVC,   0x13
.equ Mode_ABT,   0x17
.equ Mode_UND,   0x1B
.equ Mode_SYS,   0x1F
.equ I_Bit,      0x80 /* when I bit is set, IRQ is disabled*/
.equ F_Bit,      0x40 /* when F bit is set, FIQ is disabled*/
 
.text
.code 32
.global _start
 
_start:
        b Reset_Handler
        b Undefined_Handler
        b SWI_Handler
        b Prefetch_Handler
        b Data_Handler
        nop /* Reserved vector*/
        b IRQ_Handler
/*FIQ handler would go right here ...*/
 
/*
.globl _bss_start
_bss_start:
 .word bss_start
 
.globl _bss_end
_bss_end:
 .word bss_end
 
.globl _data_start
_data_start:
 .word data_start
 
.globl _rodata
_rodata:
 .word rodata
*/
 
Reset_Handler:
  
/* set the cpu to SVC32 mode and disable IRQ & FIQ */
         msr CPSR_c, #Mode_SVC|I_Bit|F_Bit ;
       /* Disable Caches */
        mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        bic r1, r1, #(0x1 << 12)  /* Disable I Cache*/
        bic r1, r1, #(0x1 << 2)   /* Disable D Cache*/
        mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /* Disable L2 cache (too specific, not needed now, but useful later)*/
        mrc p15, 0, r0, c1, c0, 1  /* reading auxiliary control register*/
        bic r0, r0, #(1<<1)
        mcr p15, 0, r0, c1, c0, 1  /* writing auxiliary control register*/
 
        /* Disable MMU */
        mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        bic r1, r1, #0x1
        mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /* Invalidate L1 Instruction cache */
        mov r1, #0
        mcr p15, 0, r1, c7, c5, 0
 
        /*Invalidate L1 data cache and L2 unified cache*/
        bl invalidate_unified_dcache_all
 
        /*enable L1 cache*/
        @mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        @orr r1, r1, #(0x1 << 12)  /* enable I Cache*/
        @orr r1, r1, #(0x1 << 2)   /* enable D Cache*/
        @mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /*enable L2 cache, only works if the above is commented in ...*/
        @mrc p15, 0, r0, c1, c0, 1
        @orr r0, r0, #(1<<1)
        @mcr p15, 0, r0, c1, c0, 1
         
 
Undefined_Handler:
        b .
SWI_Handler:
        b .
Prefetch_Handler:
        b .
Data_Handler:
        b .
IRQ_Handler:  
        b . 
           
 
 
/*==========================================
* useful routines
============================================ */
/*Massive data/unified cache cleaning to the point of coherency routine, loops all available levels!*/
 
clean_unified_dcache_all:
    mrc p15, 1, r0, c0, c0, 1 /* Read CLIDR into R0*/
                                 ands r3, r0, #0x07000000
    mov r3, r3, lsr #23 /* Cache level value (naturally aligned)*/
                           beq Finished
    mov r10, #0
Loop1:
    add r2, r10, r10, lsr #1 /* Work out 3 x cache level*/
    mov r1, r0, lsr r2 /* bottom 3 bits are the Cache type for this level*/
    and r1, r1, #7 /* get those 3 bits alone*/
    cmp r1, #2
    blt Skip /* no cache or only instruction cache at this level*/
    mcr p15, 2, r10, c0, c0, 0 /* write CSSELR from R10*/
    isb /* ISB to sync the change to the CCSIDR*/
    mrc p15, 1, r1, c0, c0, 0 /* read current CCSIDR to R1*/
                                 and r2, r1, #7 /* extract the line length field*/
    add r2, r2, #4 /* add 4 for the line length offset (log2 16 bytes)*/
    ldr r4, =0x3FF
    ands r4, r4, r1, lsr #3 /* R4 is the max number on the way size (right aligned)*/
    clz r5, r4 /* R5 is the bit position of the way size increment*/
    mov r9, r4 /* R9 working copy of the max way size (right aligned)*/
Loop2:
    ldr r7, =0x00007FFF
    ands r7, r7, r1, lsr #13 /* R7 is the max num of the index size (right aligned)*/
Loop3:
    orr r11, r10, r9, lsl R5 /* factor in the way number and cache number into R11*/
    orr r11, r11, r7, lsl R2 /* factor in the index number*/
    mcr p15, 0, r11, c7, c10, 2 /* DCCSW, clean by set/way*/
    subs r7, r7, #1 /* decrement the index*/
    bge Loop3
    subs r9, r9, #1 /* decrement the way number*/
    bge Loop2
Skip:
    add r10, r10, #2 /* increment the cache number*/
    cmp r3, r10
    bgt Loop1
    dsb
Finished:
    mov pc, lr
 
 
 
/*Massive data/unified cache invalidation, loops all available levels!*/
invalidate_unified_dcache_all:
    mrc p15, 1, r0, c0, c0, 1 /* Read CLIDR into R0*/
    ands r3, r0, #0x07000000
    mov r3, r3, lsr #23 /* Cache level value (naturally aligned)*/
    beq Finished_
    mov r10, #0
Loop_1:
    add r2, r10, r10, lsr #1 /* Work out 3 x cache level*/
    mov r1, r0, lsr r2 /* bottom 3 bits are the Cache type for this level*/
    and r1, r1, #7 /* get those 3 bits alone*/
    cmp r1, #2
    blt Skip_ /* no cache or only instruction cache at this level*/
    mcr p15, 2, r10, c0, c0, 0 /* write CSSELR from R10*/
    isb /* ISB to sync the change to the CCSIDR*/
    mrc p15, 1, r1, c0, c0, 0 /* read current CCSIDR to R1*/
    and r2, r1, #7 /* extract the line length field*/
    add r2, r2, #4 /* add 4 for the line length offset (log2 16 bytes)*/
    ldr r4, =0x3FF
    ands r4, r4, r1, lsr #3 /* R4 is the max number on the way size (right aligned)*/
    clz r5, r4 /* R5 is the bit position of the way size increment*/
    mov r9, r4 /* R9 working copy of the max way size (right aligned)*/
Loop_2:
    ldr r7, =0x00007FFF
    ands r7, r7, r1, lsr #13 /* R7 is the max num of the index size (right aligned)*/
Loop_3:
    orr r11, r10, r9, lsl R5 /* factor in the way number and cache number into R11*/
    orr r11, r11, r7, lsl R2 /* factor in the index number*/
    mcr p15, 0, r11, c7, c6, 2 /* Invalidate line described by r11*/
    subs r7, r7, #1 /* decrement the index*/
    bge Loop_3
    subs r9, r9, #1 /* decrement the way number*/
    bge Loop_2
Skip_:
    add r10, r10, #2 /* increment the cache number*/
    cmp r3, r10
    bgt Loop_1
    dsb
Finished_:
    mov pc, lr

Some portions of this code were added, for the sake of taking the burden of the code segment that will, at a future point, re-enable caches, MMU, TLB. It is troubling, to know that the mass invalidation and cleaning routines above are the easy way to go, but an OS would have to handle the logic of cleaning/invalidation routines of ways/sets in a micro manner, as in targeting specific sets/ways rather than the whole bunch. This is usually the case when dealing with OS responsibilities like DMA maintenance and multi-processor system memory management, that is where a programmer is likely to spend hours, sometimes days to come up with a code segment that is effectively less than 50 lines of code long, before touching Linux, I will be demonstrating some of these cases, to a relatively simple extent.
I realized, after reaching this point, that I have not discussed branch prediction, but then I thought meeh!.

4 Clock management subsystem

In our quest to seize control of the Clock management subsystem, we'll start with the furthest, most extrinsic yet essential system component, which is the physical oscillator(s) used. Generally, you would expect this piece of information to be presented in writing, in the development board's documentation, but no, FriendlyArm would not just pass that to you! they'd rather have you trace the schematic routes.
Luckily enough, The person who created the schematics document knew exactly what he was doing, and figuring out that the clocks being used as external input clocks was not at all a painful process.
The external clocks used are:

  • 24MHz across XXTI and XXTO
  • 24MHz across XusbXTI and XusbXTO
  • 27MHz across XhdmiXTI and XhdmiXTO
  • 32.768KHz across XrtcXTI and XrtcXTO

This is just the tip of the iceberg, and the clock management system extends further below. Cortex-A based processors tend to have a sophisticated, rather complex clock and PLL (Phase Locked Loop) subsystem, mainly due to it [the clock subsystem] being highly configurable. In S5PV210 User Manual document, we can find the details of the clock subsystem (CHAPTER 3 CLOCK CONTROLLER of S5PV210 User Manual). which is broadly illustrated in this awesome piece of graphics I created:

Clock sub-system of S5PV210

The directions of the black arrows in the graphic above are important, the blocks at the tip of the arrows represent system components that are in the receiving end of a clock-related feed, while the blocks that are at tails represent feeders, or variables that you would see on the right hand side of an equation that is used to calculate the value of the blocks at the tip, if that makes any sense. The black arrows are representations of possible routes you can take to configure the internal parts of the clock subsystem. Junctions represent multiplexing, or other forms of selection (choice) possibilities. There are also clock dividers that are not shown in the illustration.
The documentation (S5PV210's) divides the clock control subsystem into three main domains, Main system domain(MSYS) Display system domain(DSYS) and Peripheral system domain(PSYS).
In contrast, the Cortex-A8 documentation divides the clock system into three domains as well (section 10.1 Clock domains of ARM Cortex-A8 TRM) , the first is the High speed core clock domain (CLK) which controls the following internals of the core:

  • instruction fetch unit
  • instruction decode unit
  • instruction execute unit
  • load/store unit
  • L2 cache unit, including AXI interface
  • NEON unit (this thing is fantastic, stay tuned for a future post on it!)
  • ETM unit, not including the ATB interface
  • debug logic, not including the APB interface.
    The second clock domain is APB clock (PCLK) that controls the debug interface for the processor, and the third clock domain is ATB clock (ATCLK) that controls the ATB interface for the processor
    As you may have noticed, the illustration above does not seem to mention the ATB interface clock speed control, that is because as stated in S5PV210's User Manual, section 2.3 ETB, the ATB clock is synchronous to the APB bus clock, this is a choice Samsung made. ETB stands for Embedded Trace Buffer, this on-chip buffering area for trace data would be important when a debugger is attached to the processor for debugging purposes, I happen to have an old J-TAG dongle that I used with an older development board from FriendlyARM, Never tried it with this board though, cause I never needed to, I might dig it out and demonstrate it's use someday.

4.1 Clock Initialization sequence

The first thing CHAPTER 3 CLOCK CONTROLLER in S5PV210 UM document tells us is that the clock management unit is controlled by the SYSCON register, then it goes on and on about what is wired to what.
The Clock generation procedure is described in section 3.5 CLOCK CONFIGURATION PROCEDURE of the user manual, in summery, to initialize the clock subsystem, we will:

  • Bypass PLL blocks, and use direct external oscillators for clock subsystem internals
  • Turn OFF PLLs.
  • Turn PLLs ON.
  • Change PLL’s PMS values.
  • Change the MUX values for proper routing of modules within the clock subsystem.
  • Change the system clock divider values
  • Change the special clocks divider values
  • re-configure the clock subsystem to use outputs from PLLs

The order in which these steps are taken is derived mainly from the clock generation procedure section mentioned above, but some constraints are presented in the second paragraph under the beginning of chapter 3.4 CLOCK GENERATION. This paragraph concerns point 5, it tells us that there are two types of multiplexers available for clock routes selection, a glitch-free multiplexer and a non-glitch free multiplexer. The glitch-free multiplexer does not suffer from glitches when the selection is changed from one clock input source to another, with the condition that, while performing the selection change, BOTH clock inputs of the multiplexer in question are running, otherwise, the selection change might not be complete, and the resulting multiplexer output would be in an undefined state.
Non-glitch free multiplexers can suffer from glitches when changing the clock sources from one input to another, to avoid this, the outputs of these multiplexers should be disabled until the change has been performed.
In figure 3-3, Glitch free multiplexers are shown in gray color, while non-glitch free multiplexers are shown in white.
Because of the constraints on different types of multiplexers, you might notice a mix up in the process of multiplexer selection configuration code, if it does not seem intuitive, remember that, for glitch free multiplexers, we will make sure both input clock signals are running, but will not care about disabling the output, whereas for non-glitch multiplexers, we will only turn the outputs on after we perform the selection.

4.2 BL1 cache ,MMU and clock initialization related code:

Having laid down the plan, we'll go through it step by step, but first, here's the updated code, growing a bit too large I fear.

BL1.s:

/* Standard definitions of mode bits and interrupt (I & F) flags in PSRs */
.equ Mode_USR,   0x10
.equ Mode_FIQ,   0x11
.equ Mode_IRQ,   0x12
.equ Mode_SVC,   0x13
.equ Mode_ABT,   0x17
.equ Mode_UND,   0x1B
.equ Mode_SYS,   0x1F
 
/*useful addresses, fetched from the S5PV210 user manual*/
.equ GPIO_BASE,  0xE0200000
.equ GPJ2CON_OFFSET,  0x280   
.equ GPJ2DAT_OFFSET,   0x284
.equ GPJ2PUD_OFFSET,   0x288
.equ GPJ2DRV_SR_OFFSET,  0x28C
.equ GPJ2CONPDN_OFFSET,  0x290
.equ GPJ2PUDPDN_OFFSET,  0x294
 
 
/*clock configuration registers */
.equ ELFIN_CLOCK_POWER_BASE,  0xE0100000
.equ CLK_SRC6_OFFSET,  0x218
.equ CLK_SRC0_OFFSET,  0x200
 
.equ APLL_CON0_OFFSET,  0x100
.equ APLL_CON1_OFFSET,  0x104
.equ MPLL_CON_OFFSET,  0x108
.equ EPLL_CON_OFFSET,  0x110
.equ VPLL_CON_OFFSET,  0x120
 
.equ CLK_DIV0_OFFSET,  0x300
.equ CLK_DIV0_MASK,  0x7fffffff
.equ CLK_DIV6_OFFSET,  0x318
 
.equ CLK_OUT_OFFSET,  0x500
 
 
/* PMS values constants*/
.equ APLL_MDIV,   0x7D  @125
.equ APLL_PDIV,   0x3   @3
.equ APLL_SDIV,   0x1   @1
 
.equ MPLL_MDIV,   0x29b @667
.equ MPLL_PDIV,   0xc   @12
.equ MPLL_SDIV,   0x1   @1
 
.equ EPLL_MDIV,   0x60  @96
.equ EPLL_PDIV,   0x6   @6
.equ EPLL_SDIV,   0x2   @2
 
.equ VPLL_MDIV,   0x6c  @108
.equ VPLL_PDIV,   0x6   @6
.equ VPLL_SDIV,   0x3   @3
 
/*the next places MDIV value at address 16, PDIV at address 8, SDIV at address 0 of the CLK_DIV0 register, it also sets the highest bit to turn on the APLL,*/
.equ APLL_VAL,    ((1<<31)|(APLL_MDIV<<16)|(APLL_PDIV<<8)|(APLL_SDIV))
.equ MPLL_VAL,    ((1<<31)|(MPLL_MDIV<<16)|(MPLL_PDIV<<8)|(MPLL_SDIV))
.equ EPLL_VAL,    ((1<<31)|(EPLL_MDIV<<16)|(EPLL_PDIV<<8)|(EPLL_SDIV))
.equ VPLL_VAL,    ((1<<31)|(VPLL_MDIV<<16)|(VPLL_PDIV<<8)|(VPLL_SDIV))
/* Set AFC value */
.equ AFC_ON,    0x00000000
.equ AFC_OFF,    0x10000010
 
 
 
/* CLK_DIV0 constants*/
.equ APLL_RATIO,  0
.equ A2M_RATIO,   4
.equ HCLK_MSYS_RATIO,  8
.equ PCLK_MSYS_RATIO,  12
.equ HCLK_DSYS_RATIO,  16
.equ PCLK_DSYS_RATIO,  20
.equ HCLK_PSYS_RATIO,  24
.equ PCLK_PSYS_RATIO,  28
 
.equ CLK_DIV0_VAL,      ((0<<APLL_RATIO)|(4<<A2M_RATIO)|(4<<HCLK_MSYS_RATIO)|(1<<PCLK_MSYS_RATIO)|(3<<HCLK_DSYS_RATIO)|(1<<PCLK_DSYS_RATIO)|(4<<HCLK_PSYS_RATIO)|(1<<PCLK_PSYS_RATIO))
 
 
.equ I_Bit,      0x80 /* when I bit is set, IRQ is disabled*/
.equ F_Bit,      0x40 /* when F bit is set, FIQ is disabled*/
 
.text
.code 32
.global _start
 
_start:
        b Reset_Handler
        b Undefined_Handler
        b SWI_Handler
        b Prefetch_Handler
        b Data_Handler
        nop /* Reserved vector*/
        b IRQ_Handler
/*FIQ handler would go right here ...*/
 
/*
.globl _bss_start
_bss_start:
 .word bss_start
 
.globl _bss_end
_bss_end:
 .word bss_end
 
.globl _data_start
_data_start:
 .word data_start
 
.globl _rodata
_rodata:
 .word rodata
*/
 
Reset_Handler:
  
/* set the cpu to SVC32 mode and disable IRQ & FIQ */
         msr CPSR_c, #Mode_SVC|I_Bit|F_Bit ;
       /* Disable Caches */
        mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        bic r1, r1, #(0x1 << 12)  /* Disable I Cache*/
        bic r1, r1, #(0x1 << 2)   /* Disable D Cache*/
        mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /* Disable L2 cache (too specific, not needed now, but useful later)*/
     mrc p15, 0, r0, c1, c0, 1  /* reading auxiliary control register*/
     bic r0, r0, #(1<<1)
     mcr p15, 0, r0, c1, c0, 1  /* writing auxiliary control register*/
 
        /* Disable MMU */
        mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        bic r1, r1, #0x1
        mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /* Invalidate L1 Instruction cache */
        mov r1, #0
        mcr p15, 0, r1, c7, c5, 0
 
        /*Invalidate L1 data cache and L2 unified cache*/
        bl invalidate_unified_dcache_all
 
        /*enable L1 cache*/
        @mrc p15, 0, r1, c1, c0, 0 /* Read Control Register configuration data*/
        @orr r1, r1, #(0x1 << 12)  /* Disable I Cache*/
        @orr r1, r1, #(0x1 << 2)   /* Disable D Cache*/
        @mcr p15, 0, r1, c1, c0, 0 /* Write Control Register configuration data*/
 
        /*enable L2 cache, only works if the above is commented in ...*/
        @mrc p15, 0, r0, c1, c0, 1
     @orr r0, r0, #(1<<1)
     @mcr p15, 0, r0, c1, c0, 1
      
 
 ldr sp, =0xd0037d80 /* SVC stack top, from irom documentation*/
 sub sp, sp, #12 /* set stack */
 @mov fp, #0
 
 ldr r0,=0x0C
 @bl flash_led
 
 bl clock_subsys_init
 
 ldr r0,=0x0F
 @bl flash_led
 b .
 
 
Undefined_Handler:
        b .
SWI_Handler:
        b .
Prefetch_Handler:
        b .
Data_Handler:
        b .
IRQ_Handler:  
        b . 
           
 
 
/*==========================================
* useful routines
============================================ */
 
 
/*clock subsystem initialization code*/
clock_subsys_init:
 
ldr r0, =ELFIN_CLOCK_POWER_BASE @0xE0100000
 
 ldr r1, =0x0
 str r1, [r0, #CLK_SRC0_OFFSET]
 
 ldr r1, =0x0
 str r1, [r0, #APLL_CON0_OFFSET]
 ldr r1, =0x0
 str r1, [r0, #MPLL_CON_OFFSET]
 ldr r1, =0x0
 str r1, [r0, #MPLL_CON_OFFSET]
  
 /*turn on PLLs and set the PMS values according to the recommendation*/
 ldr r1, =APLL_VAL
 str r1, [r0, #APLL_CON0_OFFSET]
 
 ldr r1, =MPLL_VAL
 str r1, [r0, #MPLL_CON_OFFSET]
 
 ldr r1, =VPLL_VAL
 str r1, [r0, #VPLL_CON_OFFSET]
 
 ldr r1, =AFC_ON
 str r1, [r0, #APLL_CON1_OFFSET]
 
 ldr r1, [r0, #CLK_DIV0_OFFSET]
 ldr r2, =CLK_DIV0_MASK
 bic r1, r1, r2
 
 ldr r2, =CLK_DIV0_VAL
 orr r1, r1, r2
 str r1, [r0, #CLK_DIV0_OFFSET]
 
    /*delay for the PLLs to lock*/
 mov r1, #0x10000
1: subs r1, r1, #1
 bne 1b
     
 /* Set Mux to PLL (Bus clock) */
 
 /* CLK_SRC0 PLLsel -> APLLout(MSYS), MPLLout(DSYS,PSYS), EPLLout, VPLLout (glitch free)*/
 ldr r1, [r0, #CLK_SRC0_OFFSET]
 ldr r2, =0x10001111
 orr r1, r1, r2
 str r1, [r0, #CLK_SRC0_OFFSET]
 
 /* CLK_SRC6[25:24] -> MUXDMC0 clock select = SCLKMPLL (which is running at 667MHz, needs to be divided to a value below 400MHz)*/
 ldr r1, [r0, #CLK_SRC6_OFFSET]
 bic r1, r1, #(0x3<<24)
 orr r1, r1, #0x01000000
 str r1, [r0, #CLK_SRC6_OFFSET]
 
 /* CLK_DIV6[31:28] -> SCLK_DMC0 = MOUTDMC0 / (DMC0_RATIO + 1) -> 667/(3+1) = 166MHz*/
 ldr r1, [r0, #CLK_DIV6_OFFSET]
 bic r1, r1, #(0xF<<28)
 orr r1, r1, #0x30000000
 str r1, [r0, #CLK_DIV6_OFFSET]
 
        /*the clock output routes on of the configured clocks to an output pin, if you have a debugger to
         * verify the outcome of your configuration, I am at home, and have no access to such hardware at the moment of writing. 
        */
 /* CLK OUT Setting */
 /* DIVVAL[23:20], CLKSEL[16:12] */
 ldr r1, [r0, #CLK_OUT_OFFSET]
 ldr r2, =0x00909000
 orr r1, r1, r2
 str r1, [r0, #CLK_OUT_OFFSET]
 
 mov pc, lr
 
/*Massive data/unified cache cleaning to the point of coherency routine, loops all available levels!*/
 
clean_unified_dcache_all:
mrc p15, 1, r0, c0, c0, 1 /* Read CLIDR into R0*/
ands r3, r0, #0x07000000
mov r3, r3, lsr #23 /* Cache level value (naturally aligned)*/
beq Finished
mov r10, #0
Loop1:
add r2, r10, r10, lsr #1 /* Work out 3 x cache level*/
mov r1, r0, lsr r2 /* bottom 3 bits are the Cache type for this level*/
and r1, r1, #7 /* get those 3 bits alone*/
cmp r1, #2
blt Skip /* no cache or only instruction cache at this level*/
mcr p15, 2, r10, c0, c0, 0 /* write CSSELR from R10*/
isb /* ISB to sync the change to the CCSIDR*/
mrc p15, 1, r1, c0, c0, 0 /* read current CCSIDR to R1*/
and r2, r1, #7 /* extract the line length field*/
add r2, r2, #4 /* add 4 for the line length offset (log2 16 bytes)*/
ldr r4, =0x3FF
ands r4, r4, r1, lsr #3 /* R4 is the max number on the way size (right aligned)*/
clz r5, r4 /* R5 is the bit position of the way size increment*/
mov r9, r4 /* R9 working copy of the max way size (right aligned)*/
Loop2:
ldr r7, =0x00007FFF
ands r7, r7, r1, lsr #13 /* R7 is the max num of the index size (right aligned)*/
Loop3:
orr r11, r10, r9, lsl R5 /* factor in the way number and cache number into R11*/
orr r11, r11, r7, lsl R2 /* factor in the index number*/
mcr p15, 0, r11, c7, c10, 2 /* DCCSW, clean by set/way*/
subs r7, r7, #1 /* decrement the index*/
bge Loop3
subs r9, r9, #1 /* decrement the way number*/
bge Loop2
Skip:
add r10, r10, #2 /* increment the cache number*/
cmp r3, r10
bgt Loop1
dsb
Finished:
mov pc, lr
 
 
 
/*Massive data/unified cache invalidation, loops all available levels!*/
invalidate_unified_dcache_all:
mrc p15, 1, r0, c0, c0, 1 /* Read CLIDR into R0*/
ands r3, r0, #0x07000000
mov r3, r3, lsr #23 /* Cache level value (naturally aligned)*/
beq Finished_
mov r10, #0
Loop_1:
add r2, r10, r10, lsr #1 /* Work out 3 x cache level*/
mov r1, r0, lsr r2 /* bottom 3 bits are the Cache type for this level*/
and r1, r1, #7 /* get those 3 bits alone*/
cmp r1, #2
blt Skip_ /* no cache or only instruction cache at this level*/
mcr p15, 2, r10, c0, c0, 0 /* write CSSELR from R10*/
isb /* ISB to sync the change to the CCSIDR*/
mrc p15, 1, r1, c0, c0, 0 /* read current CCSIDR to R1*/
and r2, r1, #7 /* extract the line length field*/
add r2, r2, #4 /* add 4 for the line length offset (log2 16 bytes)*/
ldr r4, =0x3FF
ands r4, r4, r1, lsr #3 /* R4 is the max number on the way size (right aligned)*/
clz r5, r4 /* R5 is the bit position of the way size increment*/
mov r9, r4 /* R9 working copy of the max way size (right aligned)*/
Loop_2:
ldr r7, =0x00007FFF
ands r7, r7, r1, lsr #13 /* R7 is the max num of the index size (right aligned)*/
Loop_3:
orr r11, r10, r9, lsl R5 /* factor in the way number and cache number into R11*/
orr r11, r11, r7, lsl R2 /* factor in the index number*/
mcr p15, 0, r11, c7, c6, 2 /* Invalidate line described by r11*/
subs r7, r7, #1 /* decrement the index*/
bge Loop_3
subs r9, r9, #1 /* decrement the way number*/
bge Loop_2
Skip_:
add r10, r10, #2 /* increment the cache number*/
cmp r3, r10
bgt Loop_1
dsb
Finished_:
mov pc, lr
 
 
 
.align 4,0x90
flash_led:
     ldr r4,=(GPIO_BASE+GPJ2CON_OFFSET)
     ldr r5,[r4]
     ldr r2,=1
     orr r5,r5,r2
     orr r5,r2,lsl#4
     orr r5,r2,lsl#8
     orr r5,r2,lsl#12
     orr r5,r5,r2
     str r5,[r4]
     ldr r4,=(GPIO_BASE+GPJ2DAT_OFFSET)
     ldr r5,[r4]
     ldr r3,=0xF
     orr r5,r5,r3  @ turn them all off ...
     bic r5,r5,r0
     str r5,[r4]
     mov r1, #0x10000  @ this should be passed meh!
1:  subs r1, r1, #1
  bne 1b
     orr r5,r5,r3  @ turn them all off again ...
     str r5,[r4]
     mov pc, lr


The SFRs with address 0xE010_0XXX control clock-related logic, specifically, the output frequency of three PLLs, clock source selection, clock divider ratio, and clock gating.

The code starts by defining useful SFR addresses, using the .equ directive, all these definitions can be found in the user manual's 2.1.2 SPECIAL FUNCTION REGISTER MAP.
Before we go further in explaining what's going on, Figure 3-3 S5PV210 Clock Generation Circuit 1 should be referenced oftenfor the configuration procedure, it shows how the clock system flows through the guts of the SoC, as ugly as it is, I am afraid it is much more useful than the illustration above, so you ought to take a look at it.

At line (190-191),we are choosing clock source to be main, meaning the 24MHz oscilator across XXTI and XXTO pads, this information was obtained from section 3.7.3.1 Clock Source Control Registers, in short, storing zero in all fields as described in the table, reserved bits should not be touched, for they may cause undefined behavior,but I just didn't care. To my luck, their init-state is 0. what we are doing here is basically making sure that the processor receives a clock signal(hence stays up and running) while we configure the PLLs, clock dividers and multiplexing routes, if our processor dies on us through this time period because of unstable clock setup, who would run the clock subsystem configuration code we write?
If you take a look at the table, and figure 3-3, you can see that the main reason for this step is to bypass PLL blocks, through the time we take to configure all the bits and pieces of the clock subsystem. After bypassing the PLLs, we move on to Disable them, the source of information from documentation 3.7.2.1 PLL Control Registers (APLL_LOCK / MPLL_LOCK / EPLL_LOCK / VPLL_LOCK) tells us the required register addresses. in PLL Control Registers (APLL_CON0/APLL_CON1, R/W, Address = 0xE010_0100/0xE010_0104), first table tells us that if we are to set bit 31 to zero, APLL would be disabled, so we went on and zeroed out the whole register (line 193-194), cause we're lazy like that, you see, the rest of the bits in the register will have to be re-written anyways, cause, according to the equation: FOUT = MDIV X FIN / (PDIV × 2^SDIV-1) They seem to play a big part in defining the output frequency we desire, the default reset values produces 800 MHZ, but we want 1GHz, cause we can.

3.7.2.2 PLL Control Registers (MPLL_CON, R/W, Address = 0xE010_0108) and 3.7.2.3 PLL Control Registers (EPLL_CON0/ EPLL_CON1, R/W, Address = 0xE010_0110/0xE010_0114) sections of the documentation provide a similar setup for MPLL_CON and EPLL_CON0 respectively. Next, Line 201 to 211 turns on PLLs and sets their PMS values according to the recommendation in the documentation.

Having enabled PLLs, we move on to deal with clock dividers (line 213-219), controlled, as you would expect, through another set of special function registers, section 3.7.4 CLOCK DIVIDER CONTROL REGISTER gives us a very important piece of information, it states that we need to set the dividers' values such that the resulting clock frequencies do not exceed their maximum values, which are:

SCLKAPLL: 1GHz, SCLKMPLL: 667 MHz, SCLKA2M: 400 MHz, HCLK_MSYS: 200 MHz, PCLK_MSYS:100 MHz

So we have to set these dividers to match our desired clock speed, since we have no hard limitations here, lets just choose the maximum for all, bear in mind, that this is a reckless way of making a choice, a power requirement will have you ponder upon this section for some time, for you would be concerned about power efficiency and instruction cycle time (speed of processing) trade-offs depending on your application.

3.7.4.1 Clock Divider Control Register (CLK_DIV0, R/W, Address = 0xE010_0300) table gives us the requisite information, As an example, lets take the path leading to the configuration of ARMCLK, which we intend to run at 1000MHz. If we follow Figure 3-3 S5PV210 Clock Generation Circuit 1, we see that, APLL feeds a multiplexer (MUXAPLL), that feeds another multiplexer (MUXMSYS), which in turns feeds the DIVAPLL divider that produces the ARMCLK final clock signal.
The equation for calculating the value of ARMCLK is given in the CLK_DIV0 control register bits table as:

ARMCLK = MOUT_MSYS / (APLL_RATIO + 1)

where MOUT_MSYS is the output value from the MUXMSYS multiplexer,it follows that we have to configure the output of the two consecutive multiplexers so that they will deliver the APLL's 1GHz clock frequency, which means APLL_RATIO must be 0, so that we'll get an ARMCLK value of 1GHz. The rest of CLK_DIV0 register bits are configured in a similar way.
Having configured the PLLs and dividers, we will now route the clock system properly, we do that using the multiplexers. Looking back at section 3.7.3.1 Clock Source Control Registers, and filling the bits carefully, following figure 3-3. Comments in the code explain the process at every stage (line 229-244).

Looking at the datasheet of K4T1G084QQ, which is the memory chip (DDR2) used in the Tiny210 (close enough!), you can tell that the maximum operating frequency is 400 MHz (section 2. Key Features), but in our cortex-a8 implementation, we see, from section 1.1.3 SUPPORTS CLOCK FREQUENCY UP TO 200MHZ BLOCK DIAGRAM, that the Dynamic Memory Controller is operated through the AXI BUS, which is part of the MSYS domain, and from 1.3 KEY FEATURES OF S5PV210, we can see that the maximum frequency for this domain is 200MHz (checkout figure 1.3 KEY FEATURES OF S5PV210 UM), so we have to bring it down to a frequency less than 200MHz. the routine then returns by loading LR into PC, easy enough.

An led flashing routine was added, to help detect stages of execution, by flashing a certain pattern at different stages, the implementation of this routine is simple enough to be left unexplained.

Final Notes

Reading section B3.1 About the VMSA of ARM ARM, you can tell that the documentation of the SoC is required to define exactly what sort of Memory management scheme is available for you, at which point you can come back to the ARM ARM documentation and figure out how to deal with it. In a real life situation, this process should go the other way around, meaning, you would define what memory management scheme is required for your specific application, then decide which SoC to use for your project. In my case, I already have an SoC on a development board, so I had to follow the memory scheme implemented by Samsung in the S5PV210 chip.
There are useful subsystems that you might have noticed if you've been reading the documents I've referenced through this series so far (like you should've!), yet we had no use for them, and we're almost out of the dark realm of system boot up, systems like SIMD with NEON, the GPU and the likes, at this early stage, we do not have to worry about these systems, but soon, we shall conquer their horrors.
From S5PV210 User manual, section 1.3.5 SECURITY SUBSYSTEM, we can see that the chip has features that indicate an implementation of security extension, Which is an ARM extension that significantly impacts many subsystems, including the memory subsystem.

Conclusion

We're finally done with setting up (arguably) the most basic subsystems in our processor, which is the clock subsystem, I planned to include the initialization of the memory system, specifically SDRAM, but it would have doubled the size of this post, and made things a bit too complicated, so the memory initialization was pushed into the next post. All and all, this wasn't too rough of a ride, though a bit boring. As a Software engineer, I regard dealing with registers and hardware components as a necessary evil, and I personally long for the comfort of high level programming whenever I am going through the initial assembly code that forces me to go back and forth between cryptic documentations that try so hard to outsmart the reader, or so it appears. The good news is, you do not always have to go through this pain, someone out there must have done this before, so you'd be better off salvaging bits and pieces of code from here and there, you still need to understand what you're doing though.


(1) A fancy name for a set of branch instructions listed in eight consecutive word-aligned memory locations,
(2) This example is given under the assumption that SRAM is wired through cache, in many cases, the processor would have a scratchpad RAM that is in full sync with the CPU, or at least as fast as cache, this scratchpad memory would not be wired through cache, because it its fast enough to keep up with the CPU, but the user manual for S5PV210 is not very clear about the internal wiring of its components.
(3)  you might think that the fact that we will disable cache in BL1 should be enough reason to ignore cache invalidation, read section 7.2.3 Cache disabled behavior in cortex-a8 technical reference manual to know why you thought wrong.
(4) Coprocessor 15 is a built-in coprocessor which provides control over many processor features, including cache and MMU. (shamelessly copied from the programmer guide document)
(5) Failing to invalidate data cache never caused me any distress, however, the documentation recommends it, I am yet to investigate the reasons why, will take a deeper look at it some day.