korena's blog

16. Bootloader implementation - part 6

In this post, we're going to start parsing ARP requests that are sent to us from the host machine, and respond to them with our MAC address, lest's start with a simple primer of what ARP is all about.

What ARP is all about

Instead of going the wikipedia way explaining everything, we'll just go straight to the point, when you have two nodes on the same network,node1 and node2, where node1 runs a program that needs to communicate with another program running on node2 over the network, both nodes should first have properly configured IP addresses in their network (network layer addresses). This part is handled by mechanisms such as static IP configurations or DHCP leasing. To communicate, node1 sends an ARP broadcast over the network, and trusts that node2 is listening to this broadcast, assuming node2 has the address 10.0.0.2, and node1 has address 10.0.0.20, in the ARP request, node1 would say:

"Who has 10.0.0.2 ? tell 10.0.0.20"

it also includes it's own MAC address, so that node2 would be able to respond properly, this is because if you remember, MAC addresses are part of the basic structure of a link layer frame, and have to exist for proper communication using any higher level protocol. Node2 then catches this ARP request (it actually catches all ARP requests broadcasted in the network), and compares it's own IP address to the IP address (10.0.0.2) in the ARP request, since they match, it saves a mapping of node1's network layer and link layer addresses information in a format like:

IPv4 Address MAC Address
10.0.0.20 [MAC addr of node1]
There's no hard requirement for this format, how the mapping is structured is implementation specific. After saving this data, node2 proceeds to send back an ARP response to node1 using the information it found in the original ARP request, so node1 gets the MAC address of node2, and moves to actually exchange useful data with it.

At the driver level

Our driver implementation will not change much, since we are already capable of receiving data, our task today is to actually make sense of it, but we have one major change, which is the way we deal with passing information from the driver, up through the eth.c abstraction layer, to the main network loop in net.c. Unlike how things are done in u-boot, which is the base for our ported code, we're going to break the modularization concept by skipping the abstraction layer when it come's to processing received data, and make a direct call to net.c. This was not strictly necessary, but was done to just get pass this point, the reason for this change is that in the best case scenario, where there aren't too many packets flying around in the network, my development board took about 23 ms to respond to an ARP request from the host machine, and missed about 20% of requests. This is mainly because of the pointless loop in my int eth_rx(void) function in eth.c file:

/* Process up to 32 packets at one time */
	for (i = 0; i < 32; i++) {
		/*net_rx_packets is the universal packets holster,
		 *the ethernet device will place received packets 
		 *there.
		 * */
//		print_format("calling recv of ethernet device ...\n\r");
		ret = eth_current->recv(eth_current);
		if (ret > 0)
			net_process_received_packet(net_rx_packets[0], ret);
		if (ret >= 0 && 1)
		//	print_format("should clear rx space here?\n\r");
		if (ret <= 0)
			break;
	}

This loop assumes a different driver structure than the one we ported from u-boot, something that follows the new driver ecosystem in u-boot. Looking at this function, you can see that it assumes some 32 packets buffer that is filled at the driver level before this function is called, but since we have only one packet being passed from our driver implementation to the processing layer, this loop basically burns 31 times unnecessarily for each packet the ethernet chip picks up from the network, the more noise the network has, the more waste this loop produces, we could have simply re-written this function to not do what it's doing now, but I went on and called net_process_received_packet (from net.c) directly in the dm9000_rx driver function:

/*
 *   Received a packet and pass to upper layer
 *   */
static int dm9000_rx(struct eth_device *netdev)
{
	uint8_t rxbyte, *rdptr = (uint8_t *) net_rx_packets[0];
	uint16_t RxStatus=0, RxLen = 0;
	struct board_info *db = &dm9000_info;

	/* Check packet ready or not, we must check
	 * 	   the ISR status first for DM9000A */

	if (!(DM9000_ior(DM9000_ISR) & 0x01)){ /* Rx-ISR bit must be set. */
		return 0;
	}

	DM9000_iow(DM9000_ISR, 0x01); /* clear PR status latched in bit 0 */

	/* There is _at least_ 1 package in the fifo, read them all */
	for (;;) {
		DM9000_ior(DM9000_MRCMDX);	/* Dummy read */
		/* Get most updated data,
		 * 		   only look at bits 0:1, See application notes DM9000 */
		rxbyte = DM9000_inb(DM9000_DATA) & 0x03;
		/* Status check: this byte must be 0 or 1 */
		if (rxbyte > DM9000_PKT_RDY) {
			print_format("DM9000 error: status check fail: 0x%x\n\r",
					rxbyte);
			DM9000_iow(DM9000_RCR, 0x00);	/* Stop Device */
			DM9000_iow(DM9000_ISR, 0x80);	/* Stop INT request */
			// reset device ...
			dm9000_reset();
			dm9000_start();
			return 0;
		}

		if (rxbyte != DM9000_PKT_RDY){
#ifdef CONFIG_DM9000_DEBUG
			print_format("no packets received\n\r");
			dm9000_dump_regs();
#endif
			return 0; /* No packet received, ignore */
		}


#ifdef CONFIG_DM9000_DEBUG
		DM9000_DBG("receiving packet\n\r");
#endif
		/* A packet ready now  & Get status/length */
		(db->rx_status)(&RxStatus, &RxLen);
		
#ifdef CONFIG_DM9000_DEBUG
		DM9000_DBG("rx status: 0x%x rx len: %d\n\r", RxStatus, RxLen);
#endif
		/* Move data from DM9000 */
		/* Read received packet from RX SRAM */
		(db->inblk)(rdptr, RxLen);
#ifdef CONFIG_DM9000_DEBUG
		DM9000_DBG("net_rx_packets filled ...\n\r");
#endif

		if ((RxStatus & 0xbf00) || (RxLen < 0x40)
				|| (RxLen > DM9000_PKT_MAX)) {
			if (RxStatus & 0x100) {
				print_format("rx fifo error\n\r");
#ifdef CONFIG_DM9000_DEBUG
				dm9000_dump_regs();
				dm9000_dump_eth_frame(rdptr,RxLen);
#endif
			}
			if (RxStatus & 0x200) {
				print_format("rx crc error\n\r");
			}
			if (RxStatus & 0x8000) {
				print_format("rx length error\n\r");
			}
			if (RxLen > DM9000_PKT_MAX) {
				print_format("rx length too big\n\r");
				dm9000_reset();
				dm9000_start();
#ifdef CONFIG_DM9000_DEBUG
				dm9000_dump_regs();
#endif
			}
		} else {
			net_process_received_packet(net_rx_packets[0], RxLen);
		}
	}
	return 0;
}

This is bad design, I broke the pattern for no good reason, the problem could have been solved more elegantly, but I chose not to, because I did. Its worth noting that since the DM9000AEP chip is quite old, and so is this driver, even u-boot code breaks the pattern the way I did here.

At the network main loop level

At this point, we'll start following the execution path from the entry point of our network stack,
in bootloader.c :

start_linux(void)
{
        /*Irrelevant stuff above...*/
	print_format("Setting up timers next ...\n\r");
	init_timer();	
	net_loop(ARP);
        /*Irrelevant stuff bellow...*/
}

Passing ARP as the protocol we'd like to handle guarantees that our main net loop inside net.c will handle ARP requests, this is a preliminary setup, in a real functional network operation, ARP has to be one of those invisible things that should be handled always, regardless of what we're asking our network stack to do. The called networking module function net_loop(enum proto_t protocol) kicks off the initialization process detailed in the Previous post at line 14:

/*
 * We want this to handle ARP and TFTP, nothing more!
 * */
int net_loop(enum proto_t protocol)
{
	int ret = -EINVAL;

	net_restarted = 0;
	net_dev_exists = 0;
	net_try_count = 1;
	print_format("--- net_loop Entry\n\r");

	print_format("eth_start, calling net_init\n\r");
	net_init();
restart:
	net_set_state(NETLOOP_CONTINUE);

	/*
	 *	Start the ball rolling with the given start function.  From
	 *	here on, this code is a state machine driven by received
	 *	packets and timer events.
	 */
	print_format("--- net_loop Init\n\r");

	switch (net_check_prereq(protocol)) {
		case 1:
			/* network not configured */
			print_format("-ERR network not configured, halting device ...\n\r");
			eth_halt();
			return -ENODEV;

		case 2:
			/* network device not configured */
			print_format("-ERR network device not configured ...\n\r");
			break;

		case 0:
			net_dev_exists = 1;
			net_boot_file_size = 0;
			print_format("acting on chosen protocol ...\n\r");
			break;
	}

	/*
	 *	Main packet reception loop.  Loop receiving packets until
	 *	someone sets `net_state' to a state that terminates.
	 */
	print_format("looping and polling ethernet recv ...\n\r");
	int ethStatus ;
	for (;;) {
		/*
		 *	Check the ethernet for a new packet.  The ethernet
		 *	receive routine will process it.
		 *	Most drivers return the most recent packet size, but not
		 *	errors that may have happened.
		 */
		ethStatus = eth_rx();

				if (arp_timeout_check() > 0) {
				    time_start = get_timer(0);
				}

		/*
		 *	Check for a timeout, and run the timeout handler
		 *	if we have one.
		 */
				if (time_handler &&
				    ((get_timer(time_start)) > time_delta)) {
					thand_f *x;
					print_format("--- net_loop timeout\n\r");
					x = time_handler;
					time_handler = (thand_f *)0;
					(*x)();
				}

		if (net_state == NETLOOP_FAIL)
			ret = net_start_again();

		switch (net_state) {
			case NETLOOP_RESTART: 
				net_restarted = 1;
				goto restart;

			case NETLOOP_SUCCESS:
				if (net_boot_file_size > 0) {
					print_format("Bytes transferred = %d (%x hex)\n\r",
							net_boot_file_size, net_boot_file_size);
				}
				if (protocol != NETCONS)
					eth_halt();
			//	eth_set_last_protocol(protocol);

				ret = net_boot_file_size;
				print_format("--- net_loop Success!\n\r");
				goto done;

			case NETLOOP_FAIL:
				print_format("--- net_loop Fail!\n");
				goto done;

			case NETLOOP_CONTINUE:
				continue;
		}
	}

done:
	return ret;
}

int net_loop(enum proto_t protocol)

In order to not repeat ourselves here, we'll focus on arp initialization.

ARP initialization

Starting with net_init function:

void net_init(void)
{
	static int first_call = 1;
	if (first_call) {
		/*
		 *	Setup packet buffers, aligned correctly.
		 */
		int i;

		net_ip = string_to_ip("10.0.0.2");	 
		net_server_ip = string_to_ip("10.0.0.20");
		net_tx_packet = &net_pkt_buf[0] + (PKTALIGN - 1);
		net_tx_packet -= (unsigned long)net_tx_packet % PKTALIGN;
		for (i = 0; i < PKTBUFSRX; i++) {
			net_rx_packets[i] = net_tx_packet +
				(i + 1) * PKTSIZE_ALIGN;
		}
		print_format("initializing ARP ...\n\r");
		arp_init();
		print_format("Done initializing ARP ...\n\r");
		//	net_clear_handlers();
		/* Only need to setup buffer pointers once. */
		first_call = 0;
	}

	/*initialize ethernet device*/
	if(dm9000_initialize() == 0){
		/* device is initialized! register generic listener for requests...*/
		print_format("dm9000_initialization complete.\n\r");
		eth_current->state = ETH_STATE_ACTIVE;
		return;
	}else{
		print_format("net_init failed at ethernet device initialization, aborting.\n\r");
		return; 
	}

}

void net_init(void)

We see that we start by statically setting our own IP address to 10.0.0.2, and the server's (host machine) IP address to 10.0.0.20, as you may expect, the host machine has to reflect this setting, it has to know that it should own the address 10.0.0.20 in order for this to work. At this point, we have to discuss an important concept with network endianness vs host endianness.

The function string_to_ip is called to convert the string representation of an IP address to a 32bit unsigned type, the string_to_ip funciton looks like:

struct in_addr string_to_ip(const char *s)
{
	struct in_addr addr;
	//	unsigned char *e;
	char *e;
	int i;

	addr.s_addr = 0;
	if (s == NULL)
		return addr;

	for (addr.s_addr = 0, i = 0; i < 4; ++i) {
		uint32_t val = s ? simple_strtoul(s, &e, 10) : 0;
		addr.s_addr <<= 8;
		addr.s_addr |= (val & 0xFF);
		if (s) {
			s = (*e) ? e+1 : e;
		}
	}

	addr.s_addr = htonl(addr.s_addr);
	return addr;
}
struct in_addr string_to_ip(const char *s)

NOTE: The in_addr is just a wrapping structure for an uint32_t.

since the string format of an IP address is known, you can follow the logic of the for loop on your own. There are two calls to be noted in the body of this function, the first is the call to simple_strtoul function which converts a digit represented with an array of characters to an unsigned long integer, the second is to a funciton htonl(uint32_t addr), which basically converts the passed 32bit integer from the board machine's endianness (little endian) to the standard network endianness (big endian). This is a very important step, and you can read more about it Here.

In short, whatever data you receive from the network is in big endian format, so if your machine is using little endian to represent it's data, you have to pass the received data through a filter that converts from network endianness to your own machine's endianness, you do that by implementing the following functions:

uint32_t htonl (uint32_t x)
{
#if BYTE_ORDER == BIG_ENDIAN
	return x;
#elif BYTE_ORDER == LITTLE_ENDIAN
	return __bswap_32(x);
#else
# error "What kind of system is this?"
#endif
}

uint32_t ntohl (uint32_t x) __attribute__  ((weak,alias("htonl")));


uint16_t htons (uint16_t x)
{
#if BYTE_ORDER == BIG_ENDIAN
	return x;
#elif BYTE_ORDER == LITTLE_ENDIAN
	return __bswap_16(x);
#else
print_format("What kind of system is this?\n\r");
#endif
}

uint16_t ntohs (uint16_t x) __attribute__  ((weak,alias("htons")));
Endianness handlers

These functions correspond to 32bit and 16bit data respectively. And they use macros defined in GCC in our case (__bswap_32 and __bswap_16 functions).

After setting up IP addresses in net_init, we move on to define some properties for the tx buffer, the stuff you see happening in this segment is specific to DMA cache-line boundary alignment, from u-boot's documentation:

Buffer Requirements:
- Any buffer that is invalidated(that is, typically the peripheral to memory DMA buffer) should be aligned to cache-line boundary both at the beginning and at the end of the buffer.
- If the buffer is not cache-line aligned invalidation will be restricted to the aligned part. That is, one cache-line at the respective boundary may be left out while doing invalidation.
- A suitable buffer can be alloced on the stack using the ALLOC_CACHE_ALIGN_BUFFER macro.

To us, DMA is just not necessary, so we don't really care about this stuff. Next, we're calling arp_init(), which resides in arp.c:

void arp_init(void){
	arp_wait_packet_ethaddr = NULL;	 
	net_arp_wait_packet_ip.s_addr = 0 ; 
	net_arp_wait_reply_ip.s_addr = 0;
	arp_wait_tx_packet_size = 0;
	arp_tx_packet = &arp_tx_packet_buf[0]+(PKTALIGN - 1);
	arp_tx_packet -= (unsigned long) arp_tx_packet % PKTALIGN;
}
void arp_init(void)

Again, we're initializing some buffers, the usefulness of arp\_wait\_packet\_ethaddr is not obvious, but kept for fear of breaking things, as there's a comment claiming it's a fix for some bss issue. net\_arp\_wait\_packet\_ip is indeed useful, as it's used to save a packet's destination IP before making an ARP request broadcast, so when the ARP response is received, we can conditionally check if net\_arp\_wait\_packet\_ip has something in it, which would mean some packet is waiting to be transferred, and we can restore and send this packet later. We'll discuss this in more detail in the next section. The function goes on to initialize buffers with all that cache-line boundary alignment we talked about above.

Once the software returns from net_init() to net_loop(), it proceeds to set the state of the network loop to NETLOOP_CONTINUE, this is used as an execution control that get's the for(;;) loop to keep going until and never break out until this net_state variable changes.

Immediately after, a call to net_check_prereq function is made, net_check_prereq checks prerequisite conditions for the intended protocol being used in the current execution, such as whether we have a network driver registered and active. Depending on the return value of this function, we proceed or return with an error. We then jump into the for loop.

We now introduce a new initialized hardware piece, a timestamp hardware timer. In timer.c, we added :

void init_timer(void){
	uint32_t TCFG0_PRESCALER1_255 = (0xFD << 8); // FE + default 01 = 0xFF = 255
	uint32_t TCFG1_MUX4_16 = (0b0100 << 16);
	uint32_t TCNTB4_MAX_COUNT = 0xFFFFFFFF;

	*TCFG0	|= TCFG0_PRESCALER1_255;
	*TCFG1  |= TCFG1_MUX4_16;
	*TCNTB4  = TCNTB4_MAX_COUNT;
	
#if DEBUG_TIM
	print_format("TCFG0 = 0x%x\n\rTCFG1 = 0x%x\n\rTCNTB4 = 0x%x\n\rTCNTO4 = 0x%x\n\r",
		*TCFG0,
		*TCFG1,
		*TCNTB4,
		*TCNTO4);
#endif
	*PWMTCON   |= (1 << 21); // set update TCNTB4 and interval mode
	udelay(10);
	*PWMTCON   &= ~(1 << 21); // clear update TCNTB4
	udelay(10);
	*PWMTCON   |= (1 << 20); // start timer 4
#if DEBUG_TIM
	print_format("TCON = 0x%x\n\r",*PWMTCON);
#endif
}

/**
* returns increments in multiples of ~0.062 ms
* since timer 4 was started
*/
static uint32_t get_tick(void){
	return *TCNTO4;
}


/**
* returns increments in multiples of ~1 ms since base time.
* this function has an accomulating error of about 5% for every second, 
* but acceptable for timeouts in the network stack.
*/
uint32_t get_timer(uint32_t base){
	return (base == 0?(get_tick()/16UL):base - (get_tick()/16UL));
}
newly initialized timer

This timer differs from our previously initialized timer in the fact that it provides timestamps, basically, it runs for a very long time before it resets (days!), and whenever we peak into it's counter register, it gives us the number of ticks it counted since the timer was started, the time between two consecutive ticks depends on our configuration, so lets get to that.

Figure 1-2 PWM TIMER CLOCK Tree Diagram shows what needs to be configured for PWM timers to be used

For Timer 4, you can choose between using PCLK, pass it through prescaler 1, then through a divider (1/1, 1/2, 1/4, 1/8, 1/16), into the MUX that defines the immediate source of feeding clock signal, into the control block. Or you can choose to run the whole thing through SCLK_PWM, and set up the MUX to just use the direct input of that SCLK_PWM clock.

To Initialize a timer, as stated in section 1.3.5 INITIALIZE TIMER (SETTING MANUAL-UP DATA AND INVERTER)

  1. write the initial value into TCNTBn and TCMPBn (TCMPB not applicable to timer 4)
  2. set the manual update bit and clear only manual update bit of corresponding timer
  3. set the start bit of the corresponding timer to start the timer.

it's also recommended to to set the inverter bit regardless of whether it's used or not (not sure if this is applicable to timer 4)

Registers involved:

TCFG0 (Specifies the Timer Configuration Register 0 that configures the two 8-bit Prescaler and DeadZone Length) (0xE250_0000)

TCFG1 (Specifies the Timer Configuration Register 1 that controls 5 MUX Select Bit, use this to configure the MUX) (0xE250_0004)

TCON (I presume this is where I get to start and stop the timer) (0xE250_0008)

TCNTB4 (Timer 4 Count Buffer Register) (0xE250_003C)

TCNTO4 (Timer 4 Count Observation Register) (0xE250_0040)

TINT_CSTAT (Timer interrupt control and status register)

From section 1.5.1.1 Timer Configuration Register (TCFG0, R/W, Address = 0xE250_0000) :

Timer Input Clock Frequency = PCLK / (prescaler_value + 1)/ {divider_value}

where :

prescaler_value = 1~255 (?)
divider_value = 1,2,4,6,16,TCLK

bits [15:8] of register TCFG0 control the Prescaler_value (prescaler 1).

from 1.5.1.2 Timer Configuration Register (TCFG1, R/W, Address = 0xE250_0004)

bits [19:16] control the MUX input for Timer 4:
0000 = 1/1
0001 = 1/2
0010 = 1/4
0011 = 1/8
0100 = 1/16
0101 = SCLK_PWM

from 1.5.1.3 Timer Control Register (CON,R/W,Address = 0xE250_0008)

the following concern timer 4:

TCON bit [22]
0 = One-Shot
1 = Interval Mode(Auto-Reload)

TCON bit [21]
0 = No Operation
1 = Update TCNTB4

TCON bit [20]
0 = Stop
1 = Start Timer 4

No need for interrupt stuff, we're only interested in observing the value in TCNTO4

Calculations:

the smallest Frequency we can achieve using PCLK (66MHz) is

freq = 66MHz / (254 + 1) / (16) = 6.18199802176e-05 Hz

that's 0.062 ms

the TCNTB4 register is 32 bits wide, which means it holds a maximum value of 0xFFFFFFFF (4294967295)

so for the timer to overflow, it has to count through all that.

the counter of this timer will decrement this value by 1 every 0.062 ms, but that's a not so great a number,
let's see how many counts will happen when 1 ms has passed (how many 0.062 ms is in 1 ms is what we're trying to find out)

1 / 0.062 = 16.1290322581 , or just 16 because we dont care about precision.

this means that every 16 counts of timer 4's counter, we know one millisecond has passed.

so let's figure out how many 1 milliseconds will pass before the counter overflows, by dividing 4294967295/16, which gives us
268435451, that is the number of milliseconds that will pass before our counter overflows, or 268435.451 seconds. We can live with that, that's actually three long days, and completely unnecessary, but let's just go with it.
We'll configure the timer, and get ticks by looking at TCNTO4, then calculate the passage of time in milliseconds to provide meaningful timestamps using the get_timer() function. That's basically what is happening in the code above, divided between three functions.

Why are we using get_time()

The reason we need timestamps is to configure timeouts and retries for protocol handlers, if something goes wrong, we want to be able to recover, perhaps re-start an operation, or print out a message, rather than keep trying without actually analyzing the situation.

The next section of net_loop() deals with the timeout setup, it's easy to follow, so I'm not going to get into it here. We'll instead move on to the switch statement, this statement controls the execution path of the whole network module, depending on the net_state enum, it decides whether to restart some operation because of a failure, or continute to loop while waiting for some operation to finish and set the net_state enum, or maybe terminate on success. Remember that there's no multi-threading at work here, everything is sequential.

Having reached the end of the net_loop function, you may be wondering, where the heck is the call to the ARP handler? Well, since we've setup the ethernet driver, and hooked it's dm9000_rx function to net_process_received_packet function in net.c, we expect that, because we're polling the driver for new data (within net_loop), if an ARP request that is targeted at us is received, it will be captured by the net_process_received_packet function, and handled accordingly. Here's net_process_received_packet :


void net_process_received_packet(unsigned char *in_packet, int len)
{
	struct ethernet_hdr *et;
	struct ip_udp_hdr *ip;
	struct in_addr dst_ip;
	struct in_addr src_ip;
	int eth_proto;
	unsigned short cti = 0, 
		       vlanid = VLAN_NONE, 
		       myvlanid, 
		       mynvlanid; 
	print_format("packet received\n\r"); 
	net_rx_packet = in_packet; 
	net_rx_packet_len = len; 
	et = (struct ethernet_hdr *)in_packet;
#ifdef NET_DEBUG
	//TODO: This is where you should check if the ethernet header is actually populated with 
	//      proper data, do this by dumping 
	if(et){
		print_format("pointer et is not NULL\n\r");
		print_format("uint8_t et_dest[0] =0x%x \n\r",et->et_dest[0]);
		print_format("uint8_t et_dest[1] =0x%x \n\r",et->et_dest[1]);
		print_format("uint8_t et_dest[2] =0x%x \n\r",et->et_dest[2]);
		print_format("uint8_t et_dest[3] =0x%x \n\r",et->et_dest[3]);
		print_format("uint8_t et_dest[4] =0x%x \n\r",et->et_dest[4]);
		print_format("uint8_t et_dest[5] =0x%x \n\r",et->et_dest[5]);
		print_format("uint8_t et_src[0] = 0x%x \n\r",et->et_src[0]);
		print_format("uint8_t et_src[1] = 0x%x \n\r",et->et_src[1]);
		print_format("uint8_t et_src[2] = 0x%x \n\r",et->et_src[2]);
		print_format("uint8_t et_src[3] = 0x%x \n\r",et->et_src[3]);
		print_format("uint8_t et_src[4] = 0x%x \n\r",et->et_src[4]);
		print_format("uint8_t et_src[5] = 0x%x \n\r",et->et_src[5]);
		print_format("uint16_t et_protlen = 0x%x\n\r",et->et_protlen);
	}else{
		print_format("the pointer et is MULL\n\r");
		while(1); // something is wrong, and you need to fix the bug.
	}
#endif
	/* too small packet? */
	if (len < ETHER_HDR_SIZE)
		return;


	myvlanid = ntohs(net_our_vlan);
	if (myvlanid == (unsigned short)-1)
		myvlanid = VLAN_NONE;

	mynvlanid = ntohs(net_native_vlan);
	if (mynvlanid == (unsigned short)-1)
		mynvlanid = VLAN_NONE;

	eth_proto = ntohs(et->et_protlen);

	if (eth_proto < 1514) {
		struct e802_hdr *et802 = (struct e802_hdr *)et;
		/*
		 *	Got a 802.2 packet.  Check the other protocol field.
		 *	XXX VLAN over 802.2+SNAP not implemented!
		 */
		eth_proto = ntohs(et802->et_prot);
		ip = (struct ip_udp_hdr *)(in_packet + E802_HDR_SIZE);
		len -= E802_HDR_SIZE;

	} else if (eth_proto != PROT_VLAN) {
		/* normal packet */
		ip = (struct ip_udp_hdr *)(in_packet + ETHER_HDR_SIZE); // padding into the rx_packet by ethernet hdr size ...
		len -= ETHER_HDR_SIZE;
#ifdef NET_DEBUG
		print_format("header length and version: 0x%x\n\r",ip->ip_hl_v);
		print_format("type of service: 0x%x\n\r",ip->ip_tos);
		print_format("total length: 0x%x\n\r",ip->ip_len);
		print_format("identification: 0x%x\n\r",ip->ip_id);
		print_format("fragment offset field: 0x%x\n\r",ip->ip_off);
		print_format("time to live: 0x%x\n\r",ip->ip_ttl);
		print_format("protocol: 0x%x\n\r",ip->ip_p);
		print_format("checksum: 0x%x\n\r",ip->ip_sum);

		print_format("UDP source port: 0x%x\n\r",ip->udp_src);
		print_format("UDP dest port: 0x%x\n\r",ip->udp_dst);
		print_format("length of UDP packet: 0x%x\n\r",ip->udp_len);
		print_format("UDP checksum: 0x%x\n\r",ip->udp_xsum);
		
		// dump the whole IP thing !
//		for(int zip=0;zip<len;zip++){
//			print_format("B_%d = 0x%x\n\r",zip,*((uint8_t*)ip+zip));
//		}

#endif
	} else {			/* VLAN packet */
		struct vlan_ethernet_hdr *vet =
			(struct vlan_ethernet_hdr *)et;

		print_format("VLAN packet received\n\r");
		/* too small packet? */
		if (len < VLAN_ETHER_HDR_SIZE)
			return;

		if ((ntohs(net_our_vlan) & VLAN_IDMASK) == VLAN_NONE)
			return;

		cti = ntohs(vet->vet_tag);
		vlanid = cti & VLAN_IDMASK;
		eth_proto = ntohs(vet->vet_type);

		ip = (struct ip_udp_hdr *)(in_packet + VLAN_ETHER_HDR_SIZE);
		len -= VLAN_ETHER_HDR_SIZE;

	}	

	print_format("Receive from protocol 0x%x\n\r", eth_proto);

	if ((myvlanid & VLAN_IDMASK) != VLAN_NONE) {
		if (vlanid == VLAN_NONE)
			vlanid = (mynvlanid & VLAN_IDMASK);
		/* not matched? */
		if (vlanid != (myvlanid & VLAN_IDMASK))
			return;
	}

	switch (eth_proto) {
		case PROT_ARP:
			arp_receive(et, ip, len);
			break;
		default:
			print_format("I promise to never, ever happen :-)\n\r");
	}

}

After going through all the checks for a healthy packet, we get to line 120, which is configured to only handle PROT_ARP, and discard all other packets with an often broken promise.

Handling received ARP requests

void arp_receive(struct ethernet_hdr *et, struct ip_udp_hdr *ip, int len)
{
	struct arp_hdr *arp;
	struct in_addr reply_ip_addr;
	unsigned char *pkt;
	int eth_hdr_size;

	/*
	 * We have to deal with two types of ARP packets:
	 * - REQUEST packets will be answered by sending  our
	 *   IP address - if we know it.
	 * - REPLY packates are expected only after we asked
	 *   for the TFTP server's or the gateway's ethernet
	 *   address; so if we receive such a packet, we set
	 *   the server ethernet address
	 */

	arp = (struct arp_hdr *)ip;
	if (len < ARP_HDR_SIZE) {
		print_format("bad length %d < %d\n\r", len, ARP_HDR_SIZE);
		return;
	}

	if (ntohs(arp->ar_hrd) != ARP_ETHER)
		return;
	if (ntohs(arp->ar_pro) != PROT_IP)
		return;
	if (arp->ar_hln != ARP_HLEN)
		return;
	if (arp->ar_pln != ARP_PLEN)
		return;

	if (net_ip.s_addr == 0)
		return;

	if (net_read_ip(&arp->ar_tpa).s_addr != net_ip.s_addr)
		return;

	switch (ntohs(arp->ar_op)) {
		case ARPOP_REQUEST:
			/* reply with our IP address */

			pkt = (unsigned char *)et;
			eth_hdr_size = net_update_ether(et, et->et_src, PROT_ARP);
			pkt += eth_hdr_size;
			arp->ar_op = htons(ARPOP_REPLY);
			ul_memcpy(&arp->ar_tha, &arp->ar_sha, ARP_HLEN);
			net_copy_ip(&arp->ar_tpa, &arp->ar_spa);
			ul_memcpy(&arp->ar_sha, net_ethaddr, ARP_HLEN);
			net_copy_ip(&arp->ar_spa, &net_ip);

			net_send_packet((unsigned char *)et, eth_hdr_size + ARP_HDR_SIZE);
			return;

		case ARPOP_REPLY:		/* arp reply */
			/* are we waiting for a reply */
			if (!net_arp_wait_packet_ip.s_addr)
				break;


			reply_ip_addr = net_read_ip(&arp->ar_spa);

			/* matched waiting packet's address */
			if (reply_ip_addr.s_addr == net_arp_wait_reply_ip.s_addr) {
				print_format("Got ARP REPLY, set eth addr\n\r");
				/* save address for later use */
				if (arp_wait_packet_ethaddr != NULL)
					ul_memcpy(arp_wait_packet_ethaddr,
							&arp->ar_sha, ARP_HLEN);


				/* set the mac address in the waiting packet's header
				   and transmit it */
				ul_memcpy(((struct ethernet_hdr *)net_tx_packet)->et_dest,
						&arp->ar_sha, ARP_HLEN);
				net_send_packet(net_tx_packet, arp_wait_tx_packet_size);

				/* no arp request pending now */
				net_arp_wait_packet_ip.s_addr = 0;
				arp_wait_tx_packet_size = 0;
				arp_wait_packet_ethaddr = NULL;
			}
			return;
		default:
			print_format("Unexpected ARP opcode 0x%x\n",
					ntohs(arp->ar_op));
			return;
	}
}

arp_receive function handles two types of received ARP requests, a request received from an initiating network node (someone is asking about us), and a reply received as a result of an ARP request initiated by us (we're asking about someone). After a couple of checks, the function goes into a switch statement to decide what to do with the received packet. The actual process of responding to each case is rather boring and dull, except maybe for the process of sending out a packet that was waiting to be sent out, but needed to know the MAC address of the recipient, in which case the function above sets the MAC address of the recepient and sends off that waiting packet, this is because as stated earlier, ARP stuff need to be handled in the background, you won't generally make an ARP request for the sake of making an ARP request, but rather to do something more useful with the network, like sending a TFTP request as we will see in a coming post. You can follow the code yourself, but let's look at some interesting points, starting with calls to ul_memcpy function.

A breif explanation of aligned access is due at this point, since we're using a lot of memcpy to move data around.
Unaligned access is forbidden in ARM, no pointer should ever be dereferenced without ensuring it's alignment to 4 bytes, take the following example:


static inline struct in_addr net_read_ip(void *from)
{
	struct in_addr ip;
	if(((uint32_t)from & 0x3)){
		print_format("[NOT COOL] alignment issue detected\n\r");
			// droping to byte alignment and living with it
		 uint8_t* _from = (uint8_t *)from;	
			uint32_t aligned_ul = 0;	
			for(int al_i=0;al_i<4;al_i++){
				print_format("appending value 0x%x\n\r",_from[al_i]);
				aligned_ul |= (((uint32_t)*(_from+al_i)) << (al_i == 0?24:(24-8*al_i)));
				print_format(" (uint32_t) _from[%d] << %d = 0x%x\n\r",
				al_i,
				(al_i == 0?24:(24-8*al_i)),(((uint32_t)*(_from+al_i)) << (al_i == 0?24:(24-8*al_i))));
			}
		print_format("The aligned_ul value is 0x%x\n\r",aligned_ul);
		ip.s_addr = aligned_ul; 
	}else{
		memcpy((void *)&ip, (void *)from, sizeof(ip));
	}
	return ip;
}

The function net_read_ip above checks if the argument pointer from is word aligned, by inspecting if the address ends with binary bits 00, which indicates word alignment (or simply, can be divided by 4 bytes perfectly), if that's not the case, the function creates a byte aligned pointer, and accesses the contents of the from pointer one byte at a time, copying each byte into the unsigned long variable aligned_ul. The reason for this is as you well know, unaligned access is what's forbidden by the ARM core, meaning the processor's registers should never try to dereference a memory address that is not word aligned. This piece of code could have taken us a long way, but for the sake of passing this issue without further trouble, I opted for implementing, or rather copying bl2/asm/memcpy.S from the linux kernel (or U-boot, can't remember), and using it to handle the alignment issues, instead of relying on my compiler's internal implementation of memcpy, which took this alignment access issues too lightly and caused trouble.

So the calls to ul_memcpy were to basically use the assembly code I coppied from Linux (or U-boot), instead of uisng the memcpy implementation that ships with my compiler, so the above code listing could have substituted the whole if-else block with a single call to ul_memcpy, which looks like the following if you're interested:


/*
 *  linux/arch/arm/lib/memcpy.S
 *
 *  Author:	Nicolas Pitre
 *  Created:	Sep 28, 2005
 *  Copyright:	MontaVista Software, Inc.
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License version 2 as
 *  published by the Free Software Foundation.
 */

//#include <linkage.h>
#include <assembler.h>

#if defined(CONFIG_SYS_THUMB_BUILD) && !defined(MEMCPY_NO_THUMB_BUILD)
#define W(instr)	instr.w
#else
#define W(instr)	instr
#endif

#define LDR1W_SHIFT	0
#define STR1W_SHIFT	0

	.macro ldr1w ptr reg abort
	W(ldr) \reg, [\ptr], #4
	.endm

	.macro ldr4w ptr reg1 reg2 reg3 reg4 abort
	ldmia \ptr!, {\reg1, \reg2, \reg3, \reg4}
	.endm

	.macro ldr8w ptr reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 abort
	ldmia \ptr!, {\reg1, \reg2, \reg3, \reg4, \reg5, \reg6, \reg7, \reg8}
	.endm

	.macro ldr1b ptr reg cond=al abort
	ldrb\cond\() \reg, [\ptr], #1
	.endm

	.macro str1w ptr reg abort
	W(str) \reg, [\ptr], #4
	.endm

	.macro str8w ptr reg1 reg2 reg3 reg4 reg5 reg6 reg7 reg8 abort
	stmia \ptr!, {\reg1, \reg2, \reg3, \reg4, \reg5, \reg6, \reg7, \reg8}
	.endm

	.macro str1b ptr reg cond=al abort
	strb\cond\() \reg, [\ptr], #1
	.endm

	.macro enter reg1 reg2
	stmdb sp!, {r0, \reg1, \reg2}
	.endm

	.macro exit reg1 reg2
	ldmfd sp!, {r0, \reg1, \reg2}
	.endm

	.text

/* Prototype: void *memcpy(void *dest, const void *src, size_t n); */
	.syntax unified
#if defined(CONFIG_SYS_THUMB_BUILD) && !defined(MEMCPY_NO_THUMB_BUILD)
	.thumb
	.thumb_func
#endif


.global ul_memcpy

ul_memcpy:
		cmp	r0, r1
		moveq	pc, lr

		enter	r4, lr

		subs	r2, r2, #4
		blt	8f
		ands	ip, r0, #3
	PLD(	pld	[r1, #0]		)
		bne	9f
		ands	ip, r1, #3
		bne	10f

1:		subs	r2, r2, #(28)
		stmfd	sp!, {r5 - r8}
		blt	5f

	CALGN(	ands	ip, r0, #31		)
	CALGN(	rsb	r3, ip, #32		)
	CALGN(	sbcsne	r4, r3, r2		)  @ C is always set here
	CALGN(	bcs	2f			)
	CALGN(	adr	r4, 6f			)
	CALGN(	subs	r2, r2, r3		)  @ C gets set
	CALGN(	add	pc, r4, ip		)

	PLD(	pld	[r1, #0]		)
2:	PLD(	subs	r2, r2, #96		)
	PLD(	pld	[r1, #28]		)
	PLD(	blt	4f			)
	PLD(	pld	[r1, #60]		)
	PLD(	pld	[r1, #92]		)

3:	PLD(	pld	[r1, #124]		)
4:		ldr8w	r1, r3, r4, r5, r6, r7, r8, ip, lr, abort=20f
		subs	r2, r2, #32
		str8w	r0, r3, r4, r5, r6, r7, r8, ip, lr, abort=20f
		bge	3b
	PLD(	cmn	r2, #96			)
	PLD(	bge	4b			)

5:		ands	ip, r2, #28
		rsb	ip, ip, #32
#if LDR1W_SHIFT > 0
		lsl	ip, ip, #LDR1W_SHIFT
#endif
		addne	pc, pc, ip		@ C is always clear here
		b	7f
6:
		.rept	(1 << LDR1W_SHIFT)
		W(nop)
		.endr
		ldr1w	r1, r3, abort=20f
		ldr1w	r1, r4, abort=20f
		ldr1w	r1, r5, abort=20f
		ldr1w	r1, r6, abort=20f
		ldr1w	r1, r7, abort=20f
		ldr1w	r1, r8, abort=20f
		ldr1w	r1, lr, abort=20f

#if LDR1W_SHIFT < STR1W_SHIFT
		lsl	ip, ip, #STR1W_SHIFT - LDR1W_SHIFT
#elif LDR1W_SHIFT > STR1W_SHIFT
		lsr	ip, ip, #LDR1W_SHIFT - STR1W_SHIFT
#endif
		add	pc, pc, ip
		nop
		.rept	(1 << STR1W_SHIFT)
		W(nop)
		.endr
		str1w	r0, r3, abort=20f
		str1w	r0, r4, abort=20f
		str1w	r0, r5, abort=20f
		str1w	r0, r6, abort=20f
		str1w	r0, r7, abort=20f
		str1w	r0, r8, abort=20f
		str1w	r0, lr, abort=20f

	CALGN(	bcs	2b			)

7:		ldmfd	sp!, {r5 - r8}

8:		movs	r2, r2, lsl #31
		ldr1b	r1, r3, ne, abort=21f
		ldr1b	r1, r4, cs, abort=21f
		ldr1b	r1, ip, cs, abort=21f
		str1b	r0, r3, ne, abort=21f
		str1b	r0, r4, cs, abort=21f
		str1b	r0, ip, cs, abort=21f

		exit	r4, pc

9:		rsb	ip, ip, #4
		cmp	ip, #2
		ldr1b	r1, r3, gt, abort=21f
		ldr1b	r1, r4, ge, abort=21f
		ldr1b	r1, lr, abort=21f
		str1b	r0, r3, gt, abort=21f
		str1b	r0, r4, ge, abort=21f
		subs	r2, r2, ip
		str1b	r0, lr, abort=21f
		blt	8b
		ands	ip, r1, #3
		beq	1b

10:		bic	r1, r1, #3
		cmp	ip, #2
		ldr1w	r1, lr, abort=21f
		beq	17f
		bgt	18f


		.macro	forward_copy_shift pull push

		subs	r2, r2, #28
		blt	14f

	CALGN(	ands	ip, r0, #31		)
	CALGN(	rsb	ip, ip, #32		)
	CALGN(	sbcsne	r4, ip, r2		)  @ C is always set here
	CALGN(	subcc	r2, r2, ip		)
	CALGN(	bcc	15f			)

11:		stmfd	sp!, {r5 - r9}

	PLD(	pld	[r1, #0]		)
	PLD(	subs	r2, r2, #96		)
	PLD(	pld	[r1, #28]		)
	PLD(	blt	13f			)
	PLD(	pld	[r1, #60]		)
	PLD(	pld	[r1, #92]		)

12:	PLD(	pld	[r1, #124]		)
13:		ldr4w	r1, r4, r5, r6, r7, abort=19f
		mov	r3, lr, lspull #\pull
		subs	r2, r2, #32
		ldr4w	r1, r8, r9, ip, lr, abort=19f
		orr	r3, r3, r4, lspush #\push
		mov	r4, r4, lspull #\pull
		orr	r4, r4, r5, lspush #\push
		mov	r5, r5, lspull #\pull
		orr	r5, r5, r6, lspush #\push
		mov	r6, r6, lspull #\pull
		orr	r6, r6, r7, lspush #\push
		mov	r7, r7, lspull #\pull
		orr	r7, r7, r8, lspush #\push
		mov	r8, r8, lspull #\pull
		orr	r8, r8, r9, lspush #\push
		mov	r9, r9, lspull #\pull
		orr	r9, r9, ip, lspush #\push
		mov	ip, ip, lspull #\pull
		orr	ip, ip, lr, lspush #\push
		str8w	r0, r3, r4, r5, r6, r7, r8, r9, ip, , abort=19f
		bge	12b
	PLD(	cmn	r2, #96			)
	PLD(	bge	13b			)

		ldmfd	sp!, {r5 - r9}

14:		ands	ip, r2, #28
		beq	16f

15:		mov	r3, lr, lspull #\pull
		ldr1w	r1, lr, abort=21f
		subs	ip, ip, #4
		orr	r3, r3, lr, lspush #\push
		str1w	r0, r3, abort=21f
		bgt	15b
	CALGN(	cmp	r2, #0			)
	CALGN(	bge	11b			)

16:		sub	r1, r1, #(\push / 8)
		b	8b

		.endm


		forward_copy_shift	pull=8	push=24

17:		forward_copy_shift	pull=16	push=16

18:		forward_copy_shift	pull=24	push=8


/*
 * Abort preamble and completion macros.
 * If a fixup handler is required then those macros must surround it.
 * It is assumed that the fixup code will handle the private part of
 * the exit macro.
 */

	.macro	copy_abort_preamble
19:	ldmfd	sp!, {r5 - r9}
	b	21f
20:	ldmfd	sp!, {r5 - r8}
21:
	.endm

	.macro	copy_abort_end
	ldmfd	sp!, {r4, pc}
	.endm


Using functions like memcpy required modifications to the Makefile to include rdimon, which provides an implementation of syscalls for the complaining linker.

We're not planning on sending any useful requests out yet, so we'll assume that arp_receive function's switch statement will always be hitting case ARP_REQUEST, which requires responding with our own MAC address by calling net_send_packet function, which just calls eth_send function from eth.c:

int eth_send(void *packet, int length){
		int ret;
	     if (!eth_current)
	     		return -ENODEV;
	
	     	if (!(eth_current->state == ETH_STATE_ACTIVE))
	     		return -EINVAL;
	     	ret = eth_current->send(eth_current, packet, length);
	     if (ret < 0) {
	     		/* We cannot completely return the error at present */
	     	print_format(": send() returned error %d\n\r",ret);
	     	}
		return ret;
}

in line 8, the send function of our DM9000 driver is called, passing it the packet and it's length, here's the send function from dm9000.c:

/*
 *   Hardware start transmission.
 *     Send a packet to media from the upper layer.
 *     */
static int dm9000_send(struct eth_device *netdev, volatile void *packet,
		int length)
{
	int tmo;
	struct board_info *db = &dm9000_info;

	//DM9000_DMP_PACKET(__func__ , packet, length);

	DM9000_iow(DM9000_ISR, IMR_PTM); /* Clear Tx bit in ISR */

	/* Move data to DM9000 TX RAM */
	DM9000_outb(DM9000_MWCMD, DM9000_IO); /* Prepare for TX-data */
#ifdef CONFIG_DM9000_DEBUG
	dm9000_dump_eth_frame(packet,length);
#endif

	/* push the data to the TX-fifo */
	(db->outblk)(packet, length);

	/* Set TX length to DM9000 */
	DM9000_iow(DM9000_TXPLL, length & 0xff);
	DM9000_iow(DM9000_TXPLH, (length >> 8) & 0xff);

	/* Issue TX polling command */
	DM9000_iow(DM9000_TCR, TCR_TXREQ); /* Cleared after TX complete */

	/* wait for end of transmission */
	tmo = get_timer(0) + 5 * CONFIG_SYS_HZ;
	while ( !(DM9000_ior(DM9000_NSR) & (NSR_TX1END | NSR_TX2END)) ||
			!(DM9000_ior(DM9000_ISR) & IMR_PTM) ) {
		// transmission will never timeout (get_timer stubbed) ...
		if (get_timer(0) >= tmo) {
			print_format("transmission timeout\n\r");
			break;
		}
	}
	DM9000_iow(DM9000_ISR, IMR_PTM); /* Clear Tx bit in ISR */
#ifdef CONFIG_DM9000_DEBUG
	DM9000_DBG("transmit done\n\r\n\r");
#endif
	return 0;
}

The comments are enough to explain what this function does, at this point, our packet is on the wire, headed towards whoever sent us an ARP request, and our blog post is about to conclude!.

Test run

now that we have everything in place, let's get to the point of testing, after flashing the code, I rurned the power on and ran the following command in the terminal:

korena@korena-solid:~$ arping -c 20 -I enp12s0 -s 10.0.0.20 10.0.0.2

enp12s0 is my interface name, because Arch linux said so. you can check the other stuff in the man pages.

The results I get :

korena@korena-solid~$ sudo arping -c 20 -I enp12s0 -s 10.0.0.20 10.0.0.2
[sudo] password for korena: 
ARPING 10.0.0.2 from 10.0.0.20 enp12s0
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.715ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.759ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.737ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.738ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.731ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.728ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.725ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.735ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.730ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.727ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.735ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.731ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.728ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.718ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.734ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.731ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.729ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.726ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.730ms
Unicast reply from 10.0.0.2 [00:12:34:56:80:49]  0.741ms
Sent 20 probes (1 broadcast(s))
Received 20 response(s)

Conclusion

We've crossed a good milestone in this post, we now know our network stack is capable of responding to stuff, in the next post, we'll make the leap towards setting up TFTP, and loading the linux kernel, then executing it's decompressor until it fails with the unrecognized machine we saw before when we loaded the kernel from MMC to SDRAM a couple of posts ago.
If you stuck around to read this line, you're awesome!