Coding in the early days
Consider the early days in computing - Programs had to be written in machine language (zeroes and ones) - using punchcards
Since the CPU only understands binary, wa valid instruction as expeced by the CPU would look like this: 0011 0000 0001 0111 (for a 16-bit CPU)
Why Assembly language?
When computers started to have keyboards and monitors, we could for the first time type codes in the coputer and run it - this lead to Assembly.
—> every CPU instruction got a “code word”, i.e. an assembly mnemonic, which could be written as text in the computer
then the dedicated programm - the Assembler - reads this text and replace everyoccurence of mnemonic, e.g. ADD with the corresponding zeroes and ones repesenting the machine language for this instruction.
- ADDL $0x13, %eax # Add a given value (0x13) to eax (a register, but could also be a memory location (variable,..)
What are the Advantages and Disadvantages of Assembly?
Much better understandable for umans tha sequencrs of 0s and 1s
Same performance efficiency
Today: no code from programming language is as efficient as assemlby
Need an additional program for translating assembly to machine language
Every CPU architecture only understands its own instruction set
“language pitfall”: While in english the programming language itself is called Assembly, in German we refer to it as Assembler (like the program which actually translates assembly to machine code)
What are the motivations for Assembly in 2023?
according to the argument that there are yet “high-level” languages and nobody still uses assembly
Acquire a deeper understanding of programming, computer and hardware
Optimization: no compiler/ code-generator can optimize its code output in terms of soeed and efficiency as good as programming critical routines in assembly (which is still done, e.g. in operating system code)
Embedded systems ofter requires the use of assembly, due to special programming requirements of the hardware, novel ICs, etc.
Reverse Engeneering: the art of decounting a binary program (unless it’s open source, programs are solely distributed in their binary binary forms, e.g. .exe files - consisting entirely of machine language), to an assembly representation of the code, which we can use to decrypt what a piece of code actually does.
Security/ Explotation: Security researchers need assembly all the time, either in order to exploit applications (e.g. buffer overflows) or to understand what applications actually do. Also the “bad guys” need assembly, e.g. to crack software, remove copy protection
How can an executable program is buit & run?
Programming: Develoer creates one or multiple source files, containing syntactically valid assembly code
Assembling: The Assembler takes these files as input, and reate object code, which is a direct transformation as assembly to machine code. Each source file leads to a separate object file.
Linking: The Linker takes all object files, and possibly other already assembled object files (e.g. external software libraries) in order to reate the executable program
Running: When running the program, the dynamic linker can load additional runtime libraries into memory, which have not been statically compiled into the program (e.g. DLLs or shared objects)
What are Ocject codes & Executable Object codes?
Object Code & Executables Object code is simply a seqence of bytes, encoding a series of instructions executing ba the machne (machine language)
Types of Object Files - Relocatable
often called “object code” - contains binary machine code and data in a form that can be combinied with other relocatable object files to reate an executable file. A relocatable file is created for each source file and correspond 1:1 to this surce. Relocatable files can not be executed and are only used for further processing during the compilation process.
Types of Object Files - Executable
Contains binary machine code and data in a form that can be copied directly into memory and executed by the Operanding System (the CPU just has to set the Instruction Counter register to that starting address), Each Operanding System defines by itself, how an executable file has to be structured (e.g. where is the entry point? How is storedwhich/ how much memory is needed? ect.) - Linux uses the elf format, Windows uses pe32, macOS uses mach-o
Types of Object Files - Shared
A special type of relocatable file that can be loaded into mamory and lined dynamically at load timer or rum tume of another program. Typically used for software lbraries, often called DLLs (dynamic link library) or shared objects.
What is Linking?
Linking is the process of taking one or multible relocatable or shared object files, combining the machine code of them, and creating a final executable file according to the format and structure of a particular operating system (e.g. an exe file for Windows)
What are the two Types of Linking?
Static linking: Linking is done at compile time (i.e. all code relevant to execute the final program is included directly into the executable) - should be used only in very specific cases.
Dynamic linking: Linking is done at run-time, alter loading the program into memory but before starting it. This enables us to ise dynamic libraries (.dll, Windows) or shared objects (.so, Linux, Unix, macOS), e.g. so we can use functionality from a libraby without implementing it by ourselves (e.g. getting a HTTP resource)
First steps in Assembly - exit.s
this is the simplest assembly program: it terminates itself.
This code uses 2 types of instructions:
movl: copies data - the first argument to the second
int: calls the operanding system for help and perform a task for us
eax and ebx are registers, CPU-internal small memory elements
3 (4) Stepts - Assembly, link & run
Step 1: Assemble
for 64-bit systems: as exit.s -o exit.o
create 32-bit assembly on 64-bit system: as -32 exit.s -o exit.o
Strarts the “as” Assembler, which translates the human-readable ASCII text file to machine language
Step 2: Link
For 64-bit systems: ld exit.o -o exit
Create 32-bit executable on 64-bit system: ld -melf_i386 exit.o -o exit
Starts the “ld” Linker, which merges al object files into one binary (although we curently only have one object file), and creates the executable
an Executable file contains mre information than just a plain object file (e.g. adress relocation, etc.)
Step 3: Run
Step 4:check exit code (optional)
Note: Every program in Unix has an exit code aftertermination, which indicates success or failure, Exit code “zero” means everything went fine, while a code other than “zero” indicates something went wrong
What are Assembly directives?
- Directives begin with a dot. like .section
and are not translatet into code, they are used by the Assembler to know how to assembly the file and build the corresponding object files.
Data section: .section .date
- Tells the assembler that the memory storage section begins
Text section: .section .text
- Marks the section inside the source where the actual instructions (i.e. code) sarts
Program entry: .global
- Tells the assembler/linker where the global enty point of the program is (i.e. where to start execution once the program is being run), simular to a “main() method” - Java
Data types: .long (32 bits), .int (16 bits), .byte (8 bits), .ascii (8 bits)
- Declares memory space for the programmer to use, e.g. for variables
What are lables?
Define symbolic locations (e.g. where code ca jump to)
End with a colon, e.g. “start_loop:”
“_start:” Actual code start
Almost all operations, values and results are done in registers - word-size fast memory elements inside the CPU.
x86 has 6 geneal-purpose registers which can be used arbitrarily by the programmer. The first 4 can be accessed either using 32 bits, 16 bits or 8 bits
eax - 32 bits
ax - 16 bits (lower eax)
ah - 8 bits (higher ax)
al - 8 bits (lower ax)
esp & ebp can be modified ba programmer, but have distincitive meaning - invalid usage crash program
eip cannot be directly modified
Register and Values
Register are prefixed with “%” - movl $eax, %ebx
Immediate values are prefixed with “$” - movl $123, %eax
Instruction suffixes determine the data size the command is operating on
“b” - byte (8 bits)
“w” - word (16 bits)
“l” - long (32 bits)
exmpl.: movl $1, %eax - copies 1 s type long (32 bits) into register EAX
movb $1, %ah - copies 1 as type byte (8 bits) into the lower 8 bits ot the EAX register
Comments start with “#”
What are some of the most used instructions?
mov - Copies data from first argument to second argument
int - Creates an interrupt, e.g. for signaling the operating system to do work for us - read a file, or print something on the shell
jmp - Jump somewhere else in the code, instead of executing line by line
cmp - compare two arguments
je, jne, jg,.. - Conditional jump
-What are the four types of basic arguments?
IMM: Immediate values, i.e. actual concrete numeric or text values - e.g. $0x80 (numeric immediates are always prefixed with a “$”
REG: NAme of a register which holds a particular value - %eax
MEM: a reference to a memory cell in main memory . this is always a memory address. This adress can either be stored in a register, or referenced via a variabel name
LBL: a label in the source code
fetch a data from memory to a regidter MEM —> REG
copy an immediate value to a register IMM —> REG
write register content bacj to memory REG —> MEM
Note: Each instruction can max. have one single memory reference. I.e. we can’t copy data from ne memory adress to another memory adess
MEM —> MEM - the CPU can’t do this
What are System calls?
On their own, processes can only perform instructions which do not intergere with anything outside themselves. They can do arithmetic computations load and write from (allowed) memory regions, compare values and within the program code
For everything else, a process needs to perform “system calls “ the only available API (application programming interface) that exists to communicate with the operating system. This is the only way to communicate form user-space to kernel-space
What are the 6 System calls - categories?
Process control - start a program, allocate memory
File management - read or write a file
Device management - read from device
information - set system time
Communication - send message ober network
Protection/ Security - set file permission
how to perform a System call on c86 Linux
Write the desired sys-call number intro the eax reigster - movl $1, %eax
Sys-call may need one or multiple arguments, simular to function argument (e.g. the process exit corde for the “exit” sys-call) Arguments are always stored in the registers %ebx, %ecx, %edx, %esi and %edi - the first argument in %ebx, the second arguent in %exc, aso.
The process creates an interupt, which forces the CPU to stop process execution and hand over system control to the OS. The OS recognizes the interrupt line $0x80 which is used for sys-calls, and evaluates 1, the requested sys-call number and 2, all necessary arguments in the register