diff --git a/debian/patches/bugfix/all/patch-2.6.24-git7 b/debian/patches/bugfix/all/patch-2.6.24-git8 similarity index 86% rename from debian/patches/bugfix/all/patch-2.6.24-git7 rename to debian/patches/bugfix/all/patch-2.6.24-git8 index b85871cd7..a931de966 100644 --- a/debian/patches/bugfix/all/patch-2.6.24-git7 +++ b/debian/patches/bugfix/all/patch-2.6.24-git8 @@ -973,6 +973,191 @@ index a2ac6d2..8b49302 100644 FURTHER INFORMATION +diff --git a/Documentation/debugging-via-ohci1394.txt b/Documentation/debugging-via-ohci1394.txt +new file mode 100644 +index 0000000..de4804e +--- /dev/null ++++ b/Documentation/debugging-via-ohci1394.txt +@@ -0,0 +1,179 @@ ++ ++ Using physical DMA provided by OHCI-1394 FireWire controllers for debugging ++ --------------------------------------------------------------------------- ++ ++Introduction ++------------ ++ ++Basically all FireWire controllers which are in use today are compliant ++to the OHCI-1394 specification which defines the controller to be a PCI ++bus master which uses DMA to offload data transfers from the CPU and has ++a "Physical Response Unit" which executes specific requests by employing ++PCI-Bus master DMA after applying filters defined by the OHCI-1394 driver. ++ ++Once properly configured, remote machines can send these requests to ++ask the OHCI-1394 controller to perform read and write requests on ++physical system memory and, for read requests, send the result of ++the physical memory read back to the requester. ++ ++With that, it is possible to debug issues by reading interesting memory ++locations such as buffers like the printk buffer or the process table. ++ ++Retrieving a full system memory dump is also possible over the FireWire, ++using data transfer rates in the order of 10MB/s or more. ++ ++Memory access is currently limited to the low 4G of physical address ++space which can be a problem on IA64 machines where memory is located ++mostly above that limit, but it is rarely a problem on more common ++hardware such as hardware based on x86, x86-64 and PowerPC. ++ ++Together with a early initialization of the OHCI-1394 controller for debugging, ++this facility proved most useful for examining long debugs logs in the printk ++buffer on to debug early boot problems in areas like ACPI where the system ++fails to boot and other means for debugging (serial port) are either not ++available (notebooks) or too slow for extensive debug information (like ACPI). ++ ++Drivers ++------- ++ ++The OHCI-1394 drivers in drivers/firewire and drivers/ieee1394 initialize ++the OHCI-1394 controllers to a working state and can be used to enable ++physical DMA. By default you only have to load the driver, and physical ++DMA access will be granted to all remote nodes, but it can be turned off ++when using the ohci1394 driver. ++ ++Because these drivers depend on the PCI enumeration to be completed, an ++initialization routine which can runs pretty early (long before console_init(), ++which makes the printk buffer appear on the console can be called) was written. ++ ++To activate it, enable CONFIG_PROVIDE_OHCI1394_DMA_INIT (Kernel hacking menu: ++Provide code for enabling DMA over FireWire early on boot) and pass the ++parameter "ohci1394_dma=early" to the recompiled kernel on boot. ++ ++Tools ++----- ++ ++firescope - Originally developed by Benjamin Herrenschmidt, Andi Kleen ported ++it from PowerPC to x86 and x86_64 and added functionality, firescope can now ++be used to view the printk buffer of a remote machine, even with live update. ++ ++Bernhard Kaindl enhanced firescope to support accessing 64-bit machines ++from 32-bit firescope and vice versa: ++- ftp://ftp.suse.de/private/bk/firewire/tools/firescope-0.2.2.tar.bz2 ++ ++and he implemented fast system dump (alpha version - read README.txt): ++- ftp://ftp.suse.de/private/bk/firewire/tools/firedump-0.1.tar.bz2 ++ ++There is also a gdb proxy for firewire which allows to use gdb to access ++data which can be referenced from symbols found by gdb in vmlinux: ++- ftp://ftp.suse.de/private/bk/firewire/tools/fireproxy-0.33.tar.bz2 ++ ++The latest version of this gdb proxy (fireproxy-0.34) can communicate (not ++yet stable) with kgdb over an memory-based communication module (kgdbom). ++ ++Getting Started ++--------------- ++ ++The OHCI-1394 specification regulates that the OHCI-1394 controller must ++disable all physical DMA on each bus reset. ++ ++This means that if you want to debug an issue in a system state where ++interrupts are disabled and where no polling of the OHCI-1394 controller ++for bus resets takes place, you have to establish any FireWire cable ++connections and fully initialize all FireWire hardware __before__ the ++system enters such state. ++ ++Step-by-step instructions for using firescope with early OHCI initialization: ++ ++1) Verify that your hardware is supported: ++ ++ Load the ohci1394 or the fw-ohci module and check your kernel logs. ++ You should see a line similar to ++ ++ ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[18] MMIO=[fe9ff800-fe9fffff] ++ ... Max Packet=[2048] IR/IT contexts=[4/8] ++ ++ when loading the driver. If you have no supported controller, many PCI, ++ CardBus and even some Express cards which are fully compliant to OHCI-1394 ++ specification are available. If it requires no driver for Windows operating ++ systems, it most likely is. Only specialized shops have cards which are not ++ compliant, they are based on TI PCILynx chips and require drivers for Win- ++ dows operating systems. ++ ++2) Establish a working FireWire cable connection: ++ ++ Any FireWire cable, as long at it provides electrically and mechanically ++ stable connection and has matching connectors (there are small 4-pin and ++ large 6-pin FireWire ports) will do. ++ ++ If an driver is running on both machines you should see a line like ++ ++ ieee1394: Node added: ID:BUS[0-01:1023] GUID[0090270001b84bba] ++ ++ on both machines in the kernel log when the cable is plugged in ++ and connects the two machines. ++ ++3) Test physical DMA using firescope: ++ ++ On the debug host, ++ - load the raw1394 module, ++ - make sure that /dev/raw1394 is accessible, ++ then start firescope: ++ ++ $ firescope ++ Port 0 (ohci1394) opened, 2 nodes detected ++ ++ FireScope ++ --------- ++ Target : ++ Gen : 1 ++ [Ctrl-T] choose target ++ [Ctrl-H] this menu ++ [Ctrl-Q] quit ++ ++ ------> Press Ctrl-T now, the output should be similar to: ++ ++ 2 nodes available, local node is: 0 ++ 0: ffc0, uuid: 00000000 00000000 [LOCAL] ++ 1: ffc1, uuid: 00279000 ba4bb801 ++ ++ Besides the [LOCAL] node, it must show another node without error message. ++ ++4) Prepare for debugging with early OHCI-1394 initialization: ++ ++ 4.1) Kernel compilation and installation on debug target ++ ++ Compile the kernel to be debugged with CONFIG_PROVIDE_OHCI1394_DMA_INIT ++ (Kernel hacking: Provide code for enabling DMA over FireWire early on boot) ++ enabled and install it on the machine to be debugged (debug target). ++ ++ 4.2) Transfer the System.map of the debugged kernel to the debug host ++ ++ Copy the System.map of the kernel be debugged to the debug host (the host ++ which is connected to the debugged machine over the FireWire cable). ++ ++5) Retrieving the printk buffer contents: ++ ++ With the FireWire cable connected, the OHCI-1394 driver on the debugging ++ host loaded, reboot the debugged machine, booting the kernel which has ++ CONFIG_PROVIDE_OHCI1394_DMA_INIT enabled, with the option ohci1394_dma=early. ++ ++ Then, on the debugging host, run firescope, for example by using -A: ++ ++ firescope -A System.map-of-debug-target-kernel ++ ++ Note: -A automatically attaches to the first non-local node. It only works ++ reliably if only connected two machines are connected using FireWire. ++ ++ After having attached to the debug target, press Ctrl-D to view the ++ complete printk buffer or Ctrl-U to enter auto update mode and get an ++ updated live view of recent kernel messages logged on the debug target. ++ ++ Call "firescope -h" to get more information on firescope's options. ++ ++Notes ++----- ++Documentation and specifications: ftp://ftp.suse.de/private/bk/firewire/docs ++ ++FireWire is a trademark of Apple Inc. - for more information please refer to: ++http://en.wikipedia.org/wiki/FireWire diff --git a/Documentation/dontdiff b/Documentation/dontdiff index f2d658a..c09a96b 100644 --- a/Documentation/dontdiff @@ -1667,7 +1852,7 @@ index 616043a..649cb87 100644 +$ find . -name Kconfig\* | xargs grep -ns "depends on.*=.*||.*=" | grep -v orig + diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt -index c417877..880f882 100644 +index c417877..5d171b7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -34,6 +34,7 @@ parameter is applicable: @@ -1701,7 +1886,84 @@ index c417877..880f882 100644 clock= [BUGS=X86-32, HW] gettimeofday clocksource override. [Deprecated] Forces specified clocksource (if available) to be used -@@ -1123,6 +1131,10 @@ and is between 256 and 4096 characters. It is defined in the file +@@ -408,8 +416,21 @@ and is between 256 and 4096 characters. It is defined in the file + [SPARC64] tick + [X86-64] hpet,tsc + +- code_bytes [IA32] How many bytes of object code to print in an +- oops report. ++ clearcpuid=BITNUM [X86] ++ Disable CPUID feature X for the kernel. See ++ include/asm-x86/cpufeature.h for the valid bit numbers. ++ Note the Linux specific bits are not necessarily ++ stable over kernel options, but the vendor specific ++ ones should be. ++ Also note that user programs calling CPUID directly ++ or using the feature without checking anything ++ will still see it. This just prevents it from ++ being used by the kernel or shown in /proc/cpuinfo. ++ Also note the kernel might malfunction if you disable ++ some critical bits. ++ ++ code_bytes [IA32/X86_64] How many bytes of object code to print ++ in an oops report. + Range: 0 - 8192 + Default: 64 + +@@ -562,6 +583,12 @@ and is between 256 and 4096 characters. It is defined in the file + See drivers/char/README.epca and + Documentation/digiepca.txt. + ++ disable_mtrr_trim [X86, Intel and AMD only] ++ By default the kernel will trim any uncacheable ++ memory out of your available memory pool based on ++ MTRR settings. This parameter disables that behavior, ++ possibly causing your machine to run very slowly. ++ + dmasound= [HW,OSS] Sound subsystem buffers + + dscc4.setup= [NET] +@@ -652,6 +679,10 @@ and is between 256 and 4096 characters. It is defined in the file + + gamma= [HW,DRM] + ++ gart_fix_e820= [X86_64] disable the fix e820 for K8 GART ++ Format: off | on ++ default: on ++ + gdth= [HW,SCSI] + See header of drivers/scsi/gdth.c. + +@@ -786,6 +817,16 @@ and is between 256 and 4096 characters. It is defined in the file + for translation below 32 bit and if not available + then look in the higher range. + ++ io_delay= [X86-32,X86-64] I/O delay method ++ 0x80 ++ Standard port 0x80 based delay ++ 0xed ++ Alternate port 0xed based delay (needed on some systems) ++ udelay ++ Simple two microseconds delay ++ none ++ No delay ++ + io7= [HW] IO7 for Marvel based alpha systems + See comment before marvel_specify_io7 in + arch/alpha/kernel/core_marvel.c. +@@ -1051,6 +1092,11 @@ and is between 256 and 4096 characters. It is defined in the file + Multi-Function General Purpose Timers on AMD Geode + platforms. + ++ mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when ++ the BIOS has incorrectly applied a workaround. TinyBIOS ++ version 0.98 is known to be affected, 0.99 fixes the ++ problem by letting the user disable the workaround. ++ + mga= [HW,DRM] + + mousedev.tap_time= +@@ -1123,6 +1169,10 @@ and is between 256 and 4096 characters. It is defined in the file of returning the full 64-bit number. The default is to return 64-bit inode numbers. @@ -1712,7 +1974,25 @@ index c417877..880f882 100644 nmi_watchdog= [KNL,BUGS=X86-32] Debugging features for SMP kernels no387 [BUGS=X86-32] Tells the kernel to use the 387 maths -@@ -1593,7 +1605,13 @@ and is between 256 and 4096 characters. It is defined in the file +@@ -1147,6 +1197,8 @@ and is between 256 and 4096 characters. It is defined in the file + + nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. + ++ noefi [X86-32,X86-64] Disable EFI runtime services support. ++ + noexec [IA-64] + + noexec [X86-32,X86-64] +@@ -1157,6 +1209,8 @@ and is between 256 and 4096 characters. It is defined in the file + register save and restore. The kernel will only save + legacy floating-point registers on task switch. + ++ noclflush [BUGS=X86] Don't use the CLFLUSH instruction ++ + nohlt [BUGS=ARM] + + no-hlt [BUGS=X86-32] Tells the kernel that the hlt +@@ -1593,7 +1647,13 @@ and is between 256 and 4096 characters. It is defined in the file Format: :: (flags are integer value) @@ -1727,6 +2007,18 @@ index c417877..880f882 100644 scsi_mod.scan= [SCSI] sync (default) scans SCSI busses as they are discovered. async scans them in kernel threads, +@@ -1960,6 +2020,11 @@ and is between 256 and 4096 characters. It is defined in the file + vdso=1: enable VDSO (default) + vdso=0: disable VDSO mapping + ++ vdso32= [X86-32,X86-64] ++ vdso32=2: enable compat VDSO (default with COMPAT_VDSO) ++ vdso32=1: enable 32-bit VDSO (default) ++ vdso32=0: disable 32-bit VDSO mapping ++ + vector= [IA-64,SMP] + vector=percpu: enable percpu vector domain + diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt index ca86a88..bf3256e 100644 --- a/Documentation/kobject.txt @@ -5159,6 +5451,54 @@ index d17f324..dcf8bcf 100644 Look at the writable files. Writing 1 to them will enable the corresponding debug option. All options can be set on a slab that does +diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt +index 9453118..34abae4 100644 +--- a/Documentation/x86_64/boot-options.txt ++++ b/Documentation/x86_64/boot-options.txt +@@ -110,12 +110,18 @@ Idle loop + + Rebooting + +- reboot=b[ios] | t[riple] | k[bd] [, [w]arm | [c]old] ++ reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old] + bios Use the CPU reboot vector for warm reset + warm Don't set the cold reboot flag + cold Set the cold reboot flag + triple Force a triple fault (init) + kbd Use the keyboard controller. cold reset (default) ++ acpi Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the ++ ACPI reset does not work, the reboot path attempts the reset using ++ the keyboard controller. ++ efi Use efi reset_system runtime service. If EFI is not configured or the ++ EFI reset does not work, the reboot path attempts the reset using ++ the keyboard controller. + + Using warm reset will be much faster especially on big memory + systems because the BIOS will not go through the memory check. +diff --git a/Documentation/x86_64/uefi.txt b/Documentation/x86_64/uefi.txt +index 91a98ed..7d77120 100644 +--- a/Documentation/x86_64/uefi.txt ++++ b/Documentation/x86_64/uefi.txt +@@ -19,6 +19,10 @@ Mechanics: + - Build the kernel with the following configuration. + CONFIG_FB_EFI=y + CONFIG_FRAMEBUFFER_CONSOLE=y ++ If EFI runtime services are expected, the following configuration should ++ be selected. ++ CONFIG_EFI=y ++ CONFIG_EFI_VARS=y or m # optional + - Create a VFAT partition on the disk + - Copy the following to the VFAT partition: + elilo bootloader with x86_64 support, elilo configuration file, +@@ -27,3 +31,8 @@ Mechanics: + can be found in the elilo sourceforge project. + - Boot to EFI shell and invoke elilo choosing the kernel image built + in first step. ++- If some or all EFI runtime services don't work, you can try following ++ kernel command line parameters to turn off some or all EFI runtime ++ services. ++ noefi turn off all EFI runtime services ++ reboot_type=k turn off EFI reboot runtime service diff --git a/Documentation/zh_CN/CodingStyle b/Documentation/zh_CN/CodingStyle new file mode 100644 index 0000000..ecd9307 @@ -7225,10 +7565,22 @@ index 6ae2500..0f5520d 100644 /* Slow path */ spin_lock(lock); diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig -index a04f507..de211ac 100644 +index a04f507..77201d3 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig -@@ -180,8 +180,8 @@ config ARCH_AT91 +@@ -91,6 +91,11 @@ config GENERIC_IRQ_PROBE + bool + default y + ++config GENERIC_LOCKBREAK ++ bool ++ default y ++ depends on SMP && PREEMPT ++ + config RWSEM_GENERIC_SPINLOCK + bool + default y +@@ -180,8 +185,8 @@ config ARCH_AT91 bool "Atmel AT91" select GENERIC_GPIO help @@ -7239,7 +7591,7 @@ index a04f507..de211ac 100644 config ARCH_CLPS7500 bool "Cirrus CL-PS7500FE" -@@ -217,6 +217,7 @@ config ARCH_EP93XX +@@ -217,6 +222,7 @@ config ARCH_EP93XX bool "EP93xx-based" select ARM_AMBA select ARM_VIC @@ -7247,7 +7599,7 @@ index a04f507..de211ac 100644 help This enables support for the Cirrus EP93xx series of CPUs. -@@ -333,6 +334,16 @@ config ARCH_MXC +@@ -333,6 +339,16 @@ config ARCH_MXC help Support for Freescale MXC/iMX-based family of processors @@ -7264,7 +7616,7 @@ index a04f507..de211ac 100644 config ARCH_PNX4008 bool "Philips Nexperia PNX4008 Mobile" help -@@ -345,6 +356,7 @@ config ARCH_PXA +@@ -345,6 +361,7 @@ config ARCH_PXA select GENERIC_GPIO select GENERIC_TIME select GENERIC_CLOCKEVENTS @@ -7272,7 +7624,7 @@ index a04f507..de211ac 100644 help Support for Intel/Marvell's PXA2xx/PXA3xx processor line. -@@ -366,6 +378,7 @@ config ARCH_SA1100 +@@ -366,6 +383,7 @@ config ARCH_SA1100 select ARCH_DISCONTIGMEM_ENABLE select ARCH_MTD_XIP select GENERIC_GPIO @@ -7280,7 +7632,7 @@ index a04f507..de211ac 100644 help Support for StrongARM 11x0 based boards. -@@ -409,6 +422,17 @@ config ARCH_OMAP +@@ -409,6 +427,17 @@ config ARCH_OMAP help Support for TI's OMAP platform (OMAP1 and OMAP2). @@ -7298,7 +7650,7 @@ index a04f507..de211ac 100644 endchoice source "arch/arm/mach-clps711x/Kconfig" -@@ -441,6 +465,8 @@ source "arch/arm/mach-omap1/Kconfig" +@@ -441,6 +470,8 @@ source "arch/arm/mach-omap1/Kconfig" source "arch/arm/mach-omap2/Kconfig" @@ -7307,7 +7659,7 @@ index a04f507..de211ac 100644 source "arch/arm/plat-s3c24xx/Kconfig" source "arch/arm/plat-s3c/Kconfig" -@@ -477,6 +503,8 @@ source "arch/arm/mach-davinci/Kconfig" +@@ -477,6 +508,8 @@ source "arch/arm/mach-davinci/Kconfig" source "arch/arm/mach-ks8695/Kconfig" @@ -7316,7 +7668,7 @@ index a04f507..de211ac 100644 # Definitions to make life easier config ARCH_ACORN bool -@@ -657,6 +685,7 @@ config HZ +@@ -657,6 +690,7 @@ config HZ default 128 if ARCH_L7200 default 200 if ARCH_EBSA110 || ARCH_S3C2410 default OMAP_32K_TIMER_HZ if ARCH_OMAP && OMAP_32K_TIMER @@ -7324,7 +7676,7 @@ index a04f507..de211ac 100644 default 100 config AEABI -@@ -716,7 +745,7 @@ config LEDS +@@ -716,7 +750,7 @@ config LEDS ARCH_OMAP || ARCH_P720T || ARCH_PXA_IDP || \ ARCH_SA1100 || ARCH_SHARK || ARCH_VERSATILE || \ ARCH_AT91 || MACH_TRIZEPS4 || ARCH_DAVINCI || \ @@ -7333,7 +7685,7 @@ index a04f507..de211ac 100644 help If you say Y here, the LEDs on your machine will be used to provide useful information about your current system status. -@@ -867,7 +896,7 @@ config KEXEC +@@ -867,7 +901,7 @@ config KEXEC endmenu @@ -7342,7 +7694,7 @@ index a04f507..de211ac 100644 menu "CPU Frequency scaling" -@@ -903,6 +932,12 @@ config CPU_FREQ_IMX +@@ -903,6 +937,12 @@ config CPU_FREQ_IMX If in doubt, say N. @@ -7355,7 +7707,7 @@ index a04f507..de211ac 100644 endmenu endif -@@ -951,7 +986,7 @@ config FPE_FASTFPE +@@ -951,7 +991,7 @@ config FPE_FASTFPE config VFP bool "VFP-format floating point maths" @@ -7364,7 +7716,7 @@ index a04f507..de211ac 100644 help Say Y to include VFP support code in the kernel. This is needed if your hardware includes a VFP unit. -@@ -961,6 +996,18 @@ config VFP +@@ -961,6 +1001,18 @@ config VFP Say N if your target does not have VFP hardware. @@ -59044,6 +59396,32 @@ index a2e72d4..43a87b9 100644 #if defined(CONFIG_BLK_DEV_INITRD) . = ALIGN(4); ___initramfs_start = .; +diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig +index bef4772..5a41e75 100644 +--- a/arch/ia64/Kconfig ++++ b/arch/ia64/Kconfig +@@ -42,6 +42,11 @@ config MMU + config SWIOTLB + bool + ++config GENERIC_LOCKBREAK ++ bool ++ default y ++ depends on SMP && PREEMPT ++ + config RWSEM_XCHGADD_ALGORITHM + bool + default y +@@ -75,6 +80,9 @@ config GENERIC_TIME_VSYSCALL + bool + default y + ++config ARCH_SETS_UP_PER_CPU_AREA ++ def_bool y ++ + config DMI + bool + default y diff --git a/arch/ia64/hp/sim/simeth.c b/arch/ia64/hp/sim/simeth.c index 08b117e..9898feb 100644 --- a/arch/ia64/hp/sim/simeth.c @@ -59060,6 +59438,33 @@ index 08b117e..9898feb 100644 /* * very simple loop because we get interrupts only when receiving */ +diff --git a/arch/ia64/ia32/binfmt_elf32.c b/arch/ia64/ia32/binfmt_elf32.c +index 3e35987..4f0c30c 100644 +--- a/arch/ia64/ia32/binfmt_elf32.c ++++ b/arch/ia64/ia32/binfmt_elf32.c +@@ -222,7 +222,8 @@ elf32_set_personality (void) + } + + static unsigned long +-elf32_map (struct file *filep, unsigned long addr, struct elf_phdr *eppnt, int prot, int type) ++elf32_map(struct file *filep, unsigned long addr, struct elf_phdr *eppnt, ++ int prot, int type, unsigned long unused) + { + unsigned long pgoff = (eppnt->p_vaddr) & ~IA32_PAGE_MASK; + +diff --git a/arch/ia64/kernel/module.c b/arch/ia64/kernel/module.c +index 1962879..e699eb6 100644 +--- a/arch/ia64/kernel/module.c ++++ b/arch/ia64/kernel/module.c +@@ -947,7 +947,7 @@ percpu_modcopy (void *pcpudst, const void *src, unsigned long size) + { + unsigned int i; + for_each_possible_cpu(i) { +- memcpy(pcpudst + __per_cpu_offset[i], src, size); ++ memcpy(pcpudst + per_cpu_offset(i), src, size); + } + } + #endif /* CONFIG_SMP */ diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c index 4ac2b1f..86028c6 100644 --- a/arch/ia64/kernel/setup.c @@ -59242,6 +59647,22 @@ index 1f38a3a..bb1d249 100644 printk("SGI SAL version %x.%02x\n", version >> 8, version & 0x00FF); /* +diff --git a/arch/m32r/Kconfig b/arch/m32r/Kconfig +index ab9a264..f7237c5 100644 +--- a/arch/m32r/Kconfig ++++ b/arch/m32r/Kconfig +@@ -235,6 +235,11 @@ config IRAM_SIZE + # Define implied options from the CPU selection here + # + ++config GENERIC_LOCKBREAK ++ bool ++ default y ++ depends on SMP && PREEMPT ++ + config RWSEM_GENERIC_SPINLOCK + bool + depends on M32R diff --git a/arch/m32r/kernel/vmlinux.lds.S b/arch/m32r/kernel/vmlinux.lds.S index 942a8c7..41b0785 100644 --- a/arch/m32r/kernel/vmlinux.lds.S @@ -59365,7 +59786,7 @@ index 07a0055..b44edb0 100644 } diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig -index b22c043..6b0f85f 100644 +index b22c043..4fad0a3 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -37,16 +37,6 @@ config BASLER_EXCITE @@ -59585,7 +60006,19 @@ index b22c043..6b0f85f 100644 source "arch/mips/jazz/Kconfig" source "arch/mips/lasat/Kconfig" source "arch/mips/pmc-sierra/Kconfig" -@@ -797,10 +790,6 @@ config DMA_COHERENT +@@ -701,6 +694,11 @@ source "arch/mips/vr41xx/Kconfig" + + endmenu + ++config GENERIC_LOCKBREAK ++ bool ++ default y ++ depends on SMP && PREEMPT ++ + config RWSEM_GENERIC_SPINLOCK + bool + default y +@@ -797,10 +795,6 @@ config DMA_COHERENT config DMA_IP27 bool @@ -59596,7 +60029,7 @@ index b22c043..6b0f85f 100644 config DMA_NONCOHERENT bool select DMA_NEED_PCI_MAP_STATE -@@ -956,16 +945,40 @@ config EMMA2RH +@@ -956,16 +950,40 @@ config EMMA2RH config SERIAL_RM9000 bool @@ -59638,7 +60071,7 @@ index b22c043..6b0f85f 100644 default "4" if PMC_MSP4200_EVAL default "5" -@@ -974,7 +987,7 @@ config HAVE_STD_PC_SERIAL_PORT +@@ -974,7 +992,7 @@ config HAVE_STD_PC_SERIAL_PORT config ARC_CONSOLE bool "ARC console support" @@ -59647,7 +60080,7 @@ index b22c043..6b0f85f 100644 config ARC_MEMORY bool -@@ -983,7 +996,7 @@ config ARC_MEMORY +@@ -983,7 +1001,7 @@ config ARC_MEMORY config ARC_PROMLIB bool @@ -59656,7 +60089,7 @@ index b22c043..6b0f85f 100644 default y config ARC64 -@@ -1443,7 +1456,9 @@ config MIPS_MT_SMP +@@ -1443,7 +1461,9 @@ config MIPS_MT_SMP select MIPS_MT select NR_CPUS_DEFAULT_2 select SMP @@ -59666,7 +60099,7 @@ index b22c043..6b0f85f 100644 help This is a kernel model which is also known a VSMP or lately has been marketesed into SMVP. -@@ -1460,6 +1475,7 @@ config MIPS_MT_SMTC +@@ -1460,6 +1480,7 @@ config MIPS_MT_SMTC select NR_CPUS_DEFAULT_8 select SMP select SYS_SUPPORTS_SMP @@ -59674,7 +60107,7 @@ index b22c043..6b0f85f 100644 help This is a kernel model which is known a SMTC or lately has been marketesed into SMVP. -@@ -1469,6 +1485,19 @@ endchoice +@@ -1469,6 +1490,19 @@ endchoice config MIPS_MT bool @@ -59694,7 +60127,7 @@ index b22c043..6b0f85f 100644 config SYS_SUPPORTS_MULTITHREADING bool -@@ -1589,15 +1618,6 @@ config CPU_HAS_SMARTMIPS +@@ -1589,15 +1623,6 @@ config CPU_HAS_SMARTMIPS config CPU_HAS_WB bool @@ -59710,7 +60143,7 @@ index b22c043..6b0f85f 100644 # # Vectored interrupt mode is an R2 feature # -@@ -1619,6 +1639,19 @@ config GENERIC_CLOCKEVENTS_BROADCAST +@@ -1619,6 +1644,19 @@ config GENERIC_CLOCKEVENTS_BROADCAST bool # @@ -59730,7 +60163,7 @@ index b22c043..6b0f85f 100644 # Use the generic interrupt handling code in kernel/irq/: # config GENERIC_HARDIRQS -@@ -1721,6 +1754,9 @@ config SMP +@@ -1721,6 +1759,9 @@ config SMP If you don't know what to do here, say N. @@ -59740,7 +60173,7 @@ index b22c043..6b0f85f 100644 config SYS_SUPPORTS_SMP bool -@@ -1978,9 +2014,6 @@ config MMU +@@ -1978,9 +2019,6 @@ config MMU config I8253 bool @@ -62471,6 +62904,45 @@ index e76a76b..c6ada98 100644 MTC0 k0, CP0_EPC /* I hope three instructions between MTC0 and ERET are enough... */ ori k1, _THREAD_MASK +diff --git a/arch/mips/kernel/i8253.c b/arch/mips/kernel/i8253.c +index c2d497c..fc4aa07 100644 +--- a/arch/mips/kernel/i8253.c ++++ b/arch/mips/kernel/i8253.c +@@ -24,9 +24,7 @@ DEFINE_SPINLOCK(i8253_lock); + static void init_pit_timer(enum clock_event_mode mode, + struct clock_event_device *evt) + { +- unsigned long flags; +- +- spin_lock_irqsave(&i8253_lock, flags); ++ spin_lock(&i8253_lock); + + switch(mode) { + case CLOCK_EVT_MODE_PERIODIC: +@@ -55,7 +53,7 @@ static void init_pit_timer(enum clock_event_mode mode, + /* Nothing to do here */ + break; + } +- spin_unlock_irqrestore(&i8253_lock, flags); ++ spin_unlock(&i8253_lock); + } + + /* +@@ -65,12 +63,10 @@ static void init_pit_timer(enum clock_event_mode mode, + */ + static int pit_next_event(unsigned long delta, struct clock_event_device *evt) + { +- unsigned long flags; +- +- spin_lock_irqsave(&i8253_lock, flags); ++ spin_lock(&i8253_lock); + outb_p(delta & 0xff , PIT_CH0); /* LSB */ + outb(delta >> 8 , PIT_CH0); /* MSB */ +- spin_unlock_irqrestore(&i8253_lock, flags); ++ spin_unlock(&i8253_lock); + + return 0; + } diff --git a/arch/mips/kernel/i8259.c b/arch/mips/kernel/i8259.c index 4710135..197d797 100644 --- a/arch/mips/kernel/i8259.c @@ -70303,6 +70775,22 @@ index 58e4768..7723d20 100644 #ifdef CONFIG_PCI #ifdef CONFIG_ROCKHOPPER ali_m5229_preinit(); +diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig +index b8ef178..2b649c4 100644 +--- a/arch/parisc/Kconfig ++++ b/arch/parisc/Kconfig +@@ -19,6 +19,11 @@ config MMU + config STACK_GROWSUP + def_bool y + ++config GENERIC_LOCKBREAK ++ bool ++ default y ++ depends on SMP && PREEMPT ++ + config RWSEM_GENERIC_SPINLOCK + def_bool y + diff --git a/arch/parisc/kernel/vmlinux.lds.S b/arch/parisc/kernel/vmlinux.lds.S index 40d0ff9..50b4a3a 100644 --- a/arch/parisc/kernel/vmlinux.lds.S @@ -70334,6 +70822,32 @@ index 40d0ff9..50b4a3a 100644 } #ifdef CONFIG_BLK_DEV_INITRD . = ALIGN(PAGE_SIZE); +diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig +index 232c298..fb85f6b 100644 +--- a/arch/powerpc/Kconfig ++++ b/arch/powerpc/Kconfig +@@ -42,6 +42,9 @@ config GENERIC_HARDIRQS + bool + default y + ++config ARCH_SETS_UP_PER_CPU_AREA ++ def_bool PPC64 ++ + config IRQ_PER_CPU + bool + default y +@@ -53,6 +56,11 @@ config RWSEM_XCHGADD_ALGORITHM + bool + default y + ++config GENERIC_LOCKBREAK ++ bool ++ default y ++ depends on SMP && PREEMPT ++ + config ARCH_HAS_ILOG2_U32 + bool + default y diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile index 18e3271..4b1d98b 100644 --- a/arch/powerpc/boot/Makefile @@ -70347,6 +70861,90 @@ index 18e3271..4b1d98b 100644 quiet_cmd_copy_zlibheader = COPY $@ cmd_copy_zlibheader = sed "s@]*\).*@\"\1\"@" $< > $@ +diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c +index 3e17d15..8b056d2 100644 +--- a/arch/powerpc/kernel/ptrace.c ++++ b/arch/powerpc/kernel/ptrace.c +@@ -256,7 +256,7 @@ static int set_evrregs(struct task_struct *task, unsigned long *data) + #endif /* CONFIG_SPE */ + + +-static void set_single_step(struct task_struct *task) ++void user_enable_single_step(struct task_struct *task) + { + struct pt_regs *regs = task->thread.regs; + +@@ -271,7 +271,7 @@ static void set_single_step(struct task_struct *task) + set_tsk_thread_flag(task, TIF_SINGLESTEP); + } + +-static void clear_single_step(struct task_struct *task) ++void user_disable_single_step(struct task_struct *task) + { + struct pt_regs *regs = task->thread.regs; + +@@ -313,7 +313,7 @@ static int ptrace_set_debugreg(struct task_struct *task, unsigned long addr, + void ptrace_disable(struct task_struct *child) + { + /* make sure the single step bit is not set. */ +- clear_single_step(child); ++ user_disable_single_step(child); + } + + /* +@@ -445,52 +445,6 @@ long arch_ptrace(struct task_struct *child, long request, long addr, long data) + break; + } + +- case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */ +- case PTRACE_CONT: { /* restart after signal. */ +- ret = -EIO; +- if (!valid_signal(data)) +- break; +- if (request == PTRACE_SYSCALL) +- set_tsk_thread_flag(child, TIF_SYSCALL_TRACE); +- else +- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); +- child->exit_code = data; +- /* make sure the single step bit is not set. */ +- clear_single_step(child); +- wake_up_process(child); +- ret = 0; +- break; +- } +- +-/* +- * make the child exit. Best I can do is send it a sigkill. +- * perhaps it should be put in the status that it wants to +- * exit. +- */ +- case PTRACE_KILL: { +- ret = 0; +- if (child->exit_state == EXIT_ZOMBIE) /* already dead */ +- break; +- child->exit_code = SIGKILL; +- /* make sure the single step bit is not set. */ +- clear_single_step(child); +- wake_up_process(child); +- break; +- } +- +- case PTRACE_SINGLESTEP: { /* set the trap flag. */ +- ret = -EIO; +- if (!valid_signal(data)) +- break; +- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); +- set_single_step(child); +- child->exit_code = data; +- /* give it a chance to run. */ +- wake_up_process(child); +- ret = 0; +- break; +- } +- + case PTRACE_GET_DEBUGREG: { + ret = -EINVAL; + /* We only support one DABR and no IABRS at the moment */ diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 25d9a96..c8127f8 100644 --- a/arch/powerpc/kernel/sysfs.c @@ -131142,6 +131740,32 @@ index a8b4200..216147d 100644 *(.exitcall.exit) } +diff --git a/arch/sparc64/Kconfig b/arch/sparc64/Kconfig +index 10b212a..26f5791 100644 +--- a/arch/sparc64/Kconfig ++++ b/arch/sparc64/Kconfig +@@ -66,6 +66,9 @@ config AUDIT_ARCH + bool + default y + ++config ARCH_SETS_UP_PER_CPU_AREA ++ def_bool y ++ + config ARCH_NO_VIRT_TO_BUS + def_bool y + +@@ -200,6 +203,11 @@ config US2E_FREQ + If in doubt, say N. + + # Global things across all Sun machines. ++config GENERIC_LOCKBREAK ++ bool ++ default y ++ depends on SMP && PREEMPT ++ + config RWSEM_GENERIC_SPINLOCK + bool + diff --git a/arch/sparc64/kernel/unaligned.c b/arch/sparc64/kernel/unaligned.c index 953be81..dc7bf1b 100644 --- a/arch/sparc64/kernel/unaligned.c @@ -131301,6 +131925,23 @@ index 3866f49..26090b7 100644 /* Ensure the __preinit_array_start label is properly aligned. We could instead move the label definition inside the section, but +diff --git a/arch/um/kernel/ksyms.c b/arch/um/kernel/ksyms.c +index 1b388b4..7c7142b 100644 +--- a/arch/um/kernel/ksyms.c ++++ b/arch/um/kernel/ksyms.c +@@ -71,10 +71,10 @@ EXPORT_SYMBOL(dump_thread); + + /* required for SMP */ + +-extern void FASTCALL( __write_lock_failed(rwlock_t *rw)); ++extern void __write_lock_failed(rwlock_t *rw); + EXPORT_SYMBOL(__write_lock_failed); + +-extern void FASTCALL( __read_lock_failed(rwlock_t *rw)); ++extern void __read_lock_failed(rwlock_t *rw); + EXPORT_SYMBOL(__read_lock_failed); + + #endif diff --git a/arch/um/kernel/uml.lds.S b/arch/um/kernel/uml.lds.S index 13df191..5828c1d 100644 --- a/arch/um/kernel/uml.lds.S @@ -131323,6 +131964,197 @@ index 13df191..5828c1d 100644 .data : { . = ALIGN(KERNEL_STACK_SIZE); /* init_task */ +diff --git a/arch/um/sys-i386/signal.c b/arch/um/sys-i386/signal.c +index 0147227..19053d4 100644 +--- a/arch/um/sys-i386/signal.c ++++ b/arch/um/sys-i386/signal.c +@@ -3,10 +3,10 @@ + * Licensed under the GPL + */ + +-#include "linux/ptrace.h" +-#include "asm/unistd.h" +-#include "asm/uaccess.h" +-#include "asm/ucontext.h" ++#include ++#include ++#include ++#include + #include "frame_kern.h" + #include "skas.h" + +@@ -18,17 +18,17 @@ void copy_sc(struct uml_pt_regs *regs, void *from) + REGS_FS(regs->gp) = sc->fs; + REGS_ES(regs->gp) = sc->es; + REGS_DS(regs->gp) = sc->ds; +- REGS_EDI(regs->gp) = sc->edi; +- REGS_ESI(regs->gp) = sc->esi; +- REGS_EBP(regs->gp) = sc->ebp; +- REGS_SP(regs->gp) = sc->esp; +- REGS_EBX(regs->gp) = sc->ebx; +- REGS_EDX(regs->gp) = sc->edx; +- REGS_ECX(regs->gp) = sc->ecx; +- REGS_EAX(regs->gp) = sc->eax; +- REGS_IP(regs->gp) = sc->eip; ++ REGS_EDI(regs->gp) = sc->di; ++ REGS_ESI(regs->gp) = sc->si; ++ REGS_EBP(regs->gp) = sc->bp; ++ REGS_SP(regs->gp) = sc->sp; ++ REGS_EBX(regs->gp) = sc->bx; ++ REGS_EDX(regs->gp) = sc->dx; ++ REGS_ECX(regs->gp) = sc->cx; ++ REGS_EAX(regs->gp) = sc->ax; ++ REGS_IP(regs->gp) = sc->ip; + REGS_CS(regs->gp) = sc->cs; +- REGS_EFLAGS(regs->gp) = sc->eflags; ++ REGS_EFLAGS(regs->gp) = sc->flags; + REGS_SS(regs->gp) = sc->ss; + } + +@@ -229,18 +229,18 @@ static int copy_sc_to_user(struct sigcontext __user *to, + sc.fs = REGS_FS(regs->regs.gp); + sc.es = REGS_ES(regs->regs.gp); + sc.ds = REGS_DS(regs->regs.gp); +- sc.edi = REGS_EDI(regs->regs.gp); +- sc.esi = REGS_ESI(regs->regs.gp); +- sc.ebp = REGS_EBP(regs->regs.gp); +- sc.esp = sp; +- sc.ebx = REGS_EBX(regs->regs.gp); +- sc.edx = REGS_EDX(regs->regs.gp); +- sc.ecx = REGS_ECX(regs->regs.gp); +- sc.eax = REGS_EAX(regs->regs.gp); +- sc.eip = REGS_IP(regs->regs.gp); ++ sc.di = REGS_EDI(regs->regs.gp); ++ sc.si = REGS_ESI(regs->regs.gp); ++ sc.bp = REGS_EBP(regs->regs.gp); ++ sc.sp = sp; ++ sc.bx = REGS_EBX(regs->regs.gp); ++ sc.dx = REGS_EDX(regs->regs.gp); ++ sc.cx = REGS_ECX(regs->regs.gp); ++ sc.ax = REGS_EAX(regs->regs.gp); ++ sc.ip = REGS_IP(regs->regs.gp); + sc.cs = REGS_CS(regs->regs.gp); +- sc.eflags = REGS_EFLAGS(regs->regs.gp); +- sc.esp_at_signal = regs->regs.gp[UESP]; ++ sc.flags = REGS_EFLAGS(regs->regs.gp); ++ sc.sp_at_signal = regs->regs.gp[UESP]; + sc.ss = regs->regs.gp[SS]; + sc.cr2 = fi->cr2; + sc.err = fi->error_code; +diff --git a/arch/um/sys-x86_64/signal.c b/arch/um/sys-x86_64/signal.c +index 1778d33..7457436 100644 +--- a/arch/um/sys-x86_64/signal.c ++++ b/arch/um/sys-x86_64/signal.c +@@ -4,11 +4,11 @@ + * Licensed under the GPL + */ + +-#include "linux/personality.h" +-#include "linux/ptrace.h" +-#include "asm/unistd.h" +-#include "asm/uaccess.h" +-#include "asm/ucontext.h" ++#include ++#include ++#include ++#include ++#include + #include "frame_kern.h" + #include "skas.h" + +@@ -27,16 +27,16 @@ void copy_sc(struct uml_pt_regs *regs, void *from) + GETREG(regs, R13, sc, r13); + GETREG(regs, R14, sc, r14); + GETREG(regs, R15, sc, r15); +- GETREG(regs, RDI, sc, rdi); +- GETREG(regs, RSI, sc, rsi); +- GETREG(regs, RBP, sc, rbp); +- GETREG(regs, RBX, sc, rbx); +- GETREG(regs, RDX, sc, rdx); +- GETREG(regs, RAX, sc, rax); +- GETREG(regs, RCX, sc, rcx); +- GETREG(regs, RSP, sc, rsp); +- GETREG(regs, RIP, sc, rip); +- GETREG(regs, EFLAGS, sc, eflags); ++ GETREG(regs, RDI, sc, di); ++ GETREG(regs, RSI, sc, si); ++ GETREG(regs, RBP, sc, bp); ++ GETREG(regs, RBX, sc, bx); ++ GETREG(regs, RDX, sc, dx); ++ GETREG(regs, RAX, sc, ax); ++ GETREG(regs, RCX, sc, cx); ++ GETREG(regs, RSP, sc, sp); ++ GETREG(regs, RIP, sc, ip); ++ GETREG(regs, EFLAGS, sc, flags); + GETREG(regs, CS, sc, cs); + + #undef GETREG +@@ -61,16 +61,16 @@ static int copy_sc_from_user(struct pt_regs *regs, + err |= GETREG(regs, R13, from, r13); + err |= GETREG(regs, R14, from, r14); + err |= GETREG(regs, R15, from, r15); +- err |= GETREG(regs, RDI, from, rdi); +- err |= GETREG(regs, RSI, from, rsi); +- err |= GETREG(regs, RBP, from, rbp); +- err |= GETREG(regs, RBX, from, rbx); +- err |= GETREG(regs, RDX, from, rdx); +- err |= GETREG(regs, RAX, from, rax); +- err |= GETREG(regs, RCX, from, rcx); +- err |= GETREG(regs, RSP, from, rsp); +- err |= GETREG(regs, RIP, from, rip); +- err |= GETREG(regs, EFLAGS, from, eflags); ++ err |= GETREG(regs, RDI, from, di); ++ err |= GETREG(regs, RSI, from, si); ++ err |= GETREG(regs, RBP, from, bp); ++ err |= GETREG(regs, RBX, from, bx); ++ err |= GETREG(regs, RDX, from, dx); ++ err |= GETREG(regs, RAX, from, ax); ++ err |= GETREG(regs, RCX, from, cx); ++ err |= GETREG(regs, RSP, from, sp); ++ err |= GETREG(regs, RIP, from, ip); ++ err |= GETREG(regs, EFLAGS, from, flags); + err |= GETREG(regs, CS, from, cs); + if (err) + return 1; +@@ -108,19 +108,19 @@ static int copy_sc_to_user(struct sigcontext __user *to, + __put_user((regs)->regs.gp[(regno) / sizeof(unsigned long)], \ + &(sc)->regname) + +- err |= PUTREG(regs, RDI, to, rdi); +- err |= PUTREG(regs, RSI, to, rsi); +- err |= PUTREG(regs, RBP, to, rbp); ++ err |= PUTREG(regs, RDI, to, di); ++ err |= PUTREG(regs, RSI, to, si); ++ err |= PUTREG(regs, RBP, to, bp); + /* + * Must use orignal RSP, which is passed in, rather than what's in + * the pt_regs, because that's already been updated to point at the + * signal frame. + */ +- err |= __put_user(sp, &to->rsp); +- err |= PUTREG(regs, RBX, to, rbx); +- err |= PUTREG(regs, RDX, to, rdx); +- err |= PUTREG(regs, RCX, to, rcx); +- err |= PUTREG(regs, RAX, to, rax); ++ err |= __put_user(sp, &to->sp); ++ err |= PUTREG(regs, RBX, to, bx); ++ err |= PUTREG(regs, RDX, to, dx); ++ err |= PUTREG(regs, RCX, to, cx); ++ err |= PUTREG(regs, RAX, to, ax); + err |= PUTREG(regs, R8, to, r8); + err |= PUTREG(regs, R9, to, r9); + err |= PUTREG(regs, R10, to, r10); +@@ -135,8 +135,8 @@ static int copy_sc_to_user(struct sigcontext __user *to, + err |= __put_user(fi->error_code, &to->err); + err |= __put_user(fi->trap_no, &to->trapno); + +- err |= PUTREG(regs, RIP, to, rip); +- err |= PUTREG(regs, EFLAGS, to, eflags); ++ err |= PUTREG(regs, RIP, to, ip); ++ err |= PUTREG(regs, EFLAGS, to, flags); + #undef PUTREG + + err |= __put_user(mask, &to->oldmask); diff --git a/arch/v850/kernel/vmlinux.lds.S b/arch/v850/kernel/vmlinux.lds.S index 6172599..d08cd1d 100644 --- a/arch/v850/kernel/vmlinux.lds.S @@ -131366,6 +132198,3741 @@ index 6172599..d08cd1d 100644 _einittext = .; \ *(.text.init) /* 2.4 convention */ \ INITCALL_CONTENTS \ +diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig +index 80b7ba4..fb3eea3 100644 +--- a/arch/x86/Kconfig ++++ b/arch/x86/Kconfig +@@ -17,81 +17,69 @@ config X86_64 + + ### Arch settings + config X86 +- bool +- default y ++ def_bool y ++ ++config GENERIC_LOCKBREAK ++ def_bool n + + config GENERIC_TIME +- bool +- default y ++ def_bool y + + config GENERIC_CMOS_UPDATE +- bool +- default y ++ def_bool y + + config CLOCKSOURCE_WATCHDOG +- bool +- default y ++ def_bool y + + config GENERIC_CLOCKEVENTS +- bool +- default y ++ def_bool y + + config GENERIC_CLOCKEVENTS_BROADCAST +- bool +- default y ++ def_bool y + depends on X86_64 || (X86_32 && X86_LOCAL_APIC) + + config LOCKDEP_SUPPORT +- bool +- default y ++ def_bool y + + config STACKTRACE_SUPPORT +- bool +- default y ++ def_bool y + + config SEMAPHORE_SLEEPERS +- bool +- default y ++ def_bool y + + config MMU +- bool +- default y ++ def_bool y + + config ZONE_DMA +- bool +- default y ++ def_bool y + + config QUICKLIST +- bool +- default X86_32 ++ def_bool X86_32 + + config SBUS + bool + + config GENERIC_ISA_DMA +- bool +- default y ++ def_bool y + + config GENERIC_IOMAP +- bool +- default y ++ def_bool y + + config GENERIC_BUG +- bool +- default y ++ def_bool y + depends on BUG + + config GENERIC_HWEIGHT +- bool +- default y ++ def_bool y ++ ++config GENERIC_GPIO ++ def_bool n + + config ARCH_MAY_HAVE_PC_FDC +- bool +- default y ++ def_bool y + + config DMI +- bool +- default y ++ def_bool y + + config RWSEM_GENERIC_SPINLOCK + def_bool !X86_XADD +@@ -112,6 +100,9 @@ config GENERIC_TIME_VSYSCALL + bool + default X86_64 + ++config HAVE_SETUP_PER_CPU_AREA ++ def_bool X86_64 ++ + config ARCH_SUPPORTS_OPROFILE + bool + default y +@@ -144,9 +135,17 @@ config GENERIC_PENDING_IRQ + + config X86_SMP + bool +- depends on X86_32 && SMP && !X86_VOYAGER ++ depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64) + default y + ++config X86_32_SMP ++ def_bool y ++ depends on X86_32 && SMP ++ ++config X86_64_SMP ++ def_bool y ++ depends on X86_64 && SMP ++ + config X86_HT + bool + depends on SMP +@@ -292,6 +291,18 @@ config X86_ES7000 + Only choose this option if you have such a system, otherwise you + should say N here. + ++config X86_RDC321X ++ bool "RDC R-321x SoC" ++ depends on X86_32 ++ select M486 ++ select X86_REBOOTFIXUPS ++ select GENERIC_GPIO ++ select LEDS_GPIO ++ help ++ This option is needed for RDC R-321x system-on-chip, also known ++ as R-8610-(G). ++ If you don't have one of these chips, you should say N here. ++ + config X86_VSMP + bool "Support for ScaleMP vSMP" + depends on X86_64 && PCI +@@ -303,8 +314,8 @@ config X86_VSMP + endchoice + + config SCHED_NO_NO_OMIT_FRAME_POINTER +- bool "Single-depth WCHAN output" +- default y ++ def_bool y ++ prompt "Single-depth WCHAN output" + depends on X86_32 + help + Calculate simpler /proc//wchan values. If this option +@@ -314,18 +325,8 @@ config SCHED_NO_NO_OMIT_FRAME_POINTER + + If in doubt, say "Y". + +-config PARAVIRT +- bool +- depends on X86_32 && !(X86_VISWS || X86_VOYAGER) +- help +- This changes the kernel so it can modify itself when it is run +- under a hypervisor, potentially improving performance significantly +- over full virtualization. However, when run without a hypervisor +- the kernel is theoretically slower and slightly larger. +- + menuconfig PARAVIRT_GUEST + bool "Paravirtualized guest support" +- depends on X86_32 + help + Say Y here to get to see options related to running Linux under + various hypervisors. This option alone does not add any kernel code. +@@ -339,6 +340,7 @@ source "arch/x86/xen/Kconfig" + config VMI + bool "VMI Guest support" + select PARAVIRT ++ depends on X86_32 + depends on !(X86_VISWS || X86_VOYAGER) + help + VMI provides a paravirtualized interface to the VMware ESX server +@@ -348,40 +350,43 @@ config VMI + + source "arch/x86/lguest/Kconfig" + ++config PARAVIRT ++ bool "Enable paravirtualization code" ++ depends on !(X86_VISWS || X86_VOYAGER) ++ help ++ This changes the kernel so it can modify itself when it is run ++ under a hypervisor, potentially improving performance significantly ++ over full virtualization. However, when run without a hypervisor ++ the kernel is theoretically slower and slightly larger. ++ + endif + + config ACPI_SRAT +- bool +- default y ++ def_bool y + depends on X86_32 && ACPI && NUMA && (X86_SUMMIT || X86_GENERICARCH) + select ACPI_NUMA + + config HAVE_ARCH_PARSE_SRAT +- bool +- default y +- depends on ACPI_SRAT ++ def_bool y ++ depends on ACPI_SRAT + + config X86_SUMMIT_NUMA +- bool +- default y ++ def_bool y + depends on X86_32 && NUMA && (X86_SUMMIT || X86_GENERICARCH) + + config X86_CYCLONE_TIMER +- bool +- default y ++ def_bool y + depends on X86_32 && X86_SUMMIT || X86_GENERICARCH + + config ES7000_CLUSTERED_APIC +- bool +- default y ++ def_bool y + depends on SMP && X86_ES7000 && MPENTIUMIII + + source "arch/x86/Kconfig.cpu" + + config HPET_TIMER +- bool ++ def_bool X86_64 + prompt "HPET Timer Support" if X86_32 +- default X86_64 + help + Use the IA-PC HPET (High Precision Event Timer) to manage + time in preference to the PIT and RTC, if a HPET is +@@ -399,9 +404,8 @@ config HPET_TIMER + Choose N to continue using the legacy 8254 timer. + + config HPET_EMULATE_RTC +- bool +- depends on HPET_TIMER && RTC=y +- default y ++ def_bool y ++ depends on HPET_TIMER && (RTC=y || RTC=m) + + # Mark as embedded because too many people got it wrong. + # The code disables itself when not needed. +@@ -441,8 +445,8 @@ config CALGARY_IOMMU + If unsure, say Y. + + config CALGARY_IOMMU_ENABLED_BY_DEFAULT +- bool "Should Calgary be enabled by default?" +- default y ++ def_bool y ++ prompt "Should Calgary be enabled by default?" + depends on CALGARY_IOMMU + help + Should Calgary be enabled by default? if you choose 'y', Calgary +@@ -486,9 +490,9 @@ config SCHED_SMT + N here. + + config SCHED_MC +- bool "Multi-core scheduler support" ++ def_bool y ++ prompt "Multi-core scheduler support" + depends on (X86_64 && SMP) || (X86_32 && X86_HT) +- default y + help + Multi-core scheduler support improves the CPU scheduler's decision + making when dealing with multi-core CPU chips at a cost of slightly +@@ -522,19 +526,16 @@ config X86_UP_IOAPIC + an IO-APIC, then the kernel will still run with no slowdown at all. + + config X86_LOCAL_APIC +- bool ++ def_bool y + depends on X86_64 || (X86_32 && (X86_UP_APIC || ((X86_VISWS || SMP) && !X86_VOYAGER) || X86_GENERICARCH)) +- default y + + config X86_IO_APIC +- bool ++ def_bool y + depends on X86_64 || (X86_32 && (X86_UP_IOAPIC || (SMP && !(X86_VISWS || X86_VOYAGER)) || X86_GENERICARCH)) +- default y + + config X86_VISWS_APIC +- bool ++ def_bool y + depends on X86_32 && X86_VISWS +- default y + + config X86_MCE + bool "Machine Check Exception" +@@ -554,17 +555,17 @@ config X86_MCE + the 386 and 486, so nearly everyone can say Y here. + + config X86_MCE_INTEL +- bool "Intel MCE features" ++ def_bool y ++ prompt "Intel MCE features" + depends on X86_64 && X86_MCE && X86_LOCAL_APIC +- default y + help + Additional support for intel specific MCE features such as + the thermal monitor. + + config X86_MCE_AMD +- bool "AMD MCE features" ++ def_bool y ++ prompt "AMD MCE features" + depends on X86_64 && X86_MCE && X86_LOCAL_APIC +- default y + help + Additional support for AMD specific MCE features such as + the DRAM Error Threshold. +@@ -637,9 +638,9 @@ config I8K + Say N otherwise. + + config X86_REBOOTFIXUPS +- bool "Enable X86 board specific fixups for reboot" ++ def_bool n ++ prompt "Enable X86 board specific fixups for reboot" + depends on X86_32 && X86 +- default n + ---help--- + This enables chipset and/or board specific fixups to be done + in order to get reboot to work correctly. This is only needed on +@@ -648,7 +649,7 @@ config X86_REBOOTFIXUPS + system. + + Currently, the only fixup is for the Geode machines using +- CS5530A and CS5536 chipsets. ++ CS5530A and CS5536 chipsets and the RDC R-321x SoC. + + Say Y if you want to enable the fixup. Currently, it's safe to + enable this option even if you don't need it. +@@ -672,9 +673,8 @@ config MICROCODE + module will be called microcode. + + config MICROCODE_OLD_INTERFACE +- bool ++ def_bool y + depends on MICROCODE +- default y + + config X86_MSR + tristate "/dev/cpu/*/msr - Model-specific register support" +@@ -798,13 +798,12 @@ config PAGE_OFFSET + depends on X86_32 + + config HIGHMEM +- bool ++ def_bool y + depends on X86_32 && (HIGHMEM64G || HIGHMEM4G) +- default y + + config X86_PAE +- bool "PAE (Physical Address Extension) Support" +- default n ++ def_bool n ++ prompt "PAE (Physical Address Extension) Support" + depends on X86_32 && !HIGHMEM4G + select RESOURCES_64BIT + help +@@ -836,10 +835,10 @@ comment "NUMA (Summit) requires SMP, 64GB highmem support, ACPI" + depends on X86_32 && X86_SUMMIT && (!HIGHMEM64G || !ACPI) + + config K8_NUMA +- bool "Old style AMD Opteron NUMA detection" +- depends on X86_64 && NUMA && PCI +- default y +- help ++ def_bool y ++ prompt "Old style AMD Opteron NUMA detection" ++ depends on X86_64 && NUMA && PCI ++ help + Enable K8 NUMA node topology detection. You should say Y here if + you have a multi processor AMD K8 system. This uses an old + method to read the NUMA configuration directly from the builtin +@@ -847,10 +846,10 @@ config K8_NUMA + instead, which also takes priority if both are compiled in. + + config X86_64_ACPI_NUMA +- bool "ACPI NUMA detection" ++ def_bool y ++ prompt "ACPI NUMA detection" + depends on X86_64 && NUMA && ACPI && PCI + select ACPI_NUMA +- default y + help + Enable ACPI SRAT based node topology detection. + +@@ -864,52 +863,53 @@ config NUMA_EMU + + config NODES_SHIFT + int ++ range 1 15 if X86_64 + default "6" if X86_64 + default "4" if X86_NUMAQ + default "3" + depends on NEED_MULTIPLE_NODES + + config HAVE_ARCH_BOOTMEM_NODE +- bool ++ def_bool y + depends on X86_32 && NUMA +- default y + + config ARCH_HAVE_MEMORY_PRESENT +- bool ++ def_bool y + depends on X86_32 && DISCONTIGMEM +- default y + + config NEED_NODE_MEMMAP_SIZE +- bool ++ def_bool y + depends on X86_32 && (DISCONTIGMEM || SPARSEMEM) +- default y + + config HAVE_ARCH_ALLOC_REMAP +- bool ++ def_bool y + depends on X86_32 && NUMA +- default y + + config ARCH_FLATMEM_ENABLE + def_bool y +- depends on (X86_32 && ARCH_SELECT_MEMORY_MODEL && X86_PC) || (X86_64 && !NUMA) ++ depends on X86_32 && ARCH_SELECT_MEMORY_MODEL && X86_PC && !NUMA + + config ARCH_DISCONTIGMEM_ENABLE + def_bool y +- depends on NUMA ++ depends on NUMA && X86_32 + + config ARCH_DISCONTIGMEM_DEFAULT + def_bool y +- depends on NUMA ++ depends on NUMA && X86_32 ++ ++config ARCH_SPARSEMEM_DEFAULT ++ def_bool y ++ depends on X86_64 + + config ARCH_SPARSEMEM_ENABLE + def_bool y +- depends on NUMA || (EXPERIMENTAL && (X86_PC || X86_64)) ++ depends on X86_64 || NUMA || (EXPERIMENTAL && X86_PC) + select SPARSEMEM_STATIC if X86_32 + select SPARSEMEM_VMEMMAP_ENABLE if X86_64 + + config ARCH_SELECT_MEMORY_MODEL + def_bool y +- depends on X86_32 && ARCH_SPARSEMEM_ENABLE ++ depends on ARCH_SPARSEMEM_ENABLE + + config ARCH_MEMORY_PROBE + def_bool X86_64 +@@ -987,42 +987,32 @@ config MTRR + See for more information. + + config EFI +- bool "Boot from EFI support" +- depends on X86_32 && ACPI +- default n ++ def_bool n ++ prompt "EFI runtime service support" ++ depends on ACPI + ---help--- +- This enables the kernel to boot on EFI platforms using +- system configuration information passed to it from the firmware. +- This also enables the kernel to use any EFI runtime services that are ++ This enables the kernel to use EFI runtime services that are + available (such as the EFI variable services). + +- This option is only useful on systems that have EFI firmware +- and will result in a kernel image that is ~8k larger. In addition, +- you must use the latest ELILO loader available at +- in order to take advantage of +- kernel initialization using EFI information (neither GRUB nor LILO know +- anything about EFI). However, even with this option, the resultant +- kernel should continue to boot on existing non-EFI platforms. ++ This option is only useful on systems that have EFI firmware. ++ In addition, you should use the latest ELILO loader available ++ at in order to take advantage ++ of EFI runtime services. However, even with this option, the ++ resultant kernel should continue to boot on existing non-EFI ++ platforms. + + config IRQBALANCE +- bool "Enable kernel irq balancing" ++ def_bool y ++ prompt "Enable kernel irq balancing" + depends on X86_32 && SMP && X86_IO_APIC +- default y + help + The default yes will allow the kernel to do irq load balancing. + Saying no will keep the kernel from doing irq load balancing. + +-# turning this on wastes a bunch of space. +-# Summit needs it only when NUMA is on +-config BOOT_IOREMAP +- bool +- depends on X86_32 && (((X86_SUMMIT || X86_GENERICARCH) && NUMA) || (X86 && EFI)) +- default y +- + config SECCOMP +- bool "Enable seccomp to safely compute untrusted bytecode" ++ def_bool y ++ prompt "Enable seccomp to safely compute untrusted bytecode" + depends on PROC_FS +- default y + help + This kernel feature is useful for number crunching applications + that may need to compute untrusted bytecode during their +@@ -1189,11 +1179,11 @@ config HOTPLUG_CPU + suspend. + + config COMPAT_VDSO +- bool "Compat VDSO support" +- default y +- depends on X86_32 ++ def_bool y ++ prompt "Compat VDSO support" ++ depends on X86_32 || IA32_EMULATION + help +- Map the VDSO to the predictable old-style address too. ++ Map the 32-bit VDSO to the predictable old-style address too. + ---help--- + Say N here if you are running a sufficiently recent glibc + version (2.3.3 or later), to remove the high-mapped +@@ -1207,30 +1197,26 @@ config ARCH_ENABLE_MEMORY_HOTPLUG + def_bool y + depends on X86_64 || (X86_32 && HIGHMEM) + +-config MEMORY_HOTPLUG_RESERVE +- def_bool X86_64 +- depends on (MEMORY_HOTPLUG && DISCONTIGMEM) +- + config HAVE_ARCH_EARLY_PFN_TO_NID + def_bool X86_64 + depends on NUMA + +-config OUT_OF_LINE_PFN_TO_PAGE +- def_bool X86_64 +- depends on DISCONTIGMEM +- + menu "Power management options" + depends on !X86_VOYAGER + + config ARCH_HIBERNATION_HEADER +- bool ++ def_bool y + depends on X86_64 && HIBERNATION +- default y + + source "kernel/power/Kconfig" + + source "drivers/acpi/Kconfig" + ++config X86_APM_BOOT ++ bool ++ default y ++ depends on APM || APM_MODULE ++ + menuconfig APM + tristate "APM (Advanced Power Management) BIOS support" + depends on X86_32 && PM_SLEEP && !X86_VISWS +@@ -1371,7 +1357,7 @@ menu "Bus options (PCI etc.)" + config PCI + bool "PCI support" if !X86_VISWS + depends on !X86_VOYAGER +- default y if X86_VISWS ++ default y + select ARCH_SUPPORTS_MSI if (X86_LOCAL_APIC && X86_IO_APIC) + help + Find out whether you have a PCI motherboard. PCI is the name of a +@@ -1418,25 +1404,21 @@ config PCI_GOANY + endchoice + + config PCI_BIOS +- bool ++ def_bool y + depends on X86_32 && !X86_VISWS && PCI && (PCI_GOBIOS || PCI_GOANY) +- default y + + # x86-64 doesn't support PCI BIOS access from long mode so always go direct. + config PCI_DIRECT +- bool ++ def_bool y + depends on PCI && (X86_64 || (PCI_GODIRECT || PCI_GOANY) || X86_VISWS) +- default y + + config PCI_MMCONFIG +- bool ++ def_bool y + depends on X86_32 && PCI && ACPI && (PCI_GOMMCONFIG || PCI_GOANY) +- default y + + config PCI_DOMAINS +- bool ++ def_bool y + depends on PCI +- default y + + config PCI_MMCONFIG + bool "Support mmconfig PCI config space access" +@@ -1453,9 +1435,9 @@ config DMAR + remapping devices. + + config DMAR_GFX_WA +- bool "Support for Graphics workaround" ++ def_bool y ++ prompt "Support for Graphics workaround" + depends on DMAR +- default y + help + Current Graphics drivers tend to use physical address + for DMA and avoid using DMA APIs. Setting this config +@@ -1464,9 +1446,8 @@ config DMAR_GFX_WA + to use physical addresses for DMA. + + config DMAR_FLOPPY_WA +- bool ++ def_bool y + depends on DMAR +- default y + help + Floppy disk drivers are know to bypass DMA API calls + thereby failing to work when IOMMU is enabled. This +@@ -1479,8 +1460,7 @@ source "drivers/pci/Kconfig" + + # x86_64 have no ISA slots, but do have ISA-style DMA. + config ISA_DMA_API +- bool +- default y ++ def_bool y + + if X86_32 + +@@ -1546,9 +1526,9 @@ config SCx200HR_TIMER + other workaround is idle=poll boot option. + + config GEODE_MFGPT_TIMER +- bool "Geode Multi-Function General Purpose Timer (MFGPT) events" ++ def_bool y ++ prompt "Geode Multi-Function General Purpose Timer (MFGPT) events" + depends on MGEODE_LX && GENERIC_TIME && GENERIC_CLOCKEVENTS +- default y + help + This driver provides a clock event source based on the MFGPT + timer(s) in the CS5535 and CS5536 companion chip for the geode. +@@ -1575,6 +1555,7 @@ source "fs/Kconfig.binfmt" + config IA32_EMULATION + bool "IA32 Emulation" + depends on X86_64 ++ select COMPAT_BINFMT_ELF + help + Include code to run 32-bit programs under a 64-bit kernel. You should + likely turn this on, unless you're 100% sure that you don't have any +@@ -1587,18 +1568,16 @@ config IA32_AOUT + Support old a.out binaries in the 32bit emulation. + + config COMPAT +- bool ++ def_bool y + depends on IA32_EMULATION +- default y + + config COMPAT_FOR_U64_ALIGNMENT + def_bool COMPAT + depends on X86_64 + + config SYSVIPC_COMPAT +- bool ++ def_bool y + depends on X86_64 && COMPAT && SYSVIPC +- default y + + endmenu + +diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu +index c301622..e09a6b7 100644 +--- a/arch/x86/Kconfig.cpu ++++ b/arch/x86/Kconfig.cpu +@@ -219,10 +219,10 @@ config MGEODEGX1 + Select this for a Geode GX1 (Cyrix MediaGX) chip. + + config MGEODE_LX +- bool "Geode GX/LX" ++ bool "Geode GX/LX" + depends on X86_32 +- help +- Select this for AMD Geode GX and LX processors. ++ help ++ Select this for AMD Geode GX and LX processors. + + config MCYRIXIII + bool "CyrixIII/VIA-C3" +@@ -258,7 +258,7 @@ config MPSC + Optimize for Intel Pentium 4, Pentium D and older Nocona/Dempsey + Xeon CPUs with Intel 64bit which is compatible with x86-64. + Note that the latest Xeons (Xeon 51xx and 53xx) are not based on the +- Netburst core and shouldn't use this option. You can distinguish them ++ Netburst core and shouldn't use this option. You can distinguish them + using the cpu family field + in /proc/cpuinfo. Family 15 is an older Xeon, Family 6 a newer one. + +@@ -317,81 +317,75 @@ config X86_L1_CACHE_SHIFT + default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MVIAC7 + + config X86_XADD +- bool ++ def_bool y + depends on X86_32 && !M386 +- default y + + config X86_PPRO_FENCE +- bool ++ bool "PentiumPro memory ordering errata workaround" + depends on M686 || M586MMX || M586TSC || M586 || M486 || M386 || MGEODEGX1 +- default y ++ help ++ Old PentiumPro multiprocessor systems had errata that could cause memory ++ operations to violate the x86 ordering standard in rare cases. Enabling this ++ option will attempt to work around some (but not all) occurances of ++ this problem, at the cost of much heavier spinlock and memory barrier ++ operations. ++ ++ If unsure, say n here. Even distro kernels should think twice before enabling ++ this: there are few systems, and an unlikely bug. + + config X86_F00F_BUG +- bool ++ def_bool y + depends on M586MMX || M586TSC || M586 || M486 || M386 +- default y + + config X86_WP_WORKS_OK +- bool ++ def_bool y + depends on X86_32 && !M386 +- default y + + config X86_INVLPG +- bool ++ def_bool y + depends on X86_32 && !M386 +- default y + + config X86_BSWAP +- bool ++ def_bool y + depends on X86_32 && !M386 +- default y + + config X86_POPAD_OK +- bool ++ def_bool y + depends on X86_32 && !M386 +- default y + + config X86_ALIGNMENT_16 +- bool ++ def_bool y + depends on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 || MCYRIXIII || X86_ELAN || MK6 || M586MMX || M586TSC || M586 || M486 || MVIAC3_2 || MGEODEGX1 +- default y + + config X86_GOOD_APIC +- bool ++ def_bool y + depends on MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || MK8 || MEFFICEON || MCORE2 || MVIAC7 || X86_64 +- default y + + config X86_INTEL_USERCOPY +- bool ++ def_bool y + depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2 +- default y + + config X86_USE_PPRO_CHECKSUM +- bool ++ def_bool y + depends on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MEFFICEON || MGEODE_LX || MCORE2 +- default y + + config X86_USE_3DNOW +- bool ++ def_bool y + depends on (MCYRIXIII || MK7 || MGEODE_LX) && !UML +- default y + + config X86_OOSTORE +- bool ++ def_bool y + depends on (MWINCHIP3D || MWINCHIP2 || MWINCHIPC6) && MTRR +- default y + + config X86_TSC +- bool ++ def_bool y + depends on ((MWINCHIP3D || MWINCHIP2 || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2) && !X86_NUMAQ) || X86_64 +- default y + + # this should be set for all -march=.. options where the compiler + # generates cmov. + config X86_CMOV +- bool ++ def_bool y + depends on (MK7 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7) +- default y + + config X86_MINIMUM_CPU_FAMILY + int +@@ -399,3 +393,6 @@ config X86_MINIMUM_CPU_FAMILY + default "4" if X86_32 && (X86_XADD || X86_CMPXCHG || X86_BSWAP || X86_WP_WORKS_OK) + default "3" + ++config X86_DEBUGCTLMSR ++ def_bool y ++ depends on !(M586MMX || M586TSC || M586 || M486 || M386) +diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug +index 761ca7b..2e1e3af 100644 +--- a/arch/x86/Kconfig.debug ++++ b/arch/x86/Kconfig.debug +@@ -6,7 +6,7 @@ config TRACE_IRQFLAGS_SUPPORT + source "lib/Kconfig.debug" + + config EARLY_PRINTK +- bool "Early printk" if EMBEDDED && DEBUG_KERNEL && X86_32 ++ bool "Early printk" if EMBEDDED + default y + help + Write kernel log output directly into the VGA buffer or to a serial +@@ -40,22 +40,49 @@ comment "Page alloc debug is incompatible with Software Suspend on i386" + + config DEBUG_PAGEALLOC + bool "Debug page memory allocations" +- depends on DEBUG_KERNEL && !HIBERNATION && !HUGETLBFS +- depends on X86_32 ++ depends on DEBUG_KERNEL && X86_32 + help + Unmap pages from the kernel linear mapping after free_pages(). + This results in a large slowdown, but helps to find certain types + of memory corruptions. + ++config DEBUG_PER_CPU_MAPS ++ bool "Debug access to per_cpu maps" ++ depends on DEBUG_KERNEL ++ depends on X86_64_SMP ++ default n ++ help ++ Say Y to verify that the per_cpu map being accessed has ++ been setup. Adds a fair amount of code to kernel memory ++ and decreases performance. ++ ++ Say N if unsure. ++ + config DEBUG_RODATA + bool "Write protect kernel read-only data structures" ++ default y + depends on DEBUG_KERNEL + help + Mark the kernel read-only data as write-protected in the pagetables, + in order to catch accidental (and incorrect) writes to such const +- data. This option may have a slight performance impact because a +- portion of the kernel code won't be covered by a 2MB TLB anymore. +- If in doubt, say "N". ++ data. This is recommended so that we can catch kernel bugs sooner. ++ If in doubt, say "Y". ++ ++config DEBUG_RODATA_TEST ++ bool "Testcase for the DEBUG_RODATA feature" ++ depends on DEBUG_RODATA ++ help ++ This option enables a testcase for the DEBUG_RODATA ++ feature as well as for the change_page_attr() infrastructure. ++ If in doubt, say "N" ++ ++config DEBUG_NX_TEST ++ tristate "Testcase for the NX non-executable stack feature" ++ depends on DEBUG_KERNEL && m ++ help ++ This option enables a testcase for the CPU NX capability ++ and the software setup of this feature. ++ If in doubt, say "N" + + config 4KSTACKS + bool "Use 4Kb for kernel stacks instead of 8Kb" +@@ -75,8 +102,7 @@ config X86_FIND_SMP_CONFIG + + config X86_MPPARSE + def_bool y +- depends on X86_LOCAL_APIC && !X86_VISWS +- depends on X86_32 ++ depends on (X86_32 && (X86_LOCAL_APIC && !X86_VISWS)) || X86_64 + + config DOUBLEFAULT + default y +@@ -112,4 +138,91 @@ config IOMMU_LEAK + Add a simple leak tracer to the IOMMU code. This is useful when you + are debugging a buggy device driver that leaks IOMMU mappings. + ++# ++# IO delay types: ++# ++ ++config IO_DELAY_TYPE_0X80 ++ int ++ default "0" ++ ++config IO_DELAY_TYPE_0XED ++ int ++ default "1" ++ ++config IO_DELAY_TYPE_UDELAY ++ int ++ default "2" ++ ++config IO_DELAY_TYPE_NONE ++ int ++ default "3" ++ ++choice ++ prompt "IO delay type" ++ default IO_DELAY_0XED ++ ++config IO_DELAY_0X80 ++ bool "port 0x80 based port-IO delay [recommended]" ++ help ++ This is the traditional Linux IO delay used for in/out_p. ++ It is the most tested hence safest selection here. ++ ++config IO_DELAY_0XED ++ bool "port 0xed based port-IO delay" ++ help ++ Use port 0xed as the IO delay. This frees up port 0x80 which is ++ often used as a hardware-debug port. ++ ++config IO_DELAY_UDELAY ++ bool "udelay based port-IO delay" ++ help ++ Use udelay(2) as the IO delay method. This provides the delay ++ while not having any side-effect on the IO port space. ++ ++config IO_DELAY_NONE ++ bool "no port-IO delay" ++ help ++ No port-IO delay. Will break on old boxes that require port-IO ++ delay for certain operations. Should work on most new machines. ++ ++endchoice ++ ++if IO_DELAY_0X80 ++config DEFAULT_IO_DELAY_TYPE ++ int ++ default IO_DELAY_TYPE_0X80 ++endif ++ ++if IO_DELAY_0XED ++config DEFAULT_IO_DELAY_TYPE ++ int ++ default IO_DELAY_TYPE_0XED ++endif ++ ++if IO_DELAY_UDELAY ++config DEFAULT_IO_DELAY_TYPE ++ int ++ default IO_DELAY_TYPE_UDELAY ++endif ++ ++if IO_DELAY_NONE ++config DEFAULT_IO_DELAY_TYPE ++ int ++ default IO_DELAY_TYPE_NONE ++endif ++ ++config DEBUG_BOOT_PARAMS ++ bool "Debug boot parameters" ++ depends on DEBUG_KERNEL ++ depends on DEBUG_FS ++ help ++ This option will cause struct boot_params to be exported via debugfs. ++ ++config CPA_DEBUG ++ bool "CPA self test code" ++ depends on DEBUG_KERNEL ++ help ++ Do change_page_attr self tests at boot. ++ + endmenu +diff --git a/arch/x86/Makefile b/arch/x86/Makefile +index 7aa1dc6..b08f182 100644 +--- a/arch/x86/Makefile ++++ b/arch/x86/Makefile +@@ -7,13 +7,252 @@ else + KBUILD_DEFCONFIG := $(ARCH)_defconfig + endif + +-# No need to remake these files +-$(srctree)/arch/x86/Makefile%: ; ++# BITS is used as extension for files which are available in a 32 bit ++# and a 64 bit version to simplify shared Makefiles. ++# e.g.: obj-y += foo_$(BITS).o ++export BITS + + ifeq ($(CONFIG_X86_32),y) ++ BITS := 32 + UTS_MACHINE := i386 +- include $(srctree)/arch/x86/Makefile_32 ++ CHECKFLAGS += -D__i386__ ++ ++ biarch := $(call cc-option,-m32) ++ KBUILD_AFLAGS += $(biarch) ++ KBUILD_CFLAGS += $(biarch) ++ ++ ifdef CONFIG_RELOCATABLE ++ LDFLAGS_vmlinux := --emit-relocs ++ endif ++ ++ KBUILD_CFLAGS += -msoft-float -mregparm=3 -freg-struct-return ++ ++ # prevent gcc from keeping the stack 16 byte aligned ++ KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=2) ++ ++ # Disable unit-at-a-time mode on pre-gcc-4.0 compilers, it makes gcc use ++ # a lot more stack due to the lack of sharing of stacklots: ++ KBUILD_CFLAGS += $(shell if [ $(call cc-version) -lt 0400 ] ; then \ ++ echo $(call cc-option,-fno-unit-at-a-time); fi ;) ++ ++ # CPU-specific tuning. Anything which can be shared with UML should go here. ++ include $(srctree)/arch/x86/Makefile_32.cpu ++ KBUILD_CFLAGS += $(cflags-y) ++ ++ # temporary until string.h is fixed ++ KBUILD_CFLAGS += -ffreestanding + else ++ BITS := 64 + UTS_MACHINE := x86_64 +- include $(srctree)/arch/x86/Makefile_64 ++ CHECKFLAGS += -D__x86_64__ -m64 ++ ++ KBUILD_AFLAGS += -m64 ++ KBUILD_CFLAGS += -m64 ++ ++ # FIXME - should be integrated in Makefile.cpu (Makefile_32.cpu) ++ cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8) ++ cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona) ++ ++ cflags-$(CONFIG_MCORE2) += \ ++ $(call cc-option,-march=core2,$(call cc-option,-mtune=generic)) ++ cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic) ++ KBUILD_CFLAGS += $(cflags-y) ++ ++ KBUILD_CFLAGS += -mno-red-zone ++ KBUILD_CFLAGS += -mcmodel=kernel ++ ++ # -funit-at-a-time shrinks the kernel .text considerably ++ # unfortunately it makes reading oopses harder. ++ KBUILD_CFLAGS += $(call cc-option,-funit-at-a-time) ++ ++ # this works around some issues with generating unwind tables in older gccs ++ # newer gccs do it by default ++ KBUILD_CFLAGS += -maccumulate-outgoing-args ++ ++ stackp := $(CONFIG_SHELL) $(srctree)/scripts/gcc-x86_64-has-stack-protector.sh ++ stackp-$(CONFIG_CC_STACKPROTECTOR) := $(shell $(stackp) \ ++ "$(CC)" -fstack-protector ) ++ stackp-$(CONFIG_CC_STACKPROTECTOR_ALL) += $(shell $(stackp) \ ++ "$(CC)" -fstack-protector-all ) ++ ++ KBUILD_CFLAGS += $(stackp-y) ++endif ++ ++# Stackpointer is addressed different for 32 bit and 64 bit x86 ++sp-$(CONFIG_X86_32) := esp ++sp-$(CONFIG_X86_64) := rsp ++ ++# do binutils support CFI? ++cfi := $(call as-instr,.cfi_startproc\n.cfi_rel_offset $(sp-y)$(comma)0\n.cfi_endproc,-DCONFIG_AS_CFI=1) ++# is .cfi_signal_frame supported too? ++cfi-sigframe := $(call as-instr,.cfi_startproc\n.cfi_signal_frame\n.cfi_endproc,-DCONFIG_AS_CFI_SIGNAL_FRAME=1) ++KBUILD_AFLAGS += $(cfi) $(cfi-sigframe) ++KBUILD_CFLAGS += $(cfi) $(cfi-sigframe) ++ ++LDFLAGS := -m elf_$(UTS_MACHINE) ++OBJCOPYFLAGS := -O binary -R .note -R .comment -S ++ ++# Speed up the build ++KBUILD_CFLAGS += -pipe ++# Workaround for a gcc prelease that unfortunately was shipped in a suse release ++KBUILD_CFLAGS += -Wno-sign-compare ++# ++KBUILD_CFLAGS += -fno-asynchronous-unwind-tables ++# prevent gcc from generating any FP code by mistake ++KBUILD_CFLAGS += $(call cc-option,-mno-sse -mno-mmx -mno-sse2 -mno-3dnow,) ++ ++### ++# Sub architecture support ++# fcore-y is linked before mcore-y files. ++ ++# Default subarch .c files ++mcore-y := arch/x86/mach-default/ ++ ++# Voyager subarch support ++mflags-$(CONFIG_X86_VOYAGER) := -Iinclude/asm-x86/mach-voyager ++mcore-$(CONFIG_X86_VOYAGER) := arch/x86/mach-voyager/ ++ ++# VISWS subarch support ++mflags-$(CONFIG_X86_VISWS) := -Iinclude/asm-x86/mach-visws ++mcore-$(CONFIG_X86_VISWS) := arch/x86/mach-visws/ ++ ++# NUMAQ subarch support ++mflags-$(CONFIG_X86_NUMAQ) := -Iinclude/asm-x86/mach-numaq ++mcore-$(CONFIG_X86_NUMAQ) := arch/x86/mach-default/ ++ ++# BIGSMP subarch support ++mflags-$(CONFIG_X86_BIGSMP) := -Iinclude/asm-x86/mach-bigsmp ++mcore-$(CONFIG_X86_BIGSMP) := arch/x86/mach-default/ ++ ++#Summit subarch support ++mflags-$(CONFIG_X86_SUMMIT) := -Iinclude/asm-x86/mach-summit ++mcore-$(CONFIG_X86_SUMMIT) := arch/x86/mach-default/ ++ ++# generic subarchitecture ++mflags-$(CONFIG_X86_GENERICARCH):= -Iinclude/asm-x86/mach-generic ++fcore-$(CONFIG_X86_GENERICARCH) += arch/x86/mach-generic/ ++mcore-$(CONFIG_X86_GENERICARCH) := arch/x86/mach-default/ ++ ++ ++# ES7000 subarch support ++mflags-$(CONFIG_X86_ES7000) := -Iinclude/asm-x86/mach-es7000 ++fcore-$(CONFIG_X86_ES7000) := arch/x86/mach-es7000/ ++mcore-$(CONFIG_X86_ES7000) := arch/x86/mach-default/ ++ ++# RDC R-321x subarch support ++mflags-$(CONFIG_X86_RDC321X) := -Iinclude/asm-x86/mach-rdc321x ++mcore-$(CONFIG_X86_RDC321X) := arch/x86/mach-default ++core-$(CONFIG_X86_RDC321X) += arch/x86/mach-rdc321x/ ++ ++# default subarch .h files ++mflags-y += -Iinclude/asm-x86/mach-default ++ ++# 64 bit does not support subarch support - clear sub arch variables ++fcore-$(CONFIG_X86_64) := ++mcore-$(CONFIG_X86_64) := ++mflags-$(CONFIG_X86_64) := ++ ++KBUILD_CFLAGS += $(mflags-y) ++KBUILD_AFLAGS += $(mflags-y) ++ ++### ++# Kernel objects ++ ++head-y := arch/x86/kernel/head_$(BITS).o ++head-$(CONFIG_X86_64) += arch/x86/kernel/head64.o ++head-y += arch/x86/kernel/init_task.o ++ ++libs-y += arch/x86/lib/ ++ ++# Sub architecture files that needs linking first ++core-y += $(fcore-y) ++ ++# Xen paravirtualization support ++core-$(CONFIG_XEN) += arch/x86/xen/ ++ ++# lguest paravirtualization support ++core-$(CONFIG_LGUEST_GUEST) += arch/x86/lguest/ ++ ++core-y += arch/x86/kernel/ ++core-y += arch/x86/mm/ ++ ++# Remaining sub architecture files ++core-y += $(mcore-y) ++ ++core-y += arch/x86/crypto/ ++core-y += arch/x86/vdso/ ++core-$(CONFIG_IA32_EMULATION) += arch/x86/ia32/ ++ ++# drivers-y are linked after core-y ++drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/ ++drivers-$(CONFIG_PCI) += arch/x86/pci/ ++ ++# must be linked after kernel/ ++drivers-$(CONFIG_OPROFILE) += arch/x86/oprofile/ ++ ++ifeq ($(CONFIG_X86_32),y) ++drivers-$(CONFIG_PM) += arch/x86/power/ ++drivers-$(CONFIG_FB) += arch/x86/video/ + endif ++ ++#### ++# boot loader support. Several targets are kept for legacy purposes ++ ++boot := arch/x86/boot ++ ++PHONY += zImage bzImage compressed zlilo bzlilo \ ++ zdisk bzdisk fdimage fdimage144 fdimage288 isoimage install ++ ++# Default kernel to build ++all: bzImage ++ ++# KBUILD_IMAGE specify target image being built ++ KBUILD_IMAGE := $(boot)/bzImage ++zImage zlilo zdisk: KBUILD_IMAGE := arch/x86/boot/zImage ++ ++zImage bzImage: vmlinux ++ $(Q)$(MAKE) $(build)=$(boot) $(KBUILD_IMAGE) ++ $(Q)mkdir -p $(objtree)/arch/$(UTS_MACHINE)/boot ++ $(Q)ln -fsn ../../x86/boot/bzImage $(objtree)/arch/$(UTS_MACHINE)/boot/bzImage ++ ++compressed: zImage ++ ++zlilo bzlilo: vmlinux ++ $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) zlilo ++ ++zdisk bzdisk: vmlinux ++ $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) zdisk ++ ++fdimage fdimage144 fdimage288 isoimage: vmlinux ++ $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) $@ ++ ++install: vdso_install ++ $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) install ++ ++PHONY += vdso_install ++vdso_install: ++ $(Q)$(MAKE) $(build)=arch/x86/vdso $@ ++ ++archclean: ++ $(Q)rm -rf $(objtree)/arch/i386 ++ $(Q)rm -rf $(objtree)/arch/x86_64 ++ $(Q)$(MAKE) $(clean)=$(boot) ++ ++define archhelp ++ echo '* bzImage - Compressed kernel image (arch/x86/boot/bzImage)' ++ echo ' install - Install kernel using' ++ echo ' (your) ~/bin/installkernel or' ++ echo ' (distribution) /sbin/installkernel or' ++ echo ' install to $$(INSTALL_PATH) and run lilo' ++ echo ' fdimage - Create 1.4MB boot floppy image (arch/x86/boot/fdimage)' ++ echo ' fdimage144 - Create 1.4MB boot floppy image (arch/x86/boot/fdimage)' ++ echo ' fdimage288 - Create 2.8MB boot floppy image (arch/x86/boot/fdimage)' ++ echo ' isoimage - Create a boot CD-ROM image (arch/x86/boot/image.iso)' ++ echo ' bzdisk/fdimage*/isoimage also accept:' ++ echo ' FDARGS="..." arguments for the booted kernel' ++ echo ' FDINITRD=file initrd for the booted kernel' ++endef ++ ++CLEAN_FILES += arch/x86/boot/fdimage \ ++ arch/x86/boot/image.iso \ ++ arch/x86/boot/mtools.conf +diff --git a/arch/x86/Makefile_32 b/arch/x86/Makefile_32 +deleted file mode 100644 +index 50394da..0000000 +--- a/arch/x86/Makefile_32 ++++ /dev/null +@@ -1,175 +0,0 @@ +-# +-# i386 Makefile +-# +-# This file is included by the global makefile so that you can add your own +-# architecture-specific flags and dependencies. Remember to do have actions +-# for "archclean" cleaning up for this architecture. +-# +-# This file is subject to the terms and conditions of the GNU General Public +-# License. See the file "COPYING" in the main directory of this archive +-# for more details. +-# +-# Copyright (C) 1994 by Linus Torvalds +-# +-# 19990713 Artur Skawina +-# Added '-march' and '-mpreferred-stack-boundary' support +-# +-# 20050320 Kianusch Sayah Karadji +-# Added support for GEODE CPU +- +-# BITS is used as extension for files which are available in a 32 bit +-# and a 64 bit version to simplify shared Makefiles. +-# e.g.: obj-y += foo_$(BITS).o +-BITS := 32 +-export BITS +- +-HAS_BIARCH := $(call cc-option-yn, -m32) +-ifeq ($(HAS_BIARCH),y) +-AS := $(AS) --32 +-LD := $(LD) -m elf_i386 +-CC := $(CC) -m32 +-endif +- +-LDFLAGS := -m elf_i386 +-OBJCOPYFLAGS := -O binary -R .note -R .comment -S +-ifdef CONFIG_RELOCATABLE +-LDFLAGS_vmlinux := --emit-relocs +-endif +-CHECKFLAGS += -D__i386__ +- +-KBUILD_CFLAGS += -pipe -msoft-float -mregparm=3 -freg-struct-return +- +-# prevent gcc from keeping the stack 16 byte aligned +-KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=2) +- +-# CPU-specific tuning. Anything which can be shared with UML should go here. +-include $(srctree)/arch/x86/Makefile_32.cpu +- +-# temporary until string.h is fixed +-cflags-y += -ffreestanding +- +-# this works around some issues with generating unwind tables in older gccs +-# newer gccs do it by default +-cflags-y += -maccumulate-outgoing-args +- +-# Disable unit-at-a-time mode on pre-gcc-4.0 compilers, it makes gcc use +-# a lot more stack due to the lack of sharing of stacklots: +-KBUILD_CFLAGS += $(shell if [ $(call cc-version) -lt 0400 ] ; then echo $(call cc-option,-fno-unit-at-a-time); fi ;) +- +-# do binutils support CFI? +-cflags-y += $(call as-instr,.cfi_startproc\n.cfi_rel_offset esp${comma}0\n.cfi_endproc,-DCONFIG_AS_CFI=1,) +-KBUILD_AFLAGS += $(call as-instr,.cfi_startproc\n.cfi_rel_offset esp${comma}0\n.cfi_endproc,-DCONFIG_AS_CFI=1,) +- +-# is .cfi_signal_frame supported too? +-cflags-y += $(call as-instr,.cfi_startproc\n.cfi_signal_frame\n.cfi_endproc,-DCONFIG_AS_CFI_SIGNAL_FRAME=1,) +-KBUILD_AFLAGS += $(call as-instr,.cfi_startproc\n.cfi_signal_frame\n.cfi_endproc,-DCONFIG_AS_CFI_SIGNAL_FRAME=1,) +- +-KBUILD_CFLAGS += $(cflags-y) +- +-# Default subarch .c files +-mcore-y := arch/x86/mach-default +- +-# Voyager subarch support +-mflags-$(CONFIG_X86_VOYAGER) := -Iinclude/asm-x86/mach-voyager +-mcore-$(CONFIG_X86_VOYAGER) := arch/x86/mach-voyager +- +-# VISWS subarch support +-mflags-$(CONFIG_X86_VISWS) := -Iinclude/asm-x86/mach-visws +-mcore-$(CONFIG_X86_VISWS) := arch/x86/mach-visws +- +-# NUMAQ subarch support +-mflags-$(CONFIG_X86_NUMAQ) := -Iinclude/asm-x86/mach-numaq +-mcore-$(CONFIG_X86_NUMAQ) := arch/x86/mach-default +- +-# BIGSMP subarch support +-mflags-$(CONFIG_X86_BIGSMP) := -Iinclude/asm-x86/mach-bigsmp +-mcore-$(CONFIG_X86_BIGSMP) := arch/x86/mach-default +- +-#Summit subarch support +-mflags-$(CONFIG_X86_SUMMIT) := -Iinclude/asm-x86/mach-summit +-mcore-$(CONFIG_X86_SUMMIT) := arch/x86/mach-default +- +-# generic subarchitecture +-mflags-$(CONFIG_X86_GENERICARCH) := -Iinclude/asm-x86/mach-generic +-mcore-$(CONFIG_X86_GENERICARCH) := arch/x86/mach-default +-core-$(CONFIG_X86_GENERICARCH) += arch/x86/mach-generic/ +- +-# ES7000 subarch support +-mflags-$(CONFIG_X86_ES7000) := -Iinclude/asm-x86/mach-es7000 +-mcore-$(CONFIG_X86_ES7000) := arch/x86/mach-default +-core-$(CONFIG_X86_ES7000) := arch/x86/mach-es7000/ +- +-# Xen paravirtualization support +-core-$(CONFIG_XEN) += arch/x86/xen/ +- +-# lguest paravirtualization support +-core-$(CONFIG_LGUEST_GUEST) += arch/x86/lguest/ +- +-# default subarch .h files +-mflags-y += -Iinclude/asm-x86/mach-default +- +-head-y := arch/x86/kernel/head_32.o arch/x86/kernel/init_task.o +- +-libs-y += arch/x86/lib/ +-core-y += arch/x86/kernel/ \ +- arch/x86/mm/ \ +- $(mcore-y)/ \ +- arch/x86/crypto/ +-drivers-$(CONFIG_MATH_EMULATION) += arch/x86/math-emu/ +-drivers-$(CONFIG_PCI) += arch/x86/pci/ +-# must be linked after kernel/ +-drivers-$(CONFIG_OPROFILE) += arch/x86/oprofile/ +-drivers-$(CONFIG_PM) += arch/x86/power/ +-drivers-$(CONFIG_FB) += arch/x86/video/ +- +-KBUILD_CFLAGS += $(mflags-y) +-KBUILD_AFLAGS += $(mflags-y) +- +-boot := arch/x86/boot +- +-PHONY += zImage bzImage compressed zlilo bzlilo \ +- zdisk bzdisk fdimage fdimage144 fdimage288 isoimage install +- +-all: bzImage +- +-# KBUILD_IMAGE specify target image being built +- KBUILD_IMAGE := $(boot)/bzImage +-zImage zlilo zdisk: KBUILD_IMAGE := arch/x86/boot/zImage +- +-zImage bzImage: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) $(KBUILD_IMAGE) +- $(Q)mkdir -p $(objtree)/arch/i386/boot +- $(Q)ln -fsn ../../x86/boot/bzImage $(objtree)/arch/i386/boot/bzImage +- +-compressed: zImage +- +-zlilo bzlilo: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) zlilo +- +-zdisk bzdisk: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) zdisk +- +-fdimage fdimage144 fdimage288 isoimage: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) $@ +- +-install: +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(KBUILD_IMAGE) install +- +-archclean: +- $(Q)rm -rf $(objtree)/arch/i386/boot +- $(Q)$(MAKE) $(clean)=arch/x86/boot +- +-define archhelp +- echo '* bzImage - Compressed kernel image (arch/x86/boot/bzImage)' +- echo ' install - Install kernel using' +- echo ' (your) ~/bin/installkernel or' +- echo ' (distribution) /sbin/installkernel or' +- echo ' install to $$(INSTALL_PATH) and run lilo' +- echo ' bzdisk - Create a boot floppy in /dev/fd0' +- echo ' fdimage - Create a boot floppy image' +- echo ' isoimage - Create a boot CD-ROM image' +-endef +- +-CLEAN_FILES += arch/x86/boot/fdimage \ +- arch/x86/boot/image.iso \ +- arch/x86/boot/mtools.conf +diff --git a/arch/x86/Makefile_64 b/arch/x86/Makefile_64 +deleted file mode 100644 +index a804860..0000000 +--- a/arch/x86/Makefile_64 ++++ /dev/null +@@ -1,144 +0,0 @@ +-# +-# x86_64 Makefile +-# +-# This file is included by the global makefile so that you can add your own +-# architecture-specific flags and dependencies. Remember to do have actions +-# for "archclean" and "archdep" for cleaning up and making dependencies for +-# this architecture +-# +-# This file is subject to the terms and conditions of the GNU General Public +-# License. See the file "COPYING" in the main directory of this archive +-# for more details. +-# +-# Copyright (C) 1994 by Linus Torvalds +-# +-# 19990713 Artur Skawina +-# Added '-march' and '-mpreferred-stack-boundary' support +-# 20000913 Pavel Machek +-# Converted for x86_64 architecture +-# 20010105 Andi Kleen, add IA32 compiler. +-# ....and later removed it again.... +-# +-# $Id: Makefile,v 1.31 2002/03/22 15:56:07 ak Exp $ +- +-# BITS is used as extension for files which are available in a 32 bit +-# and a 64 bit version to simplify shared Makefiles. +-# e.g.: obj-y += foo_$(BITS).o +-BITS := 64 +-export BITS +- +-LDFLAGS := -m elf_x86_64 +-OBJCOPYFLAGS := -O binary -R .note -R .comment -S +-LDFLAGS_vmlinux := +-CHECKFLAGS += -D__x86_64__ -m64 +- +-cflags-y := +-cflags-kernel-y := +-cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8) +-cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona) +-# gcc doesn't support -march=core2 yet as of gcc 4.3, but I hope it +-# will eventually. Use -mtune=generic as fallback +-cflags-$(CONFIG_MCORE2) += \ +- $(call cc-option,-march=core2,$(call cc-option,-mtune=generic)) +-cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic) +- +-cflags-y += -m64 +-cflags-y += -mno-red-zone +-cflags-y += -mcmodel=kernel +-cflags-y += -pipe +-cflags-y += -Wno-sign-compare +-cflags-y += -fno-asynchronous-unwind-tables +-ifneq ($(CONFIG_DEBUG_INFO),y) +-# -fweb shrinks the kernel a bit, but the difference is very small +-# it also messes up debugging, so don't use it for now. +-#cflags-y += $(call cc-option,-fweb) +-endif +-# -funit-at-a-time shrinks the kernel .text considerably +-# unfortunately it makes reading oopses harder. +-cflags-y += $(call cc-option,-funit-at-a-time) +-# prevent gcc from generating any FP code by mistake +-cflags-y += $(call cc-option,-mno-sse -mno-mmx -mno-sse2 -mno-3dnow,) +-# this works around some issues with generating unwind tables in older gccs +-# newer gccs do it by default +-cflags-y += -maccumulate-outgoing-args +- +-# do binutils support CFI? +-cflags-y += $(call as-instr,.cfi_startproc\n.cfi_rel_offset rsp${comma}0\n.cfi_endproc,-DCONFIG_AS_CFI=1,) +-KBUILD_AFLAGS += $(call as-instr,.cfi_startproc\n.cfi_rel_offset rsp${comma}0\n.cfi_endproc,-DCONFIG_AS_CFI=1,) +- +-# is .cfi_signal_frame supported too? +-cflags-y += $(call as-instr,.cfi_startproc\n.cfi_signal_frame\n.cfi_endproc,-DCONFIG_AS_CFI_SIGNAL_FRAME=1,) +-KBUILD_AFLAGS += $(call as-instr,.cfi_startproc\n.cfi_signal_frame\n.cfi_endproc,-DCONFIG_AS_CFI_SIGNAL_FRAME=1,) +- +-cflags-$(CONFIG_CC_STACKPROTECTOR) += $(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-x86_64-has-stack-protector.sh "$(CC)" -fstack-protector ) +-cflags-$(CONFIG_CC_STACKPROTECTOR_ALL) += $(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-x86_64-has-stack-protector.sh "$(CC)" -fstack-protector-all ) +- +-KBUILD_CFLAGS += $(cflags-y) +-CFLAGS_KERNEL += $(cflags-kernel-y) +-KBUILD_AFLAGS += -m64 +- +-head-y := arch/x86/kernel/head_64.o arch/x86/kernel/head64.o arch/x86/kernel/init_task.o +- +-libs-y += arch/x86/lib/ +-core-y += arch/x86/kernel/ \ +- arch/x86/mm/ \ +- arch/x86/crypto/ \ +- arch/x86/vdso/ +-core-$(CONFIG_IA32_EMULATION) += arch/x86/ia32/ +-drivers-$(CONFIG_PCI) += arch/x86/pci/ +-drivers-$(CONFIG_OPROFILE) += arch/x86/oprofile/ +- +-boot := arch/x86/boot +- +-PHONY += bzImage bzlilo install archmrproper \ +- fdimage fdimage144 fdimage288 isoimage archclean +- +-#Default target when executing "make" +-all: bzImage +- +-BOOTIMAGE := arch/x86/boot/bzImage +-KBUILD_IMAGE := $(BOOTIMAGE) +- +-bzImage: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) $(BOOTIMAGE) +- $(Q)mkdir -p $(objtree)/arch/x86_64/boot +- $(Q)ln -fsn ../../x86/boot/bzImage $(objtree)/arch/x86_64/boot/bzImage +- +-bzlilo: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(BOOTIMAGE) zlilo +- +-bzdisk: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(BOOTIMAGE) zdisk +- +-fdimage fdimage144 fdimage288 isoimage: vmlinux +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(BOOTIMAGE) $@ +- +-install: vdso_install +- $(Q)$(MAKE) $(build)=$(boot) BOOTIMAGE=$(BOOTIMAGE) $@ +- +-vdso_install: +-ifeq ($(CONFIG_IA32_EMULATION),y) +- $(Q)$(MAKE) $(build)=arch/x86/ia32 $@ +-endif +- $(Q)$(MAKE) $(build)=arch/x86/vdso $@ +- +-archclean: +- $(Q)rm -rf $(objtree)/arch/x86_64/boot +- $(Q)$(MAKE) $(clean)=$(boot) +- +-define archhelp +- echo '* bzImage - Compressed kernel image (arch/x86/boot/bzImage)' +- echo ' install - Install kernel using' +- echo ' (your) ~/bin/installkernel or' +- echo ' (distribution) /sbin/installkernel or' +- echo ' install to $$(INSTALL_PATH) and run lilo' +- echo ' bzdisk - Create a boot floppy in /dev/fd0' +- echo ' fdimage - Create a boot floppy image' +- echo ' isoimage - Create a boot CD-ROM image' +-endef +- +-CLEAN_FILES += arch/x86/boot/fdimage \ +- arch/x86/boot/image.iso \ +- arch/x86/boot/mtools.conf +- +- +diff --git a/arch/x86/boot/Makefile b/arch/x86/boot/Makefile +index 7a3116c..349b81a 100644 +--- a/arch/x86/boot/Makefile ++++ b/arch/x86/boot/Makefile +@@ -28,9 +28,11 @@ SVGA_MODE := -DSVGA_MODE=NORMAL_VGA + targets := vmlinux.bin setup.bin setup.elf zImage bzImage + subdir- := compressed + +-setup-y += a20.o apm.o cmdline.o copy.o cpu.o cpucheck.o edd.o ++setup-y += a20.o cmdline.o copy.o cpu.o cpucheck.o edd.o + setup-y += header.o main.o mca.o memory.o pm.o pmjump.o +-setup-y += printf.o string.o tty.o video.o version.o voyager.o ++setup-y += printf.o string.o tty.o video.o version.o ++setup-$(CONFIG_X86_APM_BOOT) += apm.o ++setup-$(CONFIG_X86_VOYAGER) += voyager.o + + # The link order of the video-*.o modules can matter. In particular, + # video-vga.o *must* be listed first, followed by video-vesa.o. +@@ -49,10 +51,7 @@ HOSTCFLAGS_build.o := $(LINUXINCLUDE) + + # How to compile the 16-bit code. Note we always compile for -march=i386, + # that way we can complain to the user if the CPU is insufficient. +-cflags-$(CONFIG_X86_32) := +-cflags-$(CONFIG_X86_64) := -m32 + KBUILD_CFLAGS := $(LINUXINCLUDE) -g -Os -D_SETUP -D__KERNEL__ \ +- $(cflags-y) \ + -Wall -Wstrict-prototypes \ + -march=i386 -mregparm=3 \ + -include $(srctree)/$(src)/code16gcc.h \ +@@ -62,6 +61,7 @@ KBUILD_CFLAGS := $(LINUXINCLUDE) -g -Os -D_SETUP -D__KERNEL__ \ + $(call cc-option, -fno-unit-at-a-time)) \ + $(call cc-option, -fno-stack-protector) \ + $(call cc-option, -mpreferred-stack-boundary=2) ++KBUILD_CFLAGS += $(call cc-option,-m32) + KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__ + + $(obj)/zImage: IMAGE_OFFSET := 0x1000 +diff --git a/arch/x86/boot/apm.c b/arch/x86/boot/apm.c +index eab50c5..c117c7f 100644 +--- a/arch/x86/boot/apm.c ++++ b/arch/x86/boot/apm.c +@@ -19,8 +19,6 @@ + + #include "boot.h" + +-#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE) +- + int query_apm_bios(void) + { + u16 ax, bx, cx, dx, di; +@@ -95,4 +93,3 @@ int query_apm_bios(void) + return 0; + } + +-#endif +diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h +index d2b5adf..7822a49 100644 +--- a/arch/x86/boot/boot.h ++++ b/arch/x86/boot/boot.h +@@ -109,7 +109,7 @@ typedef unsigned int addr_t; + static inline u8 rdfs8(addr_t addr) + { + u8 v; +- asm volatile("movb %%fs:%1,%0" : "=r" (v) : "m" (*(u8 *)addr)); ++ asm volatile("movb %%fs:%1,%0" : "=q" (v) : "m" (*(u8 *)addr)); + return v; + } + static inline u16 rdfs16(addr_t addr) +@@ -127,21 +127,21 @@ static inline u32 rdfs32(addr_t addr) + + static inline void wrfs8(u8 v, addr_t addr) + { +- asm volatile("movb %1,%%fs:%0" : "+m" (*(u8 *)addr) : "r" (v)); ++ asm volatile("movb %1,%%fs:%0" : "+m" (*(u8 *)addr) : "qi" (v)); + } + static inline void wrfs16(u16 v, addr_t addr) + { +- asm volatile("movw %1,%%fs:%0" : "+m" (*(u16 *)addr) : "r" (v)); ++ asm volatile("movw %1,%%fs:%0" : "+m" (*(u16 *)addr) : "ri" (v)); + } + static inline void wrfs32(u32 v, addr_t addr) + { +- asm volatile("movl %1,%%fs:%0" : "+m" (*(u32 *)addr) : "r" (v)); ++ asm volatile("movl %1,%%fs:%0" : "+m" (*(u32 *)addr) : "ri" (v)); + } + + static inline u8 rdgs8(addr_t addr) + { + u8 v; +- asm volatile("movb %%gs:%1,%0" : "=r" (v) : "m" (*(u8 *)addr)); ++ asm volatile("movb %%gs:%1,%0" : "=q" (v) : "m" (*(u8 *)addr)); + return v; + } + static inline u16 rdgs16(addr_t addr) +@@ -159,15 +159,15 @@ static inline u32 rdgs32(addr_t addr) + + static inline void wrgs8(u8 v, addr_t addr) + { +- asm volatile("movb %1,%%gs:%0" : "+m" (*(u8 *)addr) : "r" (v)); ++ asm volatile("movb %1,%%gs:%0" : "+m" (*(u8 *)addr) : "qi" (v)); + } + static inline void wrgs16(u16 v, addr_t addr) + { +- asm volatile("movw %1,%%gs:%0" : "+m" (*(u16 *)addr) : "r" (v)); ++ asm volatile("movw %1,%%gs:%0" : "+m" (*(u16 *)addr) : "ri" (v)); + } + static inline void wrgs32(u32 v, addr_t addr) + { +- asm volatile("movl %1,%%gs:%0" : "+m" (*(u32 *)addr) : "r" (v)); ++ asm volatile("movl %1,%%gs:%0" : "+m" (*(u32 *)addr) : "ri" (v)); + } + + /* Note: these only return true/false, not a signed return value! */ +@@ -241,6 +241,7 @@ int query_apm_bios(void); + + /* cmdline.c */ + int cmdline_find_option(const char *option, char *buffer, int bufsize); ++int cmdline_find_option_bool(const char *option); + + /* cpu.c, cpucheck.c */ + int check_cpu(int *cpu_level_ptr, int *req_level_ptr, u32 **err_flags_ptr); +diff --git a/arch/x86/boot/cmdline.c b/arch/x86/boot/cmdline.c +index 34bb778..680408a 100644 +--- a/arch/x86/boot/cmdline.c ++++ b/arch/x86/boot/cmdline.c +@@ -95,3 +95,68 @@ int cmdline_find_option(const char *option, char *buffer, int bufsize) + + return len; + } ++ ++/* ++ * Find a boolean option (like quiet,noapic,nosmp....) ++ * ++ * Returns the position of that option (starts counting with 1) ++ * or 0 on not found ++ */ ++int cmdline_find_option_bool(const char *option) ++{ ++ u32 cmdline_ptr = boot_params.hdr.cmd_line_ptr; ++ addr_t cptr; ++ char c; ++ int pos = 0, wstart = 0; ++ const char *opptr = NULL; ++ enum { ++ st_wordstart, /* Start of word/after whitespace */ ++ st_wordcmp, /* Comparing this word */ ++ st_wordskip, /* Miscompare, skip */ ++ } state = st_wordstart; ++ ++ if (!cmdline_ptr || cmdline_ptr >= 0x100000) ++ return -1; /* No command line, or inaccessible */ ++ ++ cptr = cmdline_ptr & 0xf; ++ set_fs(cmdline_ptr >> 4); ++ ++ while (cptr < 0x10000) { ++ c = rdfs8(cptr++); ++ pos++; ++ ++ switch (state) { ++ case st_wordstart: ++ if (!c) ++ return 0; ++ else if (myisspace(c)) ++ break; ++ ++ state = st_wordcmp; ++ opptr = option; ++ wstart = pos; ++ /* fall through */ ++ ++ case st_wordcmp: ++ if (!*opptr) ++ if (!c || myisspace(c)) ++ return wstart; ++ else ++ state = st_wordskip; ++ else if (!c) ++ return 0; ++ else if (c != *opptr++) ++ state = st_wordskip; ++ break; ++ ++ case st_wordskip: ++ if (!c) ++ return 0; ++ else if (myisspace(c)) ++ state = st_wordstart; ++ break; ++ } ++ } ++ ++ return 0; /* Buffer overrun */ ++} +diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile +index 52c1db8..fe24cea 100644 +--- a/arch/x86/boot/compressed/Makefile ++++ b/arch/x86/boot/compressed/Makefile +@@ -1,5 +1,63 @@ ++# ++# linux/arch/x86/boot/compressed/Makefile ++# ++# create a compressed vmlinux image from the original vmlinux ++# ++ ++targets := vmlinux vmlinux.bin vmlinux.bin.gz head_$(BITS).o misc.o piggy.o ++ ++KBUILD_CFLAGS := -m$(BITS) -D__KERNEL__ $(LINUX_INCLUDE) -O2 ++KBUILD_CFLAGS += -fno-strict-aliasing -fPIC ++cflags-$(CONFIG_X86_64) := -mcmodel=small ++KBUILD_CFLAGS += $(cflags-y) ++KBUILD_CFLAGS += $(call cc-option,-ffreestanding) ++KBUILD_CFLAGS += $(call cc-option,-fno-stack-protector) ++ ++KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__ ++ ++LDFLAGS := -m elf_$(UTS_MACHINE) ++LDFLAGS_vmlinux := -T ++ ++$(obj)/vmlinux: $(src)/vmlinux_$(BITS).lds $(obj)/head_$(BITS).o $(obj)/misc.o $(obj)/piggy.o FORCE ++ $(call if_changed,ld) ++ @: ++ ++$(obj)/vmlinux.bin: vmlinux FORCE ++ $(call if_changed,objcopy) ++ ++ + ifeq ($(CONFIG_X86_32),y) +-include ${srctree}/arch/x86/boot/compressed/Makefile_32 ++targets += vmlinux.bin.all vmlinux.relocs ++hostprogs-y := relocs ++ ++quiet_cmd_relocs = RELOCS $@ ++ cmd_relocs = $(obj)/relocs $< > $@;$(obj)/relocs --abs-relocs $< ++$(obj)/vmlinux.relocs: vmlinux $(obj)/relocs FORCE ++ $(call if_changed,relocs) ++ ++vmlinux.bin.all-y := $(obj)/vmlinux.bin ++vmlinux.bin.all-$(CONFIG_RELOCATABLE) += $(obj)/vmlinux.relocs ++quiet_cmd_relocbin = BUILD $@ ++ cmd_relocbin = cat $(filter-out FORCE,$^) > $@ ++$(obj)/vmlinux.bin.all: $(vmlinux.bin.all-y) FORCE ++ $(call if_changed,relocbin) ++ ++ifdef CONFIG_RELOCATABLE ++$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin.all FORCE ++ $(call if_changed,gzip) + else +-include ${srctree}/arch/x86/boot/compressed/Makefile_64 ++$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin FORCE ++ $(call if_changed,gzip) + endif ++LDFLAGS_piggy.o := -r --format binary --oformat elf32-i386 -T ++ ++else ++$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin FORCE ++ $(call if_changed,gzip) ++ ++LDFLAGS_piggy.o := -r --format binary --oformat elf64-x86-64 -T ++endif ++ ++ ++$(obj)/piggy.o: $(obj)/vmlinux.scr $(obj)/vmlinux.bin.gz FORCE ++ $(call if_changed,ld) +diff --git a/arch/x86/boot/compressed/Makefile_32 b/arch/x86/boot/compressed/Makefile_32 +deleted file mode 100644 +index e43ff7c..0000000 +--- a/arch/x86/boot/compressed/Makefile_32 ++++ /dev/null +@@ -1,50 +0,0 @@ +-# +-# linux/arch/x86/boot/compressed/Makefile +-# +-# create a compressed vmlinux image from the original vmlinux +-# +- +-targets := vmlinux vmlinux.bin vmlinux.bin.gz head_32.o misc_32.o piggy.o \ +- vmlinux.bin.all vmlinux.relocs +-EXTRA_AFLAGS := -traditional +- +-LDFLAGS_vmlinux := -T +-hostprogs-y := relocs +- +-KBUILD_CFLAGS := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -O2 \ +- -fno-strict-aliasing -fPIC \ +- $(call cc-option,-ffreestanding) \ +- $(call cc-option,-fno-stack-protector) +-LDFLAGS := -m elf_i386 +- +-$(obj)/vmlinux: $(src)/vmlinux_32.lds $(obj)/head_32.o $(obj)/misc_32.o $(obj)/piggy.o FORCE +- $(call if_changed,ld) +- @: +- +-$(obj)/vmlinux.bin: vmlinux FORCE +- $(call if_changed,objcopy) +- +-quiet_cmd_relocs = RELOCS $@ +- cmd_relocs = $(obj)/relocs $< > $@;$(obj)/relocs --abs-relocs $< +-$(obj)/vmlinux.relocs: vmlinux $(obj)/relocs FORCE +- $(call if_changed,relocs) +- +-vmlinux.bin.all-y := $(obj)/vmlinux.bin +-vmlinux.bin.all-$(CONFIG_RELOCATABLE) += $(obj)/vmlinux.relocs +-quiet_cmd_relocbin = BUILD $@ +- cmd_relocbin = cat $(filter-out FORCE,$^) > $@ +-$(obj)/vmlinux.bin.all: $(vmlinux.bin.all-y) FORCE +- $(call if_changed,relocbin) +- +-ifdef CONFIG_RELOCATABLE +-$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin.all FORCE +- $(call if_changed,gzip) +-else +-$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin FORCE +- $(call if_changed,gzip) +-endif +- +-LDFLAGS_piggy.o := -r --format binary --oformat elf32-i386 -T +- +-$(obj)/piggy.o: $(src)/vmlinux_32.scr $(obj)/vmlinux.bin.gz FORCE +- $(call if_changed,ld) +diff --git a/arch/x86/boot/compressed/Makefile_64 b/arch/x86/boot/compressed/Makefile_64 +deleted file mode 100644 +index 7801e8d..0000000 +--- a/arch/x86/boot/compressed/Makefile_64 ++++ /dev/null +@@ -1,30 +0,0 @@ +-# +-# linux/arch/x86/boot/compressed/Makefile +-# +-# create a compressed vmlinux image from the original vmlinux +-# +- +-targets := vmlinux vmlinux.bin vmlinux.bin.gz head_64.o misc_64.o piggy.o +- +-KBUILD_CFLAGS := -m64 -D__KERNEL__ $(LINUXINCLUDE) -O2 \ +- -fno-strict-aliasing -fPIC -mcmodel=small \ +- $(call cc-option, -ffreestanding) \ +- $(call cc-option, -fno-stack-protector) +-KBUILD_AFLAGS := $(KBUILD_CFLAGS) -D__ASSEMBLY__ +-LDFLAGS := -m elf_x86_64 +- +-LDFLAGS_vmlinux := -T +-$(obj)/vmlinux: $(src)/vmlinux_64.lds $(obj)/head_64.o $(obj)/misc_64.o $(obj)/piggy.o FORCE +- $(call if_changed,ld) +- @: +- +-$(obj)/vmlinux.bin: vmlinux FORCE +- $(call if_changed,objcopy) +- +-$(obj)/vmlinux.bin.gz: $(obj)/vmlinux.bin FORCE +- $(call if_changed,gzip) +- +-LDFLAGS_piggy.o := -r --format binary --oformat elf64-x86-64 -T +- +-$(obj)/piggy.o: $(obj)/vmlinux_64.scr $(obj)/vmlinux.bin.gz FORCE +- $(call if_changed,ld) +diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c +new file mode 100644 +index 0000000..8182e32 +--- /dev/null ++++ b/arch/x86/boot/compressed/misc.c +@@ -0,0 +1,413 @@ ++/* ++ * misc.c ++ * ++ * This is a collection of several routines from gzip-1.0.3 ++ * adapted for Linux. ++ * ++ * malloc by Hannu Savolainen 1993 and Matthias Urlichs 1994 ++ * puts by Nick Holloway 1993, better puts by Martin Mares 1995 ++ * High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996 ++ */ ++ ++/* ++ * we have to be careful, because no indirections are allowed here, and ++ * paravirt_ops is a kind of one. As it will only run in baremetal anyway, ++ * we just keep it from happening ++ */ ++#undef CONFIG_PARAVIRT ++#ifdef CONFIG_X86_64 ++#define _LINUX_STRING_H_ 1 ++#define __LINUX_BITMAP_H 1 ++#endif ++ ++#include ++#include ++#include ++#include ++#include ++ ++/* WARNING!! ++ * This code is compiled with -fPIC and it is relocated dynamically ++ * at run time, but no relocation processing is performed. ++ * This means that it is not safe to place pointers in static structures. ++ */ ++ ++/* ++ * Getting to provable safe in place decompression is hard. ++ * Worst case behaviours need to be analyzed. ++ * Background information: ++ * ++ * The file layout is: ++ * magic[2] ++ * method[1] ++ * flags[1] ++ * timestamp[4] ++ * extraflags[1] ++ * os[1] ++ * compressed data blocks[N] ++ * crc[4] orig_len[4] ++ * ++ * resulting in 18 bytes of non compressed data overhead. ++ * ++ * Files divided into blocks ++ * 1 bit (last block flag) ++ * 2 bits (block type) ++ * ++ * 1 block occurs every 32K -1 bytes or when there 50% compression has been achieved. ++ * The smallest block type encoding is always used. ++ * ++ * stored: ++ * 32 bits length in bytes. ++ * ++ * fixed: ++ * magic fixed tree. ++ * symbols. ++ * ++ * dynamic: ++ * dynamic tree encoding. ++ * symbols. ++ * ++ * ++ * The buffer for decompression in place is the length of the ++ * uncompressed data, plus a small amount extra to keep the algorithm safe. ++ * The compressed data is placed at the end of the buffer. The output ++ * pointer is placed at the start of the buffer and the input pointer ++ * is placed where the compressed data starts. Problems will occur ++ * when the output pointer overruns the input pointer. ++ * ++ * The output pointer can only overrun the input pointer if the input ++ * pointer is moving faster than the output pointer. A condition only ++ * triggered by data whose compressed form is larger than the uncompressed ++ * form. ++ * ++ * The worst case at the block level is a growth of the compressed data ++ * of 5 bytes per 32767 bytes. ++ * ++ * The worst case internal to a compressed block is very hard to figure. ++ * The worst case can at least be boundined by having one bit that represents ++ * 32764 bytes and then all of the rest of the bytes representing the very ++ * very last byte. ++ * ++ * All of which is enough to compute an amount of extra data that is required ++ * to be safe. To avoid problems at the block level allocating 5 extra bytes ++ * per 32767 bytes of data is sufficient. To avoind problems internal to a block ++ * adding an extra 32767 bytes (the worst case uncompressed block size) is ++ * sufficient, to ensure that in the worst case the decompressed data for ++ * block will stop the byte before the compressed data for a block begins. ++ * To avoid problems with the compressed data's meta information an extra 18 ++ * bytes are needed. Leading to the formula: ++ * ++ * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size. ++ * ++ * Adding 8 bytes per 32K is a bit excessive but much easier to calculate. ++ * Adding 32768 instead of 32767 just makes for round numbers. ++ * Adding the decompressor_size is necessary as it musht live after all ++ * of the data as well. Last I measured the decompressor is about 14K. ++ * 10K of actual data and 4K of bss. ++ * ++ */ ++ ++/* ++ * gzip declarations ++ */ ++ ++#define OF(args) args ++#define STATIC static ++ ++#undef memset ++#undef memcpy ++#define memzero(s, n) memset ((s), 0, (n)) ++ ++typedef unsigned char uch; ++typedef unsigned short ush; ++typedef unsigned long ulg; ++ ++#define WSIZE 0x80000000 /* Window size must be at least 32k, ++ * and a power of two ++ * We don't actually have a window just ++ * a huge output buffer so I report ++ * a 2G windows size, as that should ++ * always be larger than our output buffer. ++ */ ++ ++static uch *inbuf; /* input buffer */ ++static uch *window; /* Sliding window buffer, (and final output buffer) */ ++ ++static unsigned insize; /* valid bytes in inbuf */ ++static unsigned inptr; /* index of next byte to be processed in inbuf */ ++static unsigned outcnt; /* bytes in output buffer */ ++ ++/* gzip flag byte */ ++#define ASCII_FLAG 0x01 /* bit 0 set: file probably ASCII text */ ++#define CONTINUATION 0x02 /* bit 1 set: continuation of multi-part gzip file */ ++#define EXTRA_FIELD 0x04 /* bit 2 set: extra field present */ ++#define ORIG_NAME 0x08 /* bit 3 set: original file name present */ ++#define COMMENT 0x10 /* bit 4 set: file comment present */ ++#define ENCRYPTED 0x20 /* bit 5 set: file is encrypted */ ++#define RESERVED 0xC0 /* bit 6,7: reserved */ ++ ++#define get_byte() (inptr < insize ? inbuf[inptr++] : fill_inbuf()) ++ ++/* Diagnostic functions */ ++#ifdef DEBUG ++# define Assert(cond,msg) {if(!(cond)) error(msg);} ++# define Trace(x) fprintf x ++# define Tracev(x) {if (verbose) fprintf x ;} ++# define Tracevv(x) {if (verbose>1) fprintf x ;} ++# define Tracec(c,x) {if (verbose && (c)) fprintf x ;} ++# define Tracecv(c,x) {if (verbose>1 && (c)) fprintf x ;} ++#else ++# define Assert(cond,msg) ++# define Trace(x) ++# define Tracev(x) ++# define Tracevv(x) ++# define Tracec(c,x) ++# define Tracecv(c,x) ++#endif ++ ++static int fill_inbuf(void); ++static void flush_window(void); ++static void error(char *m); ++static void gzip_mark(void **); ++static void gzip_release(void **); ++ ++/* ++ * This is set up by the setup-routine at boot-time ++ */ ++static unsigned char *real_mode; /* Pointer to real-mode data */ ++ ++#define RM_EXT_MEM_K (*(unsigned short *)(real_mode + 0x2)) ++#ifndef STANDARD_MEMORY_BIOS_CALL ++#define RM_ALT_MEM_K (*(unsigned long *)(real_mode + 0x1e0)) ++#endif ++#define RM_SCREEN_INFO (*(struct screen_info *)(real_mode+0)) ++ ++extern unsigned char input_data[]; ++extern int input_len; ++ ++static long bytes_out = 0; ++ ++static void *malloc(int size); ++static void free(void *where); ++ ++static void *memset(void *s, int c, unsigned n); ++static void *memcpy(void *dest, const void *src, unsigned n); ++ ++static void putstr(const char *); ++ ++#ifdef CONFIG_X86_64 ++#define memptr long ++#else ++#define memptr unsigned ++#endif ++ ++static memptr free_mem_ptr; ++static memptr free_mem_end_ptr; ++ ++#ifdef CONFIG_X86_64 ++#define HEAP_SIZE 0x7000 ++#else ++#define HEAP_SIZE 0x4000 ++#endif ++ ++static char *vidmem = (char *)0xb8000; ++static int vidport; ++static int lines, cols; ++ ++#ifdef CONFIG_X86_NUMAQ ++void *xquad_portio; ++#endif ++ ++#include "../../../../lib/inflate.c" ++ ++static void *malloc(int size) ++{ ++ void *p; ++ ++ if (size <0) error("Malloc error"); ++ if (free_mem_ptr <= 0) error("Memory error"); ++ ++ free_mem_ptr = (free_mem_ptr + 3) & ~3; /* Align */ ++ ++ p = (void *)free_mem_ptr; ++ free_mem_ptr += size; ++ ++ if (free_mem_ptr >= free_mem_end_ptr) ++ error("Out of memory"); ++ ++ return p; ++} ++ ++static void free(void *where) ++{ /* Don't care */ ++} ++ ++static void gzip_mark(void **ptr) ++{ ++ *ptr = (void *) free_mem_ptr; ++} ++ ++static void gzip_release(void **ptr) ++{ ++ free_mem_ptr = (memptr) *ptr; ++} ++ ++static void scroll(void) ++{ ++ int i; ++ ++ memcpy ( vidmem, vidmem + cols * 2, ( lines - 1 ) * cols * 2 ); ++ for ( i = ( lines - 1 ) * cols * 2; i < lines * cols * 2; i += 2 ) ++ vidmem[i] = ' '; ++} ++ ++static void putstr(const char *s) ++{ ++ int x,y,pos; ++ char c; ++ ++#ifdef CONFIG_X86_32 ++ if (RM_SCREEN_INFO.orig_video_mode == 0 && lines == 0 && cols == 0) ++ return; ++#endif ++ ++ x = RM_SCREEN_INFO.orig_x; ++ y = RM_SCREEN_INFO.orig_y; ++ ++ while ( ( c = *s++ ) != '\0' ) { ++ if ( c == '\n' ) { ++ x = 0; ++ if ( ++y >= lines ) { ++ scroll(); ++ y--; ++ } ++ } else { ++ vidmem [(x + cols * y) * 2] = c; ++ if ( ++x >= cols ) { ++ x = 0; ++ if ( ++y >= lines ) { ++ scroll(); ++ y--; ++ } ++ } ++ } ++ } ++ ++ RM_SCREEN_INFO.orig_x = x; ++ RM_SCREEN_INFO.orig_y = y; ++ ++ pos = (x + cols * y) * 2; /* Update cursor position */ ++ outb(14, vidport); ++ outb(0xff & (pos >> 9), vidport+1); ++ outb(15, vidport); ++ outb(0xff & (pos >> 1), vidport+1); ++} ++ ++static void* memset(void* s, int c, unsigned n) ++{ ++ int i; ++ char *ss = s; ++ ++ for (i=0;i> 8); ++ } ++ crc = c; ++ bytes_out += (ulg)outcnt; ++ outcnt = 0; ++} ++ ++static void error(char *x) ++{ ++ putstr("\n\n"); ++ putstr(x); ++ putstr("\n\n -- System halted"); ++ ++ while (1) ++ asm("hlt"); ++} ++ ++asmlinkage void decompress_kernel(void *rmode, memptr heap, ++ uch *input_data, unsigned long input_len, ++ uch *output) ++{ ++ real_mode = rmode; ++ ++ if (RM_SCREEN_INFO.orig_video_mode == 7) { ++ vidmem = (char *) 0xb0000; ++ vidport = 0x3b4; ++ } else { ++ vidmem = (char *) 0xb8000; ++ vidport = 0x3d4; ++ } ++ ++ lines = RM_SCREEN_INFO.orig_video_lines; ++ cols = RM_SCREEN_INFO.orig_video_cols; ++ ++ window = output; /* Output buffer (Normally at 1M) */ ++ free_mem_ptr = heap; /* Heap */ ++ free_mem_end_ptr = heap + HEAP_SIZE; ++ inbuf = input_data; /* Input buffer */ ++ insize = input_len; ++ inptr = 0; ++ ++#ifdef CONFIG_X86_64 ++ if ((ulg)output & (__KERNEL_ALIGN - 1)) ++ error("Destination address not 2M aligned"); ++ if ((ulg)output >= 0xffffffffffUL) ++ error("Destination address too large"); ++#else ++ if ((u32)output & (CONFIG_PHYSICAL_ALIGN -1)) ++ error("Destination address not CONFIG_PHYSICAL_ALIGN aligned"); ++ if (heap > ((-__PAGE_OFFSET-(512<<20)-1) & 0x7fffffff)) ++ error("Destination address too large"); ++#ifndef CONFIG_RELOCATABLE ++ if ((u32)output != LOAD_PHYSICAL_ADDR) ++ error("Wrong destination address"); ++#endif ++#endif ++ ++ makecrc(); ++ putstr("\nDecompressing Linux... "); ++ gunzip(); ++ putstr("done.\nBooting the kernel.\n"); ++ return; ++} +diff --git a/arch/x86/boot/compressed/misc_32.c b/arch/x86/boot/compressed/misc_32.c +deleted file mode 100644 +index b74d60d..0000000 +--- a/arch/x86/boot/compressed/misc_32.c ++++ /dev/null +@@ -1,382 +0,0 @@ +-/* +- * misc.c +- * +- * This is a collection of several routines from gzip-1.0.3 +- * adapted for Linux. +- * +- * malloc by Hannu Savolainen 1993 and Matthias Urlichs 1994 +- * puts by Nick Holloway 1993, better puts by Martin Mares 1995 +- * High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996 +- */ +- +-#undef CONFIG_PARAVIRT +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* WARNING!! +- * This code is compiled with -fPIC and it is relocated dynamically +- * at run time, but no relocation processing is performed. +- * This means that it is not safe to place pointers in static structures. +- */ +- +-/* +- * Getting to provable safe in place decompression is hard. +- * Worst case behaviours need to be analyzed. +- * Background information: +- * +- * The file layout is: +- * magic[2] +- * method[1] +- * flags[1] +- * timestamp[4] +- * extraflags[1] +- * os[1] +- * compressed data blocks[N] +- * crc[4] orig_len[4] +- * +- * resulting in 18 bytes of non compressed data overhead. +- * +- * Files divided into blocks +- * 1 bit (last block flag) +- * 2 bits (block type) +- * +- * 1 block occurs every 32K -1 bytes or when there 50% compression has been achieved. +- * The smallest block type encoding is always used. +- * +- * stored: +- * 32 bits length in bytes. +- * +- * fixed: +- * magic fixed tree. +- * symbols. +- * +- * dynamic: +- * dynamic tree encoding. +- * symbols. +- * +- * +- * The buffer for decompression in place is the length of the +- * uncompressed data, plus a small amount extra to keep the algorithm safe. +- * The compressed data is placed at the end of the buffer. The output +- * pointer is placed at the start of the buffer and the input pointer +- * is placed where the compressed data starts. Problems will occur +- * when the output pointer overruns the input pointer. +- * +- * The output pointer can only overrun the input pointer if the input +- * pointer is moving faster than the output pointer. A condition only +- * triggered by data whose compressed form is larger than the uncompressed +- * form. +- * +- * The worst case at the block level is a growth of the compressed data +- * of 5 bytes per 32767 bytes. +- * +- * The worst case internal to a compressed block is very hard to figure. +- * The worst case can at least be boundined by having one bit that represents +- * 32764 bytes and then all of the rest of the bytes representing the very +- * very last byte. +- * +- * All of which is enough to compute an amount of extra data that is required +- * to be safe. To avoid problems at the block level allocating 5 extra bytes +- * per 32767 bytes of data is sufficient. To avoind problems internal to a block +- * adding an extra 32767 bytes (the worst case uncompressed block size) is +- * sufficient, to ensure that in the worst case the decompressed data for +- * block will stop the byte before the compressed data for a block begins. +- * To avoid problems with the compressed data's meta information an extra 18 +- * bytes are needed. Leading to the formula: +- * +- * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size. +- * +- * Adding 8 bytes per 32K is a bit excessive but much easier to calculate. +- * Adding 32768 instead of 32767 just makes for round numbers. +- * Adding the decompressor_size is necessary as it musht live after all +- * of the data as well. Last I measured the decompressor is about 14K. +- * 10K of actual data and 4K of bss. +- * +- */ +- +-/* +- * gzip declarations +- */ +- +-#define OF(args) args +-#define STATIC static +- +-#undef memset +-#undef memcpy +-#define memzero(s, n) memset ((s), 0, (n)) +- +-typedef unsigned char uch; +-typedef unsigned short ush; +-typedef unsigned long ulg; +- +-#define WSIZE 0x80000000 /* Window size must be at least 32k, +- * and a power of two +- * We don't actually have a window just +- * a huge output buffer so I report +- * a 2G windows size, as that should +- * always be larger than our output buffer. +- */ +- +-static uch *inbuf; /* input buffer */ +-static uch *window; /* Sliding window buffer, (and final output buffer) */ +- +-static unsigned insize; /* valid bytes in inbuf */ +-static unsigned inptr; /* index of next byte to be processed in inbuf */ +-static unsigned outcnt; /* bytes in output buffer */ +- +-/* gzip flag byte */ +-#define ASCII_FLAG 0x01 /* bit 0 set: file probably ASCII text */ +-#define CONTINUATION 0x02 /* bit 1 set: continuation of multi-part gzip file */ +-#define EXTRA_FIELD 0x04 /* bit 2 set: extra field present */ +-#define ORIG_NAME 0x08 /* bit 3 set: original file name present */ +-#define COMMENT 0x10 /* bit 4 set: file comment present */ +-#define ENCRYPTED 0x20 /* bit 5 set: file is encrypted */ +-#define RESERVED 0xC0 /* bit 6,7: reserved */ +- +-#define get_byte() (inptr < insize ? inbuf[inptr++] : fill_inbuf()) +- +-/* Diagnostic functions */ +-#ifdef DEBUG +-# define Assert(cond,msg) {if(!(cond)) error(msg);} +-# define Trace(x) fprintf x +-# define Tracev(x) {if (verbose) fprintf x ;} +-# define Tracevv(x) {if (verbose>1) fprintf x ;} +-# define Tracec(c,x) {if (verbose && (c)) fprintf x ;} +-# define Tracecv(c,x) {if (verbose>1 && (c)) fprintf x ;} +-#else +-# define Assert(cond,msg) +-# define Trace(x) +-# define Tracev(x) +-# define Tracevv(x) +-# define Tracec(c,x) +-# define Tracecv(c,x) +-#endif +- +-static int fill_inbuf(void); +-static void flush_window(void); +-static void error(char *m); +-static void gzip_mark(void **); +-static void gzip_release(void **); +- +-/* +- * This is set up by the setup-routine at boot-time +- */ +-static unsigned char *real_mode; /* Pointer to real-mode data */ +- +-#define RM_EXT_MEM_K (*(unsigned short *)(real_mode + 0x2)) +-#ifndef STANDARD_MEMORY_BIOS_CALL +-#define RM_ALT_MEM_K (*(unsigned long *)(real_mode + 0x1e0)) +-#endif +-#define RM_SCREEN_INFO (*(struct screen_info *)(real_mode+0)) +- +-extern unsigned char input_data[]; +-extern int input_len; +- +-static long bytes_out = 0; +- +-static void *malloc(int size); +-static void free(void *where); +- +-static void *memset(void *s, int c, unsigned n); +-static void *memcpy(void *dest, const void *src, unsigned n); +- +-static void putstr(const char *); +- +-static unsigned long free_mem_ptr; +-static unsigned long free_mem_end_ptr; +- +-#define HEAP_SIZE 0x4000 +- +-static char *vidmem = (char *)0xb8000; +-static int vidport; +-static int lines, cols; +- +-#ifdef CONFIG_X86_NUMAQ +-void *xquad_portio; +-#endif +- +-#include "../../../../lib/inflate.c" +- +-static void *malloc(int size) +-{ +- void *p; +- +- if (size <0) error("Malloc error"); +- if (free_mem_ptr <= 0) error("Memory error"); +- +- free_mem_ptr = (free_mem_ptr + 3) & ~3; /* Align */ +- +- p = (void *)free_mem_ptr; +- free_mem_ptr += size; +- +- if (free_mem_ptr >= free_mem_end_ptr) +- error("Out of memory"); +- +- return p; +-} +- +-static void free(void *where) +-{ /* Don't care */ +-} +- +-static void gzip_mark(void **ptr) +-{ +- *ptr = (void *) free_mem_ptr; +-} +- +-static void gzip_release(void **ptr) +-{ +- free_mem_ptr = (unsigned long) *ptr; +-} +- +-static void scroll(void) +-{ +- int i; +- +- memcpy ( vidmem, vidmem + cols * 2, ( lines - 1 ) * cols * 2 ); +- for ( i = ( lines - 1 ) * cols * 2; i < lines * cols * 2; i += 2 ) +- vidmem[i] = ' '; +-} +- +-static void putstr(const char *s) +-{ +- int x,y,pos; +- char c; +- +- if (RM_SCREEN_INFO.orig_video_mode == 0 && lines == 0 && cols == 0) +- return; +- +- x = RM_SCREEN_INFO.orig_x; +- y = RM_SCREEN_INFO.orig_y; +- +- while ( ( c = *s++ ) != '\0' ) { +- if ( c == '\n' ) { +- x = 0; +- if ( ++y >= lines ) { +- scroll(); +- y--; +- } +- } else { +- vidmem [ ( x + cols * y ) * 2 ] = c; +- if ( ++x >= cols ) { +- x = 0; +- if ( ++y >= lines ) { +- scroll(); +- y--; +- } +- } +- } +- } +- +- RM_SCREEN_INFO.orig_x = x; +- RM_SCREEN_INFO.orig_y = y; +- +- pos = (x + cols * y) * 2; /* Update cursor position */ +- outb_p(14, vidport); +- outb_p(0xff & (pos >> 9), vidport+1); +- outb_p(15, vidport); +- outb_p(0xff & (pos >> 1), vidport+1); +-} +- +-static void* memset(void* s, int c, unsigned n) +-{ +- int i; +- char *ss = (char*)s; +- +- for (i=0;i> 8); +- } +- crc = c; +- bytes_out += (ulg)outcnt; +- outcnt = 0; +-} +- +-static void error(char *x) +-{ +- putstr("\n\n"); +- putstr(x); +- putstr("\n\n -- System halted"); +- +- while(1); /* Halt */ +-} +- +-asmlinkage void decompress_kernel(void *rmode, unsigned long end, +- uch *input_data, unsigned long input_len, uch *output) +-{ +- real_mode = rmode; +- +- if (RM_SCREEN_INFO.orig_video_mode == 7) { +- vidmem = (char *) 0xb0000; +- vidport = 0x3b4; +- } else { +- vidmem = (char *) 0xb8000; +- vidport = 0x3d4; +- } +- +- lines = RM_SCREEN_INFO.orig_video_lines; +- cols = RM_SCREEN_INFO.orig_video_cols; +- +- window = output; /* Output buffer (Normally at 1M) */ +- free_mem_ptr = end; /* Heap */ +- free_mem_end_ptr = end + HEAP_SIZE; +- inbuf = input_data; /* Input buffer */ +- insize = input_len; +- inptr = 0; +- +- if ((u32)output & (CONFIG_PHYSICAL_ALIGN -1)) +- error("Destination address not CONFIG_PHYSICAL_ALIGN aligned"); +- if (end > ((-__PAGE_OFFSET-(512 <<20)-1) & 0x7fffffff)) +- error("Destination address too large"); +-#ifndef CONFIG_RELOCATABLE +- if ((u32)output != LOAD_PHYSICAL_ADDR) +- error("Wrong destination address"); +-#endif +- +- makecrc(); +- putstr("Uncompressing Linux... "); +- gunzip(); +- putstr("Ok, booting the kernel.\n"); +- return; +-} +diff --git a/arch/x86/boot/compressed/misc_64.c b/arch/x86/boot/compressed/misc_64.c +deleted file mode 100644 +index 6ea015a..0000000 +--- a/arch/x86/boot/compressed/misc_64.c ++++ /dev/null +@@ -1,371 +0,0 @@ +-/* +- * misc.c +- * +- * This is a collection of several routines from gzip-1.0.3 +- * adapted for Linux. +- * +- * malloc by Hannu Savolainen 1993 and Matthias Urlichs 1994 +- * puts by Nick Holloway 1993, better puts by Martin Mares 1995 +- * High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996 +- */ +- +-#define _LINUX_STRING_H_ 1 +-#define __LINUX_BITMAP_H 1 +- +-#include +-#include +-#include +-#include +- +-/* WARNING!! +- * This code is compiled with -fPIC and it is relocated dynamically +- * at run time, but no relocation processing is performed. +- * This means that it is not safe to place pointers in static structures. +- */ +- +-/* +- * Getting to provable safe in place decompression is hard. +- * Worst case behaviours need to be analyzed. +- * Background information: +- * +- * The file layout is: +- * magic[2] +- * method[1] +- * flags[1] +- * timestamp[4] +- * extraflags[1] +- * os[1] +- * compressed data blocks[N] +- * crc[4] orig_len[4] +- * +- * resulting in 18 bytes of non compressed data overhead. +- * +- * Files divided into blocks +- * 1 bit (last block flag) +- * 2 bits (block type) +- * +- * 1 block occurs every 32K -1 bytes or when there 50% compression has been achieved. +- * The smallest block type encoding is always used. +- * +- * stored: +- * 32 bits length in bytes. +- * +- * fixed: +- * magic fixed tree. +- * symbols. +- * +- * dynamic: +- * dynamic tree encoding. +- * symbols. +- * +- * +- * The buffer for decompression in place is the length of the +- * uncompressed data, plus a small amount extra to keep the algorithm safe. +- * The compressed data is placed at the end of the buffer. The output +- * pointer is placed at the start of the buffer and the input pointer +- * is placed where the compressed data starts. Problems will occur +- * when the output pointer overruns the input pointer. +- * +- * The output pointer can only overrun the input pointer if the input +- * pointer is moving faster than the output pointer. A condition only +- * triggered by data whose compressed form is larger than the uncompressed +- * form. +- * +- * The worst case at the block level is a growth of the compressed data +- * of 5 bytes per 32767 bytes. +- * +- * The worst case internal to a compressed block is very hard to figure. +- * The worst case can at least be boundined by having one bit that represents +- * 32764 bytes and then all of the rest of the bytes representing the very +- * very last byte. +- * +- * All of which is enough to compute an amount of extra data that is required +- * to be safe. To avoid problems at the block level allocating 5 extra bytes +- * per 32767 bytes of data is sufficient. To avoind problems internal to a block +- * adding an extra 32767 bytes (the worst case uncompressed block size) is +- * sufficient, to ensure that in the worst case the decompressed data for +- * block will stop the byte before the compressed data for a block begins. +- * To avoid problems with the compressed data's meta information an extra 18 +- * bytes are needed. Leading to the formula: +- * +- * extra_bytes = (uncompressed_size >> 12) + 32768 + 18 + decompressor_size. +- * +- * Adding 8 bytes per 32K is a bit excessive but much easier to calculate. +- * Adding 32768 instead of 32767 just makes for round numbers. +- * Adding the decompressor_size is necessary as it musht live after all +- * of the data as well. Last I measured the decompressor is about 14K. +- * 10K of actual data and 4K of bss. +- * +- */ +- +-/* +- * gzip declarations +- */ +- +-#define OF(args) args +-#define STATIC static +- +-#undef memset +-#undef memcpy +-#define memzero(s, n) memset ((s), 0, (n)) +- +-typedef unsigned char uch; +-typedef unsigned short ush; +-typedef unsigned long ulg; +- +-#define WSIZE 0x80000000 /* Window size must be at least 32k, +- * and a power of two +- * We don't actually have a window just +- * a huge output buffer so I report +- * a 2G windows size, as that should +- * always be larger than our output buffer. +- */ +- +-static uch *inbuf; /* input buffer */ +-static uch *window; /* Sliding window buffer, (and final output buffer) */ +- +-static unsigned insize; /* valid bytes in inbuf */ +-static unsigned inptr; /* index of next byte to be processed in inbuf */ +-static unsigned outcnt; /* bytes in output buffer */ +- +-/* gzip flag byte */ +-#define ASCII_FLAG 0x01 /* bit 0 set: file probably ASCII text */ +-#define CONTINUATION 0x02 /* bit 1 set: continuation of multi-part gzip file */ +-#define EXTRA_FIELD 0x04 /* bit 2 set: extra field present */ +-#define ORIG_NAME 0x08 /* bit 3 set: original file name present */ +-#define COMMENT 0x10 /* bit 4 set: file comment present */ +-#define ENCRYPTED 0x20 /* bit 5 set: file is encrypted */ +-#define RESERVED 0xC0 /* bit 6,7: reserved */ +- +-#define get_byte() (inptr < insize ? inbuf[inptr++] : fill_inbuf()) +- +-/* Diagnostic functions */ +-#ifdef DEBUG +-# define Assert(cond,msg) {if(!(cond)) error(msg);} +-# define Trace(x) fprintf x +-# define Tracev(x) {if (verbose) fprintf x ;} +-# define Tracevv(x) {if (verbose>1) fprintf x ;} +-# define Tracec(c,x) {if (verbose && (c)) fprintf x ;} +-# define Tracecv(c,x) {if (verbose>1 && (c)) fprintf x ;} +-#else +-# define Assert(cond,msg) +-# define Trace(x) +-# define Tracev(x) +-# define Tracevv(x) +-# define Tracec(c,x) +-# define Tracecv(c,x) +-#endif +- +-static int fill_inbuf(void); +-static void flush_window(void); +-static void error(char *m); +-static void gzip_mark(void **); +-static void gzip_release(void **); +- +-/* +- * This is set up by the setup-routine at boot-time +- */ +-static unsigned char *real_mode; /* Pointer to real-mode data */ +- +-#define RM_EXT_MEM_K (*(unsigned short *)(real_mode + 0x2)) +-#ifndef STANDARD_MEMORY_BIOS_CALL +-#define RM_ALT_MEM_K (*(unsigned long *)(real_mode + 0x1e0)) +-#endif +-#define RM_SCREEN_INFO (*(struct screen_info *)(real_mode+0)) +- +-extern unsigned char input_data[]; +-extern int input_len; +- +-static long bytes_out = 0; +- +-static void *malloc(int size); +-static void free(void *where); +- +-static void *memset(void *s, int c, unsigned n); +-static void *memcpy(void *dest, const void *src, unsigned n); +- +-static void putstr(const char *); +- +-static long free_mem_ptr; +-static long free_mem_end_ptr; +- +-#define HEAP_SIZE 0x7000 +- +-static char *vidmem = (char *)0xb8000; +-static int vidport; +-static int lines, cols; +- +-#include "../../../../lib/inflate.c" +- +-static void *malloc(int size) +-{ +- void *p; +- +- if (size <0) error("Malloc error"); +- if (free_mem_ptr <= 0) error("Memory error"); +- +- free_mem_ptr = (free_mem_ptr + 3) & ~3; /* Align */ +- +- p = (void *)free_mem_ptr; +- free_mem_ptr += size; +- +- if (free_mem_ptr >= free_mem_end_ptr) +- error("Out of memory"); +- +- return p; +-} +- +-static void free(void *where) +-{ /* Don't care */ +-} +- +-static void gzip_mark(void **ptr) +-{ +- *ptr = (void *) free_mem_ptr; +-} +- +-static void gzip_release(void **ptr) +-{ +- free_mem_ptr = (long) *ptr; +-} +- +-static void scroll(void) +-{ +- int i; +- +- memcpy ( vidmem, vidmem + cols * 2, ( lines - 1 ) * cols * 2 ); +- for ( i = ( lines - 1 ) * cols * 2; i < lines * cols * 2; i += 2 ) +- vidmem[i] = ' '; +-} +- +-static void putstr(const char *s) +-{ +- int x,y,pos; +- char c; +- +- x = RM_SCREEN_INFO.orig_x; +- y = RM_SCREEN_INFO.orig_y; +- +- while ( ( c = *s++ ) != '\0' ) { +- if ( c == '\n' ) { +- x = 0; +- if ( ++y >= lines ) { +- scroll(); +- y--; +- } +- } else { +- vidmem [ ( x + cols * y ) * 2 ] = c; +- if ( ++x >= cols ) { +- x = 0; +- if ( ++y >= lines ) { +- scroll(); +- y--; +- } +- } +- } +- } +- +- RM_SCREEN_INFO.orig_x = x; +- RM_SCREEN_INFO.orig_y = y; +- +- pos = (x + cols * y) * 2; /* Update cursor position */ +- outb_p(14, vidport); +- outb_p(0xff & (pos >> 9), vidport+1); +- outb_p(15, vidport); +- outb_p(0xff & (pos >> 1), vidport+1); +-} +- +-static void* memset(void* s, int c, unsigned n) +-{ +- int i; +- char *ss = (char*)s; +- +- for (i=0;i> 8); +- } +- crc = c; +- bytes_out += (ulg)outcnt; +- outcnt = 0; +-} +- +-static void error(char *x) +-{ +- putstr("\n\n"); +- putstr(x); +- putstr("\n\n -- System halted"); +- +- while(1); /* Halt */ +-} +- +-asmlinkage void decompress_kernel(void *rmode, unsigned long heap, +- uch *input_data, unsigned long input_len, uch *output) +-{ +- real_mode = rmode; +- +- if (RM_SCREEN_INFO.orig_video_mode == 7) { +- vidmem = (char *) 0xb0000; +- vidport = 0x3b4; +- } else { +- vidmem = (char *) 0xb8000; +- vidport = 0x3d4; +- } +- +- lines = RM_SCREEN_INFO.orig_video_lines; +- cols = RM_SCREEN_INFO.orig_video_cols; +- +- window = output; /* Output buffer (Normally at 1M) */ +- free_mem_ptr = heap; /* Heap */ +- free_mem_end_ptr = heap + HEAP_SIZE; +- inbuf = input_data; /* Input buffer */ +- insize = input_len; +- inptr = 0; +- +- if ((ulg)output & (__KERNEL_ALIGN - 1)) +- error("Destination address not 2M aligned"); +- if ((ulg)output >= 0xffffffffffUL) +- error("Destination address too large"); +- +- makecrc(); +- putstr(".\nDecompressing Linux..."); +- gunzip(); +- putstr("done.\nBooting the kernel.\n"); +- return; +-} +diff --git a/arch/x86/boot/compressed/relocs.c b/arch/x86/boot/compressed/relocs.c +index 7a0d00b..d01ea42 100644 +--- a/arch/x86/boot/compressed/relocs.c ++++ b/arch/x86/boot/compressed/relocs.c +@@ -27,11 +27,6 @@ static unsigned long *relocs; + * absolute relocations present w.r.t these symbols. + */ + static const char* safe_abs_relocs[] = { +- "__kernel_vsyscall", +- "__kernel_rt_sigreturn", +- "__kernel_sigreturn", +- "SYSENTER_RETURN", +- "VDSO_NOTE_MASK", + "xen_irq_disable_direct_reloc", + "xen_save_fl_direct_reloc", + }; +@@ -45,6 +40,8 @@ static int is_safe_abs_reloc(const char* sym_name) + /* Match found */ + return 1; + } ++ if (strncmp(sym_name, "VDSO", 4) == 0) ++ return 1; + if (strncmp(sym_name, "__crc_", 6) == 0) + return 1; + return 0; +diff --git a/arch/x86/boot/compressed/vmlinux.scr b/arch/x86/boot/compressed/vmlinux.scr +new file mode 100644 +index 0000000..f02382a +--- /dev/null ++++ b/arch/x86/boot/compressed/vmlinux.scr +@@ -0,0 +1,10 @@ ++SECTIONS ++{ ++ .rodata.compressed : { ++ input_len = .; ++ LONG(input_data_end - input_data) input_data = .; ++ *(.data) ++ output_len = . - 4; ++ input_data_end = .; ++ } ++} +diff --git a/arch/x86/boot/compressed/vmlinux_32.lds b/arch/x86/boot/compressed/vmlinux_32.lds +index cc4854f..bb3c483 100644 +--- a/arch/x86/boot/compressed/vmlinux_32.lds ++++ b/arch/x86/boot/compressed/vmlinux_32.lds +@@ -3,17 +3,17 @@ OUTPUT_ARCH(i386) + ENTRY(startup_32) + SECTIONS + { +- /* Be careful parts of head.S assume startup_32 is at +- * address 0. ++ /* Be careful parts of head_32.S assume startup_32 is at ++ * address 0. + */ +- . = 0 ; ++ . = 0; + .text.head : { + _head = . ; + *(.text.head) + _ehead = . ; + } +- .data.compressed : { +- *(.data.compressed) ++ .rodata.compressed : { ++ *(.rodata.compressed) + } + .text : { + _text = .; /* Text */ +diff --git a/arch/x86/boot/compressed/vmlinux_32.scr b/arch/x86/boot/compressed/vmlinux_32.scr +deleted file mode 100644 +index 707a88f..0000000 +--- a/arch/x86/boot/compressed/vmlinux_32.scr ++++ /dev/null +@@ -1,10 +0,0 @@ +-SECTIONS +-{ +- .data.compressed : { +- input_len = .; +- LONG(input_data_end - input_data) input_data = .; +- *(.data) +- output_len = . - 4; +- input_data_end = .; +- } +-} +diff --git a/arch/x86/boot/compressed/vmlinux_64.lds b/arch/x86/boot/compressed/vmlinux_64.lds +index 94c13e5..f6e5b44 100644 +--- a/arch/x86/boot/compressed/vmlinux_64.lds ++++ b/arch/x86/boot/compressed/vmlinux_64.lds +@@ -3,15 +3,19 @@ OUTPUT_ARCH(i386:x86-64) + ENTRY(startup_64) + SECTIONS + { +- /* Be careful parts of head.S assume startup_32 is at +- * address 0. ++ /* Be careful parts of head_64.S assume startup_64 is at ++ * address 0. + */ + . = 0; +- .text : { ++ .text.head : { + _head = . ; + *(.text.head) + _ehead = . ; +- *(.text.compressed) ++ } ++ .rodata.compressed : { ++ *(.rodata.compressed) ++ } ++ .text : { + _text = .; /* Text */ + *(.text) + *(.text.*) +diff --git a/arch/x86/boot/compressed/vmlinux_64.scr b/arch/x86/boot/compressed/vmlinux_64.scr +deleted file mode 100644 +index bd1429c..0000000 +--- a/arch/x86/boot/compressed/vmlinux_64.scr ++++ /dev/null +@@ -1,10 +0,0 @@ +-SECTIONS +-{ +- .text.compressed : { +- input_len = .; +- LONG(input_data_end - input_data) input_data = .; +- *(.data) +- output_len = . - 4; +- input_data_end = .; +- } +-} +diff --git a/arch/x86/boot/edd.c b/arch/x86/boot/edd.c +index bd138e4..8721dc4 100644 +--- a/arch/x86/boot/edd.c ++++ b/arch/x86/boot/edd.c +@@ -129,6 +129,7 @@ void query_edd(void) + char eddarg[8]; + int do_mbr = 1; + int do_edd = 1; ++ int be_quiet; + int devno; + struct edd_info ei, *edp; + u32 *mbrptr; +@@ -140,12 +141,21 @@ void query_edd(void) + do_edd = 0; + } + ++ be_quiet = cmdline_find_option_bool("quiet"); ++ + edp = boot_params.eddbuf; + mbrptr = boot_params.edd_mbr_sig_buffer; + + if (!do_edd) + return; + ++ /* Bugs in OnBoard or AddOnCards Bios may hang the EDD probe, ++ * so give a hint if this happens. ++ */ ++ ++ if (!be_quiet) ++ printf("Probing EDD (edd=off to disable)... "); ++ + for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) { + /* + * Scan the BIOS-supported hard disks and query EDD +@@ -162,6 +172,9 @@ void query_edd(void) + if (do_mbr && !read_mbr_sig(devno, &ei, mbrptr++)) + boot_params.edd_mbr_sig_buf_entries = devno-0x80+1; + } ++ ++ if (!be_quiet) ++ printf("ok\n"); + } + + #endif +diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S +index 4cc5b04..64ad901 100644 +--- a/arch/x86/boot/header.S ++++ b/arch/x86/boot/header.S +@@ -195,10 +195,13 @@ cmd_line_ptr: .long 0 # (Header version 0x0202 or later) + # can be located anywhere in + # low memory 0x10000 or higher. + +-ramdisk_max: .long (-__PAGE_OFFSET-(512 << 20)-1) & 0x7fffffff ++ramdisk_max: .long 0x7fffffff + # (Header version 0x0203 or later) + # The highest safe address for + # the contents of an initrd ++ # The current kernel allows up to 4 GB, ++ # but leave it at 2 GB to avoid ++ # possible bootloader bugs. + + kernel_alignment: .long CONFIG_PHYSICAL_ALIGN #physical addr alignment + #required for protected mode +diff --git a/arch/x86/boot/main.c b/arch/x86/boot/main.c +index 1f95750..7828da5 100644 +--- a/arch/x86/boot/main.c ++++ b/arch/x86/boot/main.c +@@ -100,20 +100,32 @@ static void set_bios_mode(void) + #endif + } + +-void main(void) ++static void init_heap(void) + { +- /* First, copy the boot header into the "zeropage" */ +- copy_boot_params(); ++ char *stack_end; + +- /* End of heap check */ + if (boot_params.hdr.loadflags & CAN_USE_HEAP) { +- heap_end = (char *)(boot_params.hdr.heap_end_ptr +- +0x200-STACK_SIZE); ++ asm("leal %P1(%%esp),%0" ++ : "=r" (stack_end) : "i" (-STACK_SIZE)); ++ ++ heap_end = (char *) ++ ((size_t)boot_params.hdr.heap_end_ptr + 0x200); ++ if (heap_end > stack_end) ++ heap_end = stack_end; + } else { + /* Boot protocol 2.00 only, no heap available */ + puts("WARNING: Ancient bootloader, some functionality " + "may be limited!\n"); + } ++} ++ ++void main(void) ++{ ++ /* First, copy the boot header into the "zeropage" */ ++ copy_boot_params(); ++ ++ /* End of heap check */ ++ init_heap(); + + /* Make sure we have all the proper CPU support */ + if (validate_cpu()) { +@@ -131,9 +143,6 @@ void main(void) + /* Set keyboard repeat rate (why?) */ + keyboard_set_repeat(); + +- /* Set the video mode */ +- set_video(); +- + /* Query MCA information */ + query_mca(); + +@@ -154,6 +163,10 @@ void main(void) + #if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE) + query_edd(); + #endif ++ ++ /* Set the video mode */ ++ set_video(); ++ + /* Do the last things and invoke protected mode */ + go_to_protected_mode(); + } +diff --git a/arch/x86/boot/pm.c b/arch/x86/boot/pm.c +index 09fb342..1a0f936 100644 +--- a/arch/x86/boot/pm.c ++++ b/arch/x86/boot/pm.c +@@ -104,7 +104,7 @@ static void reset_coprocessor(void) + (((u64)(base & 0xff000000) << 32) | \ + ((u64)flags << 40) | \ + ((u64)(limit & 0x00ff0000) << 32) | \ +- ((u64)(base & 0x00ffff00) << 16) | \ ++ ((u64)(base & 0x00ffffff) << 16) | \ + ((u64)(limit & 0x0000ffff))) + + struct gdt_ptr { +@@ -121,6 +121,10 @@ static void setup_gdt(void) + [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff), + /* DS: data, read/write, 4 GB, base 0 */ + [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff), ++ /* TSS: 32-bit tss, 104 bytes, base 4096 */ ++ /* We only have a TSS here to keep Intel VT happy; ++ we don't actually use it for anything. */ ++ [GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103), + }; + /* Xen HVM incorrectly stores a pointer to the gdt_ptr, instead + of the gdt_ptr contents. Thus, make it static so it will +diff --git a/arch/x86/boot/pmjump.S b/arch/x86/boot/pmjump.S +index fa6bed1..f5402d5 100644 +--- a/arch/x86/boot/pmjump.S ++++ b/arch/x86/boot/pmjump.S +@@ -15,6 +15,7 @@ + */ + + #include ++#include + #include + + .text +@@ -29,28 +30,55 @@ + */ + protected_mode_jump: + movl %edx, %esi # Pointer to boot_params table +- movl %eax, 2f # Patch ljmpl instruction ++ ++ xorl %ebx, %ebx ++ movw %cs, %bx ++ shll $4, %ebx ++ addl %ebx, 2f + + movw $__BOOT_DS, %cx +- xorl %ebx, %ebx # Per the 32-bit boot protocol +- xorl %ebp, %ebp # Per the 32-bit boot protocol +- xorl %edi, %edi # Per the 32-bit boot protocol ++ movw $__BOOT_TSS, %di + + movl %cr0, %edx +- orb $1, %dl # Protected mode (PE) bit ++ orb $X86_CR0_PE, %dl # Protected mode + movl %edx, %cr0 + jmp 1f # Short jump to serialize on 386/486 + 1: + +- movw %cx, %ds +- movw %cx, %es +- movw %cx, %fs +- movw %cx, %gs +- movw %cx, %ss +- +- # Jump to the 32-bit entrypoint ++ # Transition to 32-bit mode + .byte 0x66, 0xea # ljmpl opcode +-2: .long 0 # offset ++2: .long in_pm32 # offset + .word __BOOT_CS # segment + + .size protected_mode_jump, .-protected_mode_jump ++ ++ .code32 ++ .type in_pm32, @function ++in_pm32: ++ # Set up data segments for flat 32-bit mode ++ movl %ecx, %ds ++ movl %ecx, %es ++ movl %ecx, %fs ++ movl %ecx, %gs ++ movl %ecx, %ss ++ # The 32-bit code sets up its own stack, but this way we do have ++ # a valid stack if some debugging hack wants to use it. ++ addl %ebx, %esp ++ ++ # Set up TR to make Intel VT happy ++ ltr %di ++ ++ # Clear registers to allow for future extensions to the ++ # 32-bit boot protocol ++ xorl %ecx, %ecx ++ xorl %edx, %edx ++ xorl %ebx, %ebx ++ xorl %ebp, %ebp ++ xorl %edi, %edi ++ ++ # Set up LDTR to make Intel VT happy ++ lldt %cx ++ ++ jmpl *%eax # Jump to the 32-bit entrypoint ++ ++ .size in_pm32, .-in_pm32 +diff --git a/arch/x86/boot/video-bios.c b/arch/x86/boot/video-bios.c +index ed0672a..ff664a1 100644 +--- a/arch/x86/boot/video-bios.c ++++ b/arch/x86/boot/video-bios.c +@@ -104,6 +104,7 @@ static int bios_probe(void) + + mi = GET_HEAP(struct mode_info, 1); + mi->mode = VIDEO_FIRST_BIOS+mode; ++ mi->depth = 0; /* text */ + mi->x = rdfs16(0x44a); + mi->y = rdfs8(0x484)+1; + nmodes++; +@@ -116,7 +117,7 @@ static int bios_probe(void) + + __videocard video_bios = + { +- .card_name = "BIOS (scanned)", ++ .card_name = "BIOS", + .probe = bios_probe, + .set_mode = bios_set_mode, + .unsafe = 1, +diff --git a/arch/x86/boot/video-vesa.c b/arch/x86/boot/video-vesa.c +index 4716b9a..662dd2f 100644 +--- a/arch/x86/boot/video-vesa.c ++++ b/arch/x86/boot/video-vesa.c +@@ -79,20 +79,28 @@ static int vesa_probe(void) + /* Text Mode, TTY BIOS supported, + supported by hardware */ + mi = GET_HEAP(struct mode_info, 1); +- mi->mode = mode + VIDEO_FIRST_VESA; +- mi->x = vminfo.h_res; +- mi->y = vminfo.v_res; ++ mi->mode = mode + VIDEO_FIRST_VESA; ++ mi->depth = 0; /* text */ ++ mi->x = vminfo.h_res; ++ mi->y = vminfo.v_res; + nmodes++; +- } else if ((vminfo.mode_attr & 0x99) == 0x99) { ++ } else if ((vminfo.mode_attr & 0x99) == 0x99 && ++ (vminfo.memory_layout == 4 || ++ vminfo.memory_layout == 6) && ++ vminfo.memory_planes == 1) { + #ifdef CONFIG_FB + /* Graphics mode, color, linear frame buffer +- supported -- register the mode but hide from +- the menu. Only do this if framebuffer is +- configured, however, otherwise the user will +- be left without a screen. */ ++ supported. Only register the mode if ++ if framebuffer is configured, however, ++ otherwise the user will be left without a screen. ++ We don't require CONFIG_FB_VESA, however, since ++ some of the other framebuffer drivers can use ++ this mode-setting, too. */ + mi = GET_HEAP(struct mode_info, 1); + mi->mode = mode + VIDEO_FIRST_VESA; +- mi->x = mi->y = 0; ++ mi->depth = vminfo.bpp; ++ mi->x = vminfo.h_res; ++ mi->y = vminfo.v_res; + nmodes++; + #endif + } +diff --git a/arch/x86/boot/video-vga.c b/arch/x86/boot/video-vga.c +index aef02f9..7259387 100644 +--- a/arch/x86/boot/video-vga.c ++++ b/arch/x86/boot/video-vga.c +@@ -18,22 +18,22 @@ + #include "video.h" + + static struct mode_info vga_modes[] = { +- { VIDEO_80x25, 80, 25 }, +- { VIDEO_8POINT, 80, 50 }, +- { VIDEO_80x43, 80, 43 }, +- { VIDEO_80x28, 80, 28 }, +- { VIDEO_80x30, 80, 30 }, +- { VIDEO_80x34, 80, 34 }, +- { VIDEO_80x60, 80, 60 }, ++ { VIDEO_80x25, 80, 25, 0 }, ++ { VIDEO_8POINT, 80, 50, 0 }, ++ { VIDEO_80x43, 80, 43, 0 }, ++ { VIDEO_80x28, 80, 28, 0 }, ++ { VIDEO_80x30, 80, 30, 0 }, ++ { VIDEO_80x34, 80, 34, 0 }, ++ { VIDEO_80x60, 80, 60, 0 }, + }; + + static struct mode_info ega_modes[] = { +- { VIDEO_80x25, 80, 25 }, +- { VIDEO_8POINT, 80, 43 }, ++ { VIDEO_80x25, 80, 25, 0 }, ++ { VIDEO_8POINT, 80, 43, 0 }, + }; + + static struct mode_info cga_modes[] = { +- { VIDEO_80x25, 80, 25 }, ++ { VIDEO_80x25, 80, 25, 0 }, + }; + + __videocard video_vga; +diff --git a/arch/x86/boot/video.c b/arch/x86/boot/video.c +index ad9712f..696d08f 100644 +--- a/arch/x86/boot/video.c ++++ b/arch/x86/boot/video.c +@@ -293,13 +293,28 @@ static void display_menu(void) + struct mode_info *mi; + char ch; + int i; ++ int nmodes; ++ int modes_per_line; ++ int col; + +- puts("Mode: COLSxROWS:\n"); ++ nmodes = 0; ++ for (card = video_cards; card < video_cards_end; card++) ++ nmodes += card->nmodes; + ++ modes_per_line = 1; ++ if (nmodes >= 20) ++ modes_per_line = 3; ++ ++ for (col = 0; col < modes_per_line; col++) ++ puts("Mode: Resolution: Type: "); ++ putchar('\n'); ++ ++ col = 0; + ch = '0'; + for (card = video_cards; card < video_cards_end; card++) { + mi = card->modes; + for (i = 0; i < card->nmodes; i++, mi++) { ++ char resbuf[32]; + int visible = mi->x && mi->y; + u16 mode_id = mi->mode ? mi->mode : + (mi->y << 8)+mi->x; +@@ -307,8 +322,18 @@ static void display_menu(void) + if (!visible) + continue; /* Hidden mode */ + +- printf("%c %04X %3dx%-3d %s\n", +- ch, mode_id, mi->x, mi->y, card->card_name); ++ if (mi->depth) ++ sprintf(resbuf, "%dx%d", mi->y, mi->depth); ++ else ++ sprintf(resbuf, "%d", mi->y); ++ ++ printf("%c %03X %4dx%-7s %-6s", ++ ch, mode_id, mi->x, resbuf, card->card_name); ++ col++; ++ if (col >= modes_per_line) { ++ putchar('\n'); ++ col = 0; ++ } + + if (ch == '9') + ch = 'a'; +@@ -318,6 +343,8 @@ static void display_menu(void) + ch++; + } + } ++ if (col) ++ putchar('\n'); + } + + #define H(x) ((x)-'a'+10) +diff --git a/arch/x86/boot/video.h b/arch/x86/boot/video.h +index b92447d..d69347f 100644 +--- a/arch/x86/boot/video.h ++++ b/arch/x86/boot/video.h +@@ -83,7 +83,8 @@ void store_screen(void); + + struct mode_info { + u16 mode; /* Mode number (vga= style) */ +- u8 x, y; /* Width, height */ ++ u16 x, y; /* Width, height */ ++ u16 depth; /* Bits per pixel, 0 for text mode */ + }; + + struct card_info { +diff --git a/arch/x86/boot/voyager.c b/arch/x86/boot/voyager.c +index 61c8fe0..6499e32 100644 +--- a/arch/x86/boot/voyager.c ++++ b/arch/x86/boot/voyager.c +@@ -16,8 +16,6 @@ + + #include "boot.h" + +-#ifdef CONFIG_X86_VOYAGER +- + int query_voyager(void) + { + u8 err; +@@ -42,5 +40,3 @@ int query_voyager(void) + copy_from_fs(data_ptr, di, 7); /* Table is 7 bytes apparently */ + return 0; + } +- +-#endif /* CONFIG_X86_VOYAGER */ +diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig +index 54ee176..77562e7 100644 +--- a/arch/x86/configs/i386_defconfig ++++ b/arch/x86/configs/i386_defconfig +@@ -99,9 +99,9 @@ CONFIG_IOSCHED_NOOP=y + CONFIG_IOSCHED_AS=y + CONFIG_IOSCHED_DEADLINE=y + CONFIG_IOSCHED_CFQ=y +-CONFIG_DEFAULT_AS=y ++# CONFIG_DEFAULT_AS is not set + # CONFIG_DEFAULT_DEADLINE is not set +-# CONFIG_DEFAULT_CFQ is not set ++CONFIG_DEFAULT_CFQ=y + # CONFIG_DEFAULT_NOOP is not set + CONFIG_DEFAULT_IOSCHED="anticipatory" + +diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig +index 38a83f9..9e2b0ef 100644 +--- a/arch/x86/configs/x86_64_defconfig ++++ b/arch/x86/configs/x86_64_defconfig +@@ -145,15 +145,6 @@ CONFIG_K8_NUMA=y + CONFIG_NODES_SHIFT=6 + CONFIG_X86_64_ACPI_NUMA=y + CONFIG_NUMA_EMU=y +-CONFIG_ARCH_DISCONTIGMEM_ENABLE=y +-CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y +-CONFIG_ARCH_SPARSEMEM_ENABLE=y +-CONFIG_SELECT_MEMORY_MODEL=y +-# CONFIG_FLATMEM_MANUAL is not set +-CONFIG_DISCONTIGMEM_MANUAL=y +-# CONFIG_SPARSEMEM_MANUAL is not set +-CONFIG_DISCONTIGMEM=y +-CONFIG_FLAT_NODE_MEM_MAP=y + CONFIG_NEED_MULTIPLE_NODES=y + # CONFIG_SPARSEMEM_STATIC is not set + CONFIG_SPLIT_PTLOCK_CPUS=4 diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 46bb609..3874c2d 100644 --- a/arch/x86/crypto/Makefile @@ -135061,11 +139628,5531 @@ index 0000000..cefaf8b +MODULE_DESCRIPTION ("Twofish Cipher Algorithm, asm optimized"); +MODULE_ALIAS("twofish"); +MODULE_ALIAS("twofish-asm"); +diff --git a/arch/x86/ia32/Makefile b/arch/x86/ia32/Makefile +index e2edda2..52d0ccf 100644 +--- a/arch/x86/ia32/Makefile ++++ b/arch/x86/ia32/Makefile +@@ -2,9 +2,7 @@ + # Makefile for the ia32 kernel emulation subsystem. + # + +-obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o tls32.o \ +- ia32_binfmt.o fpu32.o ptrace32.o syscall32.o syscall32_syscall.o \ +- mmap32.o ++obj-$(CONFIG_IA32_EMULATION) := ia32entry.o sys_ia32.o ia32_signal.o + + sysv-$(CONFIG_SYSVIPC) := ipc32.o + obj-$(CONFIG_IA32_EMULATION) += $(sysv-y) +@@ -13,40 +11,3 @@ obj-$(CONFIG_IA32_AOUT) += ia32_aout.o + + audit-class-$(CONFIG_AUDIT) := audit.o + obj-$(CONFIG_IA32_EMULATION) += $(audit-class-y) +- +-$(obj)/syscall32_syscall.o: \ +- $(foreach F,sysenter syscall,$(obj)/vsyscall-$F.so) +- +-# Teach kbuild about targets +-targets := $(foreach F,$(addprefix vsyscall-,sysenter syscall),\ +- $F.o $F.so $F.so.dbg) +- +-# The DSO images are built using a special linker script +-quiet_cmd_syscall = SYSCALL $@ +- cmd_syscall = $(CC) -m32 -nostdlib -shared \ +- $(call ld-option, -Wl$(comma)--hash-style=sysv) \ +- -Wl,-soname=linux-gate.so.1 -o $@ \ +- -Wl,-T,$(filter-out FORCE,$^) +- +-$(obj)/%.so: OBJCOPYFLAGS := -S +-$(obj)/%.so: $(obj)/%.so.dbg FORCE +- $(call if_changed,objcopy) +- +-$(obj)/vsyscall-sysenter.so.dbg $(obj)/vsyscall-syscall.so.dbg: \ +-$(obj)/vsyscall-%.so.dbg: $(src)/vsyscall.lds $(obj)/vsyscall-%.o FORCE +- $(call if_changed,syscall) +- +-AFLAGS_vsyscall-sysenter.o = -m32 -Wa,-32 +-AFLAGS_vsyscall-syscall.o = -m32 -Wa,-32 +- +-vdsos := vdso32-sysenter.so vdso32-syscall.so +- +-quiet_cmd_vdso_install = INSTALL $@ +- cmd_vdso_install = cp $(@:vdso32-%.so=$(obj)/vsyscall-%.so.dbg) \ +- $(MODLIB)/vdso/$@ +- +-$(vdsos): +- @mkdir -p $(MODLIB)/vdso +- $(call cmd,vdso_install) +- +-vdso_install: $(vdsos) +diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c +index 91b7b59..5d7b381 100644 +--- a/arch/x86/ia32/audit.c ++++ b/arch/x86/ia32/audit.c +@@ -27,7 +27,7 @@ unsigned ia32_signal_class[] = { + + int ia32_classify_syscall(unsigned syscall) + { +- switch(syscall) { ++ switch (syscall) { + case __NR_open: + return 2; + case __NR_openat: +diff --git a/arch/x86/ia32/fpu32.c b/arch/x86/ia32/fpu32.c +deleted file mode 100644 +index 2c8209a..0000000 +--- a/arch/x86/ia32/fpu32.c ++++ /dev/null +@@ -1,183 +0,0 @@ +-/* +- * Copyright 2002 Andi Kleen, SuSE Labs. +- * FXSAVE<->i387 conversion support. Based on code by Gareth Hughes. +- * This is used for ptrace, signals and coredumps in 32bit emulation. +- */ +- +-#include +-#include +-#include +-#include +-#include +- +-static inline unsigned short twd_i387_to_fxsr(unsigned short twd) +-{ +- unsigned int tmp; /* to avoid 16 bit prefixes in the code */ +- +- /* Transform each pair of bits into 01 (valid) or 00 (empty) */ +- tmp = ~twd; +- tmp = (tmp | (tmp>>1)) & 0x5555; /* 0V0V0V0V0V0V0V0V */ +- /* and move the valid bits to the lower byte. */ +- tmp = (tmp | (tmp >> 1)) & 0x3333; /* 00VV00VV00VV00VV */ +- tmp = (tmp | (tmp >> 2)) & 0x0f0f; /* 0000VVVV0000VVVV */ +- tmp = (tmp | (tmp >> 4)) & 0x00ff; /* 00000000VVVVVVVV */ +- return tmp; +-} +- +-static inline unsigned long twd_fxsr_to_i387(struct i387_fxsave_struct *fxsave) +-{ +- struct _fpxreg *st = NULL; +- unsigned long tos = (fxsave->swd >> 11) & 7; +- unsigned long twd = (unsigned long) fxsave->twd; +- unsigned long tag; +- unsigned long ret = 0xffff0000; +- int i; +- +-#define FPREG_ADDR(f, n) ((void *)&(f)->st_space + (n) * 16); +- +- for (i = 0 ; i < 8 ; i++) { +- if (twd & 0x1) { +- st = FPREG_ADDR( fxsave, (i - tos) & 7 ); +- +- switch (st->exponent & 0x7fff) { +- case 0x7fff: +- tag = 2; /* Special */ +- break; +- case 0x0000: +- if ( !st->significand[0] && +- !st->significand[1] && +- !st->significand[2] && +- !st->significand[3] ) { +- tag = 1; /* Zero */ +- } else { +- tag = 2; /* Special */ +- } +- break; +- default: +- if (st->significand[3] & 0x8000) { +- tag = 0; /* Valid */ +- } else { +- tag = 2; /* Special */ +- } +- break; +- } +- } else { +- tag = 3; /* Empty */ +- } +- ret |= (tag << (2 * i)); +- twd = twd >> 1; +- } +- return ret; +-} +- +- +-static inline int convert_fxsr_from_user(struct i387_fxsave_struct *fxsave, +- struct _fpstate_ia32 __user *buf) +-{ +- struct _fpxreg *to; +- struct _fpreg __user *from; +- int i; +- u32 v; +- int err = 0; +- +-#define G(num,val) err |= __get_user(val, num + (u32 __user *)buf) +- G(0, fxsave->cwd); +- G(1, fxsave->swd); +- G(2, fxsave->twd); +- fxsave->twd = twd_i387_to_fxsr(fxsave->twd); +- G(3, fxsave->rip); +- G(4, v); +- fxsave->fop = v>>16; /* cs ignored */ +- G(5, fxsave->rdp); +- /* 6: ds ignored */ +-#undef G +- if (err) +- return -1; +- +- to = (struct _fpxreg *)&fxsave->st_space[0]; +- from = &buf->_st[0]; +- for (i = 0 ; i < 8 ; i++, to++, from++) { +- if (__copy_from_user(to, from, sizeof(*from))) +- return -1; +- } +- return 0; +-} +- +- +-static inline int convert_fxsr_to_user(struct _fpstate_ia32 __user *buf, +- struct i387_fxsave_struct *fxsave, +- struct pt_regs *regs, +- struct task_struct *tsk) +-{ +- struct _fpreg __user *to; +- struct _fpxreg *from; +- int i; +- u16 cs,ds; +- int err = 0; +- +- if (tsk == current) { +- /* should be actually ds/cs at fpu exception time, +- but that information is not available in 64bit mode. */ +- asm("movw %%ds,%0 " : "=r" (ds)); +- asm("movw %%cs,%0 " : "=r" (cs)); +- } else { /* ptrace. task has stopped. */ +- ds = tsk->thread.ds; +- cs = regs->cs; +- } +- +-#define P(num,val) err |= __put_user(val, num + (u32 __user *)buf) +- P(0, (u32)fxsave->cwd | 0xffff0000); +- P(1, (u32)fxsave->swd | 0xffff0000); +- P(2, twd_fxsr_to_i387(fxsave)); +- P(3, (u32)fxsave->rip); +- P(4, cs | ((u32)fxsave->fop) << 16); +- P(5, fxsave->rdp); +- P(6, 0xffff0000 | ds); +-#undef P +- +- if (err) +- return -1; +- +- to = &buf->_st[0]; +- from = (struct _fpxreg *) &fxsave->st_space[0]; +- for ( i = 0 ; i < 8 ; i++, to++, from++ ) { +- if (__copy_to_user(to, from, sizeof(*to))) +- return -1; +- } +- return 0; +-} +- +-int restore_i387_ia32(struct task_struct *tsk, struct _fpstate_ia32 __user *buf, int fsave) +-{ +- clear_fpu(tsk); +- if (!fsave) { +- if (__copy_from_user(&tsk->thread.i387.fxsave, +- &buf->_fxsr_env[0], +- sizeof(struct i387_fxsave_struct))) +- return -1; +- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask; +- set_stopped_child_used_math(tsk); +- } +- return convert_fxsr_from_user(&tsk->thread.i387.fxsave, buf); +-} +- +-int save_i387_ia32(struct task_struct *tsk, +- struct _fpstate_ia32 __user *buf, +- struct pt_regs *regs, +- int fsave) +-{ +- int err = 0; +- +- init_fpu(tsk); +- if (convert_fxsr_to_user(buf, &tsk->thread.i387.fxsave, regs, tsk)) +- return -1; +- if (fsave) +- return 0; +- err |= __put_user(tsk->thread.i387.fxsave.swd, &buf->status); +- if (fsave) +- return err ? -1 : 1; +- err |= __put_user(X86_FXSR_MAGIC, &buf->magic); +- err |= __copy_to_user(&buf->_fxsr_env[0], &tsk->thread.i387.fxsave, +- sizeof(struct i387_fxsave_struct)); +- return err ? -1 : 1; +-} +diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c +index f82e1a9..e4c1207 100644 +--- a/arch/x86/ia32/ia32_aout.c ++++ b/arch/x86/ia32/ia32_aout.c +@@ -25,6 +25,7 @@ + #include + #include + #include ++#include + + #include + #include +@@ -36,61 +37,67 @@ + #undef WARN_OLD + #undef CORE_DUMP /* probably broken */ + +-static int load_aout_binary(struct linux_binprm *, struct pt_regs * regs); +-static int load_aout_library(struct file*); ++static int load_aout_binary(struct linux_binprm *, struct pt_regs *regs); ++static int load_aout_library(struct file *); + + #ifdef CORE_DUMP +-static int aout_core_dump(long signr, struct pt_regs *regs, struct file *file, unsigned long limit); ++static int aout_core_dump(long signr, struct pt_regs *regs, struct file *file, ++ unsigned long limit); + + /* + * fill in the user structure for a core dump.. + */ +-static void dump_thread32(struct pt_regs * regs, struct user32 * dump) ++static void dump_thread32(struct pt_regs *regs, struct user32 *dump) + { +- u32 fs,gs; ++ u32 fs, gs; + + /* changed the size calculations - should hopefully work better. lbt */ + dump->magic = CMAGIC; + dump->start_code = 0; +- dump->start_stack = regs->rsp & ~(PAGE_SIZE - 1); ++ dump->start_stack = regs->sp & ~(PAGE_SIZE - 1); + dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT; +- dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT; ++ dump->u_dsize = ((unsigned long) ++ (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT; + dump->u_dsize -= dump->u_tsize; + dump->u_ssize = 0; +- dump->u_debugreg[0] = current->thread.debugreg0; +- dump->u_debugreg[1] = current->thread.debugreg1; +- dump->u_debugreg[2] = current->thread.debugreg2; +- dump->u_debugreg[3] = current->thread.debugreg3; +- dump->u_debugreg[4] = 0; +- dump->u_debugreg[5] = 0; +- dump->u_debugreg[6] = current->thread.debugreg6; +- dump->u_debugreg[7] = current->thread.debugreg7; +- +- if (dump->start_stack < 0xc0000000) +- dump->u_ssize = ((unsigned long) (0xc0000000 - dump->start_stack)) >> PAGE_SHIFT; +- +- dump->regs.ebx = regs->rbx; +- dump->regs.ecx = regs->rcx; +- dump->regs.edx = regs->rdx; +- dump->regs.esi = regs->rsi; +- dump->regs.edi = regs->rdi; +- dump->regs.ebp = regs->rbp; +- dump->regs.eax = regs->rax; ++ dump->u_debugreg[0] = current->thread.debugreg0; ++ dump->u_debugreg[1] = current->thread.debugreg1; ++ dump->u_debugreg[2] = current->thread.debugreg2; ++ dump->u_debugreg[3] = current->thread.debugreg3; ++ dump->u_debugreg[4] = 0; ++ dump->u_debugreg[5] = 0; ++ dump->u_debugreg[6] = current->thread.debugreg6; ++ dump->u_debugreg[7] = current->thread.debugreg7; ++ ++ if (dump->start_stack < 0xc0000000) { ++ unsigned long tmp; ++ ++ tmp = (unsigned long) (0xc0000000 - dump->start_stack); ++ dump->u_ssize = tmp >> PAGE_SHIFT; ++ } ++ ++ dump->regs.bx = regs->bx; ++ dump->regs.cx = regs->cx; ++ dump->regs.dx = regs->dx; ++ dump->regs.si = regs->si; ++ dump->regs.di = regs->di; ++ dump->regs.bp = regs->bp; ++ dump->regs.ax = regs->ax; + dump->regs.ds = current->thread.ds; + dump->regs.es = current->thread.es; + asm("movl %%fs,%0" : "=r" (fs)); dump->regs.fs = fs; +- asm("movl %%gs,%0" : "=r" (gs)); dump->regs.gs = gs; +- dump->regs.orig_eax = regs->orig_rax; +- dump->regs.eip = regs->rip; ++ asm("movl %%gs,%0" : "=r" (gs)); dump->regs.gs = gs; ++ dump->regs.orig_ax = regs->orig_ax; ++ dump->regs.ip = regs->ip; + dump->regs.cs = regs->cs; +- dump->regs.eflags = regs->eflags; +- dump->regs.esp = regs->rsp; ++ dump->regs.flags = regs->flags; ++ dump->regs.sp = regs->sp; + dump->regs.ss = regs->ss; + + #if 1 /* FIXME */ + dump->u_fpvalid = 0; + #else +- dump->u_fpvalid = dump_fpu (regs, &dump->i387); ++ dump->u_fpvalid = dump_fpu(regs, &dump->i387); + #endif + } + +@@ -128,15 +135,19 @@ static int dump_write(struct file *file, const void *addr, int nr) + return file->f_op->write(file, addr, nr, &file->f_pos) == nr; + } + +-#define DUMP_WRITE(addr, nr) \ ++#define DUMP_WRITE(addr, nr) \ + if (!dump_write(file, (void *)(addr), (nr))) \ + goto end_coredump; + +-#define DUMP_SEEK(offset) \ +-if (file->f_op->llseek) { \ +- if (file->f_op->llseek(file,(offset),0) != (offset)) \ +- goto end_coredump; \ +-} else file->f_pos = (offset) ++#define DUMP_SEEK(offset) \ ++ if (file->f_op->llseek) { \ ++ if (file->f_op->llseek(file, (offset), 0) != (offset)) \ ++ goto end_coredump; \ ++ } else \ ++ file->f_pos = (offset) ++ ++#define START_DATA() (u.u_tsize << PAGE_SHIFT) ++#define START_STACK(u) (u.start_stack) + + /* + * Routine writes a core dump image in the current directory. +@@ -148,62 +159,70 @@ if (file->f_op->llseek) { \ + * dumping of the process results in another error.. + */ + +-static int aout_core_dump(long signr, struct pt_regs *regs, struct file *file, unsigned long limit) ++static int aout_core_dump(long signr, struct pt_regs *regs, struct file *file, ++ unsigned long limit) + { + mm_segment_t fs; + int has_dumped = 0; + unsigned long dump_start, dump_size; + struct user32 dump; +-# define START_DATA(u) (u.u_tsize << PAGE_SHIFT) +-# define START_STACK(u) (u.start_stack) + + fs = get_fs(); + set_fs(KERNEL_DS); + has_dumped = 1; + current->flags |= PF_DUMPCORE; +- strncpy(dump.u_comm, current->comm, sizeof(current->comm)); +- dump.u_ar0 = (u32)(((unsigned long)(&dump.regs)) - ((unsigned long)(&dump))); ++ strncpy(dump.u_comm, current->comm, sizeof(current->comm)); ++ dump.u_ar0 = (u32)(((unsigned long)(&dump.regs)) - ++ ((unsigned long)(&dump))); + dump.signal = signr; + dump_thread32(regs, &dump); + +-/* If the size of the dump file exceeds the rlimit, then see what would happen +- if we wrote the stack, but not the data area. */ ++ /* ++ * If the size of the dump file exceeds the rlimit, then see ++ * what would happen if we wrote the stack, but not the data ++ * area. ++ */ + if ((dump.u_dsize + dump.u_ssize + 1) * PAGE_SIZE > limit) + dump.u_dsize = 0; + +-/* Make sure we have enough room to write the stack and data areas. */ ++ /* Make sure we have enough room to write the stack and data areas. */ + if ((dump.u_ssize + 1) * PAGE_SIZE > limit) + dump.u_ssize = 0; + +-/* make sure we actually have a data and stack area to dump */ ++ /* make sure we actually have a data and stack area to dump */ + set_fs(USER_DS); +- if (!access_ok(VERIFY_READ, (void *) (unsigned long)START_DATA(dump), dump.u_dsize << PAGE_SHIFT)) ++ if (!access_ok(VERIFY_READ, (void *) (unsigned long)START_DATA(dump), ++ dump.u_dsize << PAGE_SHIFT)) + dump.u_dsize = 0; +- if (!access_ok(VERIFY_READ, (void *) (unsigned long)START_STACK(dump), dump.u_ssize << PAGE_SHIFT)) ++ if (!access_ok(VERIFY_READ, (void *) (unsigned long)START_STACK(dump), ++ dump.u_ssize << PAGE_SHIFT)) + dump.u_ssize = 0; + + set_fs(KERNEL_DS); +-/* struct user */ +- DUMP_WRITE(&dump,sizeof(dump)); +-/* Now dump all of the user data. Include malloced stuff as well */ ++ /* struct user */ ++ DUMP_WRITE(&dump, sizeof(dump)); ++ /* Now dump all of the user data. Include malloced stuff as well */ + DUMP_SEEK(PAGE_SIZE); +-/* now we start writing out the user space info */ ++ /* now we start writing out the user space info */ + set_fs(USER_DS); +-/* Dump the data area */ ++ /* Dump the data area */ + if (dump.u_dsize != 0) { + dump_start = START_DATA(dump); + dump_size = dump.u_dsize << PAGE_SHIFT; +- DUMP_WRITE(dump_start,dump_size); ++ DUMP_WRITE(dump_start, dump_size); + } +-/* Now prepare to dump the stack area */ ++ /* Now prepare to dump the stack area */ + if (dump.u_ssize != 0) { + dump_start = START_STACK(dump); + dump_size = dump.u_ssize << PAGE_SHIFT; +- DUMP_WRITE(dump_start,dump_size); ++ DUMP_WRITE(dump_start, dump_size); + } +-/* Finally dump the task struct. Not be used by gdb, but could be useful */ ++ /* ++ * Finally dump the task struct. Not be used by gdb, but ++ * could be useful ++ */ + set_fs(KERNEL_DS); +- DUMP_WRITE(current,sizeof(*current)); ++ DUMP_WRITE(current, sizeof(*current)); + end_coredump: + set_fs(fs); + return has_dumped; +@@ -217,35 +236,34 @@ end_coredump: + */ + static u32 __user *create_aout_tables(char __user *p, struct linux_binprm *bprm) + { +- u32 __user *argv; +- u32 __user *envp; +- u32 __user *sp; +- int argc = bprm->argc; +- int envc = bprm->envc; ++ u32 __user *argv, *envp, *sp; ++ int argc = bprm->argc, envc = bprm->envc; + + sp = (u32 __user *) ((-(unsigned long)sizeof(u32)) & (unsigned long) p); + sp -= envc+1; + envp = sp; + sp -= argc+1; + argv = sp; +- put_user((unsigned long) envp,--sp); +- put_user((unsigned long) argv,--sp); +- put_user(argc,--sp); ++ put_user((unsigned long) envp, --sp); ++ put_user((unsigned long) argv, --sp); ++ put_user(argc, --sp); + current->mm->arg_start = (unsigned long) p; +- while (argc-->0) { ++ while (argc-- > 0) { + char c; +- put_user((u32)(unsigned long)p,argv++); ++ ++ put_user((u32)(unsigned long)p, argv++); + do { +- get_user(c,p++); ++ get_user(c, p++); + } while (c); + } + put_user(0, argv); + current->mm->arg_end = current->mm->env_start = (unsigned long) p; +- while (envc-->0) { ++ while (envc-- > 0) { + char c; +- put_user((u32)(unsigned long)p,envp++); ++ ++ put_user((u32)(unsigned long)p, envp++); + do { +- get_user(c,p++); ++ get_user(c, p++); + } while (c); + } + put_user(0, envp); +@@ -257,20 +275,18 @@ static u32 __user *create_aout_tables(char __user *p, struct linux_binprm *bprm) + * These are the functions used to load a.out style executables and shared + * libraries. There is no binary dependent code anywhere else. + */ +- +-static int load_aout_binary(struct linux_binprm * bprm, struct pt_regs * regs) ++static int load_aout_binary(struct linux_binprm *bprm, struct pt_regs *regs) + { ++ unsigned long error, fd_offset, rlim; + struct exec ex; +- unsigned long error; +- unsigned long fd_offset; +- unsigned long rlim; + int retval; + + ex = *((struct exec *) bprm->buf); /* exec-header */ + if ((N_MAGIC(ex) != ZMAGIC && N_MAGIC(ex) != OMAGIC && + N_MAGIC(ex) != QMAGIC && N_MAGIC(ex) != NMAGIC) || + N_TRSIZE(ex) || N_DRSIZE(ex) || +- i_size_read(bprm->file->f_path.dentry->d_inode) < ex.a_text+ex.a_data+N_SYMSIZE(ex)+N_TXTOFF(ex)) { ++ i_size_read(bprm->file->f_path.dentry->d_inode) < ++ ex.a_text+ex.a_data+N_SYMSIZE(ex)+N_TXTOFF(ex)) { + return -ENOEXEC; + } + +@@ -291,13 +307,13 @@ static int load_aout_binary(struct linux_binprm * bprm, struct pt_regs * regs) + if (retval) + return retval; + +- regs->cs = __USER32_CS; ++ regs->cs = __USER32_CS; + regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 = + regs->r13 = regs->r14 = regs->r15 = 0; + + /* OK, This is the point of no return */ + set_personality(PER_LINUX); +- set_thread_flag(TIF_IA32); ++ set_thread_flag(TIF_IA32); + clear_thread_flag(TIF_ABI_PENDING); + + current->mm->end_code = ex.a_text + +@@ -311,7 +327,7 @@ static int load_aout_binary(struct linux_binprm * bprm, struct pt_regs * regs) + + current->mm->mmap = NULL; + compute_creds(bprm); +- current->flags &= ~PF_FORKNOEXEC; ++ current->flags &= ~PF_FORKNOEXEC; + + if (N_MAGIC(ex) == OMAGIC) { + unsigned long text_addr, map_size; +@@ -338,30 +354,31 @@ static int load_aout_binary(struct linux_binprm * bprm, struct pt_regs * regs) + send_sig(SIGKILL, current, 0); + return error; + } +- ++ + flush_icache_range(text_addr, text_addr+ex.a_text+ex.a_data); + } else { + #ifdef WARN_OLD + static unsigned long error_time, error_time2; + if ((ex.a_text & 0xfff || ex.a_data & 0xfff) && +- (N_MAGIC(ex) != NMAGIC) && (jiffies-error_time2) > 5*HZ) +- { ++ (N_MAGIC(ex) != NMAGIC) && ++ time_after(jiffies, error_time2 + 5*HZ)) { + printk(KERN_NOTICE "executable not page aligned\n"); + error_time2 = jiffies; + } + + if ((fd_offset & ~PAGE_MASK) != 0 && +- (jiffies-error_time) > 5*HZ) +- { +- printk(KERN_WARNING +- "fd_offset is not page aligned. Please convert program: %s\n", ++ time_after(jiffies, error_time + 5*HZ)) { ++ printk(KERN_WARNING ++ "fd_offset is not page aligned. Please convert " ++ "program: %s\n", + bprm->file->f_path.dentry->d_name.name); + error_time = jiffies; + } + #endif + +- if (!bprm->file->f_op->mmap||((fd_offset & ~PAGE_MASK) != 0)) { ++ if (!bprm->file->f_op->mmap || (fd_offset & ~PAGE_MASK) != 0) { + loff_t pos = fd_offset; ++ + down_write(¤t->mm->mmap_sem); + do_brk(N_TXTADDR(ex), ex.a_text+ex.a_data); + up_write(¤t->mm->mmap_sem); +@@ -376,9 +393,10 @@ static int load_aout_binary(struct linux_binprm * bprm, struct pt_regs * regs) + + down_write(¤t->mm->mmap_sem); + error = do_mmap(bprm->file, N_TXTADDR(ex), ex.a_text, +- PROT_READ | PROT_EXEC, +- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE | MAP_32BIT, +- fd_offset); ++ PROT_READ | PROT_EXEC, ++ MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | ++ MAP_EXECUTABLE | MAP_32BIT, ++ fd_offset); + up_write(¤t->mm->mmap_sem); + + if (error != N_TXTADDR(ex)) { +@@ -387,9 +405,10 @@ static int load_aout_binary(struct linux_binprm * bprm, struct pt_regs * regs) + } + + down_write(¤t->mm->mmap_sem); +- error = do_mmap(bprm->file, N_DATADDR(ex), ex.a_data, ++ error = do_mmap(bprm->file, N_DATADDR(ex), ex.a_data, + PROT_READ | PROT_WRITE | PROT_EXEC, +- MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE | MAP_32BIT, ++ MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | ++ MAP_EXECUTABLE | MAP_32BIT, + fd_offset + ex.a_text); + up_write(¤t->mm->mmap_sem); + if (error != N_DATADDR(ex)) { +@@ -403,9 +422,9 @@ beyond_if: + set_brk(current->mm->start_brk, current->mm->brk); + + retval = setup_arg_pages(bprm, IA32_STACK_TOP, EXSTACK_DEFAULT); +- if (retval < 0) { +- /* Someone check-me: is this error path enough? */ +- send_sig(SIGKILL, current, 0); ++ if (retval < 0) { ++ /* Someone check-me: is this error path enough? */ ++ send_sig(SIGKILL, current, 0); + return retval; + } + +@@ -414,10 +433,10 @@ beyond_if: + /* start thread */ + asm volatile("movl %0,%%fs" :: "r" (0)); \ + asm volatile("movl %0,%%es; movl %0,%%ds": :"r" (__USER32_DS)); +- load_gs_index(0); +- (regs)->rip = ex.a_entry; +- (regs)->rsp = current->mm->start_stack; +- (regs)->eflags = 0x200; ++ load_gs_index(0); ++ (regs)->ip = ex.a_entry; ++ (regs)->sp = current->mm->start_stack; ++ (regs)->flags = 0x200; + (regs)->cs = __USER32_CS; + (regs)->ss = __USER32_DS; + regs->r8 = regs->r9 = regs->r10 = regs->r11 = +@@ -425,7 +444,7 @@ beyond_if: + set_fs(USER_DS); + if (unlikely(current->ptrace & PT_PTRACED)) { + if (current->ptrace & PT_TRACE_EXEC) +- ptrace_notify ((PTRACE_EVENT_EXEC << 8) | SIGTRAP); ++ ptrace_notify((PTRACE_EVENT_EXEC << 8) | SIGTRAP); + else + send_sig(SIGTRAP, current, 0); + } +@@ -434,9 +453,8 @@ beyond_if: + + static int load_aout_library(struct file *file) + { +- struct inode * inode; +- unsigned long bss, start_addr, len; +- unsigned long error; ++ struct inode *inode; ++ unsigned long bss, start_addr, len, error; + int retval; + struct exec ex; + +@@ -450,7 +468,8 @@ static int load_aout_library(struct file *file) + /* We come in here for the regular a.out style of shared libraries */ + if ((N_MAGIC(ex) != ZMAGIC && N_MAGIC(ex) != QMAGIC) || N_TRSIZE(ex) || + N_DRSIZE(ex) || ((ex.a_entry & 0xfff) && N_MAGIC(ex) == ZMAGIC) || +- i_size_read(inode) < ex.a_text+ex.a_data+N_SYMSIZE(ex)+N_TXTOFF(ex)) { ++ i_size_read(inode) < ++ ex.a_text+ex.a_data+N_SYMSIZE(ex)+N_TXTOFF(ex)) { + goto out; + } + +@@ -467,10 +486,10 @@ static int load_aout_library(struct file *file) + + #ifdef WARN_OLD + static unsigned long error_time; +- if ((jiffies-error_time) > 5*HZ) +- { +- printk(KERN_WARNING +- "N_TXTOFF is not page aligned. Please convert library: %s\n", ++ if (time_after(jiffies, error_time + 5*HZ)) { ++ printk(KERN_WARNING ++ "N_TXTOFF is not page aligned. Please convert " ++ "library: %s\n", + file->f_path.dentry->d_name.name); + error_time = jiffies; + } +@@ -478,11 +497,12 @@ static int load_aout_library(struct file *file) + down_write(¤t->mm->mmap_sem); + do_brk(start_addr, ex.a_text + ex.a_data + ex.a_bss); + up_write(¤t->mm->mmap_sem); +- ++ + file->f_op->read(file, (char __user *)start_addr, + ex.a_text + ex.a_data, &pos); + flush_icache_range((unsigned long) start_addr, +- (unsigned long) start_addr + ex.a_text + ex.a_data); ++ (unsigned long) start_addr + ex.a_text + ++ ex.a_data); + + retval = 0; + goto out; +diff --git a/arch/x86/ia32/ia32_binfmt.c b/arch/x86/ia32/ia32_binfmt.c +deleted file mode 100644 +index 55822d2..0000000 +--- a/arch/x86/ia32/ia32_binfmt.c ++++ /dev/null +@@ -1,285 +0,0 @@ +-/* +- * Written 2000,2002 by Andi Kleen. +- * +- * Loosely based on the sparc64 and IA64 32bit emulation loaders. +- * This tricks binfmt_elf.c into loading 32bit binaries using lots +- * of ugly preprocessor tricks. Talk about very very poor man's inheritance. +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#undef ELF_ARCH +-#undef ELF_CLASS +-#define ELF_CLASS ELFCLASS32 +-#define ELF_ARCH EM_386 +- +-#undef elfhdr +-#undef elf_phdr +-#undef elf_note +-#undef elf_addr_t +-#define elfhdr elf32_hdr +-#define elf_phdr elf32_phdr +-#define elf_note elf32_note +-#define elf_addr_t Elf32_Off +- +-#define ELF_NAME "elf/i386" +- +-#define AT_SYSINFO 32 +-#define AT_SYSINFO_EHDR 33 +- +-int sysctl_vsyscall32 = 1; +- +-#undef ARCH_DLINFO +-#define ARCH_DLINFO do { \ +- if (sysctl_vsyscall32) { \ +- current->mm->context.vdso = (void *)VSYSCALL32_BASE; \ +- NEW_AUX_ENT(AT_SYSINFO, (u32)(u64)VSYSCALL32_VSYSCALL); \ +- NEW_AUX_ENT(AT_SYSINFO_EHDR, VSYSCALL32_BASE); \ +- } \ +-} while(0) +- +-struct file; +- +-#define IA32_EMULATOR 1 +- +-#undef ELF_ET_DYN_BASE +- +-#define ELF_ET_DYN_BASE (TASK_UNMAPPED_BASE + 0x1000000) +- +-#define jiffies_to_timeval(a,b) do { (b)->tv_usec = 0; (b)->tv_sec = (a)/HZ; }while(0) +- +-#define _GET_SEG(x) \ +- ({ __u32 seg; asm("movl %%" __stringify(x) ",%0" : "=r"(seg)); seg; }) +- +-/* Assumes current==process to be dumped */ +-#undef ELF_CORE_COPY_REGS +-#define ELF_CORE_COPY_REGS(pr_reg, regs) \ +- pr_reg[0] = regs->rbx; \ +- pr_reg[1] = regs->rcx; \ +- pr_reg[2] = regs->rdx; \ +- pr_reg[3] = regs->rsi; \ +- pr_reg[4] = regs->rdi; \ +- pr_reg[5] = regs->rbp; \ +- pr_reg[6] = regs->rax; \ +- pr_reg[7] = _GET_SEG(ds); \ +- pr_reg[8] = _GET_SEG(es); \ +- pr_reg[9] = _GET_SEG(fs); \ +- pr_reg[10] = _GET_SEG(gs); \ +- pr_reg[11] = regs->orig_rax; \ +- pr_reg[12] = regs->rip; \ +- pr_reg[13] = regs->cs; \ +- pr_reg[14] = regs->eflags; \ +- pr_reg[15] = regs->rsp; \ +- pr_reg[16] = regs->ss; +- +- +-#define elf_prstatus compat_elf_prstatus +-#define elf_prpsinfo compat_elf_prpsinfo +-#define elf_fpregset_t struct user_i387_ia32_struct +-#define elf_fpxregset_t struct user32_fxsr_struct +-#define user user32 +- +-#undef elf_read_implies_exec +-#define elf_read_implies_exec(ex, executable_stack) (executable_stack != EXSTACK_DISABLE_X) +- +-#define elf_core_copy_regs elf32_core_copy_regs +-static inline void elf32_core_copy_regs(compat_elf_gregset_t *elfregs, +- struct pt_regs *regs) +-{ +- ELF_CORE_COPY_REGS((&elfregs->ebx), regs) +-} +- +-#define elf_core_copy_task_regs elf32_core_copy_task_regs +-static inline int elf32_core_copy_task_regs(struct task_struct *t, +- compat_elf_gregset_t* elfregs) +-{ +- struct pt_regs *pp = task_pt_regs(t); +- ELF_CORE_COPY_REGS((&elfregs->ebx), pp); +- /* fix wrong segments */ +- elfregs->ds = t->thread.ds; +- elfregs->fs = t->thread.fsindex; +- elfregs->gs = t->thread.gsindex; +- elfregs->es = t->thread.es; +- return 1; +-} +- +-#define elf_core_copy_task_fpregs elf32_core_copy_task_fpregs +-static inline int +-elf32_core_copy_task_fpregs(struct task_struct *tsk, struct pt_regs *regs, +- elf_fpregset_t *fpu) +-{ +- struct _fpstate_ia32 *fpstate = (void*)fpu; +- mm_segment_t oldfs = get_fs(); +- +- if (!tsk_used_math(tsk)) +- return 0; +- if (!regs) +- regs = task_pt_regs(tsk); +- if (tsk == current) +- unlazy_fpu(tsk); +- set_fs(KERNEL_DS); +- save_i387_ia32(tsk, fpstate, regs, 1); +- /* Correct for i386 bug. It puts the fop into the upper 16bits of +- the tag word (like FXSAVE), not into the fcs*/ +- fpstate->cssel |= fpstate->tag & 0xffff0000; +- set_fs(oldfs); +- return 1; +-} +- +-#define ELF_CORE_COPY_XFPREGS 1 +-#define ELF_CORE_XFPREG_TYPE NT_PRXFPREG +-#define elf_core_copy_task_xfpregs elf32_core_copy_task_xfpregs +-static inline int +-elf32_core_copy_task_xfpregs(struct task_struct *t, elf_fpxregset_t *xfpu) +-{ +- struct pt_regs *regs = task_pt_regs(t); +- if (!tsk_used_math(t)) +- return 0; +- if (t == current) +- unlazy_fpu(t); +- memcpy(xfpu, &t->thread.i387.fxsave, sizeof(elf_fpxregset_t)); +- xfpu->fcs = regs->cs; +- xfpu->fos = t->thread.ds; /* right? */ +- return 1; +-} +- +-#undef elf_check_arch +-#define elf_check_arch(x) \ +- ((x)->e_machine == EM_386) +- +-extern int force_personality32; +- +-#undef ELF_EXEC_PAGESIZE +-#undef ELF_HWCAP +-#undef ELF_PLATFORM +-#undef SET_PERSONALITY +-#define ELF_EXEC_PAGESIZE PAGE_SIZE +-#define ELF_HWCAP (boot_cpu_data.x86_capability[0]) +-#define ELF_PLATFORM ("i686") +-#define SET_PERSONALITY(ex, ibcs2) \ +-do { \ +- unsigned long new_flags = 0; \ +- if ((ex).e_ident[EI_CLASS] == ELFCLASS32) \ +- new_flags = _TIF_IA32; \ +- if ((current_thread_info()->flags & _TIF_IA32) \ +- != new_flags) \ +- set_thread_flag(TIF_ABI_PENDING); \ +- else \ +- clear_thread_flag(TIF_ABI_PENDING); \ +- /* XXX This overwrites the user set personality */ \ +- current->personality |= force_personality32; \ +-} while (0) +- +-/* Override some function names */ +-#define elf_format elf32_format +- +-#define init_elf_binfmt init_elf32_binfmt +-#define exit_elf_binfmt exit_elf32_binfmt +- +-#define load_elf_binary load_elf32_binary +- +-#undef ELF_PLAT_INIT +-#define ELF_PLAT_INIT(r, load_addr) elf32_init(r) +- +-#undef start_thread +-#define start_thread(regs,new_rip,new_rsp) do { \ +- asm volatile("movl %0,%%fs" :: "r" (0)); \ +- asm volatile("movl %0,%%es; movl %0,%%ds": :"r" (__USER32_DS)); \ +- load_gs_index(0); \ +- (regs)->rip = (new_rip); \ +- (regs)->rsp = (new_rsp); \ +- (regs)->eflags = 0x200; \ +- (regs)->cs = __USER32_CS; \ +- (regs)->ss = __USER32_DS; \ +- set_fs(USER_DS); \ +-} while(0) +- +- +-#include +- +-MODULE_DESCRIPTION("Binary format loader for compatibility with IA32 ELF binaries."); +-MODULE_AUTHOR("Eric Youngdale, Andi Kleen"); +- +-#undef MODULE_DESCRIPTION +-#undef MODULE_AUTHOR +- +-static void elf32_init(struct pt_regs *); +- +-#define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1 +-#define arch_setup_additional_pages syscall32_setup_pages +-extern int syscall32_setup_pages(struct linux_binprm *, int exstack); +- +-#include "../../../fs/binfmt_elf.c" +- +-static void elf32_init(struct pt_regs *regs) +-{ +- struct task_struct *me = current; +- regs->rdi = 0; +- regs->rsi = 0; +- regs->rdx = 0; +- regs->rcx = 0; +- regs->rax = 0; +- regs->rbx = 0; +- regs->rbp = 0; +- regs->r8 = regs->r9 = regs->r10 = regs->r11 = regs->r12 = +- regs->r13 = regs->r14 = regs->r15 = 0; +- me->thread.fs = 0; +- me->thread.gs = 0; +- me->thread.fsindex = 0; +- me->thread.gsindex = 0; +- me->thread.ds = __USER_DS; +- me->thread.es = __USER_DS; +-} +- +-#ifdef CONFIG_SYSCTL +-/* Register vsyscall32 into the ABI table */ +-#include +- +-static ctl_table abi_table2[] = { +- { +- .procname = "vsyscall32", +- .data = &sysctl_vsyscall32, +- .maxlen = sizeof(int), +- .mode = 0644, +- .proc_handler = proc_dointvec +- }, +- {} +-}; +- +-static ctl_table abi_root_table2[] = { +- { +- .ctl_name = CTL_ABI, +- .procname = "abi", +- .mode = 0555, +- .child = abi_table2 +- }, +- {} +-}; +- +-static __init int ia32_binfmt_init(void) +-{ +- register_sysctl_table(abi_root_table2); +- return 0; +-} +-__initcall(ia32_binfmt_init); +-#endif +diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c +index 6ea19c2..1c0503b 100644 +--- a/arch/x86/ia32/ia32_signal.c ++++ b/arch/x86/ia32/ia32_signal.c +@@ -29,9 +29,8 @@ + #include + #include + #include +-#include + #include +-#include ++#include + + #define DEBUG_SIG 0 + +@@ -43,7 +42,8 @@ void signal_fault(struct pt_regs *regs, void __user *frame, char *where); + int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) + { + int err; +- if (!access_ok (VERIFY_WRITE, to, sizeof(compat_siginfo_t))) ++ ++ if (!access_ok(VERIFY_WRITE, to, sizeof(compat_siginfo_t))) + return -EFAULT; + + /* If you change siginfo_t structure, please make sure that +@@ -53,16 +53,19 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) + 3 ints plus the relevant union member. */ + err = __put_user(from->si_signo, &to->si_signo); + err |= __put_user(from->si_errno, &to->si_errno); +- err |= __put_user((short)from->si_code, &to->si_code); ++ err |= __put_user((short)from->si_code, &to->si_code); + + if (from->si_code < 0) { + err |= __put_user(from->si_pid, &to->si_pid); +- err |= __put_user(from->si_uid, &to->si_uid); +- err |= __put_user(ptr_to_compat(from->si_ptr), &to->si_ptr); ++ err |= __put_user(from->si_uid, &to->si_uid); ++ err |= __put_user(ptr_to_compat(from->si_ptr), &to->si_ptr); + } else { +- /* First 32bits of unions are always present: +- * si_pid === si_band === si_tid === si_addr(LS half) */ +- err |= __put_user(from->_sifields._pad[0], &to->_sifields._pad[0]); ++ /* ++ * First 32bits of unions are always present: ++ * si_pid === si_band === si_tid === si_addr(LS half) ++ */ ++ err |= __put_user(from->_sifields._pad[0], ++ &to->_sifields._pad[0]); + switch (from->si_code >> 16) { + case __SI_FAULT >> 16: + break; +@@ -76,14 +79,15 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) + err |= __put_user(from->si_uid, &to->si_uid); + break; + case __SI_POLL >> 16: +- err |= __put_user(from->si_fd, &to->si_fd); ++ err |= __put_user(from->si_fd, &to->si_fd); + break; + case __SI_TIMER >> 16: +- err |= __put_user(from->si_overrun, &to->si_overrun); ++ err |= __put_user(from->si_overrun, &to->si_overrun); + err |= __put_user(ptr_to_compat(from->si_ptr), +- &to->si_ptr); ++ &to->si_ptr); + break; +- case __SI_RT >> 16: /* This is not generated by the kernel as of now. */ ++ /* This is not generated by the kernel as of now. */ ++ case __SI_RT >> 16: + case __SI_MESGQ >> 16: + err |= __put_user(from->si_uid, &to->si_uid); + err |= __put_user(from->si_int, &to->si_int); +@@ -97,7 +101,8 @@ int copy_siginfo_from_user32(siginfo_t *to, compat_siginfo_t __user *from) + { + int err; + u32 ptr32; +- if (!access_ok (VERIFY_READ, from, sizeof(compat_siginfo_t))) ++ ++ if (!access_ok(VERIFY_READ, from, sizeof(compat_siginfo_t))) + return -EFAULT; + + err = __get_user(to->si_signo, &from->si_signo); +@@ -112,8 +117,7 @@ int copy_siginfo_from_user32(siginfo_t *to, compat_siginfo_t __user *from) + return err; + } + +-asmlinkage long +-sys32_sigsuspend(int history0, int history1, old_sigset_t mask) ++asmlinkage long sys32_sigsuspend(int history0, int history1, old_sigset_t mask) + { + mask &= _BLOCKABLE; + spin_lock_irq(¤t->sighand->siglock); +@@ -128,36 +132,37 @@ sys32_sigsuspend(int history0, int history1, old_sigset_t mask) + return -ERESTARTNOHAND; + } + +-asmlinkage long +-sys32_sigaltstack(const stack_ia32_t __user *uss_ptr, +- stack_ia32_t __user *uoss_ptr, +- struct pt_regs *regs) ++asmlinkage long sys32_sigaltstack(const stack_ia32_t __user *uss_ptr, ++ stack_ia32_t __user *uoss_ptr, ++ struct pt_regs *regs) + { +- stack_t uss,uoss; ++ stack_t uss, uoss; + int ret; +- mm_segment_t seg; +- if (uss_ptr) { ++ mm_segment_t seg; ++ ++ if (uss_ptr) { + u32 ptr; +- memset(&uss,0,sizeof(stack_t)); +- if (!access_ok(VERIFY_READ,uss_ptr,sizeof(stack_ia32_t)) || ++ ++ memset(&uss, 0, sizeof(stack_t)); ++ if (!access_ok(VERIFY_READ, uss_ptr, sizeof(stack_ia32_t)) || + __get_user(ptr, &uss_ptr->ss_sp) || + __get_user(uss.ss_flags, &uss_ptr->ss_flags) || + __get_user(uss.ss_size, &uss_ptr->ss_size)) + return -EFAULT; + uss.ss_sp = compat_ptr(ptr); + } +- seg = get_fs(); +- set_fs(KERNEL_DS); +- ret = do_sigaltstack(uss_ptr ? &uss : NULL, &uoss, regs->rsp); +- set_fs(seg); ++ seg = get_fs(); ++ set_fs(KERNEL_DS); ++ ret = do_sigaltstack(uss_ptr ? &uss : NULL, &uoss, regs->sp); ++ set_fs(seg); + if (ret >= 0 && uoss_ptr) { +- if (!access_ok(VERIFY_WRITE,uoss_ptr,sizeof(stack_ia32_t)) || ++ if (!access_ok(VERIFY_WRITE, uoss_ptr, sizeof(stack_ia32_t)) || + __put_user(ptr_to_compat(uoss.ss_sp), &uoss_ptr->ss_sp) || + __put_user(uoss.ss_flags, &uoss_ptr->ss_flags) || + __put_user(uoss.ss_size, &uoss_ptr->ss_size)) + ret = -EFAULT; +- } +- return ret; ++ } ++ return ret; + } + + /* +@@ -186,87 +191,85 @@ struct rt_sigframe + char retcode[8]; + }; + +-static int +-ia32_restore_sigcontext(struct pt_regs *regs, struct sigcontext_ia32 __user *sc, unsigned int *peax) ++#define COPY(x) { \ ++ unsigned int reg; \ ++ err |= __get_user(reg, &sc->x); \ ++ regs->x = reg; \ ++} ++ ++#define RELOAD_SEG(seg,mask) \ ++ { unsigned int cur; \ ++ unsigned short pre; \ ++ err |= __get_user(pre, &sc->seg); \ ++ asm volatile("movl %%" #seg ",%0" : "=r" (cur)); \ ++ pre |= mask; \ ++ if (pre != cur) loadsegment(seg, pre); } ++ ++static int ia32_restore_sigcontext(struct pt_regs *regs, ++ struct sigcontext_ia32 __user *sc, ++ unsigned int *peax) + { +- unsigned int err = 0; +- ++ unsigned int tmpflags, gs, oldgs, err = 0; ++ struct _fpstate_ia32 __user *buf; ++ u32 tmp; ++ + /* Always make any pending restarted system calls return -EINTR */ + current_thread_info()->restart_block.fn = do_no_restart_syscall; + + #if DEBUG_SIG +- printk("SIG restore_sigcontext: sc=%p err(%x) eip(%x) cs(%x) flg(%x)\n", +- sc, sc->err, sc->eip, sc->cs, sc->eflags); ++ printk(KERN_DEBUG "SIG restore_sigcontext: " ++ "sc=%p err(%x) eip(%x) cs(%x) flg(%x)\n", ++ sc, sc->err, sc->ip, sc->cs, sc->flags); + #endif +-#define COPY(x) { \ +- unsigned int reg; \ +- err |= __get_user(reg, &sc->e ##x); \ +- regs->r ## x = reg; \ +-} + +-#define RELOAD_SEG(seg,mask) \ +- { unsigned int cur; \ +- unsigned short pre; \ +- err |= __get_user(pre, &sc->seg); \ +- asm volatile("movl %%" #seg ",%0" : "=r" (cur)); \ +- pre |= mask; \ +- if (pre != cur) loadsegment(seg,pre); } +- +- /* Reload fs and gs if they have changed in the signal handler. +- This does not handle long fs/gs base changes in the handler, but +- does not clobber them at least in the normal case. */ +- +- { +- unsigned gs, oldgs; +- err |= __get_user(gs, &sc->gs); +- gs |= 3; +- asm("movl %%gs,%0" : "=r" (oldgs)); +- if (gs != oldgs) +- load_gs_index(gs); +- } +- RELOAD_SEG(fs,3); +- RELOAD_SEG(ds,3); +- RELOAD_SEG(es,3); ++ /* ++ * Reload fs and gs if they have changed in the signal ++ * handler. This does not handle long fs/gs base changes in ++ * the handler, but does not clobber them at least in the ++ * normal case. ++ */ ++ err |= __get_user(gs, &sc->gs); ++ gs |= 3; ++ asm("movl %%gs,%0" : "=r" (oldgs)); ++ if (gs != oldgs) ++ load_gs_index(gs); ++ ++ RELOAD_SEG(fs, 3); ++ RELOAD_SEG(ds, 3); ++ RELOAD_SEG(es, 3); + + COPY(di); COPY(si); COPY(bp); COPY(sp); COPY(bx); + COPY(dx); COPY(cx); COPY(ip); +- /* Don't touch extended registers */ +- +- err |= __get_user(regs->cs, &sc->cs); +- regs->cs |= 3; +- err |= __get_user(regs->ss, &sc->ss); +- regs->ss |= 3; +- +- { +- unsigned int tmpflags; +- err |= __get_user(tmpflags, &sc->eflags); +- regs->eflags = (regs->eflags & ~0x40DD5) | (tmpflags & 0x40DD5); +- regs->orig_rax = -1; /* disable syscall checks */ +- } ++ /* Don't touch extended registers */ ++ ++ err |= __get_user(regs->cs, &sc->cs); ++ regs->cs |= 3; ++ err |= __get_user(regs->ss, &sc->ss); ++ regs->ss |= 3; ++ ++ err |= __get_user(tmpflags, &sc->flags); ++ regs->flags = (regs->flags & ~0x40DD5) | (tmpflags & 0x40DD5); ++ /* disable syscall checks */ ++ regs->orig_ax = -1; ++ ++ err |= __get_user(tmp, &sc->fpstate); ++ buf = compat_ptr(tmp); ++ if (buf) { ++ if (!access_ok(VERIFY_READ, buf, sizeof(*buf))) ++ goto badframe; ++ err |= restore_i387_ia32(buf); ++ } else { ++ struct task_struct *me = current; + +- { +- u32 tmp; +- struct _fpstate_ia32 __user * buf; +- err |= __get_user(tmp, &sc->fpstate); +- buf = compat_ptr(tmp); +- if (buf) { +- if (!access_ok(VERIFY_READ, buf, sizeof(*buf))) +- goto badframe; +- err |= restore_i387_ia32(current, buf, 0); +- } else { +- struct task_struct *me = current; +- if (used_math()) { +- clear_fpu(me); +- clear_used_math(); +- } ++ if (used_math()) { ++ clear_fpu(me); ++ clear_used_math(); + } + } + +- { +- u32 tmp; +- err |= __get_user(tmp, &sc->eax); +- *peax = tmp; +- } ++ err |= __get_user(tmp, &sc->ax); ++ *peax = tmp; ++ + return err; + + badframe: +@@ -275,15 +278,16 @@ badframe: + + asmlinkage long sys32_sigreturn(struct pt_regs *regs) + { +- struct sigframe __user *frame = (struct sigframe __user *)(regs->rsp-8); ++ struct sigframe __user *frame = (struct sigframe __user *)(regs->sp-8); + sigset_t set; +- unsigned int eax; ++ unsigned int ax; + + if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + goto badframe; + if (__get_user(set.sig[0], &frame->sc.oldmask) + || (_COMPAT_NSIG_WORDS > 1 +- && __copy_from_user((((char *) &set.sig) + 4), &frame->extramask, ++ && __copy_from_user((((char *) &set.sig) + 4), ++ &frame->extramask, + sizeof(frame->extramask)))) + goto badframe; + +@@ -292,24 +296,24 @@ asmlinkage long sys32_sigreturn(struct pt_regs *regs) + current->blocked = set; + recalc_sigpending(); + spin_unlock_irq(¤t->sighand->siglock); +- +- if (ia32_restore_sigcontext(regs, &frame->sc, &eax)) ++ ++ if (ia32_restore_sigcontext(regs, &frame->sc, &ax)) + goto badframe; +- return eax; ++ return ax; + + badframe: + signal_fault(regs, frame, "32bit sigreturn"); + return 0; +-} ++} + + asmlinkage long sys32_rt_sigreturn(struct pt_regs *regs) + { + struct rt_sigframe __user *frame; + sigset_t set; +- unsigned int eax; ++ unsigned int ax; + struct pt_regs tregs; + +- frame = (struct rt_sigframe __user *)(regs->rsp - 4); ++ frame = (struct rt_sigframe __user *)(regs->sp - 4); + + if (!access_ok(VERIFY_READ, frame, sizeof(*frame))) + goto badframe; +@@ -321,28 +325,28 @@ asmlinkage long sys32_rt_sigreturn(struct pt_regs *regs) + current->blocked = set; + recalc_sigpending(); + spin_unlock_irq(¤t->sighand->siglock); +- +- if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext, &eax)) ++ ++ if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext, &ax)) + goto badframe; + + tregs = *regs; + if (sys32_sigaltstack(&frame->uc.uc_stack, NULL, &tregs) == -EFAULT) + goto badframe; + +- return eax; ++ return ax; + + badframe: +- signal_fault(regs,frame,"32bit rt sigreturn"); ++ signal_fault(regs, frame, "32bit rt sigreturn"); + return 0; +-} ++} + + /* + * Set up a signal frame. + */ + +-static int +-ia32_setup_sigcontext(struct sigcontext_ia32 __user *sc, struct _fpstate_ia32 __user *fpstate, +- struct pt_regs *regs, unsigned int mask) ++static int ia32_setup_sigcontext(struct sigcontext_ia32 __user *sc, ++ struct _fpstate_ia32 __user *fpstate, ++ struct pt_regs *regs, unsigned int mask) + { + int tmp, err = 0; + +@@ -356,26 +360,26 @@ ia32_setup_sigcontext(struct sigcontext_ia32 __user *sc, struct _fpstate_ia32 __ + __asm__("movl %%es,%0" : "=r"(tmp): "0"(tmp)); + err |= __put_user(tmp, (unsigned int __user *)&sc->es); + +- err |= __put_user((u32)regs->rdi, &sc->edi); +- err |= __put_user((u32)regs->rsi, &sc->esi); +- err |= __put_user((u32)regs->rbp, &sc->ebp); +- err |= __put_user((u32)regs->rsp, &sc->esp); +- err |= __put_user((u32)regs->rbx, &sc->ebx); +- err |= __put_user((u32)regs->rdx, &sc->edx); +- err |= __put_user((u32)regs->rcx, &sc->ecx); +- err |= __put_user((u32)regs->rax, &sc->eax); ++ err |= __put_user((u32)regs->di, &sc->di); ++ err |= __put_user((u32)regs->si, &sc->si); ++ err |= __put_user((u32)regs->bp, &sc->bp); ++ err |= __put_user((u32)regs->sp, &sc->sp); ++ err |= __put_user((u32)regs->bx, &sc->bx); ++ err |= __put_user((u32)regs->dx, &sc->dx); ++ err |= __put_user((u32)regs->cx, &sc->cx); ++ err |= __put_user((u32)regs->ax, &sc->ax); + err |= __put_user((u32)regs->cs, &sc->cs); + err |= __put_user((u32)regs->ss, &sc->ss); + err |= __put_user(current->thread.trap_no, &sc->trapno); + err |= __put_user(current->thread.error_code, &sc->err); +- err |= __put_user((u32)regs->rip, &sc->eip); +- err |= __put_user((u32)regs->eflags, &sc->eflags); +- err |= __put_user((u32)regs->rsp, &sc->esp_at_signal); ++ err |= __put_user((u32)regs->ip, &sc->ip); ++ err |= __put_user((u32)regs->flags, &sc->flags); ++ err |= __put_user((u32)regs->sp, &sc->sp_at_signal); + +- tmp = save_i387_ia32(current, fpstate, regs, 0); ++ tmp = save_i387_ia32(fpstate); + if (tmp < 0) + err = -EFAULT; +- else { ++ else { + clear_used_math(); + stts(); + err |= __put_user(ptr_to_compat(tmp ? fpstate : NULL), +@@ -392,40 +396,53 @@ ia32_setup_sigcontext(struct sigcontext_ia32 __user *sc, struct _fpstate_ia32 __ + /* + * Determine which stack to use.. + */ +-static void __user * +-get_sigframe(struct k_sigaction *ka, struct pt_regs * regs, size_t frame_size) ++static void __user *get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, ++ size_t frame_size) + { +- unsigned long rsp; ++ unsigned long sp; + + /* Default to using normal stack */ +- rsp = regs->rsp; ++ sp = regs->sp; + + /* This is the X/Open sanctioned signal stack switching. */ + if (ka->sa.sa_flags & SA_ONSTACK) { +- if (sas_ss_flags(rsp) == 0) +- rsp = current->sas_ss_sp + current->sas_ss_size; ++ if (sas_ss_flags(sp) == 0) ++ sp = current->sas_ss_sp + current->sas_ss_size; + } + + /* This is the legacy signal stack switching. */ + else if ((regs->ss & 0xffff) != __USER_DS && + !(ka->sa.sa_flags & SA_RESTORER) && +- ka->sa.sa_restorer) { +- rsp = (unsigned long) ka->sa.sa_restorer; +- } ++ ka->sa.sa_restorer) ++ sp = (unsigned long) ka->sa.sa_restorer; + +- rsp -= frame_size; ++ sp -= frame_size; + /* Align the stack pointer according to the i386 ABI, + * i.e. so that on function entry ((sp + 4) & 15) == 0. */ +- rsp = ((rsp + 4) & -16ul) - 4; +- return (void __user *) rsp; ++ sp = ((sp + 4) & -16ul) - 4; ++ return (void __user *) sp; + } + + int ia32_setup_frame(int sig, struct k_sigaction *ka, +- compat_sigset_t *set, struct pt_regs * regs) ++ compat_sigset_t *set, struct pt_regs *regs) + { + struct sigframe __user *frame; ++ void __user *restorer; + int err = 0; + ++ /* copy_to_user optimizes that into a single 8 byte store */ ++ static const struct { ++ u16 poplmovl; ++ u32 val; ++ u16 int80; ++ u16 pad; ++ } __attribute__((packed)) code = { ++ 0xb858, /* popl %eax ; movl $...,%eax */ ++ __NR_ia32_sigreturn, ++ 0x80cd, /* int $0x80 */ ++ 0, ++ }; ++ + frame = get_sigframe(ka, regs, sizeof(*frame)); + + if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) +@@ -443,64 +460,53 @@ int ia32_setup_frame(int sig, struct k_sigaction *ka, + if (_COMPAT_NSIG_WORDS > 1) { + err |= __copy_to_user(frame->extramask, &set->sig[1], + sizeof(frame->extramask)); ++ if (err) ++ goto give_sigsegv; + } +- if (err) +- goto give_sigsegv; + +- /* Return stub is in 32bit vsyscall page */ +- { +- void __user *restorer; ++ if (ka->sa.sa_flags & SA_RESTORER) { ++ restorer = ka->sa.sa_restorer; ++ } else { ++ /* Return stub is in 32bit vsyscall page */ + if (current->binfmt->hasvdso) +- restorer = VSYSCALL32_SIGRETURN; ++ restorer = VDSO32_SYMBOL(current->mm->context.vdso, ++ sigreturn); + else +- restorer = (void *)&frame->retcode; +- if (ka->sa.sa_flags & SA_RESTORER) +- restorer = ka->sa.sa_restorer; +- err |= __put_user(ptr_to_compat(restorer), &frame->pretcode); +- } +- /* These are actually not used anymore, but left because some +- gdb versions depend on them as a marker. */ +- { +- /* copy_to_user optimizes that into a single 8 byte store */ +- static const struct { +- u16 poplmovl; +- u32 val; +- u16 int80; +- u16 pad; +- } __attribute__((packed)) code = { +- 0xb858, /* popl %eax ; movl $...,%eax */ +- __NR_ia32_sigreturn, +- 0x80cd, /* int $0x80 */ +- 0, +- }; +- err |= __copy_to_user(frame->retcode, &code, 8); ++ restorer = &frame->retcode; + } ++ err |= __put_user(ptr_to_compat(restorer), &frame->pretcode); ++ ++ /* ++ * These are actually not used anymore, but left because some ++ * gdb versions depend on them as a marker. ++ */ ++ err |= __copy_to_user(frame->retcode, &code, 8); + if (err) + goto give_sigsegv; + + /* Set up registers for signal handler */ +- regs->rsp = (unsigned long) frame; +- regs->rip = (unsigned long) ka->sa.sa_handler; ++ regs->sp = (unsigned long) frame; ++ regs->ip = (unsigned long) ka->sa.sa_handler; + + /* Make -mregparm=3 work */ +- regs->rax = sig; +- regs->rdx = 0; +- regs->rcx = 0; ++ regs->ax = sig; ++ regs->dx = 0; ++ regs->cx = 0; + +- asm volatile("movl %0,%%ds" :: "r" (__USER32_DS)); +- asm volatile("movl %0,%%es" :: "r" (__USER32_DS)); ++ asm volatile("movl %0,%%ds" :: "r" (__USER32_DS)); ++ asm volatile("movl %0,%%es" :: "r" (__USER32_DS)); + +- regs->cs = __USER32_CS; +- regs->ss = __USER32_DS; ++ regs->cs = __USER32_CS; ++ regs->ss = __USER32_DS; + + set_fs(USER_DS); +- regs->eflags &= ~TF_MASK; ++ regs->flags &= ~X86_EFLAGS_TF; + if (test_thread_flag(TIF_SINGLESTEP)) + ptrace_notify(SIGTRAP); + + #if DEBUG_SIG +- printk("SIG deliver (%s:%d): sp=%p pc=%lx ra=%u\n", +- current->comm, current->pid, frame, regs->rip, frame->pretcode); ++ printk(KERN_DEBUG "SIG deliver (%s:%d): sp=%p pc=%lx ra=%u\n", ++ current->comm, current->pid, frame, regs->ip, frame->pretcode); + #endif + + return 0; +@@ -511,25 +517,34 @@ give_sigsegv: + } + + int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info, +- compat_sigset_t *set, struct pt_regs * regs) ++ compat_sigset_t *set, struct pt_regs *regs) + { + struct rt_sigframe __user *frame; ++ struct exec_domain *ed = current_thread_info()->exec_domain; ++ void __user *restorer; + int err = 0; + ++ /* __copy_to_user optimizes that into a single 8 byte store */ ++ static const struct { ++ u8 movl; ++ u32 val; ++ u16 int80; ++ u16 pad; ++ u8 pad2; ++ } __attribute__((packed)) code = { ++ 0xb8, ++ __NR_ia32_rt_sigreturn, ++ 0x80cd, ++ 0, ++ }; ++ + frame = get_sigframe(ka, regs, sizeof(*frame)); + + if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame))) + goto give_sigsegv; + +- { +- struct exec_domain *ed = current_thread_info()->exec_domain; +- err |= __put_user((ed +- && ed->signal_invmap +- && sig < 32 +- ? ed->signal_invmap[sig] +- : sig), +- &frame->sig); +- } ++ err |= __put_user((ed && ed->signal_invmap && sig < 32 ++ ? ed->signal_invmap[sig] : sig), &frame->sig); + err |= __put_user(ptr_to_compat(&frame->info), &frame->pinfo); + err |= __put_user(ptr_to_compat(&frame->uc), &frame->puc); + err |= copy_siginfo_to_user32(&frame->info, info); +@@ -540,73 +555,58 @@ int ia32_setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info, + err |= __put_user(0, &frame->uc.uc_flags); + err |= __put_user(0, &frame->uc.uc_link); + err |= __put_user(current->sas_ss_sp, &frame->uc.uc_stack.ss_sp); +- err |= __put_user(sas_ss_flags(regs->rsp), ++ err |= __put_user(sas_ss_flags(regs->sp), + &frame->uc.uc_stack.ss_flags); + err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size); + err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, &frame->fpstate, +- regs, set->sig[0]); ++ regs, set->sig[0]); + err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set)); + if (err) + goto give_sigsegv; + +- +- { +- void __user *restorer = VSYSCALL32_RTSIGRETURN; +- if (ka->sa.sa_flags & SA_RESTORER) +- restorer = ka->sa.sa_restorer; +- err |= __put_user(ptr_to_compat(restorer), &frame->pretcode); +- } +- +- /* This is movl $,%eax ; int $0x80 */ +- /* Not actually used anymore, but left because some gdb versions +- need it. */ +- { +- /* __copy_to_user optimizes that into a single 8 byte store */ +- static const struct { +- u8 movl; +- u32 val; +- u16 int80; +- u16 pad; +- u8 pad2; +- } __attribute__((packed)) code = { +- 0xb8, +- __NR_ia32_rt_sigreturn, +- 0x80cd, +- 0, +- }; +- err |= __copy_to_user(frame->retcode, &code, 8); +- } ++ if (ka->sa.sa_flags & SA_RESTORER) ++ restorer = ka->sa.sa_restorer; ++ else ++ restorer = VDSO32_SYMBOL(current->mm->context.vdso, ++ rt_sigreturn); ++ err |= __put_user(ptr_to_compat(restorer), &frame->pretcode); ++ ++ /* ++ * Not actually used anymore, but left because some gdb ++ * versions need it. ++ */ ++ err |= __copy_to_user(frame->retcode, &code, 8); + if (err) + goto give_sigsegv; + + /* Set up registers for signal handler */ +- regs->rsp = (unsigned long) frame; +- regs->rip = (unsigned long) ka->sa.sa_handler; ++ regs->sp = (unsigned long) frame; ++ regs->ip = (unsigned long) ka->sa.sa_handler; + + /* Make -mregparm=3 work */ +- regs->rax = sig; +- regs->rdx = (unsigned long) &frame->info; +- regs->rcx = (unsigned long) &frame->uc; ++ regs->ax = sig; ++ regs->dx = (unsigned long) &frame->info; ++ regs->cx = (unsigned long) &frame->uc; + + /* Make -mregparm=3 work */ +- regs->rax = sig; +- regs->rdx = (unsigned long) &frame->info; +- regs->rcx = (unsigned long) &frame->uc; ++ regs->ax = sig; ++ regs->dx = (unsigned long) &frame->info; ++ regs->cx = (unsigned long) &frame->uc; ++ ++ asm volatile("movl %0,%%ds" :: "r" (__USER32_DS)); ++ asm volatile("movl %0,%%es" :: "r" (__USER32_DS)); + +- asm volatile("movl %0,%%ds" :: "r" (__USER32_DS)); +- asm volatile("movl %0,%%es" :: "r" (__USER32_DS)); +- +- regs->cs = __USER32_CS; +- regs->ss = __USER32_DS; ++ regs->cs = __USER32_CS; ++ regs->ss = __USER32_DS; + + set_fs(USER_DS); +- regs->eflags &= ~TF_MASK; ++ regs->flags &= ~X86_EFLAGS_TF; + if (test_thread_flag(TIF_SINGLESTEP)) + ptrace_notify(SIGTRAP); + + #if DEBUG_SIG +- printk("SIG deliver (%s:%d): sp=%p pc=%lx ra=%u\n", +- current->comm, current->pid, frame, regs->rip, frame->pretcode); ++ printk(KERN_DEBUG "SIG deliver (%s:%d): sp=%p pc=%lx ra=%u\n", ++ current->comm, current->pid, frame, regs->ip, frame->pretcode); + #endif + + return 0; +diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S +index df588f0..0db0a62 100644 +--- a/arch/x86/ia32/ia32entry.S ++++ b/arch/x86/ia32/ia32entry.S +@@ -12,7 +12,6 @@ + #include + #include + #include +-#include + #include + #include + +@@ -104,7 +103,7 @@ ENTRY(ia32_sysenter_target) + pushfq + CFI_ADJUST_CFA_OFFSET 8 + /*CFI_REL_OFFSET rflags,0*/ +- movl $VSYSCALL32_SYSEXIT, %r10d ++ movl 8*3-THREAD_SIZE+threadinfo_sysenter_return(%rsp), %r10d + CFI_REGISTER rip,r10 + pushq $__USER32_CS + CFI_ADJUST_CFA_OFFSET 8 +@@ -142,6 +141,8 @@ sysenter_do_call: + andl $~TS_COMPAT,threadinfo_status(%r10) + /* clear IF, that popfq doesn't enable interrupts early */ + andl $~0x200,EFLAGS-R11(%rsp) ++ movl RIP-R11(%rsp),%edx /* User %eip */ ++ CFI_REGISTER rip,rdx + RESTORE_ARGS 1,24,1,1,1,1 + popfq + CFI_ADJUST_CFA_OFFSET -8 +@@ -149,8 +150,6 @@ sysenter_do_call: + popq %rcx /* User %esp */ + CFI_ADJUST_CFA_OFFSET -8 + CFI_REGISTER rsp,rcx +- movl $VSYSCALL32_SYSEXIT,%edx /* User %eip */ +- CFI_REGISTER rip,rdx + TRACE_IRQS_ON + swapgs + sti /* sti only takes effect after the next instruction */ +@@ -644,8 +643,8 @@ ia32_sys_call_table: + .quad compat_sys_futex /* 240 */ + .quad compat_sys_sched_setaffinity + .quad compat_sys_sched_getaffinity +- .quad sys32_set_thread_area +- .quad sys32_get_thread_area ++ .quad sys_set_thread_area ++ .quad sys_get_thread_area + .quad compat_sys_io_setup /* 245 */ + .quad sys_io_destroy + .quad compat_sys_io_getevents +diff --git a/arch/x86/ia32/ipc32.c b/arch/x86/ia32/ipc32.c +index 7b3342e..d21991c 100644 +--- a/arch/x86/ia32/ipc32.c ++++ b/arch/x86/ia32/ipc32.c +@@ -9,9 +9,8 @@ + #include + #include + +-asmlinkage long +-sys32_ipc(u32 call, int first, int second, int third, +- compat_uptr_t ptr, u32 fifth) ++asmlinkage long sys32_ipc(u32 call, int first, int second, int third, ++ compat_uptr_t ptr, u32 fifth) + { + int version; + +@@ -19,36 +18,35 @@ sys32_ipc(u32 call, int first, int second, int third, + call &= 0xffff; + + switch (call) { +- case SEMOP: ++ case SEMOP: + /* struct sembuf is the same on 32 and 64bit :)) */ + return sys_semtimedop(first, compat_ptr(ptr), second, NULL); +- case SEMTIMEDOP: ++ case SEMTIMEDOP: + return compat_sys_semtimedop(first, compat_ptr(ptr), second, + compat_ptr(fifth)); +- case SEMGET: ++ case SEMGET: + return sys_semget(first, second, third); +- case SEMCTL: ++ case SEMCTL: + return compat_sys_semctl(first, second, third, compat_ptr(ptr)); + +- case MSGSND: ++ case MSGSND: + return compat_sys_msgsnd(first, second, third, compat_ptr(ptr)); +- case MSGRCV: ++ case MSGRCV: + return compat_sys_msgrcv(first, second, fifth, third, + version, compat_ptr(ptr)); +- case MSGGET: ++ case MSGGET: + return sys_msgget((key_t) first, second); +- case MSGCTL: ++ case MSGCTL: + return compat_sys_msgctl(first, second, compat_ptr(ptr)); + +- case SHMAT: ++ case SHMAT: + return compat_sys_shmat(first, second, third, version, + compat_ptr(ptr)); +- break; +- case SHMDT: ++ case SHMDT: + return sys_shmdt(compat_ptr(ptr)); +- case SHMGET: ++ case SHMGET: + return sys_shmget(first, (unsigned)second, third); +- case SHMCTL: ++ case SHMCTL: + return compat_sys_shmctl(first, second, compat_ptr(ptr)); + } + return -ENOSYS; +diff --git a/arch/x86/ia32/mmap32.c b/arch/x86/ia32/mmap32.c +deleted file mode 100644 +index e4b84b4..0000000 +--- a/arch/x86/ia32/mmap32.c ++++ /dev/null +@@ -1,79 +0,0 @@ +-/* +- * linux/arch/x86_64/ia32/mm/mmap.c +- * +- * flexible mmap layout support +- * +- * Based on the i386 version which was +- * +- * Copyright 2003-2004 Red Hat Inc., Durham, North Carolina. +- * All Rights Reserved. +- * +- * This program is free software; you can redistribute it and/or modify +- * it under the terms of the GNU General Public License as published by +- * the Free Software Foundation; either version 2 of the License, or +- * (at your option) any later version. +- * +- * This program is distributed in the hope that it will be useful, +- * but WITHOUT ANY WARRANTY; without even the implied warranty of +- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +- * GNU General Public License for more details. +- * +- * You should have received a copy of the GNU General Public License +- * along with this program; if not, write to the Free Software +- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +- * +- * +- * Started by Ingo Molnar +- */ +- +-#include +-#include +-#include +-#include +- +-/* +- * Top of mmap area (just below the process stack). +- * +- * Leave an at least ~128 MB hole. +- */ +-#define MIN_GAP (128*1024*1024) +-#define MAX_GAP (TASK_SIZE/6*5) +- +-static inline unsigned long mmap_base(struct mm_struct *mm) +-{ +- unsigned long gap = current->signal->rlim[RLIMIT_STACK].rlim_cur; +- unsigned long random_factor = 0; +- +- if (current->flags & PF_RANDOMIZE) +- random_factor = get_random_int() % (1024*1024); +- +- if (gap < MIN_GAP) +- gap = MIN_GAP; +- else if (gap > MAX_GAP) +- gap = MAX_GAP; +- +- return PAGE_ALIGN(TASK_SIZE - gap - random_factor); +-} +- +-/* +- * This function, called very early during the creation of a new +- * process VM image, sets up which VM layout function to use: +- */ +-void ia32_pick_mmap_layout(struct mm_struct *mm) +-{ +- /* +- * Fall back to the standard layout if the personality +- * bit is set, or if the expected stack growth is unlimited: +- */ +- if (sysctl_legacy_va_layout || +- (current->personality & ADDR_COMPAT_LAYOUT) || +- current->signal->rlim[RLIMIT_STACK].rlim_cur == RLIM_INFINITY) { +- mm->mmap_base = TASK_UNMAPPED_BASE; +- mm->get_unmapped_area = arch_get_unmapped_area; +- mm->unmap_area = arch_unmap_area; +- } else { +- mm->mmap_base = mmap_base(mm); +- mm->get_unmapped_area = arch_get_unmapped_area_topdown; +- mm->unmap_area = arch_unmap_area_topdown; +- } +-} +diff --git a/arch/x86/ia32/ptrace32.c b/arch/x86/ia32/ptrace32.c +deleted file mode 100644 +index 4a233ad..0000000 +--- a/arch/x86/ia32/ptrace32.c ++++ /dev/null +@@ -1,404 +0,0 @@ +-/* +- * 32bit ptrace for x86-64. +- * +- * Copyright 2001,2002 Andi Kleen, SuSE Labs. +- * Some parts copied from arch/i386/kernel/ptrace.c. See that file for earlier +- * copyright. +- * +- * This allows to access 64bit processes too; but there is no way to see the extended +- * register contents. +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* +- * Determines which flags the user has access to [1 = access, 0 = no access]. +- * Prohibits changing ID(21), VIP(20), VIF(19), VM(17), IOPL(12-13), IF(9). +- * Also masks reserved bits (31-22, 15, 5, 3, 1). +- */ +-#define FLAG_MASK 0x54dd5UL +- +-#define R32(l,q) \ +- case offsetof(struct user32, regs.l): stack[offsetof(struct pt_regs, q)/8] = val; break +- +-static int putreg32(struct task_struct *child, unsigned regno, u32 val) +-{ +- int i; +- __u64 *stack = (__u64 *)task_pt_regs(child); +- +- switch (regno) { +- case offsetof(struct user32, regs.fs): +- if (val && (val & 3) != 3) return -EIO; +- child->thread.fsindex = val & 0xffff; +- break; +- case offsetof(struct user32, regs.gs): +- if (val && (val & 3) != 3) return -EIO; +- child->thread.gsindex = val & 0xffff; +- break; +- case offsetof(struct user32, regs.ds): +- if (val && (val & 3) != 3) return -EIO; +- child->thread.ds = val & 0xffff; +- break; +- case offsetof(struct user32, regs.es): +- child->thread.es = val & 0xffff; +- break; +- case offsetof(struct user32, regs.ss): +- if ((val & 3) != 3) return -EIO; +- stack[offsetof(struct pt_regs, ss)/8] = val & 0xffff; +- break; +- case offsetof(struct user32, regs.cs): +- if ((val & 3) != 3) return -EIO; +- stack[offsetof(struct pt_regs, cs)/8] = val & 0xffff; +- break; +- +- R32(ebx, rbx); +- R32(ecx, rcx); +- R32(edx, rdx); +- R32(edi, rdi); +- R32(esi, rsi); +- R32(ebp, rbp); +- R32(eax, rax); +- R32(orig_eax, orig_rax); +- R32(eip, rip); +- R32(esp, rsp); +- +- case offsetof(struct user32, regs.eflags): { +- __u64 *flags = &stack[offsetof(struct pt_regs, eflags)/8]; +- val &= FLAG_MASK; +- *flags = val | (*flags & ~FLAG_MASK); +- break; +- } +- +- case offsetof(struct user32, u_debugreg[4]): +- case offsetof(struct user32, u_debugreg[5]): +- return -EIO; +- +- case offsetof(struct user32, u_debugreg[0]): +- child->thread.debugreg0 = val; +- break; +- +- case offsetof(struct user32, u_debugreg[1]): +- child->thread.debugreg1 = val; +- break; +- +- case offsetof(struct user32, u_debugreg[2]): +- child->thread.debugreg2 = val; +- break; +- +- case offsetof(struct user32, u_debugreg[3]): +- child->thread.debugreg3 = val; +- break; +- +- case offsetof(struct user32, u_debugreg[6]): +- child->thread.debugreg6 = val; +- break; +- +- case offsetof(struct user32, u_debugreg[7]): +- val &= ~DR_CONTROL_RESERVED; +- /* See arch/i386/kernel/ptrace.c for an explanation of +- * this awkward check.*/ +- for(i=0; i<4; i++) +- if ((0x5454 >> ((val >> (16 + 4*i)) & 0xf)) & 1) +- return -EIO; +- child->thread.debugreg7 = val; +- if (val) +- set_tsk_thread_flag(child, TIF_DEBUG); +- else +- clear_tsk_thread_flag(child, TIF_DEBUG); +- break; +- +- default: +- if (regno > sizeof(struct user32) || (regno & 3)) +- return -EIO; +- +- /* Other dummy fields in the virtual user structure are ignored */ +- break; +- } +- return 0; +-} +- +-#undef R32 +- +-#define R32(l,q) \ +- case offsetof(struct user32, regs.l): *val = stack[offsetof(struct pt_regs, q)/8]; break +- +-static int getreg32(struct task_struct *child, unsigned regno, u32 *val) +-{ +- __u64 *stack = (__u64 *)task_pt_regs(child); +- +- switch (regno) { +- case offsetof(struct user32, regs.fs): +- *val = child->thread.fsindex; +- break; +- case offsetof(struct user32, regs.gs): +- *val = child->thread.gsindex; +- break; +- case offsetof(struct user32, regs.ds): +- *val = child->thread.ds; +- break; +- case offsetof(struct user32, regs.es): +- *val = child->thread.es; +- break; +- +- R32(cs, cs); +- R32(ss, ss); +- R32(ebx, rbx); +- R32(ecx, rcx); +- R32(edx, rdx); +- R32(edi, rdi); +- R32(esi, rsi); +- R32(ebp, rbp); +- R32(eax, rax); +- R32(orig_eax, orig_rax); +- R32(eip, rip); +- R32(eflags, eflags); +- R32(esp, rsp); +- +- case offsetof(struct user32, u_debugreg[0]): +- *val = child->thread.debugreg0; +- break; +- case offsetof(struct user32, u_debugreg[1]): +- *val = child->thread.debugreg1; +- break; +- case offsetof(struct user32, u_debugreg[2]): +- *val = child->thread.debugreg2; +- break; +- case offsetof(struct user32, u_debugreg[3]): +- *val = child->thread.debugreg3; +- break; +- case offsetof(struct user32, u_debugreg[6]): +- *val = child->thread.debugreg6; +- break; +- case offsetof(struct user32, u_debugreg[7]): +- *val = child->thread.debugreg7; +- break; +- +- default: +- if (regno > sizeof(struct user32) || (regno & 3)) +- return -EIO; +- +- /* Other dummy fields in the virtual user structure are ignored */ +- *val = 0; +- break; +- } +- return 0; +-} +- +-#undef R32 +- +-static long ptrace32_siginfo(unsigned request, u32 pid, u32 addr, u32 data) +-{ +- int ret; +- compat_siginfo_t __user *si32 = compat_ptr(data); +- siginfo_t ssi; +- siginfo_t __user *si = compat_alloc_user_space(sizeof(siginfo_t)); +- if (request == PTRACE_SETSIGINFO) { +- memset(&ssi, 0, sizeof(siginfo_t)); +- ret = copy_siginfo_from_user32(&ssi, si32); +- if (ret) +- return ret; +- if (copy_to_user(si, &ssi, sizeof(siginfo_t))) +- return -EFAULT; +- } +- ret = sys_ptrace(request, pid, addr, (unsigned long)si); +- if (ret) +- return ret; +- if (request == PTRACE_GETSIGINFO) { +- if (copy_from_user(&ssi, si, sizeof(siginfo_t))) +- return -EFAULT; +- ret = copy_siginfo_to_user32(si32, &ssi); +- } +- return ret; +-} +- +-asmlinkage long sys32_ptrace(long request, u32 pid, u32 addr, u32 data) +-{ +- struct task_struct *child; +- struct pt_regs *childregs; +- void __user *datap = compat_ptr(data); +- int ret; +- __u32 val; +- +- switch (request) { +- case PTRACE_TRACEME: +- case PTRACE_ATTACH: +- case PTRACE_KILL: +- case PTRACE_CONT: +- case PTRACE_SINGLESTEP: +- case PTRACE_DETACH: +- case PTRACE_SYSCALL: +- case PTRACE_OLDSETOPTIONS: +- case PTRACE_SETOPTIONS: +- case PTRACE_SET_THREAD_AREA: +- case PTRACE_GET_THREAD_AREA: +- return sys_ptrace(request, pid, addr, data); +- +- default: +- return -EINVAL; +- +- case PTRACE_PEEKTEXT: +- case PTRACE_PEEKDATA: +- case PTRACE_POKEDATA: +- case PTRACE_POKETEXT: +- case PTRACE_POKEUSR: +- case PTRACE_PEEKUSR: +- case PTRACE_GETREGS: +- case PTRACE_SETREGS: +- case PTRACE_SETFPREGS: +- case PTRACE_GETFPREGS: +- case PTRACE_SETFPXREGS: +- case PTRACE_GETFPXREGS: +- case PTRACE_GETEVENTMSG: +- break; +- +- case PTRACE_SETSIGINFO: +- case PTRACE_GETSIGINFO: +- return ptrace32_siginfo(request, pid, addr, data); +- } +- +- child = ptrace_get_task_struct(pid); +- if (IS_ERR(child)) +- return PTR_ERR(child); +- +- ret = ptrace_check_attach(child, request == PTRACE_KILL); +- if (ret < 0) +- goto out; +- +- childregs = task_pt_regs(child); +- +- switch (request) { +- case PTRACE_PEEKDATA: +- case PTRACE_PEEKTEXT: +- ret = 0; +- if (access_process_vm(child, addr, &val, sizeof(u32), 0)!=sizeof(u32)) +- ret = -EIO; +- else +- ret = put_user(val, (unsigned int __user *)datap); +- break; +- +- case PTRACE_POKEDATA: +- case PTRACE_POKETEXT: +- ret = 0; +- if (access_process_vm(child, addr, &data, sizeof(u32), 1)!=sizeof(u32)) +- ret = -EIO; +- break; +- +- case PTRACE_PEEKUSR: +- ret = getreg32(child, addr, &val); +- if (ret == 0) +- ret = put_user(val, (__u32 __user *)datap); +- break; +- +- case PTRACE_POKEUSR: +- ret = putreg32(child, addr, data); +- break; +- +- case PTRACE_GETREGS: { /* Get all gp regs from the child. */ +- int i; +- if (!access_ok(VERIFY_WRITE, datap, 16*4)) { +- ret = -EIO; +- break; +- } +- ret = 0; +- for ( i = 0; i <= 16*4 ; i += sizeof(__u32) ) { +- getreg32(child, i, &val); +- ret |= __put_user(val,(u32 __user *)datap); +- datap += sizeof(u32); +- } +- break; +- } +- +- case PTRACE_SETREGS: { /* Set all gp regs in the child. */ +- unsigned long tmp; +- int i; +- if (!access_ok(VERIFY_READ, datap, 16*4)) { +- ret = -EIO; +- break; +- } +- ret = 0; +- for ( i = 0; i <= 16*4; i += sizeof(u32) ) { +- ret |= __get_user(tmp, (u32 __user *)datap); +- putreg32(child, i, tmp); +- datap += sizeof(u32); +- } +- break; +- } +- +- case PTRACE_GETFPREGS: +- ret = -EIO; +- if (!access_ok(VERIFY_READ, compat_ptr(data), +- sizeof(struct user_i387_struct))) +- break; +- save_i387_ia32(child, datap, childregs, 1); +- ret = 0; +- break; +- +- case PTRACE_SETFPREGS: +- ret = -EIO; +- if (!access_ok(VERIFY_WRITE, datap, +- sizeof(struct user_i387_struct))) +- break; +- ret = 0; +- /* don't check EFAULT to be bug-to-bug compatible to i386 */ +- restore_i387_ia32(child, datap, 1); +- break; +- +- case PTRACE_GETFPXREGS: { +- struct user32_fxsr_struct __user *u = datap; +- init_fpu(child); +- ret = -EIO; +- if (!access_ok(VERIFY_WRITE, u, sizeof(*u))) +- break; +- ret = -EFAULT; +- if (__copy_to_user(u, &child->thread.i387.fxsave, sizeof(*u))) +- break; +- ret = __put_user(childregs->cs, &u->fcs); +- ret |= __put_user(child->thread.ds, &u->fos); +- break; +- } +- case PTRACE_SETFPXREGS: { +- struct user32_fxsr_struct __user *u = datap; +- unlazy_fpu(child); +- ret = -EIO; +- if (!access_ok(VERIFY_READ, u, sizeof(*u))) +- break; +- /* no checking to be bug-to-bug compatible with i386. */ +- /* but silence warning */ +- if (__copy_from_user(&child->thread.i387.fxsave, u, sizeof(*u))) +- ; +- set_stopped_child_used_math(child); +- child->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask; +- ret = 0; +- break; +- } +- +- case PTRACE_GETEVENTMSG: +- ret = put_user(child->ptrace_message,(unsigned int __user *)compat_ptr(data)); +- break; +- +- default: +- BUG(); +- } +- +- out: +- put_task_struct(child); +- return ret; +-} +- +diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c +index bee96d6..abf71d2 100644 +--- a/arch/x86/ia32/sys_ia32.c ++++ b/arch/x86/ia32/sys_ia32.c +@@ -1,29 +1,29 @@ + /* + * sys_ia32.c: Conversion between 32bit and 64bit native syscalls. Based on +- * sys_sparc32 ++ * sys_sparc32 + * + * Copyright (C) 2000 VA Linux Co + * Copyright (C) 2000 Don Dugger +- * Copyright (C) 1999 Arun Sharma +- * Copyright (C) 1997,1998 Jakub Jelinek (jj@sunsite.mff.cuni.cz) +- * Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu) ++ * Copyright (C) 1999 Arun Sharma ++ * Copyright (C) 1997,1998 Jakub Jelinek (jj@sunsite.mff.cuni.cz) ++ * Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu) + * Copyright (C) 2000 Hewlett-Packard Co. + * Copyright (C) 2000 David Mosberger-Tang +- * Copyright (C) 2000,2001,2002 Andi Kleen, SuSE Labs (x86-64 port) ++ * Copyright (C) 2000,2001,2002 Andi Kleen, SuSE Labs (x86-64 port) + * + * These routines maintain argument size conversion between 32bit and 64bit +- * environment. In 2.5 most of this should be moved to a generic directory. ++ * environment. In 2.5 most of this should be moved to a generic directory. + * + * This file assumes that there is a hole at the end of user address space. +- * +- * Some of the functions are LE specific currently. These are hopefully all marked. +- * This should be fixed. ++ * ++ * Some of the functions are LE specific currently. These are ++ * hopefully all marked. This should be fixed. + */ + + #include + #include +-#include +-#include ++#include ++#include + #include + #include + #include +@@ -90,43 +90,44 @@ int cp_compat_stat(struct kstat *kbuf, struct compat_stat __user *ubuf) + if (sizeof(ino) < sizeof(kbuf->ino) && ino != kbuf->ino) + return -EOVERFLOW; + if (!access_ok(VERIFY_WRITE, ubuf, sizeof(struct compat_stat)) || +- __put_user (old_encode_dev(kbuf->dev), &ubuf->st_dev) || +- __put_user (ino, &ubuf->st_ino) || +- __put_user (kbuf->mode, &ubuf->st_mode) || +- __put_user (kbuf->nlink, &ubuf->st_nlink) || +- __put_user (uid, &ubuf->st_uid) || +- __put_user (gid, &ubuf->st_gid) || +- __put_user (old_encode_dev(kbuf->rdev), &ubuf->st_rdev) || +- __put_user (kbuf->size, &ubuf->st_size) || +- __put_user (kbuf->atime.tv_sec, &ubuf->st_atime) || +- __put_user (kbuf->atime.tv_nsec, &ubuf->st_atime_nsec) || +- __put_user (kbuf->mtime.tv_sec, &ubuf->st_mtime) || +- __put_user (kbuf->mtime.tv_nsec, &ubuf->st_mtime_nsec) || +- __put_user (kbuf->ctime.tv_sec, &ubuf->st_ctime) || +- __put_user (kbuf->ctime.tv_nsec, &ubuf->st_ctime_nsec) || +- __put_user (kbuf->blksize, &ubuf->st_blksize) || +- __put_user (kbuf->blocks, &ubuf->st_blocks)) ++ __put_user(old_encode_dev(kbuf->dev), &ubuf->st_dev) || ++ __put_user(ino, &ubuf->st_ino) || ++ __put_user(kbuf->mode, &ubuf->st_mode) || ++ __put_user(kbuf->nlink, &ubuf->st_nlink) || ++ __put_user(uid, &ubuf->st_uid) || ++ __put_user(gid, &ubuf->st_gid) || ++ __put_user(old_encode_dev(kbuf->rdev), &ubuf->st_rdev) || ++ __put_user(kbuf->size, &ubuf->st_size) || ++ __put_user(kbuf->atime.tv_sec, &ubuf->st_atime) || ++ __put_user(kbuf->atime.tv_nsec, &ubuf->st_atime_nsec) || ++ __put_user(kbuf->mtime.tv_sec, &ubuf->st_mtime) || ++ __put_user(kbuf->mtime.tv_nsec, &ubuf->st_mtime_nsec) || ++ __put_user(kbuf->ctime.tv_sec, &ubuf->st_ctime) || ++ __put_user(kbuf->ctime.tv_nsec, &ubuf->st_ctime_nsec) || ++ __put_user(kbuf->blksize, &ubuf->st_blksize) || ++ __put_user(kbuf->blocks, &ubuf->st_blocks)) + return -EFAULT; + return 0; + } + +-asmlinkage long +-sys32_truncate64(char __user * filename, unsigned long offset_low, unsigned long offset_high) ++asmlinkage long sys32_truncate64(char __user *filename, ++ unsigned long offset_low, ++ unsigned long offset_high) + { + return sys_truncate(filename, ((loff_t) offset_high << 32) | offset_low); + } + +-asmlinkage long +-sys32_ftruncate64(unsigned int fd, unsigned long offset_low, unsigned long offset_high) ++asmlinkage long sys32_ftruncate64(unsigned int fd, unsigned long offset_low, ++ unsigned long offset_high) + { + return sys_ftruncate(fd, ((loff_t) offset_high << 32) | offset_low); + } + +-/* Another set for IA32/LFS -- x86_64 struct stat is different due to +- support for 64bit inode numbers. */ +- +-static int +-cp_stat64(struct stat64 __user *ubuf, struct kstat *stat) ++/* ++ * Another set for IA32/LFS -- x86_64 struct stat is different due to ++ * support for 64bit inode numbers. ++ */ ++static int cp_stat64(struct stat64 __user *ubuf, struct kstat *stat) + { + typeof(ubuf->st_uid) uid = 0; + typeof(ubuf->st_gid) gid = 0; +@@ -134,38 +135,39 @@ cp_stat64(struct stat64 __user *ubuf, struct kstat *stat) + SET_GID(gid, stat->gid); + if (!access_ok(VERIFY_WRITE, ubuf, sizeof(struct stat64)) || + __put_user(huge_encode_dev(stat->dev), &ubuf->st_dev) || +- __put_user (stat->ino, &ubuf->__st_ino) || +- __put_user (stat->ino, &ubuf->st_ino) || +- __put_user (stat->mode, &ubuf->st_mode) || +- __put_user (stat->nlink, &ubuf->st_nlink) || +- __put_user (uid, &ubuf->st_uid) || +- __put_user (gid, &ubuf->st_gid) || +- __put_user (huge_encode_dev(stat->rdev), &ubuf->st_rdev) || +- __put_user (stat->size, &ubuf->st_size) || +- __put_user (stat->atime.tv_sec, &ubuf->st_atime) || +- __put_user (stat->atime.tv_nsec, &ubuf->st_atime_nsec) || +- __put_user (stat->mtime.tv_sec, &ubuf->st_mtime) || +- __put_user (stat->mtime.tv_nsec, &ubuf->st_mtime_nsec) || +- __put_user (stat->ctime.tv_sec, &ubuf->st_ctime) || +- __put_user (stat->ctime.tv_nsec, &ubuf->st_ctime_nsec) || +- __put_user (stat->blksize, &ubuf->st_blksize) || +- __put_user (stat->blocks, &ubuf->st_blocks)) ++ __put_user(stat->ino, &ubuf->__st_ino) || ++ __put_user(stat->ino, &ubuf->st_ino) || ++ __put_user(stat->mode, &ubuf->st_mode) || ++ __put_user(stat->nlink, &ubuf->st_nlink) || ++ __put_user(uid, &ubuf->st_uid) || ++ __put_user(gid, &ubuf->st_gid) || ++ __put_user(huge_encode_dev(stat->rdev), &ubuf->st_rdev) || ++ __put_user(stat->size, &ubuf->st_size) || ++ __put_user(stat->atime.tv_sec, &ubuf->st_atime) || ++ __put_user(stat->atime.tv_nsec, &ubuf->st_atime_nsec) || ++ __put_user(stat->mtime.tv_sec, &ubuf->st_mtime) || ++ __put_user(stat->mtime.tv_nsec, &ubuf->st_mtime_nsec) || ++ __put_user(stat->ctime.tv_sec, &ubuf->st_ctime) || ++ __put_user(stat->ctime.tv_nsec, &ubuf->st_ctime_nsec) || ++ __put_user(stat->blksize, &ubuf->st_blksize) || ++ __put_user(stat->blocks, &ubuf->st_blocks)) + return -EFAULT; + return 0; + } + +-asmlinkage long +-sys32_stat64(char __user * filename, struct stat64 __user *statbuf) ++asmlinkage long sys32_stat64(char __user *filename, ++ struct stat64 __user *statbuf) + { + struct kstat stat; + int ret = vfs_stat(filename, &stat); ++ + if (!ret) + ret = cp_stat64(statbuf, &stat); + return ret; + } + +-asmlinkage long +-sys32_lstat64(char __user * filename, struct stat64 __user *statbuf) ++asmlinkage long sys32_lstat64(char __user *filename, ++ struct stat64 __user *statbuf) + { + struct kstat stat; + int ret = vfs_lstat(filename, &stat); +@@ -174,8 +176,7 @@ sys32_lstat64(char __user * filename, struct stat64 __user *statbuf) + return ret; + } + +-asmlinkage long +-sys32_fstat64(unsigned int fd, struct stat64 __user *statbuf) ++asmlinkage long sys32_fstat64(unsigned int fd, struct stat64 __user *statbuf) + { + struct kstat stat; + int ret = vfs_fstat(fd, &stat); +@@ -184,9 +185,8 @@ sys32_fstat64(unsigned int fd, struct stat64 __user *statbuf) + return ret; + } + +-asmlinkage long +-sys32_fstatat(unsigned int dfd, char __user *filename, +- struct stat64 __user* statbuf, int flag) ++asmlinkage long sys32_fstatat(unsigned int dfd, char __user *filename, ++ struct stat64 __user *statbuf, int flag) + { + struct kstat stat; + int error = -EINVAL; +@@ -221,8 +221,7 @@ struct mmap_arg_struct { + unsigned int offset; + }; + +-asmlinkage long +-sys32_mmap(struct mmap_arg_struct __user *arg) ++asmlinkage long sys32_mmap(struct mmap_arg_struct __user *arg) + { + struct mmap_arg_struct a; + struct file *file = NULL; +@@ -233,33 +232,33 @@ sys32_mmap(struct mmap_arg_struct __user *arg) + return -EFAULT; + + if (a.offset & ~PAGE_MASK) +- return -EINVAL; ++ return -EINVAL; + + if (!(a.flags & MAP_ANONYMOUS)) { + file = fget(a.fd); + if (!file) + return -EBADF; + } +- +- mm = current->mm; +- down_write(&mm->mmap_sem); +- retval = do_mmap_pgoff(file, a.addr, a.len, a.prot, a.flags, a.offset>>PAGE_SHIFT); ++ ++ mm = current->mm; ++ down_write(&mm->mmap_sem); ++ retval = do_mmap_pgoff(file, a.addr, a.len, a.prot, a.flags, ++ a.offset>>PAGE_SHIFT); + if (file) + fput(file); + +- up_write(&mm->mmap_sem); ++ up_write(&mm->mmap_sem); + + return retval; + } + +-asmlinkage long +-sys32_mprotect(unsigned long start, size_t len, unsigned long prot) ++asmlinkage long sys32_mprotect(unsigned long start, size_t len, ++ unsigned long prot) + { +- return sys_mprotect(start,len,prot); ++ return sys_mprotect(start, len, prot); + } + +-asmlinkage long +-sys32_pipe(int __user *fd) ++asmlinkage long sys32_pipe(int __user *fd) + { + int retval; + int fds[2]; +@@ -269,13 +268,13 @@ sys32_pipe(int __user *fd) + goto out; + if (copy_to_user(fd, fds, sizeof(fds))) + retval = -EFAULT; +- out: ++out: + return retval; + } + +-asmlinkage long +-sys32_rt_sigaction(int sig, struct sigaction32 __user *act, +- struct sigaction32 __user *oact, unsigned int sigsetsize) ++asmlinkage long sys32_rt_sigaction(int sig, struct sigaction32 __user *act, ++ struct sigaction32 __user *oact, ++ unsigned int sigsetsize) + { + struct k_sigaction new_ka, old_ka; + int ret; +@@ -291,12 +290,17 @@ sys32_rt_sigaction(int sig, struct sigaction32 __user *act, + if (!access_ok(VERIFY_READ, act, sizeof(*act)) || + __get_user(handler, &act->sa_handler) || + __get_user(new_ka.sa.sa_flags, &act->sa_flags) || +- __get_user(restorer, &act->sa_restorer)|| +- __copy_from_user(&set32, &act->sa_mask, sizeof(compat_sigset_t))) ++ __get_user(restorer, &act->sa_restorer) || ++ __copy_from_user(&set32, &act->sa_mask, ++ sizeof(compat_sigset_t))) + return -EFAULT; + new_ka.sa.sa_handler = compat_ptr(handler); + new_ka.sa.sa_restorer = compat_ptr(restorer); +- /* FIXME: here we rely on _COMPAT_NSIG_WORS to be >= than _NSIG_WORDS << 1 */ ++ ++ /* ++ * FIXME: here we rely on _COMPAT_NSIG_WORS to be >= ++ * than _NSIG_WORDS << 1 ++ */ + switch (_NSIG_WORDS) { + case 4: new_ka.sa.sa_mask.sig[3] = set32.sig[6] + | (((long)set32.sig[7]) << 32); +@@ -312,7 +316,10 @@ sys32_rt_sigaction(int sig, struct sigaction32 __user *act, + ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); + + if (!ret && oact) { +- /* FIXME: here we rely on _COMPAT_NSIG_WORS to be >= than _NSIG_WORDS << 1 */ ++ /* ++ * FIXME: here we rely on _COMPAT_NSIG_WORS to be >= ++ * than _NSIG_WORDS << 1 ++ */ + switch (_NSIG_WORDS) { + case 4: + set32.sig[7] = (old_ka.sa.sa_mask.sig[3] >> 32); +@@ -328,23 +335,26 @@ sys32_rt_sigaction(int sig, struct sigaction32 __user *act, + set32.sig[0] = old_ka.sa.sa_mask.sig[0]; + } + if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact)) || +- __put_user(ptr_to_compat(old_ka.sa.sa_handler), &oact->sa_handler) || +- __put_user(ptr_to_compat(old_ka.sa.sa_restorer), &oact->sa_restorer) || ++ __put_user(ptr_to_compat(old_ka.sa.sa_handler), ++ &oact->sa_handler) || ++ __put_user(ptr_to_compat(old_ka.sa.sa_restorer), ++ &oact->sa_restorer) || + __put_user(old_ka.sa.sa_flags, &oact->sa_flags) || +- __copy_to_user(&oact->sa_mask, &set32, sizeof(compat_sigset_t))) ++ __copy_to_user(&oact->sa_mask, &set32, ++ sizeof(compat_sigset_t))) + return -EFAULT; + } + + return ret; + } + +-asmlinkage long +-sys32_sigaction (int sig, struct old_sigaction32 __user *act, struct old_sigaction32 __user *oact) ++asmlinkage long sys32_sigaction(int sig, struct old_sigaction32 __user *act, ++ struct old_sigaction32 __user *oact) + { +- struct k_sigaction new_ka, old_ka; +- int ret; ++ struct k_sigaction new_ka, old_ka; ++ int ret; + +- if (act) { ++ if (act) { + compat_old_sigset_t mask; + compat_uptr_t handler, restorer; + +@@ -359,33 +369,35 @@ sys32_sigaction (int sig, struct old_sigaction32 __user *act, struct old_sigacti + new_ka.sa.sa_restorer = compat_ptr(restorer); + + siginitset(&new_ka.sa.sa_mask, mask); +- } ++ } + +- ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); ++ ret = do_sigaction(sig, act ? &new_ka : NULL, oact ? &old_ka : NULL); + + if (!ret && oact) { + if (!access_ok(VERIFY_WRITE, oact, sizeof(*oact)) || +- __put_user(ptr_to_compat(old_ka.sa.sa_handler), &oact->sa_handler) || +- __put_user(ptr_to_compat(old_ka.sa.sa_restorer), &oact->sa_restorer) || ++ __put_user(ptr_to_compat(old_ka.sa.sa_handler), ++ &oact->sa_handler) || ++ __put_user(ptr_to_compat(old_ka.sa.sa_restorer), ++ &oact->sa_restorer) || + __put_user(old_ka.sa.sa_flags, &oact->sa_flags) || + __put_user(old_ka.sa.sa_mask.sig[0], &oact->sa_mask)) + return -EFAULT; +- } ++ } + + return ret; + } + +-asmlinkage long +-sys32_rt_sigprocmask(int how, compat_sigset_t __user *set, +- compat_sigset_t __user *oset, unsigned int sigsetsize) ++asmlinkage long sys32_rt_sigprocmask(int how, compat_sigset_t __user *set, ++ compat_sigset_t __user *oset, ++ unsigned int sigsetsize) + { + sigset_t s; + compat_sigset_t s32; + int ret; + mm_segment_t old_fs = get_fs(); +- ++ + if (set) { +- if (copy_from_user (&s32, set, sizeof(compat_sigset_t))) ++ if (copy_from_user(&s32, set, sizeof(compat_sigset_t))) + return -EFAULT; + switch (_NSIG_WORDS) { + case 4: s.sig[3] = s32.sig[6] | (((long)s32.sig[7]) << 32); +@@ -394,13 +406,14 @@ sys32_rt_sigprocmask(int how, compat_sigset_t __user *set, + case 1: s.sig[0] = s32.sig[0] | (((long)s32.sig[1]) << 32); + } + } +- set_fs (KERNEL_DS); ++ set_fs(KERNEL_DS); + ret = sys_rt_sigprocmask(how, + set ? (sigset_t __user *)&s : NULL, + oset ? (sigset_t __user *)&s : NULL, +- sigsetsize); +- set_fs (old_fs); +- if (ret) return ret; ++ sigsetsize); ++ set_fs(old_fs); ++ if (ret) ++ return ret; + if (oset) { + switch (_NSIG_WORDS) { + case 4: s32.sig[7] = (s.sig[3] >> 32); s32.sig[6] = s.sig[3]; +@@ -408,52 +421,49 @@ sys32_rt_sigprocmask(int how, compat_sigset_t __user *set, + case 2: s32.sig[3] = (s.sig[1] >> 32); s32.sig[2] = s.sig[1]; + case 1: s32.sig[1] = (s.sig[0] >> 32); s32.sig[0] = s.sig[0]; + } +- if (copy_to_user (oset, &s32, sizeof(compat_sigset_t))) ++ if (copy_to_user(oset, &s32, sizeof(compat_sigset_t))) + return -EFAULT; + } + return 0; + } + +-static inline long +-get_tv32(struct timeval *o, struct compat_timeval __user *i) ++static inline long get_tv32(struct timeval *o, struct compat_timeval __user *i) + { +- int err = -EFAULT; +- if (access_ok(VERIFY_READ, i, sizeof(*i))) { ++ int err = -EFAULT; ++ ++ if (access_ok(VERIFY_READ, i, sizeof(*i))) { + err = __get_user(o->tv_sec, &i->tv_sec); + err |= __get_user(o->tv_usec, &i->tv_usec); + } +- return err; ++ return err; + } + +-static inline long +-put_tv32(struct compat_timeval __user *o, struct timeval *i) ++static inline long put_tv32(struct compat_timeval __user *o, struct timeval *i) + { + int err = -EFAULT; +- if (access_ok(VERIFY_WRITE, o, sizeof(*o))) { ++ ++ if (access_ok(VERIFY_WRITE, o, sizeof(*o))) { + err = __put_user(i->tv_sec, &o->tv_sec); + err |= __put_user(i->tv_usec, &o->tv_usec); +- } +- return err; ++ } ++ return err; + } + +-extern unsigned int alarm_setitimer(unsigned int seconds); +- +-asmlinkage long +-sys32_alarm(unsigned int seconds) ++asmlinkage long sys32_alarm(unsigned int seconds) + { + return alarm_setitimer(seconds); + } + +-/* Translations due to time_t size differences. Which affects all +- sorts of things, like timeval and itimerval. */ +- +-extern struct timezone sys_tz; +- +-asmlinkage long +-sys32_gettimeofday(struct compat_timeval __user *tv, struct timezone __user *tz) ++/* ++ * Translations due to time_t size differences. Which affects all ++ * sorts of things, like timeval and itimerval. ++ */ ++asmlinkage long sys32_gettimeofday(struct compat_timeval __user *tv, ++ struct timezone __user *tz) + { + if (tv) { + struct timeval ktv; ++ + do_gettimeofday(&ktv); + if (put_tv32(tv, &ktv)) + return -EFAULT; +@@ -465,14 +475,14 @@ sys32_gettimeofday(struct compat_timeval __user *tv, struct timezone __user *tz) + return 0; + } + +-asmlinkage long +-sys32_settimeofday(struct compat_timeval __user *tv, struct timezone __user *tz) ++asmlinkage long sys32_settimeofday(struct compat_timeval __user *tv, ++ struct timezone __user *tz) + { + struct timeval ktv; + struct timespec kts; + struct timezone ktz; + +- if (tv) { ++ if (tv) { + if (get_tv32(&ktv, tv)) + return -EFAULT; + kts.tv_sec = ktv.tv_sec; +@@ -494,8 +504,7 @@ struct sel_arg_struct { + unsigned int tvp; + }; + +-asmlinkage long +-sys32_old_select(struct sel_arg_struct __user *arg) ++asmlinkage long sys32_old_select(struct sel_arg_struct __user *arg) + { + struct sel_arg_struct a; + +@@ -505,50 +514,45 @@ sys32_old_select(struct sel_arg_struct __user *arg) + compat_ptr(a.exp), compat_ptr(a.tvp)); + } + +-extern asmlinkage long +-compat_sys_wait4(compat_pid_t pid, compat_uint_t * stat_addr, int options, +- struct compat_rusage *ru); +- +-asmlinkage long +-sys32_waitpid(compat_pid_t pid, unsigned int *stat_addr, int options) ++asmlinkage long sys32_waitpid(compat_pid_t pid, unsigned int *stat_addr, ++ int options) + { + return compat_sys_wait4(pid, stat_addr, options, NULL); + } + + /* 32-bit timeval and related flotsam. */ + +-asmlinkage long +-sys32_sysfs(int option, u32 arg1, u32 arg2) ++asmlinkage long sys32_sysfs(int option, u32 arg1, u32 arg2) + { + return sys_sysfs(option, arg1, arg2); + } + +-asmlinkage long +-sys32_sched_rr_get_interval(compat_pid_t pid, struct compat_timespec __user *interval) ++asmlinkage long sys32_sched_rr_get_interval(compat_pid_t pid, ++ struct compat_timespec __user *interval) + { + struct timespec t; + int ret; +- mm_segment_t old_fs = get_fs (); +- +- set_fs (KERNEL_DS); ++ mm_segment_t old_fs = get_fs(); ++ ++ set_fs(KERNEL_DS); + ret = sys_sched_rr_get_interval(pid, (struct timespec __user *)&t); +- set_fs (old_fs); ++ set_fs(old_fs); + if (put_compat_timespec(&t, interval)) + return -EFAULT; + return ret; + } + +-asmlinkage long +-sys32_rt_sigpending(compat_sigset_t __user *set, compat_size_t sigsetsize) ++asmlinkage long sys32_rt_sigpending(compat_sigset_t __user *set, ++ compat_size_t sigsetsize) + { + sigset_t s; + compat_sigset_t s32; + int ret; + mm_segment_t old_fs = get_fs(); +- +- set_fs (KERNEL_DS); ++ ++ set_fs(KERNEL_DS); + ret = sys_rt_sigpending((sigset_t __user *)&s, sigsetsize); +- set_fs (old_fs); ++ set_fs(old_fs); + if (!ret) { + switch (_NSIG_WORDS) { + case 4: s32.sig[7] = (s.sig[3] >> 32); s32.sig[6] = s.sig[3]; +@@ -556,30 +560,29 @@ sys32_rt_sigpending(compat_sigset_t __user *set, compat_size_t sigsetsize) + case 2: s32.sig[3] = (s.sig[1] >> 32); s32.sig[2] = s.sig[1]; + case 1: s32.sig[1] = (s.sig[0] >> 32); s32.sig[0] = s.sig[0]; + } +- if (copy_to_user (set, &s32, sizeof(compat_sigset_t))) ++ if (copy_to_user(set, &s32, sizeof(compat_sigset_t))) + return -EFAULT; + } + return ret; + } + +-asmlinkage long +-sys32_rt_sigqueueinfo(int pid, int sig, compat_siginfo_t __user *uinfo) ++asmlinkage long sys32_rt_sigqueueinfo(int pid, int sig, ++ compat_siginfo_t __user *uinfo) + { + siginfo_t info; + int ret; + mm_segment_t old_fs = get_fs(); +- ++ + if (copy_siginfo_from_user32(&info, uinfo)) + return -EFAULT; +- set_fs (KERNEL_DS); ++ set_fs(KERNEL_DS); + ret = sys_rt_sigqueueinfo(pid, sig, (siginfo_t __user *)&info); +- set_fs (old_fs); ++ set_fs(old_fs); + return ret; + } + + /* These are here just in case some old ia32 binary calls it. */ +-asmlinkage long +-sys32_pause(void) ++asmlinkage long sys32_pause(void) + { + current->state = TASK_INTERRUPTIBLE; + schedule(); +@@ -599,25 +602,25 @@ struct sysctl_ia32 { + }; + + +-asmlinkage long +-sys32_sysctl(struct sysctl_ia32 __user *args32) ++asmlinkage long sys32_sysctl(struct sysctl_ia32 __user *args32) + { + struct sysctl_ia32 a32; +- mm_segment_t old_fs = get_fs (); ++ mm_segment_t old_fs = get_fs(); + void __user *oldvalp, *newvalp; + size_t oldlen; + int __user *namep; + long ret; + +- if (copy_from_user(&a32, args32, sizeof (a32))) ++ if (copy_from_user(&a32, args32, sizeof(a32))) + return -EFAULT; + + /* +- * We need to pre-validate these because we have to disable address checking +- * before calling do_sysctl() because of OLDLEN but we can't run the risk of the +- * user specifying bad addresses here. Well, since we're dealing with 32 bit +- * addresses, we KNOW that access_ok() will always succeed, so this is an +- * expensive NOP, but so what... ++ * We need to pre-validate these because we have to disable ++ * address checking before calling do_sysctl() because of ++ * OLDLEN but we can't run the risk of the user specifying bad ++ * addresses here. Well, since we're dealing with 32 bit ++ * addresses, we KNOW that access_ok() will always succeed, so ++ * this is an expensive NOP, but so what... + */ + namep = compat_ptr(a32.name); + oldvalp = compat_ptr(a32.oldval); +@@ -636,34 +639,34 @@ sys32_sysctl(struct sysctl_ia32 __user *args32) + unlock_kernel(); + set_fs(old_fs); + +- if (oldvalp && put_user (oldlen, (int __user *)compat_ptr(a32.oldlenp))) ++ if (oldvalp && put_user(oldlen, (int __user *)compat_ptr(a32.oldlenp))) + return -EFAULT; + + return ret; + } + #endif + +-/* warning: next two assume little endian */ +-asmlinkage long +-sys32_pread(unsigned int fd, char __user *ubuf, u32 count, u32 poslo, u32 poshi) ++/* warning: next two assume little endian */ ++asmlinkage long sys32_pread(unsigned int fd, char __user *ubuf, u32 count, ++ u32 poslo, u32 poshi) + { + return sys_pread64(fd, ubuf, count, + ((loff_t)AA(poshi) << 32) | AA(poslo)); + } + +-asmlinkage long +-sys32_pwrite(unsigned int fd, char __user *ubuf, u32 count, u32 poslo, u32 poshi) ++asmlinkage long sys32_pwrite(unsigned int fd, char __user *ubuf, u32 count, ++ u32 poslo, u32 poshi) + { + return sys_pwrite64(fd, ubuf, count, + ((loff_t)AA(poshi) << 32) | AA(poslo)); + } + + +-asmlinkage long +-sys32_personality(unsigned long personality) ++asmlinkage long sys32_personality(unsigned long personality) + { + int ret; +- if (personality(current->personality) == PER_LINUX32 && ++ ++ if (personality(current->personality) == PER_LINUX32 && + personality == PER_LINUX) + personality = PER_LINUX32; + ret = sys_personality(personality); +@@ -672,34 +675,33 @@ sys32_personality(unsigned long personality) + return ret; + } + +-asmlinkage long +-sys32_sendfile(int out_fd, int in_fd, compat_off_t __user *offset, s32 count) ++asmlinkage long sys32_sendfile(int out_fd, int in_fd, ++ compat_off_t __user *offset, s32 count) + { + mm_segment_t old_fs = get_fs(); + int ret; + off_t of; +- ++ + if (offset && get_user(of, offset)) + return -EFAULT; +- ++ + set_fs(KERNEL_DS); + ret = sys_sendfile(out_fd, in_fd, offset ? (off_t __user *)&of : NULL, + count); + set_fs(old_fs); +- ++ + if (offset && put_user(of, offset)) + return -EFAULT; +- + return ret; + } + + asmlinkage long sys32_mmap2(unsigned long addr, unsigned long len, +- unsigned long prot, unsigned long flags, +- unsigned long fd, unsigned long pgoff) ++ unsigned long prot, unsigned long flags, ++ unsigned long fd, unsigned long pgoff) + { + struct mm_struct *mm = current->mm; + unsigned long error; +- struct file * file = NULL; ++ struct file *file = NULL; + + flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); + if (!(flags & MAP_ANONYMOUS)) { +@@ -717,36 +719,35 @@ asmlinkage long sys32_mmap2(unsigned long addr, unsigned long len, + return error; + } + +-asmlinkage long sys32_olduname(struct oldold_utsname __user * name) ++asmlinkage long sys32_olduname(struct oldold_utsname __user *name) + { ++ char *arch = "x86_64"; + int err; + + if (!name) + return -EFAULT; + if (!access_ok(VERIFY_WRITE, name, sizeof(struct oldold_utsname))) + return -EFAULT; +- +- down_read(&uts_sem); +- +- err = __copy_to_user(&name->sysname,&utsname()->sysname, +- __OLD_UTS_LEN); +- err |= __put_user(0,name->sysname+__OLD_UTS_LEN); +- err |= __copy_to_user(&name->nodename,&utsname()->nodename, +- __OLD_UTS_LEN); +- err |= __put_user(0,name->nodename+__OLD_UTS_LEN); +- err |= __copy_to_user(&name->release,&utsname()->release, +- __OLD_UTS_LEN); +- err |= __put_user(0,name->release+__OLD_UTS_LEN); +- err |= __copy_to_user(&name->version,&utsname()->version, +- __OLD_UTS_LEN); +- err |= __put_user(0,name->version+__OLD_UTS_LEN); +- { +- char *arch = "x86_64"; +- if (personality(current->personality) == PER_LINUX32) +- arch = "i686"; +- +- err |= __copy_to_user(&name->machine, arch, strlen(arch)+1); +- } ++ ++ down_read(&uts_sem); ++ ++ err = __copy_to_user(&name->sysname, &utsname()->sysname, ++ __OLD_UTS_LEN); ++ err |= __put_user(0, name->sysname+__OLD_UTS_LEN); ++ err |= __copy_to_user(&name->nodename, &utsname()->nodename, ++ __OLD_UTS_LEN); ++ err |= __put_user(0, name->nodename+__OLD_UTS_LEN); ++ err |= __copy_to_user(&name->release, &utsname()->release, ++ __OLD_UTS_LEN); ++ err |= __put_user(0, name->release+__OLD_UTS_LEN); ++ err |= __copy_to_user(&name->version, &utsname()->version, ++ __OLD_UTS_LEN); ++ err |= __put_user(0, name->version+__OLD_UTS_LEN); ++ ++ if (personality(current->personality) == PER_LINUX32) ++ arch = "i686"; ++ ++ err |= __copy_to_user(&name->machine, arch, strlen(arch) + 1); + + up_read(&uts_sem); + +@@ -755,17 +756,19 @@ asmlinkage long sys32_olduname(struct oldold_utsname __user * name) + return err; + } + +-long sys32_uname(struct old_utsname __user * name) ++long sys32_uname(struct old_utsname __user *name) + { + int err; ++ + if (!name) + return -EFAULT; + down_read(&uts_sem); +- err = copy_to_user(name, utsname(), sizeof (*name)); ++ err = copy_to_user(name, utsname(), sizeof(*name)); + up_read(&uts_sem); +- if (personality(current->personality) == PER_LINUX32) ++ if (personality(current->personality) == PER_LINUX32) + err |= copy_to_user(&name->machine, "i686", 5); +- return err?-EFAULT:0; ++ ++ return err ? -EFAULT : 0; + } + + long sys32_ustat(unsigned dev, struct ustat32 __user *u32p) +@@ -773,27 +776,28 @@ long sys32_ustat(unsigned dev, struct ustat32 __user *u32p) + struct ustat u; + mm_segment_t seg; + int ret; +- +- seg = get_fs(); +- set_fs(KERNEL_DS); ++ ++ seg = get_fs(); ++ set_fs(KERNEL_DS); + ret = sys_ustat(dev, (struct ustat __user *)&u); + set_fs(seg); +- if (ret >= 0) { +- if (!access_ok(VERIFY_WRITE,u32p,sizeof(struct ustat32)) || +- __put_user((__u32) u.f_tfree, &u32p->f_tfree) || +- __put_user((__u32) u.f_tinode, &u32p->f_tfree) || +- __copy_to_user(&u32p->f_fname, u.f_fname, sizeof(u.f_fname)) || +- __copy_to_user(&u32p->f_fpack, u.f_fpack, sizeof(u.f_fpack))) +- ret = -EFAULT; +- } ++ if (ret < 0) ++ return ret; ++ ++ if (!access_ok(VERIFY_WRITE, u32p, sizeof(struct ustat32)) || ++ __put_user((__u32) u.f_tfree, &u32p->f_tfree) || ++ __put_user((__u32) u.f_tinode, &u32p->f_tfree) || ++ __copy_to_user(&u32p->f_fname, u.f_fname, sizeof(u.f_fname)) || ++ __copy_to_user(&u32p->f_fpack, u.f_fpack, sizeof(u.f_fpack))) ++ ret = -EFAULT; + return ret; +-} ++} + + asmlinkage long sys32_execve(char __user *name, compat_uptr_t __user *argv, + compat_uptr_t __user *envp, struct pt_regs *regs) + { + long error; +- char * filename; ++ char *filename; + + filename = getname(name); + error = PTR_ERR(filename); +@@ -812,18 +816,19 @@ asmlinkage long sys32_execve(char __user *name, compat_uptr_t __user *argv, + asmlinkage long sys32_clone(unsigned int clone_flags, unsigned int newsp, + struct pt_regs *regs) + { +- void __user *parent_tid = (void __user *)regs->rdx; +- void __user *child_tid = (void __user *)regs->rdi; ++ void __user *parent_tid = (void __user *)regs->dx; ++ void __user *child_tid = (void __user *)regs->di; ++ + if (!newsp) +- newsp = regs->rsp; +- return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid); ++ newsp = regs->sp; ++ return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid); + } + + /* +- * Some system calls that need sign extended arguments. This could be done by a generic wrapper. +- */ +- +-long sys32_lseek (unsigned int fd, int offset, unsigned int whence) ++ * Some system calls that need sign extended arguments. This could be ++ * done by a generic wrapper. ++ */ ++long sys32_lseek(unsigned int fd, int offset, unsigned int whence) + { + return sys_lseek(fd, offset, whence); + } +@@ -832,49 +837,52 @@ long sys32_kill(int pid, int sig) + { + return sys_kill(pid, sig); + } +- +-long sys32_fadvise64_64(int fd, __u32 offset_low, __u32 offset_high, ++ ++long sys32_fadvise64_64(int fd, __u32 offset_low, __u32 offset_high, + __u32 len_low, __u32 len_high, int advice) +-{ ++{ + return sys_fadvise64_64(fd, + (((u64)offset_high)<<32) | offset_low, + (((u64)len_high)<<32) | len_low, +- advice); +-} ++ advice); ++} + + long sys32_vm86_warning(void) +-{ ++{ + struct task_struct *me = current; + static char lastcomm[sizeof(me->comm)]; ++ + if (strncmp(lastcomm, me->comm, sizeof(lastcomm))) { +- compat_printk(KERN_INFO "%s: vm86 mode not supported on 64 bit kernel\n", +- me->comm); ++ compat_printk(KERN_INFO ++ "%s: vm86 mode not supported on 64 bit kernel\n", ++ me->comm); + strncpy(lastcomm, me->comm, sizeof(lastcomm)); +- } ++ } + return -ENOSYS; +-} ++} + + long sys32_lookup_dcookie(u32 addr_low, u32 addr_high, +- char __user * buf, size_t len) ++ char __user *buf, size_t len) + { + return sys_lookup_dcookie(((u64)addr_high << 32) | addr_low, buf, len); + } + +-asmlinkage ssize_t sys32_readahead(int fd, unsigned off_lo, unsigned off_hi, size_t count) ++asmlinkage ssize_t sys32_readahead(int fd, unsigned off_lo, unsigned off_hi, ++ size_t count) + { + return sys_readahead(fd, ((u64)off_hi << 32) | off_lo, count); + } + + asmlinkage long sys32_sync_file_range(int fd, unsigned off_low, unsigned off_hi, +- unsigned n_low, unsigned n_hi, int flags) ++ unsigned n_low, unsigned n_hi, int flags) + { + return sys_sync_file_range(fd, + ((u64)off_hi << 32) | off_low, + ((u64)n_hi << 32) | n_low, flags); + } + +-asmlinkage long sys32_fadvise64(int fd, unsigned offset_lo, unsigned offset_hi, size_t len, +- int advice) ++asmlinkage long sys32_fadvise64(int fd, unsigned offset_lo, unsigned offset_hi, ++ size_t len, int advice) + { + return sys_fadvise64_64(fd, ((u64)offset_hi << 32) | offset_lo, + len, advice); +diff --git a/arch/x86/ia32/syscall32.c b/arch/x86/ia32/syscall32.c +deleted file mode 100644 +index 15013ba..0000000 +--- a/arch/x86/ia32/syscall32.c ++++ /dev/null +@@ -1,83 +0,0 @@ +-/* Copyright 2002,2003 Andi Kleen, SuSE Labs */ +- +-/* vsyscall handling for 32bit processes. Map a stub page into it +- on demand because 32bit cannot reach the kernel's fixmaps */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-extern unsigned char syscall32_syscall[], syscall32_syscall_end[]; +-extern unsigned char syscall32_sysenter[], syscall32_sysenter_end[]; +-extern int sysctl_vsyscall32; +- +-static struct page *syscall32_pages[1]; +-static int use_sysenter = -1; +- +-struct linux_binprm; +- +-/* Setup a VMA at program startup for the vsyscall page */ +-int syscall32_setup_pages(struct linux_binprm *bprm, int exstack) +-{ +- struct mm_struct *mm = current->mm; +- int ret; +- +- down_write(&mm->mmap_sem); +- /* +- * MAYWRITE to allow gdb to COW and set breakpoints +- * +- * Make sure the vDSO gets into every core dump. +- * Dumping its contents makes post-mortem fully interpretable later +- * without matching up the same kernel and hardware config to see +- * what PC values meant. +- */ +- /* Could randomize here */ +- ret = install_special_mapping(mm, VSYSCALL32_BASE, PAGE_SIZE, +- VM_READ|VM_EXEC| +- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC| +- VM_ALWAYSDUMP, +- syscall32_pages); +- up_write(&mm->mmap_sem); +- return ret; +-} +- +-static int __init init_syscall32(void) +-{ +- char *syscall32_page = (void *)get_zeroed_page(GFP_KERNEL); +- if (!syscall32_page) +- panic("Cannot allocate syscall32 page"); +- syscall32_pages[0] = virt_to_page(syscall32_page); +- if (use_sysenter > 0) { +- memcpy(syscall32_page, syscall32_sysenter, +- syscall32_sysenter_end - syscall32_sysenter); +- } else { +- memcpy(syscall32_page, syscall32_syscall, +- syscall32_syscall_end - syscall32_syscall); +- } +- return 0; +-} +- +-__initcall(init_syscall32); +- +-/* May not be __init: called during resume */ +-void syscall32_cpu_init(void) +-{ +- if (use_sysenter < 0) +- use_sysenter = (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL); +- +- /* Load these always in case some future AMD CPU supports +- SYSENTER from compat mode too. */ +- checking_wrmsrl(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS); +- checking_wrmsrl(MSR_IA32_SYSENTER_ESP, 0ULL); +- checking_wrmsrl(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target); +- +- wrmsrl(MSR_CSTAR, ia32_cstar_target); +-} +diff --git a/arch/x86/ia32/syscall32_syscall.S b/arch/x86/ia32/syscall32_syscall.S +deleted file mode 100644 +index 933f0f0..0000000 +--- a/arch/x86/ia32/syscall32_syscall.S ++++ /dev/null +@@ -1,17 +0,0 @@ +-/* 32bit VDSOs mapped into user space. */ +- +- .section ".init.data","aw" +- +- .globl syscall32_syscall +- .globl syscall32_syscall_end +- +-syscall32_syscall: +- .incbin "arch/x86/ia32/vsyscall-syscall.so" +-syscall32_syscall_end: +- +- .globl syscall32_sysenter +- .globl syscall32_sysenter_end +- +-syscall32_sysenter: +- .incbin "arch/x86/ia32/vsyscall-sysenter.so" +-syscall32_sysenter_end: +diff --git a/arch/x86/ia32/tls32.c b/arch/x86/ia32/tls32.c +deleted file mode 100644 +index 1cc4340..0000000 +--- a/arch/x86/ia32/tls32.c ++++ /dev/null +@@ -1,163 +0,0 @@ +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* +- * sys_alloc_thread_area: get a yet unused TLS descriptor index. +- */ +-static int get_free_idx(void) +-{ +- struct thread_struct *t = ¤t->thread; +- int idx; +- +- for (idx = 0; idx < GDT_ENTRY_TLS_ENTRIES; idx++) +- if (desc_empty((struct n_desc_struct *)(t->tls_array) + idx)) +- return idx + GDT_ENTRY_TLS_MIN; +- return -ESRCH; +-} +- +-/* +- * Set a given TLS descriptor: +- * When you want addresses > 32bit use arch_prctl() +- */ +-int do_set_thread_area(struct thread_struct *t, struct user_desc __user *u_info) +-{ +- struct user_desc info; +- struct n_desc_struct *desc; +- int cpu, idx; +- +- if (copy_from_user(&info, u_info, sizeof(info))) +- return -EFAULT; +- +- idx = info.entry_number; +- +- /* +- * index -1 means the kernel should try to find and +- * allocate an empty descriptor: +- */ +- if (idx == -1) { +- idx = get_free_idx(); +- if (idx < 0) +- return idx; +- if (put_user(idx, &u_info->entry_number)) +- return -EFAULT; +- } +- +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- return -EINVAL; +- +- desc = ((struct n_desc_struct *)t->tls_array) + idx - GDT_ENTRY_TLS_MIN; +- +- /* +- * We must not get preempted while modifying the TLS. +- */ +- cpu = get_cpu(); +- +- if (LDT_empty(&info)) { +- desc->a = 0; +- desc->b = 0; +- } else { +- desc->a = LDT_entry_a(&info); +- desc->b = LDT_entry_b(&info); +- } +- if (t == ¤t->thread) +- load_TLS(t, cpu); +- +- put_cpu(); +- return 0; +-} +- +-asmlinkage long sys32_set_thread_area(struct user_desc __user *u_info) +-{ +- return do_set_thread_area(¤t->thread, u_info); +-} +- +- +-/* +- * Get the current Thread-Local Storage area: +- */ +- +-#define GET_BASE(desc) ( \ +- (((desc)->a >> 16) & 0x0000ffff) | \ +- (((desc)->b << 16) & 0x00ff0000) | \ +- ( (desc)->b & 0xff000000) ) +- +-#define GET_LIMIT(desc) ( \ +- ((desc)->a & 0x0ffff) | \ +- ((desc)->b & 0xf0000) ) +- +-#define GET_32BIT(desc) (((desc)->b >> 22) & 1) +-#define GET_CONTENTS(desc) (((desc)->b >> 10) & 3) +-#define GET_WRITABLE(desc) (((desc)->b >> 9) & 1) +-#define GET_LIMIT_PAGES(desc) (((desc)->b >> 23) & 1) +-#define GET_PRESENT(desc) (((desc)->b >> 15) & 1) +-#define GET_USEABLE(desc) (((desc)->b >> 20) & 1) +-#define GET_LONGMODE(desc) (((desc)->b >> 21) & 1) +- +-int do_get_thread_area(struct thread_struct *t, struct user_desc __user *u_info) +-{ +- struct user_desc info; +- struct n_desc_struct *desc; +- int idx; +- +- if (get_user(idx, &u_info->entry_number)) +- return -EFAULT; +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- return -EINVAL; +- +- desc = ((struct n_desc_struct *)t->tls_array) + idx - GDT_ENTRY_TLS_MIN; +- +- memset(&info, 0, sizeof(struct user_desc)); +- info.entry_number = idx; +- info.base_addr = GET_BASE(desc); +- info.limit = GET_LIMIT(desc); +- info.seg_32bit = GET_32BIT(desc); +- info.contents = GET_CONTENTS(desc); +- info.read_exec_only = !GET_WRITABLE(desc); +- info.limit_in_pages = GET_LIMIT_PAGES(desc); +- info.seg_not_present = !GET_PRESENT(desc); +- info.useable = GET_USEABLE(desc); +- info.lm = GET_LONGMODE(desc); +- +- if (copy_to_user(u_info, &info, sizeof(info))) +- return -EFAULT; +- return 0; +-} +- +-asmlinkage long sys32_get_thread_area(struct user_desc __user *u_info) +-{ +- return do_get_thread_area(¤t->thread, u_info); +-} +- +- +-int ia32_child_tls(struct task_struct *p, struct pt_regs *childregs) +-{ +- struct n_desc_struct *desc; +- struct user_desc info; +- struct user_desc __user *cp; +- int idx; +- +- cp = (void __user *)childregs->rsi; +- if (copy_from_user(&info, cp, sizeof(info))) +- return -EFAULT; +- if (LDT_empty(&info)) +- return -EINVAL; +- +- idx = info.entry_number; +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- return -EINVAL; +- +- desc = (struct n_desc_struct *)(p->thread.tls_array) + idx - GDT_ENTRY_TLS_MIN; +- desc->a = LDT_entry_a(&info); +- desc->b = LDT_entry_b(&info); +- +- return 0; +-} +diff --git a/arch/x86/ia32/vsyscall-sigreturn.S b/arch/x86/ia32/vsyscall-sigreturn.S +deleted file mode 100644 +index b383be0..0000000 +--- a/arch/x86/ia32/vsyscall-sigreturn.S ++++ /dev/null +@@ -1,143 +0,0 @@ +-/* +- * Common code for the sigreturn entry points on the vsyscall page. +- * This code uses SYSCALL_ENTER_KERNEL (either syscall or int $0x80) +- * to enter the kernel. +- * This file is #include'd by vsyscall-*.S to define them after the +- * vsyscall entry point. The addresses we get for these entry points +- * by doing ".balign 32" must match in both versions of the page. +- */ +- +- .code32 +- .section .text.sigreturn,"ax" +- .balign 32 +- .globl __kernel_sigreturn +- .type __kernel_sigreturn,@function +-__kernel_sigreturn: +-.LSTART_sigreturn: +- popl %eax +- movl $__NR_ia32_sigreturn, %eax +- SYSCALL_ENTER_KERNEL +-.LEND_sigreturn: +- .size __kernel_sigreturn,.-.LSTART_sigreturn +- +- .section .text.rtsigreturn,"ax" +- .balign 32 +- .globl __kernel_rt_sigreturn +- .type __kernel_rt_sigreturn,@function +-__kernel_rt_sigreturn: +-.LSTART_rt_sigreturn: +- movl $__NR_ia32_rt_sigreturn, %eax +- SYSCALL_ENTER_KERNEL +-.LEND_rt_sigreturn: +- .size __kernel_rt_sigreturn,.-.LSTART_rt_sigreturn +- +- .section .eh_frame,"a",@progbits +-.LSTARTFRAMES: +- .long .LENDCIES-.LSTARTCIES +-.LSTARTCIES: +- .long 0 /* CIE ID */ +- .byte 1 /* Version number */ +- .string "zRS" /* NUL-terminated augmentation string */ +- .uleb128 1 /* Code alignment factor */ +- .sleb128 -4 /* Data alignment factor */ +- .byte 8 /* Return address register column */ +- .uleb128 1 /* Augmentation value length */ +- .byte 0x1b /* DW_EH_PE_pcrel|DW_EH_PE_sdata4. */ +- .byte 0x0c /* DW_CFA_def_cfa */ +- .uleb128 4 +- .uleb128 4 +- .byte 0x88 /* DW_CFA_offset, column 0x8 */ +- .uleb128 1 +- .align 4 +-.LENDCIES: +- +- .long .LENDFDE2-.LSTARTFDE2 /* Length FDE */ +-.LSTARTFDE2: +- .long .LSTARTFDE2-.LSTARTFRAMES /* CIE pointer */ +- /* HACK: The dwarf2 unwind routines will subtract 1 from the +- return address to get an address in the middle of the +- presumed call instruction. Since we didn't get here via +- a call, we need to include the nop before the real start +- to make up for it. */ +- .long .LSTART_sigreturn-1-. /* PC-relative start address */ +- .long .LEND_sigreturn-.LSTART_sigreturn+1 +- .uleb128 0 /* Augmentation length */ +- /* What follows are the instructions for the table generation. +- We record the locations of each register saved. This is +- complicated by the fact that the "CFA" is always assumed to +- be the value of the stack pointer in the caller. This means +- that we must define the CFA of this body of code to be the +- saved value of the stack pointer in the sigcontext. Which +- also means that there is no fixed relation to the other +- saved registers, which means that we must use DW_CFA_expression +- to compute their addresses. It also means that when we +- adjust the stack with the popl, we have to do it all over again. */ +- +-#define do_cfa_expr(offset) \ +- .byte 0x0f; /* DW_CFA_def_cfa_expression */ \ +- .uleb128 1f-0f; /* length */ \ +-0: .byte 0x74; /* DW_OP_breg4 */ \ +- .sleb128 offset; /* offset */ \ +- .byte 0x06; /* DW_OP_deref */ \ +-1: +- +-#define do_expr(regno, offset) \ +- .byte 0x10; /* DW_CFA_expression */ \ +- .uleb128 regno; /* regno */ \ +- .uleb128 1f-0f; /* length */ \ +-0: .byte 0x74; /* DW_OP_breg4 */ \ +- .sleb128 offset; /* offset */ \ +-1: +- +- do_cfa_expr(IA32_SIGCONTEXT_esp+4) +- do_expr(0, IA32_SIGCONTEXT_eax+4) +- do_expr(1, IA32_SIGCONTEXT_ecx+4) +- do_expr(2, IA32_SIGCONTEXT_edx+4) +- do_expr(3, IA32_SIGCONTEXT_ebx+4) +- do_expr(5, IA32_SIGCONTEXT_ebp+4) +- do_expr(6, IA32_SIGCONTEXT_esi+4) +- do_expr(7, IA32_SIGCONTEXT_edi+4) +- do_expr(8, IA32_SIGCONTEXT_eip+4) +- +- .byte 0x42 /* DW_CFA_advance_loc 2 -- nop; popl eax. */ +- +- do_cfa_expr(IA32_SIGCONTEXT_esp) +- do_expr(0, IA32_SIGCONTEXT_eax) +- do_expr(1, IA32_SIGCONTEXT_ecx) +- do_expr(2, IA32_SIGCONTEXT_edx) +- do_expr(3, IA32_SIGCONTEXT_ebx) +- do_expr(5, IA32_SIGCONTEXT_ebp) +- do_expr(6, IA32_SIGCONTEXT_esi) +- do_expr(7, IA32_SIGCONTEXT_edi) +- do_expr(8, IA32_SIGCONTEXT_eip) +- +- .align 4 +-.LENDFDE2: +- +- .long .LENDFDE3-.LSTARTFDE3 /* Length FDE */ +-.LSTARTFDE3: +- .long .LSTARTFDE3-.LSTARTFRAMES /* CIE pointer */ +- /* HACK: See above wrt unwind library assumptions. */ +- .long .LSTART_rt_sigreturn-1-. /* PC-relative start address */ +- .long .LEND_rt_sigreturn-.LSTART_rt_sigreturn+1 +- .uleb128 0 /* Augmentation */ +- /* What follows are the instructions for the table generation. +- We record the locations of each register saved. This is +- slightly less complicated than the above, since we don't +- modify the stack pointer in the process. */ +- +- do_cfa_expr(IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_esp) +- do_expr(0, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_eax) +- do_expr(1, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_ecx) +- do_expr(2, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_edx) +- do_expr(3, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_ebx) +- do_expr(5, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_ebp) +- do_expr(6, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_esi) +- do_expr(7, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_edi) +- do_expr(8, IA32_RT_SIGFRAME_sigcontext-4 + IA32_SIGCONTEXT_eip) +- +- .align 4 +-.LENDFDE3: +- +-#include "../../x86/kernel/vsyscall-note_32.S" +- +diff --git a/arch/x86/ia32/vsyscall-syscall.S b/arch/x86/ia32/vsyscall-syscall.S +deleted file mode 100644 +index cf9ef67..0000000 +--- a/arch/x86/ia32/vsyscall-syscall.S ++++ /dev/null +@@ -1,69 +0,0 @@ +-/* +- * Code for the vsyscall page. This version uses the syscall instruction. +- */ +- +-#include +-#include +-#include +- +- .code32 +- .text +- .section .text.vsyscall,"ax" +- .globl __kernel_vsyscall +- .type __kernel_vsyscall,@function +-__kernel_vsyscall: +-.LSTART_vsyscall: +- push %ebp +-.Lpush_ebp: +- movl %ecx, %ebp +- syscall +- movl $__USER32_DS, %ecx +- movl %ecx, %ss +- movl %ebp, %ecx +- popl %ebp +-.Lpop_ebp: +- ret +-.LEND_vsyscall: +- .size __kernel_vsyscall,.-.LSTART_vsyscall +- +- .section .eh_frame,"a",@progbits +-.LSTARTFRAME: +- .long .LENDCIE-.LSTARTCIE +-.LSTARTCIE: +- .long 0 /* CIE ID */ +- .byte 1 /* Version number */ +- .string "zR" /* NUL-terminated augmentation string */ +- .uleb128 1 /* Code alignment factor */ +- .sleb128 -4 /* Data alignment factor */ +- .byte 8 /* Return address register column */ +- .uleb128 1 /* Augmentation value length */ +- .byte 0x1b /* DW_EH_PE_pcrel|DW_EH_PE_sdata4. */ +- .byte 0x0c /* DW_CFA_def_cfa */ +- .uleb128 4 +- .uleb128 4 +- .byte 0x88 /* DW_CFA_offset, column 0x8 */ +- .uleb128 1 +- .align 4 +-.LENDCIE: +- +- .long .LENDFDE1-.LSTARTFDE1 /* Length FDE */ +-.LSTARTFDE1: +- .long .LSTARTFDE1-.LSTARTFRAME /* CIE pointer */ +- .long .LSTART_vsyscall-. /* PC-relative start address */ +- .long .LEND_vsyscall-.LSTART_vsyscall +- .uleb128 0 /* Augmentation length */ +- /* What follows are the instructions for the table generation. +- We have to record all changes of the stack pointer. */ +- .byte 0x40 + .Lpush_ebp-.LSTART_vsyscall /* DW_CFA_advance_loc */ +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .uleb128 8 +- .byte 0x85, 0x02 /* DW_CFA_offset %ebp -8 */ +- .byte 0x40 + .Lpop_ebp-.Lpush_ebp /* DW_CFA_advance_loc */ +- .byte 0xc5 /* DW_CFA_restore %ebp */ +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .uleb128 4 +- .align 4 +-.LENDFDE1: +- +-#define SYSCALL_ENTER_KERNEL syscall +-#include "vsyscall-sigreturn.S" +diff --git a/arch/x86/ia32/vsyscall-sysenter.S b/arch/x86/ia32/vsyscall-sysenter.S +deleted file mode 100644 +index ae056e5..0000000 +--- a/arch/x86/ia32/vsyscall-sysenter.S ++++ /dev/null +@@ -1,95 +0,0 @@ +-/* +- * Code for the vsyscall page. This version uses the sysenter instruction. +- */ +- +-#include +-#include +- +- .code32 +- .text +- .section .text.vsyscall,"ax" +- .globl __kernel_vsyscall +- .type __kernel_vsyscall,@function +-__kernel_vsyscall: +-.LSTART_vsyscall: +- push %ecx +-.Lpush_ecx: +- push %edx +-.Lpush_edx: +- push %ebp +-.Lenter_kernel: +- movl %esp,%ebp +- sysenter +- .space 7,0x90 +- jmp .Lenter_kernel +- /* 16: System call normal return point is here! */ +- pop %ebp +-.Lpop_ebp: +- pop %edx +-.Lpop_edx: +- pop %ecx +-.Lpop_ecx: +- ret +-.LEND_vsyscall: +- .size __kernel_vsyscall,.-.LSTART_vsyscall +- +- .section .eh_frame,"a",@progbits +-.LSTARTFRAME: +- .long .LENDCIE-.LSTARTCIE +-.LSTARTCIE: +- .long 0 /* CIE ID */ +- .byte 1 /* Version number */ +- .string "zR" /* NUL-terminated augmentation string */ +- .uleb128 1 /* Code alignment factor */ +- .sleb128 -4 /* Data alignment factor */ +- .byte 8 /* Return address register column */ +- .uleb128 1 /* Augmentation value length */ +- .byte 0x1b /* DW_EH_PE_pcrel|DW_EH_PE_sdata4. */ +- .byte 0x0c /* DW_CFA_def_cfa */ +- .uleb128 4 +- .uleb128 4 +- .byte 0x88 /* DW_CFA_offset, column 0x8 */ +- .uleb128 1 +- .align 4 +-.LENDCIE: +- +- .long .LENDFDE1-.LSTARTFDE1 /* Length FDE */ +-.LSTARTFDE1: +- .long .LSTARTFDE1-.LSTARTFRAME /* CIE pointer */ +- .long .LSTART_vsyscall-. /* PC-relative start address */ +- .long .LEND_vsyscall-.LSTART_vsyscall +- .uleb128 0 /* Augmentation length */ +- /* What follows are the instructions for the table generation. +- We have to record all changes of the stack pointer. */ +- .byte 0x04 /* DW_CFA_advance_loc4 */ +- .long .Lpush_ecx-.LSTART_vsyscall +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .byte 0x08 /* RA at offset 8 now */ +- .byte 0x04 /* DW_CFA_advance_loc4 */ +- .long .Lpush_edx-.Lpush_ecx +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .byte 0x0c /* RA at offset 12 now */ +- .byte 0x04 /* DW_CFA_advance_loc4 */ +- .long .Lenter_kernel-.Lpush_edx +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .byte 0x10 /* RA at offset 16 now */ +- .byte 0x85, 0x04 /* DW_CFA_offset %ebp -16 */ +- /* Finally the epilogue. */ +- .byte 0x04 /* DW_CFA_advance_loc4 */ +- .long .Lpop_ebp-.Lenter_kernel +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .byte 0x12 /* RA at offset 12 now */ +- .byte 0xc5 /* DW_CFA_restore %ebp */ +- .byte 0x04 /* DW_CFA_advance_loc4 */ +- .long .Lpop_edx-.Lpop_ebp +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .byte 0x08 /* RA at offset 8 now */ +- .byte 0x04 /* DW_CFA_advance_loc4 */ +- .long .Lpop_ecx-.Lpop_edx +- .byte 0x0e /* DW_CFA_def_cfa_offset */ +- .byte 0x04 /* RA at offset 4 now */ +- .align 4 +-.LENDFDE1: +- +-#define SYSCALL_ENTER_KERNEL int $0x80 +-#include "vsyscall-sigreturn.S" +diff --git a/arch/x86/ia32/vsyscall.lds b/arch/x86/ia32/vsyscall.lds +deleted file mode 100644 +index 1dc86ff..0000000 +--- a/arch/x86/ia32/vsyscall.lds ++++ /dev/null +@@ -1,80 +0,0 @@ +-/* +- * Linker script for vsyscall DSO. The vsyscall page is an ELF shared +- * object prelinked to its virtual address. This script controls its layout. +- */ +- +-/* This must match . */ +-VSYSCALL_BASE = 0xffffe000; +- +-SECTIONS +-{ +- . = VSYSCALL_BASE + SIZEOF_HEADERS; +- +- .hash : { *(.hash) } :text +- .gnu.hash : { *(.gnu.hash) } +- .dynsym : { *(.dynsym) } +- .dynstr : { *(.dynstr) } +- .gnu.version : { *(.gnu.version) } +- .gnu.version_d : { *(.gnu.version_d) } +- .gnu.version_r : { *(.gnu.version_r) } +- +- /* This linker script is used both with -r and with -shared. +- For the layouts to match, we need to skip more than enough +- space for the dynamic symbol table et al. If this amount +- is insufficient, ld -shared will barf. Just increase it here. */ +- . = VSYSCALL_BASE + 0x400; +- +- .text.vsyscall : { *(.text.vsyscall) } :text =0x90909090 +- +- /* This is an 32bit object and we cannot easily get the offsets +- into the 64bit kernel. Just hardcode them here. This assumes +- that all the stubs don't need more than 0x100 bytes. */ +- . = VSYSCALL_BASE + 0x500; +- +- .text.sigreturn : { *(.text.sigreturn) } :text =0x90909090 +- +- . = VSYSCALL_BASE + 0x600; +- +- .text.rtsigreturn : { *(.text.rtsigreturn) } :text =0x90909090 +- +- .note : { *(.note.*) } :text :note +- .eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame_hdr +- .eh_frame : { KEEP (*(.eh_frame)) } :text +- .dynamic : { *(.dynamic) } :text :dynamic +- .useless : { +- *(.got.plt) *(.got) +- *(.data .data.* .gnu.linkonce.d.*) +- *(.dynbss) +- *(.bss .bss.* .gnu.linkonce.b.*) +- } :text +-} +- +-/* +- * We must supply the ELF program headers explicitly to get just one +- * PT_LOAD segment, and set the flags explicitly to make segments read-only. +- */ +-PHDRS +-{ +- text PT_LOAD FILEHDR PHDRS FLAGS(5); /* PF_R|PF_X */ +- dynamic PT_DYNAMIC FLAGS(4); /* PF_R */ +- note PT_NOTE FLAGS(4); /* PF_R */ +- eh_frame_hdr 0x6474e550; /* PT_GNU_EH_FRAME, but ld doesn't match the name */ +-} +- +-/* +- * This controls what symbols we export from the DSO. +- */ +-VERSION +-{ +- LINUX_2.5 { +- global: +- __kernel_vsyscall; +- __kernel_sigreturn; +- __kernel_rt_sigreturn; +- +- local: *; +- }; +-} +- +-/* The ELF entry point can be used to set the AT_SYSINFO value. */ +-ENTRY(__kernel_vsyscall); +diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile +index 3857334..6f81300 100644 +--- a/arch/x86/kernel/Makefile ++++ b/arch/x86/kernel/Makefile +@@ -1,9 +1,91 @@ +-ifeq ($(CONFIG_X86_32),y) +-include ${srctree}/arch/x86/kernel/Makefile_32 +-else +-include ${srctree}/arch/x86/kernel/Makefile_64 ++# ++# Makefile for the linux kernel. ++# ++ ++extra-y := head_$(BITS).o init_task.o vmlinux.lds ++extra-$(CONFIG_X86_64) += head64.o ++ ++CPPFLAGS_vmlinux.lds += -U$(UTS_MACHINE) ++CFLAGS_vsyscall_64.o := $(PROFILING) -g0 ++ ++obj-y := process_$(BITS).o signal_$(BITS).o entry_$(BITS).o ++obj-y += traps_$(BITS).o irq_$(BITS).o ++obj-y += time_$(BITS).o ioport.o ldt.o ++obj-y += setup_$(BITS).o i8259_$(BITS).o ++obj-$(CONFIG_X86_32) += sys_i386_32.o i386_ksyms_32.o ++obj-$(CONFIG_X86_64) += sys_x86_64.o x8664_ksyms_64.o ++obj-$(CONFIG_X86_64) += syscall_64.o vsyscall_64.o setup64.o ++obj-y += pci-dma_$(BITS).o bootflag.o e820_$(BITS).o ++obj-y += quirks.o i8237.o topology.o kdebugfs.o ++obj-y += alternative.o i8253.o ++obj-$(CONFIG_X86_64) += pci-nommu_64.o bugs_64.o ++obj-y += tsc_$(BITS).o io_delay.o rtc.o ++ ++obj-y += i387.o ++obj-y += ptrace.o ++obj-y += ds.o ++obj-$(CONFIG_X86_32) += tls.o ++obj-$(CONFIG_IA32_EMULATION) += tls.o ++obj-y += step.o ++obj-$(CONFIG_STACKTRACE) += stacktrace.o ++obj-y += cpu/ ++obj-y += acpi/ ++obj-$(CONFIG_X86_BIOS_REBOOT) += reboot.o ++obj-$(CONFIG_X86_64) += reboot.o ++obj-$(CONFIG_MCA) += mca_32.o ++obj-$(CONFIG_X86_MSR) += msr.o ++obj-$(CONFIG_X86_CPUID) += cpuid.o ++obj-$(CONFIG_MICROCODE) += microcode.o ++obj-$(CONFIG_PCI) += early-quirks.o ++obj-$(CONFIG_APM) += apm_32.o ++obj-$(CONFIG_X86_SMP) += smp_$(BITS).o smpboot_$(BITS).o tsc_sync.o ++obj-$(CONFIG_X86_32_SMP) += smpcommon_32.o ++obj-$(CONFIG_X86_64_SMP) += smp_64.o smpboot_64.o tsc_sync.o ++obj-$(CONFIG_X86_TRAMPOLINE) += trampoline_$(BITS).o ++obj-$(CONFIG_X86_MPPARSE) += mpparse_$(BITS).o ++obj-$(CONFIG_X86_LOCAL_APIC) += apic_$(BITS).o nmi_$(BITS).o ++obj-$(CONFIG_X86_IO_APIC) += io_apic_$(BITS).o ++obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o ++obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o ++obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o ++obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o ++obj-$(CONFIG_X86_NUMAQ) += numaq_32.o ++obj-$(CONFIG_X86_SUMMIT_NUMA) += summit_32.o ++obj-$(CONFIG_X86_VSMP) += vsmp_64.o ++obj-$(CONFIG_KPROBES) += kprobes.o ++obj-$(CONFIG_MODULES) += module_$(BITS).o ++obj-$(CONFIG_ACPI_SRAT) += srat_32.o ++obj-$(CONFIG_EFI) += efi.o efi_$(BITS).o efi_stub_$(BITS).o ++obj-$(CONFIG_DOUBLEFAULT) += doublefault_32.o ++obj-$(CONFIG_VM86) += vm86_32.o ++obj-$(CONFIG_EARLY_PRINTK) += early_printk.o ++ ++obj-$(CONFIG_HPET_TIMER) += hpet.o ++ ++obj-$(CONFIG_K8_NB) += k8.o ++obj-$(CONFIG_MGEODE_LX) += geode_32.o mfgpt_32.o ++obj-$(CONFIG_DEBUG_RODATA_TEST) += test_rodata.o ++obj-$(CONFIG_DEBUG_NX_TEST) += test_nx.o ++ ++obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o ++obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o ++ ++ifdef CONFIG_INPUT_PCSPKR ++obj-y += pcspeaker.o + endif + +-# Workaround to delete .lds files with make clean +-# The problem is that we do not enter Makefile_32 with make clean. +-clean-files := vsyscall*.lds vsyscall*.so ++obj-$(CONFIG_SCx200) += scx200_32.o ++ ++### ++# 64 bit specific files ++ifeq ($(CONFIG_X86_64),y) ++ obj-y += genapic_64.o genapic_flat_64.o ++ obj-$(CONFIG_X86_PM_TIMER) += pmtimer_64.o ++ obj-$(CONFIG_AUDIT) += audit_64.o ++ obj-$(CONFIG_PM) += suspend_64.o ++ obj-$(CONFIG_HIBERNATION) += suspend_asm_64.o ++ ++ obj-$(CONFIG_GART_IOMMU) += pci-gart_64.o aperture_64.o ++ obj-$(CONFIG_CALGARY_IOMMU) += pci-calgary_64.o tce_64.o ++ obj-$(CONFIG_SWIOTLB) += pci-swiotlb_64.o ++endif +diff --git a/arch/x86/kernel/Makefile_32 b/arch/x86/kernel/Makefile_32 +deleted file mode 100644 +index a7bc93c..0000000 +--- a/arch/x86/kernel/Makefile_32 ++++ /dev/null +@@ -1,88 +0,0 @@ +-# +-# Makefile for the linux kernel. +-# +- +-extra-y := head_32.o init_task.o vmlinux.lds +-CPPFLAGS_vmlinux.lds += -Ui386 +- +-obj-y := process_32.o signal_32.o entry_32.o traps_32.o irq_32.o \ +- ptrace_32.o time_32.o ioport_32.o ldt_32.o setup_32.o i8259_32.o sys_i386_32.o \ +- pci-dma_32.o i386_ksyms_32.o i387_32.o bootflag.o e820_32.o\ +- quirks.o i8237.o topology.o alternative.o i8253.o tsc_32.o +- +-obj-$(CONFIG_STACKTRACE) += stacktrace.o +-obj-y += cpu/ +-obj-y += acpi/ +-obj-$(CONFIG_X86_BIOS_REBOOT) += reboot_32.o +-obj-$(CONFIG_MCA) += mca_32.o +-obj-$(CONFIG_X86_MSR) += msr.o +-obj-$(CONFIG_X86_CPUID) += cpuid.o +-obj-$(CONFIG_MICROCODE) += microcode.o +-obj-$(CONFIG_PCI) += early-quirks.o +-obj-$(CONFIG_APM) += apm_32.o +-obj-$(CONFIG_X86_SMP) += smp_32.o smpboot_32.o tsc_sync.o +-obj-$(CONFIG_SMP) += smpcommon_32.o +-obj-$(CONFIG_X86_TRAMPOLINE) += trampoline_32.o +-obj-$(CONFIG_X86_MPPARSE) += mpparse_32.o +-obj-$(CONFIG_X86_LOCAL_APIC) += apic_32.o nmi_32.o +-obj-$(CONFIG_X86_IO_APIC) += io_apic_32.o +-obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o +-obj-$(CONFIG_KEXEC) += machine_kexec_32.o relocate_kernel_32.o crash.o +-obj-$(CONFIG_CRASH_DUMP) += crash_dump_32.o +-obj-$(CONFIG_X86_NUMAQ) += numaq_32.o +-obj-$(CONFIG_X86_SUMMIT_NUMA) += summit_32.o +-obj-$(CONFIG_KPROBES) += kprobes_32.o +-obj-$(CONFIG_MODULES) += module_32.o +-obj-y += sysenter_32.o vsyscall_32.o +-obj-$(CONFIG_ACPI_SRAT) += srat_32.o +-obj-$(CONFIG_EFI) += efi_32.o efi_stub_32.o +-obj-$(CONFIG_DOUBLEFAULT) += doublefault_32.o +-obj-$(CONFIG_VM86) += vm86_32.o +-obj-$(CONFIG_EARLY_PRINTK) += early_printk.o +-obj-$(CONFIG_HPET_TIMER) += hpet.o +-obj-$(CONFIG_K8_NB) += k8.o +-obj-$(CONFIG_MGEODE_LX) += geode_32.o mfgpt_32.o +- +-obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o +-obj-$(CONFIG_PARAVIRT) += paravirt_32.o +-obj-y += pcspeaker.o +- +-obj-$(CONFIG_SCx200) += scx200_32.o +- +-# vsyscall_32.o contains the vsyscall DSO images as __initdata. +-# We must build both images before we can assemble it. +-# Note: kbuild does not track this dependency due to usage of .incbin +-$(obj)/vsyscall_32.o: $(obj)/vsyscall-int80_32.so $(obj)/vsyscall-sysenter_32.so +-targets += $(foreach F,int80 sysenter,vsyscall-$F_32.o vsyscall-$F_32.so) +-targets += vsyscall-note_32.o vsyscall_32.lds +- +-# The DSO images are built using a special linker script. +-quiet_cmd_syscall = SYSCALL $@ +- cmd_syscall = $(CC) -m elf_i386 -nostdlib $(SYSCFLAGS_$(@F)) \ +- -Wl,-T,$(filter-out FORCE,$^) -o $@ +- +-export CPPFLAGS_vsyscall_32.lds += -P -C -Ui386 +- +-vsyscall-flags = -shared -s -Wl,-soname=linux-gate.so.1 \ +- $(call ld-option, -Wl$(comma)--hash-style=sysv) +-SYSCFLAGS_vsyscall-sysenter_32.so = $(vsyscall-flags) +-SYSCFLAGS_vsyscall-int80_32.so = $(vsyscall-flags) +- +-$(obj)/vsyscall-int80_32.so $(obj)/vsyscall-sysenter_32.so: \ +-$(obj)/vsyscall-%.so: $(src)/vsyscall_32.lds \ +- $(obj)/vsyscall-%.o $(obj)/vsyscall-note_32.o FORCE +- $(call if_changed,syscall) +- +-# We also create a special relocatable object that should mirror the symbol +-# table and layout of the linked DSO. With ld -R we can then refer to +-# these symbols in the kernel code rather than hand-coded addresses. +-extra-y += vsyscall-syms.o +-$(obj)/built-in.o: $(obj)/vsyscall-syms.o +-$(obj)/built-in.o: ld_flags += -R $(obj)/vsyscall-syms.o +- +-SYSCFLAGS_vsyscall-syms.o = -r +-$(obj)/vsyscall-syms.o: $(src)/vsyscall_32.lds \ +- $(obj)/vsyscall-sysenter_32.o $(obj)/vsyscall-note_32.o FORCE +- $(call if_changed,syscall) +- +- +diff --git a/arch/x86/kernel/Makefile_64 b/arch/x86/kernel/Makefile_64 +deleted file mode 100644 +index 5a88890..0000000 +--- a/arch/x86/kernel/Makefile_64 ++++ /dev/null +@@ -1,45 +0,0 @@ +-# +-# Makefile for the linux kernel. +-# +- +-extra-y := head_64.o head64.o init_task.o vmlinux.lds +-CPPFLAGS_vmlinux.lds += -Ux86_64 +-EXTRA_AFLAGS := -traditional +- +-obj-y := process_64.o signal_64.o entry_64.o traps_64.o irq_64.o \ +- ptrace_64.o time_64.o ioport_64.o ldt_64.o setup_64.o i8259_64.o sys_x86_64.o \ +- x8664_ksyms_64.o i387_64.o syscall_64.o vsyscall_64.o \ +- setup64.o bootflag.o e820_64.o reboot_64.o quirks.o i8237.o \ +- pci-dma_64.o pci-nommu_64.o alternative.o hpet.o tsc_64.o bugs_64.o \ +- i8253.o +- +-obj-$(CONFIG_STACKTRACE) += stacktrace.o +-obj-y += cpu/ +-obj-y += acpi/ +-obj-$(CONFIG_X86_MSR) += msr.o +-obj-$(CONFIG_MICROCODE) += microcode.o +-obj-$(CONFIG_X86_CPUID) += cpuid.o +-obj-$(CONFIG_SMP) += smp_64.o smpboot_64.o trampoline_64.o tsc_sync.o +-obj-y += apic_64.o nmi_64.o +-obj-y += io_apic_64.o mpparse_64.o genapic_64.o genapic_flat_64.o +-obj-$(CONFIG_KEXEC) += machine_kexec_64.o relocate_kernel_64.o crash.o +-obj-$(CONFIG_CRASH_DUMP) += crash_dump_64.o +-obj-$(CONFIG_PM) += suspend_64.o +-obj-$(CONFIG_HIBERNATION) += suspend_asm_64.o +-obj-$(CONFIG_EARLY_PRINTK) += early_printk.o +-obj-$(CONFIG_GART_IOMMU) += pci-gart_64.o aperture_64.o +-obj-$(CONFIG_CALGARY_IOMMU) += pci-calgary_64.o tce_64.o +-obj-$(CONFIG_SWIOTLB) += pci-swiotlb_64.o +-obj-$(CONFIG_KPROBES) += kprobes_64.o +-obj-$(CONFIG_X86_PM_TIMER) += pmtimer_64.o +-obj-$(CONFIG_X86_VSMP) += vsmp_64.o +-obj-$(CONFIG_K8_NB) += k8.o +-obj-$(CONFIG_AUDIT) += audit_64.o +- +-obj-$(CONFIG_MODULES) += module_64.o +-obj-$(CONFIG_PCI) += early-quirks.o +- +-obj-y += topology.o +-obj-y += pcspeaker.o +- +-CFLAGS_vsyscall_64.o := $(PROFILING) -g0 +diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile +index 1351c39..19d3d6e 100644 +--- a/arch/x86/kernel/acpi/Makefile ++++ b/arch/x86/kernel/acpi/Makefile +@@ -1,5 +1,5 @@ + obj-$(CONFIG_ACPI) += boot.o +-obj-$(CONFIG_ACPI_SLEEP) += sleep_$(BITS).o wakeup_$(BITS).o ++obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o + + ifneq ($(CONFIG_ACPI_PROCESSOR),) + obj-y += cstate.o processor.o +diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c +new file mode 100644 +index 0000000..6bc815c +--- /dev/null ++++ b/arch/x86/kernel/acpi/sleep.c +@@ -0,0 +1,87 @@ ++/* ++ * sleep.c - x86-specific ACPI sleep support. ++ * ++ * Copyright (C) 2001-2003 Patrick Mochel ++ * Copyright (C) 2001-2003 Pavel Machek ++ */ ++ ++#include ++#include ++#include ++#include ++ ++#include ++ ++/* address in low memory of the wakeup routine. */ ++unsigned long acpi_wakeup_address = 0; ++unsigned long acpi_realmode_flags; ++extern char wakeup_start, wakeup_end; ++ ++extern unsigned long acpi_copy_wakeup_routine(unsigned long); ++ ++/** ++ * acpi_save_state_mem - save kernel state ++ * ++ * Create an identity mapped page table and copy the wakeup routine to ++ * low memory. ++ */ ++int acpi_save_state_mem(void) ++{ ++ if (!acpi_wakeup_address) { ++ printk(KERN_ERR "Could not allocate memory during boot, S3 disabled\n"); ++ return -ENOMEM; ++ } ++ memcpy((void *)acpi_wakeup_address, &wakeup_start, ++ &wakeup_end - &wakeup_start); ++ acpi_copy_wakeup_routine(acpi_wakeup_address); ++ ++ return 0; ++} ++ ++/* ++ * acpi_restore_state - undo effects of acpi_save_state_mem ++ */ ++void acpi_restore_state_mem(void) ++{ ++} ++ ++ ++/** ++ * acpi_reserve_bootmem - do _very_ early ACPI initialisation ++ * ++ * We allocate a page from the first 1MB of memory for the wakeup ++ * routine for when we come back from a sleep state. The ++ * runtime allocator allows specification of <16MB pages, but not ++ * <1MB pages. ++ */ ++void __init acpi_reserve_bootmem(void) ++{ ++ if ((&wakeup_end - &wakeup_start) > PAGE_SIZE*2) { ++ printk(KERN_ERR ++ "ACPI: Wakeup code way too big, S3 disabled.\n"); ++ return; ++ } ++ ++ acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE*2); ++ if (!acpi_wakeup_address) ++ printk(KERN_ERR "ACPI: Cannot allocate lowmem, S3 disabled.\n"); ++} ++ ++ ++static int __init acpi_sleep_setup(char *str) ++{ ++ while ((str != NULL) && (*str != '\0')) { ++ if (strncmp(str, "s3_bios", 7) == 0) ++ acpi_realmode_flags |= 1; ++ if (strncmp(str, "s3_mode", 7) == 0) ++ acpi_realmode_flags |= 2; ++ if (strncmp(str, "s3_beep", 7) == 0) ++ acpi_realmode_flags |= 4; ++ str = strchr(str, ','); ++ if (str != NULL) ++ str += strspn(str, ", \t"); ++ } ++ return 1; ++} ++ ++__setup("acpi_sleep=", acpi_sleep_setup); +diff --git a/arch/x86/kernel/acpi/sleep_32.c b/arch/x86/kernel/acpi/sleep_32.c +index 1069948..63fe552 100644 +--- a/arch/x86/kernel/acpi/sleep_32.c ++++ b/arch/x86/kernel/acpi/sleep_32.c +@@ -12,76 +12,6 @@ + + #include + +-/* address in low memory of the wakeup routine. */ +-unsigned long acpi_wakeup_address = 0; +-unsigned long acpi_realmode_flags; +-extern char wakeup_start, wakeup_end; +- +-extern unsigned long FASTCALL(acpi_copy_wakeup_routine(unsigned long)); +- +-/** +- * acpi_save_state_mem - save kernel state +- * +- * Create an identity mapped page table and copy the wakeup routine to +- * low memory. +- */ +-int acpi_save_state_mem(void) +-{ +- if (!acpi_wakeup_address) +- return 1; +- memcpy((void *)acpi_wakeup_address, &wakeup_start, +- &wakeup_end - &wakeup_start); +- acpi_copy_wakeup_routine(acpi_wakeup_address); +- +- return 0; +-} +- +-/* +- * acpi_restore_state - undo effects of acpi_save_state_mem +- */ +-void acpi_restore_state_mem(void) +-{ +-} +- +-/** +- * acpi_reserve_bootmem - do _very_ early ACPI initialisation +- * +- * We allocate a page from the first 1MB of memory for the wakeup +- * routine for when we come back from a sleep state. The +- * runtime allocator allows specification of <16MB pages, but not +- * <1MB pages. +- */ +-void __init acpi_reserve_bootmem(void) +-{ +- if ((&wakeup_end - &wakeup_start) > PAGE_SIZE) { +- printk(KERN_ERR +- "ACPI: Wakeup code way too big, S3 disabled.\n"); +- return; +- } +- +- acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE); +- if (!acpi_wakeup_address) +- printk(KERN_ERR "ACPI: Cannot allocate lowmem, S3 disabled.\n"); +-} +- +-static int __init acpi_sleep_setup(char *str) +-{ +- while ((str != NULL) && (*str != '\0')) { +- if (strncmp(str, "s3_bios", 7) == 0) +- acpi_realmode_flags |= 1; +- if (strncmp(str, "s3_mode", 7) == 0) +- acpi_realmode_flags |= 2; +- if (strncmp(str, "s3_beep", 7) == 0) +- acpi_realmode_flags |= 4; +- str = strchr(str, ','); +- if (str != NULL) +- str += strspn(str, ", \t"); +- } +- return 1; +-} +- +-__setup("acpi_sleep=", acpi_sleep_setup); +- + /* Ouch, we want to delete this. We already have better version in userspace, in + s2ram from suspend.sf.net project */ + static __init int reset_videomode_after_s3(const struct dmi_system_id *d) +diff --git a/arch/x86/kernel/acpi/sleep_64.c b/arch/x86/kernel/acpi/sleep_64.c +deleted file mode 100644 +index da42de2..0000000 +--- a/arch/x86/kernel/acpi/sleep_64.c ++++ /dev/null +@@ -1,117 +0,0 @@ +-/* +- * acpi.c - Architecture-Specific Low-Level ACPI Support +- * +- * Copyright (C) 2001, 2002 Paul Diefenbaugh +- * Copyright (C) 2001 Jun Nakajima +- * Copyright (C) 2001 Patrick Mochel +- * Copyright (C) 2002 Andi Kleen, SuSE Labs (x86-64 port) +- * Copyright (C) 2003 Pavel Machek, SuSE Labs +- * +- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +- * +- * This program is free software; you can redistribute it and/or modify +- * it under the terms of the GNU General Public License as published by +- * the Free Software Foundation; either version 2 of the License, or +- * (at your option) any later version. +- * +- * This program is distributed in the hope that it will be useful, +- * but WITHOUT ANY WARRANTY; without even the implied warranty of +- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +- * GNU General Public License for more details. +- * +- * You should have received a copy of the GNU General Public License +- * along with this program; if not, write to the Free Software +- * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +- * +- * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* -------------------------------------------------------------------------- +- Low-Level Sleep Support +- -------------------------------------------------------------------------- */ +- +-/* address in low memory of the wakeup routine. */ +-unsigned long acpi_wakeup_address = 0; +-unsigned long acpi_realmode_flags; +-extern char wakeup_start, wakeup_end; +- +-extern unsigned long acpi_copy_wakeup_routine(unsigned long); +- +-/** +- * acpi_save_state_mem - save kernel state +- * +- * Create an identity mapped page table and copy the wakeup routine to +- * low memory. +- */ +-int acpi_save_state_mem(void) +-{ +- memcpy((void *)acpi_wakeup_address, &wakeup_start, +- &wakeup_end - &wakeup_start); +- acpi_copy_wakeup_routine(acpi_wakeup_address); +- +- return 0; +-} +- +-/* +- * acpi_restore_state +- */ +-void acpi_restore_state_mem(void) +-{ +-} +- +-/** +- * acpi_reserve_bootmem - do _very_ early ACPI initialisation +- * +- * We allocate a page in low memory for the wakeup +- * routine for when we come back from a sleep state. The +- * runtime allocator allows specification of <16M pages, but not +- * <1M pages. +- */ +-void __init acpi_reserve_bootmem(void) +-{ +- acpi_wakeup_address = (unsigned long)alloc_bootmem_low(PAGE_SIZE*2); +- if ((&wakeup_end - &wakeup_start) > (PAGE_SIZE*2)) +- printk(KERN_CRIT +- "ACPI: Wakeup code way too big, will crash on attempt" +- " to suspend\n"); +-} +- +-static int __init acpi_sleep_setup(char *str) +-{ +- while ((str != NULL) && (*str != '\0')) { +- if (strncmp(str, "s3_bios", 7) == 0) +- acpi_realmode_flags |= 1; +- if (strncmp(str, "s3_mode", 7) == 0) +- acpi_realmode_flags |= 2; +- if (strncmp(str, "s3_beep", 7) == 0) +- acpi_realmode_flags |= 4; +- str = strchr(str, ','); +- if (str != NULL) +- str += strspn(str, ", \t"); +- } +- return 1; +-} +- +-__setup("acpi_sleep=", acpi_sleep_setup); +- +diff --git a/arch/x86/kernel/acpi/wakeup_32.S b/arch/x86/kernel/acpi/wakeup_32.S +index 1e931aa..f53e327 100644 +--- a/arch/x86/kernel/acpi/wakeup_32.S ++++ b/arch/x86/kernel/acpi/wakeup_32.S +@@ -1,4 +1,4 @@ +-.text ++ .section .text.page_aligned + #include + #include + #include +diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S +index 5ed3bc5..2e1b9e0 100644 +--- a/arch/x86/kernel/acpi/wakeup_64.S ++++ b/arch/x86/kernel/acpi/wakeup_64.S +@@ -344,13 +344,13 @@ do_suspend_lowlevel: + call save_processor_state + + movq $saved_context, %rax +- movq %rsp, pt_regs_rsp(%rax) +- movq %rbp, pt_regs_rbp(%rax) +- movq %rsi, pt_regs_rsi(%rax) +- movq %rdi, pt_regs_rdi(%rax) +- movq %rbx, pt_regs_rbx(%rax) +- movq %rcx, pt_regs_rcx(%rax) +- movq %rdx, pt_regs_rdx(%rax) ++ movq %rsp, pt_regs_sp(%rax) ++ movq %rbp, pt_regs_bp(%rax) ++ movq %rsi, pt_regs_si(%rax) ++ movq %rdi, pt_regs_di(%rax) ++ movq %rbx, pt_regs_bx(%rax) ++ movq %rcx, pt_regs_cx(%rax) ++ movq %rdx, pt_regs_dx(%rax) + movq %r8, pt_regs_r8(%rax) + movq %r9, pt_regs_r9(%rax) + movq %r10, pt_regs_r10(%rax) +@@ -360,7 +360,7 @@ do_suspend_lowlevel: + movq %r14, pt_regs_r14(%rax) + movq %r15, pt_regs_r15(%rax) + pushfq +- popq pt_regs_eflags(%rax) ++ popq pt_regs_flags(%rax) + + movq $.L97, saved_rip(%rip) + +@@ -391,15 +391,15 @@ do_suspend_lowlevel: + movq %rbx, %cr2 + movq saved_context_cr0(%rax), %rbx + movq %rbx, %cr0 +- pushq pt_regs_eflags(%rax) ++ pushq pt_regs_flags(%rax) + popfq +- movq pt_regs_rsp(%rax), %rsp +- movq pt_regs_rbp(%rax), %rbp +- movq pt_regs_rsi(%rax), %rsi +- movq pt_regs_rdi(%rax), %rdi +- movq pt_regs_rbx(%rax), %rbx +- movq pt_regs_rcx(%rax), %rcx +- movq pt_regs_rdx(%rax), %rdx ++ movq pt_regs_sp(%rax), %rsp ++ movq pt_regs_bp(%rax), %rbp ++ movq pt_regs_si(%rax), %rsi ++ movq pt_regs_di(%rax), %rdi ++ movq pt_regs_bx(%rax), %rbx ++ movq pt_regs_cx(%rax), %rcx ++ movq pt_regs_dx(%rax), %rdx + movq pt_regs_r8(%rax), %r8 + movq pt_regs_r9(%rax), %r9 + movq pt_regs_r10(%rax), %r10 +diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c +index d6405e0..45d79ea 100644 +--- a/arch/x86/kernel/alternative.c ++++ b/arch/x86/kernel/alternative.c +@@ -273,6 +273,7 @@ struct smp_alt_module { + }; + static LIST_HEAD(smp_alt_modules); + static DEFINE_SPINLOCK(smp_alt); ++static int smp_mode = 1; /* protected by smp_alt */ + + void alternatives_smp_module_add(struct module *mod, char *name, + void *locks, void *locks_end, +@@ -341,12 +342,13 @@ void alternatives_smp_switch(int smp) + + #ifdef CONFIG_LOCKDEP + /* +- * A not yet fixed binutils section handling bug prevents +- * alternatives-replacement from working reliably, so turn +- * it off: ++ * Older binutils section handling bug prevented ++ * alternatives-replacement from working reliably. ++ * ++ * If this still occurs then you should see a hang ++ * or crash shortly after this line: + */ +- printk("lockdep: not fixing up alternatives.\n"); +- return; ++ printk("lockdep: fixing up alternatives.\n"); + #endif + + if (noreplace_smp || smp_alt_once) +@@ -354,21 +356,29 @@ void alternatives_smp_switch(int smp) + BUG_ON(!smp && (num_online_cpus() > 1)); + + spin_lock_irqsave(&smp_alt, flags); +- if (smp) { ++ ++ /* ++ * Avoid unnecessary switches because it forces JIT based VMs to ++ * throw away all cached translations, which can be quite costly. ++ */ ++ if (smp == smp_mode) { ++ /* nothing */ ++ } else if (smp) { + printk(KERN_INFO "SMP alternatives: switching to SMP code\n"); +- clear_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability); +- clear_bit(X86_FEATURE_UP, cpu_data(0).x86_capability); ++ clear_cpu_cap(&boot_cpu_data, X86_FEATURE_UP); ++ clear_cpu_cap(&cpu_data(0), X86_FEATURE_UP); + list_for_each_entry(mod, &smp_alt_modules, next) + alternatives_smp_lock(mod->locks, mod->locks_end, + mod->text, mod->text_end); + } else { + printk(KERN_INFO "SMP alternatives: switching to UP code\n"); +- set_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability); +- set_bit(X86_FEATURE_UP, cpu_data(0).x86_capability); ++ set_cpu_cap(&boot_cpu_data, X86_FEATURE_UP); ++ set_cpu_cap(&cpu_data(0), X86_FEATURE_UP); + list_for_each_entry(mod, &smp_alt_modules, next) + alternatives_smp_unlock(mod->locks, mod->locks_end, + mod->text, mod->text_end); + } ++ smp_mode = smp; + spin_unlock_irqrestore(&smp_alt, flags); + } + +@@ -431,8 +441,9 @@ void __init alternative_instructions(void) + if (smp_alt_once) { + if (1 == num_possible_cpus()) { + printk(KERN_INFO "SMP alternatives: switching to UP code\n"); +- set_bit(X86_FEATURE_UP, boot_cpu_data.x86_capability); +- set_bit(X86_FEATURE_UP, cpu_data(0).x86_capability); ++ set_cpu_cap(&boot_cpu_data, X86_FEATURE_UP); ++ set_cpu_cap(&cpu_data(0), X86_FEATURE_UP); ++ + alternatives_smp_unlock(__smp_locks, __smp_locks_end, + _text, _etext); + } +@@ -440,7 +451,10 @@ void __init alternative_instructions(void) + alternatives_smp_module_add(NULL, "core kernel", + __smp_locks, __smp_locks_end, + _text, _etext); +- alternatives_smp_switch(0); ++ ++ /* Only switch to UP mode if we don't immediately boot others */ ++ if (num_possible_cpus() == 1 || setup_max_cpus <= 1) ++ alternatives_smp_switch(0); + } + #endif + apply_paravirt(__parainstructions, __parainstructions_end); +diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c +index 5b69927..608152a 100644 +--- a/arch/x86/kernel/aperture_64.c ++++ b/arch/x86/kernel/aperture_64.c +@@ -1,12 +1,12 @@ +-/* ++/* + * Firmware replacement code. +- * ++ * + * Work around broken BIOSes that don't set an aperture or only set the +- * aperture in the AGP bridge. +- * If all fails map the aperture over some low memory. This is cheaper than +- * doing bounce buffering. The memory is lost. This is done at early boot +- * because only the bootmem allocator can allocate 32+MB. +- * ++ * aperture in the AGP bridge. ++ * If all fails map the aperture over some low memory. This is cheaper than ++ * doing bounce buffering. The memory is lost. This is done at early boot ++ * because only the bootmem allocator can allocate 32+MB. ++ * + * Copyright 2002 Andi Kleen, SuSE Labs. + */ + #include +@@ -30,7 +30,7 @@ int gart_iommu_aperture_disabled __initdata = 0; + int gart_iommu_aperture_allowed __initdata = 0; + + int fallback_aper_order __initdata = 1; /* 64MB */ +-int fallback_aper_force __initdata = 0; ++int fallback_aper_force __initdata = 0; + + int fix_aperture __initdata = 1; + +@@ -49,167 +49,270 @@ static void __init insert_aperture_resource(u32 aper_base, u32 aper_size) + /* This code runs before the PCI subsystem is initialized, so just + access the northbridge directly. */ + +-static u32 __init allocate_aperture(void) ++static u32 __init allocate_aperture(void) + { + u32 aper_size; +- void *p; ++ void *p; + +- if (fallback_aper_order > 7) +- fallback_aper_order = 7; +- aper_size = (32 * 1024 * 1024) << fallback_aper_order; ++ if (fallback_aper_order > 7) ++ fallback_aper_order = 7; ++ aper_size = (32 * 1024 * 1024) << fallback_aper_order; + +- /* +- * Aperture has to be naturally aligned. This means an 2GB aperture won't +- * have much chance of finding a place in the lower 4GB of memory. +- * Unfortunately we cannot move it up because that would make the +- * IOMMU useless. ++ /* ++ * Aperture has to be naturally aligned. This means a 2GB aperture ++ * won't have much chance of finding a place in the lower 4GB of ++ * memory. Unfortunately we cannot move it up because that would ++ * make the IOMMU useless. + */ + p = __alloc_bootmem_nopanic(aper_size, aper_size, 0); + if (!p || __pa(p)+aper_size > 0xffffffff) { +- printk("Cannot allocate aperture memory hole (%p,%uK)\n", +- p, aper_size>>10); ++ printk(KERN_ERR ++ "Cannot allocate aperture memory hole (%p,%uK)\n", ++ p, aper_size>>10); + if (p) + free_bootmem(__pa(p), aper_size); + return 0; + } +- printk("Mapping aperture over %d KB of RAM @ %lx\n", +- aper_size >> 10, __pa(p)); ++ printk(KERN_INFO "Mapping aperture over %d KB of RAM @ %lx\n", ++ aper_size >> 10, __pa(p)); + insert_aperture_resource((u32)__pa(p), aper_size); +- return (u32)__pa(p); ++ ++ return (u32)__pa(p); + } + + static int __init aperture_valid(u64 aper_base, u32 aper_size) +-{ +- if (!aper_base) +- return 0; +- if (aper_size < 64*1024*1024) { +- printk("Aperture too small (%d MB)\n", aper_size>>20); ++{ ++ if (!aper_base) + return 0; +- } ++ + if (aper_base + aper_size > 0x100000000UL) { +- printk("Aperture beyond 4GB. Ignoring.\n"); +- return 0; ++ printk(KERN_ERR "Aperture beyond 4GB. Ignoring.\n"); ++ return 0; + } + if (e820_any_mapped(aper_base, aper_base + aper_size, E820_RAM)) { +- printk("Aperture pointing to e820 RAM. Ignoring.\n"); +- return 0; +- } ++ printk(KERN_ERR "Aperture pointing to e820 RAM. Ignoring.\n"); ++ return 0; ++ } ++ if (aper_size < 64*1024*1024) { ++ printk(KERN_ERR "Aperture too small (%d MB)\n", aper_size>>20); ++ return 0; ++ } ++ + return 1; +-} ++} + + /* Find a PCI capability */ +-static __u32 __init find_cap(int num, int slot, int func, int cap) +-{ +- u8 pos; ++static __u32 __init find_cap(int num, int slot, int func, int cap) ++{ + int bytes; +- if (!(read_pci_config_16(num,slot,func,PCI_STATUS) & PCI_STATUS_CAP_LIST)) ++ u8 pos; ++ ++ if (!(read_pci_config_16(num, slot, func, PCI_STATUS) & ++ PCI_STATUS_CAP_LIST)) + return 0; +- pos = read_pci_config_byte(num,slot,func,PCI_CAPABILITY_LIST); +- for (bytes = 0; bytes < 48 && pos >= 0x40; bytes++) { ++ ++ pos = read_pci_config_byte(num, slot, func, PCI_CAPABILITY_LIST); ++ for (bytes = 0; bytes < 48 && pos >= 0x40; bytes++) { + u8 id; +- pos &= ~3; +- id = read_pci_config_byte(num,slot,func,pos+PCI_CAP_LIST_ID); ++ ++ pos &= ~3; ++ id = read_pci_config_byte(num, slot, func, pos+PCI_CAP_LIST_ID); + if (id == 0xff) + break; +- if (id == cap) +- return pos; +- pos = read_pci_config_byte(num,slot,func,pos+PCI_CAP_LIST_NEXT); +- } ++ if (id == cap) ++ return pos; ++ pos = read_pci_config_byte(num, slot, func, ++ pos+PCI_CAP_LIST_NEXT); ++ } + return 0; +-} ++} + + /* Read a standard AGPv3 bridge header */ + static __u32 __init read_agp(int num, int slot, int func, int cap, u32 *order) +-{ ++{ + u32 apsize; + u32 apsizereg; + int nbits; + u32 aper_low, aper_hi; + u64 aper; + +- printk("AGP bridge at %02x:%02x:%02x\n", num, slot, func); +- apsizereg = read_pci_config_16(num,slot,func, cap + 0x14); ++ printk(KERN_INFO "AGP bridge at %02x:%02x:%02x\n", num, slot, func); ++ apsizereg = read_pci_config_16(num, slot, func, cap + 0x14); + if (apsizereg == 0xffffffff) { +- printk("APSIZE in AGP bridge unreadable\n"); ++ printk(KERN_ERR "APSIZE in AGP bridge unreadable\n"); + return 0; + } + + apsize = apsizereg & 0xfff; + /* Some BIOS use weird encodings not in the AGPv3 table. */ +- if (apsize & 0xff) +- apsize |= 0xf00; ++ if (apsize & 0xff) ++ apsize |= 0xf00; + nbits = hweight16(apsize); + *order = 7 - nbits; + if ((int)*order < 0) /* < 32MB */ + *order = 0; +- +- aper_low = read_pci_config(num,slot,func, 0x10); +- aper_hi = read_pci_config(num,slot,func,0x14); ++ ++ aper_low = read_pci_config(num, slot, func, 0x10); ++ aper_hi = read_pci_config(num, slot, func, 0x14); + aper = (aper_low & ~((1<<22)-1)) | ((u64)aper_hi << 32); + +- printk("Aperture from AGP @ %Lx size %u MB (APSIZE %x)\n", +- aper, 32 << *order, apsizereg); ++ printk(KERN_INFO "Aperture from AGP @ %Lx size %u MB (APSIZE %x)\n", ++ aper, 32 << *order, apsizereg); + + if (!aperture_valid(aper, (32*1024*1024) << *order)) +- return 0; +- return (u32)aper; +-} +- +-/* Look for an AGP bridge. Windows only expects the aperture in the +- AGP bridge and some BIOS forget to initialize the Northbridge too. +- Work around this here. +- +- Do an PCI bus scan by hand because we're running before the PCI +- subsystem. ++ return 0; ++ return (u32)aper; ++} + +- All K8 AGP bridges are AGPv3 compliant, so we can do this scan +- generically. It's probably overkill to always scan all slots because +- the AGP bridges should be always an own bus on the HT hierarchy, +- but do it here for future safety. */ ++/* ++ * Look for an AGP bridge. Windows only expects the aperture in the ++ * AGP bridge and some BIOS forget to initialize the Northbridge too. ++ * Work around this here. ++ * ++ * Do an PCI bus scan by hand because we're running before the PCI ++ * subsystem. ++ * ++ * All K8 AGP bridges are AGPv3 compliant, so we can do this scan ++ * generically. It's probably overkill to always scan all slots because ++ * the AGP bridges should be always an own bus on the HT hierarchy, ++ * but do it here for future safety. ++ */ + static __u32 __init search_agp_bridge(u32 *order, int *valid_agp) + { + int num, slot, func; + + /* Poor man's PCI discovery */ +- for (num = 0; num < 256; num++) { +- for (slot = 0; slot < 32; slot++) { +- for (func = 0; func < 8; func++) { ++ for (num = 0; num < 256; num++) { ++ for (slot = 0; slot < 32; slot++) { ++ for (func = 0; func < 8; func++) { + u32 class, cap; + u8 type; +- class = read_pci_config(num,slot,func, ++ class = read_pci_config(num, slot, func, + PCI_CLASS_REVISION); + if (class == 0xffffffff) +- break; +- +- switch (class >> 16) { ++ break; ++ ++ switch (class >> 16) { + case PCI_CLASS_BRIDGE_HOST: + case PCI_CLASS_BRIDGE_OTHER: /* needed? */ + /* AGP bridge? */ +- cap = find_cap(num,slot,func,PCI_CAP_ID_AGP); ++ cap = find_cap(num, slot, func, ++ PCI_CAP_ID_AGP); + if (!cap) + break; +- *valid_agp = 1; +- return read_agp(num,slot,func,cap,order); +- } +- ++ *valid_agp = 1; ++ return read_agp(num, slot, func, cap, ++ order); ++ } ++ + /* No multi-function device? */ +- type = read_pci_config_byte(num,slot,func, ++ type = read_pci_config_byte(num, slot, func, + PCI_HEADER_TYPE); + if (!(type & 0x80)) + break; +- } +- } ++ } ++ } + } +- printk("No AGP bridge found\n"); ++ printk(KERN_INFO "No AGP bridge found\n"); ++ + return 0; + } + ++static int gart_fix_e820 __initdata = 1; ++ ++static int __init parse_gart_mem(char *p) ++{ ++ if (!p) ++ return -EINVAL; ++ ++ if (!strncmp(p, "off", 3)) ++ gart_fix_e820 = 0; ++ else if (!strncmp(p, "on", 2)) ++ gart_fix_e820 = 1; ++ ++ return 0; ++} ++early_param("gart_fix_e820", parse_gart_mem); ++ ++void __init early_gart_iommu_check(void) ++{ ++ /* ++ * in case it is enabled before, esp for kexec/kdump, ++ * previous kernel already enable that. memset called ++ * by allocate_aperture/__alloc_bootmem_nopanic cause restart. ++ * or second kernel have different position for GART hole. and new ++ * kernel could use hole as RAM that is still used by GART set by ++ * first kernel ++ * or BIOS forget to put that in reserved. ++ * try to update e820 to make that region as reserved. ++ */ ++ int fix, num; ++ u32 ctl; ++ u32 aper_size = 0, aper_order = 0, last_aper_order = 0; ++ u64 aper_base = 0, last_aper_base = 0; ++ int aper_enabled = 0, last_aper_enabled = 0; ++ ++ if (!early_pci_allowed()) ++ return; ++ ++ fix = 0; ++ for (num = 24; num < 32; num++) { ++ if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00))) ++ continue; ++ ++ ctl = read_pci_config(0, num, 3, 0x90); ++ aper_enabled = ctl & 1; ++ aper_order = (ctl >> 1) & 7; ++ aper_size = (32 * 1024 * 1024) << aper_order; ++ aper_base = read_pci_config(0, num, 3, 0x94) & 0x7fff; ++ aper_base <<= 25; ++ ++ if ((last_aper_order && aper_order != last_aper_order) || ++ (last_aper_base && aper_base != last_aper_base) || ++ (last_aper_enabled && aper_enabled != last_aper_enabled)) { ++ fix = 1; ++ break; ++ } ++ last_aper_order = aper_order; ++ last_aper_base = aper_base; ++ last_aper_enabled = aper_enabled; ++ } ++ ++ if (!fix && !aper_enabled) ++ return; ++ ++ if (!aper_base || !aper_size || aper_base + aper_size > 0x100000000UL) ++ fix = 1; ++ ++ if (gart_fix_e820 && !fix && aper_enabled) { ++ if (e820_any_mapped(aper_base, aper_base + aper_size, ++ E820_RAM)) { ++ /* reserved it, so we can resuse it in second kernel */ ++ printk(KERN_INFO "update e820 for GART\n"); ++ add_memory_region(aper_base, aper_size, E820_RESERVED); ++ update_e820(); ++ } ++ return; ++ } ++ ++ /* different nodes have different setting, disable them all at first*/ ++ for (num = 24; num < 32; num++) { ++ if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00))) ++ continue; ++ ++ ctl = read_pci_config(0, num, 3, 0x90); ++ ctl &= ~1; ++ write_pci_config(0, num, 3, 0x90, ctl); ++ } ++ ++} ++ + void __init gart_iommu_hole_init(void) +-{ +- int fix, num; ++{ + u32 aper_size, aper_alloc = 0, aper_order = 0, last_aper_order = 0; + u64 aper_base, last_aper_base = 0; +- int valid_agp = 0; ++ int fix, num, valid_agp = 0; ++ int node; + + if (gart_iommu_aperture_disabled || !fix_aperture || + !early_pci_allowed()) +@@ -218,24 +321,26 @@ void __init gart_iommu_hole_init(void) + printk(KERN_INFO "Checking aperture...\n"); + + fix = 0; +- for (num = 24; num < 32; num++) { ++ node = 0; ++ for (num = 24; num < 32; num++) { + if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00))) + continue; + + iommu_detected = 1; + gart_iommu_aperture = 1; + +- aper_order = (read_pci_config(0, num, 3, 0x90) >> 1) & 7; +- aper_size = (32 * 1024 * 1024) << aper_order; ++ aper_order = (read_pci_config(0, num, 3, 0x90) >> 1) & 7; ++ aper_size = (32 * 1024 * 1024) << aper_order; + aper_base = read_pci_config(0, num, 3, 0x94) & 0x7fff; +- aper_base <<= 25; ++ aper_base <<= 25; ++ ++ printk(KERN_INFO "Node %d: aperture @ %Lx size %u MB\n", ++ node, aper_base, aper_size >> 20); ++ node++; + +- printk("CPU %d: aperture @ %Lx size %u MB\n", num-24, +- aper_base, aper_size>>20); +- + if (!aperture_valid(aper_base, aper_size)) { +- fix = 1; +- break; ++ fix = 1; ++ break; + } + + if ((last_aper_order && aper_order != last_aper_order) || +@@ -245,55 +350,64 @@ void __init gart_iommu_hole_init(void) + } + last_aper_order = aper_order; + last_aper_base = aper_base; +- } ++ } + + if (!fix && !fallback_aper_force) { + if (last_aper_base) { + unsigned long n = (32 * 1024 * 1024) << last_aper_order; ++ + insert_aperture_resource((u32)last_aper_base, n); + } +- return; ++ return; + } + + if (!fallback_aper_force) +- aper_alloc = search_agp_bridge(&aper_order, &valid_agp); +- +- if (aper_alloc) { ++ aper_alloc = search_agp_bridge(&aper_order, &valid_agp); ++ ++ if (aper_alloc) { + /* Got the aperture from the AGP bridge */ + } else if (swiotlb && !valid_agp) { + /* Do nothing */ + } else if ((!no_iommu && end_pfn > MAX_DMA32_PFN) || + force_iommu || + valid_agp || +- fallback_aper_force) { +- printk("Your BIOS doesn't leave a aperture memory hole\n"); +- printk("Please enable the IOMMU option in the BIOS setup\n"); +- printk("This costs you %d MB of RAM\n", +- 32 << fallback_aper_order); ++ fallback_aper_force) { ++ printk(KERN_ERR ++ "Your BIOS doesn't leave a aperture memory hole\n"); ++ printk(KERN_ERR ++ "Please enable the IOMMU option in the BIOS setup\n"); ++ printk(KERN_ERR ++ "This costs you %d MB of RAM\n", ++ 32 << fallback_aper_order); + + aper_order = fallback_aper_order; + aper_alloc = allocate_aperture(); +- if (!aper_alloc) { +- /* Could disable AGP and IOMMU here, but it's probably +- not worth it. But the later users cannot deal with +- bad apertures and turning on the aperture over memory +- causes very strange problems, so it's better to +- panic early. */ ++ if (!aper_alloc) { ++ /* ++ * Could disable AGP and IOMMU here, but it's ++ * probably not worth it. But the later users ++ * cannot deal with bad apertures and turning ++ * on the aperture over memory causes very ++ * strange problems, so it's better to panic ++ * early. ++ */ + panic("Not enough memory for aperture"); + } +- } else { +- return; +- } ++ } else { ++ return; ++ } + + /* Fix up the north bridges */ +- for (num = 24; num < 32; num++) { ++ for (num = 24; num < 32; num++) { + if (!early_is_k8_nb(read_pci_config(0, num, 3, 0x00))) +- continue; +- +- /* Don't enable translation yet. That is done later. +- Assume this BIOS didn't initialise the GART so +- just overwrite all previous bits */ +- write_pci_config(0, num, 3, 0x90, aper_order<<1); +- write_pci_config(0, num, 3, 0x94, aper_alloc>>25); +- } +-} ++ continue; ++ ++ /* ++ * Don't enable translation yet. That is done later. ++ * Assume this BIOS didn't initialise the GART so ++ * just overwrite all previous bits ++ */ ++ write_pci_config(0, num, 3, 0x90, aper_order<<1); ++ write_pci_config(0, num, 3, 0x94, aper_alloc>>25); ++ } ++} diff --git a/arch/x86/kernel/apic_32.c b/arch/x86/kernel/apic_32.c -index edb5108..a56c782 100644 +index edb5108..35a568e 100644 --- a/arch/x86/kernel/apic_32.c +++ b/arch/x86/kernel/apic_32.c -@@ -1530,7 +1530,7 @@ static int lapic_resume(struct sys_device *dev) +@@ -43,12 +43,10 @@ + #include + #include + +-#include "io_ports.h" +- + /* + * Sanity check + */ +-#if (SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F ++#if ((SPURIOUS_APIC_VECTOR & 0x0F) != 0x0F) + # error SPURIOUS_APIC_VECTOR definition error + #endif + +@@ -57,7 +55,7 @@ + * + * -1=force-disable, +1=force-enable + */ +-static int enable_local_apic __initdata = 0; ++static int enable_local_apic __initdata; + + /* Local APIC timer verification ok */ + static int local_apic_timer_verify_ok; +@@ -101,6 +99,8 @@ static DEFINE_PER_CPU(struct clock_event_device, lapic_events); + /* Local APIC was disabled by the BIOS and enabled by the kernel */ + static int enabled_via_apicbase; + ++static unsigned long apic_phys; ++ + /* + * Get the LAPIC version + */ +@@ -110,7 +110,7 @@ static inline int lapic_get_version(void) + } + + /* +- * Check, if the APIC is integrated or a seperate chip ++ * Check, if the APIC is integrated or a separate chip + */ + static inline int lapic_is_integrated(void) + { +@@ -135,9 +135,9 @@ void apic_wait_icr_idle(void) + cpu_relax(); + } + +-unsigned long safe_apic_wait_icr_idle(void) ++u32 safe_apic_wait_icr_idle(void) + { +- unsigned long send_status; ++ u32 send_status; + int timeout; + + timeout = 0; +@@ -154,7 +154,7 @@ unsigned long safe_apic_wait_icr_idle(void) + /** + * enable_NMI_through_LVT0 - enable NMI through local vector table 0 + */ +-void enable_NMI_through_LVT0 (void * dummy) ++void __cpuinit enable_NMI_through_LVT0(void) + { + unsigned int v = APIC_DM_NMI; + +@@ -379,8 +379,10 @@ void __init setup_boot_APIC_clock(void) + */ + if (local_apic_timer_disabled) { + /* No broadcast on UP ! */ +- if (num_possible_cpus() > 1) ++ if (num_possible_cpus() > 1) { ++ lapic_clockevent.mult = 1; + setup_APIC_timer(); ++ } + return; + } + +@@ -434,7 +436,7 @@ void __init setup_boot_APIC_clock(void) + "with PM Timer: %ldms instead of 100ms\n", + (long)res); + /* Correct the lapic counter value */ +- res = (((u64) delta ) * pm_100ms); ++ res = (((u64) delta) * pm_100ms); + do_div(res, deltapm); + printk(KERN_INFO "APIC delta adjusted to PM-Timer: " + "%lu (%ld)\n", (unsigned long) res, delta); +@@ -472,6 +474,19 @@ void __init setup_boot_APIC_clock(void) + + local_apic_timer_verify_ok = 1; + ++ /* ++ * Do a sanity check on the APIC calibration result ++ */ ++ if (calibration_result < (1000000 / HZ)) { ++ local_irq_enable(); ++ printk(KERN_WARNING ++ "APIC frequency too slow, disabling apic timer\n"); ++ /* No broadcast on UP ! */ ++ if (num_possible_cpus() > 1) ++ setup_APIC_timer(); ++ return; ++ } ++ + /* We trust the pm timer based calibration */ + if (!pm_referenced) { + apic_printk(APIC_VERBOSE, "... verify APIC timer\n"); +@@ -563,6 +578,9 @@ static void local_apic_timer_interrupt(void) + return; + } + ++ /* ++ * the NMI deadlock-detector uses this. ++ */ + per_cpu(irq_stat, cpu).apic_timer_irqs++; + + evt->event_handler(evt); +@@ -576,8 +594,7 @@ static void local_apic_timer_interrupt(void) + * [ if a single-CPU system runs an SMP kernel then we call the local + * interrupt as well. Thus we cannot inline the local irq ... ] + */ +- +-void fastcall smp_apic_timer_interrupt(struct pt_regs *regs) ++void smp_apic_timer_interrupt(struct pt_regs *regs) + { + struct pt_regs *old_regs = set_irq_regs(regs); + +@@ -616,9 +633,14 @@ int setup_profiling_timer(unsigned int multiplier) + */ + void clear_local_APIC(void) + { +- int maxlvt = lapic_get_maxlvt(); +- unsigned long v; ++ int maxlvt; ++ u32 v; ++ ++ /* APIC hasn't been mapped yet */ ++ if (!apic_phys) ++ return; + ++ maxlvt = lapic_get_maxlvt(); + /* + * Masking an LVT entry can trigger a local APIC error + * if the vector is zero. Mask LVTERR first to prevent this. +@@ -976,7 +998,8 @@ void __cpuinit setup_local_APIC(void) + value |= APIC_LVT_LEVEL_TRIGGER; + apic_write_around(APIC_LVT1, value); + +- if (integrated && !esr_disable) { /* !82489DX */ ++ if (integrated && !esr_disable) { ++ /* !82489DX */ + maxlvt = lapic_get_maxlvt(); + if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */ + apic_write(APIC_ESR, 0); +@@ -1020,7 +1043,7 @@ void __cpuinit setup_local_APIC(void) + /* + * Detect and initialize APIC + */ +-static int __init detect_init_APIC (void) ++static int __init detect_init_APIC(void) + { + u32 h, l, features; + +@@ -1077,7 +1100,7 @@ static int __init detect_init_APIC (void) + printk(KERN_WARNING "Could not enable APIC!\n"); + return -1; + } +- set_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); ++ set_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); + mp_lapic_addr = APIC_DEFAULT_PHYS_BASE; + + /* The BIOS may have set up the APIC at some other address */ +@@ -1104,8 +1127,6 @@ no_apic: + */ + void __init init_apic_mappings(void) + { +- unsigned long apic_phys; +- + /* + * If no local APIC can be found then set up a fake all + * zeroes page to simulate the local APIC and another +@@ -1164,10 +1185,10 @@ fake_ioapic_page: + * This initializes the IO-APIC and APIC hardware if this is + * a UP kernel. + */ +-int __init APIC_init_uniprocessor (void) ++int __init APIC_init_uniprocessor(void) + { + if (enable_local_apic < 0) +- clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); ++ clear_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); + + if (!smp_found_config && !cpu_has_apic) + return -1; +@@ -1179,7 +1200,7 @@ int __init APIC_init_uniprocessor (void) + APIC_INTEGRATED(apic_version[boot_cpu_physical_apicid])) { + printk(KERN_ERR "BIOS bug, local APIC #%d not detected!...\n", + boot_cpu_physical_apicid); +- clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); ++ clear_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); + return -1; + } + +@@ -1210,50 +1231,6 @@ int __init APIC_init_uniprocessor (void) + } + + /* +- * APIC command line parameters +- */ +-static int __init parse_lapic(char *arg) +-{ +- enable_local_apic = 1; +- return 0; +-} +-early_param("lapic", parse_lapic); +- +-static int __init parse_nolapic(char *arg) +-{ +- enable_local_apic = -1; +- clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); +- return 0; +-} +-early_param("nolapic", parse_nolapic); +- +-static int __init parse_disable_lapic_timer(char *arg) +-{ +- local_apic_timer_disabled = 1; +- return 0; +-} +-early_param("nolapic_timer", parse_disable_lapic_timer); +- +-static int __init parse_lapic_timer_c2_ok(char *arg) +-{ +- local_apic_timer_c2_ok = 1; +- return 0; +-} +-early_param("lapic_timer_c2_ok", parse_lapic_timer_c2_ok); +- +-static int __init apic_set_verbosity(char *str) +-{ +- if (strcmp("debug", str) == 0) +- apic_verbosity = APIC_DEBUG; +- else if (strcmp("verbose", str) == 0) +- apic_verbosity = APIC_VERBOSE; +- return 1; +-} +- +-__setup("apic=", apic_set_verbosity); +- +- +-/* + * Local APIC interrupts + */ + +@@ -1306,7 +1283,7 @@ void smp_error_interrupt(struct pt_regs *regs) + 6: Received illegal vector + 7: Illegal register address + */ +- printk (KERN_DEBUG "APIC error on CPU%d: %02lx(%02lx)\n", ++ printk(KERN_DEBUG "APIC error on CPU%d: %02lx(%02lx)\n", + smp_processor_id(), v , v1); + irq_exit(); + } +@@ -1393,7 +1370,7 @@ void disconnect_bsp_APIC(int virt_wire_setup) + value = apic_read(APIC_LVT0); + value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | + APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | +- APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED ); ++ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); + value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; + value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT); + apic_write_around(APIC_LVT0, value); +@@ -1530,7 +1507,7 @@ static int lapic_resume(struct sys_device *dev) */ static struct sysdev_class lapic_sysclass = { @@ -135074,11 +145161,927 @@ index edb5108..a56c782 100644 .resume = lapic_resume, .suspend = lapic_suspend, }; +@@ -1565,3 +1542,46 @@ device_initcall(init_lapic_sysfs); + static void apic_pm_activate(void) { } + + #endif /* CONFIG_PM */ ++ ++/* ++ * APIC command line parameters ++ */ ++static int __init parse_lapic(char *arg) ++{ ++ enable_local_apic = 1; ++ return 0; ++} ++early_param("lapic", parse_lapic); ++ ++static int __init parse_nolapic(char *arg) ++{ ++ enable_local_apic = -1; ++ clear_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); ++ return 0; ++} ++early_param("nolapic", parse_nolapic); ++ ++static int __init parse_disable_lapic_timer(char *arg) ++{ ++ local_apic_timer_disabled = 1; ++ return 0; ++} ++early_param("nolapic_timer", parse_disable_lapic_timer); ++ ++static int __init parse_lapic_timer_c2_ok(char *arg) ++{ ++ local_apic_timer_c2_ok = 1; ++ return 0; ++} ++early_param("lapic_timer_c2_ok", parse_lapic_timer_c2_ok); ++ ++static int __init apic_set_verbosity(char *str) ++{ ++ if (strcmp("debug", str) == 0) ++ apic_verbosity = APIC_DEBUG; ++ else if (strcmp("verbose", str) == 0) ++ apic_verbosity = APIC_VERBOSE; ++ return 1; ++} ++__setup("apic=", apic_set_verbosity); ++ diff --git a/arch/x86/kernel/apic_64.c b/arch/x86/kernel/apic_64.c -index f28ccb5..fa6cdee 100644 +index f28ccb5..d8d03e0 100644 --- a/arch/x86/kernel/apic_64.c +++ b/arch/x86/kernel/apic_64.c -@@ -639,7 +639,7 @@ static int lapic_resume(struct sys_device *dev) +@@ -23,32 +23,37 @@ + #include + #include + #include +-#include + #include + #include ++#include ++#include + + #include + #include + #include + #include ++#include + #include + #include + #include + #include + #include + #include +-#include + #include + +-int apic_verbosity; + int disable_apic_timer __cpuinitdata; + static int apic_calibrate_pmtmr __initdata; ++int disable_apic; + +-/* Local APIC timer works in C2? */ ++/* Local APIC timer works in C2 */ + int local_apic_timer_c2_ok; + EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok); + +-static struct resource *ioapic_resources; ++/* ++ * Debug level, exported for io_apic.c ++ */ ++int apic_verbosity; ++ + static struct resource lapic_resource = { + .name = "Local APIC", + .flags = IORESOURCE_MEM | IORESOURCE_BUSY, +@@ -60,10 +65,8 @@ static int lapic_next_event(unsigned long delta, + struct clock_event_device *evt); + static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt); +- + static void lapic_timer_broadcast(cpumask_t mask); +- +-static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen); ++static void apic_pm_activate(void); + + static struct clock_event_device lapic_clockevent = { + .name = "lapic", +@@ -78,6 +81,150 @@ static struct clock_event_device lapic_clockevent = { + }; + static DEFINE_PER_CPU(struct clock_event_device, lapic_events); + ++static unsigned long apic_phys; ++ ++/* ++ * Get the LAPIC version ++ */ ++static inline int lapic_get_version(void) ++{ ++ return GET_APIC_VERSION(apic_read(APIC_LVR)); ++} ++ ++/* ++ * Check, if the APIC is integrated or a seperate chip ++ */ ++static inline int lapic_is_integrated(void) ++{ ++ return 1; ++} ++ ++/* ++ * Check, whether this is a modern or a first generation APIC ++ */ ++static int modern_apic(void) ++{ ++ /* AMD systems use old APIC versions, so check the CPU */ ++ if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD && ++ boot_cpu_data.x86 >= 0xf) ++ return 1; ++ return lapic_get_version() >= 0x14; ++} ++ ++void apic_wait_icr_idle(void) ++{ ++ while (apic_read(APIC_ICR) & APIC_ICR_BUSY) ++ cpu_relax(); ++} ++ ++u32 safe_apic_wait_icr_idle(void) ++{ ++ u32 send_status; ++ int timeout; ++ ++ timeout = 0; ++ do { ++ send_status = apic_read(APIC_ICR) & APIC_ICR_BUSY; ++ if (!send_status) ++ break; ++ udelay(100); ++ } while (timeout++ < 1000); ++ ++ return send_status; ++} ++ ++/** ++ * enable_NMI_through_LVT0 - enable NMI through local vector table 0 ++ */ ++void __cpuinit enable_NMI_through_LVT0(void) ++{ ++ unsigned int v; ++ ++ /* unmask and set to NMI */ ++ v = APIC_DM_NMI; ++ apic_write(APIC_LVT0, v); ++} ++ ++/** ++ * lapic_get_maxlvt - get the maximum number of local vector table entries ++ */ ++int lapic_get_maxlvt(void) ++{ ++ unsigned int v, maxlvt; ++ ++ v = apic_read(APIC_LVR); ++ maxlvt = GET_APIC_MAXLVT(v); ++ return maxlvt; ++} ++ ++/* ++ * This function sets up the local APIC timer, with a timeout of ++ * 'clocks' APIC bus clock. During calibration we actually call ++ * this function twice on the boot CPU, once with a bogus timeout ++ * value, second time for real. The other (noncalibrating) CPUs ++ * call this function only once, with the real, calibrated value. ++ * ++ * We do reads before writes even if unnecessary, to get around the ++ * P5 APIC double write bug. ++ */ ++ ++static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) ++{ ++ unsigned int lvtt_value, tmp_value; ++ ++ lvtt_value = LOCAL_TIMER_VECTOR; ++ if (!oneshot) ++ lvtt_value |= APIC_LVT_TIMER_PERIODIC; ++ if (!irqen) ++ lvtt_value |= APIC_LVT_MASKED; ++ ++ apic_write(APIC_LVTT, lvtt_value); ++ ++ /* ++ * Divide PICLK by 16 ++ */ ++ tmp_value = apic_read(APIC_TDCR); ++ apic_write(APIC_TDCR, (tmp_value ++ & ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE)) ++ | APIC_TDR_DIV_16); ++ ++ if (!oneshot) ++ apic_write(APIC_TMICT, clocks); ++} ++ ++/* ++ * Setup extended LVT, AMD specific (K8, family 10h) ++ * ++ * Vector mappings are hard coded. On K8 only offset 0 (APIC500) and ++ * MCE interrupts are supported. Thus MCE offset must be set to 0. ++ */ ++ ++#define APIC_EILVT_LVTOFF_MCE 0 ++#define APIC_EILVT_LVTOFF_IBS 1 ++ ++static void setup_APIC_eilvt(u8 lvt_off, u8 vector, u8 msg_type, u8 mask) ++{ ++ unsigned long reg = (lvt_off << 4) + APIC_EILVT0; ++ unsigned int v = (mask << 16) | (msg_type << 8) | vector; ++ ++ apic_write(reg, v); ++} ++ ++u8 setup_APIC_eilvt_mce(u8 vector, u8 msg_type, u8 mask) ++{ ++ setup_APIC_eilvt(APIC_EILVT_LVTOFF_MCE, vector, msg_type, mask); ++ return APIC_EILVT_LVTOFF_MCE; ++} ++ ++u8 setup_APIC_eilvt_ibs(u8 vector, u8 msg_type, u8 mask) ++{ ++ setup_APIC_eilvt(APIC_EILVT_LVTOFF_IBS, vector, msg_type, mask); ++ return APIC_EILVT_LVTOFF_IBS; ++} ++ ++/* ++ * Program the next event, relative to now ++ */ + static int lapic_next_event(unsigned long delta, + struct clock_event_device *evt) + { +@@ -85,6 +232,9 @@ static int lapic_next_event(unsigned long delta, + return 0; + } + ++/* ++ * Setup the lapic timer in periodic or oneshot mode ++ */ + static void lapic_timer_setup(enum clock_event_mode mode, + struct clock_event_device *evt) + { +@@ -127,75 +277,261 @@ static void lapic_timer_broadcast(cpumask_t mask) + #endif + } + +-static void apic_pm_activate(void); ++/* ++ * Setup the local APIC timer for this CPU. Copy the initilized values ++ * of the boot CPU and register the clock event in the framework. ++ */ ++static void setup_APIC_timer(void) ++{ ++ struct clock_event_device *levt = &__get_cpu_var(lapic_events); + +-void apic_wait_icr_idle(void) ++ memcpy(levt, &lapic_clockevent, sizeof(*levt)); ++ levt->cpumask = cpumask_of_cpu(smp_processor_id()); ++ ++ clockevents_register_device(levt); ++} ++ ++/* ++ * In this function we calibrate APIC bus clocks to the external ++ * timer. Unfortunately we cannot use jiffies and the timer irq ++ * to calibrate, since some later bootup code depends on getting ++ * the first irq? Ugh. ++ * ++ * We want to do the calibration only once since we ++ * want to have local timer irqs syncron. CPUs connected ++ * by the same APIC bus have the very same bus frequency. ++ * And we want to have irqs off anyways, no accidental ++ * APIC irq that way. ++ */ ++ ++#define TICK_COUNT 100000000 ++ ++static void __init calibrate_APIC_clock(void) + { +- while (apic_read(APIC_ICR) & APIC_ICR_BUSY) +- cpu_relax(); ++ unsigned apic, apic_start; ++ unsigned long tsc, tsc_start; ++ int result; ++ ++ local_irq_disable(); ++ ++ /* ++ * Put whatever arbitrary (but long enough) timeout ++ * value into the APIC clock, we just want to get the ++ * counter running for calibration. ++ * ++ * No interrupt enable ! ++ */ ++ __setup_APIC_LVTT(250000000, 0, 0); ++ ++ apic_start = apic_read(APIC_TMCCT); ++#ifdef CONFIG_X86_PM_TIMER ++ if (apic_calibrate_pmtmr && pmtmr_ioport) { ++ pmtimer_wait(5000); /* 5ms wait */ ++ apic = apic_read(APIC_TMCCT); ++ result = (apic_start - apic) * 1000L / 5; ++ } else ++#endif ++ { ++ rdtscll(tsc_start); ++ ++ do { ++ apic = apic_read(APIC_TMCCT); ++ rdtscll(tsc); ++ } while ((tsc - tsc_start) < TICK_COUNT && ++ (apic_start - apic) < TICK_COUNT); ++ ++ result = (apic_start - apic) * 1000L * tsc_khz / ++ (tsc - tsc_start); ++ } ++ ++ local_irq_enable(); ++ ++ printk(KERN_DEBUG "APIC timer calibration result %d\n", result); ++ ++ printk(KERN_INFO "Detected %d.%03d MHz APIC timer.\n", ++ result / 1000 / 1000, result / 1000 % 1000); ++ ++ /* Calculate the scaled math multiplication factor */ ++ lapic_clockevent.mult = div_sc(result, NSEC_PER_SEC, 32); ++ lapic_clockevent.max_delta_ns = ++ clockevent_delta2ns(0x7FFFFF, &lapic_clockevent); ++ lapic_clockevent.min_delta_ns = ++ clockevent_delta2ns(0xF, &lapic_clockevent); ++ ++ calibration_result = result / HZ; + } + +-unsigned int safe_apic_wait_icr_idle(void) ++/* ++ * Setup the boot APIC ++ * ++ * Calibrate and verify the result. ++ */ ++void __init setup_boot_APIC_clock(void) + { +- unsigned int send_status; +- int timeout; ++ /* ++ * The local apic timer can be disabled via the kernel commandline. ++ * Register the lapic timer as a dummy clock event source on SMP ++ * systems, so the broadcast mechanism is used. On UP systems simply ++ * ignore it. ++ */ ++ if (disable_apic_timer) { ++ printk(KERN_INFO "Disabling APIC timer\n"); ++ /* No broadcast on UP ! */ ++ if (num_possible_cpus() > 1) { ++ lapic_clockevent.mult = 1; ++ setup_APIC_timer(); ++ } ++ return; ++ } + +- timeout = 0; +- do { +- send_status = apic_read(APIC_ICR) & APIC_ICR_BUSY; +- if (!send_status) +- break; +- udelay(100); +- } while (timeout++ < 1000); ++ printk(KERN_INFO "Using local APIC timer interrupts.\n"); ++ calibrate_APIC_clock(); + +- return send_status; ++ /* ++ * Do a sanity check on the APIC calibration result ++ */ ++ if (calibration_result < (1000000 / HZ)) { ++ printk(KERN_WARNING ++ "APIC frequency too slow, disabling apic timer\n"); ++ /* No broadcast on UP ! */ ++ if (num_possible_cpus() > 1) ++ setup_APIC_timer(); ++ return; ++ } ++ ++ /* ++ * If nmi_watchdog is set to IO_APIC, we need the ++ * PIT/HPET going. Otherwise register lapic as a dummy ++ * device. ++ */ ++ if (nmi_watchdog != NMI_IO_APIC) ++ lapic_clockevent.features &= ~CLOCK_EVT_FEAT_DUMMY; ++ else ++ printk(KERN_WARNING "APIC timer registered as dummy," ++ " due to nmi_watchdog=1!\n"); ++ ++ setup_APIC_timer(); + } + +-void enable_NMI_through_LVT0 (void * dummy) ++/* ++ * AMD C1E enabled CPUs have a real nasty problem: Some BIOSes set the ++ * C1E flag only in the secondary CPU, so when we detect the wreckage ++ * we already have enabled the boot CPU local apic timer. Check, if ++ * disable_apic_timer is set and the DUMMY flag is cleared. If yes, ++ * set the DUMMY flag again and force the broadcast mode in the ++ * clockevents layer. ++ */ ++void __cpuinit check_boot_apic_timer_broadcast(void) + { +- unsigned int v; ++ if (!disable_apic_timer || ++ (lapic_clockevent.features & CLOCK_EVT_FEAT_DUMMY)) ++ return; + +- /* unmask and set to NMI */ +- v = APIC_DM_NMI; +- apic_write(APIC_LVT0, v); ++ printk(KERN_INFO "AMD C1E detected late. Force timer broadcast.\n"); ++ lapic_clockevent.features |= CLOCK_EVT_FEAT_DUMMY; ++ ++ local_irq_enable(); ++ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE, &boot_cpu_id); ++ local_irq_disable(); + } + +-int get_maxlvt(void) ++void __cpuinit setup_secondary_APIC_clock(void) + { +- unsigned int v, maxlvt; ++ check_boot_apic_timer_broadcast(); ++ setup_APIC_timer(); ++} + +- v = apic_read(APIC_LVR); +- maxlvt = GET_APIC_MAXLVT(v); +- return maxlvt; ++/* ++ * The guts of the apic timer interrupt ++ */ ++static void local_apic_timer_interrupt(void) ++{ ++ int cpu = smp_processor_id(); ++ struct clock_event_device *evt = &per_cpu(lapic_events, cpu); ++ ++ /* ++ * Normally we should not be here till LAPIC has been initialized but ++ * in some cases like kdump, its possible that there is a pending LAPIC ++ * timer interrupt from previous kernel's context and is delivered in ++ * new kernel the moment interrupts are enabled. ++ * ++ * Interrupts are enabled early and LAPIC is setup much later, hence ++ * its possible that when we get here evt->event_handler is NULL. ++ * Check for event_handler being NULL and discard the interrupt as ++ * spurious. ++ */ ++ if (!evt->event_handler) { ++ printk(KERN_WARNING ++ "Spurious LAPIC timer interrupt on cpu %d\n", cpu); ++ /* Switch it off */ ++ lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, evt); ++ return; ++ } ++ ++ /* ++ * the NMI deadlock-detector uses this. ++ */ ++ add_pda(apic_timer_irqs, 1); ++ ++ evt->event_handler(evt); + } + + /* +- * 'what should we do if we get a hw irq event on an illegal vector'. +- * each architecture has to answer this themselves. ++ * Local APIC timer interrupt. This is the most natural way for doing ++ * local interrupts, but local timer interrupts can be emulated by ++ * broadcast interrupts too. [in case the hw doesn't support APIC timers] ++ * ++ * [ if a single-CPU system runs an SMP kernel then we call the local ++ * interrupt as well. Thus we cannot inline the local irq ... ] + */ +-void ack_bad_irq(unsigned int irq) ++void smp_apic_timer_interrupt(struct pt_regs *regs) + { +- printk("unexpected IRQ trap at vector %02x\n", irq); ++ struct pt_regs *old_regs = set_irq_regs(regs); ++ + /* +- * Currently unexpected vectors happen only on SMP and APIC. +- * We _must_ ack these because every local APIC has only N +- * irq slots per priority level, and a 'hanging, unacked' IRQ +- * holds up an irq slot - in excessive cases (when multiple +- * unexpected vectors occur) that might lock up the APIC +- * completely. +- * But don't ack when the APIC is disabled. -AK ++ * NOTE! We'd better ACK the irq immediately, ++ * because timer handling can be slow. + */ +- if (!disable_apic) +- ack_APIC_irq(); ++ ack_APIC_irq(); ++ /* ++ * update_process_times() expects us to have done irq_enter(). ++ * Besides, if we don't timer interrupts ignore the global ++ * interrupt lock, which is the WrongThing (tm) to do. ++ */ ++ exit_idle(); ++ irq_enter(); ++ local_apic_timer_interrupt(); ++ irq_exit(); ++ set_irq_regs(old_regs); ++} ++ ++int setup_profiling_timer(unsigned int multiplier) ++{ ++ return -EINVAL; + } + ++ ++/* ++ * Local APIC start and shutdown ++ */ ++ ++/** ++ * clear_local_APIC - shutdown the local APIC ++ * ++ * This is called, when a CPU is disabled and before rebooting, so the state of ++ * the local APIC has no dangling leftovers. Also used to cleanout any BIOS ++ * leftovers during boot. ++ */ + void clear_local_APIC(void) + { +- int maxlvt; +- unsigned int v; ++ int maxlvt = lapic_get_maxlvt(); ++ u32 v; + +- maxlvt = get_maxlvt(); ++ /* APIC hasn't been mapped yet */ ++ if (!apic_phys) ++ return; + ++ maxlvt = lapic_get_maxlvt(); + /* + * Masking an LVT entry can trigger a local APIC error + * if the vector is zero. Mask LVTERR first to prevent this. +@@ -233,45 +569,9 @@ void clear_local_APIC(void) + apic_read(APIC_ESR); + } + +-void disconnect_bsp_APIC(int virt_wire_setup) +-{ +- /* Go back to Virtual Wire compatibility mode */ +- unsigned long value; +- +- /* For the spurious interrupt use vector F, and enable it */ +- value = apic_read(APIC_SPIV); +- value &= ~APIC_VECTOR_MASK; +- value |= APIC_SPIV_APIC_ENABLED; +- value |= 0xf; +- apic_write(APIC_SPIV, value); +- +- if (!virt_wire_setup) { +- /* +- * For LVT0 make it edge triggered, active high, +- * external and enabled +- */ +- value = apic_read(APIC_LVT0); +- value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | +- APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | +- APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED ); +- value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; +- value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT); +- apic_write(APIC_LVT0, value); +- } else { +- /* Disable LVT0 */ +- apic_write(APIC_LVT0, APIC_LVT_MASKED); +- } +- +- /* For LVT1 make it edge triggered, active high, nmi and enabled */ +- value = apic_read(APIC_LVT1); +- value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | +- APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | +- APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); +- value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; +- value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI); +- apic_write(APIC_LVT1, value); +-} +- ++/** ++ * disable_local_APIC - clear and disable the local APIC ++ */ + void disable_local_APIC(void) + { + unsigned int value; +@@ -333,7 +633,7 @@ int __init verify_local_APIC(void) + reg1 = GET_APIC_VERSION(reg0); + if (reg1 == 0x00 || reg1 == 0xff) + return 0; +- reg1 = get_maxlvt(); ++ reg1 = lapic_get_maxlvt(); + if (reg1 < 0x02 || reg1 == 0xff) + return 0; + +@@ -355,18 +655,20 @@ int __init verify_local_APIC(void) + * compatibility mode, but most boxes are anymore. + */ + reg0 = apic_read(APIC_LVT0); +- apic_printk(APIC_DEBUG,"Getting LVT0: %x\n", reg0); ++ apic_printk(APIC_DEBUG, "Getting LVT0: %x\n", reg0); + reg1 = apic_read(APIC_LVT1); + apic_printk(APIC_DEBUG, "Getting LVT1: %x\n", reg1); + + return 1; + } + ++/** ++ * sync_Arb_IDs - synchronize APIC bus arbitration IDs ++ */ + void __init sync_Arb_IDs(void) + { + /* Unsupported on P4 - see Intel Dev. Manual Vol. 3, Ch. 8.6.1 */ +- unsigned int ver = GET_APIC_VERSION(apic_read(APIC_LVR)); +- if (ver >= 0x14) /* P4 or higher */ ++ if (modern_apic()) + return; + + /* +@@ -418,9 +720,12 @@ void __init init_bsp_APIC(void) + apic_write(APIC_LVT1, value); + } + +-void __cpuinit setup_local_APIC (void) ++/** ++ * setup_local_APIC - setup the local APIC ++ */ ++void __cpuinit setup_local_APIC(void) + { +- unsigned int value, maxlvt; ++ unsigned int value; + int i, j; + + value = apic_read(APIC_LVR); +@@ -516,30 +821,217 @@ void __cpuinit setup_local_APIC (void) + else + value = APIC_DM_NMI | APIC_LVT_MASKED; + apic_write(APIC_LVT1, value); ++} + +- { +- unsigned oldvalue; +- maxlvt = get_maxlvt(); +- oldvalue = apic_read(APIC_ESR); +- value = ERROR_APIC_VECTOR; // enables sending errors +- apic_write(APIC_LVTERR, value); +- /* +- * spec says clear errors after enabling vector. +- */ +- if (maxlvt > 3) +- apic_write(APIC_ESR, 0); +- value = apic_read(APIC_ESR); +- if (value != oldvalue) +- apic_printk(APIC_VERBOSE, +- "ESR value after enabling vector: %08x, after %08x\n", +- oldvalue, value); +- } ++void __cpuinit lapic_setup_esr(void) ++{ ++ unsigned maxlvt = lapic_get_maxlvt(); ++ ++ apic_write(APIC_LVTERR, ERROR_APIC_VECTOR); ++ /* ++ * spec says clear errors after enabling vector. ++ */ ++ if (maxlvt > 3) ++ apic_write(APIC_ESR, 0); ++} + ++void __cpuinit end_local_APIC_setup(void) ++{ ++ lapic_setup_esr(); + nmi_watchdog_default(); + setup_apic_nmi_watchdog(NULL); + apic_pm_activate(); + } + ++/* ++ * Detect and enable local APICs on non-SMP boards. ++ * Original code written by Keir Fraser. ++ * On AMD64 we trust the BIOS - if it says no APIC it is likely ++ * not correctly set up (usually the APIC timer won't work etc.) ++ */ ++static int __init detect_init_APIC(void) ++{ ++ if (!cpu_has_apic) { ++ printk(KERN_INFO "No local APIC present\n"); ++ return -1; ++ } ++ ++ mp_lapic_addr = APIC_DEFAULT_PHYS_BASE; ++ boot_cpu_id = 0; ++ return 0; ++} ++ ++/** ++ * init_apic_mappings - initialize APIC mappings ++ */ ++void __init init_apic_mappings(void) ++{ ++ /* ++ * If no local APIC can be found then set up a fake all ++ * zeroes page to simulate the local APIC and another ++ * one for the IO-APIC. ++ */ ++ if (!smp_found_config && detect_init_APIC()) { ++ apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE); ++ apic_phys = __pa(apic_phys); ++ } else ++ apic_phys = mp_lapic_addr; ++ ++ set_fixmap_nocache(FIX_APIC_BASE, apic_phys); ++ apic_printk(APIC_VERBOSE, "mapped APIC to %16lx (%16lx)\n", ++ APIC_BASE, apic_phys); ++ ++ /* Put local APIC into the resource map. */ ++ lapic_resource.start = apic_phys; ++ lapic_resource.end = lapic_resource.start + PAGE_SIZE - 1; ++ insert_resource(&iomem_resource, &lapic_resource); ++ ++ /* ++ * Fetch the APIC ID of the BSP in case we have a ++ * default configuration (or the MP table is broken). ++ */ ++ boot_cpu_id = GET_APIC_ID(apic_read(APIC_ID)); ++} ++ ++/* ++ * This initializes the IO-APIC and APIC hardware if this is ++ * a UP kernel. ++ */ ++int __init APIC_init_uniprocessor(void) ++{ ++ if (disable_apic) { ++ printk(KERN_INFO "Apic disabled\n"); ++ return -1; ++ } ++ if (!cpu_has_apic) { ++ disable_apic = 1; ++ printk(KERN_INFO "Apic disabled by BIOS\n"); ++ return -1; ++ } ++ ++ verify_local_APIC(); ++ ++ phys_cpu_present_map = physid_mask_of_physid(boot_cpu_id); ++ apic_write(APIC_ID, SET_APIC_ID(boot_cpu_id)); ++ ++ setup_local_APIC(); ++ ++ /* ++ * Now enable IO-APICs, actually call clear_IO_APIC ++ * We need clear_IO_APIC before enabling vector on BP ++ */ ++ if (!skip_ioapic_setup && nr_ioapics) ++ enable_IO_APIC(); ++ ++ end_local_APIC_setup(); ++ ++ if (smp_found_config && !skip_ioapic_setup && nr_ioapics) ++ setup_IO_APIC(); ++ else ++ nr_ioapics = 0; ++ setup_boot_APIC_clock(); ++ check_nmi_watchdog(); ++ return 0; ++} ++ ++/* ++ * Local APIC interrupts ++ */ ++ ++/* ++ * This interrupt should _never_ happen with our APIC/SMP architecture ++ */ ++asmlinkage void smp_spurious_interrupt(void) ++{ ++ unsigned int v; ++ exit_idle(); ++ irq_enter(); ++ /* ++ * Check if this really is a spurious interrupt and ACK it ++ * if it is a vectored one. Just in case... ++ * Spurious interrupts should not be ACKed. ++ */ ++ v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1)); ++ if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f))) ++ ack_APIC_irq(); ++ ++ add_pda(irq_spurious_count, 1); ++ irq_exit(); ++} ++ ++/* ++ * This interrupt should never happen with our APIC/SMP architecture ++ */ ++asmlinkage void smp_error_interrupt(void) ++{ ++ unsigned int v, v1; ++ ++ exit_idle(); ++ irq_enter(); ++ /* First tickle the hardware, only then report what went on. -- REW */ ++ v = apic_read(APIC_ESR); ++ apic_write(APIC_ESR, 0); ++ v1 = apic_read(APIC_ESR); ++ ack_APIC_irq(); ++ atomic_inc(&irq_err_count); ++ ++ /* Here is what the APIC error bits mean: ++ 0: Send CS error ++ 1: Receive CS error ++ 2: Send accept error ++ 3: Receive accept error ++ 4: Reserved ++ 5: Send illegal vector ++ 6: Received illegal vector ++ 7: Illegal register address ++ */ ++ printk(KERN_DEBUG "APIC error on CPU%d: %02x(%02x)\n", ++ smp_processor_id(), v , v1); ++ irq_exit(); ++} ++ ++void disconnect_bsp_APIC(int virt_wire_setup) ++{ ++ /* Go back to Virtual Wire compatibility mode */ ++ unsigned long value; ++ ++ /* For the spurious interrupt use vector F, and enable it */ ++ value = apic_read(APIC_SPIV); ++ value &= ~APIC_VECTOR_MASK; ++ value |= APIC_SPIV_APIC_ENABLED; ++ value |= 0xf; ++ apic_write(APIC_SPIV, value); ++ ++ if (!virt_wire_setup) { ++ /* ++ * For LVT0 make it edge triggered, active high, ++ * external and enabled ++ */ ++ value = apic_read(APIC_LVT0); ++ value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | ++ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | ++ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); ++ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; ++ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_EXTINT); ++ apic_write(APIC_LVT0, value); ++ } else { ++ /* Disable LVT0 */ ++ apic_write(APIC_LVT0, APIC_LVT_MASKED); ++ } ++ ++ /* For LVT1 make it edge triggered, active high, nmi and enabled */ ++ value = apic_read(APIC_LVT1); ++ value &= ~(APIC_MODE_MASK | APIC_SEND_PENDING | ++ APIC_INPUT_POLARITY | APIC_LVT_REMOTE_IRR | ++ APIC_LVT_LEVEL_TRIGGER | APIC_LVT_MASKED); ++ value |= APIC_LVT_REMOTE_IRR | APIC_SEND_PENDING; ++ value = SET_APIC_DELIVERY_MODE(value, APIC_MODE_NMI); ++ apic_write(APIC_LVT1, value); ++} ++ ++/* ++ * Power management ++ */ + #ifdef CONFIG_PM + + static struct { +@@ -571,7 +1063,7 @@ static int lapic_suspend(struct sys_device *dev, pm_message_t state) + if (!apic_pm_state.active) + return 0; + +- maxlvt = get_maxlvt(); ++ maxlvt = lapic_get_maxlvt(); + + apic_pm_state.apic_id = apic_read(APIC_ID); + apic_pm_state.apic_taskpri = apic_read(APIC_TASKPRI); +@@ -605,7 +1097,7 @@ static int lapic_resume(struct sys_device *dev) + if (!apic_pm_state.active) + return 0; + +- maxlvt = get_maxlvt(); ++ maxlvt = lapic_get_maxlvt(); + + local_irq_save(flags); + rdmsr(MSR_IA32_APICBASE, l, h); +@@ -639,14 +1131,14 @@ static int lapic_resume(struct sys_device *dev) } static struct sysdev_class lapic_sysclass = { @@ -135087,6 +146090,2671 @@ index f28ccb5..fa6cdee 100644 .resume = lapic_resume, .suspend = lapic_suspend, }; + + static struct sys_device device_lapic = { +- .id = 0, +- .cls = &lapic_sysclass, ++ .id = 0, ++ .cls = &lapic_sysclass, + }; + + static void __cpuinit apic_pm_activate(void) +@@ -657,9 +1149,11 @@ static void __cpuinit apic_pm_activate(void) + static int __init init_lapic_sysfs(void) + { + int error; ++ + if (!cpu_has_apic) + return 0; + /* XXX: remove suspend/resume procs if !apic_pm_state.active? */ ++ + error = sysdev_class_register(&lapic_sysclass); + if (!error) + error = sysdev_register(&device_lapic); +@@ -673,423 +1167,6 @@ static void apic_pm_activate(void) { } + + #endif /* CONFIG_PM */ + +-static int __init apic_set_verbosity(char *str) +-{ +- if (str == NULL) { +- skip_ioapic_setup = 0; +- ioapic_force = 1; +- return 0; +- } +- if (strcmp("debug", str) == 0) +- apic_verbosity = APIC_DEBUG; +- else if (strcmp("verbose", str) == 0) +- apic_verbosity = APIC_VERBOSE; +- else { +- printk(KERN_WARNING "APIC Verbosity level %s not recognised" +- " use apic=verbose or apic=debug\n", str); +- return -EINVAL; +- } +- +- return 0; +-} +-early_param("apic", apic_set_verbosity); +- +-/* +- * Detect and enable local APICs on non-SMP boards. +- * Original code written by Keir Fraser. +- * On AMD64 we trust the BIOS - if it says no APIC it is likely +- * not correctly set up (usually the APIC timer won't work etc.) +- */ +- +-static int __init detect_init_APIC (void) +-{ +- if (!cpu_has_apic) { +- printk(KERN_INFO "No local APIC present\n"); +- return -1; +- } +- +- mp_lapic_addr = APIC_DEFAULT_PHYS_BASE; +- boot_cpu_id = 0; +- return 0; +-} +- +-#ifdef CONFIG_X86_IO_APIC +-static struct resource * __init ioapic_setup_resources(void) +-{ +-#define IOAPIC_RESOURCE_NAME_SIZE 11 +- unsigned long n; +- struct resource *res; +- char *mem; +- int i; +- +- if (nr_ioapics <= 0) +- return NULL; +- +- n = IOAPIC_RESOURCE_NAME_SIZE + sizeof(struct resource); +- n *= nr_ioapics; +- +- mem = alloc_bootmem(n); +- res = (void *)mem; +- +- if (mem != NULL) { +- memset(mem, 0, n); +- mem += sizeof(struct resource) * nr_ioapics; +- +- for (i = 0; i < nr_ioapics; i++) { +- res[i].name = mem; +- res[i].flags = IORESOURCE_MEM | IORESOURCE_BUSY; +- sprintf(mem, "IOAPIC %u", i); +- mem += IOAPIC_RESOURCE_NAME_SIZE; +- } +- } +- +- ioapic_resources = res; +- +- return res; +-} +- +-static int __init ioapic_insert_resources(void) +-{ +- int i; +- struct resource *r = ioapic_resources; +- +- if (!r) { +- printk("IO APIC resources could be not be allocated.\n"); +- return -1; +- } +- +- for (i = 0; i < nr_ioapics; i++) { +- insert_resource(&iomem_resource, r); +- r++; +- } +- +- return 0; +-} +- +-/* Insert the IO APIC resources after PCI initialization has occured to handle +- * IO APICS that are mapped in on a BAR in PCI space. */ +-late_initcall(ioapic_insert_resources); +-#endif +- +-void __init init_apic_mappings(void) +-{ +- unsigned long apic_phys; +- +- /* +- * If no local APIC can be found then set up a fake all +- * zeroes page to simulate the local APIC and another +- * one for the IO-APIC. +- */ +- if (!smp_found_config && detect_init_APIC()) { +- apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE); +- apic_phys = __pa(apic_phys); +- } else +- apic_phys = mp_lapic_addr; +- +- set_fixmap_nocache(FIX_APIC_BASE, apic_phys); +- apic_printk(APIC_VERBOSE, "mapped APIC to %16lx (%16lx)\n", +- APIC_BASE, apic_phys); +- +- /* Put local APIC into the resource map. */ +- lapic_resource.start = apic_phys; +- lapic_resource.end = lapic_resource.start + PAGE_SIZE - 1; +- insert_resource(&iomem_resource, &lapic_resource); +- +- /* +- * Fetch the APIC ID of the BSP in case we have a +- * default configuration (or the MP table is broken). +- */ +- boot_cpu_id = GET_APIC_ID(apic_read(APIC_ID)); +- +- { +- unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0; +- int i; +- struct resource *ioapic_res; +- +- ioapic_res = ioapic_setup_resources(); +- for (i = 0; i < nr_ioapics; i++) { +- if (smp_found_config) { +- ioapic_phys = mp_ioapics[i].mpc_apicaddr; +- } else { +- ioapic_phys = (unsigned long) +- alloc_bootmem_pages(PAGE_SIZE); +- ioapic_phys = __pa(ioapic_phys); +- } +- set_fixmap_nocache(idx, ioapic_phys); +- apic_printk(APIC_VERBOSE, +- "mapped IOAPIC to %016lx (%016lx)\n", +- __fix_to_virt(idx), ioapic_phys); +- idx++; +- +- if (ioapic_res != NULL) { +- ioapic_res->start = ioapic_phys; +- ioapic_res->end = ioapic_phys + (4 * 1024) - 1; +- ioapic_res++; +- } +- } +- } +-} +- +-/* +- * This function sets up the local APIC timer, with a timeout of +- * 'clocks' APIC bus clock. During calibration we actually call +- * this function twice on the boot CPU, once with a bogus timeout +- * value, second time for real. The other (noncalibrating) CPUs +- * call this function only once, with the real, calibrated value. +- * +- * We do reads before writes even if unnecessary, to get around the +- * P5 APIC double write bug. +- */ +- +-static void __setup_APIC_LVTT(unsigned int clocks, int oneshot, int irqen) +-{ +- unsigned int lvtt_value, tmp_value; +- +- lvtt_value = LOCAL_TIMER_VECTOR; +- if (!oneshot) +- lvtt_value |= APIC_LVT_TIMER_PERIODIC; +- if (!irqen) +- lvtt_value |= APIC_LVT_MASKED; +- +- apic_write(APIC_LVTT, lvtt_value); +- +- /* +- * Divide PICLK by 16 +- */ +- tmp_value = apic_read(APIC_TDCR); +- apic_write(APIC_TDCR, (tmp_value +- & ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE)) +- | APIC_TDR_DIV_16); +- +- if (!oneshot) +- apic_write(APIC_TMICT, clocks); +-} +- +-static void setup_APIC_timer(void) +-{ +- struct clock_event_device *levt = &__get_cpu_var(lapic_events); +- +- memcpy(levt, &lapic_clockevent, sizeof(*levt)); +- levt->cpumask = cpumask_of_cpu(smp_processor_id()); +- +- clockevents_register_device(levt); +-} +- +-/* +- * In this function we calibrate APIC bus clocks to the external +- * timer. Unfortunately we cannot use jiffies and the timer irq +- * to calibrate, since some later bootup code depends on getting +- * the first irq? Ugh. +- * +- * We want to do the calibration only once since we +- * want to have local timer irqs syncron. CPUs connected +- * by the same APIC bus have the very same bus frequency. +- * And we want to have irqs off anyways, no accidental +- * APIC irq that way. +- */ +- +-#define TICK_COUNT 100000000 +- +-static void __init calibrate_APIC_clock(void) +-{ +- unsigned apic, apic_start; +- unsigned long tsc, tsc_start; +- int result; +- +- local_irq_disable(); +- +- /* +- * Put whatever arbitrary (but long enough) timeout +- * value into the APIC clock, we just want to get the +- * counter running for calibration. +- * +- * No interrupt enable ! +- */ +- __setup_APIC_LVTT(250000000, 0, 0); +- +- apic_start = apic_read(APIC_TMCCT); +-#ifdef CONFIG_X86_PM_TIMER +- if (apic_calibrate_pmtmr && pmtmr_ioport) { +- pmtimer_wait(5000); /* 5ms wait */ +- apic = apic_read(APIC_TMCCT); +- result = (apic_start - apic) * 1000L / 5; +- } else +-#endif +- { +- rdtscll(tsc_start); +- +- do { +- apic = apic_read(APIC_TMCCT); +- rdtscll(tsc); +- } while ((tsc - tsc_start) < TICK_COUNT && +- (apic_start - apic) < TICK_COUNT); +- +- result = (apic_start - apic) * 1000L * tsc_khz / +- (tsc - tsc_start); +- } +- +- local_irq_enable(); +- +- printk(KERN_DEBUG "APIC timer calibration result %d\n", result); +- +- printk(KERN_INFO "Detected %d.%03d MHz APIC timer.\n", +- result / 1000 / 1000, result / 1000 % 1000); +- +- /* Calculate the scaled math multiplication factor */ +- lapic_clockevent.mult = div_sc(result, NSEC_PER_SEC, 32); +- lapic_clockevent.max_delta_ns = +- clockevent_delta2ns(0x7FFFFF, &lapic_clockevent); +- lapic_clockevent.min_delta_ns = +- clockevent_delta2ns(0xF, &lapic_clockevent); +- +- calibration_result = result / HZ; +-} +- +-void __init setup_boot_APIC_clock (void) +-{ +- /* +- * The local apic timer can be disabled via the kernel commandline. +- * Register the lapic timer as a dummy clock event source on SMP +- * systems, so the broadcast mechanism is used. On UP systems simply +- * ignore it. +- */ +- if (disable_apic_timer) { +- printk(KERN_INFO "Disabling APIC timer\n"); +- /* No broadcast on UP ! */ +- if (num_possible_cpus() > 1) +- setup_APIC_timer(); +- return; +- } +- +- printk(KERN_INFO "Using local APIC timer interrupts.\n"); +- calibrate_APIC_clock(); +- +- /* +- * If nmi_watchdog is set to IO_APIC, we need the +- * PIT/HPET going. Otherwise register lapic as a dummy +- * device. +- */ +- if (nmi_watchdog != NMI_IO_APIC) +- lapic_clockevent.features &= ~CLOCK_EVT_FEAT_DUMMY; +- else +- printk(KERN_WARNING "APIC timer registered as dummy," +- " due to nmi_watchdog=1!\n"); +- +- setup_APIC_timer(); +-} +- +-/* +- * AMD C1E enabled CPUs have a real nasty problem: Some BIOSes set the +- * C1E flag only in the secondary CPU, so when we detect the wreckage +- * we already have enabled the boot CPU local apic timer. Check, if +- * disable_apic_timer is set and the DUMMY flag is cleared. If yes, +- * set the DUMMY flag again and force the broadcast mode in the +- * clockevents layer. +- */ +-void __cpuinit check_boot_apic_timer_broadcast(void) +-{ +- if (!disable_apic_timer || +- (lapic_clockevent.features & CLOCK_EVT_FEAT_DUMMY)) +- return; +- +- printk(KERN_INFO "AMD C1E detected late. Force timer broadcast.\n"); +- lapic_clockevent.features |= CLOCK_EVT_FEAT_DUMMY; +- +- local_irq_enable(); +- clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE, &boot_cpu_id); +- local_irq_disable(); +-} +- +-void __cpuinit setup_secondary_APIC_clock(void) +-{ +- check_boot_apic_timer_broadcast(); +- setup_APIC_timer(); +-} +- +-int setup_profiling_timer(unsigned int multiplier) +-{ +- return -EINVAL; +-} +- +-void setup_APIC_extended_lvt(unsigned char lvt_off, unsigned char vector, +- unsigned char msg_type, unsigned char mask) +-{ +- unsigned long reg = (lvt_off << 4) + K8_APIC_EXT_LVT_BASE; +- unsigned int v = (mask << 16) | (msg_type << 8) | vector; +- apic_write(reg, v); +-} +- +-/* +- * Local timer interrupt handler. It does both profiling and +- * process statistics/rescheduling. +- * +- * We do profiling in every local tick, statistics/rescheduling +- * happen only every 'profiling multiplier' ticks. The default +- * multiplier is 1 and it can be changed by writing the new multiplier +- * value into /proc/profile. +- */ +- +-void smp_local_timer_interrupt(void) +-{ +- int cpu = smp_processor_id(); +- struct clock_event_device *evt = &per_cpu(lapic_events, cpu); +- +- /* +- * Normally we should not be here till LAPIC has been initialized but +- * in some cases like kdump, its possible that there is a pending LAPIC +- * timer interrupt from previous kernel's context and is delivered in +- * new kernel the moment interrupts are enabled. +- * +- * Interrupts are enabled early and LAPIC is setup much later, hence +- * its possible that when we get here evt->event_handler is NULL. +- * Check for event_handler being NULL and discard the interrupt as +- * spurious. +- */ +- if (!evt->event_handler) { +- printk(KERN_WARNING +- "Spurious LAPIC timer interrupt on cpu %d\n", cpu); +- /* Switch it off */ +- lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, evt); +- return; +- } +- +- /* +- * the NMI deadlock-detector uses this. +- */ +- add_pda(apic_timer_irqs, 1); +- +- evt->event_handler(evt); +-} +- +-/* +- * Local APIC timer interrupt. This is the most natural way for doing +- * local interrupts, but local timer interrupts can be emulated by +- * broadcast interrupts too. [in case the hw doesn't support APIC timers] +- * +- * [ if a single-CPU system runs an SMP kernel then we call the local +- * interrupt as well. Thus we cannot inline the local irq ... ] +- */ +-void smp_apic_timer_interrupt(struct pt_regs *regs) +-{ +- struct pt_regs *old_regs = set_irq_regs(regs); +- +- /* +- * NOTE! We'd better ACK the irq immediately, +- * because timer handling can be slow. +- */ +- ack_APIC_irq(); +- /* +- * update_process_times() expects us to have done irq_enter(). +- * Besides, if we don't timer interrupts ignore the global +- * interrupt lock, which is the WrongThing (tm) to do. +- */ +- exit_idle(); +- irq_enter(); +- smp_local_timer_interrupt(); +- irq_exit(); +- set_irq_regs(old_regs); +-} +- + /* + * apic_is_clustered_box() -- Check if we can expect good TSC + * +@@ -1103,21 +1180,34 @@ __cpuinit int apic_is_clustered_box(void) + { + int i, clusters, zeros; + unsigned id; ++ u16 *bios_cpu_apicid = x86_bios_cpu_apicid_early_ptr; + DECLARE_BITMAP(clustermap, NUM_APIC_CLUSTERS); + + bitmap_zero(clustermap, NUM_APIC_CLUSTERS); + + for (i = 0; i < NR_CPUS; i++) { +- id = bios_cpu_apicid[i]; ++ /* are we being called early in kernel startup? */ ++ if (bios_cpu_apicid) { ++ id = bios_cpu_apicid[i]; ++ } ++ else if (i < nr_cpu_ids) { ++ if (cpu_present(i)) ++ id = per_cpu(x86_bios_cpu_apicid, i); ++ else ++ continue; ++ } ++ else ++ break; ++ + if (id != BAD_APICID) + __set_bit(APIC_CLUSTERID(id), clustermap); + } + + /* Problem: Partially populated chassis may not have CPUs in some of + * the APIC clusters they have been allocated. Only present CPUs have +- * bios_cpu_apicid entries, thus causing zeroes in the bitmap. Since +- * clusters are allocated sequentially, count zeros only if they are +- * bounded by ones. ++ * x86_bios_cpu_apicid entries, thus causing zeroes in the bitmap. ++ * Since clusters are allocated sequentially, count zeros only if ++ * they are bounded by ones. + */ + clusters = 0; + zeros = 0; +@@ -1138,96 +1228,33 @@ __cpuinit int apic_is_clustered_box(void) + } + + /* +- * This interrupt should _never_ happen with our APIC/SMP architecture +- */ +-asmlinkage void smp_spurious_interrupt(void) +-{ +- unsigned int v; +- exit_idle(); +- irq_enter(); +- /* +- * Check if this really is a spurious interrupt and ACK it +- * if it is a vectored one. Just in case... +- * Spurious interrupts should not be ACKed. +- */ +- v = apic_read(APIC_ISR + ((SPURIOUS_APIC_VECTOR & ~0x1f) >> 1)); +- if (v & (1 << (SPURIOUS_APIC_VECTOR & 0x1f))) +- ack_APIC_irq(); +- +- add_pda(irq_spurious_count, 1); +- irq_exit(); +-} +- +-/* +- * This interrupt should never happen with our APIC/SMP architecture ++ * APIC command line parameters + */ +- +-asmlinkage void smp_error_interrupt(void) +-{ +- unsigned int v, v1; +- +- exit_idle(); +- irq_enter(); +- /* First tickle the hardware, only then report what went on. -- REW */ +- v = apic_read(APIC_ESR); +- apic_write(APIC_ESR, 0); +- v1 = apic_read(APIC_ESR); +- ack_APIC_irq(); +- atomic_inc(&irq_err_count); +- +- /* Here is what the APIC error bits mean: +- 0: Send CS error +- 1: Receive CS error +- 2: Send accept error +- 3: Receive accept error +- 4: Reserved +- 5: Send illegal vector +- 6: Received illegal vector +- 7: Illegal register address +- */ +- printk (KERN_DEBUG "APIC error on CPU%d: %02x(%02x)\n", +- smp_processor_id(), v , v1); +- irq_exit(); +-} +- +-int disable_apic; +- +-/* +- * This initializes the IO-APIC and APIC hardware if this is +- * a UP kernel. +- */ +-int __init APIC_init_uniprocessor (void) ++static int __init apic_set_verbosity(char *str) + { +- if (disable_apic) { +- printk(KERN_INFO "Apic disabled\n"); +- return -1; ++ if (str == NULL) { ++ skip_ioapic_setup = 0; ++ ioapic_force = 1; ++ return 0; + } +- if (!cpu_has_apic) { +- disable_apic = 1; +- printk(KERN_INFO "Apic disabled by BIOS\n"); +- return -1; ++ if (strcmp("debug", str) == 0) ++ apic_verbosity = APIC_DEBUG; ++ else if (strcmp("verbose", str) == 0) ++ apic_verbosity = APIC_VERBOSE; ++ else { ++ printk(KERN_WARNING "APIC Verbosity level %s not recognised" ++ " use apic=verbose or apic=debug\n", str); ++ return -EINVAL; + } + +- verify_local_APIC(); +- +- phys_cpu_present_map = physid_mask_of_physid(boot_cpu_id); +- apic_write(APIC_ID, SET_APIC_ID(boot_cpu_id)); +- +- setup_local_APIC(); +- +- if (smp_found_config && !skip_ioapic_setup && nr_ioapics) +- setup_IO_APIC(); +- else +- nr_ioapics = 0; +- setup_boot_APIC_clock(); +- check_nmi_watchdog(); + return 0; + } ++early_param("apic", apic_set_verbosity); + + static __init int setup_disableapic(char *str) + { + disable_apic = 1; +- clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability); ++ clear_cpu_cap(&boot_cpu_data, X86_FEATURE_APIC); + return 0; + } + early_param("disableapic", setup_disableapic); +diff --git a/arch/x86/kernel/apm_32.c b/arch/x86/kernel/apm_32.c +index af045ca..d4438ef 100644 +--- a/arch/x86/kernel/apm_32.c ++++ b/arch/x86/kernel/apm_32.c +@@ -227,6 +227,7 @@ + #include + #include + #include ++#include + + #include + #include +@@ -235,8 +236,6 @@ + #include + #include + +-#include "io_ports.h" +- + #if defined(CONFIG_APM_DISPLAY_BLANK) && defined(CONFIG_VT) + extern int (*console_blank_hook)(int); + #endif +@@ -324,7 +323,7 @@ extern int (*console_blank_hook)(int); + /* + * Ignore suspend events for this amount of time after a resume + */ +-#define DEFAULT_BOUNCE_INTERVAL (3 * HZ) ++#define DEFAULT_BOUNCE_INTERVAL (3 * HZ) + + /* + * Maximum number of events stored +@@ -336,7 +335,7 @@ extern int (*console_blank_hook)(int); + */ + struct apm_user { + int magic; +- struct apm_user * next; ++ struct apm_user *next; + unsigned int suser: 1; + unsigned int writer: 1; + unsigned int reader: 1; +@@ -372,44 +371,44 @@ struct apm_user { + static struct { + unsigned long offset; + unsigned short segment; +-} apm_bios_entry; +-static int clock_slowed; +-static int idle_threshold __read_mostly = DEFAULT_IDLE_THRESHOLD; +-static int idle_period __read_mostly = DEFAULT_IDLE_PERIOD; +-static int set_pm_idle; +-static int suspends_pending; +-static int standbys_pending; +-static int ignore_sys_suspend; +-static int ignore_normal_resume; +-static int bounce_interval __read_mostly = DEFAULT_BOUNCE_INTERVAL; +- +-static int debug __read_mostly; +-static int smp __read_mostly; +-static int apm_disabled = -1; ++} apm_bios_entry; ++static int clock_slowed; ++static int idle_threshold __read_mostly = DEFAULT_IDLE_THRESHOLD; ++static int idle_period __read_mostly = DEFAULT_IDLE_PERIOD; ++static int set_pm_idle; ++static int suspends_pending; ++static int standbys_pending; ++static int ignore_sys_suspend; ++static int ignore_normal_resume; ++static int bounce_interval __read_mostly = DEFAULT_BOUNCE_INTERVAL; ++ ++static int debug __read_mostly; ++static int smp __read_mostly; ++static int apm_disabled = -1; + #ifdef CONFIG_SMP +-static int power_off; ++static int power_off; + #else +-static int power_off = 1; ++static int power_off = 1; + #endif + #ifdef CONFIG_APM_REAL_MODE_POWER_OFF +-static int realmode_power_off = 1; ++static int realmode_power_off = 1; + #else +-static int realmode_power_off; ++static int realmode_power_off; + #endif + #ifdef CONFIG_APM_ALLOW_INTS +-static int allow_ints = 1; ++static int allow_ints = 1; + #else +-static int allow_ints; ++static int allow_ints; + #endif +-static int broken_psr; ++static int broken_psr; + + static DECLARE_WAIT_QUEUE_HEAD(apm_waitqueue); + static DECLARE_WAIT_QUEUE_HEAD(apm_suspend_waitqueue); +-static struct apm_user * user_list; ++static struct apm_user *user_list; + static DEFINE_SPINLOCK(user_list_lock); +-static const struct desc_struct bad_bios_desc = { 0, 0x00409200 }; ++static const struct desc_struct bad_bios_desc = { { { 0, 0x00409200 } } }; + +-static const char driver_version[] = "1.16ac"; /* no spaces */ ++static const char driver_version[] = "1.16ac"; /* no spaces */ + + static struct task_struct *kapmd_task; + +@@ -417,7 +416,7 @@ static struct task_struct *kapmd_task; + * APM event names taken from the APM 1.2 specification. These are + * the message codes that the BIOS uses to tell us about events + */ +-static const char * const apm_event_name[] = { ++static const char * const apm_event_name[] = { + "system standby", + "system suspend", + "normal resume", +@@ -435,14 +434,14 @@ static const char * const apm_event_name[] = { + + typedef struct lookup_t { + int key; +- char * msg; ++ char *msg; + } lookup_t; + + /* + * The BIOS returns a set of standard error codes in AX when the + * carry flag is set. + */ +- ++ + static const lookup_t error_table[] = { + /* N/A { APM_SUCCESS, "Operation succeeded" }, */ + { APM_DISABLED, "Power management disabled" }, +@@ -472,24 +471,25 @@ static const lookup_t error_table[] = { + * Write a meaningful log entry to the kernel log in the event of + * an APM error. + */ +- ++ + static void apm_error(char *str, int err) + { +- int i; ++ int i; + + for (i = 0; i < ERROR_COUNT; i++) +- if (error_table[i].key == err) break; ++ if (error_table[i].key == err) ++ break; + if (i < ERROR_COUNT) + printk(KERN_NOTICE "apm: %s: %s\n", str, error_table[i].msg); + else + printk(KERN_NOTICE "apm: %s: unknown error code %#2.2x\n", +- str, err); ++ str, err); + } + + /* + * Lock APM functionality to physical CPU 0 + */ +- ++ + #ifdef CONFIG_SMP + + static cpumask_t apm_save_cpus(void) +@@ -511,7 +511,7 @@ static inline void apm_restore_cpus(cpumask_t mask) + /* + * No CPU lockdown needed on a uniprocessor + */ +- ++ + #define apm_save_cpus() (current->cpus_allowed) + #define apm_restore_cpus(x) (void)(x) + +@@ -590,7 +590,7 @@ static inline void apm_irq_restore(unsigned long flags) + * code is returned in AH (bits 8-15 of eax) and this function + * returns non-zero. + */ +- ++ + static u8 apm_bios_call(u32 func, u32 ebx_in, u32 ecx_in, + u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, u32 *esi) + { +@@ -602,7 +602,7 @@ static u8 apm_bios_call(u32 func, u32 ebx_in, u32 ecx_in, + struct desc_struct *gdt; + + cpus = apm_save_cpus(); +- ++ + cpu = get_cpu(); + gdt = get_cpu_gdt_table(cpu); + save_desc_40 = gdt[0x40 / 8]; +@@ -616,7 +616,7 @@ static u8 apm_bios_call(u32 func, u32 ebx_in, u32 ecx_in, + gdt[0x40 / 8] = save_desc_40; + put_cpu(); + apm_restore_cpus(cpus); +- ++ + return *eax & 0xff; + } + +@@ -645,7 +645,7 @@ static u8 apm_bios_call_simple(u32 func, u32 ebx_in, u32 ecx_in, u32 *eax) + struct desc_struct *gdt; + + cpus = apm_save_cpus(); +- ++ + cpu = get_cpu(); + gdt = get_cpu_gdt_table(cpu); + save_desc_40 = gdt[0x40 / 8]; +@@ -680,7 +680,7 @@ static u8 apm_bios_call_simple(u32 func, u32 ebx_in, u32 ecx_in, u32 *eax) + + static int apm_driver_version(u_short *val) + { +- u32 eax; ++ u32 eax; + + if (apm_bios_call_simple(APM_FUNC_VERSION, 0, *val, &eax)) + return (eax >> 8) & 0xff; +@@ -704,16 +704,16 @@ static int apm_driver_version(u_short *val) + * that APM 1.2 is in use. If no messges are pending the value 0x80 + * is returned (No power management events pending). + */ +- ++ + static int apm_get_event(apm_event_t *event, apm_eventinfo_t *info) + { +- u32 eax; +- u32 ebx; +- u32 ecx; +- u32 dummy; ++ u32 eax; ++ u32 ebx; ++ u32 ecx; ++ u32 dummy; + + if (apm_bios_call(APM_FUNC_GET_EVENT, 0, 0, &eax, &ebx, &ecx, +- &dummy, &dummy)) ++ &dummy, &dummy)) + return (eax >> 8) & 0xff; + *event = ebx; + if (apm_info.connection_version < 0x0102) +@@ -736,10 +736,10 @@ static int apm_get_event(apm_event_t *event, apm_eventinfo_t *info) + * The state holds the state to transition to, which may in fact + * be an acceptance of a BIOS requested state change. + */ +- ++ + static int set_power_state(u_short what, u_short state) + { +- u32 eax; ++ u32 eax; + + if (apm_bios_call_simple(APM_FUNC_SET_STATE, what, state, &eax)) + return (eax >> 8) & 0xff; +@@ -752,7 +752,7 @@ static int set_power_state(u_short what, u_short state) + * + * Transition the entire system into a new APM power state. + */ +- ++ + static int set_system_power_state(u_short state) + { + return set_power_state(APM_DEVICE_ALL, state); +@@ -766,13 +766,13 @@ static int set_system_power_state(u_short state) + * to handle the idle request. On a success the function returns 1 + * if the BIOS did clock slowing or 0 otherwise. + */ +- ++ + static int apm_do_idle(void) + { +- u32 eax; +- u8 ret = 0; +- int idled = 0; +- int polling; ++ u32 eax; ++ u8 ret = 0; ++ int idled = 0; ++ int polling; + + polling = !!(current_thread_info()->status & TS_POLLING); + if (polling) { +@@ -799,10 +799,9 @@ static int apm_do_idle(void) + /* This always fails on some SMP boards running UP kernels. + * Only report the failure the first 5 times. + */ +- if (++t < 5) +- { ++ if (++t < 5) { + printk(KERN_DEBUG "apm_do_idle failed (%d)\n", +- (eax >> 8) & 0xff); ++ (eax >> 8) & 0xff); + t = jiffies; + } + return -1; +@@ -814,15 +813,15 @@ static int apm_do_idle(void) + /** + * apm_do_busy - inform the BIOS the CPU is busy + * +- * Request that the BIOS brings the CPU back to full performance. ++ * Request that the BIOS brings the CPU back to full performance. + */ +- ++ + static void apm_do_busy(void) + { +- u32 dummy; ++ u32 dummy; + + if (clock_slowed || ALWAYS_CALL_BUSY) { +- (void) apm_bios_call_simple(APM_FUNC_BUSY, 0, 0, &dummy); ++ (void)apm_bios_call_simple(APM_FUNC_BUSY, 0, 0, &dummy); + clock_slowed = 0; + } + } +@@ -833,15 +832,15 @@ static void apm_do_busy(void) + * power management - we probably want + * to conserve power. + */ +-#define IDLE_CALC_LIMIT (HZ * 100) +-#define IDLE_LEAKY_MAX 16 ++#define IDLE_CALC_LIMIT (HZ * 100) ++#define IDLE_LEAKY_MAX 16 + + static void (*original_pm_idle)(void) __read_mostly; + + /** + * apm_cpu_idle - cpu idling for APM capable Linux + * +- * This is the idling function the kernel executes when APM is available. It ++ * This is the idling function the kernel executes when APM is available. It + * tries to do BIOS powermanagement based on the average system idle time. + * Furthermore it calls the system default idle routine. + */ +@@ -882,7 +881,8 @@ recalc: + + t = jiffies; + switch (apm_do_idle()) { +- case 0: apm_idle_done = 1; ++ case 0: ++ apm_idle_done = 1; + if (t != jiffies) { + if (bucket) { + bucket = IDLE_LEAKY_MAX; +@@ -893,7 +893,8 @@ recalc: + continue; + } + break; +- case 1: apm_idle_done = 1; ++ case 1: ++ apm_idle_done = 1; + break; + default: /* BIOS refused */ + break; +@@ -921,10 +922,10 @@ recalc: + * the SMP call on CPU0 as some systems will only honour this call + * on their first cpu. + */ +- ++ + static void apm_power_off(void) + { +- unsigned char po_bios_call[] = { ++ unsigned char po_bios_call[] = { + 0xb8, 0x00, 0x10, /* movw $0x1000,ax */ + 0x8e, 0xd0, /* movw ax,ss */ + 0xbc, 0x00, 0xf0, /* movw $0xf000,sp */ +@@ -935,13 +936,12 @@ static void apm_power_off(void) + }; + + /* Some bioses don't like being called from CPU != 0 */ +- if (apm_info.realmode_power_off) +- { ++ if (apm_info.realmode_power_off) { + (void)apm_save_cpus(); + machine_real_restart(po_bios_call, sizeof(po_bios_call)); ++ } else { ++ (void)set_system_power_state(APM_STATE_OFF); + } +- else +- (void) set_system_power_state(APM_STATE_OFF); + } + + #ifdef CONFIG_APM_DO_ENABLE +@@ -950,17 +950,17 @@ static void apm_power_off(void) + * apm_enable_power_management - enable BIOS APM power management + * @enable: enable yes/no + * +- * Enable or disable the APM BIOS power services. ++ * Enable or disable the APM BIOS power services. + */ +- ++ + static int apm_enable_power_management(int enable) + { +- u32 eax; ++ u32 eax; + + if ((enable == 0) && (apm_info.bios.flags & APM_BIOS_DISENGAGED)) + return APM_NOT_ENGAGED; + if (apm_bios_call_simple(APM_FUNC_ENABLE_PM, APM_DEVICE_BALL, +- enable, &eax)) ++ enable, &eax)) + return (eax >> 8) & 0xff; + if (enable) + apm_info.bios.flags &= ~APM_BIOS_DISABLED; +@@ -983,19 +983,19 @@ static int apm_enable_power_management(int enable) + * if reported is a lifetime in secodnds/minutes at current powwer + * consumption. + */ +- ++ + static int apm_get_power_status(u_short *status, u_short *bat, u_short *life) + { +- u32 eax; +- u32 ebx; +- u32 ecx; +- u32 edx; +- u32 dummy; ++ u32 eax; ++ u32 ebx; ++ u32 ecx; ++ u32 edx; ++ u32 dummy; + + if (apm_info.get_power_status_broken) + return APM_32_UNSUPPORTED; + if (apm_bios_call(APM_FUNC_GET_STATUS, APM_DEVICE_ALL, 0, +- &eax, &ebx, &ecx, &edx, &dummy)) ++ &eax, &ebx, &ecx, &edx, &dummy)) + return (eax >> 8) & 0xff; + *status = ebx; + *bat = ecx; +@@ -1011,11 +1011,11 @@ static int apm_get_power_status(u_short *status, u_short *bat, u_short *life) + static int apm_get_battery_status(u_short which, u_short *status, + u_short *bat, u_short *life, u_short *nbat) + { +- u32 eax; +- u32 ebx; +- u32 ecx; +- u32 edx; +- u32 esi; ++ u32 eax; ++ u32 ebx; ++ u32 ecx; ++ u32 edx; ++ u32 esi; + + if (apm_info.connection_version < 0x0102) { + /* pretend we only have one battery. */ +@@ -1026,7 +1026,7 @@ static int apm_get_battery_status(u_short which, u_short *status, + } + + if (apm_bios_call(APM_FUNC_GET_STATUS, (0x8000 | (which)), 0, &eax, +- &ebx, &ecx, &edx, &esi)) ++ &ebx, &ecx, &edx, &esi)) + return (eax >> 8) & 0xff; + *status = ebx; + *bat = ecx; +@@ -1044,10 +1044,10 @@ static int apm_get_battery_status(u_short which, u_short *status, + * Activate or deactive power management on either a specific device + * or the entire system (%APM_DEVICE_ALL). + */ +- ++ + static int apm_engage_power_management(u_short device, int enable) + { +- u32 eax; ++ u32 eax; + + if ((enable == 0) && (device == APM_DEVICE_ALL) + && (apm_info.bios.flags & APM_BIOS_DISABLED)) +@@ -1074,7 +1074,7 @@ static int apm_engage_power_management(u_short device, int enable) + * all video devices. Typically the BIOS will do laptop backlight and + * monitor powerdown for us. + */ +- ++ + static int apm_console_blank(int blank) + { + int error = APM_NOT_ENGAGED; /* silence gcc */ +@@ -1126,7 +1126,7 @@ static apm_event_t get_queued_event(struct apm_user *as) + + static void queue_event(apm_event_t event, struct apm_user *sender) + { +- struct apm_user * as; ++ struct apm_user *as; + + spin_lock(&user_list_lock); + if (user_list == NULL) +@@ -1174,11 +1174,11 @@ static void reinit_timer(void) + + spin_lock_irqsave(&i8253_lock, flags); + /* set the clock to HZ */ +- outb_p(0x34, PIT_MODE); /* binary, mode 2, LSB/MSB, ch 0 */ ++ outb_pit(0x34, PIT_MODE); /* binary, mode 2, LSB/MSB, ch 0 */ + udelay(10); +- outb_p(LATCH & 0xff, PIT_CH0); /* LSB */ ++ outb_pit(LATCH & 0xff, PIT_CH0); /* LSB */ + udelay(10); +- outb(LATCH >> 8, PIT_CH0); /* MSB */ ++ outb_pit(LATCH >> 8, PIT_CH0); /* MSB */ + udelay(10); + spin_unlock_irqrestore(&i8253_lock, flags); + #endif +@@ -1186,7 +1186,7 @@ static void reinit_timer(void) + + static int suspend(int vetoable) + { +- int err; ++ int err; + struct apm_user *as; + + if (pm_send_all(PM_SUSPEND, (void *)3)) { +@@ -1239,7 +1239,7 @@ static int suspend(int vetoable) + + static void standby(void) + { +- int err; ++ int err; + + local_irq_disable(); + device_power_down(PMSG_SUSPEND); +@@ -1256,8 +1256,8 @@ static void standby(void) + + static apm_event_t get_event(void) + { +- int error; +- apm_event_t event = APM_NO_EVENTS; /* silence gcc */ ++ int error; ++ apm_event_t event = APM_NO_EVENTS; /* silence gcc */ + apm_eventinfo_t info; + + static int notified; +@@ -1275,9 +1275,9 @@ static apm_event_t get_event(void) + + static void check_events(void) + { +- apm_event_t event; +- static unsigned long last_resume; +- static int ignore_bounce; ++ apm_event_t event; ++ static unsigned long last_resume; ++ static int ignore_bounce; + + while ((event = get_event()) != 0) { + if (debug) { +@@ -1289,7 +1289,7 @@ static void check_events(void) + "event 0x%02x\n", event); + } + if (ignore_bounce +- && ((jiffies - last_resume) > bounce_interval)) ++ && (time_after(jiffies, last_resume + bounce_interval))) + ignore_bounce = 0; + + switch (event) { +@@ -1357,7 +1357,7 @@ static void check_events(void) + /* + * We are not allowed to reject a critical suspend. + */ +- (void) suspend(0); ++ (void)suspend(0); + break; + } + } +@@ -1365,12 +1365,12 @@ static void check_events(void) + + static void apm_event_handler(void) + { +- static int pending_count = 4; +- int err; ++ static int pending_count = 4; ++ int err; + + if ((standbys_pending > 0) || (suspends_pending > 0)) { + if ((apm_info.connection_version > 0x100) && +- (pending_count-- <= 0)) { ++ (pending_count-- <= 0)) { + pending_count = 4; + if (debug) + printk(KERN_DEBUG "apm: setting state busy\n"); +@@ -1418,9 +1418,9 @@ static int check_apm_user(struct apm_user *as, const char *func) + + static ssize_t do_read(struct file *fp, char __user *buf, size_t count, loff_t *ppos) + { +- struct apm_user * as; +- int i; +- apm_event_t event; ++ struct apm_user *as; ++ int i; ++ apm_event_t event; + + as = fp->private_data; + if (check_apm_user(as, "read")) +@@ -1459,9 +1459,9 @@ static ssize_t do_read(struct file *fp, char __user *buf, size_t count, loff_t * + return 0; + } + +-static unsigned int do_poll(struct file *fp, poll_table * wait) ++static unsigned int do_poll(struct file *fp, poll_table *wait) + { +- struct apm_user * as; ++ struct apm_user *as; + + as = fp->private_data; + if (check_apm_user(as, "poll")) +@@ -1472,10 +1472,10 @@ static unsigned int do_poll(struct file *fp, poll_table * wait) + return 0; + } + +-static int do_ioctl(struct inode * inode, struct file *filp, ++static int do_ioctl(struct inode *inode, struct file *filp, + u_int cmd, u_long arg) + { +- struct apm_user * as; ++ struct apm_user *as; + + as = filp->private_data; + if (check_apm_user(as, "ioctl")) +@@ -1515,9 +1515,9 @@ static int do_ioctl(struct inode * inode, struct file *filp, + return 0; + } + +-static int do_release(struct inode * inode, struct file * filp) ++static int do_release(struct inode *inode, struct file *filp) + { +- struct apm_user * as; ++ struct apm_user *as; + + as = filp->private_data; + if (check_apm_user(as, "release")) +@@ -1533,11 +1533,11 @@ static int do_release(struct inode * inode, struct file * filp) + if (suspends_pending <= 0) + (void) suspend(1); + } +- spin_lock(&user_list_lock); ++ spin_lock(&user_list_lock); + if (user_list == as) + user_list = as->next; + else { +- struct apm_user * as1; ++ struct apm_user *as1; + + for (as1 = user_list; + (as1 != NULL) && (as1->next != as); +@@ -1553,9 +1553,9 @@ static int do_release(struct inode * inode, struct file * filp) + return 0; + } + +-static int do_open(struct inode * inode, struct file * filp) ++static int do_open(struct inode *inode, struct file *filp) + { +- struct apm_user * as; ++ struct apm_user *as; + + as = kmalloc(sizeof(*as), GFP_KERNEL); + if (as == NULL) { +@@ -1569,7 +1569,7 @@ static int do_open(struct inode * inode, struct file * filp) + as->suspends_read = as->standbys_read = 0; + /* + * XXX - this is a tiny bit broken, when we consider BSD +- * process accounting. If the device is opened by root, we ++ * process accounting. If the device is opened by root, we + * instantly flag that we used superuser privs. Who knows, + * we might close the device immediately without doing a + * privileged operation -- cevans +@@ -1652,16 +1652,16 @@ static int proc_apm_show(struct seq_file *m, void *v) + 8) min = minutes; sec = seconds */ + + seq_printf(m, "%s %d.%d 0x%02x 0x%02x 0x%02x 0x%02x %d%% %d %s\n", +- driver_version, +- (apm_info.bios.version >> 8) & 0xff, +- apm_info.bios.version & 0xff, +- apm_info.bios.flags, +- ac_line_status, +- battery_status, +- battery_flag, +- percentage, +- time_units, +- units); ++ driver_version, ++ (apm_info.bios.version >> 8) & 0xff, ++ apm_info.bios.version & 0xff, ++ apm_info.bios.flags, ++ ac_line_status, ++ battery_status, ++ battery_flag, ++ percentage, ++ time_units, ++ units); + return 0; + } + +@@ -1684,8 +1684,8 @@ static int apm(void *unused) + unsigned short cx; + unsigned short dx; + int error; +- char * power_stat; +- char * bat_stat; ++ char *power_stat; ++ char *bat_stat; + + #ifdef CONFIG_SMP + /* 2002/08/01 - WT +@@ -1744,23 +1744,41 @@ static int apm(void *unused) + } + } + +- if (debug && (num_online_cpus() == 1 || smp )) { ++ if (debug && (num_online_cpus() == 1 || smp)) { + error = apm_get_power_status(&bx, &cx, &dx); + if (error) + printk(KERN_INFO "apm: power status not available\n"); + else { + switch ((bx >> 8) & 0xff) { +- case 0: power_stat = "off line"; break; +- case 1: power_stat = "on line"; break; +- case 2: power_stat = "on backup power"; break; +- default: power_stat = "unknown"; break; ++ case 0: ++ power_stat = "off line"; ++ break; ++ case 1: ++ power_stat = "on line"; ++ break; ++ case 2: ++ power_stat = "on backup power"; ++ break; ++ default: ++ power_stat = "unknown"; ++ break; + } + switch (bx & 0xff) { +- case 0: bat_stat = "high"; break; +- case 1: bat_stat = "low"; break; +- case 2: bat_stat = "critical"; break; +- case 3: bat_stat = "charging"; break; +- default: bat_stat = "unknown"; break; ++ case 0: ++ bat_stat = "high"; ++ break; ++ case 1: ++ bat_stat = "low"; ++ break; ++ case 2: ++ bat_stat = "critical"; ++ break; ++ case 3: ++ bat_stat = "charging"; ++ break; ++ default: ++ bat_stat = "unknown"; ++ break; + } + printk(KERN_INFO + "apm: AC %s, battery status %s, battery life ", +@@ -1777,8 +1795,8 @@ static int apm(void *unused) + printk("unknown\n"); + else + printk("%d %s\n", dx & 0x7fff, +- (dx & 0x8000) ? +- "minutes" : "seconds"); ++ (dx & 0x8000) ? ++ "minutes" : "seconds"); + } + } + } +@@ -1803,7 +1821,7 @@ static int apm(void *unused) + #ifndef MODULE + static int __init apm_setup(char *str) + { +- int invert; ++ int invert; + + while ((str != NULL) && (*str != '\0')) { + if (strncmp(str, "off", 3) == 0) +@@ -1828,14 +1846,13 @@ static int __init apm_setup(char *str) + if ((strncmp(str, "power-off", 9) == 0) || + (strncmp(str, "power_off", 9) == 0)) + power_off = !invert; +- if (strncmp(str, "smp", 3) == 0) +- { ++ if (strncmp(str, "smp", 3) == 0) { + smp = !invert; + idle_threshold = 100; + } + if ((strncmp(str, "allow-ints", 10) == 0) || + (strncmp(str, "allow_ints", 10) == 0)) +- apm_info.allow_ints = !invert; ++ apm_info.allow_ints = !invert; + if ((strncmp(str, "broken-psr", 10) == 0) || + (strncmp(str, "broken_psr", 10) == 0)) + apm_info.get_power_status_broken = !invert; +@@ -1881,7 +1898,8 @@ static int __init print_if_true(const struct dmi_system_id *d) + */ + static int __init broken_ps2_resume(const struct dmi_system_id *d) + { +- printk(KERN_INFO "%s machine detected. Mousepad Resume Bug workaround hopefully not needed.\n", d->ident); ++ printk(KERN_INFO "%s machine detected. Mousepad Resume Bug " ++ "workaround hopefully not needed.\n", d->ident); + return 0; + } + +@@ -1890,7 +1908,8 @@ static int __init set_realmode_power_off(const struct dmi_system_id *d) + { + if (apm_info.realmode_power_off == 0) { + apm_info.realmode_power_off = 1; +- printk(KERN_INFO "%s bios detected. Using realmode poweroff only.\n", d->ident); ++ printk(KERN_INFO "%s bios detected. " ++ "Using realmode poweroff only.\n", d->ident); + } + return 0; + } +@@ -1900,7 +1919,8 @@ static int __init set_apm_ints(const struct dmi_system_id *d) + { + if (apm_info.allow_ints == 0) { + apm_info.allow_ints = 1; +- printk(KERN_INFO "%s machine detected. Enabling interrupts during APM calls.\n", d->ident); ++ printk(KERN_INFO "%s machine detected. " ++ "Enabling interrupts during APM calls.\n", d->ident); + } + return 0; + } +@@ -1910,7 +1930,8 @@ static int __init apm_is_horked(const struct dmi_system_id *d) + { + if (apm_info.disabled == 0) { + apm_info.disabled = 1; +- printk(KERN_INFO "%s machine detected. Disabling APM.\n", d->ident); ++ printk(KERN_INFO "%s machine detected. " ++ "Disabling APM.\n", d->ident); + } + return 0; + } +@@ -1919,7 +1940,8 @@ static int __init apm_is_horked_d850md(const struct dmi_system_id *d) + { + if (apm_info.disabled == 0) { + apm_info.disabled = 1; +- printk(KERN_INFO "%s machine detected. Disabling APM.\n", d->ident); ++ printk(KERN_INFO "%s machine detected. " ++ "Disabling APM.\n", d->ident); + printk(KERN_INFO "This bug is fixed in bios P15 which is available for \n"); + printk(KERN_INFO "download from support.intel.com \n"); + } +@@ -1931,7 +1953,8 @@ static int __init apm_likes_to_melt(const struct dmi_system_id *d) + { + if (apm_info.forbid_idle == 0) { + apm_info.forbid_idle = 1; +- printk(KERN_INFO "%s machine detected. Disabling APM idle calls.\n", d->ident); ++ printk(KERN_INFO "%s machine detected. " ++ "Disabling APM idle calls.\n", d->ident); + } + return 0; + } +@@ -1954,7 +1977,8 @@ static int __init apm_likes_to_melt(const struct dmi_system_id *d) + static int __init broken_apm_power(const struct dmi_system_id *d) + { + apm_info.get_power_status_broken = 1; +- printk(KERN_WARNING "BIOS strings suggest APM bugs, disabling power status reporting.\n"); ++ printk(KERN_WARNING "BIOS strings suggest APM bugs, " ++ "disabling power status reporting.\n"); + return 0; + } + +@@ -1965,7 +1989,8 @@ static int __init broken_apm_power(const struct dmi_system_id *d) + static int __init swab_apm_power_in_minutes(const struct dmi_system_id *d) + { + apm_info.get_power_status_swabinminutes = 1; +- printk(KERN_WARNING "BIOS strings suggest APM reports battery life in minutes and wrong byte order.\n"); ++ printk(KERN_WARNING "BIOS strings suggest APM reports battery life " ++ "in minutes and wrong byte order.\n"); + return 0; + } + +@@ -1990,8 +2015,8 @@ static struct dmi_system_id __initdata apm_dmi_table[] = { + apm_is_horked, "Dell Inspiron 2500", + { DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 2500"), +- DMI_MATCH(DMI_BIOS_VENDOR,"Phoenix Technologies LTD"), +- DMI_MATCH(DMI_BIOS_VERSION,"A11"), }, ++ DMI_MATCH(DMI_BIOS_VENDOR, "Phoenix Technologies LTD"), ++ DMI_MATCH(DMI_BIOS_VERSION, "A11"), }, + }, + { /* Allow interrupts during suspend on Dell Inspiron laptops*/ + set_apm_ints, "Dell Inspiron", { +@@ -2014,15 +2039,15 @@ static struct dmi_system_id __initdata apm_dmi_table[] = { + apm_is_horked, "Dell Dimension 4100", + { DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), + DMI_MATCH(DMI_PRODUCT_NAME, "XPS-Z"), +- DMI_MATCH(DMI_BIOS_VENDOR,"Intel Corp."), +- DMI_MATCH(DMI_BIOS_VERSION,"A11"), }, ++ DMI_MATCH(DMI_BIOS_VENDOR, "Intel Corp."), ++ DMI_MATCH(DMI_BIOS_VERSION, "A11"), }, + }, + { /* Allow interrupts during suspend on Compaq Laptops*/ + set_apm_ints, "Compaq 12XL125", + { DMI_MATCH(DMI_SYS_VENDOR, "Compaq"), + DMI_MATCH(DMI_PRODUCT_NAME, "Compaq PC"), + DMI_MATCH(DMI_BIOS_VENDOR, "Phoenix Technologies LTD"), +- DMI_MATCH(DMI_BIOS_VERSION,"4.06"), }, ++ DMI_MATCH(DMI_BIOS_VERSION, "4.06"), }, + }, + { /* Allow interrupts during APM or the clock goes slow */ + set_apm_ints, "ASUSTeK", +@@ -2064,15 +2089,15 @@ static struct dmi_system_id __initdata apm_dmi_table[] = { + apm_is_horked, "Sharp PC-PJ/AX", + { DMI_MATCH(DMI_SYS_VENDOR, "SHARP"), + DMI_MATCH(DMI_PRODUCT_NAME, "PC-PJ/AX"), +- DMI_MATCH(DMI_BIOS_VENDOR,"SystemSoft"), +- DMI_MATCH(DMI_BIOS_VERSION,"Version R2.08"), }, ++ DMI_MATCH(DMI_BIOS_VENDOR, "SystemSoft"), ++ DMI_MATCH(DMI_BIOS_VERSION, "Version R2.08"), }, + }, + { /* APM crashes */ + apm_is_horked, "Dell Inspiron 2500", + { DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), + DMI_MATCH(DMI_PRODUCT_NAME, "Inspiron 2500"), +- DMI_MATCH(DMI_BIOS_VENDOR,"Phoenix Technologies LTD"), +- DMI_MATCH(DMI_BIOS_VERSION,"A11"), }, ++ DMI_MATCH(DMI_BIOS_VENDOR, "Phoenix Technologies LTD"), ++ DMI_MATCH(DMI_BIOS_VERSION, "A11"), }, + }, + { /* APM idle hangs */ + apm_likes_to_melt, "Jabil AMD", +@@ -2203,11 +2228,11 @@ static int __init apm_init(void) + return -ENODEV; + } + printk(KERN_INFO +- "apm: BIOS version %d.%d Flags 0x%02x (Driver version %s)\n", +- ((apm_info.bios.version >> 8) & 0xff), +- (apm_info.bios.version & 0xff), +- apm_info.bios.flags, +- driver_version); ++ "apm: BIOS version %d.%d Flags 0x%02x (Driver version %s)\n", ++ ((apm_info.bios.version >> 8) & 0xff), ++ (apm_info.bios.version & 0xff), ++ apm_info.bios.flags, ++ driver_version); + if ((apm_info.bios.flags & APM_32_BIT_SUPPORT) == 0) { + printk(KERN_INFO "apm: no 32 bit BIOS support\n"); + return -ENODEV; +@@ -2312,9 +2337,9 @@ static int __init apm_init(void) + } + wake_up_process(kapmd_task); + +- if (num_online_cpus() > 1 && !smp ) { ++ if (num_online_cpus() > 1 && !smp) { + printk(KERN_NOTICE +- "apm: disabled - APM is not SMP safe (power off active).\n"); ++ "apm: disabled - APM is not SMP safe (power off active).\n"); + return 0; + } + +@@ -2339,7 +2364,7 @@ static int __init apm_init(void) + + static void __exit apm_exit(void) + { +- int error; ++ int error; + + if (set_pm_idle) { + pm_idle = original_pm_idle; +diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c +index 0e45981..afd8446 100644 +--- a/arch/x86/kernel/asm-offsets_32.c ++++ b/arch/x86/kernel/asm-offsets_32.c +@@ -38,15 +38,15 @@ void foo(void); + + void foo(void) + { +- OFFSET(SIGCONTEXT_eax, sigcontext, eax); +- OFFSET(SIGCONTEXT_ebx, sigcontext, ebx); +- OFFSET(SIGCONTEXT_ecx, sigcontext, ecx); +- OFFSET(SIGCONTEXT_edx, sigcontext, edx); +- OFFSET(SIGCONTEXT_esi, sigcontext, esi); +- OFFSET(SIGCONTEXT_edi, sigcontext, edi); +- OFFSET(SIGCONTEXT_ebp, sigcontext, ebp); +- OFFSET(SIGCONTEXT_esp, sigcontext, esp); +- OFFSET(SIGCONTEXT_eip, sigcontext, eip); ++ OFFSET(IA32_SIGCONTEXT_ax, sigcontext, ax); ++ OFFSET(IA32_SIGCONTEXT_bx, sigcontext, bx); ++ OFFSET(IA32_SIGCONTEXT_cx, sigcontext, cx); ++ OFFSET(IA32_SIGCONTEXT_dx, sigcontext, dx); ++ OFFSET(IA32_SIGCONTEXT_si, sigcontext, si); ++ OFFSET(IA32_SIGCONTEXT_di, sigcontext, di); ++ OFFSET(IA32_SIGCONTEXT_bp, sigcontext, bp); ++ OFFSET(IA32_SIGCONTEXT_sp, sigcontext, sp); ++ OFFSET(IA32_SIGCONTEXT_ip, sigcontext, ip); + BLANK(); + + OFFSET(CPUINFO_x86, cpuinfo_x86, x86); +@@ -70,39 +70,38 @@ void foo(void) + OFFSET(TI_cpu, thread_info, cpu); + BLANK(); + +- OFFSET(GDS_size, Xgt_desc_struct, size); +- OFFSET(GDS_address, Xgt_desc_struct, address); +- OFFSET(GDS_pad, Xgt_desc_struct, pad); ++ OFFSET(GDS_size, desc_ptr, size); ++ OFFSET(GDS_address, desc_ptr, address); + BLANK(); + +- OFFSET(PT_EBX, pt_regs, ebx); +- OFFSET(PT_ECX, pt_regs, ecx); +- OFFSET(PT_EDX, pt_regs, edx); +- OFFSET(PT_ESI, pt_regs, esi); +- OFFSET(PT_EDI, pt_regs, edi); +- OFFSET(PT_EBP, pt_regs, ebp); +- OFFSET(PT_EAX, pt_regs, eax); +- OFFSET(PT_DS, pt_regs, xds); +- OFFSET(PT_ES, pt_regs, xes); +- OFFSET(PT_FS, pt_regs, xfs); +- OFFSET(PT_ORIG_EAX, pt_regs, orig_eax); +- OFFSET(PT_EIP, pt_regs, eip); +- OFFSET(PT_CS, pt_regs, xcs); +- OFFSET(PT_EFLAGS, pt_regs, eflags); +- OFFSET(PT_OLDESP, pt_regs, esp); +- OFFSET(PT_OLDSS, pt_regs, xss); ++ OFFSET(PT_EBX, pt_regs, bx); ++ OFFSET(PT_ECX, pt_regs, cx); ++ OFFSET(PT_EDX, pt_regs, dx); ++ OFFSET(PT_ESI, pt_regs, si); ++ OFFSET(PT_EDI, pt_regs, di); ++ OFFSET(PT_EBP, pt_regs, bp); ++ OFFSET(PT_EAX, pt_regs, ax); ++ OFFSET(PT_DS, pt_regs, ds); ++ OFFSET(PT_ES, pt_regs, es); ++ OFFSET(PT_FS, pt_regs, fs); ++ OFFSET(PT_ORIG_EAX, pt_regs, orig_ax); ++ OFFSET(PT_EIP, pt_regs, ip); ++ OFFSET(PT_CS, pt_regs, cs); ++ OFFSET(PT_EFLAGS, pt_regs, flags); ++ OFFSET(PT_OLDESP, pt_regs, sp); ++ OFFSET(PT_OLDSS, pt_regs, ss); + BLANK(); + + OFFSET(EXEC_DOMAIN_handler, exec_domain, handler); +- OFFSET(RT_SIGFRAME_sigcontext, rt_sigframe, uc.uc_mcontext); ++ OFFSET(IA32_RT_SIGFRAME_sigcontext, rt_sigframe, uc.uc_mcontext); + BLANK(); + + OFFSET(pbe_address, pbe, address); + OFFSET(pbe_orig_address, pbe, orig_address); + OFFSET(pbe_next, pbe, next); + +- /* Offset from the sysenter stack to tss.esp0 */ +- DEFINE(TSS_sysenter_esp0, offsetof(struct tss_struct, x86_tss.esp0) - ++ /* Offset from the sysenter stack to tss.sp0 */ ++ DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) - + sizeof(struct tss_struct)); + + DEFINE(PAGE_SIZE_asm, PAGE_SIZE); +@@ -111,8 +110,6 @@ void foo(void) + DEFINE(PTRS_PER_PMD, PTRS_PER_PMD); + DEFINE(PTRS_PER_PGD, PTRS_PER_PGD); + +- DEFINE(VDSO_PRELINK_asm, VDSO_PRELINK); +- + OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx); + + #ifdef CONFIG_PARAVIRT +@@ -123,7 +120,7 @@ void foo(void) + OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable); + OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable); + OFFSET(PV_CPU_iret, pv_cpu_ops, iret); +- OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit); ++ OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret); + OFFSET(PV_CPU_read_cr0, pv_cpu_ops, read_cr0); + #endif + +diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c +index d1b6ed9..494e1e0 100644 +--- a/arch/x86/kernel/asm-offsets_64.c ++++ b/arch/x86/kernel/asm-offsets_64.c +@@ -38,7 +38,6 @@ int main(void) + #define ENTRY(entry) DEFINE(tsk_ ## entry, offsetof(struct task_struct, entry)) + ENTRY(state); + ENTRY(flags); +- ENTRY(thread); + ENTRY(pid); + BLANK(); + #undef ENTRY +@@ -47,6 +46,9 @@ int main(void) + ENTRY(addr_limit); + ENTRY(preempt_count); + ENTRY(status); ++#ifdef CONFIG_IA32_EMULATION ++ ENTRY(sysenter_return); ++#endif + BLANK(); + #undef ENTRY + #define ENTRY(entry) DEFINE(pda_ ## entry, offsetof(struct x8664_pda, entry)) +@@ -59,17 +61,31 @@ int main(void) + ENTRY(data_offset); + BLANK(); + #undef ENTRY ++#ifdef CONFIG_PARAVIRT ++ BLANK(); ++ OFFSET(PARAVIRT_enabled, pv_info, paravirt_enabled); ++ OFFSET(PARAVIRT_PATCH_pv_cpu_ops, paravirt_patch_template, pv_cpu_ops); ++ OFFSET(PARAVIRT_PATCH_pv_irq_ops, paravirt_patch_template, pv_irq_ops); ++ OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable); ++ OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable); ++ OFFSET(PV_CPU_iret, pv_cpu_ops, iret); ++ OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret); ++ OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs); ++ OFFSET(PV_MMU_read_cr2, pv_mmu_ops, read_cr2); ++#endif ++ ++ + #ifdef CONFIG_IA32_EMULATION + #define ENTRY(entry) DEFINE(IA32_SIGCONTEXT_ ## entry, offsetof(struct sigcontext_ia32, entry)) +- ENTRY(eax); +- ENTRY(ebx); +- ENTRY(ecx); +- ENTRY(edx); +- ENTRY(esi); +- ENTRY(edi); +- ENTRY(ebp); +- ENTRY(esp); +- ENTRY(eip); ++ ENTRY(ax); ++ ENTRY(bx); ++ ENTRY(cx); ++ ENTRY(dx); ++ ENTRY(si); ++ ENTRY(di); ++ ENTRY(bp); ++ ENTRY(sp); ++ ENTRY(ip); + BLANK(); + #undef ENTRY + DEFINE(IA32_RT_SIGFRAME_sigcontext, +@@ -81,14 +97,14 @@ int main(void) + DEFINE(pbe_next, offsetof(struct pbe, next)); + BLANK(); + #define ENTRY(entry) DEFINE(pt_regs_ ## entry, offsetof(struct pt_regs, entry)) +- ENTRY(rbx); +- ENTRY(rbx); +- ENTRY(rcx); +- ENTRY(rdx); +- ENTRY(rsp); +- ENTRY(rbp); +- ENTRY(rsi); +- ENTRY(rdi); ++ ENTRY(bx); ++ ENTRY(bx); ++ ENTRY(cx); ++ ENTRY(dx); ++ ENTRY(sp); ++ ENTRY(bp); ++ ENTRY(si); ++ ENTRY(di); + ENTRY(r8); + ENTRY(r9); + ENTRY(r10); +@@ -97,7 +113,7 @@ int main(void) + ENTRY(r13); + ENTRY(r14); + ENTRY(r15); +- ENTRY(eflags); ++ ENTRY(flags); + BLANK(); + #undef ENTRY + #define ENTRY(entry) DEFINE(saved_context_ ## entry, offsetof(struct saved_context, entry)) +@@ -108,7 +124,7 @@ int main(void) + ENTRY(cr8); + BLANK(); + #undef ENTRY +- DEFINE(TSS_ist, offsetof(struct tss_struct, ist)); ++ DEFINE(TSS_ist, offsetof(struct tss_struct, x86_tss.ist)); + BLANK(); + DEFINE(crypto_tfm_ctx_offset, offsetof(struct crypto_tfm, __crt_ctx)); + BLANK(); +diff --git a/arch/x86/kernel/bootflag.c b/arch/x86/kernel/bootflag.c +index 0b98605..30f25a7 100644 +--- a/arch/x86/kernel/bootflag.c ++++ b/arch/x86/kernel/bootflag.c +@@ -1,8 +1,6 @@ + /* + * Implement 'Simple Boot Flag Specification 2.0' + */ +- +- + #include + #include + #include +@@ -14,40 +12,38 @@ + + #include + +- + #define SBF_RESERVED (0x78) + #define SBF_PNPOS (1<<0) + #define SBF_BOOTING (1<<1) + #define SBF_DIAG (1<<2) + #define SBF_PARITY (1<<7) + +- + int sbf_port __initdata = -1; /* set via acpi_boot_init() */ + +- + static int __init parity(u8 v) + { + int x = 0; + int i; +- +- for(i=0;i<8;i++) +- { +- x^=(v&1); +- v>>=1; ++ ++ for (i = 0; i < 8; i++) { ++ x ^= (v & 1); ++ v >>= 1; + } ++ + return x; + } + + static void __init sbf_write(u8 v) + { + unsigned long flags; +- if(sbf_port != -1) +- { ++ ++ if (sbf_port != -1) { + v &= ~SBF_PARITY; +- if(!parity(v)) +- v|=SBF_PARITY; ++ if (!parity(v)) ++ v |= SBF_PARITY; + +- printk(KERN_INFO "Simple Boot Flag at 0x%x set to 0x%x\n", sbf_port, v); ++ printk(KERN_INFO "Simple Boot Flag at 0x%x set to 0x%x\n", ++ sbf_port, v); + + spin_lock_irqsave(&rtc_lock, flags); + CMOS_WRITE(v, sbf_port); +@@ -57,33 +53,41 @@ static void __init sbf_write(u8 v) + + static u8 __init sbf_read(void) + { +- u8 v; + unsigned long flags; +- if(sbf_port == -1) ++ u8 v; ++ ++ if (sbf_port == -1) + return 0; ++ + spin_lock_irqsave(&rtc_lock, flags); + v = CMOS_READ(sbf_port); + spin_unlock_irqrestore(&rtc_lock, flags); ++ + return v; + } + + static int __init sbf_value_valid(u8 v) + { +- if(v&SBF_RESERVED) /* Reserved bits */ ++ if (v & SBF_RESERVED) /* Reserved bits */ + return 0; +- if(!parity(v)) ++ if (!parity(v)) + return 0; ++ + return 1; + } + + static int __init sbf_init(void) + { + u8 v; +- if(sbf_port == -1) ++ ++ if (sbf_port == -1) + return 0; ++ + v = sbf_read(); +- if(!sbf_value_valid(v)) +- printk(KERN_WARNING "Simple Boot Flag value 0x%x read from CMOS RAM was invalid\n",v); ++ if (!sbf_value_valid(v)) { ++ printk(KERN_WARNING "Simple Boot Flag value 0x%x read from " ++ "CMOS RAM was invalid\n", v); ++ } + + v &= ~SBF_RESERVED; + v &= ~SBF_BOOTING; +@@ -92,7 +96,7 @@ static int __init sbf_init(void) + v |= SBF_PNPOS; + #endif + sbf_write(v); ++ + return 0; + } +- + module_init(sbf_init); +diff --git a/arch/x86/kernel/bugs_64.c b/arch/x86/kernel/bugs_64.c +index 9a189ce..8f520f9 100644 +--- a/arch/x86/kernel/bugs_64.c ++++ b/arch/x86/kernel/bugs_64.c +@@ -13,7 +13,6 @@ + void __init check_bugs(void) + { + identify_cpu(&boot_cpu_data); +- mtrr_bp_init(); + #if !defined(CONFIG_SMP) + printk("CPU: "); + print_cpu_info(&boot_cpu_data); +diff --git a/arch/x86/kernel/cpu/addon_cpuid_features.c b/arch/x86/kernel/cpu/addon_cpuid_features.c +index 3e91d3e..238468a 100644 +--- a/arch/x86/kernel/cpu/addon_cpuid_features.c ++++ b/arch/x86/kernel/cpu/addon_cpuid_features.c +@@ -45,6 +45,6 @@ void __cpuinit init_scattered_cpuid_features(struct cpuinfo_x86 *c) + ®s[CR_ECX], ®s[CR_EDX]); + + if (regs[cb->reg] & (1 << cb->bit)) +- set_bit(cb->feature, c->x86_capability); ++ set_cpu_cap(c, cb->feature); + } + } +diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c +index 1ff88c7..06fa159 100644 +--- a/arch/x86/kernel/cpu/amd.c ++++ b/arch/x86/kernel/cpu/amd.c +@@ -63,6 +63,15 @@ static __cpuinit int amd_apic_timer_broken(void) + + int force_mwait __cpuinitdata; + ++void __cpuinit early_init_amd(struct cpuinfo_x86 *c) ++{ ++ if (cpuid_eax(0x80000000) >= 0x80000007) { ++ c->x86_power = cpuid_edx(0x80000007); ++ if (c->x86_power & (1<<8)) ++ set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability); ++ } ++} ++ + static void __cpuinit init_amd(struct cpuinfo_x86 *c) + { + u32 l, h; +@@ -85,6 +94,8 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) + } + #endif + ++ early_init_amd(c); ++ + /* + * FIXME: We should handle the K5 here. Set up the write + * range and also turn on MSR 83 bits 4 and 31 (write alloc, +@@ -257,12 +268,6 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) + c->x86_max_cores = (cpuid_ecx(0x80000008) & 0xff) + 1; + } + +- if (cpuid_eax(0x80000000) >= 0x80000007) { +- c->x86_power = cpuid_edx(0x80000007); +- if (c->x86_power & (1<<8)) +- set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability); +- } +- + #ifdef CONFIG_X86_HT + /* + * On a AMD multi core setup the lower bits of the APIC id +@@ -295,12 +300,12 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) + local_apic_timer_disabled = 1; + #endif + +- if (c->x86 == 0x10 && !force_mwait) +- clear_bit(X86_FEATURE_MWAIT, c->x86_capability); +- + /* K6s reports MCEs but don't actually have all the MSRs */ + if (c->x86 < 6) + clear_bit(X86_FEATURE_MCE, c->x86_capability); ++ ++ if (cpu_has_xmm) ++ set_bit(X86_FEATURE_MFENCE_RDTSC, c->x86_capability); + } + + static unsigned int __cpuinit amd_size_cache(struct cpuinfo_x86 * c, unsigned int size) +diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c +index 205fd5b..9b95edc 100644 +--- a/arch/x86/kernel/cpu/bugs.c ++++ b/arch/x86/kernel/cpu/bugs.c +@@ -11,6 +11,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -35,7 +36,7 @@ __setup("mca-pentium", mca_pentium); + static int __init no_387(char *s) + { + boot_cpu_data.hard_math = 0; +- write_cr0(0xE | read_cr0()); ++ write_cr0(X86_CR0_TS | X86_CR0_EM | X86_CR0_MP | read_cr0()); + return 1; + } + +@@ -153,7 +154,7 @@ static void __init check_config(void) + * If we configured ourselves for a TSC, we'd better have one! + */ + #ifdef CONFIG_X86_TSC +- if (!cpu_has_tsc && !tsc_disable) ++ if (!cpu_has_tsc) + panic("Kernel compiled for Pentium+, requires TSC feature!"); + #endif + +diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c +index e2fcf20..db28aa9 100644 +--- a/arch/x86/kernel/cpu/common.c ++++ b/arch/x86/kernel/cpu/common.c +@@ -22,43 +22,48 @@ + #include "cpu.h" + + DEFINE_PER_CPU(struct gdt_page, gdt_page) = { .gdt = { +- [GDT_ENTRY_KERNEL_CS] = { 0x0000ffff, 0x00cf9a00 }, +- [GDT_ENTRY_KERNEL_DS] = { 0x0000ffff, 0x00cf9200 }, +- [GDT_ENTRY_DEFAULT_USER_CS] = { 0x0000ffff, 0x00cffa00 }, +- [GDT_ENTRY_DEFAULT_USER_DS] = { 0x0000ffff, 0x00cff200 }, ++ [GDT_ENTRY_KERNEL_CS] = { { { 0x0000ffff, 0x00cf9a00 } } }, ++ [GDT_ENTRY_KERNEL_DS] = { { { 0x0000ffff, 0x00cf9200 } } }, ++ [GDT_ENTRY_DEFAULT_USER_CS] = { { { 0x0000ffff, 0x00cffa00 } } }, ++ [GDT_ENTRY_DEFAULT_USER_DS] = { { { 0x0000ffff, 0x00cff200 } } }, + /* + * Segments used for calling PnP BIOS have byte granularity. + * They code segments and data segments have fixed 64k limits, + * the transfer segment sizes are set at run time. + */ +- [GDT_ENTRY_PNPBIOS_CS32] = { 0x0000ffff, 0x00409a00 },/* 32-bit code */ +- [GDT_ENTRY_PNPBIOS_CS16] = { 0x0000ffff, 0x00009a00 },/* 16-bit code */ +- [GDT_ENTRY_PNPBIOS_DS] = { 0x0000ffff, 0x00009200 }, /* 16-bit data */ +- [GDT_ENTRY_PNPBIOS_TS1] = { 0x00000000, 0x00009200 },/* 16-bit data */ +- [GDT_ENTRY_PNPBIOS_TS2] = { 0x00000000, 0x00009200 },/* 16-bit data */ ++ /* 32-bit code */ ++ [GDT_ENTRY_PNPBIOS_CS32] = { { { 0x0000ffff, 0x00409a00 } } }, ++ /* 16-bit code */ ++ [GDT_ENTRY_PNPBIOS_CS16] = { { { 0x0000ffff, 0x00009a00 } } }, ++ /* 16-bit data */ ++ [GDT_ENTRY_PNPBIOS_DS] = { { { 0x0000ffff, 0x00009200 } } }, ++ /* 16-bit data */ ++ [GDT_ENTRY_PNPBIOS_TS1] = { { { 0x00000000, 0x00009200 } } }, ++ /* 16-bit data */ ++ [GDT_ENTRY_PNPBIOS_TS2] = { { { 0x00000000, 0x00009200 } } }, + /* + * The APM segments have byte granularity and their bases + * are set at run time. All have 64k limits. + */ +- [GDT_ENTRY_APMBIOS_BASE] = { 0x0000ffff, 0x00409a00 },/* 32-bit code */ ++ /* 32-bit code */ ++ [GDT_ENTRY_APMBIOS_BASE] = { { { 0x0000ffff, 0x00409a00 } } }, + /* 16-bit code */ +- [GDT_ENTRY_APMBIOS_BASE+1] = { 0x0000ffff, 0x00009a00 }, +- [GDT_ENTRY_APMBIOS_BASE+2] = { 0x0000ffff, 0x00409200 }, /* data */ ++ [GDT_ENTRY_APMBIOS_BASE+1] = { { { 0x0000ffff, 0x00009a00 } } }, ++ /* data */ ++ [GDT_ENTRY_APMBIOS_BASE+2] = { { { 0x0000ffff, 0x00409200 } } }, + +- [GDT_ENTRY_ESPFIX_SS] = { 0x00000000, 0x00c09200 }, +- [GDT_ENTRY_PERCPU] = { 0x00000000, 0x00000000 }, ++ [GDT_ENTRY_ESPFIX_SS] = { { { 0x00000000, 0x00c09200 } } }, ++ [GDT_ENTRY_PERCPU] = { { { 0x00000000, 0x00000000 } } }, + } }; + EXPORT_PER_CPU_SYMBOL_GPL(gdt_page); + ++__u32 cleared_cpu_caps[NCAPINTS] __cpuinitdata; ++ + static int cachesize_override __cpuinitdata = -1; +-static int disable_x86_fxsr __cpuinitdata; + static int disable_x86_serial_nr __cpuinitdata = 1; +-static int disable_x86_sep __cpuinitdata; + + struct cpu_dev * cpu_devs[X86_VENDOR_NUM] = {}; + +-extern int disable_pse; +- + static void __cpuinit default_init(struct cpuinfo_x86 * c) + { + /* Not much we can do here... */ +@@ -207,16 +212,8 @@ static void __cpuinit get_cpu_vendor(struct cpuinfo_x86 *c, int early) + + static int __init x86_fxsr_setup(char * s) + { +- /* Tell all the other CPUs to not use it... */ +- disable_x86_fxsr = 1; +- +- /* +- * ... and clear the bits early in the boot_cpu_data +- * so that the bootup process doesn't try to do this +- * either. +- */ +- clear_bit(X86_FEATURE_FXSR, boot_cpu_data.x86_capability); +- clear_bit(X86_FEATURE_XMM, boot_cpu_data.x86_capability); ++ setup_clear_cpu_cap(X86_FEATURE_FXSR); ++ setup_clear_cpu_cap(X86_FEATURE_XMM); + return 1; + } + __setup("nofxsr", x86_fxsr_setup); +@@ -224,7 +221,7 @@ __setup("nofxsr", x86_fxsr_setup); + + static int __init x86_sep_setup(char * s) + { +- disable_x86_sep = 1; ++ setup_clear_cpu_cap(X86_FEATURE_SEP); + return 1; + } + __setup("nosep", x86_sep_setup); +@@ -281,6 +278,33 @@ void __init cpu_detect(struct cpuinfo_x86 *c) + c->x86_cache_alignment = ((misc >> 8) & 0xff) * 8; + } + } ++static void __cpuinit early_get_cap(struct cpuinfo_x86 *c) ++{ ++ u32 tfms, xlvl; ++ int ebx; ++ ++ memset(&c->x86_capability, 0, sizeof c->x86_capability); ++ if (have_cpuid_p()) { ++ /* Intel-defined flags: level 0x00000001 */ ++ if (c->cpuid_level >= 0x00000001) { ++ u32 capability, excap; ++ cpuid(0x00000001, &tfms, &ebx, &excap, &capability); ++ c->x86_capability[0] = capability; ++ c->x86_capability[4] = excap; ++ } ++ ++ /* AMD-defined flags: level 0x80000001 */ ++ xlvl = cpuid_eax(0x80000000); ++ if ((xlvl & 0xffff0000) == 0x80000000) { ++ if (xlvl >= 0x80000001) { ++ c->x86_capability[1] = cpuid_edx(0x80000001); ++ c->x86_capability[6] = cpuid_ecx(0x80000001); ++ } ++ } ++ ++ } ++ ++} + + /* Do minimum CPU detection early. + Fields really needed: vendor, cpuid_level, family, model, mask, cache alignment. +@@ -300,6 +324,17 @@ static void __init early_cpu_detect(void) + cpu_detect(c); + + get_cpu_vendor(c, 1); ++ ++ switch (c->x86_vendor) { ++ case X86_VENDOR_AMD: ++ early_init_amd(c); ++ break; ++ case X86_VENDOR_INTEL: ++ early_init_intel(c); ++ break; ++ } ++ ++ early_get_cap(c); + } + + static void __cpuinit generic_identify(struct cpuinfo_x86 * c) +@@ -357,8 +392,6 @@ static void __cpuinit generic_identify(struct cpuinfo_x86 * c) + init_scattered_cpuid_features(c); + } + +- early_intel_workaround(c); +- + #ifdef CONFIG_X86_HT + c->phys_proc_id = (cpuid_ebx(1) >> 24) & 0xff; + #endif +@@ -392,7 +425,7 @@ __setup("serialnumber", x86_serial_nr_setup); + /* + * This does the hard work of actually picking apart the CPU stuff... + */ +-static void __cpuinit identify_cpu(struct cpuinfo_x86 *c) ++void __cpuinit identify_cpu(struct cpuinfo_x86 *c) + { + int i; + +@@ -418,20 +451,9 @@ static void __cpuinit identify_cpu(struct cpuinfo_x86 *c) + + generic_identify(c); + +- printk(KERN_DEBUG "CPU: After generic identify, caps:"); +- for (i = 0; i < NCAPINTS; i++) +- printk(" %08lx", c->x86_capability[i]); +- printk("\n"); +- +- if (this_cpu->c_identify) { ++ if (this_cpu->c_identify) + this_cpu->c_identify(c); + +- printk(KERN_DEBUG "CPU: After vendor identify, caps:"); +- for (i = 0; i < NCAPINTS; i++) +- printk(" %08lx", c->x86_capability[i]); +- printk("\n"); +- } +- + /* + * Vendor-specific initialization. In this section we + * canonicalize the feature flags, meaning if there are +@@ -453,23 +475,6 @@ static void __cpuinit identify_cpu(struct cpuinfo_x86 *c) + * we do "generic changes." + */ + +- /* TSC disabled? */ +- if ( tsc_disable ) +- clear_bit(X86_FEATURE_TSC, c->x86_capability); +- +- /* FXSR disabled? */ +- if (disable_x86_fxsr) { +- clear_bit(X86_FEATURE_FXSR, c->x86_capability); +- clear_bit(X86_FEATURE_XMM, c->x86_capability); +- } +- +- /* SEP disabled? */ +- if (disable_x86_sep) +- clear_bit(X86_FEATURE_SEP, c->x86_capability); +- +- if (disable_pse) +- clear_bit(X86_FEATURE_PSE, c->x86_capability); +- + /* If the model name is still unset, do table lookup. */ + if ( !c->x86_model_id[0] ) { + char *p; +@@ -482,13 +487,6 @@ static void __cpuinit identify_cpu(struct cpuinfo_x86 *c) + c->x86, c->x86_model); + } + +- /* Now the feature flags better reflect actual CPU features! */ +- +- printk(KERN_DEBUG "CPU: After all inits, caps:"); +- for (i = 0; i < NCAPINTS; i++) +- printk(" %08lx", c->x86_capability[i]); +- printk("\n"); +- + /* + * On SMP, boot_cpu_data holds the common feature set between + * all CPUs; so make sure that we indicate which features are +@@ -501,8 +499,14 @@ static void __cpuinit identify_cpu(struct cpuinfo_x86 *c) + boot_cpu_data.x86_capability[i] &= c->x86_capability[i]; + } + ++ /* Clear all flags overriden by options */ ++ for (i = 0; i < NCAPINTS; i++) ++ c->x86_capability[i] ^= cleared_cpu_caps[i]; ++ + /* Init Machine Check Exception if available. */ + mcheck_init(c); ++ ++ select_idle_routine(c); + } + + void __init identify_boot_cpu(void) +@@ -510,7 +514,6 @@ void __init identify_boot_cpu(void) + identify_cpu(&boot_cpu_data); + sysenter_setup(); + enable_sep_cpu(); +- mtrr_bp_init(); + } + + void __cpuinit identify_secondary_cpu(struct cpuinfo_x86 *c) +@@ -567,6 +570,13 @@ void __cpuinit detect_ht(struct cpuinfo_x86 *c) + } + #endif + ++static __init int setup_noclflush(char *arg) ++{ ++ setup_clear_cpu_cap(X86_FEATURE_CLFLSH); ++ return 1; ++} ++__setup("noclflush", setup_noclflush); ++ + void __cpuinit print_cpu_info(struct cpuinfo_x86 *c) + { + char *vendor = NULL; +@@ -590,6 +600,17 @@ void __cpuinit print_cpu_info(struct cpuinfo_x86 *c) + printk("\n"); + } + ++static __init int setup_disablecpuid(char *arg) ++{ ++ int bit; ++ if (get_option(&arg, &bit) && bit < NCAPINTS*32) ++ setup_clear_cpu_cap(bit); ++ else ++ return 0; ++ return 1; ++} ++__setup("clearcpuid=", setup_disablecpuid); ++ + cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE; + + /* This is hacky. :) +@@ -620,21 +641,13 @@ void __init early_cpu_init(void) + nexgen_init_cpu(); + umc_init_cpu(); + early_cpu_detect(); +- +-#ifdef CONFIG_DEBUG_PAGEALLOC +- /* pse is not compatible with on-the-fly unmapping, +- * disable it even if the cpus claim to support it. +- */ +- clear_bit(X86_FEATURE_PSE, boot_cpu_data.x86_capability); +- disable_pse = 1; +-#endif + } + + /* Make sure %fs is initialized properly in idle threads */ + struct pt_regs * __devinit idle_regs(struct pt_regs *regs) + { + memset(regs, 0, sizeof(struct pt_regs)); +- regs->xfs = __KERNEL_PERCPU; ++ regs->fs = __KERNEL_PERCPU; + return regs; + } + +@@ -642,7 +655,7 @@ struct pt_regs * __devinit idle_regs(struct pt_regs *regs) + * it's on the real one. */ + void switch_to_new_gdt(void) + { +- struct Xgt_desc_struct gdt_descr; ++ struct desc_ptr gdt_descr; + + gdt_descr.address = (long)get_cpu_gdt_table(smp_processor_id()); + gdt_descr.size = GDT_SIZE - 1; +@@ -672,12 +685,6 @@ void __cpuinit cpu_init(void) + + if (cpu_has_vme || cpu_has_tsc || cpu_has_de) + clear_in_cr4(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE); +- if (tsc_disable && cpu_has_tsc) { +- printk(KERN_NOTICE "Disabling TSC...\n"); +- /**** FIX-HPA: DOES THIS REALLY BELONG HERE? ****/ +- clear_bit(X86_FEATURE_TSC, boot_cpu_data.x86_capability); +- set_in_cr4(X86_CR4_TSD); +- } + + load_idt(&idt_descr); + switch_to_new_gdt(); +@@ -691,7 +698,7 @@ void __cpuinit cpu_init(void) + BUG(); + enter_lazy_tlb(&init_mm, curr); + +- load_esp0(t, thread); ++ load_sp0(t, thread); + set_tss_desc(cpu,t); + load_TR_desc(); + load_LDT(&init_mm.context); +diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h +index 2f6432c..ad6527a 100644 +--- a/arch/x86/kernel/cpu/cpu.h ++++ b/arch/x86/kernel/cpu/cpu.h +@@ -24,5 +24,6 @@ extern struct cpu_dev * cpu_devs [X86_VENDOR_NUM]; + extern int get_model_name(struct cpuinfo_x86 *c); + extern void display_cacheinfo(struct cpuinfo_x86 *c); + +-extern void early_intel_workaround(struct cpuinfo_x86 *c); ++extern void early_init_intel(struct cpuinfo_x86 *c); ++extern void early_init_amd(struct cpuinfo_x86 *c); + +diff --git a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c +index fea0af0..a962dcb 100644 +--- a/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c ++++ b/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c +@@ -67,7 +67,8 @@ struct acpi_cpufreq_data { + unsigned int cpu_feature; + }; + +-static struct acpi_cpufreq_data *drv_data[NR_CPUS]; ++static DEFINE_PER_CPU(struct acpi_cpufreq_data *, drv_data); ++ + /* acpi_perf_data is a pointer to percpu data. */ + static struct acpi_processor_performance *acpi_perf_data; + +@@ -218,14 +219,14 @@ static u32 get_cur_val(cpumask_t mask) + if (unlikely(cpus_empty(mask))) + return 0; + +- switch (drv_data[first_cpu(mask)]->cpu_feature) { ++ switch (per_cpu(drv_data, first_cpu(mask))->cpu_feature) { + case SYSTEM_INTEL_MSR_CAPABLE: + cmd.type = SYSTEM_INTEL_MSR_CAPABLE; + cmd.addr.msr.reg = MSR_IA32_PERF_STATUS; + break; + case SYSTEM_IO_CAPABLE: + cmd.type = SYSTEM_IO_CAPABLE; +- perf = drv_data[first_cpu(mask)]->acpi_data; ++ perf = per_cpu(drv_data, first_cpu(mask))->acpi_data; + cmd.addr.io.port = perf->control_register.address; + cmd.addr.io.bit_width = perf->control_register.bit_width; + break; +@@ -325,7 +326,7 @@ static unsigned int get_measured_perf(unsigned int cpu) + + #endif + +- retval = drv_data[cpu]->max_freq * perf_percent / 100; ++ retval = per_cpu(drv_data, cpu)->max_freq * perf_percent / 100; + + put_cpu(); + set_cpus_allowed(current, saved_mask); +@@ -336,7 +337,7 @@ static unsigned int get_measured_perf(unsigned int cpu) + + static unsigned int get_cur_freq_on_cpu(unsigned int cpu) + { +- struct acpi_cpufreq_data *data = drv_data[cpu]; ++ struct acpi_cpufreq_data *data = per_cpu(drv_data, cpu); + unsigned int freq; + + dprintk("get_cur_freq_on_cpu (%d)\n", cpu); +@@ -370,7 +371,7 @@ static unsigned int check_freqs(cpumask_t mask, unsigned int freq, + static int acpi_cpufreq_target(struct cpufreq_policy *policy, + unsigned int target_freq, unsigned int relation) + { +- struct acpi_cpufreq_data *data = drv_data[policy->cpu]; ++ struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu); + struct acpi_processor_performance *perf; + struct cpufreq_freqs freqs; + cpumask_t online_policy_cpus; +@@ -466,7 +467,7 @@ static int acpi_cpufreq_target(struct cpufreq_policy *policy, + + static int acpi_cpufreq_verify(struct cpufreq_policy *policy) + { +- struct acpi_cpufreq_data *data = drv_data[policy->cpu]; ++ struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu); + + dprintk("acpi_cpufreq_verify\n"); + +@@ -570,7 +571,7 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy) + return -ENOMEM; + + data->acpi_data = percpu_ptr(acpi_perf_data, cpu); +- drv_data[cpu] = data; ++ per_cpu(drv_data, cpu) = data; + + if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) + acpi_cpufreq_driver.flags |= CPUFREQ_CONST_LOOPS; +@@ -714,20 +715,20 @@ err_unreg: + acpi_processor_unregister_performance(perf, cpu); + err_free: + kfree(data); +- drv_data[cpu] = NULL; ++ per_cpu(drv_data, cpu) = NULL; + + return result; + } + + static int acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy) + { +- struct acpi_cpufreq_data *data = drv_data[policy->cpu]; ++ struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu); + + dprintk("acpi_cpufreq_cpu_exit\n"); + + if (data) { + cpufreq_frequency_table_put_attr(policy->cpu); +- drv_data[policy->cpu] = NULL; ++ per_cpu(drv_data, policy->cpu) = NULL; + acpi_processor_unregister_performance(data->acpi_data, + policy->cpu); + kfree(data); +@@ -738,7 +739,7 @@ static int acpi_cpufreq_cpu_exit(struct cpufreq_policy *policy) + + static int acpi_cpufreq_resume(struct cpufreq_policy *policy) + { +- struct acpi_cpufreq_data *data = drv_data[policy->cpu]; ++ struct acpi_cpufreq_data *data = per_cpu(drv_data, policy->cpu); + + dprintk("acpi_cpufreq_resume\n"); + +diff --git a/arch/x86/kernel/cpu/cpufreq/longhaul.c b/arch/x86/kernel/cpu/cpufreq/longhaul.c +index 749d00c..06fcce5 100644 +--- a/arch/x86/kernel/cpu/cpufreq/longhaul.c ++++ b/arch/x86/kernel/cpu/cpufreq/longhaul.c +@@ -694,7 +694,7 @@ static acpi_status longhaul_walk_callback(acpi_handle obj_handle, + if ( acpi_bus_get_device(obj_handle, &d) ) { + return 0; + } +- *return_value = (void *)acpi_driver_data(d); ++ *return_value = acpi_driver_data(d); + return 1; + } + +diff --git a/arch/x86/kernel/cpu/cpufreq/powernow-k8.c b/arch/x86/kernel/cpu/cpufreq/powernow-k8.c +index 99e1ef9..a052273 100644 +--- a/arch/x86/kernel/cpu/cpufreq/powernow-k8.c ++++ b/arch/x86/kernel/cpu/cpufreq/powernow-k8.c +@@ -52,7 +52,7 @@ + /* serialize freq changes */ + static DEFINE_MUTEX(fidvid_mutex); + +-static struct powernow_k8_data *powernow_data[NR_CPUS]; ++static DEFINE_PER_CPU(struct powernow_k8_data *, powernow_data); + + static int cpu_family = CPU_OPTERON; + +@@ -1018,7 +1018,7 @@ static int transition_frequency_pstate(struct powernow_k8_data *data, unsigned i + static int powernowk8_target(struct cpufreq_policy *pol, unsigned targfreq, unsigned relation) + { + cpumask_t oldmask = CPU_MASK_ALL; +- struct powernow_k8_data *data = powernow_data[pol->cpu]; ++ struct powernow_k8_data *data = per_cpu(powernow_data, pol->cpu); + u32 checkfid; + u32 checkvid; + unsigned int newstate; +@@ -1094,7 +1094,7 @@ err_out: + /* Driver entry point to verify the policy and range of frequencies */ + static int powernowk8_verify(struct cpufreq_policy *pol) + { +- struct powernow_k8_data *data = powernow_data[pol->cpu]; ++ struct powernow_k8_data *data = per_cpu(powernow_data, pol->cpu); + + if (!data) + return -EINVAL; +@@ -1202,7 +1202,7 @@ static int __cpuinit powernowk8_cpu_init(struct cpufreq_policy *pol) + dprintk("cpu_init done, current fid 0x%x, vid 0x%x\n", + data->currfid, data->currvid); + +- powernow_data[pol->cpu] = data; ++ per_cpu(powernow_data, pol->cpu) = data; + + return 0; + +@@ -1216,7 +1216,7 @@ err_out: + + static int __devexit powernowk8_cpu_exit (struct cpufreq_policy *pol) + { +- struct powernow_k8_data *data = powernow_data[pol->cpu]; ++ struct powernow_k8_data *data = per_cpu(powernow_data, pol->cpu); + + if (!data) + return -EINVAL; +@@ -1237,7 +1237,7 @@ static unsigned int powernowk8_get (unsigned int cpu) + cpumask_t oldmask = current->cpus_allowed; + unsigned int khz = 0; + +- data = powernow_data[first_cpu(per_cpu(cpu_core_map, cpu))]; ++ data = per_cpu(powernow_data, first_cpu(per_cpu(cpu_core_map, cpu))); + + if (!data) + return -EINVAL; +diff --git a/arch/x86/kernel/cpu/cyrix.c b/arch/x86/kernel/cpu/cyrix.c +index 88d66fb..404a6a2 100644 +--- a/arch/x86/kernel/cpu/cyrix.c ++++ b/arch/x86/kernel/cpu/cyrix.c +@@ -5,6 +5,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -126,15 +127,12 @@ static void __cpuinit set_cx86_reorder(void) + + static void __cpuinit set_cx86_memwb(void) + { +- u32 cr0; +- + printk(KERN_INFO "Enable Memory-Write-back mode on Cyrix/NSC processor.\n"); + + /* CCR2 bit 2: unlock NW bit */ + setCx86(CX86_CCR2, getCx86(CX86_CCR2) & ~0x04); + /* set 'Not Write-through' */ +- cr0 = 0x20000000; +- write_cr0(read_cr0() | cr0); ++ write_cr0(read_cr0() | X86_CR0_NW); + /* CCR2 bit 2: lock NW bit and set WT1 */ + setCx86(CX86_CCR2, getCx86(CX86_CCR2) | 0x14 ); + } +diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c +index cc8c501..d1c372b 100644 +--- a/arch/x86/kernel/cpu/intel.c ++++ b/arch/x86/kernel/cpu/intel.c +@@ -11,6 +11,8 @@ + #include + #include + #include ++#include ++#include + + #include "cpu.h" + +@@ -27,13 +29,14 @@ + struct movsl_mask movsl_mask __read_mostly; + #endif + +-void __cpuinit early_intel_workaround(struct cpuinfo_x86 *c) ++void __cpuinit early_init_intel(struct cpuinfo_x86 *c) + { +- if (c->x86_vendor != X86_VENDOR_INTEL) +- return; + /* Netburst reports 64 bytes clflush size, but does IO in 128 bytes */ + if (c->x86 == 15 && c->x86_cache_alignment == 64) + c->x86_cache_alignment = 128; ++ if ((c->x86 == 0xf && c->x86_model >= 0x03) || ++ (c->x86 == 0x6 && c->x86_model >= 0x0e)) ++ set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); + } + + /* +@@ -113,6 +116,8 @@ static void __cpuinit init_intel(struct cpuinfo_x86 *c) + unsigned int l2 = 0; + char *p = NULL; + ++ early_init_intel(c); ++ + #ifdef CONFIG_X86_F00F_BUG + /* + * All current models of Pentium and Pentium with MMX technology CPUs +@@ -132,7 +137,6 @@ static void __cpuinit init_intel(struct cpuinfo_x86 *c) + } + #endif + +- select_idle_routine(c); + l2 = init_intel_cacheinfo(c); + if (c->cpuid_level > 9 ) { + unsigned eax = cpuid_eax(10); +@@ -201,16 +205,13 @@ static void __cpuinit init_intel(struct cpuinfo_x86 *c) + } + #endif + ++ if (cpu_has_xmm2) ++ set_bit(X86_FEATURE_LFENCE_RDTSC, c->x86_capability); + if (c->x86 == 15) { + set_bit(X86_FEATURE_P4, c->x86_capability); +- set_bit(X86_FEATURE_SYNC_RDTSC, c->x86_capability); + } + if (c->x86 == 6) + set_bit(X86_FEATURE_P3, c->x86_capability); +- if ((c->x86 == 0xf && c->x86_model >= 0x03) || +- (c->x86 == 0x6 && c->x86_model >= 0x0e)) +- set_bit(X86_FEATURE_CONSTANT_TSC, c->x86_capability); +- + if (cpu_has_ds) { + unsigned int l1; + rdmsr(MSR_IA32_MISC_ENABLE, l1, l2); +@@ -219,6 +220,9 @@ static void __cpuinit init_intel(struct cpuinfo_x86 *c) + if (!(l1 & (1<<12))) + set_bit(X86_FEATURE_PEBS, c->x86_capability); + } ++ ++ if (cpu_has_bts) ++ ds_init_intel(c); + } + + static unsigned int __cpuinit intel_size_cache(struct cpuinfo_x86 * c, unsigned int size) +@@ -342,5 +346,22 @@ unsigned long cmpxchg_386_u32(volatile void *ptr, u32 old, u32 new) + EXPORT_SYMBOL(cmpxchg_386_u32); + #endif + ++#ifndef CONFIG_X86_CMPXCHG64 ++unsigned long long cmpxchg_486_u64(volatile void *ptr, u64 old, u64 new) ++{ ++ u64 prev; ++ unsigned long flags; ++ ++ /* Poor man's cmpxchg8b for 386 and 486. Unsuitable for SMP */ ++ local_irq_save(flags); ++ prev = *(u64 *)ptr; ++ if (prev == old) ++ *(u64 *)ptr = new; ++ local_irq_restore(flags); ++ return prev; ++} ++EXPORT_SYMBOL(cmpxchg_486_u64); ++#endif ++ + // arch_initcall(intel_cpu_init); + diff --git a/arch/x86/kernel/cpu/intel_cacheinfo.c b/arch/x86/kernel/cpu/intel_cacheinfo.c index 9f530ff..8b4507b 100644 --- a/arch/x86/kernel/cpu/intel_cacheinfo.c @@ -135146,11 +148814,230 @@ index 9f530ff..8b4507b 100644 cpuid4_cache_sysfs_exit(cpu); } +diff --git a/arch/x86/kernel/cpu/mcheck/k7.c b/arch/x86/kernel/cpu/mcheck/k7.c +index eef63e3..e633c9c 100644 +--- a/arch/x86/kernel/cpu/mcheck/k7.c ++++ b/arch/x86/kernel/cpu/mcheck/k7.c +@@ -16,7 +16,7 @@ + #include "mce.h" + + /* Machine Check Handler For AMD Athlon/Duron */ +-static fastcall void k7_machine_check(struct pt_regs * regs, long error_code) ++static void k7_machine_check(struct pt_regs * regs, long error_code) + { + int recover=1; + u32 alow, ahigh, high, low; +@@ -27,29 +27,32 @@ static fastcall void k7_machine_check(struct pt_regs * regs, long error_code) + if (mcgstl & (1<<0)) /* Recoverable ? */ + recover=0; + +- printk (KERN_EMERG "CPU %d: Machine Check Exception: %08x%08x\n", ++ printk(KERN_EMERG "CPU %d: Machine Check Exception: %08x%08x\n", + smp_processor_id(), mcgsth, mcgstl); + +- for (i=1; i= MCE_LOG_LEN) { +- set_bit(MCE_OVERFLOW, &mcelog.flags); ++ set_bit(MCE_OVERFLOW, (unsigned long *)&mcelog.flags); + return; + } + /* Old left over entry. Skip. */ +@@ -110,12 +110,12 @@ static void print_mce(struct mce *m) + KERN_EMERG + "CPU %d: Machine Check Exception: %16Lx Bank %d: %016Lx\n", + m->cpu, m->mcgstatus, m->bank, m->status); +- if (m->rip) { ++ if (m->ip) { + printk(KERN_EMERG "RIP%s %02x:<%016Lx> ", + !(m->mcgstatus & MCG_STATUS_EIPV) ? " !INEXACT!" : "", +- m->cs, m->rip); ++ m->cs, m->ip); + if (m->cs == __KERNEL_CS) +- print_symbol("{%s}", m->rip); ++ print_symbol("{%s}", m->ip); + printk("\n"); + } + printk(KERN_EMERG "TSC %Lx ", m->tsc); +@@ -156,16 +156,16 @@ static int mce_available(struct cpuinfo_x86 *c) + static inline void mce_get_rip(struct mce *m, struct pt_regs *regs) + { + if (regs && (m->mcgstatus & MCG_STATUS_RIPV)) { +- m->rip = regs->rip; ++ m->ip = regs->ip; + m->cs = regs->cs; + } else { +- m->rip = 0; ++ m->ip = 0; + m->cs = 0; + } + if (rip_msr) { + /* Assume the RIP in the MSR is exact. Is this true? */ + m->mcgstatus |= MCG_STATUS_EIPV; +- rdmsrl(rip_msr, m->rip); ++ rdmsrl(rip_msr, m->ip); + m->cs = 0; + } + } +@@ -192,10 +192,10 @@ void do_machine_check(struct pt_regs * regs, long error_code) + + atomic_inc(&mce_entry); + +- if (regs) +- notify_die(DIE_NMI, "machine check", regs, error_code, 18, +- SIGKILL); +- if (!banks) ++ if ((regs ++ && notify_die(DIE_NMI, "machine check", regs, error_code, ++ 18, SIGKILL) == NOTIFY_STOP) ++ || !banks) + goto out2; + + memset(&m, 0, sizeof(struct mce)); +@@ -288,7 +288,7 @@ void do_machine_check(struct pt_regs * regs, long error_code) + * instruction which caused the MCE. + */ + if (m.mcgstatus & MCG_STATUS_EIPV) +- user_space = panicm.rip && (panicm.cs & 3); ++ user_space = panicm.ip && (panicm.cs & 3); + + /* + * If we know that the error was in user space, send a +@@ -564,7 +564,7 @@ static ssize_t mce_read(struct file *filp, char __user *ubuf, size_t usize, + loff_t *off) + { + unsigned long *cpu_tsc; +- static DECLARE_MUTEX(mce_read_sem); ++ static DEFINE_MUTEX(mce_read_mutex); + unsigned next; + char __user *buf = ubuf; + int i, err; +@@ -573,12 +573,12 @@ static ssize_t mce_read(struct file *filp, char __user *ubuf, size_t usize, + if (!cpu_tsc) + return -ENOMEM; + +- down(&mce_read_sem); ++ mutex_lock(&mce_read_mutex); + next = rcu_dereference(mcelog.next); + + /* Only supports full reads right now */ + if (*off != 0 || usize < MCE_LOG_LEN*sizeof(struct mce)) { +- up(&mce_read_sem); ++ mutex_unlock(&mce_read_mutex); + kfree(cpu_tsc); + return -EINVAL; + } +@@ -621,7 +621,7 @@ static ssize_t mce_read(struct file *filp, char __user *ubuf, size_t usize, + memset(&mcelog.entry[i], 0, sizeof(struct mce)); + } + } +- up(&mce_read_sem); ++ mutex_unlock(&mce_read_mutex); + kfree(cpu_tsc); + return err ? -EFAULT : buf - ubuf; + } +@@ -634,8 +634,7 @@ static unsigned int mce_poll(struct file *file, poll_table *wait) + return 0; + } + +-static int mce_ioctl(struct inode *i, struct file *f,unsigned int cmd, +- unsigned long arg) ++static long mce_ioctl(struct file *f, unsigned int cmd, unsigned long arg) + { + int __user *p = (int __user *)arg; + +@@ -664,7 +663,7 @@ static const struct file_operations mce_chrdev_ops = { + .release = mce_release, + .read = mce_read, + .poll = mce_poll, +- .ioctl = mce_ioctl, ++ .unlocked_ioctl = mce_ioctl, + }; + + static struct miscdevice mce_log_device = { +@@ -745,7 +744,7 @@ static void mce_restart(void) static struct sysdev_class mce_sysclass = { .resume = mce_resume, @@ -135159,8 +149046,28 @@ index 4b21d29..242e866 100644 }; DEFINE_PER_CPU(struct sys_device, device_mce); +@@ -855,8 +854,8 @@ static void mce_remove_device(unsigned int cpu) + } + + /* Get notified when a cpu comes on/off. Be hotplug friendly. */ +-static int +-mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) ++static int __cpuinit mce_cpu_callback(struct notifier_block *nfb, ++ unsigned long action, void *hcpu) + { + unsigned int cpu = (unsigned long)hcpu; + +@@ -873,7 +872,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) + return NOTIFY_OK; + } + +-static struct notifier_block mce_cpu_notifier = { ++static struct notifier_block mce_cpu_notifier __cpuinitdata = { + .notifier_call = mce_cpu_callback, + }; + diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c -index 752fb16..7535887 100644 +index 752fb16..32671da 100644 --- a/arch/x86/kernel/cpu/mcheck/mce_amd_64.c +++ b/arch/x86/kernel/cpu/mcheck/mce_amd_64.c @@ -65,7 +65,7 @@ static struct threshold_block threshold_defaults = { @@ -135172,6 +149079,33 @@ index 752fb16..7535887 100644 struct threshold_block *blocks; cpumask_t cpus; }; +@@ -118,6 +118,7 @@ void __cpuinit mce_amd_feature_init(struct cpuinfo_x86 *c) + { + unsigned int bank, block; + unsigned int cpu = smp_processor_id(); ++ u8 lvt_off; + u32 low = 0, high = 0, address = 0; + + for (bank = 0; bank < NR_BANKS; ++bank) { +@@ -153,14 +154,13 @@ void __cpuinit mce_amd_feature_init(struct cpuinfo_x86 *c) + if (shared_bank[bank] && c->cpu_core_id) + break; + #endif ++ lvt_off = setup_APIC_eilvt_mce(THRESHOLD_APIC_VECTOR, ++ APIC_EILVT_MSG_FIX, 0); ++ + high &= ~MASK_LVTOFF_HI; +- high |= K8_APIC_EXT_LVT_ENTRY_THRESHOLD << 20; ++ high |= lvt_off << 20; + wrmsr(address, low, high); + +- setup_APIC_extended_lvt(K8_APIC_EXT_LVT_ENTRY_THRESHOLD, +- THRESHOLD_APIC_VECTOR, +- K8_APIC_EXT_INT_MSG_FIX, 0); +- + threshold_defaults.address = address; + threshold_restart_bank(&threshold_defaults, 0, 0); + } @@ -432,10 +432,9 @@ static __cpuinit int allocate_threshold_blocks(unsigned int cpu, else per_cpu(threshold_banks, cpu)[bank]->blocks = b; @@ -135186,11 +149120,12 @@ index 752fb16..7535887 100644 if (err) goto out_free; recurse: -@@ -451,11 +450,13 @@ recurse: +@@ -451,11 +450,14 @@ recurse: if (err) goto out_free; -+ kobject_uevent(&b->kobj, KOBJ_ADD); ++ if (b) ++ kobject_uevent(&b->kobj, KOBJ_ADD); + return err; @@ -135201,7 +149136,7 @@ index 752fb16..7535887 100644 kfree(b); } return err; -@@ -489,7 +490,7 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) +@@ -489,7 +491,7 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) goto out; err = sysfs_create_link(&per_cpu(device_mce, cpu).kobj, @@ -135210,7 +149145,7 @@ index 752fb16..7535887 100644 if (err) goto out; -@@ -505,16 +506,15 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) +@@ -505,16 +507,15 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) goto out; } @@ -135231,7 +149166,7 @@ index 752fb16..7535887 100644 per_cpu(threshold_banks, cpu)[bank] = b; -@@ -531,7 +531,7 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) +@@ -531,7 +532,7 @@ static __cpuinit int threshold_create_bank(unsigned int cpu, unsigned int bank) continue; err = sysfs_create_link(&per_cpu(device_mce, i).kobj, @@ -135240,7 +149175,16 @@ index 752fb16..7535887 100644 if (err) goto out; -@@ -581,7 +581,7 @@ static void deallocate_threshold_block(unsigned int cpu, +@@ -554,7 +555,7 @@ static __cpuinit int threshold_create_device(unsigned int cpu) + int err = 0; + + for (bank = 0; bank < NR_BANKS; ++bank) { +- if (!(per_cpu(bank_map, cpu) & 1 << bank)) ++ if (!(per_cpu(bank_map, cpu) & (1 << bank))) + continue; + err = threshold_create_bank(cpu, bank); + if (err) +@@ -581,7 +582,7 @@ static void deallocate_threshold_block(unsigned int cpu, return; list_for_each_entry_safe(pos, tmp, &head->blocks->miscj, miscj) { @@ -135249,7 +149193,7 @@ index 752fb16..7535887 100644 list_del(&pos->miscj); kfree(pos); } -@@ -627,7 +627,7 @@ static void threshold_remove_bank(unsigned int cpu, int bank) +@@ -627,7 +628,7 @@ static void threshold_remove_bank(unsigned int cpu, int bank) deallocate_threshold_block(cpu, bank); free_out: @@ -135258,11 +149202,474 @@ index 752fb16..7535887 100644 kfree(b); per_cpu(threshold_banks, cpu)[bank] = NULL; } +@@ -637,14 +638,14 @@ static void threshold_remove_device(unsigned int cpu) + unsigned int bank; + + for (bank = 0; bank < NR_BANKS; ++bank) { +- if (!(per_cpu(bank_map, cpu) & 1 << bank)) ++ if (!(per_cpu(bank_map, cpu) & (1 << bank))) + continue; + threshold_remove_bank(cpu, bank); + } + } + + /* get notified when a cpu comes on/off */ +-static int threshold_cpu_callback(struct notifier_block *nfb, ++static int __cpuinit threshold_cpu_callback(struct notifier_block *nfb, + unsigned long action, void *hcpu) + { + /* cpu was unsigned int to begin with */ +@@ -669,7 +670,7 @@ static int threshold_cpu_callback(struct notifier_block *nfb, + return NOTIFY_OK; + } + +-static struct notifier_block threshold_cpu_notifier = { ++static struct notifier_block threshold_cpu_notifier __cpuinitdata = { + .notifier_call = threshold_cpu_callback, + }; + +diff --git a/arch/x86/kernel/cpu/mcheck/p4.c b/arch/x86/kernel/cpu/mcheck/p4.c +index be4dabf..cb03345 100644 +--- a/arch/x86/kernel/cpu/mcheck/p4.c ++++ b/arch/x86/kernel/cpu/mcheck/p4.c +@@ -57,7 +57,7 @@ static void intel_thermal_interrupt(struct pt_regs *regs) + /* Thermal interrupt handler for this CPU setup */ + static void (*vendor_thermal_interrupt)(struct pt_regs *regs) = unexpected_thermal_interrupt; + +-fastcall void smp_thermal_interrupt(struct pt_regs *regs) ++void smp_thermal_interrupt(struct pt_regs *regs) + { + irq_enter(); + vendor_thermal_interrupt(regs); +@@ -141,7 +141,7 @@ static inline void intel_get_extended_msrs(struct intel_mce_extended_msrs *r) + rdmsr (MSR_IA32_MCG_EIP, r->eip, h); + } + +-static fastcall void intel_machine_check(struct pt_regs * regs, long error_code) ++static void intel_machine_check(struct pt_regs * regs, long error_code) + { + int recover=1; + u32 alow, ahigh, high, low; +@@ -152,38 +152,41 @@ static fastcall void intel_machine_check(struct pt_regs * regs, long error_code) + if (mcgstl & (1<<0)) /* Recoverable ? */ + recover=0; + +- printk (KERN_EMERG "CPU %d: Machine Check Exception: %08x%08x\n", ++ printk(KERN_EMERG "CPU %d: Machine Check Exception: %08x%08x\n", + smp_processor_id(), mcgsth, mcgstl); + + if (mce_num_extended_msrs > 0) { + struct intel_mce_extended_msrs dbg; + intel_get_extended_msrs(&dbg); +- printk (KERN_DEBUG "CPU %d: EIP: %08x EFLAGS: %08x\n", +- smp_processor_id(), dbg.eip, dbg.eflags); +- printk (KERN_DEBUG "\teax: %08x ebx: %08x ecx: %08x edx: %08x\n", +- dbg.eax, dbg.ebx, dbg.ecx, dbg.edx); +- printk (KERN_DEBUG "\tesi: %08x edi: %08x ebp: %08x esp: %08x\n", ++ printk(KERN_DEBUG "CPU %d: EIP: %08x EFLAGS: %08x\n" ++ "\teax: %08x ebx: %08x ecx: %08x edx: %08x\n" ++ "\tesi: %08x edi: %08x ebp: %08x esp: %08x\n", ++ smp_processor_id(), dbg.eip, dbg.eflags, ++ dbg.eax, dbg.ebx, dbg.ecx, dbg.edx, + dbg.esi, dbg.edi, dbg.ebp, dbg.esp); + } + +- for (i=0; i The base address of the region. + The size of the region. If this is 0 the region is disabled. + The type of the region. +- If TRUE, do the change safely. If FALSE, safety measures should +- be done externally. + [RETURNS] Nothing. + */ + { +diff --git a/arch/x86/kernel/cpu/mtrr/cyrix.c b/arch/x86/kernel/cpu/mtrr/cyrix.c +index 9964be3..8e139c7 100644 +--- a/arch/x86/kernel/cpu/mtrr/cyrix.c ++++ b/arch/x86/kernel/cpu/mtrr/cyrix.c +@@ -4,6 +4,7 @@ + #include + #include + #include ++#include + #include "mtrr.h" + + int arr3_protected; +@@ -142,7 +143,7 @@ static void prepare_set(void) + + /* Disable and flush caches. Note that wbinvd flushes the TLBs as + a side-effect */ +- cr0 = read_cr0() | 0x40000000; ++ cr0 = read_cr0() | X86_CR0_CD; + wbinvd(); + write_cr0(cr0); + wbinvd(); +diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c +index 992f08d..103d61a 100644 +--- a/arch/x86/kernel/cpu/mtrr/generic.c ++++ b/arch/x86/kernel/cpu/mtrr/generic.c +@@ -9,11 +9,12 @@ + #include + #include + #include ++#include + #include + #include "mtrr.h" + + struct mtrr_state { +- struct mtrr_var_range *var_ranges; ++ struct mtrr_var_range var_ranges[MAX_VAR_RANGES]; + mtrr_type fixed_ranges[NUM_FIXED_RANGES]; + unsigned char enabled; + unsigned char have_fixed; +@@ -85,12 +86,6 @@ void __init get_mtrr_state(void) + struct mtrr_var_range *vrs; + unsigned lo, dummy; + +- if (!mtrr_state.var_ranges) { +- mtrr_state.var_ranges = kmalloc(num_var_ranges * sizeof (struct mtrr_var_range), +- GFP_KERNEL); +- if (!mtrr_state.var_ranges) +- return; +- } + vrs = mtrr_state.var_ranges; + + rdmsr(MTRRcap_MSR, lo, dummy); +@@ -188,7 +183,7 @@ static inline void k8_enable_fixed_iorrs(void) + * \param changed pointer which indicates whether the MTRR needed to be changed + * \param msrwords pointer to the MSR values which the MSR should have + */ +-static void set_fixed_range(int msr, int * changed, unsigned int * msrwords) ++static void set_fixed_range(int msr, bool *changed, unsigned int *msrwords) + { + unsigned lo, hi; + +@@ -200,7 +195,7 @@ static void set_fixed_range(int msr, int * changed, unsigned int * msrwords) + ((msrwords[0] | msrwords[1]) & K8_MTRR_RDMEM_WRMEM_MASK)) + k8_enable_fixed_iorrs(); + mtrr_wrmsr(msr, msrwords[0], msrwords[1]); +- *changed = TRUE; ++ *changed = true; + } + } + +@@ -260,7 +255,7 @@ static void generic_get_mtrr(unsigned int reg, unsigned long *base, + static int set_fixed_ranges(mtrr_type * frs) + { + unsigned long long *saved = (unsigned long long *) frs; +- int changed = FALSE; ++ bool changed = false; + int block=-1, range; + + while (fixed_range_blocks[++block].ranges) +@@ -273,17 +268,17 @@ static int set_fixed_ranges(mtrr_type * frs) + + /* Set the MSR pair relating to a var range. Returns TRUE if + changes are made */ +-static int set_mtrr_var_ranges(unsigned int index, struct mtrr_var_range *vr) ++static bool set_mtrr_var_ranges(unsigned int index, struct mtrr_var_range *vr) + { + unsigned int lo, hi; +- int changed = FALSE; ++ bool changed = false; + + rdmsr(MTRRphysBase_MSR(index), lo, hi); + if ((vr->base_lo & 0xfffff0ffUL) != (lo & 0xfffff0ffUL) + || (vr->base_hi & (size_and_mask >> (32 - PAGE_SHIFT))) != + (hi & (size_and_mask >> (32 - PAGE_SHIFT)))) { + mtrr_wrmsr(MTRRphysBase_MSR(index), vr->base_lo, vr->base_hi); +- changed = TRUE; ++ changed = true; + } + + rdmsr(MTRRphysMask_MSR(index), lo, hi); +@@ -292,7 +287,7 @@ static int set_mtrr_var_ranges(unsigned int index, struct mtrr_var_range *vr) + || (vr->mask_hi & (size_and_mask >> (32 - PAGE_SHIFT))) != + (hi & (size_and_mask >> (32 - PAGE_SHIFT)))) { + mtrr_wrmsr(MTRRphysMask_MSR(index), vr->mask_lo, vr->mask_hi); +- changed = TRUE; ++ changed = true; + } + return changed; + } +@@ -350,7 +345,7 @@ static void prepare_set(void) __acquires(set_atomicity_lock) + spin_lock(&set_atomicity_lock); + + /* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */ +- cr0 = read_cr0() | 0x40000000; /* set CD flag */ ++ cr0 = read_cr0() | X86_CR0_CD; + write_cr0(cr0); + wbinvd(); + +@@ -417,8 +412,6 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base, + The base address of the region. + The size of the region. If this is 0 the region is disabled. + The type of the region. +- If TRUE, do the change safely. If FALSE, safety measures should +- be done externally. + [RETURNS] Nothing. + */ + { +diff --git a/arch/x86/kernel/cpu/mtrr/if.c b/arch/x86/kernel/cpu/mtrr/if.c +index c7d8f17..91e150a 100644 +--- a/arch/x86/kernel/cpu/mtrr/if.c ++++ b/arch/x86/kernel/cpu/mtrr/if.c +@@ -11,10 +11,6 @@ + #include + #include "mtrr.h" + +-/* RED-PEN: this is accessed without any locking */ +-extern unsigned int *usage_table; +- +- + #define FILE_FCOUNT(f) (((struct seq_file *)((f)->private_data))->private) + + static const char *const mtrr_strings[MTRR_NUM_TYPES] = +@@ -37,7 +33,7 @@ const char *mtrr_attrib_to_str(int x) + + static int + mtrr_file_add(unsigned long base, unsigned long size, +- unsigned int type, char increment, struct file *file, int page) ++ unsigned int type, bool increment, struct file *file, int page) + { + int reg, max; + unsigned int *fcount = FILE_FCOUNT(file); +@@ -55,7 +51,7 @@ mtrr_file_add(unsigned long base, unsigned long size, + base >>= PAGE_SHIFT; + size >>= PAGE_SHIFT; + } +- reg = mtrr_add_page(base, size, type, 1); ++ reg = mtrr_add_page(base, size, type, true); + if (reg >= 0) + ++fcount[reg]; + return reg; +@@ -141,7 +137,7 @@ mtrr_write(struct file *file, const char __user *buf, size_t len, loff_t * ppos) + size >>= PAGE_SHIFT; + err = + mtrr_add_page((unsigned long) base, (unsigned long) size, i, +- 1); ++ true); + if (err < 0) + return err; + return len; +@@ -217,7 +213,7 @@ mtrr_ioctl(struct file *file, unsigned int cmd, unsigned long __arg) + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + err = +- mtrr_file_add(sentry.base, sentry.size, sentry.type, 1, ++ mtrr_file_add(sentry.base, sentry.size, sentry.type, true, + file, 0); + break; + case MTRRIOC_SET_ENTRY: +@@ -226,7 +222,7 @@ mtrr_ioctl(struct file *file, unsigned int cmd, unsigned long __arg) + #endif + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; +- err = mtrr_add(sentry.base, sentry.size, sentry.type, 0); ++ err = mtrr_add(sentry.base, sentry.size, sentry.type, false); + break; + case MTRRIOC_DEL_ENTRY: + #ifdef CONFIG_COMPAT +@@ -270,7 +266,7 @@ mtrr_ioctl(struct file *file, unsigned int cmd, unsigned long __arg) + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + err = +- mtrr_file_add(sentry.base, sentry.size, sentry.type, 1, ++ mtrr_file_add(sentry.base, sentry.size, sentry.type, true, + file, 1); + break; + case MTRRIOC_SET_PAGE_ENTRY: +@@ -279,7 +275,8 @@ mtrr_ioctl(struct file *file, unsigned int cmd, unsigned long __arg) + #endif + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; +- err = mtrr_add_page(sentry.base, sentry.size, sentry.type, 0); ++ err = ++ mtrr_add_page(sentry.base, sentry.size, sentry.type, false); + break; + case MTRRIOC_DEL_PAGE_ENTRY: + #ifdef CONFIG_COMPAT +@@ -396,7 +393,7 @@ static int mtrr_seq_show(struct seq_file *seq, void *offset) + for (i = 0; i < max; i++) { + mtrr_if->get(i, &base, &size, &type); + if (size == 0) +- usage_table[i] = 0; ++ mtrr_usage_table[i] = 0; + else { + if (size < (0x100000 >> PAGE_SHIFT)) { + /* less than 1MB */ +@@ -410,7 +407,7 @@ static int mtrr_seq_show(struct seq_file *seq, void *offset) + len += seq_printf(seq, + "reg%02i: base=0x%05lx000 (%4luMB), size=%4lu%cB: %s, count=%d\n", + i, base, base >> (20 - PAGE_SHIFT), size, factor, +- mtrr_attrib_to_str(type), usage_table[i]); ++ mtrr_attrib_to_str(type), mtrr_usage_table[i]); + } + } + return 0; diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c -index 3b20613..beb45c9 100644 +index 3b20613..7159195 100644 --- a/arch/x86/kernel/cpu/mtrr/main.c +++ b/arch/x86/kernel/cpu/mtrr/main.c -@@ -349,7 +349,7 @@ int mtrr_add_page(unsigned long base, unsigned long size, +@@ -38,8 +38,8 @@ + #include + #include + ++#include + #include +- + #include + #include + #include +@@ -47,7 +47,7 @@ + + u32 num_var_ranges = 0; + +-unsigned int *usage_table; ++unsigned int mtrr_usage_table[MAX_VAR_RANGES]; + static DEFINE_MUTEX(mtrr_mutex); + + u64 size_or_mask, size_and_mask; +@@ -121,13 +121,8 @@ static void __init init_table(void) + int i, max; + + max = num_var_ranges; +- if ((usage_table = kmalloc(max * sizeof *usage_table, GFP_KERNEL)) +- == NULL) { +- printk(KERN_ERR "mtrr: could not allocate\n"); +- return; +- } + for (i = 0; i < max; i++) +- usage_table[i] = 1; ++ mtrr_usage_table[i] = 1; + } + + struct set_mtrr_data { +@@ -311,7 +306,7 @@ static void set_mtrr(unsigned int reg, unsigned long base, + */ + + int mtrr_add_page(unsigned long base, unsigned long size, +- unsigned int type, char increment) ++ unsigned int type, bool increment) + { + int i, replace, error; + mtrr_type ltype; +@@ -349,7 +344,7 @@ int mtrr_add_page(unsigned long base, unsigned long size, replace = -1; /* No CPU hotplug when we change MTRR entries */ @@ -135271,7 +149678,37 @@ index 3b20613..beb45c9 100644 /* Search for existing MTRR */ mutex_lock(&mtrr_mutex); for (i = 0; i < num_var_ranges; ++i) { -@@ -405,7 +405,7 @@ int mtrr_add_page(unsigned long base, unsigned long size, +@@ -383,7 +378,7 @@ int mtrr_add_page(unsigned long base, unsigned long size, + goto out; + } + if (increment) +- ++usage_table[i]; ++ ++mtrr_usage_table[i]; + error = i; + goto out; + } +@@ -391,13 +386,15 @@ int mtrr_add_page(unsigned long base, unsigned long size, + i = mtrr_if->get_free_region(base, size, replace); + if (i >= 0) { + set_mtrr(i, base, size, type); +- if (likely(replace < 0)) +- usage_table[i] = 1; +- else { +- usage_table[i] = usage_table[replace] + !!increment; ++ if (likely(replace < 0)) { ++ mtrr_usage_table[i] = 1; ++ } else { ++ mtrr_usage_table[i] = mtrr_usage_table[replace]; ++ if (increment) ++ mtrr_usage_table[i]++; + if (unlikely(replace != i)) { + set_mtrr(replace, 0, 0, 0); +- usage_table[replace] = 0; ++ mtrr_usage_table[replace] = 0; + } + } + } else +@@ -405,7 +402,7 @@ int mtrr_add_page(unsigned long base, unsigned long size, error = i; out: mutex_unlock(&mtrr_mutex); @@ -135280,7 +149717,16 @@ index 3b20613..beb45c9 100644 return error; } -@@ -495,7 +495,7 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size) +@@ -460,7 +457,7 @@ static int mtrr_check(unsigned long base, unsigned long size) + + int + mtrr_add(unsigned long base, unsigned long size, unsigned int type, +- char increment) ++ bool increment) + { + if (mtrr_check(base, size)) + return -EINVAL; +@@ -495,7 +492,7 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size) max = num_var_ranges; /* No CPU hotplug when we change MTRR entries */ @@ -135289,7 +149735,18 @@ index 3b20613..beb45c9 100644 mutex_lock(&mtrr_mutex); if (reg < 0) { /* Search for existing MTRR */ -@@ -536,7 +536,7 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size) +@@ -527,16 +524,16 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size) + printk(KERN_WARNING "mtrr: MTRR %d not used\n", reg); + goto out; + } +- if (usage_table[reg] < 1) { ++ if (mtrr_usage_table[reg] < 1) { + printk(KERN_WARNING "mtrr: reg: %d has count=0\n", reg); + goto out; + } +- if (--usage_table[reg] < 1) ++ if (--mtrr_usage_table[reg] < 1) + set_mtrr(reg, 0, 0, 0); error = reg; out: mutex_unlock(&mtrr_mutex); @@ -135298,10 +149755,238 @@ index 3b20613..beb45c9 100644 return error; } /** +@@ -591,16 +588,11 @@ struct mtrr_value { + unsigned long lsize; + }; + +-static struct mtrr_value * mtrr_state; ++static struct mtrr_value mtrr_state[MAX_VAR_RANGES]; + + static int mtrr_save(struct sys_device * sysdev, pm_message_t state) + { + int i; +- int size = num_var_ranges * sizeof(struct mtrr_value); +- +- mtrr_state = kzalloc(size,GFP_ATOMIC); +- if (!mtrr_state) +- return -ENOMEM; + + for (i = 0; i < num_var_ranges; i++) { + mtrr_if->get(i, +@@ -622,7 +614,6 @@ static int mtrr_restore(struct sys_device * sysdev) + mtrr_state[i].lsize, + mtrr_state[i].ltype); + } +- kfree(mtrr_state); + return 0; + } + +@@ -633,6 +624,112 @@ static struct sysdev_driver mtrr_sysdev_driver = { + .resume = mtrr_restore, + }; + ++static int disable_mtrr_trim; ++ ++static int __init disable_mtrr_trim_setup(char *str) ++{ ++ disable_mtrr_trim = 1; ++ return 0; ++} ++early_param("disable_mtrr_trim", disable_mtrr_trim_setup); ++ ++/* ++ * Newer AMD K8s and later CPUs have a special magic MSR way to force WB ++ * for memory >4GB. Check for that here. ++ * Note this won't check if the MTRRs < 4GB where the magic bit doesn't ++ * apply to are wrong, but so far we don't know of any such case in the wild. ++ */ ++#define Tom2Enabled (1U << 21) ++#define Tom2ForceMemTypeWB (1U << 22) ++ ++static __init int amd_special_default_mtrr(void) ++{ ++ u32 l, h; ++ ++ if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) ++ return 0; ++ if (boot_cpu_data.x86 < 0xf || boot_cpu_data.x86 > 0x11) ++ return 0; ++ /* In case some hypervisor doesn't pass SYSCFG through */ ++ if (rdmsr_safe(MSR_K8_SYSCFG, &l, &h) < 0) ++ return 0; ++ /* ++ * Memory between 4GB and top of mem is forced WB by this magic bit. ++ * Reserved before K8RevF, but should be zero there. ++ */ ++ if ((l & (Tom2Enabled | Tom2ForceMemTypeWB)) == ++ (Tom2Enabled | Tom2ForceMemTypeWB)) ++ return 1; ++ return 0; ++} ++ ++/** ++ * mtrr_trim_uncached_memory - trim RAM not covered by MTRRs ++ * ++ * Some buggy BIOSes don't setup the MTRRs properly for systems with certain ++ * memory configurations. This routine checks that the highest MTRR matches ++ * the end of memory, to make sure the MTRRs having a write back type cover ++ * all of the memory the kernel is intending to use. If not, it'll trim any ++ * memory off the end by adjusting end_pfn, removing it from the kernel's ++ * allocation pools, warning the user with an obnoxious message. ++ */ ++int __init mtrr_trim_uncached_memory(unsigned long end_pfn) ++{ ++ unsigned long i, base, size, highest_addr = 0, def, dummy; ++ mtrr_type type; ++ u64 trim_start, trim_size; ++ ++ /* ++ * Make sure we only trim uncachable memory on machines that ++ * support the Intel MTRR architecture: ++ */ ++ if (!is_cpu(INTEL) || disable_mtrr_trim) ++ return 0; ++ rdmsr(MTRRdefType_MSR, def, dummy); ++ def &= 0xff; ++ if (def != MTRR_TYPE_UNCACHABLE) ++ return 0; ++ ++ if (amd_special_default_mtrr()) ++ return 0; ++ ++ /* Find highest cached pfn */ ++ for (i = 0; i < num_var_ranges; i++) { ++ mtrr_if->get(i, &base, &size, &type); ++ if (type != MTRR_TYPE_WRBACK) ++ continue; ++ base <<= PAGE_SHIFT; ++ size <<= PAGE_SHIFT; ++ if (highest_addr < base + size) ++ highest_addr = base + size; ++ } ++ ++ /* kvm/qemu doesn't have mtrr set right, don't trim them all */ ++ if (!highest_addr) { ++ printk(KERN_WARNING "WARNING: strange, CPU MTRRs all blank?\n"); ++ WARN_ON(1); ++ return 0; ++ } ++ ++ if ((highest_addr >> PAGE_SHIFT) < end_pfn) { ++ printk(KERN_WARNING "WARNING: BIOS bug: CPU MTRRs don't cover" ++ " all of memory, losing %LdMB of RAM.\n", ++ (((u64)end_pfn << PAGE_SHIFT) - highest_addr) >> 20); ++ ++ WARN_ON(1); ++ ++ printk(KERN_INFO "update e820 for mtrr\n"); ++ trim_start = highest_addr; ++ trim_size = end_pfn; ++ trim_size <<= PAGE_SHIFT; ++ trim_size -= trim_start; ++ add_memory_region(trim_start, trim_size, E820_RESERVED); ++ update_e820(); ++ return 1; ++ } ++ ++ return 0; ++} + + /** + * mtrr_bp_init - initialize mtrrs on the boot CPU +diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h +index 289dfe6..fb74a2c 100644 +--- a/arch/x86/kernel/cpu/mtrr/mtrr.h ++++ b/arch/x86/kernel/cpu/mtrr/mtrr.h +@@ -2,10 +2,8 @@ + * local mtrr defines. + */ + +-#ifndef TRUE +-#define TRUE 1 +-#define FALSE 0 +-#endif ++#include ++#include + + #define MTRRcap_MSR 0x0fe + #define MTRRdefType_MSR 0x2ff +@@ -14,6 +12,7 @@ + #define MTRRphysMask_MSR(reg) (0x200 + 2 * (reg) + 1) + + #define NUM_FIXED_RANGES 88 ++#define MAX_VAR_RANGES 256 + #define MTRRfix64K_00000_MSR 0x250 + #define MTRRfix16K_80000_MSR 0x258 + #define MTRRfix16K_A0000_MSR 0x259 +@@ -34,6 +33,8 @@ + an 8 bit field: */ + typedef u8 mtrr_type; + ++extern unsigned int mtrr_usage_table[MAX_VAR_RANGES]; ++ + struct mtrr_ops { + u32 vendor; + u32 use_intel_if; +diff --git a/arch/x86/kernel/cpu/mtrr/state.c b/arch/x86/kernel/cpu/mtrr/state.c +index 49e20c2..9f8ba92 100644 +--- a/arch/x86/kernel/cpu/mtrr/state.c ++++ b/arch/x86/kernel/cpu/mtrr/state.c +@@ -4,6 +4,7 @@ + #include + #include + #include ++#include + #include "mtrr.h" + + +@@ -25,7 +26,7 @@ void set_mtrr_prepare_save(struct set_mtrr_context *ctxt) + + /* Disable and flush caches. Note that wbinvd flushes the TLBs as + a side-effect */ +- cr0 = read_cr0() | 0x40000000; ++ cr0 = read_cr0() | X86_CR0_CD; + wbinvd(); + write_cr0(cr0); + wbinvd(); +diff --git a/arch/x86/kernel/cpu/perfctr-watchdog.c b/arch/x86/kernel/cpu/perfctr-watchdog.c +index c02541e..9b83832 100644 +--- a/arch/x86/kernel/cpu/perfctr-watchdog.c ++++ b/arch/x86/kernel/cpu/perfctr-watchdog.c +@@ -167,7 +167,6 @@ void release_evntsel_nmi(unsigned int msr) + clear_bit(counter, evntsel_nmi_owner); + } + +-EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi); + EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi_bit); + EXPORT_SYMBOL(reserve_perfctr_nmi); + EXPORT_SYMBOL(release_perfctr_nmi); +diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c +index 3900e46..0282132 100644 +--- a/arch/x86/kernel/cpu/proc.c ++++ b/arch/x86/kernel/cpu/proc.c +@@ -188,7 +188,7 @@ static void *c_next(struct seq_file *m, void *v, loff_t *pos) + static void c_stop(struct seq_file *m, void *v) + { + } +-struct seq_operations cpuinfo_op = { ++const struct seq_operations cpuinfo_op = { + .start = c_start, + .next = c_next, + .stop = c_stop, diff --git a/arch/x86/kernel/cpuid.c b/arch/x86/kernel/cpuid.c -index 05c9936..d387c77 100644 +index 05c9936..dec66e4 100644 --- a/arch/x86/kernel/cpuid.c +++ b/arch/x86/kernel/cpuid.c +@@ -50,7 +50,7 @@ struct cpuid_command { + + static void cpuid_smp_cpuid(void *cmd_block) + { +- struct cpuid_command *cmd = (struct cpuid_command *)cmd_block; ++ struct cpuid_command *cmd = cmd_block; + + cpuid(cmd->reg, &cmd->data[0], &cmd->data[1], &cmd->data[2], + &cmd->data[3]); @@ -157,15 +157,15 @@ static int __cpuinit cpuid_class_cpu_callback(struct notifier_block *nfb, switch (action) { @@ -135321,20 +150006,3446 @@ index 05c9936..d387c77 100644 } return err ? NOTIFY_BAD : NOTIFY_OK; } +diff --git a/arch/x86/kernel/doublefault_32.c b/arch/x86/kernel/doublefault_32.c +index 40978af..a47798b 100644 +--- a/arch/x86/kernel/doublefault_32.c ++++ b/arch/x86/kernel/doublefault_32.c +@@ -17,7 +17,7 @@ static unsigned long doublefault_stack[DOUBLEFAULT_STACKSIZE]; + + static void doublefault_fn(void) + { +- struct Xgt_desc_struct gdt_desc = {0, 0}; ++ struct desc_ptr gdt_desc = {0, 0}; + unsigned long gdt, tss; + + store_gdt(&gdt_desc); +@@ -33,14 +33,15 @@ static void doublefault_fn(void) + printk(KERN_EMERG "double fault, tss at %08lx\n", tss); + + if (ptr_ok(tss)) { +- struct i386_hw_tss *t = (struct i386_hw_tss *)tss; ++ struct x86_hw_tss *t = (struct x86_hw_tss *)tss; + +- printk(KERN_EMERG "eip = %08lx, esp = %08lx\n", t->eip, t->esp); ++ printk(KERN_EMERG "eip = %08lx, esp = %08lx\n", ++ t->ip, t->sp); + + printk(KERN_EMERG "eax = %08lx, ebx = %08lx, ecx = %08lx, edx = %08lx\n", +- t->eax, t->ebx, t->ecx, t->edx); ++ t->ax, t->bx, t->cx, t->dx); + printk(KERN_EMERG "esi = %08lx, edi = %08lx\n", +- t->esi, t->edi); ++ t->si, t->di); + } + } + +@@ -50,15 +51,15 @@ static void doublefault_fn(void) + + struct tss_struct doublefault_tss __cacheline_aligned = { + .x86_tss = { +- .esp0 = STACK_START, ++ .sp0 = STACK_START, + .ss0 = __KERNEL_DS, + .ldt = 0, + .io_bitmap_base = INVALID_IO_BITMAP_OFFSET, + +- .eip = (unsigned long) doublefault_fn, ++ .ip = (unsigned long) doublefault_fn, + /* 0x2 bit is always set */ +- .eflags = X86_EFLAGS_SF | 0x2, +- .esp = STACK_START, ++ .flags = X86_EFLAGS_SF | 0x2, ++ .sp = STACK_START, + .es = __USER_DS, + .cs = __KERNEL_CS, + .ss = __KERNEL_DS, +diff --git a/arch/x86/kernel/ds.c b/arch/x86/kernel/ds.c +new file mode 100644 +index 0000000..1c5ca4d +--- /dev/null ++++ b/arch/x86/kernel/ds.c +@@ -0,0 +1,464 @@ ++/* ++ * Debug Store support ++ * ++ * This provides a low-level interface to the hardware's Debug Store ++ * feature that is used for last branch recording (LBR) and ++ * precise-event based sampling (PEBS). ++ * ++ * Different architectures use a different DS layout/pointer size. ++ * The below functions therefore work on a void*. ++ * ++ * ++ * Since there is no user for PEBS, yet, only LBR (or branch ++ * trace store, BTS) is supported. ++ * ++ * ++ * Copyright (C) 2007 Intel Corporation. ++ * Markus Metzger , Dec 2007 ++ */ ++ ++#include ++ ++#include ++#include ++#include ++ ++ ++/* ++ * Debug Store (DS) save area configuration (see Intel64 and IA32 ++ * Architectures Software Developer's Manual, section 18.5) ++ * ++ * The DS configuration consists of the following fields; different ++ * architetures vary in the size of those fields. ++ * - double-word aligned base linear address of the BTS buffer ++ * - write pointer into the BTS buffer ++ * - end linear address of the BTS buffer (one byte beyond the end of ++ * the buffer) ++ * - interrupt pointer into BTS buffer ++ * (interrupt occurs when write pointer passes interrupt pointer) ++ * - double-word aligned base linear address of the PEBS buffer ++ * - write pointer into the PEBS buffer ++ * - end linear address of the PEBS buffer (one byte beyond the end of ++ * the buffer) ++ * - interrupt pointer into PEBS buffer ++ * (interrupt occurs when write pointer passes interrupt pointer) ++ * - value to which counter is reset following counter overflow ++ * ++ * On later architectures, the last branch recording hardware uses ++ * 64bit pointers even in 32bit mode. ++ * ++ * ++ * Branch Trace Store (BTS) records store information about control ++ * flow changes. They at least provide the following information: ++ * - source linear address ++ * - destination linear address ++ * ++ * Netburst supported a predicated bit that had been dropped in later ++ * architectures. We do not suppor it. ++ * ++ * ++ * In order to abstract from the actual DS and BTS layout, we describe ++ * the access to the relevant fields. ++ * Thanks to Andi Kleen for proposing this design. ++ * ++ * The implementation, however, is not as general as it might seem. In ++ * order to stay somewhat simple and efficient, we assume an ++ * underlying unsigned type (mostly a pointer type) and we expect the ++ * field to be at least as big as that type. ++ */ ++ ++/* ++ * A special from_ip address to indicate that the BTS record is an ++ * info record that needs to be interpreted or skipped. ++ */ ++#define BTS_ESCAPE_ADDRESS (-1) ++ ++/* ++ * A field access descriptor ++ */ ++struct access_desc { ++ unsigned char offset; ++ unsigned char size; ++}; ++ ++/* ++ * The configuration for a particular DS/BTS hardware implementation. ++ */ ++struct ds_configuration { ++ /* the DS configuration */ ++ unsigned char sizeof_ds; ++ struct access_desc bts_buffer_base; ++ struct access_desc bts_index; ++ struct access_desc bts_absolute_maximum; ++ struct access_desc bts_interrupt_threshold; ++ /* the BTS configuration */ ++ unsigned char sizeof_bts; ++ struct access_desc from_ip; ++ struct access_desc to_ip; ++ /* BTS variants used to store additional information like ++ timestamps */ ++ struct access_desc info_type; ++ struct access_desc info_data; ++ unsigned long debugctl_mask; ++}; ++ ++/* ++ * The global configuration used by the below accessor functions ++ */ ++static struct ds_configuration ds_cfg; ++ ++/* ++ * Accessor functions for some DS and BTS fields using the above ++ * global ptrace_bts_cfg. ++ */ ++static inline unsigned long get_bts_buffer_base(char *base) ++{ ++ return *(unsigned long *)(base + ds_cfg.bts_buffer_base.offset); ++} ++static inline void set_bts_buffer_base(char *base, unsigned long value) ++{ ++ (*(unsigned long *)(base + ds_cfg.bts_buffer_base.offset)) = value; ++} ++static inline unsigned long get_bts_index(char *base) ++{ ++ return *(unsigned long *)(base + ds_cfg.bts_index.offset); ++} ++static inline void set_bts_index(char *base, unsigned long value) ++{ ++ (*(unsigned long *)(base + ds_cfg.bts_index.offset)) = value; ++} ++static inline unsigned long get_bts_absolute_maximum(char *base) ++{ ++ return *(unsigned long *)(base + ds_cfg.bts_absolute_maximum.offset); ++} ++static inline void set_bts_absolute_maximum(char *base, unsigned long value) ++{ ++ (*(unsigned long *)(base + ds_cfg.bts_absolute_maximum.offset)) = value; ++} ++static inline unsigned long get_bts_interrupt_threshold(char *base) ++{ ++ return *(unsigned long *)(base + ds_cfg.bts_interrupt_threshold.offset); ++} ++static inline void set_bts_interrupt_threshold(char *base, unsigned long value) ++{ ++ (*(unsigned long *)(base + ds_cfg.bts_interrupt_threshold.offset)) = value; ++} ++static inline unsigned long get_from_ip(char *base) ++{ ++ return *(unsigned long *)(base + ds_cfg.from_ip.offset); ++} ++static inline void set_from_ip(char *base, unsigned long value) ++{ ++ (*(unsigned long *)(base + ds_cfg.from_ip.offset)) = value; ++} ++static inline unsigned long get_to_ip(char *base) ++{ ++ return *(unsigned long *)(base + ds_cfg.to_ip.offset); ++} ++static inline void set_to_ip(char *base, unsigned long value) ++{ ++ (*(unsigned long *)(base + ds_cfg.to_ip.offset)) = value; ++} ++static inline unsigned char get_info_type(char *base) ++{ ++ return *(unsigned char *)(base + ds_cfg.info_type.offset); ++} ++static inline void set_info_type(char *base, unsigned char value) ++{ ++ (*(unsigned char *)(base + ds_cfg.info_type.offset)) = value; ++} ++static inline unsigned long get_info_data(char *base) ++{ ++ return *(unsigned long *)(base + ds_cfg.info_data.offset); ++} ++static inline void set_info_data(char *base, unsigned long value) ++{ ++ (*(unsigned long *)(base + ds_cfg.info_data.offset)) = value; ++} ++ ++ ++int ds_allocate(void **dsp, size_t bts_size_in_bytes) ++{ ++ size_t bts_size_in_records; ++ unsigned long bts; ++ void *ds; ++ ++ if (!ds_cfg.sizeof_ds || !ds_cfg.sizeof_bts) ++ return -EOPNOTSUPP; ++ ++ if (bts_size_in_bytes < 0) ++ return -EINVAL; ++ ++ bts_size_in_records = ++ bts_size_in_bytes / ds_cfg.sizeof_bts; ++ bts_size_in_bytes = ++ bts_size_in_records * ds_cfg.sizeof_bts; ++ ++ if (bts_size_in_bytes <= 0) ++ return -EINVAL; ++ ++ bts = (unsigned long)kzalloc(bts_size_in_bytes, GFP_KERNEL); ++ ++ if (!bts) ++ return -ENOMEM; ++ ++ ds = kzalloc(ds_cfg.sizeof_ds, GFP_KERNEL); ++ ++ if (!ds) { ++ kfree((void *)bts); ++ return -ENOMEM; ++ } ++ ++ set_bts_buffer_base(ds, bts); ++ set_bts_index(ds, bts); ++ set_bts_absolute_maximum(ds, bts + bts_size_in_bytes); ++ set_bts_interrupt_threshold(ds, bts + bts_size_in_bytes + 1); ++ ++ *dsp = ds; ++ return 0; ++} ++ ++int ds_free(void **dsp) ++{ ++ if (*dsp) ++ kfree((void *)get_bts_buffer_base(*dsp)); ++ kfree(*dsp); ++ *dsp = 0; ++ ++ return 0; ++} ++ ++int ds_get_bts_size(void *ds) ++{ ++ int size_in_bytes; ++ ++ if (!ds_cfg.sizeof_ds || !ds_cfg.sizeof_bts) ++ return -EOPNOTSUPP; ++ ++ if (!ds) ++ return 0; ++ ++ size_in_bytes = ++ get_bts_absolute_maximum(ds) - ++ get_bts_buffer_base(ds); ++ return size_in_bytes; ++} ++ ++int ds_get_bts_end(void *ds) ++{ ++ int size_in_bytes = ds_get_bts_size(ds); ++ ++ if (size_in_bytes <= 0) ++ return size_in_bytes; ++ ++ return size_in_bytes / ds_cfg.sizeof_bts; ++} ++ ++int ds_get_bts_index(void *ds) ++{ ++ int index_offset_in_bytes; ++ ++ if (!ds_cfg.sizeof_ds || !ds_cfg.sizeof_bts) ++ return -EOPNOTSUPP; ++ ++ index_offset_in_bytes = ++ get_bts_index(ds) - ++ get_bts_buffer_base(ds); ++ ++ return index_offset_in_bytes / ds_cfg.sizeof_bts; ++} ++ ++int ds_set_overflow(void *ds, int method) ++{ ++ switch (method) { ++ case DS_O_SIGNAL: ++ return -EOPNOTSUPP; ++ case DS_O_WRAP: ++ return 0; ++ default: ++ return -EINVAL; ++ } ++} ++ ++int ds_get_overflow(void *ds) ++{ ++ return DS_O_WRAP; ++} ++ ++int ds_clear(void *ds) ++{ ++ int bts_size = ds_get_bts_size(ds); ++ unsigned long bts_base; ++ ++ if (bts_size <= 0) ++ return bts_size; ++ ++ bts_base = get_bts_buffer_base(ds); ++ memset((void *)bts_base, 0, bts_size); ++ ++ set_bts_index(ds, bts_base); ++ return 0; ++} ++ ++int ds_read_bts(void *ds, int index, struct bts_struct *out) ++{ ++ void *bts; ++ ++ if (!ds_cfg.sizeof_ds || !ds_cfg.sizeof_bts) ++ return -EOPNOTSUPP; ++ ++ if (index < 0) ++ return -EINVAL; ++ ++ if (index >= ds_get_bts_size(ds)) ++ return -EINVAL; ++ ++ bts = (void *)(get_bts_buffer_base(ds) + (index * ds_cfg.sizeof_bts)); ++ ++ memset(out, 0, sizeof(*out)); ++ if (get_from_ip(bts) == BTS_ESCAPE_ADDRESS) { ++ out->qualifier = get_info_type(bts); ++ out->variant.jiffies = get_info_data(bts); ++ } else { ++ out->qualifier = BTS_BRANCH; ++ out->variant.lbr.from_ip = get_from_ip(bts); ++ out->variant.lbr.to_ip = get_to_ip(bts); ++ } ++ ++ return sizeof(*out);; ++} ++ ++int ds_write_bts(void *ds, const struct bts_struct *in) ++{ ++ unsigned long bts; ++ ++ if (!ds_cfg.sizeof_ds || !ds_cfg.sizeof_bts) ++ return -EOPNOTSUPP; ++ ++ if (ds_get_bts_size(ds) <= 0) ++ return -ENXIO; ++ ++ bts = get_bts_index(ds); ++ ++ memset((void *)bts, 0, ds_cfg.sizeof_bts); ++ switch (in->qualifier) { ++ case BTS_INVALID: ++ break; ++ ++ case BTS_BRANCH: ++ set_from_ip((void *)bts, in->variant.lbr.from_ip); ++ set_to_ip((void *)bts, in->variant.lbr.to_ip); ++ break; ++ ++ case BTS_TASK_ARRIVES: ++ case BTS_TASK_DEPARTS: ++ set_from_ip((void *)bts, BTS_ESCAPE_ADDRESS); ++ set_info_type((void *)bts, in->qualifier); ++ set_info_data((void *)bts, in->variant.jiffies); ++ break; ++ ++ default: ++ return -EINVAL; ++ } ++ ++ bts = bts + ds_cfg.sizeof_bts; ++ if (bts >= get_bts_absolute_maximum(ds)) ++ bts = get_bts_buffer_base(ds); ++ set_bts_index(ds, bts); ++ ++ return ds_cfg.sizeof_bts; ++} ++ ++unsigned long ds_debugctl_mask(void) ++{ ++ return ds_cfg.debugctl_mask; ++} ++ ++#ifdef __i386__ ++static const struct ds_configuration ds_cfg_netburst = { ++ .sizeof_ds = 9 * 4, ++ .bts_buffer_base = { 0, 4 }, ++ .bts_index = { 4, 4 }, ++ .bts_absolute_maximum = { 8, 4 }, ++ .bts_interrupt_threshold = { 12, 4 }, ++ .sizeof_bts = 3 * 4, ++ .from_ip = { 0, 4 }, ++ .to_ip = { 4, 4 }, ++ .info_type = { 4, 1 }, ++ .info_data = { 8, 4 }, ++ .debugctl_mask = (1<<2)|(1<<3) ++}; ++ ++static const struct ds_configuration ds_cfg_pentium_m = { ++ .sizeof_ds = 9 * 4, ++ .bts_buffer_base = { 0, 4 }, ++ .bts_index = { 4, 4 }, ++ .bts_absolute_maximum = { 8, 4 }, ++ .bts_interrupt_threshold = { 12, 4 }, ++ .sizeof_bts = 3 * 4, ++ .from_ip = { 0, 4 }, ++ .to_ip = { 4, 4 }, ++ .info_type = { 4, 1 }, ++ .info_data = { 8, 4 }, ++ .debugctl_mask = (1<<6)|(1<<7) ++}; ++#endif /* _i386_ */ ++ ++static const struct ds_configuration ds_cfg_core2 = { ++ .sizeof_ds = 9 * 8, ++ .bts_buffer_base = { 0, 8 }, ++ .bts_index = { 8, 8 }, ++ .bts_absolute_maximum = { 16, 8 }, ++ .bts_interrupt_threshold = { 24, 8 }, ++ .sizeof_bts = 3 * 8, ++ .from_ip = { 0, 8 }, ++ .to_ip = { 8, 8 }, ++ .info_type = { 8, 1 }, ++ .info_data = { 16, 8 }, ++ .debugctl_mask = (1<<6)|(1<<7)|(1<<9) ++}; ++ ++static inline void ++ds_configure(const struct ds_configuration *cfg) ++{ ++ ds_cfg = *cfg; ++} ++ ++void __cpuinit ds_init_intel(struct cpuinfo_x86 *c) ++{ ++ switch (c->x86) { ++ case 0x6: ++ switch (c->x86_model) { ++#ifdef __i386__ ++ case 0xD: ++ case 0xE: /* Pentium M */ ++ ds_configure(&ds_cfg_pentium_m); ++ break; ++#endif /* _i386_ */ ++ case 0xF: /* Core2 */ ++ ds_configure(&ds_cfg_core2); ++ break; ++ default: ++ /* sorry, don't know about them */ ++ break; ++ } ++ break; ++ case 0xF: ++ switch (c->x86_model) { ++#ifdef __i386__ ++ case 0x0: ++ case 0x1: ++ case 0x2: /* Netburst */ ++ ds_configure(&ds_cfg_netburst); ++ break; ++#endif /* _i386_ */ ++ default: ++ /* sorry, don't know about them */ ++ break; ++ } ++ break; ++ default: ++ /* sorry, don't know about them */ ++ break; ++ } ++} +diff --git a/arch/x86/kernel/e820_32.c b/arch/x86/kernel/e820_32.c +index 18f500d..4e16ef4 100644 +--- a/arch/x86/kernel/e820_32.c ++++ b/arch/x86/kernel/e820_32.c +@@ -7,7 +7,6 @@ + #include + #include + #include +-#include + #include + #include + #include +@@ -17,11 +16,6 @@ + #include + #include + +-#ifdef CONFIG_EFI +-int efi_enabled = 0; +-EXPORT_SYMBOL(efi_enabled); +-#endif +- + struct e820map e820; + struct change_member { + struct e820entry *pbios; /* pointer to original bios entry */ +@@ -37,26 +31,6 @@ unsigned long pci_mem_start = 0x10000000; + EXPORT_SYMBOL(pci_mem_start); + #endif + extern int user_defined_memmap; +-struct resource data_resource = { +- .name = "Kernel data", +- .start = 0, +- .end = 0, +- .flags = IORESOURCE_BUSY | IORESOURCE_MEM +-}; +- +-struct resource code_resource = { +- .name = "Kernel code", +- .start = 0, +- .end = 0, +- .flags = IORESOURCE_BUSY | IORESOURCE_MEM +-}; +- +-struct resource bss_resource = { +- .name = "Kernel bss", +- .start = 0, +- .end = 0, +- .flags = IORESOURCE_BUSY | IORESOURCE_MEM +-}; + + static struct resource system_rom_resource = { + .name = "System ROM", +@@ -111,60 +85,6 @@ static struct resource video_rom_resource = { + .flags = IORESOURCE_BUSY | IORESOURCE_READONLY | IORESOURCE_MEM + }; + +-static struct resource video_ram_resource = { +- .name = "Video RAM area", +- .start = 0xa0000, +- .end = 0xbffff, +- .flags = IORESOURCE_BUSY | IORESOURCE_MEM +-}; +- +-static struct resource standard_io_resources[] = { { +- .name = "dma1", +- .start = 0x0000, +- .end = 0x001f, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "pic1", +- .start = 0x0020, +- .end = 0x0021, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "timer0", +- .start = 0x0040, +- .end = 0x0043, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "timer1", +- .start = 0x0050, +- .end = 0x0053, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "keyboard", +- .start = 0x0060, +- .end = 0x006f, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "dma page reg", +- .start = 0x0080, +- .end = 0x008f, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "pic2", +- .start = 0x00a0, +- .end = 0x00a1, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "dma2", +- .start = 0x00c0, +- .end = 0x00df, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-}, { +- .name = "fpu", +- .start = 0x00f0, +- .end = 0x00ff, +- .flags = IORESOURCE_BUSY | IORESOURCE_IO +-} }; +- + #define ROMSIGNATURE 0xaa55 + + static int __init romsignature(const unsigned char *rom) +@@ -260,10 +180,9 @@ static void __init probe_roms(void) + * Request address space for all standard RAM and ROM resources + * and also for regions reported as reserved by the e820. + */ +-static void __init +-legacy_init_iomem_resources(struct resource *code_resource, +- struct resource *data_resource, +- struct resource *bss_resource) ++void __init init_iomem_resources(struct resource *code_resource, ++ struct resource *data_resource, ++ struct resource *bss_resource) + { + int i; + +@@ -305,35 +224,6 @@ legacy_init_iomem_resources(struct resource *code_resource, + } + } + +-/* +- * Request address space for all standard resources +- * +- * This is called just before pcibios_init(), which is also a +- * subsys_initcall, but is linked in later (in arch/i386/pci/common.c). +- */ +-static int __init request_standard_resources(void) +-{ +- int i; +- +- printk("Setting up standard PCI resources\n"); +- if (efi_enabled) +- efi_initialize_iomem_resources(&code_resource, +- &data_resource, &bss_resource); +- else +- legacy_init_iomem_resources(&code_resource, +- &data_resource, &bss_resource); +- +- /* EFI systems may still have VGA */ +- request_resource(&iomem_resource, &video_ram_resource); +- +- /* request I/O space for devices used on all i[345]86 PCs */ +- for (i = 0; i < ARRAY_SIZE(standard_io_resources); i++) +- request_resource(&ioport_resource, &standard_io_resources[i]); +- return 0; +-} +- +-subsys_initcall(request_standard_resources); +- + #if defined(CONFIG_PM) && defined(CONFIG_HIBERNATION) + /** + * e820_mark_nosave_regions - Find the ranges of physical addresses that do not +@@ -370,19 +260,17 @@ void __init add_memory_region(unsigned long long start, + { + int x; + +- if (!efi_enabled) { +- x = e820.nr_map; +- +- if (x == E820MAX) { +- printk(KERN_ERR "Ooops! Too many entries in the memory map!\n"); +- return; +- } ++ x = e820.nr_map; + +- e820.map[x].addr = start; +- e820.map[x].size = size; +- e820.map[x].type = type; +- e820.nr_map++; ++ if (x == E820MAX) { ++ printk(KERN_ERR "Ooops! Too many entries in the memory map!\n"); ++ return; + } ++ ++ e820.map[x].addr = start; ++ e820.map[x].size = size; ++ e820.map[x].type = type; ++ e820.nr_map++; + } /* add_memory_region */ + + /* +@@ -598,29 +486,6 @@ int __init copy_e820_map(struct e820entry * biosmap, int nr_map) + } + + /* +- * Callback for efi_memory_walk. +- */ +-static int __init +-efi_find_max_pfn(unsigned long start, unsigned long end, void *arg) +-{ +- unsigned long *max_pfn = arg, pfn; +- +- if (start < end) { +- pfn = PFN_UP(end -1); +- if (pfn > *max_pfn) +- *max_pfn = pfn; +- } +- return 0; +-} +- +-static int __init +-efi_memory_present_wrapper(unsigned long start, unsigned long end, void *arg) +-{ +- memory_present(0, PFN_UP(start), PFN_DOWN(end)); +- return 0; +-} +- +-/* + * Find the highest page frame number we have available + */ + void __init find_max_pfn(void) +@@ -628,11 +493,6 @@ void __init find_max_pfn(void) + int i; + + max_pfn = 0; +- if (efi_enabled) { +- efi_memmap_walk(efi_find_max_pfn, &max_pfn); +- efi_memmap_walk(efi_memory_present_wrapper, NULL); +- return; +- } + + for (i = 0; i < e820.nr_map; i++) { + unsigned long start, end; +@@ -650,34 +510,12 @@ void __init find_max_pfn(void) + } + + /* +- * Free all available memory for boot time allocation. Used +- * as a callback function by efi_memory_walk() +- */ +- +-static int __init +-free_available_memory(unsigned long start, unsigned long end, void *arg) +-{ +- /* check max_low_pfn */ +- if (start >= (max_low_pfn << PAGE_SHIFT)) +- return 0; +- if (end >= (max_low_pfn << PAGE_SHIFT)) +- end = max_low_pfn << PAGE_SHIFT; +- if (start < end) +- free_bootmem(start, end - start); +- +- return 0; +-} +-/* + * Register fully available low RAM pages with the bootmem allocator. + */ + void __init register_bootmem_low_pages(unsigned long max_low_pfn) + { + int i; + +- if (efi_enabled) { +- efi_memmap_walk(free_available_memory, NULL); +- return; +- } + for (i = 0; i < e820.nr_map; i++) { + unsigned long curr_pfn, last_pfn, size; + /* +@@ -785,56 +623,12 @@ void __init print_memory_map(char *who) + } + } + +-static __init __always_inline void efi_limit_regions(unsigned long long size) +-{ +- unsigned long long current_addr = 0; +- efi_memory_desc_t *md, *next_md; +- void *p, *p1; +- int i, j; +- +- j = 0; +- p1 = memmap.map; +- for (p = p1, i = 0; p < memmap.map_end; p += memmap.desc_size, i++) { +- md = p; +- next_md = p1; +- current_addr = md->phys_addr + +- PFN_PHYS(md->num_pages); +- if (is_available_memory(md)) { +- if (md->phys_addr >= size) continue; +- memcpy(next_md, md, memmap.desc_size); +- if (current_addr >= size) { +- next_md->num_pages -= +- PFN_UP(current_addr-size); +- } +- p1 += memmap.desc_size; +- next_md = p1; +- j++; +- } else if ((md->attribute & EFI_MEMORY_RUNTIME) == +- EFI_MEMORY_RUNTIME) { +- /* In order to make runtime services +- * available we have to include runtime +- * memory regions in memory map */ +- memcpy(next_md, md, memmap.desc_size); +- p1 += memmap.desc_size; +- next_md = p1; +- j++; +- } +- } +- memmap.nr_map = j; +- memmap.map_end = memmap.map + +- (memmap.nr_map * memmap.desc_size); +-} +- + void __init limit_regions(unsigned long long size) + { + unsigned long long current_addr; + int i; + + print_memory_map("limit_regions start"); +- if (efi_enabled) { +- efi_limit_regions(size); +- return; +- } + for (i = 0; i < e820.nr_map; i++) { + current_addr = e820.map[i].addr + e820.map[i].size; + if (current_addr < size) +@@ -955,3 +749,14 @@ static int __init parse_memmap(char *arg) + return 0; + } + early_param("memmap", parse_memmap); ++void __init update_e820(void) ++{ ++ u8 nr_map; ++ ++ nr_map = e820.nr_map; ++ if (sanitize_e820_map(e820.map, &nr_map)) ++ return; ++ e820.nr_map = nr_map; ++ printk(KERN_INFO "modified physical RAM map:\n"); ++ print_memory_map("modified"); ++} +diff --git a/arch/x86/kernel/e820_64.c b/arch/x86/kernel/e820_64.c +index 04698e0..c617174 100644 +--- a/arch/x86/kernel/e820_64.c ++++ b/arch/x86/kernel/e820_64.c +@@ -1,4 +1,4 @@ +-/* ++/* + * Handle the memory map. + * The functions here do the job until bootmem takes over. + * +@@ -26,80 +26,87 @@ + #include + #include + #include ++#include + + struct e820map e820; + +-/* ++/* + * PFN of last memory page. + */ +-unsigned long end_pfn; +-EXPORT_SYMBOL(end_pfn); ++unsigned long end_pfn; + +-/* ++/* + * end_pfn only includes RAM, while end_pfn_map includes all e820 entries. + * The direct mapping extends to end_pfn_map, so that we can directly access + * apertures, ACPI and other tables without having to play with fixmaps. +- */ +-unsigned long end_pfn_map; ++ */ ++unsigned long end_pfn_map; + +-/* ++/* + * Last pfn which the user wants to use. + */ + static unsigned long __initdata end_user_pfn = MAXMEM>>PAGE_SHIFT; + +-extern struct resource code_resource, data_resource, bss_resource; +- +-/* Check for some hardcoded bad areas that early boot is not allowed to touch */ +-static inline int bad_addr(unsigned long *addrp, unsigned long size) +-{ +- unsigned long addr = *addrp, last = addr + size; +- +- /* various gunk below that needed for SMP startup */ +- if (addr < 0x8000) { +- *addrp = PAGE_ALIGN(0x8000); +- return 1; +- } +- +- /* direct mapping tables of the kernel */ +- if (last >= table_start<= ramdisk_image && addr < ramdisk_end) { +- *addrp = PAGE_ALIGN(ramdisk_end); +- return 1; +- } +- } ++/* ++ * Early reserved memory areas. ++ */ ++#define MAX_EARLY_RES 20 ++ ++struct early_res { ++ unsigned long start, end; ++}; ++static struct early_res early_res[MAX_EARLY_RES] __initdata = { ++ { 0, PAGE_SIZE }, /* BIOS data page */ ++#ifdef CONFIG_SMP ++ { SMP_TRAMPOLINE_BASE, SMP_TRAMPOLINE_BASE + 2*PAGE_SIZE }, + #endif +- /* kernel code */ +- if (last >= __pa_symbol(&_text) && addr < __pa_symbol(&_end)) { +- *addrp = PAGE_ALIGN(__pa_symbol(&_end)); +- return 1; ++ {} ++}; ++ ++void __init reserve_early(unsigned long start, unsigned long end) ++{ ++ int i; ++ struct early_res *r; ++ for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) { ++ r = &early_res[i]; ++ if (end > r->start && start < r->end) ++ panic("Overlapping early reservations %lx-%lx to %lx-%lx\n", ++ start, end, r->start, r->end); + } ++ if (i >= MAX_EARLY_RES) ++ panic("Too many early reservations"); ++ r = &early_res[i]; ++ r->start = start; ++ r->end = end; ++} + +- if (last >= ebda_addr && addr < ebda_addr + ebda_size) { +- *addrp = PAGE_ALIGN(ebda_addr + ebda_size); +- return 1; ++void __init early_res_to_bootmem(void) ++{ ++ int i; ++ for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) { ++ struct early_res *r = &early_res[i]; ++ reserve_bootmem_generic(r->start, r->end - r->start); + } ++} + +-#ifdef CONFIG_NUMA +- /* NUMA memory to node map */ +- if (last >= nodemap_addr && addr < nodemap_addr + nodemap_size) { +- *addrp = nodemap_addr + nodemap_size; +- return 1; ++/* Check for already reserved areas */ ++static inline int bad_addr(unsigned long *addrp, unsigned long size) ++{ ++ int i; ++ unsigned long addr = *addrp, last; ++ int changed = 0; ++again: ++ last = addr + size; ++ for (i = 0; i < MAX_EARLY_RES && early_res[i].end; i++) { ++ struct early_res *r = &early_res[i]; ++ if (last >= r->start && addr < r->end) { ++ *addrp = addr = r->end; ++ changed = 1; ++ goto again; ++ } + } +-#endif +- /* XXX ramdisk image here? */ +- return 0; +-} ++ return changed; ++} + + /* + * This function checks if any part of the range is mapped +@@ -107,16 +114,18 @@ static inline int bad_addr(unsigned long *addrp, unsigned long size) + */ + int + e820_any_mapped(unsigned long start, unsigned long end, unsigned type) +-{ ++{ + int i; +- for (i = 0; i < e820.nr_map; i++) { +- struct e820entry *ei = &e820.map[i]; +- if (type && ei->type != type) ++ ++ for (i = 0; i < e820.nr_map; i++) { ++ struct e820entry *ei = &e820.map[i]; ++ ++ if (type && ei->type != type) + continue; + if (ei->addr >= end || ei->addr + ei->size <= start) +- continue; +- return 1; +- } ++ continue; ++ return 1; ++ } + return 0; + } + EXPORT_SYMBOL_GPL(e820_any_mapped); +@@ -127,11 +136,14 @@ EXPORT_SYMBOL_GPL(e820_any_mapped); + * Note: this function only works correct if the e820 table is sorted and + * not-overlapping, which is the case + */ +-int __init e820_all_mapped(unsigned long start, unsigned long end, unsigned type) ++int __init e820_all_mapped(unsigned long start, unsigned long end, ++ unsigned type) + { + int i; ++ + for (i = 0; i < e820.nr_map; i++) { + struct e820entry *ei = &e820.map[i]; ++ + if (type && ei->type != type) + continue; + /* is the region (part) in overlap with the current region ?*/ +@@ -143,65 +155,73 @@ int __init e820_all_mapped(unsigned long start, unsigned long end, unsigned type + */ + if (ei->addr <= start) + start = ei->addr + ei->size; +- /* if start is now at or beyond end, we're done, full coverage */ ++ /* ++ * if start is now at or beyond end, we're done, full ++ * coverage ++ */ + if (start >= end) +- return 1; /* we're done */ ++ return 1; + } + return 0; + } + +-/* +- * Find a free area in a specific range. +- */ +-unsigned long __init find_e820_area(unsigned long start, unsigned long end, unsigned size) +-{ +- int i; +- for (i = 0; i < e820.nr_map; i++) { +- struct e820entry *ei = &e820.map[i]; +- unsigned long addr = ei->addr, last; +- if (ei->type != E820_RAM) +- continue; +- if (addr < start) ++/* ++ * Find a free area in a specific range. ++ */ ++unsigned long __init find_e820_area(unsigned long start, unsigned long end, ++ unsigned size) ++{ ++ int i; ++ ++ for (i = 0; i < e820.nr_map; i++) { ++ struct e820entry *ei = &e820.map[i]; ++ unsigned long addr = ei->addr, last; ++ ++ if (ei->type != E820_RAM) ++ continue; ++ if (addr < start) + addr = start; +- if (addr > ei->addr + ei->size) +- continue; ++ if (addr > ei->addr + ei->size) ++ continue; + while (bad_addr(&addr, size) && addr+size <= ei->addr+ei->size) + ; + last = PAGE_ALIGN(addr) + size; + if (last > ei->addr + ei->size) + continue; +- if (last > end) ++ if (last > end) + continue; +- return addr; +- } +- return -1UL; +-} ++ return addr; ++ } ++ return -1UL; ++} + + /* + * Find the highest page frame number we have available + */ + unsigned long __init e820_end_of_ram(void) + { +- unsigned long end_pfn = 0; ++ unsigned long end_pfn; ++ + end_pfn = find_max_pfn_with_active_regions(); +- +- if (end_pfn > end_pfn_map) ++ ++ if (end_pfn > end_pfn_map) + end_pfn_map = end_pfn; + if (end_pfn_map > MAXMEM>>PAGE_SHIFT) + end_pfn_map = MAXMEM>>PAGE_SHIFT; + if (end_pfn > end_user_pfn) + end_pfn = end_user_pfn; +- if (end_pfn > end_pfn_map) +- end_pfn = end_pfn_map; ++ if (end_pfn > end_pfn_map) ++ end_pfn = end_pfn_map; + +- printk("end_pfn_map = %lu\n", end_pfn_map); +- return end_pfn; ++ printk(KERN_INFO "end_pfn_map = %lu\n", end_pfn_map); ++ return end_pfn; + } + + /* + * Mark e820 reserved areas as busy for the resource manager. + */ +-void __init e820_reserve_resources(void) ++void __init e820_reserve_resources(struct resource *code_resource, ++ struct resource *data_resource, struct resource *bss_resource) + { + int i; + for (i = 0; i < e820.nr_map; i++) { +@@ -219,13 +239,13 @@ void __init e820_reserve_resources(void) + request_resource(&iomem_resource, res); + if (e820.map[i].type == E820_RAM) { + /* +- * We don't know which RAM region contains kernel data, +- * so we try it repeatedly and let the resource manager +- * test it. ++ * We don't know which RAM region contains kernel data, ++ * so we try it repeatedly and let the resource manager ++ * test it. + */ +- request_resource(res, &code_resource); +- request_resource(res, &data_resource); +- request_resource(res, &bss_resource); ++ request_resource(res, code_resource); ++ request_resource(res, data_resource); ++ request_resource(res, bss_resource); + #ifdef CONFIG_KEXEC + if (crashk_res.start != crashk_res.end) + request_resource(res, &crashk_res); +@@ -322,9 +342,9 @@ e820_register_active_regions(int nid, unsigned long start_pfn, + add_active_range(nid, ei_startpfn, ei_endpfn); + } + +-/* ++/* + * Add a memory region to the kernel e820 map. +- */ ++ */ + void __init add_memory_region(unsigned long start, unsigned long size, int type) + { + int x = e820.nr_map; +@@ -349,9 +369,7 @@ unsigned long __init e820_hole_size(unsigned long start, unsigned long end) + { + unsigned long start_pfn = start >> PAGE_SHIFT; + unsigned long end_pfn = end >> PAGE_SHIFT; +- unsigned long ei_startpfn; +- unsigned long ei_endpfn; +- unsigned long ram = 0; ++ unsigned long ei_startpfn, ei_endpfn, ram = 0; + int i; + + for (i = 0; i < e820.nr_map; i++) { +@@ -363,28 +381,31 @@ unsigned long __init e820_hole_size(unsigned long start, unsigned long end) + return end - start - (ram << PAGE_SHIFT); + } + +-void __init e820_print_map(char *who) ++static void __init e820_print_map(char *who) + { + int i; + + for (i = 0; i < e820.nr_map; i++) { + printk(KERN_INFO " %s: %016Lx - %016Lx ", who, +- (unsigned long long) e820.map[i].addr, +- (unsigned long long) (e820.map[i].addr + e820.map[i].size)); ++ (unsigned long long) e820.map[i].addr, ++ (unsigned long long) ++ (e820.map[i].addr + e820.map[i].size)); + switch (e820.map[i].type) { +- case E820_RAM: printk("(usable)\n"); +- break; ++ case E820_RAM: ++ printk(KERN_CONT "(usable)\n"); ++ break; + case E820_RESERVED: +- printk("(reserved)\n"); +- break; ++ printk(KERN_CONT "(reserved)\n"); ++ break; + case E820_ACPI: +- printk("(ACPI data)\n"); +- break; ++ printk(KERN_CONT "(ACPI data)\n"); ++ break; + case E820_NVS: +- printk("(ACPI NVS)\n"); +- break; +- default: printk("type %u\n", e820.map[i].type); +- break; ++ printk(KERN_CONT "(ACPI NVS)\n"); ++ break; ++ default: ++ printk(KERN_CONT "type %u\n", e820.map[i].type); ++ break; + } + } + } +@@ -392,11 +413,11 @@ void __init e820_print_map(char *who) + /* + * Sanitize the BIOS e820 map. + * +- * Some e820 responses include overlapping entries. The following ++ * Some e820 responses include overlapping entries. The following + * replaces the original e820 map with a new one, removing overlaps. + * + */ +-static int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map) ++static int __init sanitize_e820_map(struct e820entry *biosmap, char *pnr_map) + { + struct change_member { + struct e820entry *pbios; /* pointer to original bios entry */ +@@ -416,7 +437,8 @@ static int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map) + int i; + + /* +- Visually we're performing the following (1,2,3,4 = memory types)... ++ Visually we're performing the following ++ (1,2,3,4 = memory types)... + + Sample memory map (w/overlaps): + ____22__________________ +@@ -458,22 +480,23 @@ static int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map) + old_nr = *pnr_map; + + /* bail out if we find any unreasonable addresses in bios map */ +- for (i=0; iaddr = biosmap[i].addr; + change_point[chgidx++]->pbios = &biosmap[i]; +- change_point[chgidx]->addr = biosmap[i].addr + biosmap[i].size; ++ change_point[chgidx]->addr = biosmap[i].addr + ++ biosmap[i].size; + change_point[chgidx++]->pbios = &biosmap[i]; + } + } +@@ -483,75 +506,106 @@ static int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map) + still_changing = 1; + while (still_changing) { + still_changing = 0; +- for (i=1; i < chg_nr; i++) { +- /* if > , swap */ +- /* or, if current= & last=, swap */ +- if ((change_point[i]->addr < change_point[i-1]->addr) || +- ((change_point[i]->addr == change_point[i-1]->addr) && +- (change_point[i]->addr == change_point[i]->pbios->addr) && +- (change_point[i-1]->addr != change_point[i-1]->pbios->addr)) +- ) +- { ++ for (i = 1; i < chg_nr; i++) { ++ unsigned long long curaddr, lastaddr; ++ unsigned long long curpbaddr, lastpbaddr; ++ ++ curaddr = change_point[i]->addr; ++ lastaddr = change_point[i - 1]->addr; ++ curpbaddr = change_point[i]->pbios->addr; ++ lastpbaddr = change_point[i - 1]->pbios->addr; ++ ++ /* ++ * swap entries, when: ++ * ++ * curaddr > lastaddr or ++ * curaddr == lastaddr and curaddr == curpbaddr and ++ * lastaddr != lastpbaddr ++ */ ++ if (curaddr < lastaddr || ++ (curaddr == lastaddr && curaddr == curpbaddr && ++ lastaddr != lastpbaddr)) { + change_tmp = change_point[i]; + change_point[i] = change_point[i-1]; + change_point[i-1] = change_tmp; +- still_changing=1; ++ still_changing = 1; + } + } + } + + /* create a new bios memory map, removing overlaps */ +- overlap_entries=0; /* number of entries in the overlap table */ +- new_bios_entry=0; /* index for creating new bios map entries */ ++ overlap_entries = 0; /* number of entries in the overlap table */ ++ new_bios_entry = 0; /* index for creating new bios map entries */ + last_type = 0; /* start with undefined memory type */ + last_addr = 0; /* start with 0 as last starting address */ ++ + /* loop through change-points, determining affect on the new bios map */ +- for (chgidx=0; chgidx < chg_nr; chgidx++) +- { ++ for (chgidx = 0; chgidx < chg_nr; chgidx++) { + /* keep track of all overlapping bios entries */ +- if (change_point[chgidx]->addr == change_point[chgidx]->pbios->addr) +- { +- /* add map entry to overlap list (> 1 entry implies an overlap) */ +- overlap_list[overlap_entries++]=change_point[chgidx]->pbios; +- } +- else +- { +- /* remove entry from list (order independent, so swap with last) */ +- for (i=0; ipbios) +- overlap_list[i] = overlap_list[overlap_entries-1]; ++ if (change_point[chgidx]->addr == ++ change_point[chgidx]->pbios->addr) { ++ /* ++ * add map entry to overlap list (> 1 entry ++ * implies an overlap) ++ */ ++ overlap_list[overlap_entries++] = ++ change_point[chgidx]->pbios; ++ } else { ++ /* ++ * remove entry from list (order independent, ++ * so swap with last) ++ */ ++ for (i = 0; i < overlap_entries; i++) { ++ if (overlap_list[i] == ++ change_point[chgidx]->pbios) ++ overlap_list[i] = ++ overlap_list[overlap_entries-1]; + } + overlap_entries--; + } +- /* if there are overlapping entries, decide which "type" to use */ +- /* (larger value takes precedence -- 1=usable, 2,3,4,4+=unusable) */ ++ /* ++ * if there are overlapping entries, decide which ++ * "type" to use (larger value takes precedence -- ++ * 1=usable, 2,3,4,4+=unusable) ++ */ + current_type = 0; +- for (i=0; itype > current_type) + current_type = overlap_list[i]->type; +- /* continue building up new bios map based on this information */ ++ /* ++ * continue building up new bios map based on this ++ * information ++ */ + if (current_type != last_type) { + if (last_type != 0) { + new_bios[new_bios_entry].size = + change_point[chgidx]->addr - last_addr; +- /* move forward only if the new size was non-zero */ ++ /* ++ * move forward only if the new size ++ * was non-zero ++ */ + if (new_bios[new_bios_entry].size != 0) ++ /* ++ * no more space left for new ++ * bios entries ? ++ */ + if (++new_bios_entry >= E820MAX) +- break; /* no more space left for new bios entries */ ++ break; + } + if (current_type != 0) { +- new_bios[new_bios_entry].addr = change_point[chgidx]->addr; ++ new_bios[new_bios_entry].addr = ++ change_point[chgidx]->addr; + new_bios[new_bios_entry].type = current_type; +- last_addr=change_point[chgidx]->addr; ++ last_addr = change_point[chgidx]->addr; + } + last_type = current_type; + } + } +- new_nr = new_bios_entry; /* retain count for new bios entries */ ++ /* retain count for new bios entries */ ++ new_nr = new_bios_entry; + + /* copy new bios mapping into original location */ +- memcpy(biosmap, new_bios, new_nr*sizeof(struct e820entry)); ++ memcpy(biosmap, new_bios, new_nr * sizeof(struct e820entry)); + *pnr_map = new_nr; + + return 0; +@@ -566,7 +620,7 @@ static int __init sanitize_e820_map(struct e820entry * biosmap, char * pnr_map) + * will have given us a memory map that we can use to properly + * set up memory. If we aren't, we'll fake a memory map. + */ +-static int __init copy_e820_map(struct e820entry * biosmap, int nr_map) ++static int __init copy_e820_map(struct e820entry *biosmap, int nr_map) + { + /* Only one memory region (or negative)? Ignore it */ + if (nr_map < 2) +@@ -583,18 +637,20 @@ static int __init copy_e820_map(struct e820entry * biosmap, int nr_map) + return -1; + + add_memory_region(start, size, type); +- } while (biosmap++,--nr_map); ++ } while (biosmap++, --nr_map); + return 0; + } + +-void early_panic(char *msg) ++static void early_panic(char *msg) + { + early_printk(msg); + panic(msg); + } + +-void __init setup_memory_region(void) ++/* We're not void only for x86 32-bit compat */ ++char * __init machine_specific_memory_setup(void) + { ++ char *who = "BIOS-e820"; + /* + * Try to copy the BIOS-supplied E820-map. + * +@@ -605,7 +661,10 @@ void __init setup_memory_region(void) + if (copy_e820_map(boot_params.e820_map, boot_params.e820_entries) < 0) + early_panic("Cannot find a valid memory map"); + printk(KERN_INFO "BIOS-provided physical RAM map:\n"); +- e820_print_map("BIOS-e820"); ++ e820_print_map(who); ++ ++ /* In case someone cares... */ ++ return who; + } + + static int __init parse_memopt(char *p) +@@ -613,9 +672,9 @@ static int __init parse_memopt(char *p) + if (!p) + return -EINVAL; + end_user_pfn = memparse(p, &p); +- end_user_pfn >>= PAGE_SHIFT; ++ end_user_pfn >>= PAGE_SHIFT; + return 0; +-} ++} + early_param("mem", parse_memopt); + + static int userdef __initdata; +@@ -627,9 +686,9 @@ static int __init parse_memmap_opt(char *p) + + if (!strcmp(p, "exactmap")) { + #ifdef CONFIG_CRASH_DUMP +- /* If we are doing a crash dump, we +- * still need to know the real mem +- * size before original memory map is ++ /* ++ * If we are doing a crash dump, we still need to know ++ * the real mem size before original memory map is + * reset. + */ + e820_register_active_regions(0, 0, -1UL); +@@ -646,6 +705,8 @@ static int __init parse_memmap_opt(char *p) + mem_size = memparse(p, &p); + if (p == oldp) + return -EINVAL; ++ ++ userdef = 1; + if (*p == '@') { + start_at = memparse(p+1, &p); + add_memory_region(start_at, mem_size, E820_RAM); +@@ -665,11 +726,29 @@ early_param("memmap", parse_memmap_opt); + void __init finish_e820_parsing(void) + { + if (userdef) { ++ char nr = e820.nr_map; ++ ++ if (sanitize_e820_map(e820.map, &nr) < 0) ++ early_panic("Invalid user supplied memory map"); ++ e820.nr_map = nr; ++ + printk(KERN_INFO "user-defined physical RAM map:\n"); + e820_print_map("user"); + } + } + ++void __init update_e820(void) ++{ ++ u8 nr_map; ++ ++ nr_map = e820.nr_map; ++ if (sanitize_e820_map(e820.map, &nr_map)) ++ return; ++ e820.nr_map = nr_map; ++ printk(KERN_INFO "modified physical RAM map:\n"); ++ e820_print_map("modified"); ++} ++ + unsigned long pci_mem_start = 0xaeedbabe; + EXPORT_SYMBOL(pci_mem_start); + +@@ -713,8 +792,10 @@ __init void e820_setup_gap(void) + + if (!found) { + gapstart = (end_pfn << PAGE_SHIFT) + 1024*1024; +- printk(KERN_ERR "PCI: Warning: Cannot find a gap in the 32bit address range\n" +- KERN_ERR "PCI: Unassigned devices with 32bit resource registers may break!\n"); ++ printk(KERN_ERR "PCI: Warning: Cannot find a gap in the 32bit " ++ "address range\n" ++ KERN_ERR "PCI: Unassigned devices with 32bit resource " ++ "registers may break!\n"); + } + + /* +@@ -727,8 +808,9 @@ __init void e820_setup_gap(void) + /* Fun with two's complement */ + pci_mem_start = (gapstart + round) & -round; + +- printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n", +- pci_mem_start, gapstart, gapsize); ++ printk(KERN_INFO ++ "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n", ++ pci_mem_start, gapstart, gapsize); + } + + int __init arch_get_ram_range(int slot, u64 *addr, u64 *size) +diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c +index 88bb83e..9f51e1e 100644 +--- a/arch/x86/kernel/early-quirks.c ++++ b/arch/x86/kernel/early-quirks.c +@@ -21,7 +21,33 @@ + #include + #endif + +-static void __init via_bugs(void) ++static void __init fix_hypertransport_config(int num, int slot, int func) ++{ ++ u32 htcfg; ++ /* ++ * we found a hypertransport bus ++ * make sure that we are broadcasting ++ * interrupts to all cpus on the ht bus ++ * if we're using extended apic ids ++ */ ++ htcfg = read_pci_config(num, slot, func, 0x68); ++ if (htcfg & (1 << 18)) { ++ printk(KERN_INFO "Detected use of extended apic ids " ++ "on hypertransport bus\n"); ++ if ((htcfg & (1 << 17)) == 0) { ++ printk(KERN_INFO "Enabling hypertransport extended " ++ "apic interrupt broadcast\n"); ++ printk(KERN_INFO "Note this is a bios bug, " ++ "please contact your hw vendor\n"); ++ htcfg |= (1 << 17); ++ write_pci_config(num, slot, func, 0x68, htcfg); ++ } ++ } ++ ++ ++} ++ ++static void __init via_bugs(int num, int slot, int func) + { + #ifdef CONFIG_GART_IOMMU + if ((end_pfn > MAX_DMA32_PFN || force_iommu) && +@@ -44,7 +70,7 @@ static int __init nvidia_hpet_check(struct acpi_table_header *header) + #endif /* CONFIG_X86_IO_APIC */ + #endif /* CONFIG_ACPI */ + +-static void __init nvidia_bugs(void) ++static void __init nvidia_bugs(int num, int slot, int func) + { + #ifdef CONFIG_ACPI + #ifdef CONFIG_X86_IO_APIC +@@ -72,7 +98,7 @@ static void __init nvidia_bugs(void) + + } + +-static void __init ati_bugs(void) ++static void __init ati_bugs(int num, int slot, int func) + { + #ifdef CONFIG_X86_IO_APIC + if (timer_over_8254 == 1) { +@@ -83,18 +109,67 @@ static void __init ati_bugs(void) + #endif + } + ++#define QFLAG_APPLY_ONCE 0x1 ++#define QFLAG_APPLIED 0x2 ++#define QFLAG_DONE (QFLAG_APPLY_ONCE|QFLAG_APPLIED) + struct chipset { +- u16 vendor; +- void (*f)(void); ++ u32 vendor; ++ u32 device; ++ u32 class; ++ u32 class_mask; ++ u32 flags; ++ void (*f)(int num, int slot, int func); + }; + + static struct chipset early_qrk[] __initdata = { +- { PCI_VENDOR_ID_NVIDIA, nvidia_bugs }, +- { PCI_VENDOR_ID_VIA, via_bugs }, +- { PCI_VENDOR_ID_ATI, ati_bugs }, ++ { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID, ++ PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, nvidia_bugs }, ++ { PCI_VENDOR_ID_VIA, PCI_ANY_ID, ++ PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, via_bugs }, ++ { PCI_VENDOR_ID_ATI, PCI_ANY_ID, ++ PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, ati_bugs }, ++ { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_K8_NB, ++ PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, fix_hypertransport_config }, + {} + }; + ++static void __init check_dev_quirk(int num, int slot, int func) ++{ ++ u16 class; ++ u16 vendor; ++ u16 device; ++ u8 type; ++ int i; ++ ++ class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE); ++ ++ if (class == 0xffff) ++ return; ++ ++ vendor = read_pci_config_16(num, slot, func, PCI_VENDOR_ID); ++ ++ device = read_pci_config_16(num, slot, func, PCI_DEVICE_ID); ++ ++ for (i = 0; early_qrk[i].f != NULL; i++) { ++ if (((early_qrk[i].vendor == PCI_ANY_ID) || ++ (early_qrk[i].vendor == vendor)) && ++ ((early_qrk[i].device == PCI_ANY_ID) || ++ (early_qrk[i].device == device)) && ++ (!((early_qrk[i].class ^ class) & ++ early_qrk[i].class_mask))) { ++ if ((early_qrk[i].flags & ++ QFLAG_DONE) != QFLAG_DONE) ++ early_qrk[i].f(num, slot, func); ++ early_qrk[i].flags |= QFLAG_APPLIED; ++ } ++ } ++ ++ type = read_pci_config_byte(num, slot, func, ++ PCI_HEADER_TYPE); ++ if (!(type & 0x80)) ++ return; ++} ++ + void __init early_quirks(void) + { + int num, slot, func; +@@ -103,36 +178,8 @@ void __init early_quirks(void) + return; + + /* Poor man's PCI discovery */ +- for (num = 0; num < 32; num++) { +- for (slot = 0; slot < 32; slot++) { +- for (func = 0; func < 8; func++) { +- u32 class; +- u32 vendor; +- u8 type; +- int i; +- class = read_pci_config(num,slot,func, +- PCI_CLASS_REVISION); +- if (class == 0xffffffff) +- break; +- +- if ((class >> 16) != PCI_CLASS_BRIDGE_PCI) +- continue; +- +- vendor = read_pci_config(num, slot, func, +- PCI_VENDOR_ID); +- vendor &= 0xffff; +- +- for (i = 0; early_qrk[i].f; i++) +- if (early_qrk[i].vendor == vendor) { +- early_qrk[i].f(); +- return; +- } +- +- type = read_pci_config_byte(num, slot, func, +- PCI_HEADER_TYPE); +- if (!(type & 0x80)) +- break; +- } +- } +- } ++ for (num = 0; num < 32; num++) ++ for (slot = 0; slot < 32; slot++) ++ for (func = 0; func < 8; func++) ++ check_dev_quirk(num, slot, func); + } +diff --git a/arch/x86/kernel/efi.c b/arch/x86/kernel/efi.c +new file mode 100644 +index 0000000..1411324 +--- /dev/null ++++ b/arch/x86/kernel/efi.c +@@ -0,0 +1,512 @@ ++/* ++ * Common EFI (Extensible Firmware Interface) support functions ++ * Based on Extensible Firmware Interface Specification version 1.0 ++ * ++ * Copyright (C) 1999 VA Linux Systems ++ * Copyright (C) 1999 Walt Drummond ++ * Copyright (C) 1999-2002 Hewlett-Packard Co. ++ * David Mosberger-Tang ++ * Stephane Eranian ++ * Copyright (C) 2005-2008 Intel Co. ++ * Fenghua Yu ++ * Bibo Mao ++ * Chandramouli Narayanan ++ * Huang Ying ++ * ++ * Copied from efi_32.c to eliminate the duplicated code between EFI ++ * 32/64 support code. --ying 2007-10-26 ++ * ++ * All EFI Runtime Services are not implemented yet as EFI only ++ * supports physical mode addressing on SoftSDV. This is to be fixed ++ * in a future version. --drummond 1999-07-20 ++ * ++ * Implemented EFI runtime services and virtual mode calls. --davidm ++ * ++ * Goutham Rao: ++ * Skip non-WB memory and ignore empty memory ranges. ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#include ++#include ++#include ++#include ++#include ++ ++#define EFI_DEBUG 1 ++#define PFX "EFI: " ++ ++int efi_enabled; ++EXPORT_SYMBOL(efi_enabled); ++ ++struct efi efi; ++EXPORT_SYMBOL(efi); ++ ++struct efi_memory_map memmap; ++ ++struct efi efi_phys __initdata; ++static efi_system_table_t efi_systab __initdata; ++ ++static int __init setup_noefi(char *arg) ++{ ++ efi_enabled = 0; ++ return 0; ++} ++early_param("noefi", setup_noefi); ++ ++static efi_status_t virt_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc) ++{ ++ return efi_call_virt2(get_time, tm, tc); ++} ++ ++static efi_status_t virt_efi_set_time(efi_time_t *tm) ++{ ++ return efi_call_virt1(set_time, tm); ++} ++ ++static efi_status_t virt_efi_get_wakeup_time(efi_bool_t *enabled, ++ efi_bool_t *pending, ++ efi_time_t *tm) ++{ ++ return efi_call_virt3(get_wakeup_time, ++ enabled, pending, tm); ++} ++ ++static efi_status_t virt_efi_set_wakeup_time(efi_bool_t enabled, efi_time_t *tm) ++{ ++ return efi_call_virt2(set_wakeup_time, ++ enabled, tm); ++} ++ ++static efi_status_t virt_efi_get_variable(efi_char16_t *name, ++ efi_guid_t *vendor, ++ u32 *attr, ++ unsigned long *data_size, ++ void *data) ++{ ++ return efi_call_virt5(get_variable, ++ name, vendor, attr, ++ data_size, data); ++} ++ ++static efi_status_t virt_efi_get_next_variable(unsigned long *name_size, ++ efi_char16_t *name, ++ efi_guid_t *vendor) ++{ ++ return efi_call_virt3(get_next_variable, ++ name_size, name, vendor); ++} ++ ++static efi_status_t virt_efi_set_variable(efi_char16_t *name, ++ efi_guid_t *vendor, ++ unsigned long attr, ++ unsigned long data_size, ++ void *data) ++{ ++ return efi_call_virt5(set_variable, ++ name, vendor, attr, ++ data_size, data); ++} ++ ++static efi_status_t virt_efi_get_next_high_mono_count(u32 *count) ++{ ++ return efi_call_virt1(get_next_high_mono_count, count); ++} ++ ++static void virt_efi_reset_system(int reset_type, ++ efi_status_t status, ++ unsigned long data_size, ++ efi_char16_t *data) ++{ ++ efi_call_virt4(reset_system, reset_type, status, ++ data_size, data); ++} ++ ++static efi_status_t virt_efi_set_virtual_address_map( ++ unsigned long memory_map_size, ++ unsigned long descriptor_size, ++ u32 descriptor_version, ++ efi_memory_desc_t *virtual_map) ++{ ++ return efi_call_virt4(set_virtual_address_map, ++ memory_map_size, descriptor_size, ++ descriptor_version, virtual_map); ++} ++ ++static efi_status_t __init phys_efi_set_virtual_address_map( ++ unsigned long memory_map_size, ++ unsigned long descriptor_size, ++ u32 descriptor_version, ++ efi_memory_desc_t *virtual_map) ++{ ++ efi_status_t status; ++ ++ efi_call_phys_prelog(); ++ status = efi_call_phys4(efi_phys.set_virtual_address_map, ++ memory_map_size, descriptor_size, ++ descriptor_version, virtual_map); ++ efi_call_phys_epilog(); ++ return status; ++} ++ ++static efi_status_t __init phys_efi_get_time(efi_time_t *tm, ++ efi_time_cap_t *tc) ++{ ++ efi_status_t status; ++ ++ efi_call_phys_prelog(); ++ status = efi_call_phys2(efi_phys.get_time, tm, tc); ++ efi_call_phys_epilog(); ++ return status; ++} ++ ++int efi_set_rtc_mmss(unsigned long nowtime) ++{ ++ int real_seconds, real_minutes; ++ efi_status_t status; ++ efi_time_t eft; ++ efi_time_cap_t cap; ++ ++ status = efi.get_time(&eft, &cap); ++ if (status != EFI_SUCCESS) { ++ printk(KERN_ERR "Oops: efitime: can't read time!\n"); ++ return -1; ++ } ++ ++ real_seconds = nowtime % 60; ++ real_minutes = nowtime / 60; ++ if (((abs(real_minutes - eft.minute) + 15)/30) & 1) ++ real_minutes += 30; ++ real_minutes %= 60; ++ eft.minute = real_minutes; ++ eft.second = real_seconds; ++ ++ status = efi.set_time(&eft); ++ if (status != EFI_SUCCESS) { ++ printk(KERN_ERR "Oops: efitime: can't write time!\n"); ++ return -1; ++ } ++ return 0; ++} ++ ++unsigned long efi_get_time(void) ++{ ++ efi_status_t status; ++ efi_time_t eft; ++ efi_time_cap_t cap; ++ ++ status = efi.get_time(&eft, &cap); ++ if (status != EFI_SUCCESS) ++ printk(KERN_ERR "Oops: efitime: can't read time!\n"); ++ ++ return mktime(eft.year, eft.month, eft.day, eft.hour, ++ eft.minute, eft.second); ++} ++ ++#if EFI_DEBUG ++static void __init print_efi_memmap(void) ++{ ++ efi_memory_desc_t *md; ++ void *p; ++ int i; ++ ++ for (p = memmap.map, i = 0; ++ p < memmap.map_end; ++ p += memmap.desc_size, i++) { ++ md = p; ++ printk(KERN_INFO PFX "mem%02u: type=%u, attr=0x%llx, " ++ "range=[0x%016llx-0x%016llx) (%lluMB)\n", ++ i, md->type, md->attribute, md->phys_addr, ++ md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT), ++ (md->num_pages >> (20 - EFI_PAGE_SHIFT))); ++ } ++} ++#endif /* EFI_DEBUG */ ++ ++void __init efi_init(void) ++{ ++ efi_config_table_t *config_tables; ++ efi_runtime_services_t *runtime; ++ efi_char16_t *c16; ++ char vendor[100] = "unknown"; ++ int i = 0; ++ void *tmp; ++ ++#ifdef CONFIG_X86_32 ++ efi_phys.systab = (efi_system_table_t *)boot_params.efi_info.efi_systab; ++ memmap.phys_map = (void *)boot_params.efi_info.efi_memmap; ++#else ++ efi_phys.systab = (efi_system_table_t *) ++ (boot_params.efi_info.efi_systab | ++ ((__u64)boot_params.efi_info.efi_systab_hi<<32)); ++ memmap.phys_map = (void *) ++ (boot_params.efi_info.efi_memmap | ++ ((__u64)boot_params.efi_info.efi_memmap_hi<<32)); ++#endif ++ memmap.nr_map = boot_params.efi_info.efi_memmap_size / ++ boot_params.efi_info.efi_memdesc_size; ++ memmap.desc_version = boot_params.efi_info.efi_memdesc_version; ++ memmap.desc_size = boot_params.efi_info.efi_memdesc_size; ++ ++ efi.systab = early_ioremap((unsigned long)efi_phys.systab, ++ sizeof(efi_system_table_t)); ++ if (efi.systab == NULL) ++ printk(KERN_ERR "Couldn't map the EFI system table!\n"); ++ memcpy(&efi_systab, efi.systab, sizeof(efi_system_table_t)); ++ early_iounmap(efi.systab, sizeof(efi_system_table_t)); ++ efi.systab = &efi_systab; ++ ++ /* ++ * Verify the EFI Table ++ */ ++ if (efi.systab->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE) ++ printk(KERN_ERR "EFI system table signature incorrect!\n"); ++ if ((efi.systab->hdr.revision >> 16) == 0) ++ printk(KERN_ERR "Warning: EFI system table version " ++ "%d.%02d, expected 1.00 or greater!\n", ++ efi.systab->hdr.revision >> 16, ++ efi.systab->hdr.revision & 0xffff); ++ ++ /* ++ * Show what we know for posterity ++ */ ++ c16 = tmp = early_ioremap(efi.systab->fw_vendor, 2); ++ if (c16) { ++ for (i = 0; i < sizeof(vendor) && *c16; ++i) ++ vendor[i] = *c16++; ++ vendor[i] = '\0'; ++ } else ++ printk(KERN_ERR PFX "Could not map the firmware vendor!\n"); ++ early_iounmap(tmp, 2); ++ ++ printk(KERN_INFO "EFI v%u.%.02u by %s \n", ++ efi.systab->hdr.revision >> 16, ++ efi.systab->hdr.revision & 0xffff, vendor); ++ ++ /* ++ * Let's see what config tables the firmware passed to us. ++ */ ++ config_tables = early_ioremap( ++ efi.systab->tables, ++ efi.systab->nr_tables * sizeof(efi_config_table_t)); ++ if (config_tables == NULL) ++ printk(KERN_ERR "Could not map EFI Configuration Table!\n"); ++ ++ printk(KERN_INFO); ++ for (i = 0; i < efi.systab->nr_tables; i++) { ++ if (!efi_guidcmp(config_tables[i].guid, MPS_TABLE_GUID)) { ++ efi.mps = config_tables[i].table; ++ printk(" MPS=0x%lx ", config_tables[i].table); ++ } else if (!efi_guidcmp(config_tables[i].guid, ++ ACPI_20_TABLE_GUID)) { ++ efi.acpi20 = config_tables[i].table; ++ printk(" ACPI 2.0=0x%lx ", config_tables[i].table); ++ } else if (!efi_guidcmp(config_tables[i].guid, ++ ACPI_TABLE_GUID)) { ++ efi.acpi = config_tables[i].table; ++ printk(" ACPI=0x%lx ", config_tables[i].table); ++ } else if (!efi_guidcmp(config_tables[i].guid, ++ SMBIOS_TABLE_GUID)) { ++ efi.smbios = config_tables[i].table; ++ printk(" SMBIOS=0x%lx ", config_tables[i].table); ++ } else if (!efi_guidcmp(config_tables[i].guid, ++ HCDP_TABLE_GUID)) { ++ efi.hcdp = config_tables[i].table; ++ printk(" HCDP=0x%lx ", config_tables[i].table); ++ } else if (!efi_guidcmp(config_tables[i].guid, ++ UGA_IO_PROTOCOL_GUID)) { ++ efi.uga = config_tables[i].table; ++ printk(" UGA=0x%lx ", config_tables[i].table); ++ } ++ } ++ printk("\n"); ++ early_iounmap(config_tables, ++ efi.systab->nr_tables * sizeof(efi_config_table_t)); ++ ++ /* ++ * Check out the runtime services table. We need to map ++ * the runtime services table so that we can grab the physical ++ * address of several of the EFI runtime functions, needed to ++ * set the firmware into virtual mode. ++ */ ++ runtime = early_ioremap((unsigned long)efi.systab->runtime, ++ sizeof(efi_runtime_services_t)); ++ if (runtime != NULL) { ++ /* ++ * We will only need *early* access to the following ++ * two EFI runtime services before set_virtual_address_map ++ * is invoked. ++ */ ++ efi_phys.get_time = (efi_get_time_t *)runtime->get_time; ++ efi_phys.set_virtual_address_map = ++ (efi_set_virtual_address_map_t *) ++ runtime->set_virtual_address_map; ++ /* ++ * Make efi_get_time can be called before entering ++ * virtual mode. ++ */ ++ efi.get_time = phys_efi_get_time; ++ } else ++ printk(KERN_ERR "Could not map the EFI runtime service " ++ "table!\n"); ++ early_iounmap(runtime, sizeof(efi_runtime_services_t)); ++ ++ /* Map the EFI memory map */ ++ memmap.map = early_ioremap((unsigned long)memmap.phys_map, ++ memmap.nr_map * memmap.desc_size); ++ if (memmap.map == NULL) ++ printk(KERN_ERR "Could not map the EFI memory map!\n"); ++ memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size); ++ if (memmap.desc_size != sizeof(efi_memory_desc_t)) ++ printk(KERN_WARNING "Kernel-defined memdesc" ++ "doesn't match the one from EFI!\n"); ++ ++ /* Setup for EFI runtime service */ ++ reboot_type = BOOT_EFI; ++ ++#if EFI_DEBUG ++ print_efi_memmap(); ++#endif ++} ++ ++#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) ++static void __init runtime_code_page_mkexec(void) ++{ ++ efi_memory_desc_t *md; ++ unsigned long end; ++ void *p; ++ ++ if (!(__supported_pte_mask & _PAGE_NX)) ++ return; ++ ++ /* Make EFI runtime service code area executable */ ++ for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { ++ md = p; ++ end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT); ++ if (md->type == EFI_RUNTIME_SERVICES_CODE && ++ (end >> PAGE_SHIFT) <= max_pfn_mapped) { ++ set_memory_x(md->virt_addr, md->num_pages); ++ set_memory_uc(md->virt_addr, md->num_pages); ++ } ++ } ++ __flush_tlb_all(); ++} ++#else ++static inline void __init runtime_code_page_mkexec(void) { } ++#endif ++ ++/* ++ * This function will switch the EFI runtime services to virtual mode. ++ * Essentially, look through the EFI memmap and map every region that ++ * has the runtime attribute bit set in its memory descriptor and update ++ * that memory descriptor with the virtual address obtained from ioremap(). ++ * This enables the runtime services to be called without having to ++ * thunk back into physical mode for every invocation. ++ */ ++void __init efi_enter_virtual_mode(void) ++{ ++ efi_memory_desc_t *md; ++ efi_status_t status; ++ unsigned long end; ++ void *p; ++ ++ efi.systab = NULL; ++ for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { ++ md = p; ++ if (!(md->attribute & EFI_MEMORY_RUNTIME)) ++ continue; ++ end = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT); ++ if ((md->attribute & EFI_MEMORY_WB) && ++ ((end >> PAGE_SHIFT) <= max_pfn_mapped)) ++ md->virt_addr = (unsigned long)__va(md->phys_addr); ++ else ++ md->virt_addr = (unsigned long) ++ efi_ioremap(md->phys_addr, ++ md->num_pages << EFI_PAGE_SHIFT); ++ if (!md->virt_addr) ++ printk(KERN_ERR PFX "ioremap of 0x%llX failed!\n", ++ (unsigned long long)md->phys_addr); ++ if ((md->phys_addr <= (unsigned long)efi_phys.systab) && ++ ((unsigned long)efi_phys.systab < end)) ++ efi.systab = (efi_system_table_t *)(unsigned long) ++ (md->virt_addr - md->phys_addr + ++ (unsigned long)efi_phys.systab); ++ } ++ ++ BUG_ON(!efi.systab); ++ ++ status = phys_efi_set_virtual_address_map( ++ memmap.desc_size * memmap.nr_map, ++ memmap.desc_size, ++ memmap.desc_version, ++ memmap.phys_map); ++ ++ if (status != EFI_SUCCESS) { ++ printk(KERN_ALERT "Unable to switch EFI into virtual mode " ++ "(status=%lx)!\n", status); ++ panic("EFI call to SetVirtualAddressMap() failed!"); ++ } ++ ++ /* ++ * Now that EFI is in virtual mode, update the function ++ * pointers in the runtime service table to the new virtual addresses. ++ * ++ * Call EFI services through wrapper functions. ++ */ ++ efi.get_time = virt_efi_get_time; ++ efi.set_time = virt_efi_set_time; ++ efi.get_wakeup_time = virt_efi_get_wakeup_time; ++ efi.set_wakeup_time = virt_efi_set_wakeup_time; ++ efi.get_variable = virt_efi_get_variable; ++ efi.get_next_variable = virt_efi_get_next_variable; ++ efi.set_variable = virt_efi_set_variable; ++ efi.get_next_high_mono_count = virt_efi_get_next_high_mono_count; ++ efi.reset_system = virt_efi_reset_system; ++ efi.set_virtual_address_map = virt_efi_set_virtual_address_map; ++ runtime_code_page_mkexec(); ++ early_iounmap(memmap.map, memmap.nr_map * memmap.desc_size); ++ memmap.map = NULL; ++} ++ ++/* ++ * Convenience functions to obtain memory types and attributes ++ */ ++u32 efi_mem_type(unsigned long phys_addr) ++{ ++ efi_memory_desc_t *md; ++ void *p; ++ ++ for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { ++ md = p; ++ if ((md->phys_addr <= phys_addr) && ++ (phys_addr < (md->phys_addr + ++ (md->num_pages << EFI_PAGE_SHIFT)))) ++ return md->type; ++ } ++ return 0; ++} ++ ++u64 efi_mem_attributes(unsigned long phys_addr) ++{ ++ efi_memory_desc_t *md; ++ void *p; ++ ++ for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { ++ md = p; ++ if ((md->phys_addr <= phys_addr) && ++ (phys_addr < (md->phys_addr + ++ (md->num_pages << EFI_PAGE_SHIFT)))) ++ return md->attribute; ++ } ++ return 0; ++} +diff --git a/arch/x86/kernel/efi_32.c b/arch/x86/kernel/efi_32.c +index e2be78f..cb91f98 100644 +--- a/arch/x86/kernel/efi_32.c ++++ b/arch/x86/kernel/efi_32.c +@@ -20,40 +20,15 @@ + */ + + #include +-#include +-#include + #include +-#include +-#include +-#include + #include +-#include + #include +-#include + +-#include + #include + #include + #include +-#include +-#include + #include + +-#define EFI_DEBUG 0 +-#define PFX "EFI: " +- +-extern efi_status_t asmlinkage efi_call_phys(void *, ...); +- +-struct efi efi; +-EXPORT_SYMBOL(efi); +-static struct efi efi_phys; +-struct efi_memory_map memmap; +- +-/* +- * We require an early boot_ioremap mapping mechanism initially +- */ +-extern void * boot_ioremap(unsigned long, unsigned long); +- + /* + * To make EFI call EFI runtime service in physical addressing mode we need + * prelog/epilog before/after the invocation to disable interrupt, to +@@ -62,16 +37,14 @@ extern void * boot_ioremap(unsigned long, unsigned long); + */ + + static unsigned long efi_rt_eflags; +-static DEFINE_SPINLOCK(efi_rt_lock); + static pgd_t efi_bak_pg_dir_pointer[2]; + +-static void efi_call_phys_prelog(void) __acquires(efi_rt_lock) ++void efi_call_phys_prelog(void) + { + unsigned long cr4; + unsigned long temp; +- struct Xgt_desc_struct gdt_descr; ++ struct desc_ptr gdt_descr; + +- spin_lock(&efi_rt_lock); + local_irq_save(efi_rt_eflags); + + /* +@@ -101,17 +74,17 @@ static void efi_call_phys_prelog(void) __acquires(efi_rt_lock) + /* + * After the lock is released, the original page table is restored. + */ +- local_flush_tlb(); ++ __flush_tlb_all(); + + gdt_descr.address = __pa(get_cpu_gdt_table(0)); + gdt_descr.size = GDT_SIZE - 1; + load_gdt(&gdt_descr); + } + +-static void efi_call_phys_epilog(void) __releases(efi_rt_lock) ++void efi_call_phys_epilog(void) + { + unsigned long cr4; +- struct Xgt_desc_struct gdt_descr; ++ struct desc_ptr gdt_descr; + + gdt_descr.address = (unsigned long)get_cpu_gdt_table(0); + gdt_descr.size = GDT_SIZE - 1; +@@ -132,586 +105,7 @@ static void efi_call_phys_epilog(void) __releases(efi_rt_lock) + /* + * After the lock is released, the original page table is restored. + */ +- local_flush_tlb(); ++ __flush_tlb_all(); + + local_irq_restore(efi_rt_eflags); +- spin_unlock(&efi_rt_lock); +-} +- +-static efi_status_t +-phys_efi_set_virtual_address_map(unsigned long memory_map_size, +- unsigned long descriptor_size, +- u32 descriptor_version, +- efi_memory_desc_t *virtual_map) +-{ +- efi_status_t status; +- +- efi_call_phys_prelog(); +- status = efi_call_phys(efi_phys.set_virtual_address_map, +- memory_map_size, descriptor_size, +- descriptor_version, virtual_map); +- efi_call_phys_epilog(); +- return status; +-} +- +-static efi_status_t +-phys_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc) +-{ +- efi_status_t status; +- +- efi_call_phys_prelog(); +- status = efi_call_phys(efi_phys.get_time, tm, tc); +- efi_call_phys_epilog(); +- return status; +-} +- +-inline int efi_set_rtc_mmss(unsigned long nowtime) +-{ +- int real_seconds, real_minutes; +- efi_status_t status; +- efi_time_t eft; +- efi_time_cap_t cap; +- +- spin_lock(&efi_rt_lock); +- status = efi.get_time(&eft, &cap); +- spin_unlock(&efi_rt_lock); +- if (status != EFI_SUCCESS) +- panic("Ooops, efitime: can't read time!\n"); +- real_seconds = nowtime % 60; +- real_minutes = nowtime / 60; +- +- if (((abs(real_minutes - eft.minute) + 15)/30) & 1) +- real_minutes += 30; +- real_minutes %= 60; +- +- eft.minute = real_minutes; +- eft.second = real_seconds; +- +- if (status != EFI_SUCCESS) { +- printk("Ooops: efitime: can't read time!\n"); +- return -1; +- } +- return 0; +-} +-/* +- * This is used during kernel init before runtime +- * services have been remapped and also during suspend, therefore, +- * we'll need to call both in physical and virtual modes. +- */ +-inline unsigned long efi_get_time(void) +-{ +- efi_status_t status; +- efi_time_t eft; +- efi_time_cap_t cap; +- +- if (efi.get_time) { +- /* if we are in virtual mode use remapped function */ +- status = efi.get_time(&eft, &cap); +- } else { +- /* we are in physical mode */ +- status = phys_efi_get_time(&eft, &cap); +- } +- +- if (status != EFI_SUCCESS) +- printk("Oops: efitime: can't read time status: 0x%lx\n",status); +- +- return mktime(eft.year, eft.month, eft.day, eft.hour, +- eft.minute, eft.second); +-} +- +-int is_available_memory(efi_memory_desc_t * md) +-{ +- if (!(md->attribute & EFI_MEMORY_WB)) +- return 0; +- +- switch (md->type) { +- case EFI_LOADER_CODE: +- case EFI_LOADER_DATA: +- case EFI_BOOT_SERVICES_CODE: +- case EFI_BOOT_SERVICES_DATA: +- case EFI_CONVENTIONAL_MEMORY: +- return 1; +- } +- return 0; +-} +- +-/* +- * We need to map the EFI memory map again after paging_init(). +- */ +-void __init efi_map_memmap(void) +-{ +- memmap.map = NULL; +- +- memmap.map = bt_ioremap((unsigned long) memmap.phys_map, +- (memmap.nr_map * memmap.desc_size)); +- if (memmap.map == NULL) +- printk(KERN_ERR PFX "Could not remap the EFI memmap!\n"); +- +- memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size); +-} +- +-#if EFI_DEBUG +-static void __init print_efi_memmap(void) +-{ +- efi_memory_desc_t *md; +- void *p; +- int i; +- +- for (p = memmap.map, i = 0; p < memmap.map_end; p += memmap.desc_size, i++) { +- md = p; +- printk(KERN_INFO "mem%02u: type=%u, attr=0x%llx, " +- "range=[0x%016llx-0x%016llx) (%lluMB)\n", +- i, md->type, md->attribute, md->phys_addr, +- md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT), +- (md->num_pages >> (20 - EFI_PAGE_SHIFT))); +- } +-} +-#endif /* EFI_DEBUG */ +- +-/* +- * Walks the EFI memory map and calls CALLBACK once for each EFI +- * memory descriptor that has memory that is available for kernel use. +- */ +-void efi_memmap_walk(efi_freemem_callback_t callback, void *arg) +-{ +- int prev_valid = 0; +- struct range { +- unsigned long start; +- unsigned long end; +- } uninitialized_var(prev), curr; +- efi_memory_desc_t *md; +- unsigned long start, end; +- void *p; +- +- for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { +- md = p; +- +- if ((md->num_pages == 0) || (!is_available_memory(md))) +- continue; +- +- curr.start = md->phys_addr; +- curr.end = curr.start + (md->num_pages << EFI_PAGE_SHIFT); +- +- if (!prev_valid) { +- prev = curr; +- prev_valid = 1; +- } else { +- if (curr.start < prev.start) +- printk(KERN_INFO PFX "Unordered memory map\n"); +- if (prev.end == curr.start) +- prev.end = curr.end; +- else { +- start = +- (unsigned long) (PAGE_ALIGN(prev.start)); +- end = (unsigned long) (prev.end & PAGE_MASK); +- if ((end > start) +- && (*callback) (start, end, arg) < 0) +- return; +- prev = curr; +- } +- } +- } +- if (prev_valid) { +- start = (unsigned long) PAGE_ALIGN(prev.start); +- end = (unsigned long) (prev.end & PAGE_MASK); +- if (end > start) +- (*callback) (start, end, arg); +- } +-} +- +-void __init efi_init(void) +-{ +- efi_config_table_t *config_tables; +- efi_runtime_services_t *runtime; +- efi_char16_t *c16; +- char vendor[100] = "unknown"; +- unsigned long num_config_tables; +- int i = 0; +- +- memset(&efi, 0, sizeof(efi) ); +- memset(&efi_phys, 0, sizeof(efi_phys)); +- +- efi_phys.systab = +- (efi_system_table_t *)boot_params.efi_info.efi_systab; +- memmap.phys_map = (void *)boot_params.efi_info.efi_memmap; +- memmap.nr_map = boot_params.efi_info.efi_memmap_size/ +- boot_params.efi_info.efi_memdesc_size; +- memmap.desc_version = boot_params.efi_info.efi_memdesc_version; +- memmap.desc_size = boot_params.efi_info.efi_memdesc_size; +- +- efi.systab = (efi_system_table_t *) +- boot_ioremap((unsigned long) efi_phys.systab, +- sizeof(efi_system_table_t)); +- /* +- * Verify the EFI Table +- */ +- if (efi.systab == NULL) +- printk(KERN_ERR PFX "Woah! Couldn't map the EFI system table.\n"); +- if (efi.systab->hdr.signature != EFI_SYSTEM_TABLE_SIGNATURE) +- printk(KERN_ERR PFX "Woah! EFI system table signature incorrect\n"); +- if ((efi.systab->hdr.revision >> 16) == 0) +- printk(KERN_ERR PFX "Warning: EFI system table version " +- "%d.%02d, expected 1.00 or greater\n", +- efi.systab->hdr.revision >> 16, +- efi.systab->hdr.revision & 0xffff); +- +- /* +- * Grab some details from the system table +- */ +- num_config_tables = efi.systab->nr_tables; +- config_tables = (efi_config_table_t *)efi.systab->tables; +- runtime = efi.systab->runtime; +- +- /* +- * Show what we know for posterity +- */ +- c16 = (efi_char16_t *) boot_ioremap(efi.systab->fw_vendor, 2); +- if (c16) { +- for (i = 0; i < (sizeof(vendor) - 1) && *c16; ++i) +- vendor[i] = *c16++; +- vendor[i] = '\0'; +- } else +- printk(KERN_ERR PFX "Could not map the firmware vendor!\n"); +- +- printk(KERN_INFO PFX "EFI v%u.%.02u by %s \n", +- efi.systab->hdr.revision >> 16, +- efi.systab->hdr.revision & 0xffff, vendor); +- +- /* +- * Let's see what config tables the firmware passed to us. +- */ +- config_tables = (efi_config_table_t *) +- boot_ioremap((unsigned long) config_tables, +- num_config_tables * sizeof(efi_config_table_t)); +- +- if (config_tables == NULL) +- printk(KERN_ERR PFX "Could not map EFI Configuration Table!\n"); +- +- efi.mps = EFI_INVALID_TABLE_ADDR; +- efi.acpi = EFI_INVALID_TABLE_ADDR; +- efi.acpi20 = EFI_INVALID_TABLE_ADDR; +- efi.smbios = EFI_INVALID_TABLE_ADDR; +- efi.sal_systab = EFI_INVALID_TABLE_ADDR; +- efi.boot_info = EFI_INVALID_TABLE_ADDR; +- efi.hcdp = EFI_INVALID_TABLE_ADDR; +- efi.uga = EFI_INVALID_TABLE_ADDR; +- +- for (i = 0; i < num_config_tables; i++) { +- if (efi_guidcmp(config_tables[i].guid, MPS_TABLE_GUID) == 0) { +- efi.mps = config_tables[i].table; +- printk(KERN_INFO " MPS=0x%lx ", config_tables[i].table); +- } else +- if (efi_guidcmp(config_tables[i].guid, ACPI_20_TABLE_GUID) == 0) { +- efi.acpi20 = config_tables[i].table; +- printk(KERN_INFO " ACPI 2.0=0x%lx ", config_tables[i].table); +- } else +- if (efi_guidcmp(config_tables[i].guid, ACPI_TABLE_GUID) == 0) { +- efi.acpi = config_tables[i].table; +- printk(KERN_INFO " ACPI=0x%lx ", config_tables[i].table); +- } else +- if (efi_guidcmp(config_tables[i].guid, SMBIOS_TABLE_GUID) == 0) { +- efi.smbios = config_tables[i].table; +- printk(KERN_INFO " SMBIOS=0x%lx ", config_tables[i].table); +- } else +- if (efi_guidcmp(config_tables[i].guid, HCDP_TABLE_GUID) == 0) { +- efi.hcdp = config_tables[i].table; +- printk(KERN_INFO " HCDP=0x%lx ", config_tables[i].table); +- } else +- if (efi_guidcmp(config_tables[i].guid, UGA_IO_PROTOCOL_GUID) == 0) { +- efi.uga = config_tables[i].table; +- printk(KERN_INFO " UGA=0x%lx ", config_tables[i].table); +- } +- } +- printk("\n"); +- +- /* +- * Check out the runtime services table. We need to map +- * the runtime services table so that we can grab the physical +- * address of several of the EFI runtime functions, needed to +- * set the firmware into virtual mode. +- */ +- +- runtime = (efi_runtime_services_t *) boot_ioremap((unsigned long) +- runtime, +- sizeof(efi_runtime_services_t)); +- if (runtime != NULL) { +- /* +- * We will only need *early* access to the following +- * two EFI runtime services before set_virtual_address_map +- * is invoked. +- */ +- efi_phys.get_time = (efi_get_time_t *) runtime->get_time; +- efi_phys.set_virtual_address_map = +- (efi_set_virtual_address_map_t *) +- runtime->set_virtual_address_map; +- } else +- printk(KERN_ERR PFX "Could not map the runtime service table!\n"); +- +- /* Map the EFI memory map for use until paging_init() */ +- memmap.map = boot_ioremap(boot_params.efi_info.efi_memmap, +- boot_params.efi_info.efi_memmap_size); +- if (memmap.map == NULL) +- printk(KERN_ERR PFX "Could not map the EFI memory map!\n"); +- +- memmap.map_end = memmap.map + (memmap.nr_map * memmap.desc_size); +- +-#if EFI_DEBUG +- print_efi_memmap(); +-#endif +-} +- +-static inline void __init check_range_for_systab(efi_memory_desc_t *md) +-{ +- if (((unsigned long)md->phys_addr <= (unsigned long)efi_phys.systab) && +- ((unsigned long)efi_phys.systab < md->phys_addr + +- ((unsigned long)md->num_pages << EFI_PAGE_SHIFT))) { +- unsigned long addr; +- +- addr = md->virt_addr - md->phys_addr + +- (unsigned long)efi_phys.systab; +- efi.systab = (efi_system_table_t *)addr; +- } +-} +- +-/* +- * Wrap all the virtual calls in a way that forces the parameters on the stack. +- */ +- +-#define efi_call_virt(f, args...) \ +- ((efi_##f##_t __attribute__((regparm(0)))*)efi.systab->runtime->f)(args) +- +-static efi_status_t virt_efi_get_time(efi_time_t *tm, efi_time_cap_t *tc) +-{ +- return efi_call_virt(get_time, tm, tc); +-} +- +-static efi_status_t virt_efi_set_time (efi_time_t *tm) +-{ +- return efi_call_virt(set_time, tm); +-} +- +-static efi_status_t virt_efi_get_wakeup_time (efi_bool_t *enabled, +- efi_bool_t *pending, +- efi_time_t *tm) +-{ +- return efi_call_virt(get_wakeup_time, enabled, pending, tm); +-} +- +-static efi_status_t virt_efi_set_wakeup_time (efi_bool_t enabled, +- efi_time_t *tm) +-{ +- return efi_call_virt(set_wakeup_time, enabled, tm); +-} +- +-static efi_status_t virt_efi_get_variable (efi_char16_t *name, +- efi_guid_t *vendor, u32 *attr, +- unsigned long *data_size, void *data) +-{ +- return efi_call_virt(get_variable, name, vendor, attr, data_size, data); +-} +- +-static efi_status_t virt_efi_get_next_variable (unsigned long *name_size, +- efi_char16_t *name, +- efi_guid_t *vendor) +-{ +- return efi_call_virt(get_next_variable, name_size, name, vendor); +-} +- +-static efi_status_t virt_efi_set_variable (efi_char16_t *name, +- efi_guid_t *vendor, +- unsigned long attr, +- unsigned long data_size, void *data) +-{ +- return efi_call_virt(set_variable, name, vendor, attr, data_size, data); +-} +- +-static efi_status_t virt_efi_get_next_high_mono_count (u32 *count) +-{ +- return efi_call_virt(get_next_high_mono_count, count); +-} +- +-static void virt_efi_reset_system (int reset_type, efi_status_t status, +- unsigned long data_size, +- efi_char16_t *data) +-{ +- efi_call_virt(reset_system, reset_type, status, data_size, data); +-} +- +-/* +- * This function will switch the EFI runtime services to virtual mode. +- * Essentially, look through the EFI memmap and map every region that +- * has the runtime attribute bit set in its memory descriptor and update +- * that memory descriptor with the virtual address obtained from ioremap(). +- * This enables the runtime services to be called without having to +- * thunk back into physical mode for every invocation. +- */ +- +-void __init efi_enter_virtual_mode(void) +-{ +- efi_memory_desc_t *md; +- efi_status_t status; +- void *p; +- +- efi.systab = NULL; +- +- for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { +- md = p; +- +- if (!(md->attribute & EFI_MEMORY_RUNTIME)) +- continue; +- +- md->virt_addr = (unsigned long)ioremap(md->phys_addr, +- md->num_pages << EFI_PAGE_SHIFT); +- if (!(unsigned long)md->virt_addr) { +- printk(KERN_ERR PFX "ioremap of 0x%lX failed\n", +- (unsigned long)md->phys_addr); +- } +- /* update the virtual address of the EFI system table */ +- check_range_for_systab(md); +- } +- +- BUG_ON(!efi.systab); +- +- status = phys_efi_set_virtual_address_map( +- memmap.desc_size * memmap.nr_map, +- memmap.desc_size, +- memmap.desc_version, +- memmap.phys_map); +- +- if (status != EFI_SUCCESS) { +- printk (KERN_ALERT "You are screwed! " +- "Unable to switch EFI into virtual mode " +- "(status=%lx)\n", status); +- panic("EFI call to SetVirtualAddressMap() failed!"); +- } +- +- /* +- * Now that EFI is in virtual mode, update the function +- * pointers in the runtime service table to the new virtual addresses. +- */ +- +- efi.get_time = virt_efi_get_time; +- efi.set_time = virt_efi_set_time; +- efi.get_wakeup_time = virt_efi_get_wakeup_time; +- efi.set_wakeup_time = virt_efi_set_wakeup_time; +- efi.get_variable = virt_efi_get_variable; +- efi.get_next_variable = virt_efi_get_next_variable; +- efi.set_variable = virt_efi_set_variable; +- efi.get_next_high_mono_count = virt_efi_get_next_high_mono_count; +- efi.reset_system = virt_efi_reset_system; +-} +- +-void __init +-efi_initialize_iomem_resources(struct resource *code_resource, +- struct resource *data_resource, +- struct resource *bss_resource) +-{ +- struct resource *res; +- efi_memory_desc_t *md; +- void *p; +- +- for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { +- md = p; +- +- if ((md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT)) > +- 0x100000000ULL) +- continue; +- res = kzalloc(sizeof(struct resource), GFP_ATOMIC); +- switch (md->type) { +- case EFI_RESERVED_TYPE: +- res->name = "Reserved Memory"; +- break; +- case EFI_LOADER_CODE: +- res->name = "Loader Code"; +- break; +- case EFI_LOADER_DATA: +- res->name = "Loader Data"; +- break; +- case EFI_BOOT_SERVICES_DATA: +- res->name = "BootServices Data"; +- break; +- case EFI_BOOT_SERVICES_CODE: +- res->name = "BootServices Code"; +- break; +- case EFI_RUNTIME_SERVICES_CODE: +- res->name = "Runtime Service Code"; +- break; +- case EFI_RUNTIME_SERVICES_DATA: +- res->name = "Runtime Service Data"; +- break; +- case EFI_CONVENTIONAL_MEMORY: +- res->name = "Conventional Memory"; +- break; +- case EFI_UNUSABLE_MEMORY: +- res->name = "Unusable Memory"; +- break; +- case EFI_ACPI_RECLAIM_MEMORY: +- res->name = "ACPI Reclaim"; +- break; +- case EFI_ACPI_MEMORY_NVS: +- res->name = "ACPI NVS"; +- break; +- case EFI_MEMORY_MAPPED_IO: +- res->name = "Memory Mapped IO"; +- break; +- case EFI_MEMORY_MAPPED_IO_PORT_SPACE: +- res->name = "Memory Mapped IO Port Space"; +- break; +- default: +- res->name = "Reserved"; +- break; +- } +- res->start = md->phys_addr; +- res->end = res->start + ((md->num_pages << EFI_PAGE_SHIFT) - 1); +- res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; +- if (request_resource(&iomem_resource, res) < 0) +- printk(KERN_ERR PFX "Failed to allocate res %s : " +- "0x%llx-0x%llx\n", res->name, +- (unsigned long long)res->start, +- (unsigned long long)res->end); +- /* +- * We don't know which region contains kernel data so we try +- * it repeatedly and let the resource manager test it. +- */ +- if (md->type == EFI_CONVENTIONAL_MEMORY) { +- request_resource(res, code_resource); +- request_resource(res, data_resource); +- request_resource(res, bss_resource); +-#ifdef CONFIG_KEXEC +- request_resource(res, &crashk_res); +-#endif +- } +- } +-} +- +-/* +- * Convenience functions to obtain memory types and attributes +- */ +- +-u32 efi_mem_type(unsigned long phys_addr) +-{ +- efi_memory_desc_t *md; +- void *p; +- +- for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { +- md = p; +- if ((md->phys_addr <= phys_addr) && (phys_addr < +- (md->phys_addr + (md-> num_pages << EFI_PAGE_SHIFT)) )) +- return md->type; +- } +- return 0; +-} +- +-u64 efi_mem_attributes(unsigned long phys_addr) +-{ +- efi_memory_desc_t *md; +- void *p; +- +- for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { +- md = p; +- if ((md->phys_addr <= phys_addr) && (phys_addr < +- (md->phys_addr + (md-> num_pages << EFI_PAGE_SHIFT)) )) +- return md->attribute; +- } +- return 0; + } +diff --git a/arch/x86/kernel/efi_64.c b/arch/x86/kernel/efi_64.c +new file mode 100644 +index 0000000..4b73992 +--- /dev/null ++++ b/arch/x86/kernel/efi_64.c +@@ -0,0 +1,134 @@ ++/* ++ * x86_64 specific EFI support functions ++ * Based on Extensible Firmware Interface Specification version 1.0 ++ * ++ * Copyright (C) 2005-2008 Intel Co. ++ * Fenghua Yu ++ * Bibo Mao ++ * Chandramouli Narayanan ++ * Huang Ying ++ * ++ * Code to convert EFI to E820 map has been implemented in elilo bootloader ++ * based on a EFI patch by Edgar Hucek. Based on the E820 map, the page table ++ * is setup appropriately for EFI runtime code. ++ * - mouli 06/14/2007. ++ * ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++static pgd_t save_pgd __initdata; ++static unsigned long efi_flags __initdata; ++ ++static void __init early_mapping_set_exec(unsigned long start, ++ unsigned long end, ++ int executable) ++{ ++ pte_t *kpte; ++ int level; ++ ++ while (start < end) { ++ kpte = lookup_address((unsigned long)__va(start), &level); ++ BUG_ON(!kpte); ++ if (executable) ++ set_pte(kpte, pte_mkexec(*kpte)); ++ else ++ set_pte(kpte, __pte((pte_val(*kpte) | _PAGE_NX) & \ ++ __supported_pte_mask)); ++ if (level == 4) ++ start = (start + PMD_SIZE) & PMD_MASK; ++ else ++ start = (start + PAGE_SIZE) & PAGE_MASK; ++ } ++} ++ ++static void __init early_runtime_code_mapping_set_exec(int executable) ++{ ++ efi_memory_desc_t *md; ++ void *p; ++ ++ if (!(__supported_pte_mask & _PAGE_NX)) ++ return; ++ ++ /* Make EFI runtime service code area executable */ ++ for (p = memmap.map; p < memmap.map_end; p += memmap.desc_size) { ++ md = p; ++ if (md->type == EFI_RUNTIME_SERVICES_CODE) { ++ unsigned long end; ++ end = md->phys_addr + (md->num_pages << PAGE_SHIFT); ++ early_mapping_set_exec(md->phys_addr, end, executable); ++ } ++ } ++} ++ ++void __init efi_call_phys_prelog(void) ++{ ++ unsigned long vaddress; ++ ++ local_irq_save(efi_flags); ++ early_runtime_code_mapping_set_exec(1); ++ vaddress = (unsigned long)__va(0x0UL); ++ save_pgd = *pgd_offset_k(0x0UL); ++ set_pgd(pgd_offset_k(0x0UL), *pgd_offset_k(vaddress)); ++ __flush_tlb_all(); ++} ++ ++void __init efi_call_phys_epilog(void) ++{ ++ /* ++ * After the lock is released, the original page table is restored. ++ */ ++ set_pgd(pgd_offset_k(0x0UL), save_pgd); ++ early_runtime_code_mapping_set_exec(0); ++ __flush_tlb_all(); ++ local_irq_restore(efi_flags); ++} ++ ++void __init efi_reserve_bootmem(void) ++{ ++ reserve_bootmem_generic((unsigned long)memmap.phys_map, ++ memmap.nr_map * memmap.desc_size); ++} ++ ++void __iomem * __init efi_ioremap(unsigned long offset, ++ unsigned long size) ++{ ++ static unsigned pages_mapped; ++ unsigned long last_addr; ++ unsigned i, pages; ++ ++ last_addr = offset + size - 1; ++ offset &= PAGE_MASK; ++ pages = (PAGE_ALIGN(last_addr) - offset) >> PAGE_SHIFT; ++ if (pages_mapped + pages > MAX_EFI_IO_PAGES) ++ return NULL; ++ ++ for (i = 0; i < pages; i++) { ++ __set_fixmap(FIX_EFI_IO_MAP_FIRST_PAGE - pages_mapped, ++ offset, PAGE_KERNEL_EXEC_NOCACHE); ++ offset += PAGE_SIZE; ++ pages_mapped++; ++ } ++ ++ return (void __iomem *)__fix_to_virt(FIX_EFI_IO_MAP_FIRST_PAGE - \ ++ (pages_mapped - pages)); ++} +diff --git a/arch/x86/kernel/efi_stub_64.S b/arch/x86/kernel/efi_stub_64.S +new file mode 100644 +index 0000000..99b47d4 +--- /dev/null ++++ b/arch/x86/kernel/efi_stub_64.S +@@ -0,0 +1,109 @@ ++/* ++ * Function calling ABI conversion from Linux to EFI for x86_64 ++ * ++ * Copyright (C) 2007 Intel Corp ++ * Bibo Mao ++ * Huang Ying ++ */ ++ ++#include ++ ++#define SAVE_XMM \ ++ mov %rsp, %rax; \ ++ subq $0x70, %rsp; \ ++ and $~0xf, %rsp; \ ++ mov %rax, (%rsp); \ ++ mov %cr0, %rax; \ ++ clts; \ ++ mov %rax, 0x8(%rsp); \ ++ movaps %xmm0, 0x60(%rsp); \ ++ movaps %xmm1, 0x50(%rsp); \ ++ movaps %xmm2, 0x40(%rsp); \ ++ movaps %xmm3, 0x30(%rsp); \ ++ movaps %xmm4, 0x20(%rsp); \ ++ movaps %xmm5, 0x10(%rsp) ++ ++#define RESTORE_XMM \ ++ movaps 0x60(%rsp), %xmm0; \ ++ movaps 0x50(%rsp), %xmm1; \ ++ movaps 0x40(%rsp), %xmm2; \ ++ movaps 0x30(%rsp), %xmm3; \ ++ movaps 0x20(%rsp), %xmm4; \ ++ movaps 0x10(%rsp), %xmm5; \ ++ mov 0x8(%rsp), %rsi; \ ++ mov %rsi, %cr0; \ ++ mov (%rsp), %rsp ++ ++ENTRY(efi_call0) ++ SAVE_XMM ++ subq $32, %rsp ++ call *%rdi ++ addq $32, %rsp ++ RESTORE_XMM ++ ret ++ ++ENTRY(efi_call1) ++ SAVE_XMM ++ subq $32, %rsp ++ mov %rsi, %rcx ++ call *%rdi ++ addq $32, %rsp ++ RESTORE_XMM ++ ret ++ ++ENTRY(efi_call2) ++ SAVE_XMM ++ subq $32, %rsp ++ mov %rsi, %rcx ++ call *%rdi ++ addq $32, %rsp ++ RESTORE_XMM ++ ret ++ ++ENTRY(efi_call3) ++ SAVE_XMM ++ subq $32, %rsp ++ mov %rcx, %r8 ++ mov %rsi, %rcx ++ call *%rdi ++ addq $32, %rsp ++ RESTORE_XMM ++ ret ++ ++ENTRY(efi_call4) ++ SAVE_XMM ++ subq $32, %rsp ++ mov %r8, %r9 ++ mov %rcx, %r8 ++ mov %rsi, %rcx ++ call *%rdi ++ addq $32, %rsp ++ RESTORE_XMM ++ ret ++ ++ENTRY(efi_call5) ++ SAVE_XMM ++ subq $48, %rsp ++ mov %r9, 32(%rsp) ++ mov %r8, %r9 ++ mov %rcx, %r8 ++ mov %rsi, %rcx ++ call *%rdi ++ addq $48, %rsp ++ RESTORE_XMM ++ ret ++ ++ENTRY(efi_call6) ++ SAVE_XMM ++ mov (%rsp), %rax ++ mov 8(%rax), %rax ++ subq $48, %rsp ++ mov %r9, 32(%rsp) ++ mov %rax, 40(%rsp) ++ mov %r8, %r9 ++ mov %rcx, %r8 ++ mov %rsi, %rcx ++ call *%rdi ++ addq $48, %rsp ++ RESTORE_XMM ++ ret +diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S +index dc7f938..be5c31d 100644 +--- a/arch/x86/kernel/entry_32.S ++++ b/arch/x86/kernel/entry_32.S +@@ -58,7 +58,7 @@ + * for paravirtualization. The following will never clobber any registers: + * INTERRUPT_RETURN (aka. "iret") + * GET_CR0_INTO_EAX (aka. "movl %cr0, %eax") +- * ENABLE_INTERRUPTS_SYSEXIT (aka "sti; sysexit"). ++ * ENABLE_INTERRUPTS_SYSCALL_RET (aka "sti; sysexit"). + * + * For DISABLE_INTERRUPTS/ENABLE_INTERRUPTS (aka "cli"/"sti"), you must + * specify what registers can be overwritten (CLBR_NONE, CLBR_EAX/EDX/ECX/ANY). +@@ -283,12 +283,12 @@ END(resume_kernel) + the vsyscall page. See vsyscall-sysentry.S, which defines the symbol. */ + + # sysenter call handler stub +-ENTRY(sysenter_entry) ++ENTRY(ia32_sysenter_target) + CFI_STARTPROC simple + CFI_SIGNAL_FRAME + CFI_DEF_CFA esp, 0 + CFI_REGISTER esp, ebp +- movl TSS_sysenter_esp0(%esp),%esp ++ movl TSS_sysenter_sp0(%esp),%esp + sysenter_past_esp: + /* + * No need to follow this irqs on/off section: the syscall +@@ -351,7 +351,7 @@ sysenter_past_esp: + xorl %ebp,%ebp + TRACE_IRQS_ON + 1: mov PT_FS(%esp), %fs +- ENABLE_INTERRUPTS_SYSEXIT ++ ENABLE_INTERRUPTS_SYSCALL_RET + CFI_ENDPROC + .pushsection .fixup,"ax" + 2: movl $0,PT_FS(%esp) +@@ -360,7 +360,7 @@ sysenter_past_esp: + .align 4 + .long 1b,2b + .popsection +-ENDPROC(sysenter_entry) ++ENDPROC(ia32_sysenter_target) + + # system call handler stub + ENTRY(system_call) +@@ -583,7 +583,7 @@ END(syscall_badsys) + * Build the entry stubs and pointer table with + * some assembler magic. + */ +-.data ++.section .rodata,"a" + ENTRY(interrupt) + .text + +@@ -743,7 +743,7 @@ END(device_not_available) + * that sets up the real kernel stack. Check here, since we can't + * allow the wrong stack to be used. + * +- * "TSS_sysenter_esp0+12" is because the NMI/debug handler will have ++ * "TSS_sysenter_sp0+12" is because the NMI/debug handler will have + * already pushed 3 words if it hits on the sysenter instruction: + * eflags, cs and eip. + * +@@ -755,7 +755,7 @@ END(device_not_available) + cmpw $__KERNEL_CS,4(%esp); \ + jne ok; \ + label: \ +- movl TSS_sysenter_esp0+offset(%esp),%esp; \ ++ movl TSS_sysenter_sp0+offset(%esp),%esp; \ + CFI_DEF_CFA esp, 0; \ + CFI_UNDEFINED eip; \ + pushfl; \ +@@ -768,7 +768,7 @@ label: \ + + KPROBE_ENTRY(debug) + RING0_INT_FRAME +- cmpl $sysenter_entry,(%esp) ++ cmpl $ia32_sysenter_target,(%esp) + jne debug_stack_correct + FIX_STACK(12, debug_stack_correct, debug_esp_fix_insn) + debug_stack_correct: +@@ -799,7 +799,7 @@ KPROBE_ENTRY(nmi) + popl %eax + CFI_ADJUST_CFA_OFFSET -4 + je nmi_espfix_stack +- cmpl $sysenter_entry,(%esp) ++ cmpl $ia32_sysenter_target,(%esp) + je nmi_stack_fixup + pushl %eax + CFI_ADJUST_CFA_OFFSET 4 +@@ -812,7 +812,7 @@ KPROBE_ENTRY(nmi) + popl %eax + CFI_ADJUST_CFA_OFFSET -4 + jae nmi_stack_correct +- cmpl $sysenter_entry,12(%esp) ++ cmpl $ia32_sysenter_target,12(%esp) + je nmi_debug_stack_check + nmi_stack_correct: + /* We have a RING0_INT_FRAME here */ +@@ -882,10 +882,10 @@ ENTRY(native_iret) + .previous + END(native_iret) + +-ENTRY(native_irq_enable_sysexit) ++ENTRY(native_irq_enable_syscall_ret) + sti + sysexit +-END(native_irq_enable_sysexit) ++END(native_irq_enable_syscall_ret) + #endif + + KPROBE_ENTRY(int3) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S -index 3a058bb..e70f388 100644 +index 3a058bb..bea8474 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S -@@ -283,7 +283,7 @@ sysret_careful: +@@ -50,6 +50,7 @@ + #include + #include + #include ++#include + + .code64 + +@@ -57,6 +58,13 @@ + #define retint_kernel retint_restore_args + #endif + ++#ifdef CONFIG_PARAVIRT ++ENTRY(native_irq_enable_syscall_ret) ++ movq %gs:pda_oldrsp,%rsp ++ swapgs ++ sysretq ++#endif /* CONFIG_PARAVIRT */ ++ + + .macro TRACE_IRQS_IRETQ offset=ARGOFFSET + #ifdef CONFIG_TRACE_IRQFLAGS +@@ -216,14 +224,21 @@ ENTRY(system_call) + CFI_DEF_CFA rsp,PDA_STACKOFFSET + CFI_REGISTER rip,rcx + /*CFI_REGISTER rflags,r11*/ +- swapgs ++ SWAPGS_UNSAFE_STACK ++ /* ++ * A hypervisor implementation might want to use a label ++ * after the swapgs, so that it can do the swapgs ++ * for the guest and jump here on syscall. ++ */ ++ENTRY(system_call_after_swapgs) ++ + movq %rsp,%gs:pda_oldrsp + movq %gs:pda_kernelstack,%rsp + /* + * No need to follow this irqs off/on section - it's straight + * and short: + */ +- sti ++ ENABLE_INTERRUPTS(CLBR_NONE) + SAVE_ARGS 8,1 + movq %rax,ORIG_RAX-ARGOFFSET(%rsp) + movq %rcx,RIP-ARGOFFSET(%rsp) +@@ -246,7 +261,7 @@ ret_from_sys_call: + sysret_check: + LOCKDEP_SYS_EXIT + GET_THREAD_INFO(%rcx) +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + movl threadinfo_flags(%rcx),%edx + andl %edi,%edx +@@ -260,9 +275,7 @@ sysret_check: + CFI_REGISTER rip,rcx + RESTORE_ARGS 0,-ARG_SKIP,1 + /*CFI_REGISTER rflags,r11*/ +- movq %gs:pda_oldrsp,%rsp +- swapgs +- sysretq ++ ENABLE_INTERRUPTS_SYSCALL_RET + + CFI_RESTORE_STATE + /* Handle reschedules */ +@@ -271,7 +284,7 @@ sysret_careful: + bt $TIF_NEED_RESCHED,%edx + jnc sysret_signal + TRACE_IRQS_ON +- sti ++ ENABLE_INTERRUPTS(CLBR_NONE) + pushq %rdi + CFI_ADJUST_CFA_OFFSET 8 + call schedule +@@ -282,8 +295,8 @@ sysret_careful: + /* Handle a signal */ sysret_signal: TRACE_IRQS_ON - sti +- sti - testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx ++ ENABLE_INTERRUPTS(CLBR_NONE) + testl $_TIF_DO_NOTIFY_MASK,%edx jz 1f /* Really a signal */ -@@ -377,7 +377,7 @@ int_very_careful: +@@ -295,7 +308,7 @@ sysret_signal: + 1: movl $_TIF_NEED_RESCHED,%edi + /* Use IRET because user could have changed frame. This + works because ptregscall_common has called FIXUP_TOP_OF_STACK. */ +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + jmp int_with_check + +@@ -327,7 +340,7 @@ tracesys: + */ + .globl int_ret_from_sys_call + int_ret_from_sys_call: +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + testl $3,CS-ARGOFFSET(%rsp) + je retint_restore_args +@@ -349,20 +362,20 @@ int_careful: + bt $TIF_NEED_RESCHED,%edx + jnc int_very_careful + TRACE_IRQS_ON +- sti ++ ENABLE_INTERRUPTS(CLBR_NONE) + pushq %rdi + CFI_ADJUST_CFA_OFFSET 8 + call schedule + popq %rdi + CFI_ADJUST_CFA_OFFSET -8 +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + jmp int_with_check + + /* handle signals and tracing -- both require a full stack frame */ + int_very_careful: + TRACE_IRQS_ON +- sti ++ ENABLE_INTERRUPTS(CLBR_NONE) + SAVE_REST + /* Check for syscall exit trace */ + testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP),%edx +@@ -377,7 +390,7 @@ int_very_careful: jmp int_restore_rest int_signal: @@ -135343,7 +153454,95 @@ index 3a058bb..e70f388 100644 jz 1f movq %rsp,%rdi # &ptregs -> arg1 xorl %esi,%esi # oldset -> arg2 -@@ -603,7 +603,7 @@ retint_careful: +@@ -385,7 +398,7 @@ int_signal: + 1: movl $_TIF_NEED_RESCHED,%edi + int_restore_rest: + RESTORE_REST +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + jmp int_with_check + CFI_ENDPROC +@@ -506,7 +519,7 @@ END(stub_rt_sigreturn) + CFI_DEF_CFA_REGISTER rbp + testl $3,CS(%rdi) + je 1f +- swapgs ++ SWAPGS + /* irqcount is used to check if a CPU is already on an interrupt + stack or not. While this is essentially redundant with preempt_count + it is a little cheaper to use a separate counter in the PDA +@@ -527,7 +540,7 @@ ENTRY(common_interrupt) + interrupt do_IRQ + /* 0(%rsp): oldrsp-ARGOFFSET */ + ret_from_intr: +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + decl %gs:pda_irqcount + leaveq +@@ -556,13 +569,13 @@ retint_swapgs: /* return to user-space */ + /* + * The iretq could re-enable interrupts: + */ +- cli ++ DISABLE_INTERRUPTS(CLBR_ANY) + TRACE_IRQS_IRETQ +- swapgs ++ SWAPGS + jmp restore_args + + retint_restore_args: /* return to kernel space */ +- cli ++ DISABLE_INTERRUPTS(CLBR_ANY) + /* + * The iretq could re-enable interrupts: + */ +@@ -570,10 +583,14 @@ retint_restore_args: /* return to kernel space */ + restore_args: + RESTORE_ARGS 0,8,0 + iret_label: ++#ifdef CONFIG_PARAVIRT ++ INTERRUPT_RETURN ++#endif ++ENTRY(native_iret) + iretq + + .section __ex_table,"a" +- .quad iret_label,bad_iret ++ .quad native_iret, bad_iret + .previous + .section .fixup,"ax" + /* force a signal here? this matches i386 behaviour */ +@@ -581,39 +598,39 @@ iret_label: + bad_iret: + movq $11,%rdi /* SIGSEGV */ + TRACE_IRQS_ON +- sti +- jmp do_exit +- .previous +- ++ ENABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI)) ++ jmp do_exit ++ .previous ++ + /* edi: workmask, edx: work */ + retint_careful: + CFI_RESTORE_STATE + bt $TIF_NEED_RESCHED,%edx + jnc retint_signal + TRACE_IRQS_ON +- sti ++ ENABLE_INTERRUPTS(CLBR_NONE) + pushq %rdi + CFI_ADJUST_CFA_OFFSET 8 + call schedule + popq %rdi + CFI_ADJUST_CFA_OFFSET -8 + GET_THREAD_INFO(%rcx) +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF jmp retint_check retint_signal: @@ -135351,7 +153550,1874 @@ index 3a058bb..e70f388 100644 + testl $_TIF_DO_NOTIFY_MASK,%edx jz retint_swapgs TRACE_IRQS_ON - sti +- sti ++ ENABLE_INTERRUPTS(CLBR_NONE) + SAVE_REST + movq $-1,ORIG_RAX(%rsp) + xorl %esi,%esi # oldset + movq %rsp,%rdi # &pt_regs + call do_notify_resume + RESTORE_REST +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + movl $_TIF_NEED_RESCHED,%edi + GET_THREAD_INFO(%rcx) +@@ -731,7 +748,7 @@ END(spurious_interrupt) + rdmsr + testl %edx,%edx + js 1f +- swapgs ++ SWAPGS + xorl %ebx,%ebx + 1: + .if \ist +@@ -747,7 +764,7 @@ END(spurious_interrupt) + .if \ist + addq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp) + .endif +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + .if \irqtrace + TRACE_IRQS_OFF + .endif +@@ -776,10 +793,10 @@ paranoid_swapgs\trace: + .if \trace + TRACE_IRQS_IRETQ 0 + .endif +- swapgs ++ SWAPGS_UNSAFE_STACK + paranoid_restore\trace: + RESTORE_ALL 8 +- iretq ++ INTERRUPT_RETURN + paranoid_userspace\trace: + GET_THREAD_INFO(%rcx) + movl threadinfo_flags(%rcx),%ebx +@@ -794,11 +811,11 @@ paranoid_userspace\trace: + .if \trace + TRACE_IRQS_ON + .endif +- sti ++ ENABLE_INTERRUPTS(CLBR_NONE) + xorl %esi,%esi /* arg2: oldset */ + movq %rsp,%rdi /* arg1: &pt_regs */ + call do_notify_resume +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + .if \trace + TRACE_IRQS_OFF + .endif +@@ -807,9 +824,9 @@ paranoid_schedule\trace: + .if \trace + TRACE_IRQS_ON + .endif +- sti ++ ENABLE_INTERRUPTS(CLBR_ANY) + call schedule +- cli ++ DISABLE_INTERRUPTS(CLBR_ANY) + .if \trace + TRACE_IRQS_OFF + .endif +@@ -862,7 +879,7 @@ KPROBE_ENTRY(error_entry) + testl $3,CS(%rsp) + je error_kernelspace + error_swapgs: +- swapgs ++ SWAPGS + error_sti: + movq %rdi,RDI(%rsp) + CFI_REL_OFFSET rdi,RDI +@@ -874,7 +891,7 @@ error_sti: + error_exit: + movl %ebx,%eax + RESTORE_REST +- cli ++ DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + GET_THREAD_INFO(%rcx) + testl %eax,%eax +@@ -911,12 +928,12 @@ ENTRY(load_gs_index) + CFI_STARTPROC + pushf + CFI_ADJUST_CFA_OFFSET 8 +- cli +- swapgs ++ DISABLE_INTERRUPTS(CLBR_ANY | ~(CLBR_RDI)) ++ SWAPGS + gs_change: + movl %edi,%gs + 2: mfence /* workaround */ +- swapgs ++ SWAPGS + popf + CFI_ADJUST_CFA_OFFSET -8 + ret +@@ -930,7 +947,7 @@ ENDPROC(load_gs_index) + .section .fixup,"ax" + /* running with kernelgs */ + bad_gs: +- swapgs /* switch back to user gs */ ++ SWAPGS /* switch back to user gs */ + xorl %eax,%eax + movl %eax,%gs + jmp 2b +diff --git a/arch/x86/kernel/genapic_64.c b/arch/x86/kernel/genapic_64.c +index ce703e2..4ae7b64 100644 +--- a/arch/x86/kernel/genapic_64.c ++++ b/arch/x86/kernel/genapic_64.c +@@ -24,18 +24,11 @@ + #include + #endif + +-/* +- * which logical CPU number maps to which CPU (physical APIC ID) +- * +- * The following static array is used during kernel startup +- * and the x86_cpu_to_apicid_ptr contains the address of the +- * array during this time. Is it zeroed when the per_cpu +- * data area is removed. +- */ +-u8 x86_cpu_to_apicid_init[NR_CPUS] __initdata ++/* which logical CPU number maps to which CPU (physical APIC ID) */ ++u16 x86_cpu_to_apicid_init[NR_CPUS] __initdata + = { [0 ... NR_CPUS-1] = BAD_APICID }; +-void *x86_cpu_to_apicid_ptr; +-DEFINE_PER_CPU(u8, x86_cpu_to_apicid) = BAD_APICID; ++void *x86_cpu_to_apicid_early_ptr; ++DEFINE_PER_CPU(u16, x86_cpu_to_apicid) = BAD_APICID; + EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid); + + struct genapic __read_mostly *genapic = &apic_flat; +diff --git a/arch/x86/kernel/geode_32.c b/arch/x86/kernel/geode_32.c +index f12d8c5..9c7f7d3 100644 +--- a/arch/x86/kernel/geode_32.c ++++ b/arch/x86/kernel/geode_32.c +@@ -1,6 +1,7 @@ + /* + * AMD Geode southbridge support code + * Copyright (C) 2006, Advanced Micro Devices, Inc. ++ * Copyright (C) 2007, Andres Salomon + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public License +@@ -51,45 +52,62 @@ EXPORT_SYMBOL_GPL(geode_get_dev_base); + + /* === GPIO API === */ + +-void geode_gpio_set(unsigned int gpio, unsigned int reg) ++void geode_gpio_set(u32 gpio, unsigned int reg) + { + u32 base = geode_get_dev_base(GEODE_DEV_GPIO); + + if (!base) + return; + +- if (gpio < 16) +- outl(1 << gpio, base + reg); +- else +- outl(1 << (gpio - 16), base + 0x80 + reg); ++ /* low bank register */ ++ if (gpio & 0xFFFF) ++ outl(gpio & 0xFFFF, base + reg); ++ /* high bank register */ ++ gpio >>= 16; ++ if (gpio) ++ outl(gpio, base + 0x80 + reg); + } + EXPORT_SYMBOL_GPL(geode_gpio_set); + +-void geode_gpio_clear(unsigned int gpio, unsigned int reg) ++void geode_gpio_clear(u32 gpio, unsigned int reg) + { + u32 base = geode_get_dev_base(GEODE_DEV_GPIO); + + if (!base) + return; + +- if (gpio < 16) +- outl(1 << (gpio + 16), base + reg); +- else +- outl(1 << gpio, base + 0x80 + reg); ++ /* low bank register */ ++ if (gpio & 0xFFFF) ++ outl((gpio & 0xFFFF) << 16, base + reg); ++ /* high bank register */ ++ gpio &= (0xFFFF << 16); ++ if (gpio) ++ outl(gpio, base + 0x80 + reg); + } + EXPORT_SYMBOL_GPL(geode_gpio_clear); + +-int geode_gpio_isset(unsigned int gpio, unsigned int reg) ++int geode_gpio_isset(u32 gpio, unsigned int reg) + { + u32 base = geode_get_dev_base(GEODE_DEV_GPIO); ++ u32 val; + + if (!base) + return 0; + +- if (gpio < 16) +- return (inl(base + reg) & (1 << gpio)) ? 1 : 0; +- else +- return (inl(base + 0x80 + reg) & (1 << (gpio - 16))) ? 1 : 0; ++ /* low bank register */ ++ if (gpio & 0xFFFF) { ++ val = inl(base + reg) & (gpio & 0xFFFF); ++ if ((gpio & 0xFFFF) == val) ++ return 1; ++ } ++ /* high bank register */ ++ gpio >>= 16; ++ if (gpio) { ++ val = inl(base + 0x80 + reg) & gpio; ++ if (gpio == val) ++ return 1; ++ } ++ return 0; + } + EXPORT_SYMBOL_GPL(geode_gpio_isset); + +diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c +index 6b34693..a317336 100644 +--- a/arch/x86/kernel/head64.c ++++ b/arch/x86/kernel/head64.c +@@ -10,6 +10,7 @@ + #include + #include + #include ++#include + + #include + #include +@@ -19,12 +20,14 @@ + #include + #include + #include ++#include ++#include + + static void __init zap_identity_mappings(void) + { + pgd_t *pgd = pgd_offset_k(0UL); + pgd_clear(pgd); +- __flush_tlb(); ++ __flush_tlb_all(); + } + + /* Don't add a printk in there. printk relies on the PDA which is not initialized +@@ -46,6 +49,35 @@ static void __init copy_bootdata(char *real_mode_data) + } + } + ++#define EBDA_ADDR_POINTER 0x40E ++ ++static __init void reserve_ebda(void) ++{ ++ unsigned ebda_addr, ebda_size; ++ ++ /* ++ * there is a real-mode segmented pointer pointing to the ++ * 4K EBDA area at 0x40E ++ */ ++ ebda_addr = *(unsigned short *)__va(EBDA_ADDR_POINTER); ++ ebda_addr <<= 4; ++ ++ if (!ebda_addr) ++ return; ++ ++ ebda_size = *(unsigned short *)__va(ebda_addr); ++ ++ /* Round EBDA up to pages */ ++ if (ebda_size == 0) ++ ebda_size = 1; ++ ebda_size <<= 10; ++ ebda_size = round_up(ebda_size + (ebda_addr & ~PAGE_MASK), PAGE_SIZE); ++ if (ebda_size > 64*1024) ++ ebda_size = 64*1024; ++ ++ reserve_early(ebda_addr, ebda_addr + ebda_size); ++} ++ + void __init x86_64_start_kernel(char * real_mode_data) + { + int i; +@@ -56,8 +88,13 @@ void __init x86_64_start_kernel(char * real_mode_data) + /* Make NULL pointers segfault */ + zap_identity_mappings(); + +- for (i = 0; i < IDT_ENTRIES; i++) ++ for (i = 0; i < IDT_ENTRIES; i++) { ++#ifdef CONFIG_EARLY_PRINTK ++ set_intr_gate(i, &early_idt_handlers[i]); ++#else + set_intr_gate(i, early_idt_handler); ++#endif ++ } + load_idt((const struct desc_ptr *)&idt_descr); + + early_printk("Kernel alive\n"); +@@ -67,8 +104,24 @@ void __init x86_64_start_kernel(char * real_mode_data) + + pda_init(0); + copy_bootdata(__va(real_mode_data)); +-#ifdef CONFIG_SMP +- cpu_set(0, cpu_online_map); +-#endif ++ ++ reserve_early(__pa_symbol(&_text), __pa_symbol(&_end)); ++ ++ /* Reserve INITRD */ ++ if (boot_params.hdr.type_of_loader && boot_params.hdr.ramdisk_image) { ++ unsigned long ramdisk_image = boot_params.hdr.ramdisk_image; ++ unsigned long ramdisk_size = boot_params.hdr.ramdisk_size; ++ unsigned long ramdisk_end = ramdisk_image + ramdisk_size; ++ reserve_early(ramdisk_image, ramdisk_end); ++ } ++ ++ reserve_ebda(); ++ ++ /* ++ * At this point everything still needed from the boot loader ++ * or BIOS or kernel text should be early reserved or marked not ++ * RAM in e820. All other memory is free game. ++ */ ++ + start_kernel(); + } +diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S +index fbad51f..5d8c573 100644 +--- a/arch/x86/kernel/head_32.S ++++ b/arch/x86/kernel/head_32.S +@@ -9,6 +9,7 @@ + + .text + #include ++#include + #include + #include + #include +@@ -151,7 +152,9 @@ WEAK(xen_entry) + /* Unknown implementation; there's really + nothing we can do at this point. */ + ud2a +-.data ++ ++ __INITDATA ++ + subarch_entries: + .long default_entry /* normal x86/PC */ + .long lguest_entry /* lguest hypervisor */ +@@ -199,7 +202,6 @@ default_entry: + addl $0x67, %eax /* 0x67 == _PAGE_TABLE */ + movl %eax, 4092(%edx) + +- xorl %ebx,%ebx /* This is the boot CPU (BSP) */ + jmp 3f + /* + * Non-boot CPU entry point; entered from trampoline.S +@@ -222,6 +224,8 @@ ENTRY(startup_32_smp) + movl %eax,%es + movl %eax,%fs + movl %eax,%gs ++#endif /* CONFIG_SMP */ ++3: + + /* + * New page tables may be in 4Mbyte page mode and may +@@ -268,12 +272,6 @@ ENTRY(startup_32_smp) + wrmsr + + 6: +- /* This is a secondary processor (AP) */ +- xorl %ebx,%ebx +- incl %ebx +- +-#endif /* CONFIG_SMP */ +-3: + + /* + * Enable paging +@@ -297,7 +295,7 @@ ENTRY(startup_32_smp) + popfl + + #ifdef CONFIG_SMP +- andl %ebx,%ebx ++ cmpb $0, ready + jz 1f /* Initial CPU cleans BSS */ + jmp checkCPUtype + 1: +@@ -502,6 +500,7 @@ early_fault: + call printk + #endif + #endif ++ call dump_stack + hlt_loop: + hlt + jmp hlt_loop +diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S +index b6167fe..1d5a7a3 100644 +--- a/arch/x86/kernel/head_64.S ++++ b/arch/x86/kernel/head_64.S +@@ -19,6 +19,13 @@ + #include + #include + ++#ifdef CONFIG_PARAVIRT ++#include ++#include ++#else ++#define GET_CR2_INTO_RCX movq %cr2, %rcx ++#endif ++ + /* we are not able to switch in one step to the final KERNEL ADRESS SPACE + * because we need identity-mapped pages. + * +@@ -260,14 +267,43 @@ init_rsp: + bad_address: + jmp bad_address + ++#ifdef CONFIG_EARLY_PRINTK ++.macro early_idt_tramp first, last ++ .ifgt \last-\first ++ early_idt_tramp \first, \last-1 ++ .endif ++ movl $\last,%esi ++ jmp early_idt_handler ++.endm ++ ++ .globl early_idt_handlers ++early_idt_handlers: ++ early_idt_tramp 0, 63 ++ early_idt_tramp 64, 127 ++ early_idt_tramp 128, 191 ++ early_idt_tramp 192, 255 ++#endif ++ + ENTRY(early_idt_handler) ++#ifdef CONFIG_EARLY_PRINTK + cmpl $2,early_recursion_flag(%rip) + jz 1f + incl early_recursion_flag(%rip) ++ GET_CR2_INTO_RCX ++ movq %rcx,%r9 ++ xorl %r8d,%r8d # zero for error code ++ movl %esi,%ecx # get vector number ++ # Test %ecx against mask of vectors that push error code. ++ cmpl $31,%ecx ++ ja 0f ++ movl $1,%eax ++ salq %cl,%rax ++ testl $0x27d00,%eax ++ je 0f ++ popq %r8 # get error code ++0: movq 0(%rsp),%rcx # get ip ++ movq 8(%rsp),%rdx # get cs + xorl %eax,%eax +- movq 8(%rsp),%rsi # get rip +- movq (%rsp),%rdx +- movq %cr2,%rcx + leaq early_idt_msg(%rip),%rdi + call early_printk + cmpl $2,early_recursion_flag(%rip) +@@ -278,15 +314,19 @@ ENTRY(early_idt_handler) + movq 8(%rsp),%rsi # get rip again + call __print_symbol + #endif ++#endif /* EARLY_PRINTK */ + 1: hlt + jmp 1b ++ ++#ifdef CONFIG_EARLY_PRINTK + early_recursion_flag: + .long 0 + + early_idt_msg: +- .asciz "PANIC: early exception rip %lx error %lx cr2 %lx\n" ++ .asciz "PANIC: early exception %02lx rip %lx:%lx error %lx cr2 %lx\n" + early_idt_ripmsg: + .asciz "RIP %s\n" ++#endif /* CONFIG_EARLY_PRINTK */ + + .balign PAGE_SIZE + +diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c +index 2f99ee2..429d084 100644 +--- a/arch/x86/kernel/hpet.c ++++ b/arch/x86/kernel/hpet.c +@@ -6,7 +6,6 @@ + #include + #include + #include +-#include + + #include + #include +@@ -16,7 +15,8 @@ + #define HPET_MASK CLOCKSOURCE_MASK(32) + #define HPET_SHIFT 22 + +-/* FSEC = 10^-15 NSEC = 10^-9 */ ++/* FSEC = 10^-15 ++ NSEC = 10^-9 */ + #define FSEC_PER_NSEC 1000000 + + /* +@@ -107,6 +107,7 @@ int is_hpet_enabled(void) + { + return is_hpet_capable() && hpet_legacy_int_enabled; + } ++EXPORT_SYMBOL_GPL(is_hpet_enabled); + + /* + * When the hpet driver (/dev/hpet) is enabled, we need to reserve +@@ -132,16 +133,13 @@ static void hpet_reserve_platform_timers(unsigned long id) + #ifdef CONFIG_HPET_EMULATE_RTC + hpet_reserve_timer(&hd, 1); + #endif +- + hd.hd_irq[0] = HPET_LEGACY_8254; + hd.hd_irq[1] = HPET_LEGACY_RTC; + +- for (i = 2; i < nrtimers; timer++, i++) +- hd.hd_irq[i] = (timer->hpet_config & Tn_INT_ROUTE_CNF_MASK) >> +- Tn_INT_ROUTE_CNF_SHIFT; +- ++ for (i = 2; i < nrtimers; timer++, i++) ++ hd.hd_irq[i] = (timer->hpet_config & Tn_INT_ROUTE_CNF_MASK) >> ++ Tn_INT_ROUTE_CNF_SHIFT; + hpet_alloc(&hd); +- + } + #else + static void hpet_reserve_platform_timers(unsigned long id) { } +@@ -478,6 +476,7 @@ void hpet_disable(void) + */ + #include + #include ++#include + + #define DEFAULT_RTC_INT_FREQ 64 + #define DEFAULT_RTC_SHIFT 6 +@@ -492,6 +491,38 @@ static unsigned long hpet_default_delta; + static unsigned long hpet_pie_delta; + static unsigned long hpet_pie_limit; + ++static rtc_irq_handler irq_handler; ++ ++/* ++ * Registers a IRQ handler. ++ */ ++int hpet_register_irq_handler(rtc_irq_handler handler) ++{ ++ if (!is_hpet_enabled()) ++ return -ENODEV; ++ if (irq_handler) ++ return -EBUSY; ++ ++ irq_handler = handler; ++ ++ return 0; ++} ++EXPORT_SYMBOL_GPL(hpet_register_irq_handler); ++ ++/* ++ * Deregisters the IRQ handler registered with hpet_register_irq_handler() ++ * and does cleanup. ++ */ ++void hpet_unregister_irq_handler(rtc_irq_handler handler) ++{ ++ if (!is_hpet_enabled()) ++ return; ++ ++ irq_handler = NULL; ++ hpet_rtc_flags = 0; ++} ++EXPORT_SYMBOL_GPL(hpet_unregister_irq_handler); ++ + /* + * Timer 1 for RTC emulation. We use one shot mode, as periodic mode + * is not supported by all HPET implementations for timer 1. +@@ -533,6 +564,7 @@ int hpet_rtc_timer_init(void) + + return 1; + } ++EXPORT_SYMBOL_GPL(hpet_rtc_timer_init); + + /* + * The functions below are called from rtc driver. +@@ -547,6 +579,7 @@ int hpet_mask_rtc_irq_bit(unsigned long bit_mask) + hpet_rtc_flags &= ~bit_mask; + return 1; + } ++EXPORT_SYMBOL_GPL(hpet_mask_rtc_irq_bit); + + int hpet_set_rtc_irq_bit(unsigned long bit_mask) + { +@@ -562,6 +595,7 @@ int hpet_set_rtc_irq_bit(unsigned long bit_mask) + + return 1; + } ++EXPORT_SYMBOL_GPL(hpet_set_rtc_irq_bit); + + int hpet_set_alarm_time(unsigned char hrs, unsigned char min, + unsigned char sec) +@@ -575,6 +609,7 @@ int hpet_set_alarm_time(unsigned char hrs, unsigned char min, + + return 1; + } ++EXPORT_SYMBOL_GPL(hpet_set_alarm_time); + + int hpet_set_periodic_freq(unsigned long freq) + { +@@ -593,11 +628,13 @@ int hpet_set_periodic_freq(unsigned long freq) + } + return 1; + } ++EXPORT_SYMBOL_GPL(hpet_set_periodic_freq); + + int hpet_rtc_dropped_irq(void) + { + return is_hpet_enabled(); + } ++EXPORT_SYMBOL_GPL(hpet_rtc_dropped_irq); + + static void hpet_rtc_timer_reinit(void) + { +@@ -641,9 +678,10 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id) + unsigned long rtc_int_flag = 0; + + hpet_rtc_timer_reinit(); ++ memset(&curr_time, 0, sizeof(struct rtc_time)); + + if (hpet_rtc_flags & (RTC_UIE | RTC_AIE)) +- rtc_get_rtc_time(&curr_time); ++ get_rtc_time(&curr_time); + + if (hpet_rtc_flags & RTC_UIE && + curr_time.tm_sec != hpet_prev_update_sec) { +@@ -665,8 +703,10 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id) + + if (rtc_int_flag) { + rtc_int_flag |= (RTC_IRQF | (RTC_NUM_INTS << 8)); +- rtc_interrupt(rtc_int_flag, dev_id); ++ if (irq_handler) ++ irq_handler(rtc_int_flag, dev_id); + } + return IRQ_HANDLED; + } ++EXPORT_SYMBOL_GPL(hpet_rtc_interrupt); + #endif +diff --git a/arch/x86/kernel/i386_ksyms_32.c b/arch/x86/kernel/i386_ksyms_32.c +index 02112fc..0616278 100644 +--- a/arch/x86/kernel/i386_ksyms_32.c ++++ b/arch/x86/kernel/i386_ksyms_32.c +@@ -22,12 +22,5 @@ EXPORT_SYMBOL(__put_user_8); + + EXPORT_SYMBOL(strstr); + +-#ifdef CONFIG_SMP +-extern void FASTCALL( __write_lock_failed(rwlock_t *rw)); +-extern void FASTCALL( __read_lock_failed(rwlock_t *rw)); +-EXPORT_SYMBOL(__write_lock_failed); +-EXPORT_SYMBOL(__read_lock_failed); +-#endif +- + EXPORT_SYMBOL(csum_partial); + EXPORT_SYMBOL(empty_zero_page); +diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c +new file mode 100644 +index 0000000..26719bd +--- /dev/null ++++ b/arch/x86/kernel/i387.c +@@ -0,0 +1,479 @@ ++/* ++ * Copyright (C) 1994 Linus Torvalds ++ * ++ * Pentium III FXSR, SSE support ++ * General FPU state handling cleanups ++ * Gareth Hughes , May 2000 ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#ifdef CONFIG_X86_64 ++ ++#include ++#include ++ ++#else ++ ++#define save_i387_ia32 save_i387 ++#define restore_i387_ia32 restore_i387 ++ ++#define _fpstate_ia32 _fpstate ++#define user_i387_ia32_struct user_i387_struct ++#define user32_fxsr_struct user_fxsr_struct ++ ++#endif ++ ++#ifdef CONFIG_MATH_EMULATION ++#define HAVE_HWFP (boot_cpu_data.hard_math) ++#else ++#define HAVE_HWFP 1 ++#endif ++ ++unsigned int mxcsr_feature_mask __read_mostly = 0xffffffffu; ++ ++void mxcsr_feature_mask_init(void) ++{ ++ unsigned long mask = 0; ++ clts(); ++ if (cpu_has_fxsr) { ++ memset(¤t->thread.i387.fxsave, 0, ++ sizeof(struct i387_fxsave_struct)); ++ asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave)); ++ mask = current->thread.i387.fxsave.mxcsr_mask; ++ if (mask == 0) ++ mask = 0x0000ffbf; ++ } ++ mxcsr_feature_mask &= mask; ++ stts(); ++} ++ ++#ifdef CONFIG_X86_64 ++/* ++ * Called at bootup to set up the initial FPU state that is later cloned ++ * into all processes. ++ */ ++void __cpuinit fpu_init(void) ++{ ++ unsigned long oldcr0 = read_cr0(); ++ extern void __bad_fxsave_alignment(void); ++ ++ if (offsetof(struct task_struct, thread.i387.fxsave) & 15) ++ __bad_fxsave_alignment(); ++ set_in_cr4(X86_CR4_OSFXSR); ++ set_in_cr4(X86_CR4_OSXMMEXCPT); ++ ++ write_cr0(oldcr0 & ~((1UL<<3)|(1UL<<2))); /* clear TS and EM */ ++ ++ mxcsr_feature_mask_init(); ++ /* clean state in init */ ++ current_thread_info()->status = 0; ++ clear_used_math(); ++} ++#endif /* CONFIG_X86_64 */ ++ ++/* ++ * The _current_ task is using the FPU for the first time ++ * so initialize it and set the mxcsr to its default ++ * value at reset if we support XMM instructions and then ++ * remeber the current task has used the FPU. ++ */ ++void init_fpu(struct task_struct *tsk) ++{ ++ if (tsk_used_math(tsk)) { ++ if (tsk == current) ++ unlazy_fpu(tsk); ++ return; ++ } ++ ++ if (cpu_has_fxsr) { ++ memset(&tsk->thread.i387.fxsave, 0, ++ sizeof(struct i387_fxsave_struct)); ++ tsk->thread.i387.fxsave.cwd = 0x37f; ++ if (cpu_has_xmm) ++ tsk->thread.i387.fxsave.mxcsr = MXCSR_DEFAULT; ++ } else { ++ memset(&tsk->thread.i387.fsave, 0, ++ sizeof(struct i387_fsave_struct)); ++ tsk->thread.i387.fsave.cwd = 0xffff037fu; ++ tsk->thread.i387.fsave.swd = 0xffff0000u; ++ tsk->thread.i387.fsave.twd = 0xffffffffu; ++ tsk->thread.i387.fsave.fos = 0xffff0000u; ++ } ++ /* ++ * Only the device not available exception or ptrace can call init_fpu. ++ */ ++ set_stopped_child_used_math(tsk); ++} ++ ++int fpregs_active(struct task_struct *target, const struct user_regset *regset) ++{ ++ return tsk_used_math(target) ? regset->n : 0; ++} ++ ++int xfpregs_active(struct task_struct *target, const struct user_regset *regset) ++{ ++ return (cpu_has_fxsr && tsk_used_math(target)) ? regset->n : 0; ++} ++ ++int xfpregs_get(struct task_struct *target, const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ void *kbuf, void __user *ubuf) ++{ ++ if (!cpu_has_fxsr) ++ return -ENODEV; ++ ++ unlazy_fpu(target); ++ ++ return user_regset_copyout(&pos, &count, &kbuf, &ubuf, ++ &target->thread.i387.fxsave, 0, -1); ++} ++ ++int xfpregs_set(struct task_struct *target, const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ const void *kbuf, const void __user *ubuf) ++{ ++ int ret; ++ ++ if (!cpu_has_fxsr) ++ return -ENODEV; ++ ++ unlazy_fpu(target); ++ set_stopped_child_used_math(target); ++ ++ ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, ++ &target->thread.i387.fxsave, 0, -1); ++ ++ /* ++ * mxcsr reserved bits must be masked to zero for security reasons. ++ */ ++ target->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask; ++ ++ return ret; ++} ++ ++#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION ++ ++/* ++ * FPU tag word conversions. ++ */ ++ ++static inline unsigned short twd_i387_to_fxsr(unsigned short twd) ++{ ++ unsigned int tmp; /* to avoid 16 bit prefixes in the code */ ++ ++ /* Transform each pair of bits into 01 (valid) or 00 (empty) */ ++ tmp = ~twd; ++ tmp = (tmp | (tmp>>1)) & 0x5555; /* 0V0V0V0V0V0V0V0V */ ++ /* and move the valid bits to the lower byte. */ ++ tmp = (tmp | (tmp >> 1)) & 0x3333; /* 00VV00VV00VV00VV */ ++ tmp = (tmp | (tmp >> 2)) & 0x0f0f; /* 0000VVVV0000VVVV */ ++ tmp = (tmp | (tmp >> 4)) & 0x00ff; /* 00000000VVVVVVVV */ ++ return tmp; ++} ++ ++#define FPREG_ADDR(f, n) ((void *)&(f)->st_space + (n) * 16); ++#define FP_EXP_TAG_VALID 0 ++#define FP_EXP_TAG_ZERO 1 ++#define FP_EXP_TAG_SPECIAL 2 ++#define FP_EXP_TAG_EMPTY 3 ++ ++static inline u32 twd_fxsr_to_i387(struct i387_fxsave_struct *fxsave) ++{ ++ struct _fpxreg *st; ++ u32 tos = (fxsave->swd >> 11) & 7; ++ u32 twd = (unsigned long) fxsave->twd; ++ u32 tag; ++ u32 ret = 0xffff0000u; ++ int i; ++ ++ for (i = 0; i < 8; i++, twd >>= 1) { ++ if (twd & 0x1) { ++ st = FPREG_ADDR(fxsave, (i - tos) & 7); ++ ++ switch (st->exponent & 0x7fff) { ++ case 0x7fff: ++ tag = FP_EXP_TAG_SPECIAL; ++ break; ++ case 0x0000: ++ if (!st->significand[0] && ++ !st->significand[1] && ++ !st->significand[2] && ++ !st->significand[3]) ++ tag = FP_EXP_TAG_ZERO; ++ else ++ tag = FP_EXP_TAG_SPECIAL; ++ break; ++ default: ++ if (st->significand[3] & 0x8000) ++ tag = FP_EXP_TAG_VALID; ++ else ++ tag = FP_EXP_TAG_SPECIAL; ++ break; ++ } ++ } else { ++ tag = FP_EXP_TAG_EMPTY; ++ } ++ ret |= tag << (2 * i); ++ } ++ return ret; ++} ++ ++/* ++ * FXSR floating point environment conversions. ++ */ ++ ++static void convert_from_fxsr(struct user_i387_ia32_struct *env, ++ struct task_struct *tsk) ++{ ++ struct i387_fxsave_struct *fxsave = &tsk->thread.i387.fxsave; ++ struct _fpreg *to = (struct _fpreg *) &env->st_space[0]; ++ struct _fpxreg *from = (struct _fpxreg *) &fxsave->st_space[0]; ++ int i; ++ ++ env->cwd = fxsave->cwd | 0xffff0000u; ++ env->swd = fxsave->swd | 0xffff0000u; ++ env->twd = twd_fxsr_to_i387(fxsave); ++ ++#ifdef CONFIG_X86_64 ++ env->fip = fxsave->rip; ++ env->foo = fxsave->rdp; ++ if (tsk == current) { ++ /* ++ * should be actually ds/cs at fpu exception time, but ++ * that information is not available in 64bit mode. ++ */ ++ asm("mov %%ds,%0" : "=r" (env->fos)); ++ asm("mov %%cs,%0" : "=r" (env->fcs)); ++ } else { ++ struct pt_regs *regs = task_pt_regs(tsk); ++ env->fos = 0xffff0000 | tsk->thread.ds; ++ env->fcs = regs->cs; ++ } ++#else ++ env->fip = fxsave->fip; ++ env->fcs = fxsave->fcs; ++ env->foo = fxsave->foo; ++ env->fos = fxsave->fos; ++#endif ++ ++ for (i = 0; i < 8; ++i) ++ memcpy(&to[i], &from[i], sizeof(to[0])); ++} ++ ++static void convert_to_fxsr(struct task_struct *tsk, ++ const struct user_i387_ia32_struct *env) ++ ++{ ++ struct i387_fxsave_struct *fxsave = &tsk->thread.i387.fxsave; ++ struct _fpreg *from = (struct _fpreg *) &env->st_space[0]; ++ struct _fpxreg *to = (struct _fpxreg *) &fxsave->st_space[0]; ++ int i; ++ ++ fxsave->cwd = env->cwd; ++ fxsave->swd = env->swd; ++ fxsave->twd = twd_i387_to_fxsr(env->twd); ++ fxsave->fop = (u16) ((u32) env->fcs >> 16); ++#ifdef CONFIG_X86_64 ++ fxsave->rip = env->fip; ++ fxsave->rdp = env->foo; ++ /* cs and ds ignored */ ++#else ++ fxsave->fip = env->fip; ++ fxsave->fcs = (env->fcs & 0xffff); ++ fxsave->foo = env->foo; ++ fxsave->fos = env->fos; ++#endif ++ ++ for (i = 0; i < 8; ++i) ++ memcpy(&to[i], &from[i], sizeof(from[0])); ++} ++ ++int fpregs_get(struct task_struct *target, const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ void *kbuf, void __user *ubuf) ++{ ++ struct user_i387_ia32_struct env; ++ ++ if (!HAVE_HWFP) ++ return fpregs_soft_get(target, regset, pos, count, kbuf, ubuf); ++ ++ unlazy_fpu(target); ++ ++ if (!cpu_has_fxsr) ++ return user_regset_copyout(&pos, &count, &kbuf, &ubuf, ++ &target->thread.i387.fsave, 0, -1); ++ ++ if (kbuf && pos == 0 && count == sizeof(env)) { ++ convert_from_fxsr(kbuf, target); ++ return 0; ++ } ++ ++ convert_from_fxsr(&env, target); ++ return user_regset_copyout(&pos, &count, &kbuf, &ubuf, &env, 0, -1); ++} ++ ++int fpregs_set(struct task_struct *target, const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ const void *kbuf, const void __user *ubuf) ++{ ++ struct user_i387_ia32_struct env; ++ int ret; ++ ++ if (!HAVE_HWFP) ++ return fpregs_soft_set(target, regset, pos, count, kbuf, ubuf); ++ ++ unlazy_fpu(target); ++ set_stopped_child_used_math(target); ++ ++ if (!cpu_has_fxsr) ++ return user_regset_copyin(&pos, &count, &kbuf, &ubuf, ++ &target->thread.i387.fsave, 0, -1); ++ ++ if (pos > 0 || count < sizeof(env)) ++ convert_from_fxsr(&env, target); ++ ++ ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &env, 0, -1); ++ if (!ret) ++ convert_to_fxsr(target, &env); ++ ++ return ret; ++} ++ ++/* ++ * Signal frame handlers. ++ */ ++ ++static inline int save_i387_fsave(struct _fpstate_ia32 __user *buf) ++{ ++ struct task_struct *tsk = current; ++ ++ unlazy_fpu(tsk); ++ tsk->thread.i387.fsave.status = tsk->thread.i387.fsave.swd; ++ if (__copy_to_user(buf, &tsk->thread.i387.fsave, ++ sizeof(struct i387_fsave_struct))) ++ return -1; ++ return 1; ++} ++ ++static int save_i387_fxsave(struct _fpstate_ia32 __user *buf) ++{ ++ struct task_struct *tsk = current; ++ struct user_i387_ia32_struct env; ++ int err = 0; ++ ++ unlazy_fpu(tsk); ++ ++ convert_from_fxsr(&env, tsk); ++ if (__copy_to_user(buf, &env, sizeof(env))) ++ return -1; ++ ++ err |= __put_user(tsk->thread.i387.fxsave.swd, &buf->status); ++ err |= __put_user(X86_FXSR_MAGIC, &buf->magic); ++ if (err) ++ return -1; ++ ++ if (__copy_to_user(&buf->_fxsr_env[0], &tsk->thread.i387.fxsave, ++ sizeof(struct i387_fxsave_struct))) ++ return -1; ++ return 1; ++} ++ ++int save_i387_ia32(struct _fpstate_ia32 __user *buf) ++{ ++ if (!used_math()) ++ return 0; ++ ++ /* This will cause a "finit" to be triggered by the next ++ * attempted FPU operation by the 'current' process. ++ */ ++ clear_used_math(); ++ ++ if (HAVE_HWFP) { ++ if (cpu_has_fxsr) { ++ return save_i387_fxsave(buf); ++ } else { ++ return save_i387_fsave(buf); ++ } ++ } else { ++ return fpregs_soft_get(current, NULL, ++ 0, sizeof(struct user_i387_ia32_struct), ++ NULL, buf) ? -1 : 1; ++ } ++} ++ ++static inline int restore_i387_fsave(struct _fpstate_ia32 __user *buf) ++{ ++ struct task_struct *tsk = current; ++ clear_fpu(tsk); ++ return __copy_from_user(&tsk->thread.i387.fsave, buf, ++ sizeof(struct i387_fsave_struct)); ++} ++ ++static int restore_i387_fxsave(struct _fpstate_ia32 __user *buf) ++{ ++ int err; ++ struct task_struct *tsk = current; ++ struct user_i387_ia32_struct env; ++ clear_fpu(tsk); ++ err = __copy_from_user(&tsk->thread.i387.fxsave, &buf->_fxsr_env[0], ++ sizeof(struct i387_fxsave_struct)); ++ /* mxcsr reserved bits must be masked to zero for security reasons */ ++ tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask; ++ if (err || __copy_from_user(&env, buf, sizeof(env))) ++ return 1; ++ convert_to_fxsr(tsk, &env); ++ return 0; ++} ++ ++int restore_i387_ia32(struct _fpstate_ia32 __user *buf) ++{ ++ int err; ++ ++ if (HAVE_HWFP) { ++ if (cpu_has_fxsr) { ++ err = restore_i387_fxsave(buf); ++ } else { ++ err = restore_i387_fsave(buf); ++ } ++ } else { ++ err = fpregs_soft_set(current, NULL, ++ 0, sizeof(struct user_i387_ia32_struct), ++ NULL, buf) != 0; ++ } ++ set_used_math(); ++ return err; ++} ++ ++/* ++ * FPU state for core dumps. ++ * This is only used for a.out dumps now. ++ * It is declared generically using elf_fpregset_t (which is ++ * struct user_i387_struct) but is in fact only used for 32-bit ++ * dumps, so on 64-bit it is really struct user_i387_ia32_struct. ++ */ ++int dump_fpu(struct pt_regs *regs, struct user_i387_struct *fpu) ++{ ++ int fpvalid; ++ struct task_struct *tsk = current; ++ ++ fpvalid = !!used_math(); ++ if (fpvalid) ++ fpvalid = !fpregs_get(tsk, NULL, ++ 0, sizeof(struct user_i387_ia32_struct), ++ fpu, NULL); ++ ++ return fpvalid; ++} ++EXPORT_SYMBOL(dump_fpu); ++ ++#endif /* CONFIG_X86_32 || CONFIG_IA32_EMULATION */ +diff --git a/arch/x86/kernel/i387_32.c b/arch/x86/kernel/i387_32.c +deleted file mode 100644 +index 7d2e12f..0000000 +--- a/arch/x86/kernel/i387_32.c ++++ /dev/null +@@ -1,544 +0,0 @@ +-/* +- * Copyright (C) 1994 Linus Torvalds +- * +- * Pentium III FXSR, SSE support +- * General FPU state handling cleanups +- * Gareth Hughes , May 2000 +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#ifdef CONFIG_MATH_EMULATION +-#define HAVE_HWFP (boot_cpu_data.hard_math) +-#else +-#define HAVE_HWFP 1 +-#endif +- +-static unsigned long mxcsr_feature_mask __read_mostly = 0xffffffff; +- +-void mxcsr_feature_mask_init(void) +-{ +- unsigned long mask = 0; +- clts(); +- if (cpu_has_fxsr) { +- memset(¤t->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct)); +- asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave)); +- mask = current->thread.i387.fxsave.mxcsr_mask; +- if (mask == 0) mask = 0x0000ffbf; +- } +- mxcsr_feature_mask &= mask; +- stts(); +-} +- +-/* +- * The _current_ task is using the FPU for the first time +- * so initialize it and set the mxcsr to its default +- * value at reset if we support XMM instructions and then +- * remeber the current task has used the FPU. +- */ +-void init_fpu(struct task_struct *tsk) +-{ +- if (cpu_has_fxsr) { +- memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct)); +- tsk->thread.i387.fxsave.cwd = 0x37f; +- if (cpu_has_xmm) +- tsk->thread.i387.fxsave.mxcsr = 0x1f80; +- } else { +- memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct)); +- tsk->thread.i387.fsave.cwd = 0xffff037fu; +- tsk->thread.i387.fsave.swd = 0xffff0000u; +- tsk->thread.i387.fsave.twd = 0xffffffffu; +- tsk->thread.i387.fsave.fos = 0xffff0000u; +- } +- /* only the device not available exception or ptrace can call init_fpu */ +- set_stopped_child_used_math(tsk); +-} +- +-/* +- * FPU lazy state save handling. +- */ +- +-void kernel_fpu_begin(void) +-{ +- struct thread_info *thread = current_thread_info(); +- +- preempt_disable(); +- if (thread->status & TS_USEDFPU) { +- __save_init_fpu(thread->task); +- return; +- } +- clts(); +-} +-EXPORT_SYMBOL_GPL(kernel_fpu_begin); +- +-/* +- * FPU tag word conversions. +- */ +- +-static inline unsigned short twd_i387_to_fxsr( unsigned short twd ) +-{ +- unsigned int tmp; /* to avoid 16 bit prefixes in the code */ +- +- /* Transform each pair of bits into 01 (valid) or 00 (empty) */ +- tmp = ~twd; +- tmp = (tmp | (tmp>>1)) & 0x5555; /* 0V0V0V0V0V0V0V0V */ +- /* and move the valid bits to the lower byte. */ +- tmp = (tmp | (tmp >> 1)) & 0x3333; /* 00VV00VV00VV00VV */ +- tmp = (tmp | (tmp >> 2)) & 0x0f0f; /* 0000VVVV0000VVVV */ +- tmp = (tmp | (tmp >> 4)) & 0x00ff; /* 00000000VVVVVVVV */ +- return tmp; +-} +- +-static inline unsigned long twd_fxsr_to_i387( struct i387_fxsave_struct *fxsave ) +-{ +- struct _fpxreg *st = NULL; +- unsigned long tos = (fxsave->swd >> 11) & 7; +- unsigned long twd = (unsigned long) fxsave->twd; +- unsigned long tag; +- unsigned long ret = 0xffff0000u; +- int i; +- +-#define FPREG_ADDR(f, n) ((void *)&(f)->st_space + (n) * 16); +- +- for ( i = 0 ; i < 8 ; i++ ) { +- if ( twd & 0x1 ) { +- st = FPREG_ADDR( fxsave, (i - tos) & 7 ); +- +- switch ( st->exponent & 0x7fff ) { +- case 0x7fff: +- tag = 2; /* Special */ +- break; +- case 0x0000: +- if ( !st->significand[0] && +- !st->significand[1] && +- !st->significand[2] && +- !st->significand[3] ) { +- tag = 1; /* Zero */ +- } else { +- tag = 2; /* Special */ +- } +- break; +- default: +- if ( st->significand[3] & 0x8000 ) { +- tag = 0; /* Valid */ +- } else { +- tag = 2; /* Special */ +- } +- break; +- } +- } else { +- tag = 3; /* Empty */ +- } +- ret |= (tag << (2 * i)); +- twd = twd >> 1; +- } +- return ret; +-} +- +-/* +- * FPU state interaction. +- */ +- +-unsigned short get_fpu_cwd( struct task_struct *tsk ) +-{ +- if ( cpu_has_fxsr ) { +- return tsk->thread.i387.fxsave.cwd; +- } else { +- return (unsigned short)tsk->thread.i387.fsave.cwd; +- } +-} +- +-unsigned short get_fpu_swd( struct task_struct *tsk ) +-{ +- if ( cpu_has_fxsr ) { +- return tsk->thread.i387.fxsave.swd; +- } else { +- return (unsigned short)tsk->thread.i387.fsave.swd; +- } +-} +- +-#if 0 +-unsigned short get_fpu_twd( struct task_struct *tsk ) +-{ +- if ( cpu_has_fxsr ) { +- return tsk->thread.i387.fxsave.twd; +- } else { +- return (unsigned short)tsk->thread.i387.fsave.twd; +- } +-} +-#endif /* 0 */ +- +-unsigned short get_fpu_mxcsr( struct task_struct *tsk ) +-{ +- if ( cpu_has_xmm ) { +- return tsk->thread.i387.fxsave.mxcsr; +- } else { +- return 0x1f80; +- } +-} +- +-#if 0 +- +-void set_fpu_cwd( struct task_struct *tsk, unsigned short cwd ) +-{ +- if ( cpu_has_fxsr ) { +- tsk->thread.i387.fxsave.cwd = cwd; +- } else { +- tsk->thread.i387.fsave.cwd = ((long)cwd | 0xffff0000u); +- } +-} +- +-void set_fpu_swd( struct task_struct *tsk, unsigned short swd ) +-{ +- if ( cpu_has_fxsr ) { +- tsk->thread.i387.fxsave.swd = swd; +- } else { +- tsk->thread.i387.fsave.swd = ((long)swd | 0xffff0000u); +- } +-} +- +-void set_fpu_twd( struct task_struct *tsk, unsigned short twd ) +-{ +- if ( cpu_has_fxsr ) { +- tsk->thread.i387.fxsave.twd = twd_i387_to_fxsr(twd); +- } else { +- tsk->thread.i387.fsave.twd = ((long)twd | 0xffff0000u); +- } +-} +- +-#endif /* 0 */ +- +-/* +- * FXSR floating point environment conversions. +- */ +- +-static int convert_fxsr_to_user( struct _fpstate __user *buf, +- struct i387_fxsave_struct *fxsave ) +-{ +- unsigned long env[7]; +- struct _fpreg __user *to; +- struct _fpxreg *from; +- int i; +- +- env[0] = (unsigned long)fxsave->cwd | 0xffff0000ul; +- env[1] = (unsigned long)fxsave->swd | 0xffff0000ul; +- env[2] = twd_fxsr_to_i387(fxsave); +- env[3] = fxsave->fip; +- env[4] = fxsave->fcs | ((unsigned long)fxsave->fop << 16); +- env[5] = fxsave->foo; +- env[6] = fxsave->fos; +- +- if ( __copy_to_user( buf, env, 7 * sizeof(unsigned long) ) ) +- return 1; +- +- to = &buf->_st[0]; +- from = (struct _fpxreg *) &fxsave->st_space[0]; +- for ( i = 0 ; i < 8 ; i++, to++, from++ ) { +- unsigned long __user *t = (unsigned long __user *)to; +- unsigned long *f = (unsigned long *)from; +- +- if (__put_user(*f, t) || +- __put_user(*(f + 1), t + 1) || +- __put_user(from->exponent, &to->exponent)) +- return 1; +- } +- return 0; +-} +- +-static int convert_fxsr_from_user( struct i387_fxsave_struct *fxsave, +- struct _fpstate __user *buf ) +-{ +- unsigned long env[7]; +- struct _fpxreg *to; +- struct _fpreg __user *from; +- int i; +- +- if ( __copy_from_user( env, buf, 7 * sizeof(long) ) ) +- return 1; +- +- fxsave->cwd = (unsigned short)(env[0] & 0xffff); +- fxsave->swd = (unsigned short)(env[1] & 0xffff); +- fxsave->twd = twd_i387_to_fxsr((unsigned short)(env[2] & 0xffff)); +- fxsave->fip = env[3]; +- fxsave->fop = (unsigned short)((env[4] & 0xffff0000ul) >> 16); +- fxsave->fcs = (env[4] & 0xffff); +- fxsave->foo = env[5]; +- fxsave->fos = env[6]; +- +- to = (struct _fpxreg *) &fxsave->st_space[0]; +- from = &buf->_st[0]; +- for ( i = 0 ; i < 8 ; i++, to++, from++ ) { +- unsigned long *t = (unsigned long *)to; +- unsigned long __user *f = (unsigned long __user *)from; +- +- if (__get_user(*t, f) || +- __get_user(*(t + 1), f + 1) || +- __get_user(to->exponent, &from->exponent)) +- return 1; +- } +- return 0; +-} +- +-/* +- * Signal frame handlers. +- */ +- +-static inline int save_i387_fsave( struct _fpstate __user *buf ) +-{ +- struct task_struct *tsk = current; +- +- unlazy_fpu( tsk ); +- tsk->thread.i387.fsave.status = tsk->thread.i387.fsave.swd; +- if ( __copy_to_user( buf, &tsk->thread.i387.fsave, +- sizeof(struct i387_fsave_struct) ) ) +- return -1; +- return 1; +-} +- +-static int save_i387_fxsave( struct _fpstate __user *buf ) +-{ +- struct task_struct *tsk = current; +- int err = 0; +- +- unlazy_fpu( tsk ); +- +- if ( convert_fxsr_to_user( buf, &tsk->thread.i387.fxsave ) ) +- return -1; +- +- err |= __put_user( tsk->thread.i387.fxsave.swd, &buf->status ); +- err |= __put_user( X86_FXSR_MAGIC, &buf->magic ); +- if ( err ) +- return -1; +- +- if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387.fxsave, +- sizeof(struct i387_fxsave_struct) ) ) +- return -1; +- return 1; +-} +- +-int save_i387( struct _fpstate __user *buf ) +-{ +- if ( !used_math() ) +- return 0; +- +- /* This will cause a "finit" to be triggered by the next +- * attempted FPU operation by the 'current' process. +- */ +- clear_used_math(); +- +- if ( HAVE_HWFP ) { +- if ( cpu_has_fxsr ) { +- return save_i387_fxsave( buf ); +- } else { +- return save_i387_fsave( buf ); +- } +- } else { +- return save_i387_soft( ¤t->thread.i387.soft, buf ); +- } +-} +- +-static inline int restore_i387_fsave( struct _fpstate __user *buf ) +-{ +- struct task_struct *tsk = current; +- clear_fpu( tsk ); +- return __copy_from_user( &tsk->thread.i387.fsave, buf, +- sizeof(struct i387_fsave_struct) ); +-} +- +-static int restore_i387_fxsave( struct _fpstate __user *buf ) +-{ +- int err; +- struct task_struct *tsk = current; +- clear_fpu( tsk ); +- err = __copy_from_user( &tsk->thread.i387.fxsave, &buf->_fxsr_env[0], +- sizeof(struct i387_fxsave_struct) ); +- /* mxcsr reserved bits must be masked to zero for security reasons */ +- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask; +- return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387.fxsave, buf ); +-} +- +-int restore_i387( struct _fpstate __user *buf ) +-{ +- int err; +- +- if ( HAVE_HWFP ) { +- if ( cpu_has_fxsr ) { +- err = restore_i387_fxsave( buf ); +- } else { +- err = restore_i387_fsave( buf ); +- } +- } else { +- err = restore_i387_soft( ¤t->thread.i387.soft, buf ); +- } +- set_used_math(); +- return err; +-} +- +-/* +- * ptrace request handlers. +- */ +- +-static inline int get_fpregs_fsave( struct user_i387_struct __user *buf, +- struct task_struct *tsk ) +-{ +- return __copy_to_user( buf, &tsk->thread.i387.fsave, +- sizeof(struct user_i387_struct) ); +-} +- +-static inline int get_fpregs_fxsave( struct user_i387_struct __user *buf, +- struct task_struct *tsk ) +-{ +- return convert_fxsr_to_user( (struct _fpstate __user *)buf, +- &tsk->thread.i387.fxsave ); +-} +- +-int get_fpregs( struct user_i387_struct __user *buf, struct task_struct *tsk ) +-{ +- if ( HAVE_HWFP ) { +- if ( cpu_has_fxsr ) { +- return get_fpregs_fxsave( buf, tsk ); +- } else { +- return get_fpregs_fsave( buf, tsk ); +- } +- } else { +- return save_i387_soft( &tsk->thread.i387.soft, +- (struct _fpstate __user *)buf ); +- } +-} +- +-static inline int set_fpregs_fsave( struct task_struct *tsk, +- struct user_i387_struct __user *buf ) +-{ +- return __copy_from_user( &tsk->thread.i387.fsave, buf, +- sizeof(struct user_i387_struct) ); +-} +- +-static inline int set_fpregs_fxsave( struct task_struct *tsk, +- struct user_i387_struct __user *buf ) +-{ +- return convert_fxsr_from_user( &tsk->thread.i387.fxsave, +- (struct _fpstate __user *)buf ); +-} +- +-int set_fpregs( struct task_struct *tsk, struct user_i387_struct __user *buf ) +-{ +- if ( HAVE_HWFP ) { +- if ( cpu_has_fxsr ) { +- return set_fpregs_fxsave( tsk, buf ); +- } else { +- return set_fpregs_fsave( tsk, buf ); +- } +- } else { +- return restore_i387_soft( &tsk->thread.i387.soft, +- (struct _fpstate __user *)buf ); +- } +-} +- +-int get_fpxregs( struct user_fxsr_struct __user *buf, struct task_struct *tsk ) +-{ +- if ( cpu_has_fxsr ) { +- if (__copy_to_user( buf, &tsk->thread.i387.fxsave, +- sizeof(struct user_fxsr_struct) )) +- return -EFAULT; +- return 0; +- } else { +- return -EIO; +- } +-} +- +-int set_fpxregs( struct task_struct *tsk, struct user_fxsr_struct __user *buf ) +-{ +- int ret = 0; +- +- if ( cpu_has_fxsr ) { +- if (__copy_from_user( &tsk->thread.i387.fxsave, buf, +- sizeof(struct user_fxsr_struct) )) +- ret = -EFAULT; +- /* mxcsr reserved bits must be masked to zero for security reasons */ +- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask; +- } else { +- ret = -EIO; +- } +- return ret; +-} +- +-/* +- * FPU state for core dumps. +- */ +- +-static inline void copy_fpu_fsave( struct task_struct *tsk, +- struct user_i387_struct *fpu ) +-{ +- memcpy( fpu, &tsk->thread.i387.fsave, +- sizeof(struct user_i387_struct) ); +-} +- +-static inline void copy_fpu_fxsave( struct task_struct *tsk, +- struct user_i387_struct *fpu ) +-{ +- unsigned short *to; +- unsigned short *from; +- int i; +- +- memcpy( fpu, &tsk->thread.i387.fxsave, 7 * sizeof(long) ); +- +- to = (unsigned short *)&fpu->st_space[0]; +- from = (unsigned short *)&tsk->thread.i387.fxsave.st_space[0]; +- for ( i = 0 ; i < 8 ; i++, to += 5, from += 8 ) { +- memcpy( to, from, 5 * sizeof(unsigned short) ); +- } +-} +- +-int dump_fpu( struct pt_regs *regs, struct user_i387_struct *fpu ) +-{ +- int fpvalid; +- struct task_struct *tsk = current; +- +- fpvalid = !!used_math(); +- if ( fpvalid ) { +- unlazy_fpu( tsk ); +- if ( cpu_has_fxsr ) { +- copy_fpu_fxsave( tsk, fpu ); +- } else { +- copy_fpu_fsave( tsk, fpu ); +- } +- } +- +- return fpvalid; +-} +-EXPORT_SYMBOL(dump_fpu); +- +-int dump_task_fpu(struct task_struct *tsk, struct user_i387_struct *fpu) +-{ +- int fpvalid = !!tsk_used_math(tsk); +- +- if (fpvalid) { +- if (tsk == current) +- unlazy_fpu(tsk); +- if (cpu_has_fxsr) +- copy_fpu_fxsave(tsk, fpu); +- else +- copy_fpu_fsave(tsk, fpu); +- } +- return fpvalid; +-} +- +-int dump_task_extended_fpu(struct task_struct *tsk, struct user_fxsr_struct *fpu) +-{ +- int fpvalid = tsk_used_math(tsk) && cpu_has_fxsr; +- +- if (fpvalid) { +- if (tsk == current) +- unlazy_fpu(tsk); +- memcpy(fpu, &tsk->thread.i387.fxsave, sizeof(*fpu)); +- } +- return fpvalid; +-} +diff --git a/arch/x86/kernel/i387_64.c b/arch/x86/kernel/i387_64.c +deleted file mode 100644 +index bfaff28..0000000 +--- a/arch/x86/kernel/i387_64.c ++++ /dev/null +@@ -1,150 +0,0 @@ +-/* +- * Copyright (C) 1994 Linus Torvalds +- * Copyright (C) 2002 Andi Kleen, SuSE Labs +- * +- * Pentium III FXSR, SSE support +- * General FPU state handling cleanups +- * Gareth Hughes , May 2000 +- * +- * x86-64 rework 2002 Andi Kleen. +- * Does direct fxsave in and out of user space now for signal handlers. +- * All the FSAVE<->FXSAVE conversion code has been moved to the 32bit emulation, +- * the 64bit user space sees a FXSAVE frame directly. +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-unsigned int mxcsr_feature_mask __read_mostly = 0xffffffff; +- +-void mxcsr_feature_mask_init(void) +-{ +- unsigned int mask; +- clts(); +- memset(¤t->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct)); +- asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave)); +- mask = current->thread.i387.fxsave.mxcsr_mask; +- if (mask == 0) mask = 0x0000ffbf; +- mxcsr_feature_mask &= mask; +- stts(); +-} +- +-/* +- * Called at bootup to set up the initial FPU state that is later cloned +- * into all processes. +- */ +-void __cpuinit fpu_init(void) +-{ +- unsigned long oldcr0 = read_cr0(); +- extern void __bad_fxsave_alignment(void); +- +- if (offsetof(struct task_struct, thread.i387.fxsave) & 15) +- __bad_fxsave_alignment(); +- set_in_cr4(X86_CR4_OSFXSR); +- set_in_cr4(X86_CR4_OSXMMEXCPT); +- +- write_cr0(oldcr0 & ~((1UL<<3)|(1UL<<2))); /* clear TS and EM */ +- +- mxcsr_feature_mask_init(); +- /* clean state in init */ +- current_thread_info()->status = 0; +- clear_used_math(); +-} +- +-void init_fpu(struct task_struct *child) +-{ +- if (tsk_used_math(child)) { +- if (child == current) +- unlazy_fpu(child); +- return; +- } +- memset(&child->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct)); +- child->thread.i387.fxsave.cwd = 0x37f; +- child->thread.i387.fxsave.mxcsr = 0x1f80; +- /* only the device not available exception or ptrace can call init_fpu */ +- set_stopped_child_used_math(child); +-} +- +-/* +- * Signal frame handlers. +- */ +- +-int save_i387(struct _fpstate __user *buf) +-{ +- struct task_struct *tsk = current; +- int err = 0; +- +- BUILD_BUG_ON(sizeof(struct user_i387_struct) != +- sizeof(tsk->thread.i387.fxsave)); +- +- if ((unsigned long)buf % 16) +- printk("save_i387: bad fpstate %p\n",buf); +- +- if (!used_math()) +- return 0; +- clear_used_math(); /* trigger finit */ +- if (task_thread_info(tsk)->status & TS_USEDFPU) { +- err = save_i387_checking((struct i387_fxsave_struct __user *)buf); +- if (err) return err; +- task_thread_info(tsk)->status &= ~TS_USEDFPU; +- stts(); +- } else { +- if (__copy_to_user(buf, &tsk->thread.i387.fxsave, +- sizeof(struct i387_fxsave_struct))) +- return -1; +- } +- return 1; +-} +- +-/* +- * ptrace request handlers. +- */ +- +-int get_fpregs(struct user_i387_struct __user *buf, struct task_struct *tsk) +-{ +- init_fpu(tsk); +- return __copy_to_user(buf, &tsk->thread.i387.fxsave, +- sizeof(struct user_i387_struct)) ? -EFAULT : 0; +-} +- +-int set_fpregs(struct task_struct *tsk, struct user_i387_struct __user *buf) +-{ +- if (__copy_from_user(&tsk->thread.i387.fxsave, buf, +- sizeof(struct user_i387_struct))) +- return -EFAULT; +- return 0; +-} +- +-/* +- * FPU state for core dumps. +- */ +- +-int dump_fpu( struct pt_regs *regs, struct user_i387_struct *fpu ) +-{ +- struct task_struct *tsk = current; +- +- if (!used_math()) +- return 0; +- +- unlazy_fpu(tsk); +- memcpy(fpu, &tsk->thread.i387.fxsave, sizeof(struct user_i387_struct)); +- return 1; +-} +- +-int dump_task_fpu(struct task_struct *tsk, struct user_i387_struct *fpu) +-{ +- int fpvalid = !!tsk_used_math(tsk); +- +- if (fpvalid) { +- if (tsk == current) +- unlazy_fpu(tsk); +- memcpy(fpu, &tsk->thread.i387.fxsave, sizeof(struct user_i387_struct)); +-} +- return fpvalid; +-} diff --git a/arch/x86/kernel/i8237.c b/arch/x86/kernel/i8237.c index 2931383..dbd6c1d 100644 --- a/arch/x86/kernel/i8237.c @@ -135365,11 +155431,161 @@ index 2931383..dbd6c1d 100644 .suspend = i8237A_suspend, .resume = i8237A_resume, }; +diff --git a/arch/x86/kernel/i8253.c b/arch/x86/kernel/i8253.c +index a42c807..ef62b07 100644 +--- a/arch/x86/kernel/i8253.c ++++ b/arch/x86/kernel/i8253.c +@@ -13,10 +13,17 @@ + #include + #include + #include ++#include + + DEFINE_SPINLOCK(i8253_lock); + EXPORT_SYMBOL(i8253_lock); + ++#ifdef CONFIG_X86_32 ++static void pit_disable_clocksource(void); ++#else ++static inline void pit_disable_clocksource(void) { } ++#endif ++ + /* + * HPET replaces the PIT, when enabled. So we need to know, which of + * the two timers is used +@@ -31,38 +38,38 @@ struct clock_event_device *global_clock_event; + static void init_pit_timer(enum clock_event_mode mode, + struct clock_event_device *evt) + { +- unsigned long flags; +- +- spin_lock_irqsave(&i8253_lock, flags); ++ spin_lock(&i8253_lock); + + switch(mode) { + case CLOCK_EVT_MODE_PERIODIC: + /* binary, mode 2, LSB/MSB, ch 0 */ +- outb_p(0x34, PIT_MODE); +- outb_p(LATCH & 0xff , PIT_CH0); /* LSB */ +- outb(LATCH >> 8 , PIT_CH0); /* MSB */ ++ outb_pit(0x34, PIT_MODE); ++ outb_pit(LATCH & 0xff , PIT_CH0); /* LSB */ ++ outb_pit(LATCH >> 8 , PIT_CH0); /* MSB */ + break; + + case CLOCK_EVT_MODE_SHUTDOWN: + case CLOCK_EVT_MODE_UNUSED: + if (evt->mode == CLOCK_EVT_MODE_PERIODIC || + evt->mode == CLOCK_EVT_MODE_ONESHOT) { +- outb_p(0x30, PIT_MODE); +- outb_p(0, PIT_CH0); +- outb_p(0, PIT_CH0); ++ outb_pit(0x30, PIT_MODE); ++ outb_pit(0, PIT_CH0); ++ outb_pit(0, PIT_CH0); + } ++ pit_disable_clocksource(); + break; + + case CLOCK_EVT_MODE_ONESHOT: + /* One shot setup */ +- outb_p(0x38, PIT_MODE); ++ pit_disable_clocksource(); ++ outb_pit(0x38, PIT_MODE); + break; + + case CLOCK_EVT_MODE_RESUME: + /* Nothing to do here */ + break; + } +- spin_unlock_irqrestore(&i8253_lock, flags); ++ spin_unlock(&i8253_lock); + } + + /* +@@ -72,12 +79,10 @@ static void init_pit_timer(enum clock_event_mode mode, + */ + static int pit_next_event(unsigned long delta, struct clock_event_device *evt) + { +- unsigned long flags; +- +- spin_lock_irqsave(&i8253_lock, flags); +- outb_p(delta & 0xff , PIT_CH0); /* LSB */ +- outb(delta >> 8 , PIT_CH0); /* MSB */ +- spin_unlock_irqrestore(&i8253_lock, flags); ++ spin_lock(&i8253_lock); ++ outb_pit(delta & 0xff , PIT_CH0); /* LSB */ ++ outb_pit(delta >> 8 , PIT_CH0); /* MSB */ ++ spin_unlock(&i8253_lock); + + return 0; + } +@@ -148,15 +153,15 @@ static cycle_t pit_read(void) + * count), it cannot be newer. + */ + jifs = jiffies; +- outb_p(0x00, PIT_MODE); /* latch the count ASAP */ +- count = inb_p(PIT_CH0); /* read the latched count */ +- count |= inb_p(PIT_CH0) << 8; ++ outb_pit(0x00, PIT_MODE); /* latch the count ASAP */ ++ count = inb_pit(PIT_CH0); /* read the latched count */ ++ count |= inb_pit(PIT_CH0) << 8; + + /* VIA686a test code... reset the latch if count > max + 1 */ + if (count > LATCH) { +- outb_p(0x34, PIT_MODE); +- outb_p(LATCH & 0xff, PIT_CH0); +- outb(LATCH >> 8, PIT_CH0); ++ outb_pit(0x34, PIT_MODE); ++ outb_pit(LATCH & 0xff, PIT_CH0); ++ outb_pit(LATCH >> 8, PIT_CH0); + count = LATCH - 1; + } + +@@ -195,9 +200,28 @@ static struct clocksource clocksource_pit = { + .shift = 20, + }; + ++static void pit_disable_clocksource(void) ++{ ++ /* ++ * Use mult to check whether it is registered or not ++ */ ++ if (clocksource_pit.mult) { ++ clocksource_unregister(&clocksource_pit); ++ clocksource_pit.mult = 0; ++ } ++} ++ + static int __init init_pit_clocksource(void) + { +- if (num_possible_cpus() > 1) /* PIT does not scale! */ ++ /* ++ * Several reasons not to register PIT as a clocksource: ++ * ++ * - On SMP PIT does not scale due to i8253_lock ++ * - when HPET is enabled ++ * - when local APIC timer is active (PIT is switched off) ++ */ ++ if (num_possible_cpus() > 1 || is_hpet_enabled() || ++ pit_clockevent.mode != CLOCK_EVT_MODE_PERIODIC) + return 0; + + clocksource_pit.mult = clocksource_hz2mult(CLOCK_TICK_RATE, 20); diff --git a/arch/x86/kernel/i8259_32.c b/arch/x86/kernel/i8259_32.c -index f634fc7..5f3496d 100644 +index f634fc7..2d25b77 100644 --- a/arch/x86/kernel/i8259_32.c +++ b/arch/x86/kernel/i8259_32.c -@@ -258,7 +258,7 @@ static int i8259A_shutdown(struct sys_device *dev) +@@ -21,8 +21,6 @@ + #include + #include + +-#include +- + /* + * This is the 'legacy' 8259A Programmable Interrupt Controller, + * present in the majority of PC/AT boxes. +@@ -258,7 +256,7 @@ static int i8259A_shutdown(struct sys_device *dev) } static struct sysdev_class i8259_sysdev_class = { @@ -135378,11 +155594,242 @@ index f634fc7..5f3496d 100644 .suspend = i8259A_suspend, .resume = i8259A_resume, .shutdown = i8259A_shutdown, +@@ -291,20 +289,20 @@ void init_8259A(int auto_eoi) + outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */ + + /* +- * outb_p - this has to work on a wide range of PC hardware. ++ * outb_pic - this has to work on a wide range of PC hardware. + */ +- outb_p(0x11, PIC_MASTER_CMD); /* ICW1: select 8259A-1 init */ +- outb_p(0x20 + 0, PIC_MASTER_IMR); /* ICW2: 8259A-1 IR0-7 mapped to 0x20-0x27 */ +- outb_p(1U << PIC_CASCADE_IR, PIC_MASTER_IMR); /* 8259A-1 (the master) has a slave on IR2 */ ++ outb_pic(0x11, PIC_MASTER_CMD); /* ICW1: select 8259A-1 init */ ++ outb_pic(0x20 + 0, PIC_MASTER_IMR); /* ICW2: 8259A-1 IR0-7 mapped to 0x20-0x27 */ ++ outb_pic(1U << PIC_CASCADE_IR, PIC_MASTER_IMR); /* 8259A-1 (the master) has a slave on IR2 */ + if (auto_eoi) /* master does Auto EOI */ +- outb_p(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); ++ outb_pic(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); + else /* master expects normal EOI */ +- outb_p(MASTER_ICW4_DEFAULT, PIC_MASTER_IMR); ++ outb_pic(MASTER_ICW4_DEFAULT, PIC_MASTER_IMR); + +- outb_p(0x11, PIC_SLAVE_CMD); /* ICW1: select 8259A-2 init */ +- outb_p(0x20 + 8, PIC_SLAVE_IMR); /* ICW2: 8259A-2 IR0-7 mapped to 0x28-0x2f */ +- outb_p(PIC_CASCADE_IR, PIC_SLAVE_IMR); /* 8259A-2 is a slave on master's IR2 */ +- outb_p(SLAVE_ICW4_DEFAULT, PIC_SLAVE_IMR); /* (slave's support for AEOI in flat mode is to be investigated) */ ++ outb_pic(0x11, PIC_SLAVE_CMD); /* ICW1: select 8259A-2 init */ ++ outb_pic(0x20 + 8, PIC_SLAVE_IMR); /* ICW2: 8259A-2 IR0-7 mapped to 0x28-0x2f */ ++ outb_pic(PIC_CASCADE_IR, PIC_SLAVE_IMR); /* 8259A-2 is a slave on master's IR2 */ ++ outb_pic(SLAVE_ICW4_DEFAULT, PIC_SLAVE_IMR); /* (slave's support for AEOI in flat mode is to be investigated) */ + if (auto_eoi) + /* + * In AEOI mode we just have to mask the interrupt +@@ -341,7 +339,7 @@ static irqreturn_t math_error_irq(int cpl, void *dev_id) + outb(0,0xF0); + if (ignore_fpu_irq || !boot_cpu_data.hard_math) + return IRQ_NONE; +- math_error((void __user *)get_irq_regs()->eip); ++ math_error((void __user *)get_irq_regs()->ip); + return IRQ_HANDLED; + } + diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c -index 3f27ea0..ba6d572 100644 +index 3f27ea0..fa57a15 100644 --- a/arch/x86/kernel/i8259_64.c +++ b/arch/x86/kernel/i8259_64.c -@@ -370,7 +370,7 @@ static int i8259A_shutdown(struct sys_device *dev) +@@ -21,6 +21,7 @@ + #include + #include + #include ++#include + + /* + * Common place to define all x86 IRQ vectors +@@ -48,7 +49,7 @@ + */ + + /* +- * The IO-APIC gives us many more interrupt sources. Most of these ++ * The IO-APIC gives us many more interrupt sources. Most of these + * are unused but an SMP system is supposed to have enough memory ... + * sometimes (mostly wrt. hw bugs) we get corrupted vectors all + * across the spectrum, so we really want to be prepared to get all +@@ -76,7 +77,7 @@ BUILD_16_IRQS(0xc) BUILD_16_IRQS(0xd) BUILD_16_IRQS(0xe) BUILD_16_IRQS(0xf) + IRQ(x,c), IRQ(x,d), IRQ(x,e), IRQ(x,f) + + /* for the irq vectors */ +-static void (*interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = { ++static void (*__initdata interrupt[NR_VECTORS - FIRST_EXTERNAL_VECTOR])(void) = { + IRQLIST_16(0x2), IRQLIST_16(0x3), + IRQLIST_16(0x4), IRQLIST_16(0x5), IRQLIST_16(0x6), IRQLIST_16(0x7), + IRQLIST_16(0x8), IRQLIST_16(0x9), IRQLIST_16(0xa), IRQLIST_16(0xb), +@@ -114,11 +115,7 @@ static struct irq_chip i8259A_chip = { + /* + * This contains the irq mask for both 8259A irq controllers, + */ +-static unsigned int cached_irq_mask = 0xffff; +- +-#define __byte(x,y) (((unsigned char *)&(y))[x]) +-#define cached_21 (__byte(0,cached_irq_mask)) +-#define cached_A1 (__byte(1,cached_irq_mask)) ++unsigned int cached_irq_mask = 0xffff; + + /* + * Not all IRQs can be routed through the IO-APIC, eg. on certain (older) +@@ -139,9 +136,9 @@ void disable_8259A_irq(unsigned int irq) + spin_lock_irqsave(&i8259A_lock, flags); + cached_irq_mask |= mask; + if (irq & 8) +- outb(cached_A1,0xA1); ++ outb(cached_slave_mask, PIC_SLAVE_IMR); + else +- outb(cached_21,0x21); ++ outb(cached_master_mask, PIC_MASTER_IMR); + spin_unlock_irqrestore(&i8259A_lock, flags); + } + +@@ -153,9 +150,9 @@ void enable_8259A_irq(unsigned int irq) + spin_lock_irqsave(&i8259A_lock, flags); + cached_irq_mask &= mask; + if (irq & 8) +- outb(cached_A1,0xA1); ++ outb(cached_slave_mask, PIC_SLAVE_IMR); + else +- outb(cached_21,0x21); ++ outb(cached_master_mask, PIC_MASTER_IMR); + spin_unlock_irqrestore(&i8259A_lock, flags); + } + +@@ -167,9 +164,9 @@ int i8259A_irq_pending(unsigned int irq) + + spin_lock_irqsave(&i8259A_lock, flags); + if (irq < 8) +- ret = inb(0x20) & mask; ++ ret = inb(PIC_MASTER_CMD) & mask; + else +- ret = inb(0xA0) & (mask >> 8); ++ ret = inb(PIC_SLAVE_CMD) & (mask >> 8); + spin_unlock_irqrestore(&i8259A_lock, flags); + + return ret; +@@ -196,14 +193,14 @@ static inline int i8259A_irq_real(unsigned int irq) + int irqmask = 1<> 8); +- outb(0x0A,0xA0); /* back to the IRR register */ ++ outb(0x0B,PIC_SLAVE_CMD); /* ISR register */ ++ value = inb(PIC_SLAVE_CMD) & (irqmask >> 8); ++ outb(0x0A,PIC_SLAVE_CMD); /* back to the IRR register */ + return value; + } + +@@ -240,14 +237,17 @@ static void mask_and_ack_8259A(unsigned int irq) + + handle_real_irq: + if (irq & 8) { +- inb(0xA1); /* DUMMY - (do we need this?) */ +- outb(cached_A1,0xA1); +- outb(0x60+(irq&7),0xA0);/* 'Specific EOI' to slave */ +- outb(0x62,0x20); /* 'Specific EOI' to master-IRQ2 */ ++ inb(PIC_SLAVE_IMR); /* DUMMY - (do we need this?) */ ++ outb(cached_slave_mask, PIC_SLAVE_IMR); ++ /* 'Specific EOI' to slave */ ++ outb(0x60+(irq&7),PIC_SLAVE_CMD); ++ /* 'Specific EOI' to master-IRQ2 */ ++ outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); + } else { +- inb(0x21); /* DUMMY - (do we need this?) */ +- outb(cached_21,0x21); +- outb(0x60+irq,0x20); /* 'Specific EOI' to master */ ++ inb(PIC_MASTER_IMR); /* DUMMY - (do we need this?) */ ++ outb(cached_master_mask, PIC_MASTER_IMR); ++ /* 'Specific EOI' to master */ ++ outb(0x60+irq,PIC_MASTER_CMD); + } + spin_unlock_irqrestore(&i8259A_lock, flags); + return; +@@ -270,7 +270,8 @@ spurious_8259A_irq: + * lets ACK and report it. [once per IRQ] + */ + if (!(spurious_irq_mask & irqmask)) { +- printk(KERN_DEBUG "spurious 8259A interrupt: IRQ%d.\n", irq); ++ printk(KERN_DEBUG ++ "spurious 8259A interrupt: IRQ%d.\n", irq); + spurious_irq_mask |= irqmask; + } + atomic_inc(&irq_err_count); +@@ -283,51 +284,6 @@ spurious_8259A_irq: + } + } + +-void init_8259A(int auto_eoi) +-{ +- unsigned long flags; +- +- i8259A_auto_eoi = auto_eoi; +- +- spin_lock_irqsave(&i8259A_lock, flags); +- +- outb(0xff, 0x21); /* mask all of 8259A-1 */ +- outb(0xff, 0xA1); /* mask all of 8259A-2 */ +- +- /* +- * outb_p - this has to work on a wide range of PC hardware. +- */ +- outb_p(0x11, 0x20); /* ICW1: select 8259A-1 init */ +- outb_p(IRQ0_VECTOR, 0x21); /* ICW2: 8259A-1 IR0-7 mapped to 0x30-0x37 */ +- outb_p(0x04, 0x21); /* 8259A-1 (the master) has a slave on IR2 */ +- if (auto_eoi) +- outb_p(0x03, 0x21); /* master does Auto EOI */ +- else +- outb_p(0x01, 0x21); /* master expects normal EOI */ +- +- outb_p(0x11, 0xA0); /* ICW1: select 8259A-2 init */ +- outb_p(IRQ8_VECTOR, 0xA1); /* ICW2: 8259A-2 IR0-7 mapped to 0x38-0x3f */ +- outb_p(0x02, 0xA1); /* 8259A-2 is a slave on master's IR2 */ +- outb_p(0x01, 0xA1); /* (slave's support for AEOI in flat mode +- is to be investigated) */ +- +- if (auto_eoi) +- /* +- * in AEOI mode we just have to mask the interrupt +- * when acking. +- */ +- i8259A_chip.mask_ack = disable_8259A_irq; +- else +- i8259A_chip.mask_ack = mask_and_ack_8259A; +- +- udelay(100); /* wait for 8259A to initialize */ +- +- outb(cached_21, 0x21); /* restore master IRQ mask */ +- outb(cached_A1, 0xA1); /* restore slave IRQ mask */ +- +- spin_unlock_irqrestore(&i8259A_lock, flags); +-} +- + static char irq_trigger[2]; + /** + * ELCR registers (0x4d0, 0x4d1) control edge/level of IRQ +@@ -364,13 +320,13 @@ static int i8259A_shutdown(struct sys_device *dev) + * the kernel initialization code can get it + * out of. + */ +- outb(0xff, 0x21); /* mask all of 8259A-1 */ +- outb(0xff, 0xA1); /* mask all of 8259A-1 */ ++ outb(0xff, PIC_MASTER_IMR); /* mask all of 8259A-1 */ ++ outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-1 */ + return 0; } static struct sysdev_class i8259_sysdev_class = { @@ -135391,11 +155838,155 @@ index 3f27ea0..ba6d572 100644 .suspend = i8259A_suspend, .resume = i8259A_resume, .shutdown = i8259A_shutdown, +@@ -391,6 +347,58 @@ static int __init i8259A_init_sysfs(void) + + device_initcall(i8259A_init_sysfs); + ++void init_8259A(int auto_eoi) ++{ ++ unsigned long flags; ++ ++ i8259A_auto_eoi = auto_eoi; ++ ++ spin_lock_irqsave(&i8259A_lock, flags); ++ ++ outb(0xff, PIC_MASTER_IMR); /* mask all of 8259A-1 */ ++ outb(0xff, PIC_SLAVE_IMR); /* mask all of 8259A-2 */ ++ ++ /* ++ * outb_pic - this has to work on a wide range of PC hardware. ++ */ ++ outb_pic(0x11, PIC_MASTER_CMD); /* ICW1: select 8259A-1 init */ ++ /* ICW2: 8259A-1 IR0-7 mapped to 0x30-0x37 */ ++ outb_pic(IRQ0_VECTOR, PIC_MASTER_IMR); ++ /* 8259A-1 (the master) has a slave on IR2 */ ++ outb_pic(0x04, PIC_MASTER_IMR); ++ if (auto_eoi) /* master does Auto EOI */ ++ outb_pic(MASTER_ICW4_DEFAULT | PIC_ICW4_AEOI, PIC_MASTER_IMR); ++ else /* master expects normal EOI */ ++ outb_pic(MASTER_ICW4_DEFAULT, PIC_MASTER_IMR); ++ ++ outb_pic(0x11, PIC_SLAVE_CMD); /* ICW1: select 8259A-2 init */ ++ /* ICW2: 8259A-2 IR0-7 mapped to 0x38-0x3f */ ++ outb_pic(IRQ8_VECTOR, PIC_SLAVE_IMR); ++ /* 8259A-2 is a slave on master's IR2 */ ++ outb_pic(PIC_CASCADE_IR, PIC_SLAVE_IMR); ++ /* (slave's support for AEOI in flat mode is to be investigated) */ ++ outb_pic(SLAVE_ICW4_DEFAULT, PIC_SLAVE_IMR); ++ ++ if (auto_eoi) ++ /* ++ * In AEOI mode we just have to mask the interrupt ++ * when acking. ++ */ ++ i8259A_chip.mask_ack = disable_8259A_irq; ++ else ++ i8259A_chip.mask_ack = mask_and_ack_8259A; ++ ++ udelay(100); /* wait for 8259A to initialize */ ++ ++ outb(cached_master_mask, PIC_MASTER_IMR); /* restore master IRQ mask */ ++ outb(cached_slave_mask, PIC_SLAVE_IMR); /* restore slave IRQ mask */ ++ ++ spin_unlock_irqrestore(&i8259A_lock, flags); ++} ++ ++ ++ ++ + /* + * IRQ2 is cascade interrupt to second interrupt controller + */ +@@ -448,7 +456,9 @@ void __init init_ISA_irqs (void) + } + } + +-void __init init_IRQ(void) ++void init_IRQ(void) __attribute__((weak, alias("native_init_IRQ"))); ++ ++void __init native_init_IRQ(void) + { + int i; + +diff --git a/arch/x86/kernel/init_task.c b/arch/x86/kernel/init_task.c +index 468c9c4..5b3ce79 100644 +--- a/arch/x86/kernel/init_task.c ++++ b/arch/x86/kernel/init_task.c +@@ -15,7 +15,6 @@ static struct files_struct init_files = INIT_FILES; + static struct signal_struct init_signals = INIT_SIGNALS(init_signals); + static struct sighand_struct init_sighand = INIT_SIGHAND(init_sighand); + struct mm_struct init_mm = INIT_MM(init_mm); +-EXPORT_SYMBOL(init_mm); + + /* + * Initial thread structure. diff --git a/arch/x86/kernel/io_apic_32.c b/arch/x86/kernel/io_apic_32.c -index a6b1490..ab77f19 100644 +index a6b1490..4ca5486 100644 --- a/arch/x86/kernel/io_apic_32.c +++ b/arch/x86/kernel/io_apic_32.c -@@ -2401,7 +2401,7 @@ static int ioapic_resume(struct sys_device *dev) +@@ -35,6 +35,7 @@ + #include + #include + #include ++#include /* time_after() */ + + #include + #include +@@ -48,8 +49,6 @@ + #include + #include + +-#include "io_ports.h" +- + int (*ioapic_renumber_irq)(int ioapic, int irq); + atomic_t irq_mis_count; + +@@ -351,7 +350,7 @@ static void set_ioapic_affinity_irq(unsigned int irq, cpumask_t cpumask) + # include /* kernel_thread() */ + # include /* kstat */ + # include /* kmalloc() */ +-# include /* time_after() */ ++# include + + #define IRQBALANCE_CHECK_ARCH -999 + #define MAX_BALANCED_IRQ_INTERVAL (5*HZ) +@@ -727,7 +726,7 @@ late_initcall(balanced_irq_init); + #endif /* CONFIG_SMP */ + + #ifndef CONFIG_SMP +-void fastcall send_IPI_self(int vector) ++void send_IPI_self(int vector) + { + unsigned int cfg; + +@@ -1900,7 +1899,7 @@ static int __init timer_irq_works(void) + * might have cached one ExtINT interrupt. Finally, at + * least one tick may be lost due to delays. + */ +- if (jiffies - t1 > 4) ++ if (time_after(jiffies, t1 + 4)) + return 1; + + return 0; +@@ -2080,7 +2079,7 @@ static struct irq_chip lapic_chip __read_mostly = { + .eoi = ack_apic, + }; + +-static void setup_nmi (void) ++static void __init setup_nmi(void) + { + /* + * Dirty trick to enable the NMI watchdog ... +@@ -2093,7 +2092,7 @@ static void setup_nmi (void) + */ + apic_printk(APIC_VERBOSE, KERN_INFO "activating NMI Watchdog ..."); + +- on_each_cpu(enable_NMI_through_LVT0, NULL, 1, 1); ++ enable_NMI_through_LVT0(); + + apic_printk(APIC_VERBOSE, " done.\n"); + } +@@ -2401,7 +2400,7 @@ static int ioapic_resume(struct sys_device *dev) } static struct sysdev_class ioapic_sysdev_class = { @@ -135405,10 +155996,106 @@ index a6b1490..ab77f19 100644 .resume = ioapic_resume, }; diff --git a/arch/x86/kernel/io_apic_64.c b/arch/x86/kernel/io_apic_64.c -index cbac167..23a3ac0 100644 +index cbac167..1627c0d 100644 --- a/arch/x86/kernel/io_apic_64.c +++ b/arch/x86/kernel/io_apic_64.c -@@ -1850,7 +1850,7 @@ static int ioapic_resume(struct sys_device *dev) +@@ -32,9 +32,11 @@ + #include + #include + #include ++#include + #ifdef CONFIG_ACPI + #include + #endif ++#include + + #include + #include +@@ -1069,7 +1071,7 @@ void __apicdebuginit print_local_APIC(void * dummy) + v = apic_read(APIC_LVR); + printk(KERN_INFO "... APIC VERSION: %08x\n", v); + ver = GET_APIC_VERSION(v); +- maxlvt = get_maxlvt(); ++ maxlvt = lapic_get_maxlvt(); + + v = apic_read(APIC_TASKPRI); + printk(KERN_DEBUG "... APIC TASKPRI: %08x (%02x)\n", v, v & APIC_TPRI_MASK); +@@ -1171,7 +1173,7 @@ void __apicdebuginit print_PIC(void) + + #endif /* 0 */ + +-static void __init enable_IO_APIC(void) ++void __init enable_IO_APIC(void) + { + union IO_APIC_reg_01 reg_01; + int i8259_apic, i8259_pin; +@@ -1298,7 +1300,7 @@ static int __init timer_irq_works(void) + */ + + /* jiffies wrap? */ +- if (jiffies - t1 > 4) ++ if (time_after(jiffies, t1 + 4)) + return 1; + return 0; + } +@@ -1411,7 +1413,7 @@ static void irq_complete_move(unsigned int irq) + if (likely(!cfg->move_in_progress)) + return; + +- vector = ~get_irq_regs()->orig_rax; ++ vector = ~get_irq_regs()->orig_ax; + me = smp_processor_id(); + if ((vector == cfg->vector) && cpu_isset(me, cfg->domain)) { + cpumask_t cleanup_mask; +@@ -1438,7 +1440,7 @@ static void ack_apic_level(unsigned int irq) + int do_unmask_irq = 0; + + irq_complete_move(irq); +-#if defined(CONFIG_GENERIC_PENDING_IRQ) || defined(CONFIG_IRQBALANCE) ++#ifdef CONFIG_GENERIC_PENDING_IRQ + /* If we are moving the irq we need to mask it */ + if (unlikely(irq_desc[irq].status & IRQ_MOVE_PENDING)) { + do_unmask_irq = 1; +@@ -1565,7 +1567,7 @@ static struct hw_interrupt_type lapic_irq_type __read_mostly = { + .end = end_lapic_irq, + }; + +-static void setup_nmi (void) ++static void __init setup_nmi(void) + { + /* + * Dirty trick to enable the NMI watchdog ... +@@ -1578,7 +1580,7 @@ static void setup_nmi (void) + */ + printk(KERN_INFO "activating NMI Watchdog ..."); + +- enable_NMI_through_LVT0(NULL); ++ enable_NMI_through_LVT0(); + + printk(" done.\n"); + } +@@ -1654,7 +1656,7 @@ static inline void unlock_ExtINT_logic(void) + * + * FIXME: really need to revamp this for modern platforms only. + */ +-static inline void check_timer(void) ++static inline void __init check_timer(void) + { + struct irq_cfg *cfg = irq_cfg + 0; + int apic1, pin1, apic2, pin2; +@@ -1788,7 +1790,10 @@ __setup("no_timer_check", notimercheck); + + void __init setup_IO_APIC(void) + { +- enable_IO_APIC(); ++ ++ /* ++ * calling enable_IO_APIC() is moved to setup_local_APIC for BP ++ */ + + if (acpi_ioapic) + io_apic_irqs = ~0; /* all IRQs go through IOAPIC */ +@@ -1850,7 +1855,7 @@ static int ioapic_resume(struct sys_device *dev) } static struct sysdev_class ioapic_sysdev_class = { @@ -135417,11 +156104,4304 @@ index cbac167..23a3ac0 100644 .suspend = ioapic_suspend, .resume = ioapic_resume, }; +@@ -2288,3 +2293,92 @@ void __init setup_ioapic_dest(void) + } + #endif + ++#define IOAPIC_RESOURCE_NAME_SIZE 11 ++ ++static struct resource *ioapic_resources; ++ ++static struct resource * __init ioapic_setup_resources(void) ++{ ++ unsigned long n; ++ struct resource *res; ++ char *mem; ++ int i; ++ ++ if (nr_ioapics <= 0) ++ return NULL; ++ ++ n = IOAPIC_RESOURCE_NAME_SIZE + sizeof(struct resource); ++ n *= nr_ioapics; ++ ++ mem = alloc_bootmem(n); ++ res = (void *)mem; ++ ++ if (mem != NULL) { ++ memset(mem, 0, n); ++ mem += sizeof(struct resource) * nr_ioapics; ++ ++ for (i = 0; i < nr_ioapics; i++) { ++ res[i].name = mem; ++ res[i].flags = IORESOURCE_MEM | IORESOURCE_BUSY; ++ sprintf(mem, "IOAPIC %u", i); ++ mem += IOAPIC_RESOURCE_NAME_SIZE; ++ } ++ } ++ ++ ioapic_resources = res; ++ ++ return res; ++} ++ ++void __init ioapic_init_mappings(void) ++{ ++ unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0; ++ struct resource *ioapic_res; ++ int i; ++ ++ ioapic_res = ioapic_setup_resources(); ++ for (i = 0; i < nr_ioapics; i++) { ++ if (smp_found_config) { ++ ioapic_phys = mp_ioapics[i].mpc_apicaddr; ++ } else { ++ ioapic_phys = (unsigned long) ++ alloc_bootmem_pages(PAGE_SIZE); ++ ioapic_phys = __pa(ioapic_phys); ++ } ++ set_fixmap_nocache(idx, ioapic_phys); ++ apic_printk(APIC_VERBOSE, ++ "mapped IOAPIC to %016lx (%016lx)\n", ++ __fix_to_virt(idx), ioapic_phys); ++ idx++; ++ ++ if (ioapic_res != NULL) { ++ ioapic_res->start = ioapic_phys; ++ ioapic_res->end = ioapic_phys + (4 * 1024) - 1; ++ ioapic_res++; ++ } ++ } ++} ++ ++static int __init ioapic_insert_resources(void) ++{ ++ int i; ++ struct resource *r = ioapic_resources; ++ ++ if (!r) { ++ printk(KERN_ERR ++ "IO APIC resources could be not be allocated.\n"); ++ return -1; ++ } ++ ++ for (i = 0; i < nr_ioapics; i++) { ++ insert_resource(&iomem_resource, r); ++ r++; ++ } ++ ++ return 0; ++} ++ ++/* Insert the IO APIC resources after PCI initialization has occured to handle ++ * IO APICS that are mapped in on a BAR in PCI space. */ ++late_initcall(ioapic_insert_resources); ++ +diff --git a/arch/x86/kernel/io_delay.c b/arch/x86/kernel/io_delay.c +new file mode 100644 +index 0000000..bd49321 +--- /dev/null ++++ b/arch/x86/kernel/io_delay.c +@@ -0,0 +1,114 @@ ++/* ++ * I/O delay strategies for inb_p/outb_p ++ * ++ * Allow for a DMI based override of port 0x80, needed for certain HP laptops ++ * and possibly other systems. Also allow for the gradual elimination of ++ * outb_p/inb_p API uses. ++ */ ++#include ++#include ++#include ++#include ++#include ++#include ++ ++int io_delay_type __read_mostly = CONFIG_DEFAULT_IO_DELAY_TYPE; ++EXPORT_SYMBOL_GPL(io_delay_type); ++ ++static int __initdata io_delay_override; ++ ++/* ++ * Paravirt wants native_io_delay to be a constant. ++ */ ++void native_io_delay(void) ++{ ++ switch (io_delay_type) { ++ default: ++ case CONFIG_IO_DELAY_TYPE_0X80: ++ asm volatile ("outb %al, $0x80"); ++ break; ++ case CONFIG_IO_DELAY_TYPE_0XED: ++ asm volatile ("outb %al, $0xed"); ++ break; ++ case CONFIG_IO_DELAY_TYPE_UDELAY: ++ /* ++ * 2 usecs is an upper-bound for the outb delay but ++ * note that udelay doesn't have the bus-level ++ * side-effects that outb does, nor does udelay() have ++ * precise timings during very early bootup (the delays ++ * are shorter until calibrated): ++ */ ++ udelay(2); ++ case CONFIG_IO_DELAY_TYPE_NONE: ++ break; ++ } ++} ++EXPORT_SYMBOL(native_io_delay); ++ ++static int __init dmi_io_delay_0xed_port(const struct dmi_system_id *id) ++{ ++ if (io_delay_type == CONFIG_IO_DELAY_TYPE_0X80) { ++ printk(KERN_NOTICE "%s: using 0xed I/O delay port\n", ++ id->ident); ++ io_delay_type = CONFIG_IO_DELAY_TYPE_0XED; ++ } ++ ++ return 0; ++} ++ ++/* ++ * Quirk table for systems that misbehave (lock up, etc.) if port ++ * 0x80 is used: ++ */ ++static struct dmi_system_id __initdata io_delay_0xed_port_dmi_table[] = { ++ { ++ .callback = dmi_io_delay_0xed_port, ++ .ident = "Compaq Presario V6000", ++ .matches = { ++ DMI_MATCH(DMI_BOARD_VENDOR, "Quanta"), ++ DMI_MATCH(DMI_BOARD_NAME, "30B7") ++ } ++ }, ++ { ++ .callback = dmi_io_delay_0xed_port, ++ .ident = "HP Pavilion dv9000z", ++ .matches = { ++ DMI_MATCH(DMI_BOARD_VENDOR, "Quanta"), ++ DMI_MATCH(DMI_BOARD_NAME, "30B9") ++ } ++ }, ++ { ++ .callback = dmi_io_delay_0xed_port, ++ .ident = "HP Pavilion tx1000", ++ .matches = { ++ DMI_MATCH(DMI_BOARD_VENDOR, "Quanta"), ++ DMI_MATCH(DMI_BOARD_NAME, "30BF") ++ } ++ }, ++ { } ++}; ++ ++void __init io_delay_init(void) ++{ ++ if (!io_delay_override) ++ dmi_check_system(io_delay_0xed_port_dmi_table); ++} ++ ++static int __init io_delay_param(char *s) ++{ ++ if (!strcmp(s, "0x80")) ++ io_delay_type = CONFIG_IO_DELAY_TYPE_0X80; ++ else if (!strcmp(s, "0xed")) ++ io_delay_type = CONFIG_IO_DELAY_TYPE_0XED; ++ else if (!strcmp(s, "udelay")) ++ io_delay_type = CONFIG_IO_DELAY_TYPE_UDELAY; ++ else if (!strcmp(s, "none")) ++ io_delay_type = CONFIG_IO_DELAY_TYPE_NONE; ++ else ++ return -EINVAL; ++ ++ io_delay_override = 1; ++ return 0; ++} ++ ++early_param("io_delay", io_delay_param); +diff --git a/arch/x86/kernel/ioport.c b/arch/x86/kernel/ioport.c +new file mode 100644 +index 0000000..50e5e4a +--- /dev/null ++++ b/arch/x86/kernel/ioport.c +@@ -0,0 +1,154 @@ ++/* ++ * This contains the io-permission bitmap code - written by obz, with changes ++ * by Linus. 32/64 bits code unification by Miguel Botón. ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++/* Set EXTENT bits starting at BASE in BITMAP to value TURN_ON. */ ++static void set_bitmap(unsigned long *bitmap, unsigned int base, ++ unsigned int extent, int new_value) ++{ ++ unsigned int i; ++ ++ for (i = base; i < base + extent; i++) { ++ if (new_value) ++ __set_bit(i, bitmap); ++ else ++ __clear_bit(i, bitmap); ++ } ++} ++ ++/* ++ * this changes the io permissions bitmap in the current task. ++ */ ++asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int turn_on) ++{ ++ struct thread_struct * t = ¤t->thread; ++ struct tss_struct * tss; ++ unsigned int i, max_long, bytes, bytes_updated; ++ ++ if ((from + num <= from) || (from + num > IO_BITMAP_BITS)) ++ return -EINVAL; ++ if (turn_on && !capable(CAP_SYS_RAWIO)) ++ return -EPERM; ++ ++ /* ++ * If it's the first ioperm() call in this thread's lifetime, set the ++ * IO bitmap up. ioperm() is much less timing critical than clone(), ++ * this is why we delay this operation until now: ++ */ ++ if (!t->io_bitmap_ptr) { ++ unsigned long *bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL); ++ ++ if (!bitmap) ++ return -ENOMEM; ++ ++ memset(bitmap, 0xff, IO_BITMAP_BYTES); ++ t->io_bitmap_ptr = bitmap; ++ set_thread_flag(TIF_IO_BITMAP); ++ } ++ ++ /* ++ * do it in the per-thread copy and in the TSS ... ++ * ++ * Disable preemption via get_cpu() - we must not switch away ++ * because the ->io_bitmap_max value must match the bitmap ++ * contents: ++ */ ++ tss = &per_cpu(init_tss, get_cpu()); ++ ++ set_bitmap(t->io_bitmap_ptr, from, num, !turn_on); ++ ++ /* ++ * Search for a (possibly new) maximum. This is simple and stupid, ++ * to keep it obviously correct: ++ */ ++ max_long = 0; ++ for (i = 0; i < IO_BITMAP_LONGS; i++) ++ if (t->io_bitmap_ptr[i] != ~0UL) ++ max_long = i; ++ ++ bytes = (max_long + 1) * sizeof(unsigned long); ++ bytes_updated = max(bytes, t->io_bitmap_max); ++ ++ t->io_bitmap_max = bytes; ++ ++#ifdef CONFIG_X86_32 ++ /* ++ * Sets the lazy trigger so that the next I/O operation will ++ * reload the correct bitmap. ++ * Reset the owner so that a process switch will not set ++ * tss->io_bitmap_base to IO_BITMAP_OFFSET. ++ */ ++ tss->x86_tss.io_bitmap_base = INVALID_IO_BITMAP_OFFSET_LAZY; ++ tss->io_bitmap_owner = NULL; ++#else ++ /* Update the TSS: */ ++ memcpy(tss->io_bitmap, t->io_bitmap_ptr, bytes_updated); ++#endif ++ ++ put_cpu(); ++ ++ return 0; ++} ++ ++/* ++ * sys_iopl has to be used when you want to access the IO ports ++ * beyond the 0x3ff range: to get the full 65536 ports bitmapped ++ * you'd need 8kB of bitmaps/process, which is a bit excessive. ++ * ++ * Here we just change the flags value on the stack: we allow ++ * only the super-user to do it. This depends on the stack-layout ++ * on system-call entry - see also fork() and the signal handling ++ * code. ++ */ ++static int do_iopl(unsigned int level, struct pt_regs *regs) ++{ ++ unsigned int old = (regs->flags >> 12) & 3; ++ ++ if (level > 3) ++ return -EINVAL; ++ /* Trying to gain more privileges? */ ++ if (level > old) { ++ if (!capable(CAP_SYS_RAWIO)) ++ return -EPERM; ++ } ++ regs->flags = (regs->flags & ~X86_EFLAGS_IOPL) | (level << 12); ++ ++ return 0; ++} ++ ++#ifdef CONFIG_X86_32 ++asmlinkage long sys_iopl(unsigned long regsp) ++{ ++ struct pt_regs *regs = (struct pt_regs *)®sp; ++ unsigned int level = regs->bx; ++ struct thread_struct *t = ¤t->thread; ++ int rc; ++ ++ rc = do_iopl(level, regs); ++ if (rc < 0) ++ goto out; ++ ++ t->iopl = level << 12; ++ set_iopl_mask(t->iopl); ++out: ++ return rc; ++} ++#else ++asmlinkage long sys_iopl(unsigned int level, struct pt_regs *regs) ++{ ++ return do_iopl(level, regs); ++} ++#endif +diff --git a/arch/x86/kernel/ioport_32.c b/arch/x86/kernel/ioport_32.c +deleted file mode 100644 +index 4ed48dc..0000000 +--- a/arch/x86/kernel/ioport_32.c ++++ /dev/null +@@ -1,151 +0,0 @@ +-/* +- * This contains the io-permission bitmap code - written by obz, with changes +- * by Linus. +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* Set EXTENT bits starting at BASE in BITMAP to value TURN_ON. */ +-static void set_bitmap(unsigned long *bitmap, unsigned int base, unsigned int extent, int new_value) +-{ +- unsigned long mask; +- unsigned long *bitmap_base = bitmap + (base / BITS_PER_LONG); +- unsigned int low_index = base & (BITS_PER_LONG-1); +- int length = low_index + extent; +- +- if (low_index != 0) { +- mask = (~0UL << low_index); +- if (length < BITS_PER_LONG) +- mask &= ~(~0UL << length); +- if (new_value) +- *bitmap_base++ |= mask; +- else +- *bitmap_base++ &= ~mask; +- length -= BITS_PER_LONG; +- } +- +- mask = (new_value ? ~0UL : 0UL); +- while (length >= BITS_PER_LONG) { +- *bitmap_base++ = mask; +- length -= BITS_PER_LONG; +- } +- +- if (length > 0) { +- mask = ~(~0UL << length); +- if (new_value) +- *bitmap_base++ |= mask; +- else +- *bitmap_base++ &= ~mask; +- } +-} +- +- +-/* +- * this changes the io permissions bitmap in the current task. +- */ +-asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int turn_on) +-{ +- unsigned long i, max_long, bytes, bytes_updated; +- struct thread_struct * t = ¤t->thread; +- struct tss_struct * tss; +- unsigned long *bitmap; +- +- if ((from + num <= from) || (from + num > IO_BITMAP_BITS)) +- return -EINVAL; +- if (turn_on && !capable(CAP_SYS_RAWIO)) +- return -EPERM; +- +- /* +- * If it's the first ioperm() call in this thread's lifetime, set the +- * IO bitmap up. ioperm() is much less timing critical than clone(), +- * this is why we delay this operation until now: +- */ +- if (!t->io_bitmap_ptr) { +- bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL); +- if (!bitmap) +- return -ENOMEM; +- +- memset(bitmap, 0xff, IO_BITMAP_BYTES); +- t->io_bitmap_ptr = bitmap; +- set_thread_flag(TIF_IO_BITMAP); +- } +- +- /* +- * do it in the per-thread copy and in the TSS ... +- * +- * Disable preemption via get_cpu() - we must not switch away +- * because the ->io_bitmap_max value must match the bitmap +- * contents: +- */ +- tss = &per_cpu(init_tss, get_cpu()); +- +- set_bitmap(t->io_bitmap_ptr, from, num, !turn_on); +- +- /* +- * Search for a (possibly new) maximum. This is simple and stupid, +- * to keep it obviously correct: +- */ +- max_long = 0; +- for (i = 0; i < IO_BITMAP_LONGS; i++) +- if (t->io_bitmap_ptr[i] != ~0UL) +- max_long = i; +- +- bytes = (max_long + 1) * sizeof(long); +- bytes_updated = max(bytes, t->io_bitmap_max); +- +- t->io_bitmap_max = bytes; +- +- /* +- * Sets the lazy trigger so that the next I/O operation will +- * reload the correct bitmap. +- * Reset the owner so that a process switch will not set +- * tss->io_bitmap_base to IO_BITMAP_OFFSET. +- */ +- tss->x86_tss.io_bitmap_base = INVALID_IO_BITMAP_OFFSET_LAZY; +- tss->io_bitmap_owner = NULL; +- +- put_cpu(); +- +- return 0; +-} +- +-/* +- * sys_iopl has to be used when you want to access the IO ports +- * beyond the 0x3ff range: to get the full 65536 ports bitmapped +- * you'd need 8kB of bitmaps/process, which is a bit excessive. +- * +- * Here we just change the eflags value on the stack: we allow +- * only the super-user to do it. This depends on the stack-layout +- * on system-call entry - see also fork() and the signal handling +- * code. +- */ +- +-asmlinkage long sys_iopl(unsigned long unused) +-{ +- volatile struct pt_regs * regs = (struct pt_regs *) &unused; +- unsigned int level = regs->ebx; +- unsigned int old = (regs->eflags >> 12) & 3; +- struct thread_struct *t = ¤t->thread; +- +- if (level > 3) +- return -EINVAL; +- /* Trying to gain more privileges? */ +- if (level > old) { +- if (!capable(CAP_SYS_RAWIO)) +- return -EPERM; +- } +- t->iopl = level << 12; +- regs->eflags = (regs->eflags & ~X86_EFLAGS_IOPL) | t->iopl; +- set_iopl_mask(t->iopl); +- return 0; +-} +diff --git a/arch/x86/kernel/ioport_64.c b/arch/x86/kernel/ioport_64.c +deleted file mode 100644 +index 5f62fad..0000000 +--- a/arch/x86/kernel/ioport_64.c ++++ /dev/null +@@ -1,117 +0,0 @@ +-/* +- * This contains the io-permission bitmap code - written by obz, with changes +- * by Linus. +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* Set EXTENT bits starting at BASE in BITMAP to value TURN_ON. */ +-static void set_bitmap(unsigned long *bitmap, unsigned int base, unsigned int extent, int new_value) +-{ +- int i; +- if (new_value) +- for (i = base; i < base + extent; i++) +- __set_bit(i, bitmap); +- else +- for (i = base; i < base + extent; i++) +- clear_bit(i, bitmap); +-} +- +-/* +- * this changes the io permissions bitmap in the current task. +- */ +-asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int turn_on) +-{ +- unsigned int i, max_long, bytes, bytes_updated; +- struct thread_struct * t = ¤t->thread; +- struct tss_struct * tss; +- unsigned long *bitmap; +- +- if ((from + num <= from) || (from + num > IO_BITMAP_BITS)) +- return -EINVAL; +- if (turn_on && !capable(CAP_SYS_RAWIO)) +- return -EPERM; +- +- /* +- * If it's the first ioperm() call in this thread's lifetime, set the +- * IO bitmap up. ioperm() is much less timing critical than clone(), +- * this is why we delay this operation until now: +- */ +- if (!t->io_bitmap_ptr) { +- bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL); +- if (!bitmap) +- return -ENOMEM; +- +- memset(bitmap, 0xff, IO_BITMAP_BYTES); +- t->io_bitmap_ptr = bitmap; +- set_thread_flag(TIF_IO_BITMAP); +- } +- +- /* +- * do it in the per-thread copy and in the TSS ... +- * +- * Disable preemption via get_cpu() - we must not switch away +- * because the ->io_bitmap_max value must match the bitmap +- * contents: +- */ +- tss = &per_cpu(init_tss, get_cpu()); +- +- set_bitmap(t->io_bitmap_ptr, from, num, !turn_on); +- +- /* +- * Search for a (possibly new) maximum. This is simple and stupid, +- * to keep it obviously correct: +- */ +- max_long = 0; +- for (i = 0; i < IO_BITMAP_LONGS; i++) +- if (t->io_bitmap_ptr[i] != ~0UL) +- max_long = i; +- +- bytes = (max_long + 1) * sizeof(long); +- bytes_updated = max(bytes, t->io_bitmap_max); +- +- t->io_bitmap_max = bytes; +- +- /* Update the TSS: */ +- memcpy(tss->io_bitmap, t->io_bitmap_ptr, bytes_updated); +- +- put_cpu(); +- +- return 0; +-} +- +-/* +- * sys_iopl has to be used when you want to access the IO ports +- * beyond the 0x3ff range: to get the full 65536 ports bitmapped +- * you'd need 8kB of bitmaps/process, which is a bit excessive. +- * +- * Here we just change the eflags value on the stack: we allow +- * only the super-user to do it. This depends on the stack-layout +- * on system-call entry - see also fork() and the signal handling +- * code. +- */ +- +-asmlinkage long sys_iopl(unsigned int level, struct pt_regs *regs) +-{ +- unsigned int old = (regs->eflags >> 12) & 3; +- +- if (level > 3) +- return -EINVAL; +- /* Trying to gain more privileges? */ +- if (level > old) { +- if (!capable(CAP_SYS_RAWIO)) +- return -EPERM; +- } +- regs->eflags = (regs->eflags &~ X86_EFLAGS_IOPL) | (level << 12); +- return 0; +-} +diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c +index d3fde94..cef054b 100644 +--- a/arch/x86/kernel/irq_32.c ++++ b/arch/x86/kernel/irq_32.c +@@ -66,11 +66,11 @@ static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly; + * SMP cross-CPU interrupts have their own specific + * handlers). + */ +-fastcall unsigned int do_IRQ(struct pt_regs *regs) ++unsigned int do_IRQ(struct pt_regs *regs) + { + struct pt_regs *old_regs; + /* high bit used in ret_from_ code */ +- int irq = ~regs->orig_eax; ++ int irq = ~regs->orig_ax; + struct irq_desc *desc = irq_desc + irq; + #ifdef CONFIG_4KSTACKS + union irq_ctx *curctx, *irqctx; +@@ -88,13 +88,13 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) + #ifdef CONFIG_DEBUG_STACKOVERFLOW + /* Debugging check for stack overflow: is there less than 1KB free? */ + { +- long esp; ++ long sp; + + __asm__ __volatile__("andl %%esp,%0" : +- "=r" (esp) : "0" (THREAD_SIZE - 1)); +- if (unlikely(esp < (sizeof(struct thread_info) + STACK_WARN))) { ++ "=r" (sp) : "0" (THREAD_SIZE - 1)); ++ if (unlikely(sp < (sizeof(struct thread_info) + STACK_WARN))) { + printk("do_IRQ: stack overflow: %ld\n", +- esp - sizeof(struct thread_info)); ++ sp - sizeof(struct thread_info)); + dump_stack(); + } + } +@@ -112,7 +112,7 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) + * current stack (which is the irq stack already after all) + */ + if (curctx != irqctx) { +- int arg1, arg2, ebx; ++ int arg1, arg2, bx; + + /* build the stack frame on the IRQ stack */ + isp = (u32*) ((char*)irqctx + sizeof(*irqctx)); +@@ -128,10 +128,10 @@ fastcall unsigned int do_IRQ(struct pt_regs *regs) + (curctx->tinfo.preempt_count & SOFTIRQ_MASK); + + asm volatile( +- " xchgl %%ebx,%%esp \n" +- " call *%%edi \n" +- " movl %%ebx,%%esp \n" +- : "=a" (arg1), "=d" (arg2), "=b" (ebx) ++ " xchgl %%ebx,%%esp \n" ++ " call *%%edi \n" ++ " movl %%ebx,%%esp \n" ++ : "=a" (arg1), "=d" (arg2), "=b" (bx) + : "0" (irq), "1" (desc), "2" (isp), + "D" (desc->handle_irq) + : "memory", "cc" +diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c +index 6b5c730..3aac154 100644 +--- a/arch/x86/kernel/irq_64.c ++++ b/arch/x86/kernel/irq_64.c +@@ -20,6 +20,26 @@ + + atomic_t irq_err_count; + ++/* ++ * 'what should we do if we get a hw irq event on an illegal vector'. ++ * each architecture has to answer this themselves. ++ */ ++void ack_bad_irq(unsigned int irq) ++{ ++ printk(KERN_WARNING "unexpected IRQ trap at vector %02x\n", irq); ++ /* ++ * Currently unexpected vectors happen only on SMP and APIC. ++ * We _must_ ack these because every local APIC has only N ++ * irq slots per priority level, and a 'hanging, unacked' IRQ ++ * holds up an irq slot - in excessive cases (when multiple ++ * unexpected vectors occur) that might lock up the APIC ++ * completely. ++ * But don't ack when the APIC is disabled. -AK ++ */ ++ if (!disable_apic) ++ ack_APIC_irq(); ++} ++ + #ifdef CONFIG_DEBUG_STACKOVERFLOW + /* + * Probabilistic stack overflow check: +@@ -33,11 +53,11 @@ static inline void stack_overflow_check(struct pt_regs *regs) + u64 curbase = (u64)task_stack_page(current); + static unsigned long warned = -60*HZ; + +- if (regs->rsp >= curbase && regs->rsp <= curbase + THREAD_SIZE && +- regs->rsp < curbase + sizeof(struct thread_info) + 128 && ++ if (regs->sp >= curbase && regs->sp <= curbase + THREAD_SIZE && ++ regs->sp < curbase + sizeof(struct thread_info) + 128 && + time_after(jiffies, warned + 60*HZ)) { +- printk("do_IRQ: %s near stack overflow (cur:%Lx,rsp:%lx)\n", +- current->comm, curbase, regs->rsp); ++ printk("do_IRQ: %s near stack overflow (cur:%Lx,sp:%lx)\n", ++ current->comm, curbase, regs->sp); + show_stack(NULL,NULL); + warned = jiffies; + } +@@ -142,7 +162,7 @@ asmlinkage unsigned int do_IRQ(struct pt_regs *regs) + struct pt_regs *old_regs = set_irq_regs(regs); + + /* high bit used in ret_from_ code */ +- unsigned vector = ~regs->orig_rax; ++ unsigned vector = ~regs->orig_ax; + unsigned irq; + + exit_idle(); +diff --git a/arch/x86/kernel/kdebugfs.c b/arch/x86/kernel/kdebugfs.c +new file mode 100644 +index 0000000..7335430 +--- /dev/null ++++ b/arch/x86/kernel/kdebugfs.c +@@ -0,0 +1,65 @@ ++/* ++ * Architecture specific debugfs files ++ * ++ * Copyright (C) 2007, Intel Corp. ++ * Huang Ying ++ * ++ * This file is released under the GPLv2. ++ */ ++ ++#include ++#include ++#include ++ ++#include ++ ++#ifdef CONFIG_DEBUG_BOOT_PARAMS ++static struct debugfs_blob_wrapper boot_params_blob = { ++ .data = &boot_params, ++ .size = sizeof(boot_params), ++}; ++ ++static int __init boot_params_kdebugfs_init(void) ++{ ++ int error; ++ struct dentry *dbp, *version, *data; ++ ++ dbp = debugfs_create_dir("boot_params", NULL); ++ if (!dbp) { ++ error = -ENOMEM; ++ goto err_return; ++ } ++ version = debugfs_create_x16("version", S_IRUGO, dbp, ++ &boot_params.hdr.version); ++ if (!version) { ++ error = -ENOMEM; ++ goto err_dir; ++ } ++ data = debugfs_create_blob("data", S_IRUGO, dbp, ++ &boot_params_blob); ++ if (!data) { ++ error = -ENOMEM; ++ goto err_version; ++ } ++ return 0; ++err_version: ++ debugfs_remove(version); ++err_dir: ++ debugfs_remove(dbp); ++err_return: ++ return error; ++} ++#endif ++ ++static int __init arch_kdebugfs_init(void) ++{ ++ int error = 0; ++ ++#ifdef CONFIG_DEBUG_BOOT_PARAMS ++ error = boot_params_kdebugfs_init(); ++#endif ++ ++ return error; ++} ++ ++arch_initcall(arch_kdebugfs_init); +diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c +new file mode 100644 +index 0000000..a99e764 +--- /dev/null ++++ b/arch/x86/kernel/kprobes.c +@@ -0,0 +1,1066 @@ ++/* ++ * Kernel Probes (KProbes) ++ * ++ * This program is free software; you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License as published by ++ * the Free Software Foundation; either version 2 of the License, or ++ * (at your option) any later version. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public License ++ * along with this program; if not, write to the Free Software ++ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. ++ * ++ * Copyright (C) IBM Corporation, 2002, 2004 ++ * ++ * 2002-Oct Created by Vamsi Krishna S Kernel ++ * Probes initial implementation ( includes contributions from ++ * Rusty Russell). ++ * 2004-July Suparna Bhattacharya added jumper probes ++ * interface to access function arguments. ++ * 2004-Oct Jim Keniston and Prasanna S Panchamukhi ++ * adapted for x86_64 from i386. ++ * 2005-Mar Roland McGrath ++ * Fixed to handle %rip-relative addressing mode correctly. ++ * 2005-May Hien Nguyen , Jim Keniston ++ * and Prasanna S Panchamukhi ++ * added function-return probes. ++ * 2005-May Rusty Lynch ++ * Added function return probes functionality ++ * 2006-Feb Masami Hiramatsu added ++ * kprobe-booster and kretprobe-booster for i386. ++ * 2007-Dec Masami Hiramatsu added kprobe-booster ++ * and kretprobe-booster for x86-64 ++ * 2007-Dec Masami Hiramatsu , Arjan van de Ven ++ * and Jim Keniston ++ * unified x86 kprobes code. ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#include ++#include ++#include ++#include ++#include ++ ++void jprobe_return_end(void); ++ ++DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL; ++DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); ++ ++#ifdef CONFIG_X86_64 ++#define stack_addr(regs) ((unsigned long *)regs->sp) ++#else ++/* ++ * "®s->sp" looks wrong, but it's correct for x86_32. x86_32 CPUs ++ * don't save the ss and esp registers if the CPU is already in kernel ++ * mode when it traps. So for kprobes, regs->sp and regs->ss are not ++ * the [nonexistent] saved stack pointer and ss register, but rather ++ * the top 8 bytes of the pre-int3 stack. So ®s->sp happens to ++ * point to the top of the pre-int3 stack. ++ */ ++#define stack_addr(regs) ((unsigned long *)®s->sp) ++#endif ++ ++#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\ ++ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \ ++ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \ ++ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \ ++ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \ ++ << (row % 32)) ++ /* ++ * Undefined/reserved opcodes, conditional jump, Opcode Extension ++ * Groups, and some special opcodes can not boost. ++ */ ++static const u32 twobyte_is_boostable[256 / 32] = { ++ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ ++ /* ---------------------------------------------- */ ++ W(0x00, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0) | /* 00 */ ++ W(0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 10 */ ++ W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 20 */ ++ W(0x30, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */ ++ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */ ++ W(0x50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 50 */ ++ W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1) | /* 60 */ ++ W(0x70, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */ ++ W(0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 80 */ ++ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */ ++ W(0xa0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1) | /* a0 */ ++ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1) , /* b0 */ ++ W(0xc0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */ ++ W(0xd0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1) , /* d0 */ ++ W(0xe0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1) | /* e0 */ ++ W(0xf0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0) /* f0 */ ++ /* ----------------------------------------------- */ ++ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ ++}; ++static const u32 onebyte_has_modrm[256 / 32] = { ++ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ ++ /* ----------------------------------------------- */ ++ W(0x00, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 00 */ ++ W(0x10, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) , /* 10 */ ++ W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 20 */ ++ W(0x30, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) , /* 30 */ ++ W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */ ++ W(0x50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 50 */ ++ W(0x60, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0) | /* 60 */ ++ W(0x70, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 70 */ ++ W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */ ++ W(0x90, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 90 */ ++ W(0xa0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* a0 */ ++ W(0xb0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* b0 */ ++ W(0xc0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) | /* c0 */ ++ W(0xd0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */ ++ W(0xe0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* e0 */ ++ W(0xf0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) /* f0 */ ++ /* ----------------------------------------------- */ ++ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ ++}; ++static const u32 twobyte_has_modrm[256 / 32] = { ++ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ ++ /* ----------------------------------------------- */ ++ W(0x00, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1) | /* 0f */ ++ W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0) , /* 1f */ ++ W(0x20, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 2f */ ++ W(0x30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 3f */ ++ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 4f */ ++ W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 5f */ ++ W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 6f */ ++ W(0x70, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1) , /* 7f */ ++ W(0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 8f */ ++ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 9f */ ++ W(0xa0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1) | /* af */ ++ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1) , /* bf */ ++ W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) | /* cf */ ++ W(0xd0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* df */ ++ W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* ef */ ++ W(0xf0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0) /* ff */ ++ /* ----------------------------------------------- */ ++ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ ++}; ++#undef W ++ ++struct kretprobe_blackpoint kretprobe_blacklist[] = { ++ {"__switch_to", }, /* This function switches only current task, but ++ doesn't switch kernel stack.*/ ++ {NULL, NULL} /* Terminator */ ++}; ++const int kretprobe_blacklist_size = ARRAY_SIZE(kretprobe_blacklist); ++ ++/* Insert a jump instruction at address 'from', which jumps to address 'to'.*/ ++static void __kprobes set_jmp_op(void *from, void *to) ++{ ++ struct __arch_jmp_op { ++ char op; ++ s32 raddr; ++ } __attribute__((packed)) * jop; ++ jop = (struct __arch_jmp_op *)from; ++ jop->raddr = (s32)((long)(to) - ((long)(from) + 5)); ++ jop->op = RELATIVEJUMP_INSTRUCTION; ++} ++ ++/* ++ * Check for the REX prefix which can only exist on X86_64 ++ * X86_32 always returns 0 ++ */ ++static int __kprobes is_REX_prefix(kprobe_opcode_t *insn) ++{ ++#ifdef CONFIG_X86_64 ++ if ((*insn & 0xf0) == 0x40) ++ return 1; ++#endif ++ return 0; ++} ++ ++/* ++ * Returns non-zero if opcode is boostable. ++ * RIP relative instructions are adjusted at copying time in 64 bits mode ++ */ ++static int __kprobes can_boost(kprobe_opcode_t *opcodes) ++{ ++ kprobe_opcode_t opcode; ++ kprobe_opcode_t *orig_opcodes = opcodes; ++ ++retry: ++ if (opcodes - orig_opcodes > MAX_INSN_SIZE - 1) ++ return 0; ++ opcode = *(opcodes++); ++ ++ /* 2nd-byte opcode */ ++ if (opcode == 0x0f) { ++ if (opcodes - orig_opcodes > MAX_INSN_SIZE - 1) ++ return 0; ++ return test_bit(*opcodes, ++ (unsigned long *)twobyte_is_boostable); ++ } ++ ++ switch (opcode & 0xf0) { ++#ifdef CONFIG_X86_64 ++ case 0x40: ++ goto retry; /* REX prefix is boostable */ ++#endif ++ case 0x60: ++ if (0x63 < opcode && opcode < 0x67) ++ goto retry; /* prefixes */ ++ /* can't boost Address-size override and bound */ ++ return (opcode != 0x62 && opcode != 0x67); ++ case 0x70: ++ return 0; /* can't boost conditional jump */ ++ case 0xc0: ++ /* can't boost software-interruptions */ ++ return (0xc1 < opcode && opcode < 0xcc) || opcode == 0xcf; ++ case 0xd0: ++ /* can boost AA* and XLAT */ ++ return (opcode == 0xd4 || opcode == 0xd5 || opcode == 0xd7); ++ case 0xe0: ++ /* can boost in/out and absolute jmps */ ++ return ((opcode & 0x04) || opcode == 0xea); ++ case 0xf0: ++ if ((opcode & 0x0c) == 0 && opcode != 0xf1) ++ goto retry; /* lock/rep(ne) prefix */ ++ /* clear and set flags are boostable */ ++ return (opcode == 0xf5 || (0xf7 < opcode && opcode < 0xfe)); ++ default: ++ /* segment override prefixes are boostable */ ++ if (opcode == 0x26 || opcode == 0x36 || opcode == 0x3e) ++ goto retry; /* prefixes */ ++ /* CS override prefix and call are not boostable */ ++ return (opcode != 0x2e && opcode != 0x9a); ++ } ++} ++ ++/* ++ * Returns non-zero if opcode modifies the interrupt flag. ++ */ ++static int __kprobes is_IF_modifier(kprobe_opcode_t *insn) ++{ ++ switch (*insn) { ++ case 0xfa: /* cli */ ++ case 0xfb: /* sti */ ++ case 0xcf: /* iret/iretd */ ++ case 0x9d: /* popf/popfd */ ++ return 1; ++ } ++ ++ /* ++ * on X86_64, 0x40-0x4f are REX prefixes so we need to look ++ * at the next byte instead.. but of course not recurse infinitely ++ */ ++ if (is_REX_prefix(insn)) ++ return is_IF_modifier(++insn); ++ ++ return 0; ++} ++ ++/* ++ * Adjust the displacement if the instruction uses the %rip-relative ++ * addressing mode. ++ * If it does, Return the address of the 32-bit displacement word. ++ * If not, return null. ++ * Only applicable to 64-bit x86. ++ */ ++static void __kprobes fix_riprel(struct kprobe *p) ++{ ++#ifdef CONFIG_X86_64 ++ u8 *insn = p->ainsn.insn; ++ s64 disp; ++ int need_modrm; ++ ++ /* Skip legacy instruction prefixes. */ ++ while (1) { ++ switch (*insn) { ++ case 0x66: ++ case 0x67: ++ case 0x2e: ++ case 0x3e: ++ case 0x26: ++ case 0x64: ++ case 0x65: ++ case 0x36: ++ case 0xf0: ++ case 0xf3: ++ case 0xf2: ++ ++insn; ++ continue; ++ } ++ break; ++ } ++ ++ /* Skip REX instruction prefix. */ ++ if (is_REX_prefix(insn)) ++ ++insn; ++ ++ if (*insn == 0x0f) { ++ /* Two-byte opcode. */ ++ ++insn; ++ need_modrm = test_bit(*insn, ++ (unsigned long *)twobyte_has_modrm); ++ } else ++ /* One-byte opcode. */ ++ need_modrm = test_bit(*insn, ++ (unsigned long *)onebyte_has_modrm); ++ ++ if (need_modrm) { ++ u8 modrm = *++insn; ++ if ((modrm & 0xc7) == 0x05) { ++ /* %rip+disp32 addressing mode */ ++ /* Displacement follows ModRM byte. */ ++ ++insn; ++ /* ++ * The copied instruction uses the %rip-relative ++ * addressing mode. Adjust the displacement for the ++ * difference between the original location of this ++ * instruction and the location of the copy that will ++ * actually be run. The tricky bit here is making sure ++ * that the sign extension happens correctly in this ++ * calculation, since we need a signed 32-bit result to ++ * be sign-extended to 64 bits when it's added to the ++ * %rip value and yield the same 64-bit result that the ++ * sign-extension of the original signed 32-bit ++ * displacement would have given. ++ */ ++ disp = (u8 *) p->addr + *((s32 *) insn) - ++ (u8 *) p->ainsn.insn; ++ BUG_ON((s64) (s32) disp != disp); /* Sanity check. */ ++ *(s32 *)insn = (s32) disp; ++ } ++ } ++#endif ++} ++ ++static void __kprobes arch_copy_kprobe(struct kprobe *p) ++{ ++ memcpy(p->ainsn.insn, p->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t)); ++ ++ fix_riprel(p); ++ ++ if (can_boost(p->addr)) ++ p->ainsn.boostable = 0; ++ else ++ p->ainsn.boostable = -1; ++ ++ p->opcode = *p->addr; ++} ++ ++int __kprobes arch_prepare_kprobe(struct kprobe *p) ++{ ++ /* insn: must be on special executable page on x86. */ ++ p->ainsn.insn = get_insn_slot(); ++ if (!p->ainsn.insn) ++ return -ENOMEM; ++ arch_copy_kprobe(p); ++ return 0; ++} ++ ++void __kprobes arch_arm_kprobe(struct kprobe *p) ++{ ++ text_poke(p->addr, ((unsigned char []){BREAKPOINT_INSTRUCTION}), 1); ++} ++ ++void __kprobes arch_disarm_kprobe(struct kprobe *p) ++{ ++ text_poke(p->addr, &p->opcode, 1); ++} ++ ++void __kprobes arch_remove_kprobe(struct kprobe *p) ++{ ++ mutex_lock(&kprobe_mutex); ++ free_insn_slot(p->ainsn.insn, (p->ainsn.boostable == 1)); ++ mutex_unlock(&kprobe_mutex); ++} ++ ++static void __kprobes save_previous_kprobe(struct kprobe_ctlblk *kcb) ++{ ++ kcb->prev_kprobe.kp = kprobe_running(); ++ kcb->prev_kprobe.status = kcb->kprobe_status; ++ kcb->prev_kprobe.old_flags = kcb->kprobe_old_flags; ++ kcb->prev_kprobe.saved_flags = kcb->kprobe_saved_flags; ++} ++ ++static void __kprobes restore_previous_kprobe(struct kprobe_ctlblk *kcb) ++{ ++ __get_cpu_var(current_kprobe) = kcb->prev_kprobe.kp; ++ kcb->kprobe_status = kcb->prev_kprobe.status; ++ kcb->kprobe_old_flags = kcb->prev_kprobe.old_flags; ++ kcb->kprobe_saved_flags = kcb->prev_kprobe.saved_flags; ++} ++ ++static void __kprobes set_current_kprobe(struct kprobe *p, struct pt_regs *regs, ++ struct kprobe_ctlblk *kcb) ++{ ++ __get_cpu_var(current_kprobe) = p; ++ kcb->kprobe_saved_flags = kcb->kprobe_old_flags ++ = (regs->flags & (X86_EFLAGS_TF | X86_EFLAGS_IF)); ++ if (is_IF_modifier(p->ainsn.insn)) ++ kcb->kprobe_saved_flags &= ~X86_EFLAGS_IF; ++} ++ ++static void __kprobes clear_btf(void) ++{ ++ if (test_thread_flag(TIF_DEBUGCTLMSR)) ++ wrmsrl(MSR_IA32_DEBUGCTLMSR, 0); ++} ++ ++static void __kprobes restore_btf(void) ++{ ++ if (test_thread_flag(TIF_DEBUGCTLMSR)) ++ wrmsrl(MSR_IA32_DEBUGCTLMSR, current->thread.debugctlmsr); ++} ++ ++static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs *regs) ++{ ++ clear_btf(); ++ regs->flags |= X86_EFLAGS_TF; ++ regs->flags &= ~X86_EFLAGS_IF; ++ /* single step inline if the instruction is an int3 */ ++ if (p->opcode == BREAKPOINT_INSTRUCTION) ++ regs->ip = (unsigned long)p->addr; ++ else ++ regs->ip = (unsigned long)p->ainsn.insn; ++} ++ ++/* Called with kretprobe_lock held */ ++void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri, ++ struct pt_regs *regs) ++{ ++ unsigned long *sara = stack_addr(regs); ++ ++ ri->ret_addr = (kprobe_opcode_t *) *sara; ++ ++ /* Replace the return addr with trampoline addr */ ++ *sara = (unsigned long) &kretprobe_trampoline; ++} ++ ++static void __kprobes setup_singlestep(struct kprobe *p, struct pt_regs *regs, ++ struct kprobe_ctlblk *kcb) ++{ ++#if !defined(CONFIG_PREEMPT) || defined(CONFIG_PM) ++ if (p->ainsn.boostable == 1 && !p->post_handler) { ++ /* Boost up -- we can execute copied instructions directly */ ++ reset_current_kprobe(); ++ regs->ip = (unsigned long)p->ainsn.insn; ++ preempt_enable_no_resched(); ++ return; ++ } ++#endif ++ prepare_singlestep(p, regs); ++ kcb->kprobe_status = KPROBE_HIT_SS; ++} ++ ++/* ++ * We have reentered the kprobe_handler(), since another probe was hit while ++ * within the handler. We save the original kprobes variables and just single ++ * step on the instruction of the new probe without calling any user handlers. ++ */ ++static int __kprobes reenter_kprobe(struct kprobe *p, struct pt_regs *regs, ++ struct kprobe_ctlblk *kcb) ++{ ++ switch (kcb->kprobe_status) { ++ case KPROBE_HIT_SSDONE: ++#ifdef CONFIG_X86_64 ++ /* TODO: Provide re-entrancy from post_kprobes_handler() and ++ * avoid exception stack corruption while single-stepping on ++ * the instruction of the new probe. ++ */ ++ arch_disarm_kprobe(p); ++ regs->ip = (unsigned long)p->addr; ++ reset_current_kprobe(); ++ preempt_enable_no_resched(); ++ break; ++#endif ++ case KPROBE_HIT_ACTIVE: ++ save_previous_kprobe(kcb); ++ set_current_kprobe(p, regs, kcb); ++ kprobes_inc_nmissed_count(p); ++ prepare_singlestep(p, regs); ++ kcb->kprobe_status = KPROBE_REENTER; ++ break; ++ case KPROBE_HIT_SS: ++ if (p == kprobe_running()) { ++ regs->flags &= ~TF_MASK; ++ regs->flags |= kcb->kprobe_saved_flags; ++ return 0; ++ } else { ++ /* A probe has been hit in the codepath leading up ++ * to, or just after, single-stepping of a probed ++ * instruction. This entire codepath should strictly ++ * reside in .kprobes.text section. Raise a warning ++ * to highlight this peculiar case. ++ */ ++ } ++ default: ++ /* impossible cases */ ++ WARN_ON(1); ++ return 0; ++ } ++ ++ return 1; ++} ++ ++/* ++ * Interrupts are disabled on entry as trap3 is an interrupt gate and they ++ * remain disabled thorough out this function. ++ */ ++static int __kprobes kprobe_handler(struct pt_regs *regs) ++{ ++ kprobe_opcode_t *addr; ++ struct kprobe *p; ++ struct kprobe_ctlblk *kcb; ++ ++ addr = (kprobe_opcode_t *)(regs->ip - sizeof(kprobe_opcode_t)); ++ if (*addr != BREAKPOINT_INSTRUCTION) { ++ /* ++ * The breakpoint instruction was removed right ++ * after we hit it. Another cpu has removed ++ * either a probepoint or a debugger breakpoint ++ * at this address. In either case, no further ++ * handling of this interrupt is appropriate. ++ * Back up over the (now missing) int3 and run ++ * the original instruction. ++ */ ++ regs->ip = (unsigned long)addr; ++ return 1; ++ } ++ ++ /* ++ * We don't want to be preempted for the entire ++ * duration of kprobe processing. We conditionally ++ * re-enable preemption at the end of this function, ++ * and also in reenter_kprobe() and setup_singlestep(). ++ */ ++ preempt_disable(); ++ ++ kcb = get_kprobe_ctlblk(); ++ p = get_kprobe(addr); ++ ++ if (p) { ++ if (kprobe_running()) { ++ if (reenter_kprobe(p, regs, kcb)) ++ return 1; ++ } else { ++ set_current_kprobe(p, regs, kcb); ++ kcb->kprobe_status = KPROBE_HIT_ACTIVE; ++ ++ /* ++ * If we have no pre-handler or it returned 0, we ++ * continue with normal processing. If we have a ++ * pre-handler and it returned non-zero, it prepped ++ * for calling the break_handler below on re-entry ++ * for jprobe processing, so get out doing nothing ++ * more here. ++ */ ++ if (!p->pre_handler || !p->pre_handler(p, regs)) ++ setup_singlestep(p, regs, kcb); ++ return 1; ++ } ++ } else if (kprobe_running()) { ++ p = __get_cpu_var(current_kprobe); ++ if (p->break_handler && p->break_handler(p, regs)) { ++ setup_singlestep(p, regs, kcb); ++ return 1; ++ } ++ } /* else: not a kprobe fault; let the kernel handle it */ ++ ++ preempt_enable_no_resched(); ++ return 0; ++} ++ ++/* ++ * When a retprobed function returns, this code saves registers and ++ * calls trampoline_handler() runs, which calls the kretprobe's handler. ++ */ ++void __kprobes kretprobe_trampoline_holder(void) ++{ ++ asm volatile ( ++ ".global kretprobe_trampoline\n" ++ "kretprobe_trampoline: \n" ++#ifdef CONFIG_X86_64 ++ /* We don't bother saving the ss register */ ++ " pushq %rsp\n" ++ " pushfq\n" ++ /* ++ * Skip cs, ip, orig_ax. ++ * trampoline_handler() will plug in these values ++ */ ++ " subq $24, %rsp\n" ++ " pushq %rdi\n" ++ " pushq %rsi\n" ++ " pushq %rdx\n" ++ " pushq %rcx\n" ++ " pushq %rax\n" ++ " pushq %r8\n" ++ " pushq %r9\n" ++ " pushq %r10\n" ++ " pushq %r11\n" ++ " pushq %rbx\n" ++ " pushq %rbp\n" ++ " pushq %r12\n" ++ " pushq %r13\n" ++ " pushq %r14\n" ++ " pushq %r15\n" ++ " movq %rsp, %rdi\n" ++ " call trampoline_handler\n" ++ /* Replace saved sp with true return address. */ ++ " movq %rax, 152(%rsp)\n" ++ " popq %r15\n" ++ " popq %r14\n" ++ " popq %r13\n" ++ " popq %r12\n" ++ " popq %rbp\n" ++ " popq %rbx\n" ++ " popq %r11\n" ++ " popq %r10\n" ++ " popq %r9\n" ++ " popq %r8\n" ++ " popq %rax\n" ++ " popq %rcx\n" ++ " popq %rdx\n" ++ " popq %rsi\n" ++ " popq %rdi\n" ++ /* Skip orig_ax, ip, cs */ ++ " addq $24, %rsp\n" ++ " popfq\n" ++#else ++ " pushf\n" ++ /* ++ * Skip cs, ip, orig_ax. ++ * trampoline_handler() will plug in these values ++ */ ++ " subl $12, %esp\n" ++ " pushl %fs\n" ++ " pushl %ds\n" ++ " pushl %es\n" ++ " pushl %eax\n" ++ " pushl %ebp\n" ++ " pushl %edi\n" ++ " pushl %esi\n" ++ " pushl %edx\n" ++ " pushl %ecx\n" ++ " pushl %ebx\n" ++ " movl %esp, %eax\n" ++ " call trampoline_handler\n" ++ /* Move flags to cs */ ++ " movl 52(%esp), %edx\n" ++ " movl %edx, 48(%esp)\n" ++ /* Replace saved flags with true return address. */ ++ " movl %eax, 52(%esp)\n" ++ " popl %ebx\n" ++ " popl %ecx\n" ++ " popl %edx\n" ++ " popl %esi\n" ++ " popl %edi\n" ++ " popl %ebp\n" ++ " popl %eax\n" ++ /* Skip ip, orig_ax, es, ds, fs */ ++ " addl $20, %esp\n" ++ " popf\n" ++#endif ++ " ret\n"); ++} ++ ++/* ++ * Called from kretprobe_trampoline ++ */ ++void * __kprobes trampoline_handler(struct pt_regs *regs) ++{ ++ struct kretprobe_instance *ri = NULL; ++ struct hlist_head *head, empty_rp; ++ struct hlist_node *node, *tmp; ++ unsigned long flags, orig_ret_address = 0; ++ unsigned long trampoline_address = (unsigned long)&kretprobe_trampoline; ++ ++ INIT_HLIST_HEAD(&empty_rp); ++ spin_lock_irqsave(&kretprobe_lock, flags); ++ head = kretprobe_inst_table_head(current); ++ /* fixup registers */ ++#ifdef CONFIG_X86_64 ++ regs->cs = __KERNEL_CS; ++#else ++ regs->cs = __KERNEL_CS | get_kernel_rpl(); ++#endif ++ regs->ip = trampoline_address; ++ regs->orig_ax = ~0UL; ++ ++ /* ++ * It is possible to have multiple instances associated with a given ++ * task either because multiple functions in the call path have ++ * return probes installed on them, and/or more then one ++ * return probe was registered for a target function. ++ * ++ * We can handle this because: ++ * - instances are always pushed into the head of the list ++ * - when multiple return probes are registered for the same ++ * function, the (chronologically) first instance's ret_addr ++ * will be the real return address, and all the rest will ++ * point to kretprobe_trampoline. ++ */ ++ hlist_for_each_entry_safe(ri, node, tmp, head, hlist) { ++ if (ri->task != current) ++ /* another task is sharing our hash bucket */ ++ continue; ++ ++ if (ri->rp && ri->rp->handler) { ++ __get_cpu_var(current_kprobe) = &ri->rp->kp; ++ get_kprobe_ctlblk()->kprobe_status = KPROBE_HIT_ACTIVE; ++ ri->rp->handler(ri, regs); ++ __get_cpu_var(current_kprobe) = NULL; ++ } ++ ++ orig_ret_address = (unsigned long)ri->ret_addr; ++ recycle_rp_inst(ri, &empty_rp); ++ ++ if (orig_ret_address != trampoline_address) ++ /* ++ * This is the real return address. Any other ++ * instances associated with this task are for ++ * other calls deeper on the call stack ++ */ ++ break; ++ } ++ ++ kretprobe_assert(ri, orig_ret_address, trampoline_address); ++ ++ spin_unlock_irqrestore(&kretprobe_lock, flags); ++ ++ hlist_for_each_entry_safe(ri, node, tmp, &empty_rp, hlist) { ++ hlist_del(&ri->hlist); ++ kfree(ri); ++ } ++ return (void *)orig_ret_address; ++} ++ ++/* ++ * Called after single-stepping. p->addr is the address of the ++ * instruction whose first byte has been replaced by the "int 3" ++ * instruction. To avoid the SMP problems that can occur when we ++ * temporarily put back the original opcode to single-step, we ++ * single-stepped a copy of the instruction. The address of this ++ * copy is p->ainsn.insn. ++ * ++ * This function prepares to return from the post-single-step ++ * interrupt. We have to fix up the stack as follows: ++ * ++ * 0) Except in the case of absolute or indirect jump or call instructions, ++ * the new ip is relative to the copied instruction. We need to make ++ * it relative to the original instruction. ++ * ++ * 1) If the single-stepped instruction was pushfl, then the TF and IF ++ * flags are set in the just-pushed flags, and may need to be cleared. ++ * ++ * 2) If the single-stepped instruction was a call, the return address ++ * that is atop the stack is the address following the copied instruction. ++ * We need to make it the address following the original instruction. ++ * ++ * If this is the first time we've single-stepped the instruction at ++ * this probepoint, and the instruction is boostable, boost it: add a ++ * jump instruction after the copied instruction, that jumps to the next ++ * instruction after the probepoint. ++ */ ++static void __kprobes resume_execution(struct kprobe *p, ++ struct pt_regs *regs, struct kprobe_ctlblk *kcb) ++{ ++ unsigned long *tos = stack_addr(regs); ++ unsigned long copy_ip = (unsigned long)p->ainsn.insn; ++ unsigned long orig_ip = (unsigned long)p->addr; ++ kprobe_opcode_t *insn = p->ainsn.insn; ++ ++ /*skip the REX prefix*/ ++ if (is_REX_prefix(insn)) ++ insn++; ++ ++ regs->flags &= ~X86_EFLAGS_TF; ++ switch (*insn) { ++ case 0x9c: /* pushfl */ ++ *tos &= ~(X86_EFLAGS_TF | X86_EFLAGS_IF); ++ *tos |= kcb->kprobe_old_flags; ++ break; ++ case 0xc2: /* iret/ret/lret */ ++ case 0xc3: ++ case 0xca: ++ case 0xcb: ++ case 0xcf: ++ case 0xea: /* jmp absolute -- ip is correct */ ++ /* ip is already adjusted, no more changes required */ ++ p->ainsn.boostable = 1; ++ goto no_change; ++ case 0xe8: /* call relative - Fix return addr */ ++ *tos = orig_ip + (*tos - copy_ip); ++ break; ++#ifdef CONFIG_X86_32 ++ case 0x9a: /* call absolute -- same as call absolute, indirect */ ++ *tos = orig_ip + (*tos - copy_ip); ++ goto no_change; ++#endif ++ case 0xff: ++ if ((insn[1] & 0x30) == 0x10) { ++ /* ++ * call absolute, indirect ++ * Fix return addr; ip is correct. ++ * But this is not boostable ++ */ ++ *tos = orig_ip + (*tos - copy_ip); ++ goto no_change; ++ } else if (((insn[1] & 0x31) == 0x20) || ++ ((insn[1] & 0x31) == 0x21)) { ++ /* ++ * jmp near and far, absolute indirect ++ * ip is correct. And this is boostable ++ */ ++ p->ainsn.boostable = 1; ++ goto no_change; ++ } ++ default: ++ break; ++ } ++ ++ if (p->ainsn.boostable == 0) { ++ if ((regs->ip > copy_ip) && ++ (regs->ip - copy_ip) + 5 < MAX_INSN_SIZE) { ++ /* ++ * These instructions can be executed directly if it ++ * jumps back to correct address. ++ */ ++ set_jmp_op((void *)regs->ip, ++ (void *)orig_ip + (regs->ip - copy_ip)); ++ p->ainsn.boostable = 1; ++ } else { ++ p->ainsn.boostable = -1; ++ } ++ } ++ ++ regs->ip += orig_ip - copy_ip; ++ ++no_change: ++ restore_btf(); ++} ++ ++/* ++ * Interrupts are disabled on entry as trap1 is an interrupt gate and they ++ * remain disabled thoroughout this function. ++ */ ++static int __kprobes post_kprobe_handler(struct pt_regs *regs) ++{ ++ struct kprobe *cur = kprobe_running(); ++ struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); ++ ++ if (!cur) ++ return 0; ++ ++ if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) { ++ kcb->kprobe_status = KPROBE_HIT_SSDONE; ++ cur->post_handler(cur, regs, 0); ++ } ++ ++ resume_execution(cur, regs, kcb); ++ regs->flags |= kcb->kprobe_saved_flags; ++ trace_hardirqs_fixup_flags(regs->flags); ++ ++ /* Restore back the original saved kprobes variables and continue. */ ++ if (kcb->kprobe_status == KPROBE_REENTER) { ++ restore_previous_kprobe(kcb); ++ goto out; ++ } ++ reset_current_kprobe(); ++out: ++ preempt_enable_no_resched(); ++ ++ /* ++ * if somebody else is singlestepping across a probe point, flags ++ * will have TF set, in which case, continue the remaining processing ++ * of do_debug, as if this is not a probe hit. ++ */ ++ if (regs->flags & X86_EFLAGS_TF) ++ return 0; ++ ++ return 1; ++} ++ ++int __kprobes kprobe_fault_handler(struct pt_regs *regs, int trapnr) ++{ ++ struct kprobe *cur = kprobe_running(); ++ struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); ++ ++ switch (kcb->kprobe_status) { ++ case KPROBE_HIT_SS: ++ case KPROBE_REENTER: ++ /* ++ * We are here because the instruction being single ++ * stepped caused a page fault. We reset the current ++ * kprobe and the ip points back to the probe address ++ * and allow the page fault handler to continue as a ++ * normal page fault. ++ */ ++ regs->ip = (unsigned long)cur->addr; ++ regs->flags |= kcb->kprobe_old_flags; ++ if (kcb->kprobe_status == KPROBE_REENTER) ++ restore_previous_kprobe(kcb); ++ else ++ reset_current_kprobe(); ++ preempt_enable_no_resched(); ++ break; ++ case KPROBE_HIT_ACTIVE: ++ case KPROBE_HIT_SSDONE: ++ /* ++ * We increment the nmissed count for accounting, ++ * we can also use npre/npostfault count for accounting ++ * these specific fault cases. ++ */ ++ kprobes_inc_nmissed_count(cur); ++ ++ /* ++ * We come here because instructions in the pre/post ++ * handler caused the page_fault, this could happen ++ * if handler tries to access user space by ++ * copy_from_user(), get_user() etc. Let the ++ * user-specified handler try to fix it first. ++ */ ++ if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr)) ++ return 1; ++ ++ /* ++ * In case the user-specified fault handler returned ++ * zero, try to fix up. ++ */ ++ if (fixup_exception(regs)) ++ return 1; ++ ++ /* ++ * fixup routine could not handle it, ++ * Let do_page_fault() fix it. ++ */ ++ break; ++ default: ++ break; ++ } ++ return 0; ++} ++ ++/* ++ * Wrapper routine for handling exceptions. ++ */ ++int __kprobes kprobe_exceptions_notify(struct notifier_block *self, ++ unsigned long val, void *data) ++{ ++ struct die_args *args = data; ++ int ret = NOTIFY_DONE; ++ ++ if (args->regs && user_mode_vm(args->regs)) ++ return ret; ++ ++ switch (val) { ++ case DIE_INT3: ++ if (kprobe_handler(args->regs)) ++ ret = NOTIFY_STOP; ++ break; ++ case DIE_DEBUG: ++ if (post_kprobe_handler(args->regs)) ++ ret = NOTIFY_STOP; ++ break; ++ case DIE_GPF: ++ /* ++ * To be potentially processing a kprobe fault and to ++ * trust the result from kprobe_running(), we have ++ * be non-preemptible. ++ */ ++ if (!preemptible() && kprobe_running() && ++ kprobe_fault_handler(args->regs, args->trapnr)) ++ ret = NOTIFY_STOP; ++ break; ++ default: ++ break; ++ } ++ return ret; ++} ++ ++int __kprobes setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs) ++{ ++ struct jprobe *jp = container_of(p, struct jprobe, kp); ++ unsigned long addr; ++ struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); ++ ++ kcb->jprobe_saved_regs = *regs; ++ kcb->jprobe_saved_sp = stack_addr(regs); ++ addr = (unsigned long)(kcb->jprobe_saved_sp); ++ ++ /* ++ * As Linus pointed out, gcc assumes that the callee ++ * owns the argument space and could overwrite it, e.g. ++ * tailcall optimization. So, to be absolutely safe ++ * we also save and restore enough stack bytes to cover ++ * the argument area. ++ */ ++ memcpy(kcb->jprobes_stack, (kprobe_opcode_t *)addr, ++ MIN_STACK_SIZE(addr)); ++ regs->flags &= ~X86_EFLAGS_IF; ++ trace_hardirqs_off(); ++ regs->ip = (unsigned long)(jp->entry); ++ return 1; ++} ++ ++void __kprobes jprobe_return(void) ++{ ++ struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); ++ ++ asm volatile ( ++#ifdef CONFIG_X86_64 ++ " xchg %%rbx,%%rsp \n" ++#else ++ " xchgl %%ebx,%%esp \n" ++#endif ++ " int3 \n" ++ " .globl jprobe_return_end\n" ++ " jprobe_return_end: \n" ++ " nop \n"::"b" ++ (kcb->jprobe_saved_sp):"memory"); ++} ++ ++int __kprobes longjmp_break_handler(struct kprobe *p, struct pt_regs *regs) ++{ ++ struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); ++ u8 *addr = (u8 *) (regs->ip - 1); ++ struct jprobe *jp = container_of(p, struct jprobe, kp); ++ ++ if ((addr > (u8 *) jprobe_return) && ++ (addr < (u8 *) jprobe_return_end)) { ++ if (stack_addr(regs) != kcb->jprobe_saved_sp) { ++ struct pt_regs *saved_regs = &kcb->jprobe_saved_regs; ++ printk(KERN_ERR ++ "current sp %p does not match saved sp %p\n", ++ stack_addr(regs), kcb->jprobe_saved_sp); ++ printk(KERN_ERR "Saved registers for jprobe %p\n", jp); ++ show_registers(saved_regs); ++ printk(KERN_ERR "Current registers\n"); ++ show_registers(regs); ++ BUG(); ++ } ++ *regs = kcb->jprobe_saved_regs; ++ memcpy((kprobe_opcode_t *)(kcb->jprobe_saved_sp), ++ kcb->jprobes_stack, ++ MIN_STACK_SIZE(kcb->jprobe_saved_sp)); ++ preempt_enable_no_resched(); ++ return 1; ++ } ++ return 0; ++} ++ ++int __init arch_init_kprobes(void) ++{ ++ return 0; ++} ++ ++int __kprobes arch_trampoline_kprobe(struct kprobe *p) ++{ ++ return 0; ++} +diff --git a/arch/x86/kernel/kprobes_32.c b/arch/x86/kernel/kprobes_32.c +deleted file mode 100644 +index 3a020f7..0000000 +--- a/arch/x86/kernel/kprobes_32.c ++++ /dev/null +@@ -1,756 +0,0 @@ +-/* +- * Kernel Probes (KProbes) +- * +- * This program is free software; you can redistribute it and/or modify +- * it under the terms of the GNU General Public License as published by +- * the Free Software Foundation; either version 2 of the License, or +- * (at your option) any later version. +- * +- * This program is distributed in the hope that it will be useful, +- * but WITHOUT ANY WARRANTY; without even the implied warranty of +- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +- * GNU General Public License for more details. +- * +- * You should have received a copy of the GNU General Public License +- * along with this program; if not, write to the Free Software +- * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. +- * +- * Copyright (C) IBM Corporation, 2002, 2004 +- * +- * 2002-Oct Created by Vamsi Krishna S Kernel +- * Probes initial implementation ( includes contributions from +- * Rusty Russell). +- * 2004-July Suparna Bhattacharya added jumper probes +- * interface to access function arguments. +- * 2005-May Hien Nguyen , Jim Keniston +- * and Prasanna S Panchamukhi +- * added function-return probes. +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-void jprobe_return_end(void); +- +-DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL; +-DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); +- +-struct kretprobe_blackpoint kretprobe_blacklist[] = { +- {"__switch_to", }, /* This function switches only current task, but +- doesn't switch kernel stack.*/ +- {NULL, NULL} /* Terminator */ +-}; +-const int kretprobe_blacklist_size = ARRAY_SIZE(kretprobe_blacklist); +- +-/* insert a jmp code */ +-static __always_inline void set_jmp_op(void *from, void *to) +-{ +- struct __arch_jmp_op { +- char op; +- long raddr; +- } __attribute__((packed)) *jop; +- jop = (struct __arch_jmp_op *)from; +- jop->raddr = (long)(to) - ((long)(from) + 5); +- jop->op = RELATIVEJUMP_INSTRUCTION; +-} +- +-/* +- * returns non-zero if opcodes can be boosted. +- */ +-static __always_inline int can_boost(kprobe_opcode_t *opcodes) +-{ +-#define W(row,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf) \ +- (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \ +- (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \ +- (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \ +- (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \ +- << (row % 32)) +- /* +- * Undefined/reserved opcodes, conditional jump, Opcode Extension +- * Groups, and some special opcodes can not be boost. +- */ +- static const unsigned long twobyte_is_boostable[256 / 32] = { +- /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ +- /* ------------------------------- */ +- W(0x00, 0,0,1,1,0,0,1,0,1,1,0,0,0,0,0,0)| /* 00 */ +- W(0x10, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), /* 10 */ +- W(0x20, 1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0)| /* 20 */ +- W(0x30, 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), /* 30 */ +- W(0x40, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* 40 */ +- W(0x50, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), /* 50 */ +- W(0x60, 1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1)| /* 60 */ +- W(0x70, 0,0,0,0,1,1,1,1,0,0,0,0,0,0,1,1), /* 70 */ +- W(0x80, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)| /* 80 */ +- W(0x90, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1), /* 90 */ +- W(0xa0, 1,1,0,1,1,1,0,0,1,1,0,1,1,1,0,1)| /* a0 */ +- W(0xb0, 1,1,1,1,1,1,1,1,0,0,0,1,1,1,1,1), /* b0 */ +- W(0xc0, 1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1)| /* c0 */ +- W(0xd0, 0,1,1,1,0,1,0,0,1,1,0,1,1,1,0,1), /* d0 */ +- W(0xe0, 0,1,1,0,0,1,0,0,1,1,0,1,1,1,0,1)| /* e0 */ +- W(0xf0, 0,1,1,1,0,1,0,0,1,1,1,0,1,1,1,0) /* f0 */ +- /* ------------------------------- */ +- /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ +- }; +-#undef W +- kprobe_opcode_t opcode; +- kprobe_opcode_t *orig_opcodes = opcodes; +-retry: +- if (opcodes - orig_opcodes > MAX_INSN_SIZE - 1) +- return 0; +- opcode = *(opcodes++); +- +- /* 2nd-byte opcode */ +- if (opcode == 0x0f) { +- if (opcodes - orig_opcodes > MAX_INSN_SIZE - 1) +- return 0; +- return test_bit(*opcodes, twobyte_is_boostable); +- } +- +- switch (opcode & 0xf0) { +- case 0x60: +- if (0x63 < opcode && opcode < 0x67) +- goto retry; /* prefixes */ +- /* can't boost Address-size override and bound */ +- return (opcode != 0x62 && opcode != 0x67); +- case 0x70: +- return 0; /* can't boost conditional jump */ +- case 0xc0: +- /* can't boost software-interruptions */ +- return (0xc1 < opcode && opcode < 0xcc) || opcode == 0xcf; +- case 0xd0: +- /* can boost AA* and XLAT */ +- return (opcode == 0xd4 || opcode == 0xd5 || opcode == 0xd7); +- case 0xe0: +- /* can boost in/out and absolute jmps */ +- return ((opcode & 0x04) || opcode == 0xea); +- case 0xf0: +- if ((opcode & 0x0c) == 0 && opcode != 0xf1) +- goto retry; /* lock/rep(ne) prefix */ +- /* clear and set flags can be boost */ +- return (opcode == 0xf5 || (0xf7 < opcode && opcode < 0xfe)); +- default: +- if (opcode == 0x26 || opcode == 0x36 || opcode == 0x3e) +- goto retry; /* prefixes */ +- /* can't boost CS override and call */ +- return (opcode != 0x2e && opcode != 0x9a); +- } +-} +- +-/* +- * returns non-zero if opcode modifies the interrupt flag. +- */ +-static int __kprobes is_IF_modifier(kprobe_opcode_t opcode) +-{ +- switch (opcode) { +- case 0xfa: /* cli */ +- case 0xfb: /* sti */ +- case 0xcf: /* iret/iretd */ +- case 0x9d: /* popf/popfd */ +- return 1; +- } +- return 0; +-} +- +-int __kprobes arch_prepare_kprobe(struct kprobe *p) +-{ +- /* insn: must be on special executable page on i386. */ +- p->ainsn.insn = get_insn_slot(); +- if (!p->ainsn.insn) +- return -ENOMEM; +- +- memcpy(p->ainsn.insn, p->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t)); +- p->opcode = *p->addr; +- if (can_boost(p->addr)) { +- p->ainsn.boostable = 0; +- } else { +- p->ainsn.boostable = -1; +- } +- return 0; +-} +- +-void __kprobes arch_arm_kprobe(struct kprobe *p) +-{ +- text_poke(p->addr, ((unsigned char []){BREAKPOINT_INSTRUCTION}), 1); +-} +- +-void __kprobes arch_disarm_kprobe(struct kprobe *p) +-{ +- text_poke(p->addr, &p->opcode, 1); +-} +- +-void __kprobes arch_remove_kprobe(struct kprobe *p) +-{ +- mutex_lock(&kprobe_mutex); +- free_insn_slot(p->ainsn.insn, (p->ainsn.boostable == 1)); +- mutex_unlock(&kprobe_mutex); +-} +- +-static void __kprobes save_previous_kprobe(struct kprobe_ctlblk *kcb) +-{ +- kcb->prev_kprobe.kp = kprobe_running(); +- kcb->prev_kprobe.status = kcb->kprobe_status; +- kcb->prev_kprobe.old_eflags = kcb->kprobe_old_eflags; +- kcb->prev_kprobe.saved_eflags = kcb->kprobe_saved_eflags; +-} +- +-static void __kprobes restore_previous_kprobe(struct kprobe_ctlblk *kcb) +-{ +- __get_cpu_var(current_kprobe) = kcb->prev_kprobe.kp; +- kcb->kprobe_status = kcb->prev_kprobe.status; +- kcb->kprobe_old_eflags = kcb->prev_kprobe.old_eflags; +- kcb->kprobe_saved_eflags = kcb->prev_kprobe.saved_eflags; +-} +- +-static void __kprobes set_current_kprobe(struct kprobe *p, struct pt_regs *regs, +- struct kprobe_ctlblk *kcb) +-{ +- __get_cpu_var(current_kprobe) = p; +- kcb->kprobe_saved_eflags = kcb->kprobe_old_eflags +- = (regs->eflags & (TF_MASK | IF_MASK)); +- if (is_IF_modifier(p->opcode)) +- kcb->kprobe_saved_eflags &= ~IF_MASK; +-} +- +-static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs *regs) +-{ +- regs->eflags |= TF_MASK; +- regs->eflags &= ~IF_MASK; +- /*single step inline if the instruction is an int3*/ +- if (p->opcode == BREAKPOINT_INSTRUCTION) +- regs->eip = (unsigned long)p->addr; +- else +- regs->eip = (unsigned long)p->ainsn.insn; +-} +- +-/* Called with kretprobe_lock held */ +-void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri, +- struct pt_regs *regs) +-{ +- unsigned long *sara = (unsigned long *)®s->esp; +- +- ri->ret_addr = (kprobe_opcode_t *) *sara; +- +- /* Replace the return addr with trampoline addr */ +- *sara = (unsigned long) &kretprobe_trampoline; +-} +- +-/* +- * Interrupts are disabled on entry as trap3 is an interrupt gate and they +- * remain disabled thorough out this function. +- */ +-static int __kprobes kprobe_handler(struct pt_regs *regs) +-{ +- struct kprobe *p; +- int ret = 0; +- kprobe_opcode_t *addr; +- struct kprobe_ctlblk *kcb; +- +- addr = (kprobe_opcode_t *)(regs->eip - sizeof(kprobe_opcode_t)); +- +- /* +- * We don't want to be preempted for the entire +- * duration of kprobe processing +- */ +- preempt_disable(); +- kcb = get_kprobe_ctlblk(); +- +- /* Check we're not actually recursing */ +- if (kprobe_running()) { +- p = get_kprobe(addr); +- if (p) { +- if (kcb->kprobe_status == KPROBE_HIT_SS && +- *p->ainsn.insn == BREAKPOINT_INSTRUCTION) { +- regs->eflags &= ~TF_MASK; +- regs->eflags |= kcb->kprobe_saved_eflags; +- goto no_kprobe; +- } +- /* We have reentered the kprobe_handler(), since +- * another probe was hit while within the handler. +- * We here save the original kprobes variables and +- * just single step on the instruction of the new probe +- * without calling any user handlers. +- */ +- save_previous_kprobe(kcb); +- set_current_kprobe(p, regs, kcb); +- kprobes_inc_nmissed_count(p); +- prepare_singlestep(p, regs); +- kcb->kprobe_status = KPROBE_REENTER; +- return 1; +- } else { +- if (*addr != BREAKPOINT_INSTRUCTION) { +- /* The breakpoint instruction was removed by +- * another cpu right after we hit, no further +- * handling of this interrupt is appropriate +- */ +- regs->eip -= sizeof(kprobe_opcode_t); +- ret = 1; +- goto no_kprobe; +- } +- p = __get_cpu_var(current_kprobe); +- if (p->break_handler && p->break_handler(p, regs)) { +- goto ss_probe; +- } +- } +- goto no_kprobe; +- } +- +- p = get_kprobe(addr); +- if (!p) { +- if (*addr != BREAKPOINT_INSTRUCTION) { +- /* +- * The breakpoint instruction was removed right +- * after we hit it. Another cpu has removed +- * either a probepoint or a debugger breakpoint +- * at this address. In either case, no further +- * handling of this interrupt is appropriate. +- * Back up over the (now missing) int3 and run +- * the original instruction. +- */ +- regs->eip -= sizeof(kprobe_opcode_t); +- ret = 1; +- } +- /* Not one of ours: let kernel handle it */ +- goto no_kprobe; +- } +- +- set_current_kprobe(p, regs, kcb); +- kcb->kprobe_status = KPROBE_HIT_ACTIVE; +- +- if (p->pre_handler && p->pre_handler(p, regs)) +- /* handler has already set things up, so skip ss setup */ +- return 1; +- +-ss_probe: +-#if !defined(CONFIG_PREEMPT) || defined(CONFIG_PM) +- if (p->ainsn.boostable == 1 && !p->post_handler){ +- /* Boost up -- we can execute copied instructions directly */ +- reset_current_kprobe(); +- regs->eip = (unsigned long)p->ainsn.insn; +- preempt_enable_no_resched(); +- return 1; +- } +-#endif +- prepare_singlestep(p, regs); +- kcb->kprobe_status = KPROBE_HIT_SS; +- return 1; +- +-no_kprobe: +- preempt_enable_no_resched(); +- return ret; +-} +- +-/* +- * For function-return probes, init_kprobes() establishes a probepoint +- * here. When a retprobed function returns, this probe is hit and +- * trampoline_probe_handler() runs, calling the kretprobe's handler. +- */ +- void __kprobes kretprobe_trampoline_holder(void) +- { +- asm volatile ( ".global kretprobe_trampoline\n" +- "kretprobe_trampoline: \n" +- " pushf\n" +- /* skip cs, eip, orig_eax */ +- " subl $12, %esp\n" +- " pushl %fs\n" +- " pushl %ds\n" +- " pushl %es\n" +- " pushl %eax\n" +- " pushl %ebp\n" +- " pushl %edi\n" +- " pushl %esi\n" +- " pushl %edx\n" +- " pushl %ecx\n" +- " pushl %ebx\n" +- " movl %esp, %eax\n" +- " call trampoline_handler\n" +- /* move eflags to cs */ +- " movl 52(%esp), %edx\n" +- " movl %edx, 48(%esp)\n" +- /* save true return address on eflags */ +- " movl %eax, 52(%esp)\n" +- " popl %ebx\n" +- " popl %ecx\n" +- " popl %edx\n" +- " popl %esi\n" +- " popl %edi\n" +- " popl %ebp\n" +- " popl %eax\n" +- /* skip eip, orig_eax, es, ds, fs */ +- " addl $20, %esp\n" +- " popf\n" +- " ret\n"); +-} +- +-/* +- * Called from kretprobe_trampoline +- */ +-fastcall void *__kprobes trampoline_handler(struct pt_regs *regs) +-{ +- struct kretprobe_instance *ri = NULL; +- struct hlist_head *head, empty_rp; +- struct hlist_node *node, *tmp; +- unsigned long flags, orig_ret_address = 0; +- unsigned long trampoline_address =(unsigned long)&kretprobe_trampoline; +- +- INIT_HLIST_HEAD(&empty_rp); +- spin_lock_irqsave(&kretprobe_lock, flags); +- head = kretprobe_inst_table_head(current); +- /* fixup registers */ +- regs->xcs = __KERNEL_CS | get_kernel_rpl(); +- regs->eip = trampoline_address; +- regs->orig_eax = 0xffffffff; +- +- /* +- * It is possible to have multiple instances associated with a given +- * task either because an multiple functions in the call path +- * have a return probe installed on them, and/or more then one return +- * return probe was registered for a target function. +- * +- * We can handle this because: +- * - instances are always inserted at the head of the list +- * - when multiple return probes are registered for the same +- * function, the first instance's ret_addr will point to the +- * real return address, and all the rest will point to +- * kretprobe_trampoline +- */ +- hlist_for_each_entry_safe(ri, node, tmp, head, hlist) { +- if (ri->task != current) +- /* another task is sharing our hash bucket */ +- continue; +- +- if (ri->rp && ri->rp->handler){ +- __get_cpu_var(current_kprobe) = &ri->rp->kp; +- get_kprobe_ctlblk()->kprobe_status = KPROBE_HIT_ACTIVE; +- ri->rp->handler(ri, regs); +- __get_cpu_var(current_kprobe) = NULL; +- } +- +- orig_ret_address = (unsigned long)ri->ret_addr; +- recycle_rp_inst(ri, &empty_rp); +- +- if (orig_ret_address != trampoline_address) +- /* +- * This is the real return address. Any other +- * instances associated with this task are for +- * other calls deeper on the call stack +- */ +- break; +- } +- +- kretprobe_assert(ri, orig_ret_address, trampoline_address); +- spin_unlock_irqrestore(&kretprobe_lock, flags); +- +- hlist_for_each_entry_safe(ri, node, tmp, &empty_rp, hlist) { +- hlist_del(&ri->hlist); +- kfree(ri); +- } +- return (void*)orig_ret_address; +-} +- +-/* +- * Called after single-stepping. p->addr is the address of the +- * instruction whose first byte has been replaced by the "int 3" +- * instruction. To avoid the SMP problems that can occur when we +- * temporarily put back the original opcode to single-step, we +- * single-stepped a copy of the instruction. The address of this +- * copy is p->ainsn.insn. +- * +- * This function prepares to return from the post-single-step +- * interrupt. We have to fix up the stack as follows: +- * +- * 0) Except in the case of absolute or indirect jump or call instructions, +- * the new eip is relative to the copied instruction. We need to make +- * it relative to the original instruction. +- * +- * 1) If the single-stepped instruction was pushfl, then the TF and IF +- * flags are set in the just-pushed eflags, and may need to be cleared. +- * +- * 2) If the single-stepped instruction was a call, the return address +- * that is atop the stack is the address following the copied instruction. +- * We need to make it the address following the original instruction. +- * +- * This function also checks instruction size for preparing direct execution. +- */ +-static void __kprobes resume_execution(struct kprobe *p, +- struct pt_regs *regs, struct kprobe_ctlblk *kcb) +-{ +- unsigned long *tos = (unsigned long *)®s->esp; +- unsigned long copy_eip = (unsigned long)p->ainsn.insn; +- unsigned long orig_eip = (unsigned long)p->addr; +- +- regs->eflags &= ~TF_MASK; +- switch (p->ainsn.insn[0]) { +- case 0x9c: /* pushfl */ +- *tos &= ~(TF_MASK | IF_MASK); +- *tos |= kcb->kprobe_old_eflags; +- break; +- case 0xc2: /* iret/ret/lret */ +- case 0xc3: +- case 0xca: +- case 0xcb: +- case 0xcf: +- case 0xea: /* jmp absolute -- eip is correct */ +- /* eip is already adjusted, no more changes required */ +- p->ainsn.boostable = 1; +- goto no_change; +- case 0xe8: /* call relative - Fix return addr */ +- *tos = orig_eip + (*tos - copy_eip); +- break; +- case 0x9a: /* call absolute -- same as call absolute, indirect */ +- *tos = orig_eip + (*tos - copy_eip); +- goto no_change; +- case 0xff: +- if ((p->ainsn.insn[1] & 0x30) == 0x10) { +- /* +- * call absolute, indirect +- * Fix return addr; eip is correct. +- * But this is not boostable +- */ +- *tos = orig_eip + (*tos - copy_eip); +- goto no_change; +- } else if (((p->ainsn.insn[1] & 0x31) == 0x20) || /* jmp near, absolute indirect */ +- ((p->ainsn.insn[1] & 0x31) == 0x21)) { /* jmp far, absolute indirect */ +- /* eip is correct. And this is boostable */ +- p->ainsn.boostable = 1; +- goto no_change; +- } +- default: +- break; +- } +- +- if (p->ainsn.boostable == 0) { +- if ((regs->eip > copy_eip) && +- (regs->eip - copy_eip) + 5 < MAX_INSN_SIZE) { +- /* +- * These instructions can be executed directly if it +- * jumps back to correct address. +- */ +- set_jmp_op((void *)regs->eip, +- (void *)orig_eip + (regs->eip - copy_eip)); +- p->ainsn.boostable = 1; +- } else { +- p->ainsn.boostable = -1; +- } +- } +- +- regs->eip = orig_eip + (regs->eip - copy_eip); +- +-no_change: +- return; +-} +- +-/* +- * Interrupts are disabled on entry as trap1 is an interrupt gate and they +- * remain disabled thoroughout this function. +- */ +-static int __kprobes post_kprobe_handler(struct pt_regs *regs) +-{ +- struct kprobe *cur = kprobe_running(); +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- +- if (!cur) +- return 0; +- +- if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) { +- kcb->kprobe_status = KPROBE_HIT_SSDONE; +- cur->post_handler(cur, regs, 0); +- } +- +- resume_execution(cur, regs, kcb); +- regs->eflags |= kcb->kprobe_saved_eflags; +- trace_hardirqs_fixup_flags(regs->eflags); +- +- /*Restore back the original saved kprobes variables and continue. */ +- if (kcb->kprobe_status == KPROBE_REENTER) { +- restore_previous_kprobe(kcb); +- goto out; +- } +- reset_current_kprobe(); +-out: +- preempt_enable_no_resched(); +- +- /* +- * if somebody else is singlestepping across a probe point, eflags +- * will have TF set, in which case, continue the remaining processing +- * of do_debug, as if this is not a probe hit. +- */ +- if (regs->eflags & TF_MASK) +- return 0; +- +- return 1; +-} +- +-int __kprobes kprobe_fault_handler(struct pt_regs *regs, int trapnr) +-{ +- struct kprobe *cur = kprobe_running(); +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- +- switch(kcb->kprobe_status) { +- case KPROBE_HIT_SS: +- case KPROBE_REENTER: +- /* +- * We are here because the instruction being single +- * stepped caused a page fault. We reset the current +- * kprobe and the eip points back to the probe address +- * and allow the page fault handler to continue as a +- * normal page fault. +- */ +- regs->eip = (unsigned long)cur->addr; +- regs->eflags |= kcb->kprobe_old_eflags; +- if (kcb->kprobe_status == KPROBE_REENTER) +- restore_previous_kprobe(kcb); +- else +- reset_current_kprobe(); +- preempt_enable_no_resched(); +- break; +- case KPROBE_HIT_ACTIVE: +- case KPROBE_HIT_SSDONE: +- /* +- * We increment the nmissed count for accounting, +- * we can also use npre/npostfault count for accouting +- * these specific fault cases. +- */ +- kprobes_inc_nmissed_count(cur); +- +- /* +- * We come here because instructions in the pre/post +- * handler caused the page_fault, this could happen +- * if handler tries to access user space by +- * copy_from_user(), get_user() etc. Let the +- * user-specified handler try to fix it first. +- */ +- if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr)) +- return 1; +- +- /* +- * In case the user-specified fault handler returned +- * zero, try to fix up. +- */ +- if (fixup_exception(regs)) +- return 1; +- +- /* +- * fixup_exception() could not handle it, +- * Let do_page_fault() fix it. +- */ +- break; +- default: +- break; +- } +- return 0; +-} +- +-/* +- * Wrapper routine to for handling exceptions. +- */ +-int __kprobes kprobe_exceptions_notify(struct notifier_block *self, +- unsigned long val, void *data) +-{ +- struct die_args *args = (struct die_args *)data; +- int ret = NOTIFY_DONE; +- +- if (args->regs && user_mode_vm(args->regs)) +- return ret; +- +- switch (val) { +- case DIE_INT3: +- if (kprobe_handler(args->regs)) +- ret = NOTIFY_STOP; +- break; +- case DIE_DEBUG: +- if (post_kprobe_handler(args->regs)) +- ret = NOTIFY_STOP; +- break; +- case DIE_GPF: +- /* kprobe_running() needs smp_processor_id() */ +- preempt_disable(); +- if (kprobe_running() && +- kprobe_fault_handler(args->regs, args->trapnr)) +- ret = NOTIFY_STOP; +- preempt_enable(); +- break; +- default: +- break; +- } +- return ret; +-} +- +-int __kprobes setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs) +-{ +- struct jprobe *jp = container_of(p, struct jprobe, kp); +- unsigned long addr; +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- +- kcb->jprobe_saved_regs = *regs; +- kcb->jprobe_saved_esp = ®s->esp; +- addr = (unsigned long)(kcb->jprobe_saved_esp); +- +- /* +- * TBD: As Linus pointed out, gcc assumes that the callee +- * owns the argument space and could overwrite it, e.g. +- * tailcall optimization. So, to be absolutely safe +- * we also save and restore enough stack bytes to cover +- * the argument area. +- */ +- memcpy(kcb->jprobes_stack, (kprobe_opcode_t *)addr, +- MIN_STACK_SIZE(addr)); +- regs->eflags &= ~IF_MASK; +- trace_hardirqs_off(); +- regs->eip = (unsigned long)(jp->entry); +- return 1; +-} +- +-void __kprobes jprobe_return(void) +-{ +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- +- asm volatile (" xchgl %%ebx,%%esp \n" +- " int3 \n" +- " .globl jprobe_return_end \n" +- " jprobe_return_end: \n" +- " nop \n"::"b" +- (kcb->jprobe_saved_esp):"memory"); +-} +- +-int __kprobes longjmp_break_handler(struct kprobe *p, struct pt_regs *regs) +-{ +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- u8 *addr = (u8 *) (regs->eip - 1); +- unsigned long stack_addr = (unsigned long)(kcb->jprobe_saved_esp); +- struct jprobe *jp = container_of(p, struct jprobe, kp); +- +- if ((addr > (u8 *) jprobe_return) && (addr < (u8 *) jprobe_return_end)) { +- if (®s->esp != kcb->jprobe_saved_esp) { +- struct pt_regs *saved_regs = &kcb->jprobe_saved_regs; +- printk("current esp %p does not match saved esp %p\n", +- ®s->esp, kcb->jprobe_saved_esp); +- printk("Saved registers for jprobe %p\n", jp); +- show_registers(saved_regs); +- printk("Current registers\n"); +- show_registers(regs); +- BUG(); +- } +- *regs = kcb->jprobe_saved_regs; +- memcpy((kprobe_opcode_t *) stack_addr, kcb->jprobes_stack, +- MIN_STACK_SIZE(stack_addr)); +- preempt_enable_no_resched(); +- return 1; +- } +- return 0; +-} +- +-int __kprobes arch_trampoline_kprobe(struct kprobe *p) +-{ +- return 0; +-} +- +-int __init arch_init_kprobes(void) +-{ +- return 0; +-} +diff --git a/arch/x86/kernel/kprobes_64.c b/arch/x86/kernel/kprobes_64.c +deleted file mode 100644 +index 5df19a9..0000000 +--- a/arch/x86/kernel/kprobes_64.c ++++ /dev/null +@@ -1,749 +0,0 @@ +-/* +- * Kernel Probes (KProbes) +- * +- * This program is free software; you can redistribute it and/or modify +- * it under the terms of the GNU General Public License as published by +- * the Free Software Foundation; either version 2 of the License, or +- * (at your option) any later version. +- * +- * This program is distributed in the hope that it will be useful, +- * but WITHOUT ANY WARRANTY; without even the implied warranty of +- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +- * GNU General Public License for more details. +- * +- * You should have received a copy of the GNU General Public License +- * along with this program; if not, write to the Free Software +- * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. +- * +- * Copyright (C) IBM Corporation, 2002, 2004 +- * +- * 2002-Oct Created by Vamsi Krishna S Kernel +- * Probes initial implementation ( includes contributions from +- * Rusty Russell). +- * 2004-July Suparna Bhattacharya added jumper probes +- * interface to access function arguments. +- * 2004-Oct Jim Keniston and Prasanna S Panchamukhi +- * adapted for x86_64 +- * 2005-Mar Roland McGrath +- * Fixed to handle %rip-relative addressing mode correctly. +- * 2005-May Rusty Lynch +- * Added function return probes functionality +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +- +-void jprobe_return_end(void); +-static void __kprobes arch_copy_kprobe(struct kprobe *p); +- +-DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL; +-DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk); +- +-struct kretprobe_blackpoint kretprobe_blacklist[] = { +- {"__switch_to", }, /* This function switches only current task, but +- doesn't switch kernel stack.*/ +- {NULL, NULL} /* Terminator */ +-}; +-const int kretprobe_blacklist_size = ARRAY_SIZE(kretprobe_blacklist); +- +-/* +- * returns non-zero if opcode modifies the interrupt flag. +- */ +-static int __kprobes is_IF_modifier(kprobe_opcode_t *insn) +-{ +- switch (*insn) { +- case 0xfa: /* cli */ +- case 0xfb: /* sti */ +- case 0xcf: /* iret/iretd */ +- case 0x9d: /* popf/popfd */ +- return 1; +- } +- +- if (*insn >= 0x40 && *insn <= 0x4f && *++insn == 0xcf) +- return 1; +- return 0; +-} +- +-int __kprobes arch_prepare_kprobe(struct kprobe *p) +-{ +- /* insn: must be on special executable page on x86_64. */ +- p->ainsn.insn = get_insn_slot(); +- if (!p->ainsn.insn) { +- return -ENOMEM; +- } +- arch_copy_kprobe(p); +- return 0; +-} +- +-/* +- * Determine if the instruction uses the %rip-relative addressing mode. +- * If it does, return the address of the 32-bit displacement word. +- * If not, return null. +- */ +-static s32 __kprobes *is_riprel(u8 *insn) +-{ +-#define W(row,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf) \ +- (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \ +- (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \ +- (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \ +- (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \ +- << (row % 64)) +- static const u64 onebyte_has_modrm[256 / 64] = { +- /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ +- /* ------------------------------- */ +- W(0x00, 1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0)| /* 00 */ +- W(0x10, 1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0)| /* 10 */ +- W(0x20, 1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0)| /* 20 */ +- W(0x30, 1,1,1,1,0,0,0,0,1,1,1,1,0,0,0,0), /* 30 */ +- W(0x40, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)| /* 40 */ +- W(0x50, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)| /* 50 */ +- W(0x60, 0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0)| /* 60 */ +- W(0x70, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), /* 70 */ +- W(0x80, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* 80 */ +- W(0x90, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)| /* 90 */ +- W(0xa0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)| /* a0 */ +- W(0xb0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), /* b0 */ +- W(0xc0, 1,1,0,0,1,1,1,1,0,0,0,0,0,0,0,0)| /* c0 */ +- W(0xd0, 1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1)| /* d0 */ +- W(0xe0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)| /* e0 */ +- W(0xf0, 0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1) /* f0 */ +- /* ------------------------------- */ +- /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ +- }; +- static const u64 twobyte_has_modrm[256 / 64] = { +- /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ +- /* ------------------------------- */ +- W(0x00, 1,1,1,1,0,0,0,0,0,0,0,0,0,1,0,1)| /* 0f */ +- W(0x10, 1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0)| /* 1f */ +- W(0x20, 1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1)| /* 2f */ +- W(0x30, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), /* 3f */ +- W(0x40, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* 4f */ +- W(0x50, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* 5f */ +- W(0x60, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* 6f */ +- W(0x70, 1,1,1,1,1,1,1,0,0,0,0,0,1,1,1,1), /* 7f */ +- W(0x80, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)| /* 8f */ +- W(0x90, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* 9f */ +- W(0xa0, 0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1)| /* af */ +- W(0xb0, 1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1), /* bf */ +- W(0xc0, 1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0)| /* cf */ +- W(0xd0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* df */ +- W(0xe0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)| /* ef */ +- W(0xf0, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0) /* ff */ +- /* ------------------------------- */ +- /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */ +- }; +-#undef W +- int need_modrm; +- +- /* Skip legacy instruction prefixes. */ +- while (1) { +- switch (*insn) { +- case 0x66: +- case 0x67: +- case 0x2e: +- case 0x3e: +- case 0x26: +- case 0x64: +- case 0x65: +- case 0x36: +- case 0xf0: +- case 0xf3: +- case 0xf2: +- ++insn; +- continue; +- } +- break; +- } +- +- /* Skip REX instruction prefix. */ +- if ((*insn & 0xf0) == 0x40) +- ++insn; +- +- if (*insn == 0x0f) { /* Two-byte opcode. */ +- ++insn; +- need_modrm = test_bit(*insn, twobyte_has_modrm); +- } else { /* One-byte opcode. */ +- need_modrm = test_bit(*insn, onebyte_has_modrm); +- } +- +- if (need_modrm) { +- u8 modrm = *++insn; +- if ((modrm & 0xc7) == 0x05) { /* %rip+disp32 addressing mode */ +- /* Displacement follows ModRM byte. */ +- return (s32 *) ++insn; +- } +- } +- +- /* No %rip-relative addressing mode here. */ +- return NULL; +-} +- +-static void __kprobes arch_copy_kprobe(struct kprobe *p) +-{ +- s32 *ripdisp; +- memcpy(p->ainsn.insn, p->addr, MAX_INSN_SIZE); +- ripdisp = is_riprel(p->ainsn.insn); +- if (ripdisp) { +- /* +- * The copied instruction uses the %rip-relative +- * addressing mode. Adjust the displacement for the +- * difference between the original location of this +- * instruction and the location of the copy that will +- * actually be run. The tricky bit here is making sure +- * that the sign extension happens correctly in this +- * calculation, since we need a signed 32-bit result to +- * be sign-extended to 64 bits when it's added to the +- * %rip value and yield the same 64-bit result that the +- * sign-extension of the original signed 32-bit +- * displacement would have given. +- */ +- s64 disp = (u8 *) p->addr + *ripdisp - (u8 *) p->ainsn.insn; +- BUG_ON((s64) (s32) disp != disp); /* Sanity check. */ +- *ripdisp = disp; +- } +- p->opcode = *p->addr; +-} +- +-void __kprobes arch_arm_kprobe(struct kprobe *p) +-{ +- text_poke(p->addr, ((unsigned char []){BREAKPOINT_INSTRUCTION}), 1); +-} +- +-void __kprobes arch_disarm_kprobe(struct kprobe *p) +-{ +- text_poke(p->addr, &p->opcode, 1); +-} +- +-void __kprobes arch_remove_kprobe(struct kprobe *p) +-{ +- mutex_lock(&kprobe_mutex); +- free_insn_slot(p->ainsn.insn, 0); +- mutex_unlock(&kprobe_mutex); +-} +- +-static void __kprobes save_previous_kprobe(struct kprobe_ctlblk *kcb) +-{ +- kcb->prev_kprobe.kp = kprobe_running(); +- kcb->prev_kprobe.status = kcb->kprobe_status; +- kcb->prev_kprobe.old_rflags = kcb->kprobe_old_rflags; +- kcb->prev_kprobe.saved_rflags = kcb->kprobe_saved_rflags; +-} +- +-static void __kprobes restore_previous_kprobe(struct kprobe_ctlblk *kcb) +-{ +- __get_cpu_var(current_kprobe) = kcb->prev_kprobe.kp; +- kcb->kprobe_status = kcb->prev_kprobe.status; +- kcb->kprobe_old_rflags = kcb->prev_kprobe.old_rflags; +- kcb->kprobe_saved_rflags = kcb->prev_kprobe.saved_rflags; +-} +- +-static void __kprobes set_current_kprobe(struct kprobe *p, struct pt_regs *regs, +- struct kprobe_ctlblk *kcb) +-{ +- __get_cpu_var(current_kprobe) = p; +- kcb->kprobe_saved_rflags = kcb->kprobe_old_rflags +- = (regs->eflags & (TF_MASK | IF_MASK)); +- if (is_IF_modifier(p->ainsn.insn)) +- kcb->kprobe_saved_rflags &= ~IF_MASK; +-} +- +-static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs *regs) +-{ +- regs->eflags |= TF_MASK; +- regs->eflags &= ~IF_MASK; +- /*single step inline if the instruction is an int3*/ +- if (p->opcode == BREAKPOINT_INSTRUCTION) +- regs->rip = (unsigned long)p->addr; +- else +- regs->rip = (unsigned long)p->ainsn.insn; +-} +- +-/* Called with kretprobe_lock held */ +-void __kprobes arch_prepare_kretprobe(struct kretprobe_instance *ri, +- struct pt_regs *regs) +-{ +- unsigned long *sara = (unsigned long *)regs->rsp; +- +- ri->ret_addr = (kprobe_opcode_t *) *sara; +- /* Replace the return addr with trampoline addr */ +- *sara = (unsigned long) &kretprobe_trampoline; +-} +- +-int __kprobes kprobe_handler(struct pt_regs *regs) +-{ +- struct kprobe *p; +- int ret = 0; +- kprobe_opcode_t *addr = (kprobe_opcode_t *)(regs->rip - sizeof(kprobe_opcode_t)); +- struct kprobe_ctlblk *kcb; +- +- /* +- * We don't want to be preempted for the entire +- * duration of kprobe processing +- */ +- preempt_disable(); +- kcb = get_kprobe_ctlblk(); +- +- /* Check we're not actually recursing */ +- if (kprobe_running()) { +- p = get_kprobe(addr); +- if (p) { +- if (kcb->kprobe_status == KPROBE_HIT_SS && +- *p->ainsn.insn == BREAKPOINT_INSTRUCTION) { +- regs->eflags &= ~TF_MASK; +- regs->eflags |= kcb->kprobe_saved_rflags; +- goto no_kprobe; +- } else if (kcb->kprobe_status == KPROBE_HIT_SSDONE) { +- /* TODO: Provide re-entrancy from +- * post_kprobes_handler() and avoid exception +- * stack corruption while single-stepping on +- * the instruction of the new probe. +- */ +- arch_disarm_kprobe(p); +- regs->rip = (unsigned long)p->addr; +- reset_current_kprobe(); +- ret = 1; +- } else { +- /* We have reentered the kprobe_handler(), since +- * another probe was hit while within the +- * handler. We here save the original kprobe +- * variables and just single step on instruction +- * of the new probe without calling any user +- * handlers. +- */ +- save_previous_kprobe(kcb); +- set_current_kprobe(p, regs, kcb); +- kprobes_inc_nmissed_count(p); +- prepare_singlestep(p, regs); +- kcb->kprobe_status = KPROBE_REENTER; +- return 1; +- } +- } else { +- if (*addr != BREAKPOINT_INSTRUCTION) { +- /* The breakpoint instruction was removed by +- * another cpu right after we hit, no further +- * handling of this interrupt is appropriate +- */ +- regs->rip = (unsigned long)addr; +- ret = 1; +- goto no_kprobe; +- } +- p = __get_cpu_var(current_kprobe); +- if (p->break_handler && p->break_handler(p, regs)) { +- goto ss_probe; +- } +- } +- goto no_kprobe; +- } +- +- p = get_kprobe(addr); +- if (!p) { +- if (*addr != BREAKPOINT_INSTRUCTION) { +- /* +- * The breakpoint instruction was removed right +- * after we hit it. Another cpu has removed +- * either a probepoint or a debugger breakpoint +- * at this address. In either case, no further +- * handling of this interrupt is appropriate. +- * Back up over the (now missing) int3 and run +- * the original instruction. +- */ +- regs->rip = (unsigned long)addr; +- ret = 1; +- } +- /* Not one of ours: let kernel handle it */ +- goto no_kprobe; +- } +- +- set_current_kprobe(p, regs, kcb); +- kcb->kprobe_status = KPROBE_HIT_ACTIVE; +- +- if (p->pre_handler && p->pre_handler(p, regs)) +- /* handler has already set things up, so skip ss setup */ +- return 1; +- +-ss_probe: +- prepare_singlestep(p, regs); +- kcb->kprobe_status = KPROBE_HIT_SS; +- return 1; +- +-no_kprobe: +- preempt_enable_no_resched(); +- return ret; +-} +- +-/* +- * For function-return probes, init_kprobes() establishes a probepoint +- * here. When a retprobed function returns, this probe is hit and +- * trampoline_probe_handler() runs, calling the kretprobe's handler. +- */ +- void kretprobe_trampoline_holder(void) +- { +- asm volatile ( ".global kretprobe_trampoline\n" +- "kretprobe_trampoline: \n" +- "nop\n"); +- } +- +-/* +- * Called when we hit the probe point at kretprobe_trampoline +- */ +-int __kprobes trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs) +-{ +- struct kretprobe_instance *ri = NULL; +- struct hlist_head *head, empty_rp; +- struct hlist_node *node, *tmp; +- unsigned long flags, orig_ret_address = 0; +- unsigned long trampoline_address =(unsigned long)&kretprobe_trampoline; +- +- INIT_HLIST_HEAD(&empty_rp); +- spin_lock_irqsave(&kretprobe_lock, flags); +- head = kretprobe_inst_table_head(current); +- +- /* +- * It is possible to have multiple instances associated with a given +- * task either because an multiple functions in the call path +- * have a return probe installed on them, and/or more then one return +- * return probe was registered for a target function. +- * +- * We can handle this because: +- * - instances are always inserted at the head of the list +- * - when multiple return probes are registered for the same +- * function, the first instance's ret_addr will point to the +- * real return address, and all the rest will point to +- * kretprobe_trampoline +- */ +- hlist_for_each_entry_safe(ri, node, tmp, head, hlist) { +- if (ri->task != current) +- /* another task is sharing our hash bucket */ +- continue; +- +- if (ri->rp && ri->rp->handler) +- ri->rp->handler(ri, regs); +- +- orig_ret_address = (unsigned long)ri->ret_addr; +- recycle_rp_inst(ri, &empty_rp); +- +- if (orig_ret_address != trampoline_address) +- /* +- * This is the real return address. Any other +- * instances associated with this task are for +- * other calls deeper on the call stack +- */ +- break; +- } +- +- kretprobe_assert(ri, orig_ret_address, trampoline_address); +- regs->rip = orig_ret_address; +- +- reset_current_kprobe(); +- spin_unlock_irqrestore(&kretprobe_lock, flags); +- preempt_enable_no_resched(); +- +- hlist_for_each_entry_safe(ri, node, tmp, &empty_rp, hlist) { +- hlist_del(&ri->hlist); +- kfree(ri); +- } +- /* +- * By returning a non-zero value, we are telling +- * kprobe_handler() that we don't want the post_handler +- * to run (and have re-enabled preemption) +- */ +- return 1; +-} +- +-/* +- * Called after single-stepping. p->addr is the address of the +- * instruction whose first byte has been replaced by the "int 3" +- * instruction. To avoid the SMP problems that can occur when we +- * temporarily put back the original opcode to single-step, we +- * single-stepped a copy of the instruction. The address of this +- * copy is p->ainsn.insn. +- * +- * This function prepares to return from the post-single-step +- * interrupt. We have to fix up the stack as follows: +- * +- * 0) Except in the case of absolute or indirect jump or call instructions, +- * the new rip is relative to the copied instruction. We need to make +- * it relative to the original instruction. +- * +- * 1) If the single-stepped instruction was pushfl, then the TF and IF +- * flags are set in the just-pushed eflags, and may need to be cleared. +- * +- * 2) If the single-stepped instruction was a call, the return address +- * that is atop the stack is the address following the copied instruction. +- * We need to make it the address following the original instruction. +- */ +-static void __kprobes resume_execution(struct kprobe *p, +- struct pt_regs *regs, struct kprobe_ctlblk *kcb) +-{ +- unsigned long *tos = (unsigned long *)regs->rsp; +- unsigned long copy_rip = (unsigned long)p->ainsn.insn; +- unsigned long orig_rip = (unsigned long)p->addr; +- kprobe_opcode_t *insn = p->ainsn.insn; +- +- /*skip the REX prefix*/ +- if (*insn >= 0x40 && *insn <= 0x4f) +- insn++; +- +- regs->eflags &= ~TF_MASK; +- switch (*insn) { +- case 0x9c: /* pushfl */ +- *tos &= ~(TF_MASK | IF_MASK); +- *tos |= kcb->kprobe_old_rflags; +- break; +- case 0xc2: /* iret/ret/lret */ +- case 0xc3: +- case 0xca: +- case 0xcb: +- case 0xcf: +- case 0xea: /* jmp absolute -- ip is correct */ +- /* ip is already adjusted, no more changes required */ +- goto no_change; +- case 0xe8: /* call relative - Fix return addr */ +- *tos = orig_rip + (*tos - copy_rip); +- break; +- case 0xff: +- if ((insn[1] & 0x30) == 0x10) { +- /* call absolute, indirect */ +- /* Fix return addr; ip is correct. */ +- *tos = orig_rip + (*tos - copy_rip); +- goto no_change; +- } else if (((insn[1] & 0x31) == 0x20) || /* jmp near, absolute indirect */ +- ((insn[1] & 0x31) == 0x21)) { /* jmp far, absolute indirect */ +- /* ip is correct. */ +- goto no_change; +- } +- default: +- break; +- } +- +- regs->rip = orig_rip + (regs->rip - copy_rip); +-no_change: +- +- return; +-} +- +-int __kprobes post_kprobe_handler(struct pt_regs *regs) +-{ +- struct kprobe *cur = kprobe_running(); +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- +- if (!cur) +- return 0; +- +- if ((kcb->kprobe_status != KPROBE_REENTER) && cur->post_handler) { +- kcb->kprobe_status = KPROBE_HIT_SSDONE; +- cur->post_handler(cur, regs, 0); +- } +- +- resume_execution(cur, regs, kcb); +- regs->eflags |= kcb->kprobe_saved_rflags; +- trace_hardirqs_fixup_flags(regs->eflags); +- +- /* Restore the original saved kprobes variables and continue. */ +- if (kcb->kprobe_status == KPROBE_REENTER) { +- restore_previous_kprobe(kcb); +- goto out; +- } +- reset_current_kprobe(); +-out: +- preempt_enable_no_resched(); +- +- /* +- * if somebody else is singlestepping across a probe point, eflags +- * will have TF set, in which case, continue the remaining processing +- * of do_debug, as if this is not a probe hit. +- */ +- if (regs->eflags & TF_MASK) +- return 0; +- +- return 1; +-} +- +-int __kprobes kprobe_fault_handler(struct pt_regs *regs, int trapnr) +-{ +- struct kprobe *cur = kprobe_running(); +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- const struct exception_table_entry *fixup; +- +- switch(kcb->kprobe_status) { +- case KPROBE_HIT_SS: +- case KPROBE_REENTER: +- /* +- * We are here because the instruction being single +- * stepped caused a page fault. We reset the current +- * kprobe and the rip points back to the probe address +- * and allow the page fault handler to continue as a +- * normal page fault. +- */ +- regs->rip = (unsigned long)cur->addr; +- regs->eflags |= kcb->kprobe_old_rflags; +- if (kcb->kprobe_status == KPROBE_REENTER) +- restore_previous_kprobe(kcb); +- else +- reset_current_kprobe(); +- preempt_enable_no_resched(); +- break; +- case KPROBE_HIT_ACTIVE: +- case KPROBE_HIT_SSDONE: +- /* +- * We increment the nmissed count for accounting, +- * we can also use npre/npostfault count for accouting +- * these specific fault cases. +- */ +- kprobes_inc_nmissed_count(cur); +- +- /* +- * We come here because instructions in the pre/post +- * handler caused the page_fault, this could happen +- * if handler tries to access user space by +- * copy_from_user(), get_user() etc. Let the +- * user-specified handler try to fix it first. +- */ +- if (cur->fault_handler && cur->fault_handler(cur, regs, trapnr)) +- return 1; +- +- /* +- * In case the user-specified fault handler returned +- * zero, try to fix up. +- */ +- fixup = search_exception_tables(regs->rip); +- if (fixup) { +- regs->rip = fixup->fixup; +- return 1; +- } +- +- /* +- * fixup() could not handle it, +- * Let do_page_fault() fix it. +- */ +- break; +- default: +- break; +- } +- return 0; +-} +- +-/* +- * Wrapper routine for handling exceptions. +- */ +-int __kprobes kprobe_exceptions_notify(struct notifier_block *self, +- unsigned long val, void *data) +-{ +- struct die_args *args = (struct die_args *)data; +- int ret = NOTIFY_DONE; +- +- if (args->regs && user_mode(args->regs)) +- return ret; +- +- switch (val) { +- case DIE_INT3: +- if (kprobe_handler(args->regs)) +- ret = NOTIFY_STOP; +- break; +- case DIE_DEBUG: +- if (post_kprobe_handler(args->regs)) +- ret = NOTIFY_STOP; +- break; +- case DIE_GPF: +- /* kprobe_running() needs smp_processor_id() */ +- preempt_disable(); +- if (kprobe_running() && +- kprobe_fault_handler(args->regs, args->trapnr)) +- ret = NOTIFY_STOP; +- preempt_enable(); +- break; +- default: +- break; +- } +- return ret; +-} +- +-int __kprobes setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs) +-{ +- struct jprobe *jp = container_of(p, struct jprobe, kp); +- unsigned long addr; +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- +- kcb->jprobe_saved_regs = *regs; +- kcb->jprobe_saved_rsp = (long *) regs->rsp; +- addr = (unsigned long)(kcb->jprobe_saved_rsp); +- /* +- * As Linus pointed out, gcc assumes that the callee +- * owns the argument space and could overwrite it, e.g. +- * tailcall optimization. So, to be absolutely safe +- * we also save and restore enough stack bytes to cover +- * the argument area. +- */ +- memcpy(kcb->jprobes_stack, (kprobe_opcode_t *)addr, +- MIN_STACK_SIZE(addr)); +- regs->eflags &= ~IF_MASK; +- trace_hardirqs_off(); +- regs->rip = (unsigned long)(jp->entry); +- return 1; +-} +- +-void __kprobes jprobe_return(void) +-{ +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- +- asm volatile (" xchg %%rbx,%%rsp \n" +- " int3 \n" +- " .globl jprobe_return_end \n" +- " jprobe_return_end: \n" +- " nop \n"::"b" +- (kcb->jprobe_saved_rsp):"memory"); +-} +- +-int __kprobes longjmp_break_handler(struct kprobe *p, struct pt_regs *regs) +-{ +- struct kprobe_ctlblk *kcb = get_kprobe_ctlblk(); +- u8 *addr = (u8 *) (regs->rip - 1); +- unsigned long stack_addr = (unsigned long)(kcb->jprobe_saved_rsp); +- struct jprobe *jp = container_of(p, struct jprobe, kp); +- +- if ((addr > (u8 *) jprobe_return) && (addr < (u8 *) jprobe_return_end)) { +- if ((unsigned long *)regs->rsp != kcb->jprobe_saved_rsp) { +- struct pt_regs *saved_regs = &kcb->jprobe_saved_regs; +- printk("current rsp %p does not match saved rsp %p\n", +- (long *)regs->rsp, kcb->jprobe_saved_rsp); +- printk("Saved registers for jprobe %p\n", jp); +- show_registers(saved_regs); +- printk("Current registers\n"); +- show_registers(regs); +- BUG(); +- } +- *regs = kcb->jprobe_saved_regs; +- memcpy((kprobe_opcode_t *) stack_addr, kcb->jprobes_stack, +- MIN_STACK_SIZE(stack_addr)); +- preempt_enable_no_resched(); +- return 1; +- } +- return 0; +-} +- +-static struct kprobe trampoline_p = { +- .addr = (kprobe_opcode_t *) &kretprobe_trampoline, +- .pre_handler = trampoline_probe_handler +-}; +- +-int __init arch_init_kprobes(void) +-{ +- return register_kprobe(&trampoline_p); +-} +- +-int __kprobes arch_trampoline_kprobe(struct kprobe *p) +-{ +- if (p->addr == (kprobe_opcode_t *)&kretprobe_trampoline) +- return 1; +- +- return 0; +-} +diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c +new file mode 100644 +index 0000000..8a7660c +--- /dev/null ++++ b/arch/x86/kernel/ldt.c +@@ -0,0 +1,260 @@ ++/* ++ * Copyright (C) 1992 Krishna Balasubramanian and Linus Torvalds ++ * Copyright (C) 1999 Ingo Molnar ++ * Copyright (C) 2002 Andi Kleen ++ * ++ * This handles calls from both 32bit and 64bit mode. ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#include ++#include ++#include ++#include ++#include ++ ++#ifdef CONFIG_SMP ++static void flush_ldt(void *null) ++{ ++ if (current->active_mm) ++ load_LDT(¤t->active_mm->context); ++} ++#endif ++ ++static int alloc_ldt(mm_context_t *pc, int mincount, int reload) ++{ ++ void *oldldt, *newldt; ++ int oldsize; ++ ++ if (mincount <= pc->size) ++ return 0; ++ oldsize = pc->size; ++ mincount = (mincount + 511) & (~511); ++ if (mincount * LDT_ENTRY_SIZE > PAGE_SIZE) ++ newldt = vmalloc(mincount * LDT_ENTRY_SIZE); ++ else ++ newldt = (void *)__get_free_page(GFP_KERNEL); ++ ++ if (!newldt) ++ return -ENOMEM; ++ ++ if (oldsize) ++ memcpy(newldt, pc->ldt, oldsize * LDT_ENTRY_SIZE); ++ oldldt = pc->ldt; ++ memset(newldt + oldsize * LDT_ENTRY_SIZE, 0, ++ (mincount - oldsize) * LDT_ENTRY_SIZE); ++ ++#ifdef CONFIG_X86_64 ++ /* CHECKME: Do we really need this ? */ ++ wmb(); ++#endif ++ pc->ldt = newldt; ++ wmb(); ++ pc->size = mincount; ++ wmb(); ++ ++ if (reload) { ++#ifdef CONFIG_SMP ++ cpumask_t mask; ++ ++ preempt_disable(); ++ load_LDT(pc); ++ mask = cpumask_of_cpu(smp_processor_id()); ++ if (!cpus_equal(current->mm->cpu_vm_mask, mask)) ++ smp_call_function(flush_ldt, NULL, 1, 1); ++ preempt_enable(); ++#else ++ load_LDT(pc); ++#endif ++ } ++ if (oldsize) { ++ if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE) ++ vfree(oldldt); ++ else ++ put_page(virt_to_page(oldldt)); ++ } ++ return 0; ++} ++ ++static inline int copy_ldt(mm_context_t *new, mm_context_t *old) ++{ ++ int err = alloc_ldt(new, old->size, 0); ++ ++ if (err < 0) ++ return err; ++ memcpy(new->ldt, old->ldt, old->size * LDT_ENTRY_SIZE); ++ return 0; ++} ++ ++/* ++ * we do not have to muck with descriptors here, that is ++ * done in switch_mm() as needed. ++ */ ++int init_new_context(struct task_struct *tsk, struct mm_struct *mm) ++{ ++ struct mm_struct *old_mm; ++ int retval = 0; ++ ++ mutex_init(&mm->context.lock); ++ mm->context.size = 0; ++ old_mm = current->mm; ++ if (old_mm && old_mm->context.size > 0) { ++ mutex_lock(&old_mm->context.lock); ++ retval = copy_ldt(&mm->context, &old_mm->context); ++ mutex_unlock(&old_mm->context.lock); ++ } ++ return retval; ++} ++ ++/* ++ * No need to lock the MM as we are the last user ++ * ++ * 64bit: Don't touch the LDT register - we're already in the next thread. ++ */ ++void destroy_context(struct mm_struct *mm) ++{ ++ if (mm->context.size) { ++#ifdef CONFIG_X86_32 ++ /* CHECKME: Can this ever happen ? */ ++ if (mm == current->active_mm) ++ clear_LDT(); ++#endif ++ if (mm->context.size * LDT_ENTRY_SIZE > PAGE_SIZE) ++ vfree(mm->context.ldt); ++ else ++ put_page(virt_to_page(mm->context.ldt)); ++ mm->context.size = 0; ++ } ++} ++ ++static int read_ldt(void __user *ptr, unsigned long bytecount) ++{ ++ int err; ++ unsigned long size; ++ struct mm_struct *mm = current->mm; ++ ++ if (!mm->context.size) ++ return 0; ++ if (bytecount > LDT_ENTRY_SIZE * LDT_ENTRIES) ++ bytecount = LDT_ENTRY_SIZE * LDT_ENTRIES; ++ ++ mutex_lock(&mm->context.lock); ++ size = mm->context.size * LDT_ENTRY_SIZE; ++ if (size > bytecount) ++ size = bytecount; ++ ++ err = 0; ++ if (copy_to_user(ptr, mm->context.ldt, size)) ++ err = -EFAULT; ++ mutex_unlock(&mm->context.lock); ++ if (err < 0) ++ goto error_return; ++ if (size != bytecount) { ++ /* zero-fill the rest */ ++ if (clear_user(ptr + size, bytecount - size) != 0) { ++ err = -EFAULT; ++ goto error_return; ++ } ++ } ++ return bytecount; ++error_return: ++ return err; ++} ++ ++static int read_default_ldt(void __user *ptr, unsigned long bytecount) ++{ ++ /* CHECKME: Can we use _one_ random number ? */ ++#ifdef CONFIG_X86_32 ++ unsigned long size = 5 * sizeof(struct desc_struct); ++#else ++ unsigned long size = 128; ++#endif ++ if (bytecount > size) ++ bytecount = size; ++ if (clear_user(ptr, bytecount)) ++ return -EFAULT; ++ return bytecount; ++} ++ ++static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) ++{ ++ struct mm_struct *mm = current->mm; ++ struct desc_struct ldt; ++ int error; ++ struct user_desc ldt_info; ++ ++ error = -EINVAL; ++ if (bytecount != sizeof(ldt_info)) ++ goto out; ++ error = -EFAULT; ++ if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info))) ++ goto out; ++ ++ error = -EINVAL; ++ if (ldt_info.entry_number >= LDT_ENTRIES) ++ goto out; ++ if (ldt_info.contents == 3) { ++ if (oldmode) ++ goto out; ++ if (ldt_info.seg_not_present == 0) ++ goto out; ++ } ++ ++ mutex_lock(&mm->context.lock); ++ if (ldt_info.entry_number >= mm->context.size) { ++ error = alloc_ldt(¤t->mm->context, ++ ldt_info.entry_number + 1, 1); ++ if (error < 0) ++ goto out_unlock; ++ } ++ ++ /* Allow LDTs to be cleared by the user. */ ++ if (ldt_info.base_addr == 0 && ldt_info.limit == 0) { ++ if (oldmode || LDT_empty(&ldt_info)) { ++ memset(&ldt, 0, sizeof(ldt)); ++ goto install; ++ } ++ } ++ ++ fill_ldt(&ldt, &ldt_info); ++ if (oldmode) ++ ldt.avl = 0; ++ ++ /* Install the new entry ... */ ++install: ++ write_ldt_entry(mm->context.ldt, ldt_info.entry_number, &ldt); ++ error = 0; ++ ++out_unlock: ++ mutex_unlock(&mm->context.lock); ++out: ++ return error; ++} ++ ++asmlinkage int sys_modify_ldt(int func, void __user *ptr, ++ unsigned long bytecount) ++{ ++ int ret = -ENOSYS; ++ ++ switch (func) { ++ case 0: ++ ret = read_ldt(ptr, bytecount); ++ break; ++ case 1: ++ ret = write_ldt(ptr, bytecount, 1); ++ break; ++ case 2: ++ ret = read_default_ldt(ptr, bytecount); ++ break; ++ case 0x11: ++ ret = write_ldt(ptr, bytecount, 0); ++ break; ++ } ++ return ret; ++} +diff --git a/arch/x86/kernel/ldt_32.c b/arch/x86/kernel/ldt_32.c +deleted file mode 100644 +index 9ff90a2..0000000 +--- a/arch/x86/kernel/ldt_32.c ++++ /dev/null +@@ -1,248 +0,0 @@ +-/* +- * Copyright (C) 1992 Krishna Balasubramanian and Linus Torvalds +- * Copyright (C) 1999 Ingo Molnar +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +- +-#ifdef CONFIG_SMP /* avoids "defined but not used" warnig */ +-static void flush_ldt(void *null) +-{ +- if (current->active_mm) +- load_LDT(¤t->active_mm->context); +-} +-#endif +- +-static int alloc_ldt(mm_context_t *pc, int mincount, int reload) +-{ +- void *oldldt; +- void *newldt; +- int oldsize; +- +- if (mincount <= pc->size) +- return 0; +- oldsize = pc->size; +- mincount = (mincount+511)&(~511); +- if (mincount*LDT_ENTRY_SIZE > PAGE_SIZE) +- newldt = vmalloc(mincount*LDT_ENTRY_SIZE); +- else +- newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL); +- +- if (!newldt) +- return -ENOMEM; +- +- if (oldsize) +- memcpy(newldt, pc->ldt, oldsize*LDT_ENTRY_SIZE); +- oldldt = pc->ldt; +- memset(newldt+oldsize*LDT_ENTRY_SIZE, 0, (mincount-oldsize)*LDT_ENTRY_SIZE); +- pc->ldt = newldt; +- wmb(); +- pc->size = mincount; +- wmb(); +- +- if (reload) { +-#ifdef CONFIG_SMP +- cpumask_t mask; +- preempt_disable(); +- load_LDT(pc); +- mask = cpumask_of_cpu(smp_processor_id()); +- if (!cpus_equal(current->mm->cpu_vm_mask, mask)) +- smp_call_function(flush_ldt, NULL, 1, 1); +- preempt_enable(); +-#else +- load_LDT(pc); +-#endif +- } +- if (oldsize) { +- if (oldsize*LDT_ENTRY_SIZE > PAGE_SIZE) +- vfree(oldldt); +- else +- kfree(oldldt); +- } +- return 0; +-} +- +-static inline int copy_ldt(mm_context_t *new, mm_context_t *old) +-{ +- int err = alloc_ldt(new, old->size, 0); +- if (err < 0) +- return err; +- memcpy(new->ldt, old->ldt, old->size*LDT_ENTRY_SIZE); +- return 0; +-} +- +-/* +- * we do not have to muck with descriptors here, that is +- * done in switch_mm() as needed. +- */ +-int init_new_context(struct task_struct *tsk, struct mm_struct *mm) +-{ +- struct mm_struct * old_mm; +- int retval = 0; +- +- mutex_init(&mm->context.lock); +- mm->context.size = 0; +- old_mm = current->mm; +- if (old_mm && old_mm->context.size > 0) { +- mutex_lock(&old_mm->context.lock); +- retval = copy_ldt(&mm->context, &old_mm->context); +- mutex_unlock(&old_mm->context.lock); +- } +- return retval; +-} +- +-/* +- * No need to lock the MM as we are the last user +- */ +-void destroy_context(struct mm_struct *mm) +-{ +- if (mm->context.size) { +- if (mm == current->active_mm) +- clear_LDT(); +- if (mm->context.size*LDT_ENTRY_SIZE > PAGE_SIZE) +- vfree(mm->context.ldt); +- else +- kfree(mm->context.ldt); +- mm->context.size = 0; +- } +-} +- +-static int read_ldt(void __user * ptr, unsigned long bytecount) +-{ +- int err; +- unsigned long size; +- struct mm_struct * mm = current->mm; +- +- if (!mm->context.size) +- return 0; +- if (bytecount > LDT_ENTRY_SIZE*LDT_ENTRIES) +- bytecount = LDT_ENTRY_SIZE*LDT_ENTRIES; +- +- mutex_lock(&mm->context.lock); +- size = mm->context.size*LDT_ENTRY_SIZE; +- if (size > bytecount) +- size = bytecount; +- +- err = 0; +- if (copy_to_user(ptr, mm->context.ldt, size)) +- err = -EFAULT; +- mutex_unlock(&mm->context.lock); +- if (err < 0) +- goto error_return; +- if (size != bytecount) { +- /* zero-fill the rest */ +- if (clear_user(ptr+size, bytecount-size) != 0) { +- err = -EFAULT; +- goto error_return; +- } +- } +- return bytecount; +-error_return: +- return err; +-} +- +-static int read_default_ldt(void __user * ptr, unsigned long bytecount) +-{ +- int err; +- unsigned long size; +- +- err = 0; +- size = 5*sizeof(struct desc_struct); +- if (size > bytecount) +- size = bytecount; +- +- err = size; +- if (clear_user(ptr, size)) +- err = -EFAULT; +- +- return err; +-} +- +-static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode) +-{ +- struct mm_struct * mm = current->mm; +- __u32 entry_1, entry_2; +- int error; +- struct user_desc ldt_info; +- +- error = -EINVAL; +- if (bytecount != sizeof(ldt_info)) +- goto out; +- error = -EFAULT; +- if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info))) +- goto out; +- +- error = -EINVAL; +- if (ldt_info.entry_number >= LDT_ENTRIES) +- goto out; +- if (ldt_info.contents == 3) { +- if (oldmode) +- goto out; +- if (ldt_info.seg_not_present == 0) +- goto out; +- } +- +- mutex_lock(&mm->context.lock); +- if (ldt_info.entry_number >= mm->context.size) { +- error = alloc_ldt(¤t->mm->context, ldt_info.entry_number+1, 1); +- if (error < 0) +- goto out_unlock; +- } +- +- /* Allow LDTs to be cleared by the user. */ +- if (ldt_info.base_addr == 0 && ldt_info.limit == 0) { +- if (oldmode || LDT_empty(&ldt_info)) { +- entry_1 = 0; +- entry_2 = 0; +- goto install; +- } +- } +- +- entry_1 = LDT_entry_a(&ldt_info); +- entry_2 = LDT_entry_b(&ldt_info); +- if (oldmode) +- entry_2 &= ~(1 << 20); +- +- /* Install the new entry ... */ +-install: +- write_ldt_entry(mm->context.ldt, ldt_info.entry_number, entry_1, entry_2); +- error = 0; +- +-out_unlock: +- mutex_unlock(&mm->context.lock); +-out: +- return error; +-} +- +-asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount) +-{ +- int ret = -ENOSYS; +- +- switch (func) { +- case 0: +- ret = read_ldt(ptr, bytecount); +- break; +- case 1: +- ret = write_ldt(ptr, bytecount, 1); +- break; +- case 2: +- ret = read_default_ldt(ptr, bytecount); +- break; +- case 0x11: +- ret = write_ldt(ptr, bytecount, 0); +- break; +- } +- return ret; +-} +diff --git a/arch/x86/kernel/ldt_64.c b/arch/x86/kernel/ldt_64.c +deleted file mode 100644 +index 60e57ab..0000000 +--- a/arch/x86/kernel/ldt_64.c ++++ /dev/null +@@ -1,250 +0,0 @@ +-/* +- * Copyright (C) 1992 Krishna Balasubramanian and Linus Torvalds +- * Copyright (C) 1999 Ingo Molnar +- * Copyright (C) 2002 Andi Kleen +- * +- * This handles calls from both 32bit and 64bit mode. +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +- +-#ifdef CONFIG_SMP /* avoids "defined but not used" warnig */ +-static void flush_ldt(void *null) +-{ +- if (current->active_mm) +- load_LDT(¤t->active_mm->context); +-} +-#endif +- +-static int alloc_ldt(mm_context_t *pc, unsigned mincount, int reload) +-{ +- void *oldldt; +- void *newldt; +- unsigned oldsize; +- +- if (mincount <= (unsigned)pc->size) +- return 0; +- oldsize = pc->size; +- mincount = (mincount+511)&(~511); +- if (mincount*LDT_ENTRY_SIZE > PAGE_SIZE) +- newldt = vmalloc(mincount*LDT_ENTRY_SIZE); +- else +- newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL); +- +- if (!newldt) +- return -ENOMEM; +- +- if (oldsize) +- memcpy(newldt, pc->ldt, oldsize*LDT_ENTRY_SIZE); +- oldldt = pc->ldt; +- memset(newldt+oldsize*LDT_ENTRY_SIZE, 0, (mincount-oldsize)*LDT_ENTRY_SIZE); +- wmb(); +- pc->ldt = newldt; +- wmb(); +- pc->size = mincount; +- wmb(); +- if (reload) { +-#ifdef CONFIG_SMP +- cpumask_t mask; +- +- preempt_disable(); +- mask = cpumask_of_cpu(smp_processor_id()); +- load_LDT(pc); +- if (!cpus_equal(current->mm->cpu_vm_mask, mask)) +- smp_call_function(flush_ldt, NULL, 1, 1); +- preempt_enable(); +-#else +- load_LDT(pc); +-#endif +- } +- if (oldsize) { +- if (oldsize*LDT_ENTRY_SIZE > PAGE_SIZE) +- vfree(oldldt); +- else +- kfree(oldldt); +- } +- return 0; +-} +- +-static inline int copy_ldt(mm_context_t *new, mm_context_t *old) +-{ +- int err = alloc_ldt(new, old->size, 0); +- if (err < 0) +- return err; +- memcpy(new->ldt, old->ldt, old->size*LDT_ENTRY_SIZE); +- return 0; +-} +- +-/* +- * we do not have to muck with descriptors here, that is +- * done in switch_mm() as needed. +- */ +-int init_new_context(struct task_struct *tsk, struct mm_struct *mm) +-{ +- struct mm_struct * old_mm; +- int retval = 0; +- +- mutex_init(&mm->context.lock); +- mm->context.size = 0; +- old_mm = current->mm; +- if (old_mm && old_mm->context.size > 0) { +- mutex_lock(&old_mm->context.lock); +- retval = copy_ldt(&mm->context, &old_mm->context); +- mutex_unlock(&old_mm->context.lock); +- } +- return retval; +-} +- +-/* +- * +- * Don't touch the LDT register - we're already in the next thread. +- */ +-void destroy_context(struct mm_struct *mm) +-{ +- if (mm->context.size) { +- if ((unsigned)mm->context.size*LDT_ENTRY_SIZE > PAGE_SIZE) +- vfree(mm->context.ldt); +- else +- kfree(mm->context.ldt); +- mm->context.size = 0; +- } +-} +- +-static int read_ldt(void __user * ptr, unsigned long bytecount) +-{ +- int err; +- unsigned long size; +- struct mm_struct * mm = current->mm; +- +- if (!mm->context.size) +- return 0; +- if (bytecount > LDT_ENTRY_SIZE*LDT_ENTRIES) +- bytecount = LDT_ENTRY_SIZE*LDT_ENTRIES; +- +- mutex_lock(&mm->context.lock); +- size = mm->context.size*LDT_ENTRY_SIZE; +- if (size > bytecount) +- size = bytecount; +- +- err = 0; +- if (copy_to_user(ptr, mm->context.ldt, size)) +- err = -EFAULT; +- mutex_unlock(&mm->context.lock); +- if (err < 0) +- goto error_return; +- if (size != bytecount) { +- /* zero-fill the rest */ +- if (clear_user(ptr+size, bytecount-size) != 0) { +- err = -EFAULT; +- goto error_return; +- } +- } +- return bytecount; +-error_return: +- return err; +-} +- +-static int read_default_ldt(void __user * ptr, unsigned long bytecount) +-{ +- /* Arbitrary number */ +- /* x86-64 default LDT is all zeros */ +- if (bytecount > 128) +- bytecount = 128; +- if (clear_user(ptr, bytecount)) +- return -EFAULT; +- return bytecount; +-} +- +-static int write_ldt(void __user * ptr, unsigned long bytecount, int oldmode) +-{ +- struct task_struct *me = current; +- struct mm_struct * mm = me->mm; +- __u32 entry_1, entry_2, *lp; +- int error; +- struct user_desc ldt_info; +- +- error = -EINVAL; +- +- if (bytecount != sizeof(ldt_info)) +- goto out; +- error = -EFAULT; +- if (copy_from_user(&ldt_info, ptr, bytecount)) +- goto out; +- +- error = -EINVAL; +- if (ldt_info.entry_number >= LDT_ENTRIES) +- goto out; +- if (ldt_info.contents == 3) { +- if (oldmode) +- goto out; +- if (ldt_info.seg_not_present == 0) +- goto out; +- } +- +- mutex_lock(&mm->context.lock); +- if (ldt_info.entry_number >= (unsigned)mm->context.size) { +- error = alloc_ldt(¤t->mm->context, ldt_info.entry_number+1, 1); +- if (error < 0) +- goto out_unlock; +- } +- +- lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt); +- +- /* Allow LDTs to be cleared by the user. */ +- if (ldt_info.base_addr == 0 && ldt_info.limit == 0) { +- if (oldmode || LDT_empty(&ldt_info)) { +- entry_1 = 0; +- entry_2 = 0; +- goto install; +- } +- } +- +- entry_1 = LDT_entry_a(&ldt_info); +- entry_2 = LDT_entry_b(&ldt_info); +- if (oldmode) +- entry_2 &= ~(1 << 20); +- +- /* Install the new entry ... */ +-install: +- *lp = entry_1; +- *(lp+1) = entry_2; +- error = 0; +- +-out_unlock: +- mutex_unlock(&mm->context.lock); +-out: +- return error; +-} +- +-asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount) +-{ +- int ret = -ENOSYS; +- +- switch (func) { +- case 0: +- ret = read_ldt(ptr, bytecount); +- break; +- case 1: +- ret = write_ldt(ptr, bytecount, 1); +- break; +- case 2: +- ret = read_default_ldt(ptr, bytecount); +- break; +- case 0x11: +- ret = write_ldt(ptr, bytecount, 0); +- break; +- } +- return ret; +-} +diff --git a/arch/x86/kernel/machine_kexec_32.c b/arch/x86/kernel/machine_kexec_32.c +index 11b935f..c1cfd60 100644 +--- a/arch/x86/kernel/machine_kexec_32.c ++++ b/arch/x86/kernel/machine_kexec_32.c +@@ -32,7 +32,7 @@ static u32 kexec_pte1[1024] PAGE_ALIGNED; + + static void set_idt(void *newidt, __u16 limit) + { +- struct Xgt_desc_struct curidt; ++ struct desc_ptr curidt; + + /* ia32 supports unaliged loads & stores */ + curidt.size = limit; +@@ -44,7 +44,7 @@ static void set_idt(void *newidt, __u16 limit) + + static void set_gdt(void *newgdt, __u16 limit) + { +- struct Xgt_desc_struct curgdt; ++ struct desc_ptr curgdt; + + /* ia32 supports unaligned loads & stores */ + curgdt.size = limit; +diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c +index aa3d2c8..a1fef42 100644 +--- a/arch/x86/kernel/machine_kexec_64.c ++++ b/arch/x86/kernel/machine_kexec_64.c +@@ -234,10 +234,5 @@ NORET_TYPE void machine_kexec(struct kimage *image) + void arch_crash_save_vmcoreinfo(void) + { + VMCOREINFO_SYMBOL(init_level4_pgt); +- +-#ifdef CONFIG_ARCH_DISCONTIGMEM_ENABLE +- VMCOREINFO_SYMBOL(node_data); +- VMCOREINFO_LENGTH(node_data, MAX_NUMNODES); +-#endif + } + +diff --git a/arch/x86/kernel/mfgpt_32.c b/arch/x86/kernel/mfgpt_32.c +index 3960ab7..219f86e 100644 +--- a/arch/x86/kernel/mfgpt_32.c ++++ b/arch/x86/kernel/mfgpt_32.c +@@ -63,6 +63,21 @@ static int __init mfgpt_disable(char *s) + } + __setup("nomfgpt", mfgpt_disable); + ++/* Reset the MFGPT timers. This is required by some broken BIOSes which already ++ * do the same and leave the system in an unstable state. TinyBIOS 0.98 is ++ * affected at least (0.99 is OK with MFGPT workaround left to off). ++ */ ++static int __init mfgpt_fix(char *s) ++{ ++ u32 val, dummy; ++ ++ /* The following udocumented bit resets the MFGPT timers */ ++ val = 0xFF; dummy = 0; ++ wrmsr(0x5140002B, val, dummy); ++ return 1; ++} ++__setup("mfgptfix", mfgpt_fix); ++ + /* + * Check whether any MFGPTs are available for the kernel to use. In most + * cases, firmware that uses AMD's VSA code will claim all timers during diff --git a/arch/x86/kernel/microcode.c b/arch/x86/kernel/microcode.c -index 09c3152..40cfd54 100644 +index 09c3152..6ff447f 100644 --- a/arch/x86/kernel/microcode.c +++ b/arch/x86/kernel/microcode.c -@@ -436,7 +436,7 @@ static ssize_t microcode_write (struct file *file, const char __user *buf, size_ +@@ -244,8 +244,8 @@ static int microcode_sanity_check(void *mc) + return 0; + /* check extended signature checksum */ + for (i = 0; i < ext_sigcount; i++) { +- ext_sig = (struct extended_signature *)((void *)ext_header +- + EXT_HEADER_SIZE + EXT_SIGNATURE_SIZE * i); ++ ext_sig = (void *)ext_header + EXT_HEADER_SIZE + ++ EXT_SIGNATURE_SIZE * i; + sum = orig_sum + - (mc_header->sig + mc_header->pf + mc_header->cksum) + + (ext_sig->sig + ext_sig->pf + ext_sig->cksum); +@@ -279,11 +279,9 @@ static int get_maching_microcode(void *mc, int cpu) + if (total_size <= get_datasize(mc_header) + MC_HEADER_SIZE) + return 0; + +- ext_header = (struct extended_sigtable *)(mc + +- get_datasize(mc_header) + MC_HEADER_SIZE); ++ ext_header = mc + get_datasize(mc_header) + MC_HEADER_SIZE; + ext_sigcount = ext_header->count; +- ext_sig = (struct extended_signature *)((void *)ext_header +- + EXT_HEADER_SIZE); ++ ext_sig = (void *)ext_header + EXT_HEADER_SIZE; + for (i = 0; i < ext_sigcount; i++) { + if (microcode_update_match(cpu, mc_header, + ext_sig->sig, ext_sig->pf)) +@@ -436,7 +434,7 @@ static ssize_t microcode_write (struct file *file, const char __user *buf, size_ return -EINVAL; } @@ -135430,7 +160410,7 @@ index 09c3152..40cfd54 100644 mutex_lock(µcode_mutex); user_buffer = (void __user *) buf; -@@ -447,7 +447,7 @@ static ssize_t microcode_write (struct file *file, const char __user *buf, size_ +@@ -447,7 +445,7 @@ static ssize_t microcode_write (struct file *file, const char __user *buf, size_ ret = (ssize_t)len; mutex_unlock(µcode_mutex); @@ -135439,7 +160419,16 @@ index 09c3152..40cfd54 100644 return ret; } -@@ -658,14 +658,14 @@ static ssize_t reload_store(struct sys_device *dev, const char *buf, size_t sz) +@@ -539,7 +537,7 @@ static int cpu_request_microcode(int cpu) + pr_debug("ucode data file %s load failed\n", name); + return error; + } +- buf = (void *)firmware->data; ++ buf = firmware->data; + size = firmware->size; + while ((offset = get_next_ucode_from_buffer(&mc, buf, size, offset)) + > 0) { +@@ -658,14 +656,14 @@ static ssize_t reload_store(struct sys_device *dev, const char *buf, size_t sz) old = current->cpus_allowed; @@ -135456,7 +160445,7 @@ index 09c3152..40cfd54 100644 set_cpus_allowed(current, old); } if (err) -@@ -817,9 +817,9 @@ static int __init microcode_init (void) +@@ -817,9 +815,9 @@ static int __init microcode_init (void) return PTR_ERR(microcode_pdev); } @@ -135468,7 +160457,7 @@ index 09c3152..40cfd54 100644 if (error) { microcode_dev_exit(); platform_device_unregister(microcode_pdev); -@@ -839,9 +839,9 @@ static void __exit microcode_exit (void) +@@ -839,9 +837,9 @@ static void __exit microcode_exit (void) unregister_hotcpu_notifier(&mc_cpu_notifier); @@ -135480,6 +160469,178 @@ index 09c3152..40cfd54 100644 platform_device_unregister(microcode_pdev); } +diff --git a/arch/x86/kernel/mpparse_32.c b/arch/x86/kernel/mpparse_32.c +index 7a05a7f..67009cd 100644 +--- a/arch/x86/kernel/mpparse_32.c ++++ b/arch/x86/kernel/mpparse_32.c +@@ -68,7 +68,7 @@ unsigned int def_to_bigsmp = 0; + /* Processor that is doing the boot up */ + unsigned int boot_cpu_physical_apicid = -1U; + /* Internal processor count */ +-unsigned int __cpuinitdata num_processors; ++unsigned int num_processors; + + /* Bitmask of physically existing CPUs */ + physid_mask_t phys_cpu_present_map; +@@ -258,7 +258,7 @@ static void __init MP_ioapic_info (struct mpc_config_ioapic *m) + if (!(m->mpc_flags & MPC_APIC_USABLE)) + return; + +- printk(KERN_INFO "I/O APIC #%d Version %d at 0x%lX.\n", ++ printk(KERN_INFO "I/O APIC #%d Version %d at 0x%X.\n", + m->mpc_apicid, m->mpc_apicver, m->mpc_apicaddr); + if (nr_ioapics >= MAX_IO_APICS) { + printk(KERN_CRIT "Max # of I/O APICs (%d) exceeded (found %d).\n", +@@ -405,9 +405,9 @@ static int __init smp_read_mpc(struct mp_config_table *mpc) + + mps_oem_check(mpc, oem, str); + +- printk("APIC at: 0x%lX\n",mpc->mpc_lapic); ++ printk("APIC at: 0x%X\n", mpc->mpc_lapic); + +- /* ++ /* + * Save the local APIC address (it might be non-default) -- but only + * if we're not using ACPI. + */ +@@ -721,7 +721,7 @@ static int __init smp_scan_config (unsigned long base, unsigned long length) + unsigned long *bp = phys_to_virt(base); + struct intel_mp_floating *mpf; + +- Dprintk("Scan SMP from %p for %ld bytes.\n", bp,length); ++ printk(KERN_INFO "Scan SMP from %p for %ld bytes.\n", bp,length); + if (sizeof(*mpf) != 16) + printk("Error: MPF size\n"); + +@@ -734,8 +734,8 @@ static int __init smp_scan_config (unsigned long base, unsigned long length) + || (mpf->mpf_specification == 4)) ) { + + smp_found_config = 1; +- printk(KERN_INFO "found SMP MP-table at %08lx\n", +- virt_to_phys(mpf)); ++ printk(KERN_INFO "found SMP MP-table at [%p] %08lx\n", ++ mpf, virt_to_phys(mpf)); + reserve_bootmem(virt_to_phys(mpf), PAGE_SIZE); + if (mpf->mpf_physptr) { + /* +@@ -918,14 +918,14 @@ void __init mp_register_ioapic(u8 id, u32 address, u32 gsi_base) + */ + mp_ioapic_routing[idx].apic_id = mp_ioapics[idx].mpc_apicid; + mp_ioapic_routing[idx].gsi_base = gsi_base; +- mp_ioapic_routing[idx].gsi_end = gsi_base + ++ mp_ioapic_routing[idx].gsi_end = gsi_base + + io_apic_get_redir_entries(idx); + +- printk("IOAPIC[%d]: apic_id %d, version %d, address 0x%lx, " +- "GSI %d-%d\n", idx, mp_ioapics[idx].mpc_apicid, +- mp_ioapics[idx].mpc_apicver, mp_ioapics[idx].mpc_apicaddr, +- mp_ioapic_routing[idx].gsi_base, +- mp_ioapic_routing[idx].gsi_end); ++ printk("IOAPIC[%d]: apic_id %d, version %d, address 0x%x, " ++ "GSI %d-%d\n", idx, mp_ioapics[idx].mpc_apicid, ++ mp_ioapics[idx].mpc_apicver, mp_ioapics[idx].mpc_apicaddr, ++ mp_ioapic_routing[idx].gsi_base, ++ mp_ioapic_routing[idx].gsi_end); + } + + void __init +@@ -1041,15 +1041,16 @@ void __init mp_config_acpi_legacy_irqs (void) + } + + #define MAX_GSI_NUM 4096 ++#define IRQ_COMPRESSION_START 64 + + int mp_register_gsi(u32 gsi, int triggering, int polarity) + { + int ioapic = -1; + int ioapic_pin = 0; + int idx, bit = 0; +- static int pci_irq = 16; ++ static int pci_irq = IRQ_COMPRESSION_START; + /* +- * Mapping between Global System Interrups, which ++ * Mapping between Global System Interrupts, which + * represent all possible interrupts, and IRQs + * assigned to actual devices. + */ +@@ -1086,12 +1087,16 @@ int mp_register_gsi(u32 gsi, int triggering, int polarity) + if ((1<= 64, use IRQ compression ++ */ ++ if ((gsi >= IRQ_COMPRESSION_START) ++ && (triggering == ACPI_LEVEL_SENSITIVE)) { + /* + * For PCI devices assign IRQs in order, avoiding gaps + * due to unused I/O APIC pins. +diff --git a/arch/x86/kernel/mpparse_64.c b/arch/x86/kernel/mpparse_64.c +index ef4aab1..72ab140 100644 +--- a/arch/x86/kernel/mpparse_64.c ++++ b/arch/x86/kernel/mpparse_64.c +@@ -60,14 +60,18 @@ unsigned int boot_cpu_id = -1U; + EXPORT_SYMBOL(boot_cpu_id); + + /* Internal processor count */ +-unsigned int num_processors __cpuinitdata = 0; ++unsigned int num_processors; + + unsigned disabled_cpus __cpuinitdata; + + /* Bitmask of physically existing CPUs */ + physid_mask_t phys_cpu_present_map = PHYSID_MASK_NONE; + +-u8 bios_cpu_apicid[NR_CPUS] = { [0 ... NR_CPUS-1] = BAD_APICID }; ++u16 x86_bios_cpu_apicid_init[NR_CPUS] __initdata ++ = { [0 ... NR_CPUS-1] = BAD_APICID }; ++void *x86_bios_cpu_apicid_early_ptr; ++DEFINE_PER_CPU(u16, x86_bios_cpu_apicid) = BAD_APICID; ++EXPORT_PER_CPU_SYMBOL(x86_bios_cpu_apicid); + + + /* +@@ -118,24 +122,22 @@ static void __cpuinit MP_processor_info(struct mpc_config_processor *m) + physid_set(m->mpc_apicid, phys_cpu_present_map); + if (m->mpc_cpuflag & CPU_BOOTPROCESSOR) { + /* +- * bios_cpu_apicid is required to have processors listed ++ * x86_bios_cpu_apicid is required to have processors listed + * in same order as logical cpu numbers. Hence the first + * entry is BSP, and so on. + */ + cpu = 0; + } +- bios_cpu_apicid[cpu] = m->mpc_apicid; +- /* +- * We get called early in the the start_kernel initialization +- * process when the per_cpu data area is not yet setup, so we +- * use a static array that is removed after the per_cpu data +- * area is created. +- */ +- if (x86_cpu_to_apicid_ptr) { +- u8 *x86_cpu_to_apicid = (u8 *)x86_cpu_to_apicid_ptr; +- x86_cpu_to_apicid[cpu] = m->mpc_apicid; ++ /* are we being called early in kernel startup? */ ++ if (x86_cpu_to_apicid_early_ptr) { ++ u16 *cpu_to_apicid = x86_cpu_to_apicid_early_ptr; ++ u16 *bios_cpu_apicid = x86_bios_cpu_apicid_early_ptr; ++ ++ cpu_to_apicid[cpu] = m->mpc_apicid; ++ bios_cpu_apicid[cpu] = m->mpc_apicid; + } else { + per_cpu(x86_cpu_to_apicid, cpu) = m->mpc_apicid; ++ per_cpu(x86_bios_cpu_apicid, cpu) = m->mpc_apicid; + } + + cpu_set(cpu, cpu_possible_map); diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c index ee6eba4..21f6e3c 100644 --- a/arch/x86/kernel/msr.c @@ -135504,10 +160665,50 @@ index ee6eba4..21f6e3c 100644 return err ? NOTIFY_BAD : NOTIFY_OK; } diff --git a/arch/x86/kernel/nmi_32.c b/arch/x86/kernel/nmi_32.c -index 852db29..4f4bfd3 100644 +index 852db29..edd4136 100644 --- a/arch/x86/kernel/nmi_32.c +++ b/arch/x86/kernel/nmi_32.c -@@ -176,7 +176,7 @@ static int lapic_nmi_resume(struct sys_device *dev) +@@ -51,13 +51,13 @@ static int unknown_nmi_panic_callback(struct pt_regs *regs, int cpu); + + static int endflag __initdata = 0; + ++#ifdef CONFIG_SMP + /* The performance counters used by NMI_LOCAL_APIC don't trigger when + * the CPU is idle. To make sure the NMI watchdog really ticks on all + * CPUs during the test make them busy. + */ + static __init void nmi_cpu_busy(void *data) + { +-#ifdef CONFIG_SMP + local_irq_enable_in_hardirq(); + /* Intentionally don't use cpu_relax here. This is + to make sure that the performance counter really ticks, +@@ -67,8 +67,8 @@ static __init void nmi_cpu_busy(void *data) + care if they get somewhat less cycles. */ + while (endflag == 0) + mb(); +-#endif + } ++#endif + + static int __init check_nmi_watchdog(void) + { +@@ -87,11 +87,13 @@ static int __init check_nmi_watchdog(void) + + printk(KERN_INFO "Testing NMI watchdog ... "); + ++#ifdef CONFIG_SMP + if (nmi_watchdog == NMI_LOCAL_APIC) + smp_call_function(nmi_cpu_busy, (void *)&endflag, 0, 0); ++#endif + + for_each_possible_cpu(cpu) +- prev_nmi_count[cpu] = per_cpu(irq_stat, cpu).__nmi_count; ++ prev_nmi_count[cpu] = nmi_count(cpu); + local_irq_enable(); + mdelay((20*1000)/nmi_hz); // wait 20 ticks + +@@ -176,7 +178,7 @@ static int lapic_nmi_resume(struct sys_device *dev) static struct sysdev_class nmi_sysclass = { @@ -135516,11 +160717,158 @@ index 852db29..4f4bfd3 100644 .resume = lapic_nmi_resume, .suspend = lapic_nmi_suspend, }; +@@ -237,10 +239,10 @@ void acpi_nmi_disable(void) + on_each_cpu(__acpi_nmi_disable, NULL, 0, 1); + } + +-void setup_apic_nmi_watchdog (void *unused) ++void setup_apic_nmi_watchdog(void *unused) + { + if (__get_cpu_var(wd_enabled)) +- return; ++ return; + + /* cheap hack to support suspend/resume */ + /* if cpu0 is not active neither should the other cpus */ +@@ -329,7 +331,7 @@ __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) + unsigned int sum; + int touched = 0; + int cpu = smp_processor_id(); +- int rc=0; ++ int rc = 0; + + /* check for other users first */ + if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) diff --git a/arch/x86/kernel/nmi_64.c b/arch/x86/kernel/nmi_64.c -index 4253c4e..c3d1476 100644 +index 4253c4e..fb99484 100644 --- a/arch/x86/kernel/nmi_64.c +++ b/arch/x86/kernel/nmi_64.c -@@ -211,7 +211,7 @@ static int lapic_nmi_resume(struct sys_device *dev) +@@ -39,7 +39,7 @@ static cpumask_t backtrace_mask = CPU_MASK_NONE; + * 0: the lapic NMI watchdog is disabled, but can be enabled + */ + atomic_t nmi_active = ATOMIC_INIT(0); /* oprofile uses this */ +-int panic_on_timeout; ++static int panic_on_timeout; + + unsigned int nmi_watchdog = NMI_DEFAULT; + static unsigned int nmi_hz = HZ; +@@ -78,22 +78,22 @@ static __init void nmi_cpu_busy(void *data) + } + #endif + +-int __init check_nmi_watchdog (void) ++int __init check_nmi_watchdog(void) + { +- int *counts; ++ int *prev_nmi_count; + int cpu; + +- if ((nmi_watchdog == NMI_NONE) || (nmi_watchdog == NMI_DISABLED)) ++ if ((nmi_watchdog == NMI_NONE) || (nmi_watchdog == NMI_DISABLED)) + return 0; + + if (!atomic_read(&nmi_active)) + return 0; + +- counts = kmalloc(NR_CPUS * sizeof(int), GFP_KERNEL); +- if (!counts) ++ prev_nmi_count = kmalloc(NR_CPUS * sizeof(int), GFP_KERNEL); ++ if (!prev_nmi_count) + return -1; + +- printk(KERN_INFO "testing NMI watchdog ... "); ++ printk(KERN_INFO "Testing NMI watchdog ... "); + + #ifdef CONFIG_SMP + if (nmi_watchdog == NMI_LOCAL_APIC) +@@ -101,30 +101,29 @@ int __init check_nmi_watchdog (void) + #endif + + for (cpu = 0; cpu < NR_CPUS; cpu++) +- counts[cpu] = cpu_pda(cpu)->__nmi_count; ++ prev_nmi_count[cpu] = cpu_pda(cpu)->__nmi_count; + local_irq_enable(); + mdelay((20*1000)/nmi_hz); // wait 20 ticks + + for_each_online_cpu(cpu) { + if (!per_cpu(wd_enabled, cpu)) + continue; +- if (cpu_pda(cpu)->__nmi_count - counts[cpu] <= 5) { ++ if (cpu_pda(cpu)->__nmi_count - prev_nmi_count[cpu] <= 5) { + printk(KERN_WARNING "WARNING: CPU#%d: NMI " + "appears to be stuck (%d->%d)!\n", +- cpu, +- counts[cpu], +- cpu_pda(cpu)->__nmi_count); ++ cpu, ++ prev_nmi_count[cpu], ++ cpu_pda(cpu)->__nmi_count); + per_cpu(wd_enabled, cpu) = 0; + atomic_dec(&nmi_active); + } + } ++ endflag = 1; + if (!atomic_read(&nmi_active)) { +- kfree(counts); ++ kfree(prev_nmi_count); + atomic_set(&nmi_active, -1); +- endflag = 1; + return -1; + } +- endflag = 1; + printk("OK.\n"); + + /* now that we know it works we can reduce NMI frequency to +@@ -132,11 +131,11 @@ int __init check_nmi_watchdog (void) + if (nmi_watchdog == NMI_LOCAL_APIC) + nmi_hz = lapic_adjust_nmi_hz(1); + +- kfree(counts); ++ kfree(prev_nmi_count); + return 0; + } + +-int __init setup_nmi_watchdog(char *str) ++static int __init setup_nmi_watchdog(char *str) + { + int nmi; + +@@ -159,34 +158,6 @@ int __init setup_nmi_watchdog(char *str) + + __setup("nmi_watchdog=", setup_nmi_watchdog); + +- +-static void __acpi_nmi_disable(void *__unused) +-{ +- apic_write(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED); +-} +- +-/* +- * Disable timer based NMIs on all CPUs: +- */ +-void acpi_nmi_disable(void) +-{ +- if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) +- on_each_cpu(__acpi_nmi_disable, NULL, 0, 1); +-} +- +-static void __acpi_nmi_enable(void *__unused) +-{ +- apic_write(APIC_LVT0, APIC_DM_NMI); +-} +- +-/* +- * Enable timer based NMIs on all CPUs: +- */ +-void acpi_nmi_enable(void) +-{ +- if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) +- on_each_cpu(__acpi_nmi_enable, NULL, 0, 1); +-} + #ifdef CONFIG_PM + + static int nmi_pm_active; /* nmi_active before suspend */ +@@ -211,13 +182,13 @@ static int lapic_nmi_resume(struct sys_device *dev) } static struct sysdev_class nmi_sysclass = { @@ -135529,11 +160877,9819 @@ index 4253c4e..c3d1476 100644 .resume = lapic_nmi_resume, .suspend = lapic_nmi_suspend, }; + + static struct sys_device device_lapic_nmi = { +- .id = 0, ++ .id = 0, + .cls = &nmi_sysclass, + }; + +@@ -231,7 +202,7 @@ static int __init init_lapic_nmi_sysfs(void) + if (nmi_watchdog != NMI_LOCAL_APIC) + return 0; + +- if ( atomic_read(&nmi_active) < 0 ) ++ if (atomic_read(&nmi_active) < 0) + return 0; + + error = sysdev_class_register(&nmi_sysclass); +@@ -244,9 +215,37 @@ late_initcall(init_lapic_nmi_sysfs); + + #endif /* CONFIG_PM */ + ++static void __acpi_nmi_enable(void *__unused) ++{ ++ apic_write(APIC_LVT0, APIC_DM_NMI); ++} ++ ++/* ++ * Enable timer based NMIs on all CPUs: ++ */ ++void acpi_nmi_enable(void) ++{ ++ if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) ++ on_each_cpu(__acpi_nmi_enable, NULL, 0, 1); ++} ++ ++static void __acpi_nmi_disable(void *__unused) ++{ ++ apic_write(APIC_LVT0, APIC_DM_NMI | APIC_LVT_MASKED); ++} ++ ++/* ++ * Disable timer based NMIs on all CPUs: ++ */ ++void acpi_nmi_disable(void) ++{ ++ if (atomic_read(&nmi_active) && nmi_watchdog == NMI_IO_APIC) ++ on_each_cpu(__acpi_nmi_disable, NULL, 0, 1); ++} ++ + void setup_apic_nmi_watchdog(void *unused) + { +- if (__get_cpu_var(wd_enabled) == 1) ++ if (__get_cpu_var(wd_enabled)) + return; + + /* cheap hack to support suspend/resume */ +@@ -311,8 +310,9 @@ void touch_nmi_watchdog(void) + } + } + +- touch_softlockup_watchdog(); ++ touch_softlockup_watchdog(); + } ++EXPORT_SYMBOL(touch_nmi_watchdog); + + int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) + { +@@ -479,4 +479,3 @@ void __trigger_all_cpu_backtrace(void) + + EXPORT_SYMBOL(nmi_active); + EXPORT_SYMBOL(nmi_watchdog); +-EXPORT_SYMBOL(touch_nmi_watchdog); +diff --git a/arch/x86/kernel/numaq_32.c b/arch/x86/kernel/numaq_32.c +index 9000d82..e65281b 100644 +--- a/arch/x86/kernel/numaq_32.c ++++ b/arch/x86/kernel/numaq_32.c +@@ -82,7 +82,7 @@ static int __init numaq_tsc_disable(void) + { + if (num_online_nodes() > 1) { + printk(KERN_DEBUG "NUMAQ: disabling TSC\n"); +- tsc_disable = 1; ++ setup_clear_cpu_cap(X86_FEATURE_TSC); + } + return 0; + } +diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c +new file mode 100644 +index 0000000..075962c +--- /dev/null ++++ b/arch/x86/kernel/paravirt.c +@@ -0,0 +1,440 @@ ++/* Paravirtualization interfaces ++ Copyright (C) 2006 Rusty Russell IBM Corporation ++ ++ This program is free software; you can redistribute it and/or modify ++ it under the terms of the GNU General Public License as published by ++ the Free Software Foundation; either version 2 of the License, or ++ (at your option) any later version. ++ ++ This program is distributed in the hope that it will be useful, ++ but WITHOUT ANY WARRANTY; without even the implied warranty of ++ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ GNU General Public License for more details. ++ ++ You should have received a copy of the GNU General Public License ++ along with this program; if not, write to the Free Software ++ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA ++ ++ 2007 - x86_64 support added by Glauber de Oliveira Costa, Red Hat Inc ++*/ ++ ++#include ++#include ++#include ++#include ++#include ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++/* nop stub */ ++void _paravirt_nop(void) ++{ ++} ++ ++static void __init default_banner(void) ++{ ++ printk(KERN_INFO "Booting paravirtualized kernel on %s\n", ++ pv_info.name); ++} ++ ++char *memory_setup(void) ++{ ++ return pv_init_ops.memory_setup(); ++} ++ ++/* Simple instruction patching code. */ ++#define DEF_NATIVE(ops, name, code) \ ++ extern const char start_##ops##_##name[], end_##ops##_##name[]; \ ++ asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":") ++ ++/* Undefined instruction for dealing with missing ops pointers. */ ++static const unsigned char ud2a[] = { 0x0f, 0x0b }; ++ ++unsigned paravirt_patch_nop(void) ++{ ++ return 0; ++} ++ ++unsigned paravirt_patch_ignore(unsigned len) ++{ ++ return len; ++} ++ ++struct branch { ++ unsigned char opcode; ++ u32 delta; ++} __attribute__((packed)); ++ ++unsigned paravirt_patch_call(void *insnbuf, ++ const void *target, u16 tgt_clobbers, ++ unsigned long addr, u16 site_clobbers, ++ unsigned len) ++{ ++ struct branch *b = insnbuf; ++ unsigned long delta = (unsigned long)target - (addr+5); ++ ++ if (tgt_clobbers & ~site_clobbers) ++ return len; /* target would clobber too much for this site */ ++ if (len < 5) ++ return len; /* call too long for patch site */ ++ ++ b->opcode = 0xe8; /* call */ ++ b->delta = delta; ++ BUILD_BUG_ON(sizeof(*b) != 5); ++ ++ return 5; ++} ++ ++unsigned paravirt_patch_jmp(void *insnbuf, const void *target, ++ unsigned long addr, unsigned len) ++{ ++ struct branch *b = insnbuf; ++ unsigned long delta = (unsigned long)target - (addr+5); ++ ++ if (len < 5) ++ return len; /* call too long for patch site */ ++ ++ b->opcode = 0xe9; /* jmp */ ++ b->delta = delta; ++ ++ return 5; ++} ++ ++/* Neat trick to map patch type back to the call within the ++ * corresponding structure. */ ++static void *get_call_destination(u8 type) ++{ ++ struct paravirt_patch_template tmpl = { ++ .pv_init_ops = pv_init_ops, ++ .pv_time_ops = pv_time_ops, ++ .pv_cpu_ops = pv_cpu_ops, ++ .pv_irq_ops = pv_irq_ops, ++ .pv_apic_ops = pv_apic_ops, ++ .pv_mmu_ops = pv_mmu_ops, ++ }; ++ return *((void **)&tmpl + type); ++} ++ ++unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf, ++ unsigned long addr, unsigned len) ++{ ++ void *opfunc = get_call_destination(type); ++ unsigned ret; ++ ++ if (opfunc == NULL) ++ /* If there's no function, patch it with a ud2a (BUG) */ ++ ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a)); ++ else if (opfunc == paravirt_nop) ++ /* If the operation is a nop, then nop the callsite */ ++ ret = paravirt_patch_nop(); ++ else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) || ++ type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret)) ++ /* If operation requires a jmp, then jmp */ ++ ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len); ++ else ++ /* Otherwise call the function; assume target could ++ clobber any caller-save reg */ ++ ret = paravirt_patch_call(insnbuf, opfunc, CLBR_ANY, ++ addr, clobbers, len); ++ ++ return ret; ++} ++ ++unsigned paravirt_patch_insns(void *insnbuf, unsigned len, ++ const char *start, const char *end) ++{ ++ unsigned insn_len = end - start; ++ ++ if (insn_len > len || start == NULL) ++ insn_len = len; ++ else ++ memcpy(insnbuf, start, insn_len); ++ ++ return insn_len; ++} ++ ++void init_IRQ(void) ++{ ++ pv_irq_ops.init_IRQ(); ++} ++ ++static void native_flush_tlb(void) ++{ ++ __native_flush_tlb(); ++} ++ ++/* ++ * Global pages have to be flushed a bit differently. Not a real ++ * performance problem because this does not happen often. ++ */ ++static void native_flush_tlb_global(void) ++{ ++ __native_flush_tlb_global(); ++} ++ ++static void native_flush_tlb_single(unsigned long addr) ++{ ++ __native_flush_tlb_single(addr); ++} ++ ++/* These are in entry.S */ ++extern void native_iret(void); ++extern void native_irq_enable_syscall_ret(void); ++ ++static int __init print_banner(void) ++{ ++ pv_init_ops.banner(); ++ return 0; ++} ++core_initcall(print_banner); ++ ++static struct resource reserve_ioports = { ++ .start = 0, ++ .end = IO_SPACE_LIMIT, ++ .name = "paravirt-ioport", ++ .flags = IORESOURCE_IO | IORESOURCE_BUSY, ++}; ++ ++static struct resource reserve_iomem = { ++ .start = 0, ++ .end = -1, ++ .name = "paravirt-iomem", ++ .flags = IORESOURCE_MEM | IORESOURCE_BUSY, ++}; ++ ++/* ++ * Reserve the whole legacy IO space to prevent any legacy drivers ++ * from wasting time probing for their hardware. This is a fairly ++ * brute-force approach to disabling all non-virtual drivers. ++ * ++ * Note that this must be called very early to have any effect. ++ */ ++int paravirt_disable_iospace(void) ++{ ++ int ret; ++ ++ ret = request_resource(&ioport_resource, &reserve_ioports); ++ if (ret == 0) { ++ ret = request_resource(&iomem_resource, &reserve_iomem); ++ if (ret) ++ release_resource(&reserve_ioports); ++ } ++ ++ return ret; ++} ++ ++static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = PARAVIRT_LAZY_NONE; ++ ++static inline void enter_lazy(enum paravirt_lazy_mode mode) ++{ ++ BUG_ON(__get_cpu_var(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE); ++ BUG_ON(preemptible()); ++ ++ __get_cpu_var(paravirt_lazy_mode) = mode; ++} ++ ++void paravirt_leave_lazy(enum paravirt_lazy_mode mode) ++{ ++ BUG_ON(__get_cpu_var(paravirt_lazy_mode) != mode); ++ BUG_ON(preemptible()); ++ ++ __get_cpu_var(paravirt_lazy_mode) = PARAVIRT_LAZY_NONE; ++} ++ ++void paravirt_enter_lazy_mmu(void) ++{ ++ enter_lazy(PARAVIRT_LAZY_MMU); ++} ++ ++void paravirt_leave_lazy_mmu(void) ++{ ++ paravirt_leave_lazy(PARAVIRT_LAZY_MMU); ++} ++ ++void paravirt_enter_lazy_cpu(void) ++{ ++ enter_lazy(PARAVIRT_LAZY_CPU); ++} ++ ++void paravirt_leave_lazy_cpu(void) ++{ ++ paravirt_leave_lazy(PARAVIRT_LAZY_CPU); ++} ++ ++enum paravirt_lazy_mode paravirt_get_lazy_mode(void) ++{ ++ return __get_cpu_var(paravirt_lazy_mode); ++} ++ ++struct pv_info pv_info = { ++ .name = "bare hardware", ++ .paravirt_enabled = 0, ++ .kernel_rpl = 0, ++ .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */ ++}; ++ ++struct pv_init_ops pv_init_ops = { ++ .patch = native_patch, ++ .banner = default_banner, ++ .arch_setup = paravirt_nop, ++ .memory_setup = machine_specific_memory_setup, ++}; ++ ++struct pv_time_ops pv_time_ops = { ++ .time_init = hpet_time_init, ++ .get_wallclock = native_get_wallclock, ++ .set_wallclock = native_set_wallclock, ++ .sched_clock = native_sched_clock, ++ .get_cpu_khz = native_calculate_cpu_khz, ++}; ++ ++struct pv_irq_ops pv_irq_ops = { ++ .init_IRQ = native_init_IRQ, ++ .save_fl = native_save_fl, ++ .restore_fl = native_restore_fl, ++ .irq_disable = native_irq_disable, ++ .irq_enable = native_irq_enable, ++ .safe_halt = native_safe_halt, ++ .halt = native_halt, ++}; ++ ++struct pv_cpu_ops pv_cpu_ops = { ++ .cpuid = native_cpuid, ++ .get_debugreg = native_get_debugreg, ++ .set_debugreg = native_set_debugreg, ++ .clts = native_clts, ++ .read_cr0 = native_read_cr0, ++ .write_cr0 = native_write_cr0, ++ .read_cr4 = native_read_cr4, ++ .read_cr4_safe = native_read_cr4_safe, ++ .write_cr4 = native_write_cr4, ++#ifdef CONFIG_X86_64 ++ .read_cr8 = native_read_cr8, ++ .write_cr8 = native_write_cr8, ++#endif ++ .wbinvd = native_wbinvd, ++ .read_msr = native_read_msr_safe, ++ .write_msr = native_write_msr_safe, ++ .read_tsc = native_read_tsc, ++ .read_pmc = native_read_pmc, ++ .read_tscp = native_read_tscp, ++ .load_tr_desc = native_load_tr_desc, ++ .set_ldt = native_set_ldt, ++ .load_gdt = native_load_gdt, ++ .load_idt = native_load_idt, ++ .store_gdt = native_store_gdt, ++ .store_idt = native_store_idt, ++ .store_tr = native_store_tr, ++ .load_tls = native_load_tls, ++ .write_ldt_entry = native_write_ldt_entry, ++ .write_gdt_entry = native_write_gdt_entry, ++ .write_idt_entry = native_write_idt_entry, ++ .load_sp0 = native_load_sp0, ++ ++ .irq_enable_syscall_ret = native_irq_enable_syscall_ret, ++ .iret = native_iret, ++ .swapgs = native_swapgs, ++ ++ .set_iopl_mask = native_set_iopl_mask, ++ .io_delay = native_io_delay, ++ ++ .lazy_mode = { ++ .enter = paravirt_nop, ++ .leave = paravirt_nop, ++ }, ++}; ++ ++struct pv_apic_ops pv_apic_ops = { ++#ifdef CONFIG_X86_LOCAL_APIC ++ .apic_write = native_apic_write, ++ .apic_write_atomic = native_apic_write_atomic, ++ .apic_read = native_apic_read, ++ .setup_boot_clock = setup_boot_APIC_clock, ++ .setup_secondary_clock = setup_secondary_APIC_clock, ++ .startup_ipi_hook = paravirt_nop, ++#endif ++}; ++ ++struct pv_mmu_ops pv_mmu_ops = { ++#ifndef CONFIG_X86_64 ++ .pagetable_setup_start = native_pagetable_setup_start, ++ .pagetable_setup_done = native_pagetable_setup_done, ++#endif ++ ++ .read_cr2 = native_read_cr2, ++ .write_cr2 = native_write_cr2, ++ .read_cr3 = native_read_cr3, ++ .write_cr3 = native_write_cr3, ++ ++ .flush_tlb_user = native_flush_tlb, ++ .flush_tlb_kernel = native_flush_tlb_global, ++ .flush_tlb_single = native_flush_tlb_single, ++ .flush_tlb_others = native_flush_tlb_others, ++ ++ .alloc_pt = paravirt_nop, ++ .alloc_pd = paravirt_nop, ++ .alloc_pd_clone = paravirt_nop, ++ .release_pt = paravirt_nop, ++ .release_pd = paravirt_nop, ++ ++ .set_pte = native_set_pte, ++ .set_pte_at = native_set_pte_at, ++ .set_pmd = native_set_pmd, ++ .pte_update = paravirt_nop, ++ .pte_update_defer = paravirt_nop, ++ ++#ifdef CONFIG_HIGHPTE ++ .kmap_atomic_pte = kmap_atomic, ++#endif ++ ++#if PAGETABLE_LEVELS >= 3 ++#ifdef CONFIG_X86_PAE ++ .set_pte_atomic = native_set_pte_atomic, ++ .set_pte_present = native_set_pte_present, ++ .pte_clear = native_pte_clear, ++ .pmd_clear = native_pmd_clear, ++#endif ++ .set_pud = native_set_pud, ++ .pmd_val = native_pmd_val, ++ .make_pmd = native_make_pmd, ++ ++#if PAGETABLE_LEVELS == 4 ++ .pud_val = native_pud_val, ++ .make_pud = native_make_pud, ++ .set_pgd = native_set_pgd, ++#endif ++#endif /* PAGETABLE_LEVELS >= 3 */ ++ ++ .pte_val = native_pte_val, ++ .pgd_val = native_pgd_val, ++ ++ .make_pte = native_make_pte, ++ .make_pgd = native_make_pgd, ++ ++ .dup_mmap = paravirt_nop, ++ .exit_mmap = paravirt_nop, ++ .activate_mm = paravirt_nop, ++ ++ .lazy_mode = { ++ .enter = paravirt_nop, ++ .leave = paravirt_nop, ++ }, ++}; ++ ++EXPORT_SYMBOL_GPL(pv_time_ops); ++EXPORT_SYMBOL (pv_cpu_ops); ++EXPORT_SYMBOL (pv_mmu_ops); ++EXPORT_SYMBOL_GPL(pv_apic_ops); ++EXPORT_SYMBOL_GPL(pv_info); ++EXPORT_SYMBOL (pv_irq_ops); +diff --git a/arch/x86/kernel/paravirt_32.c b/arch/x86/kernel/paravirt_32.c +deleted file mode 100644 +index f500079..0000000 +--- a/arch/x86/kernel/paravirt_32.c ++++ /dev/null +@@ -1,472 +0,0 @@ +-/* Paravirtualization interfaces +- Copyright (C) 2006 Rusty Russell IBM Corporation +- +- This program is free software; you can redistribute it and/or modify +- it under the terms of the GNU General Public License as published by +- the Free Software Foundation; either version 2 of the License, or +- (at your option) any later version. +- +- This program is distributed in the hope that it will be useful, +- but WITHOUT ANY WARRANTY; without even the implied warranty of +- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +- GNU General Public License for more details. +- +- You should have received a copy of the GNU General Public License +- along with this program; if not, write to the Free Software +- Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +-*/ +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* nop stub */ +-void _paravirt_nop(void) +-{ +-} +- +-static void __init default_banner(void) +-{ +- printk(KERN_INFO "Booting paravirtualized kernel on %s\n", +- pv_info.name); +-} +- +-char *memory_setup(void) +-{ +- return pv_init_ops.memory_setup(); +-} +- +-/* Simple instruction patching code. */ +-#define DEF_NATIVE(ops, name, code) \ +- extern const char start_##ops##_##name[], end_##ops##_##name[]; \ +- asm("start_" #ops "_" #name ": " code "; end_" #ops "_" #name ":") +- +-DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); +-DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); +-DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf"); +-DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax"); +-DEF_NATIVE(pv_cpu_ops, iret, "iret"); +-DEF_NATIVE(pv_cpu_ops, irq_enable_sysexit, "sti; sysexit"); +-DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax"); +-DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3"); +-DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax"); +-DEF_NATIVE(pv_cpu_ops, clts, "clts"); +-DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc"); +- +-/* Undefined instruction for dealing with missing ops pointers. */ +-static const unsigned char ud2a[] = { 0x0f, 0x0b }; +- +-static unsigned native_patch(u8 type, u16 clobbers, void *ibuf, +- unsigned long addr, unsigned len) +-{ +- const unsigned char *start, *end; +- unsigned ret; +- +- switch(type) { +-#define SITE(ops, x) \ +- case PARAVIRT_PATCH(ops.x): \ +- start = start_##ops##_##x; \ +- end = end_##ops##_##x; \ +- goto patch_site +- +- SITE(pv_irq_ops, irq_disable); +- SITE(pv_irq_ops, irq_enable); +- SITE(pv_irq_ops, restore_fl); +- SITE(pv_irq_ops, save_fl); +- SITE(pv_cpu_ops, iret); +- SITE(pv_cpu_ops, irq_enable_sysexit); +- SITE(pv_mmu_ops, read_cr2); +- SITE(pv_mmu_ops, read_cr3); +- SITE(pv_mmu_ops, write_cr3); +- SITE(pv_cpu_ops, clts); +- SITE(pv_cpu_ops, read_tsc); +-#undef SITE +- +- patch_site: +- ret = paravirt_patch_insns(ibuf, len, start, end); +- break; +- +- default: +- ret = paravirt_patch_default(type, clobbers, ibuf, addr, len); +- break; +- } +- +- return ret; +-} +- +-unsigned paravirt_patch_nop(void) +-{ +- return 0; +-} +- +-unsigned paravirt_patch_ignore(unsigned len) +-{ +- return len; +-} +- +-struct branch { +- unsigned char opcode; +- u32 delta; +-} __attribute__((packed)); +- +-unsigned paravirt_patch_call(void *insnbuf, +- const void *target, u16 tgt_clobbers, +- unsigned long addr, u16 site_clobbers, +- unsigned len) +-{ +- struct branch *b = insnbuf; +- unsigned long delta = (unsigned long)target - (addr+5); +- +- if (tgt_clobbers & ~site_clobbers) +- return len; /* target would clobber too much for this site */ +- if (len < 5) +- return len; /* call too long for patch site */ +- +- b->opcode = 0xe8; /* call */ +- b->delta = delta; +- BUILD_BUG_ON(sizeof(*b) != 5); +- +- return 5; +-} +- +-unsigned paravirt_patch_jmp(void *insnbuf, const void *target, +- unsigned long addr, unsigned len) +-{ +- struct branch *b = insnbuf; +- unsigned long delta = (unsigned long)target - (addr+5); +- +- if (len < 5) +- return len; /* call too long for patch site */ +- +- b->opcode = 0xe9; /* jmp */ +- b->delta = delta; +- +- return 5; +-} +- +-/* Neat trick to map patch type back to the call within the +- * corresponding structure. */ +-static void *get_call_destination(u8 type) +-{ +- struct paravirt_patch_template tmpl = { +- .pv_init_ops = pv_init_ops, +- .pv_time_ops = pv_time_ops, +- .pv_cpu_ops = pv_cpu_ops, +- .pv_irq_ops = pv_irq_ops, +- .pv_apic_ops = pv_apic_ops, +- .pv_mmu_ops = pv_mmu_ops, +- }; +- return *((void **)&tmpl + type); +-} +- +-unsigned paravirt_patch_default(u8 type, u16 clobbers, void *insnbuf, +- unsigned long addr, unsigned len) +-{ +- void *opfunc = get_call_destination(type); +- unsigned ret; +- +- if (opfunc == NULL) +- /* If there's no function, patch it with a ud2a (BUG) */ +- ret = paravirt_patch_insns(insnbuf, len, ud2a, ud2a+sizeof(ud2a)); +- else if (opfunc == paravirt_nop) +- /* If the operation is a nop, then nop the callsite */ +- ret = paravirt_patch_nop(); +- else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) || +- type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit)) +- /* If operation requires a jmp, then jmp */ +- ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len); +- else +- /* Otherwise call the function; assume target could +- clobber any caller-save reg */ +- ret = paravirt_patch_call(insnbuf, opfunc, CLBR_ANY, +- addr, clobbers, len); +- +- return ret; +-} +- +-unsigned paravirt_patch_insns(void *insnbuf, unsigned len, +- const char *start, const char *end) +-{ +- unsigned insn_len = end - start; +- +- if (insn_len > len || start == NULL) +- insn_len = len; +- else +- memcpy(insnbuf, start, insn_len); +- +- return insn_len; +-} +- +-void init_IRQ(void) +-{ +- pv_irq_ops.init_IRQ(); +-} +- +-static void native_flush_tlb(void) +-{ +- __native_flush_tlb(); +-} +- +-/* +- * Global pages have to be flushed a bit differently. Not a real +- * performance problem because this does not happen often. +- */ +-static void native_flush_tlb_global(void) +-{ +- __native_flush_tlb_global(); +-} +- +-static void native_flush_tlb_single(unsigned long addr) +-{ +- __native_flush_tlb_single(addr); +-} +- +-/* These are in entry.S */ +-extern void native_iret(void); +-extern void native_irq_enable_sysexit(void); +- +-static int __init print_banner(void) +-{ +- pv_init_ops.banner(); +- return 0; +-} +-core_initcall(print_banner); +- +-static struct resource reserve_ioports = { +- .start = 0, +- .end = IO_SPACE_LIMIT, +- .name = "paravirt-ioport", +- .flags = IORESOURCE_IO | IORESOURCE_BUSY, +-}; +- +-static struct resource reserve_iomem = { +- .start = 0, +- .end = -1, +- .name = "paravirt-iomem", +- .flags = IORESOURCE_MEM | IORESOURCE_BUSY, +-}; +- +-/* +- * Reserve the whole legacy IO space to prevent any legacy drivers +- * from wasting time probing for their hardware. This is a fairly +- * brute-force approach to disabling all non-virtual drivers. +- * +- * Note that this must be called very early to have any effect. +- */ +-int paravirt_disable_iospace(void) +-{ +- int ret; +- +- ret = request_resource(&ioport_resource, &reserve_ioports); +- if (ret == 0) { +- ret = request_resource(&iomem_resource, &reserve_iomem); +- if (ret) +- release_resource(&reserve_ioports); +- } +- +- return ret; +-} +- +-static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = PARAVIRT_LAZY_NONE; +- +-static inline void enter_lazy(enum paravirt_lazy_mode mode) +-{ +- BUG_ON(x86_read_percpu(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE); +- BUG_ON(preemptible()); +- +- x86_write_percpu(paravirt_lazy_mode, mode); +-} +- +-void paravirt_leave_lazy(enum paravirt_lazy_mode mode) +-{ +- BUG_ON(x86_read_percpu(paravirt_lazy_mode) != mode); +- BUG_ON(preemptible()); +- +- x86_write_percpu(paravirt_lazy_mode, PARAVIRT_LAZY_NONE); +-} +- +-void paravirt_enter_lazy_mmu(void) +-{ +- enter_lazy(PARAVIRT_LAZY_MMU); +-} +- +-void paravirt_leave_lazy_mmu(void) +-{ +- paravirt_leave_lazy(PARAVIRT_LAZY_MMU); +-} +- +-void paravirt_enter_lazy_cpu(void) +-{ +- enter_lazy(PARAVIRT_LAZY_CPU); +-} +- +-void paravirt_leave_lazy_cpu(void) +-{ +- paravirt_leave_lazy(PARAVIRT_LAZY_CPU); +-} +- +-enum paravirt_lazy_mode paravirt_get_lazy_mode(void) +-{ +- return x86_read_percpu(paravirt_lazy_mode); +-} +- +-struct pv_info pv_info = { +- .name = "bare hardware", +- .paravirt_enabled = 0, +- .kernel_rpl = 0, +- .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */ +-}; +- +-struct pv_init_ops pv_init_ops = { +- .patch = native_patch, +- .banner = default_banner, +- .arch_setup = paravirt_nop, +- .memory_setup = machine_specific_memory_setup, +-}; +- +-struct pv_time_ops pv_time_ops = { +- .time_init = hpet_time_init, +- .get_wallclock = native_get_wallclock, +- .set_wallclock = native_set_wallclock, +- .sched_clock = native_sched_clock, +- .get_cpu_khz = native_calculate_cpu_khz, +-}; +- +-struct pv_irq_ops pv_irq_ops = { +- .init_IRQ = native_init_IRQ, +- .save_fl = native_save_fl, +- .restore_fl = native_restore_fl, +- .irq_disable = native_irq_disable, +- .irq_enable = native_irq_enable, +- .safe_halt = native_safe_halt, +- .halt = native_halt, +-}; +- +-struct pv_cpu_ops pv_cpu_ops = { +- .cpuid = native_cpuid, +- .get_debugreg = native_get_debugreg, +- .set_debugreg = native_set_debugreg, +- .clts = native_clts, +- .read_cr0 = native_read_cr0, +- .write_cr0 = native_write_cr0, +- .read_cr4 = native_read_cr4, +- .read_cr4_safe = native_read_cr4_safe, +- .write_cr4 = native_write_cr4, +- .wbinvd = native_wbinvd, +- .read_msr = native_read_msr_safe, +- .write_msr = native_write_msr_safe, +- .read_tsc = native_read_tsc, +- .read_pmc = native_read_pmc, +- .load_tr_desc = native_load_tr_desc, +- .set_ldt = native_set_ldt, +- .load_gdt = native_load_gdt, +- .load_idt = native_load_idt, +- .store_gdt = native_store_gdt, +- .store_idt = native_store_idt, +- .store_tr = native_store_tr, +- .load_tls = native_load_tls, +- .write_ldt_entry = write_dt_entry, +- .write_gdt_entry = write_dt_entry, +- .write_idt_entry = write_dt_entry, +- .load_esp0 = native_load_esp0, +- +- .irq_enable_sysexit = native_irq_enable_sysexit, +- .iret = native_iret, +- +- .set_iopl_mask = native_set_iopl_mask, +- .io_delay = native_io_delay, +- +- .lazy_mode = { +- .enter = paravirt_nop, +- .leave = paravirt_nop, +- }, +-}; +- +-struct pv_apic_ops pv_apic_ops = { +-#ifdef CONFIG_X86_LOCAL_APIC +- .apic_write = native_apic_write, +- .apic_write_atomic = native_apic_write_atomic, +- .apic_read = native_apic_read, +- .setup_boot_clock = setup_boot_APIC_clock, +- .setup_secondary_clock = setup_secondary_APIC_clock, +- .startup_ipi_hook = paravirt_nop, +-#endif +-}; +- +-struct pv_mmu_ops pv_mmu_ops = { +- .pagetable_setup_start = native_pagetable_setup_start, +- .pagetable_setup_done = native_pagetable_setup_done, +- +- .read_cr2 = native_read_cr2, +- .write_cr2 = native_write_cr2, +- .read_cr3 = native_read_cr3, +- .write_cr3 = native_write_cr3, +- +- .flush_tlb_user = native_flush_tlb, +- .flush_tlb_kernel = native_flush_tlb_global, +- .flush_tlb_single = native_flush_tlb_single, +- .flush_tlb_others = native_flush_tlb_others, +- +- .alloc_pt = paravirt_nop, +- .alloc_pd = paravirt_nop, +- .alloc_pd_clone = paravirt_nop, +- .release_pt = paravirt_nop, +- .release_pd = paravirt_nop, +- +- .set_pte = native_set_pte, +- .set_pte_at = native_set_pte_at, +- .set_pmd = native_set_pmd, +- .pte_update = paravirt_nop, +- .pte_update_defer = paravirt_nop, +- +-#ifdef CONFIG_HIGHPTE +- .kmap_atomic_pte = kmap_atomic, +-#endif +- +-#ifdef CONFIG_X86_PAE +- .set_pte_atomic = native_set_pte_atomic, +- .set_pte_present = native_set_pte_present, +- .set_pud = native_set_pud, +- .pte_clear = native_pte_clear, +- .pmd_clear = native_pmd_clear, +- +- .pmd_val = native_pmd_val, +- .make_pmd = native_make_pmd, +-#endif +- +- .pte_val = native_pte_val, +- .pgd_val = native_pgd_val, +- +- .make_pte = native_make_pte, +- .make_pgd = native_make_pgd, +- +- .dup_mmap = paravirt_nop, +- .exit_mmap = paravirt_nop, +- .activate_mm = paravirt_nop, +- +- .lazy_mode = { +- .enter = paravirt_nop, +- .leave = paravirt_nop, +- }, +-}; +- +-EXPORT_SYMBOL_GPL(pv_time_ops); +-EXPORT_SYMBOL (pv_cpu_ops); +-EXPORT_SYMBOL (pv_mmu_ops); +-EXPORT_SYMBOL_GPL(pv_apic_ops); +-EXPORT_SYMBOL_GPL(pv_info); +-EXPORT_SYMBOL (pv_irq_ops); +diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c +new file mode 100644 +index 0000000..82fc5fc +--- /dev/null ++++ b/arch/x86/kernel/paravirt_patch_32.c +@@ -0,0 +1,49 @@ ++#include ++ ++DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); ++DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); ++DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf"); ++DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax"); ++DEF_NATIVE(pv_cpu_ops, iret, "iret"); ++DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "sti; sysexit"); ++DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax"); ++DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3"); ++DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax"); ++DEF_NATIVE(pv_cpu_ops, clts, "clts"); ++DEF_NATIVE(pv_cpu_ops, read_tsc, "rdtsc"); ++ ++unsigned native_patch(u8 type, u16 clobbers, void *ibuf, ++ unsigned long addr, unsigned len) ++{ ++ const unsigned char *start, *end; ++ unsigned ret; ++ ++#define PATCH_SITE(ops, x) \ ++ case PARAVIRT_PATCH(ops.x): \ ++ start = start_##ops##_##x; \ ++ end = end_##ops##_##x; \ ++ goto patch_site ++ switch(type) { ++ PATCH_SITE(pv_irq_ops, irq_disable); ++ PATCH_SITE(pv_irq_ops, irq_enable); ++ PATCH_SITE(pv_irq_ops, restore_fl); ++ PATCH_SITE(pv_irq_ops, save_fl); ++ PATCH_SITE(pv_cpu_ops, iret); ++ PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret); ++ PATCH_SITE(pv_mmu_ops, read_cr2); ++ PATCH_SITE(pv_mmu_ops, read_cr3); ++ PATCH_SITE(pv_mmu_ops, write_cr3); ++ PATCH_SITE(pv_cpu_ops, clts); ++ PATCH_SITE(pv_cpu_ops, read_tsc); ++ ++ patch_site: ++ ret = paravirt_patch_insns(ibuf, len, start, end); ++ break; ++ ++ default: ++ ret = paravirt_patch_default(type, clobbers, ibuf, addr, len); ++ break; ++ } ++#undef PATCH_SITE ++ return ret; ++} +diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c +new file mode 100644 +index 0000000..7d904e1 +--- /dev/null ++++ b/arch/x86/kernel/paravirt_patch_64.c +@@ -0,0 +1,57 @@ ++#include ++#include ++#include ++ ++DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); ++DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); ++DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq"); ++DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax"); ++DEF_NATIVE(pv_cpu_ops, iret, "iretq"); ++DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax"); ++DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax"); ++DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3"); ++DEF_NATIVE(pv_mmu_ops, flush_tlb_single, "invlpg (%rdi)"); ++DEF_NATIVE(pv_cpu_ops, clts, "clts"); ++DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd"); ++ ++/* the three commands give us more control to how to return from a syscall */ ++DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "movq %gs:" __stringify(pda_oldrsp) ", %rsp; swapgs; sysretq;"); ++DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs"); ++ ++unsigned native_patch(u8 type, u16 clobbers, void *ibuf, ++ unsigned long addr, unsigned len) ++{ ++ const unsigned char *start, *end; ++ unsigned ret; ++ ++#define PATCH_SITE(ops, x) \ ++ case PARAVIRT_PATCH(ops.x): \ ++ start = start_##ops##_##x; \ ++ end = end_##ops##_##x; \ ++ goto patch_site ++ switch(type) { ++ PATCH_SITE(pv_irq_ops, restore_fl); ++ PATCH_SITE(pv_irq_ops, save_fl); ++ PATCH_SITE(pv_irq_ops, irq_enable); ++ PATCH_SITE(pv_irq_ops, irq_disable); ++ PATCH_SITE(pv_cpu_ops, iret); ++ PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret); ++ PATCH_SITE(pv_cpu_ops, swapgs); ++ PATCH_SITE(pv_mmu_ops, read_cr2); ++ PATCH_SITE(pv_mmu_ops, read_cr3); ++ PATCH_SITE(pv_mmu_ops, write_cr3); ++ PATCH_SITE(pv_cpu_ops, clts); ++ PATCH_SITE(pv_mmu_ops, flush_tlb_single); ++ PATCH_SITE(pv_cpu_ops, wbinvd); ++ ++ patch_site: ++ ret = paravirt_patch_insns(ibuf, len, start, end); ++ break; ++ ++ default: ++ ret = paravirt_patch_default(type, clobbers, ibuf, addr, len); ++ break; ++ } ++#undef PATCH_SITE ++ return ret; ++} +diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c +index 6bf1f71..21f34db 100644 +--- a/arch/x86/kernel/pci-calgary_64.c ++++ b/arch/x86/kernel/pci-calgary_64.c +@@ -30,7 +30,6 @@ + #include + #include + #include +-#include + #include + #include + #include +@@ -183,7 +182,7 @@ static struct calgary_bus_info bus_info[MAX_PHB_BUS_NUM] = { { NULL, 0, 0 }, }; + + /* enable this to stress test the chip's TCE cache */ + #ifdef CONFIG_IOMMU_DEBUG +-int debugging __read_mostly = 1; ++static int debugging = 1; + + static inline unsigned long verify_bit_range(unsigned long* bitmap, + int expected, unsigned long start, unsigned long end) +@@ -202,7 +201,7 @@ static inline unsigned long verify_bit_range(unsigned long* bitmap, + return ~0UL; + } + #else /* debugging is disabled */ +-int debugging __read_mostly = 0; ++static int debugging; + + static inline unsigned long verify_bit_range(unsigned long* bitmap, + int expected, unsigned long start, unsigned long end) +diff --git a/arch/x86/kernel/pci-dma_64.c b/arch/x86/kernel/pci-dma_64.c +index 5552d23..a82473d 100644 +--- a/arch/x86/kernel/pci-dma_64.c ++++ b/arch/x86/kernel/pci-dma_64.c +@@ -13,7 +13,6 @@ + #include + + int iommu_merge __read_mostly = 0; +-EXPORT_SYMBOL(iommu_merge); + + dma_addr_t bad_dma_address __read_mostly; + EXPORT_SYMBOL(bad_dma_address); +@@ -230,7 +229,7 @@ EXPORT_SYMBOL(dma_set_mask); + * See for the iommu kernel parameter + * documentation. + */ +-__init int iommu_setup(char *p) ++static __init int iommu_setup(char *p) + { + iommu_merge = 1; + +diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c +index 06bcba5..4d5cc71 100644 +--- a/arch/x86/kernel/pci-gart_64.c ++++ b/arch/x86/kernel/pci-gart_64.c +@@ -1,12 +1,12 @@ + /* + * Dynamic DMA mapping support for AMD Hammer. +- * ++ * + * Use the integrated AGP GART in the Hammer northbridge as an IOMMU for PCI. + * This allows to use PCI devices that only support 32bit addresses on systems +- * with more than 4GB. ++ * with more than 4GB. + * + * See Documentation/DMA-mapping.txt for the interface specification. +- * ++ * + * Copyright 2002 Andi Kleen, SuSE Labs. + * Subject to the GNU General Public License v2 only. + */ +@@ -37,23 +37,26 @@ + #include + + static unsigned long iommu_bus_base; /* GART remapping area (physical) */ +-static unsigned long iommu_size; /* size of remapping area bytes */ ++static unsigned long iommu_size; /* size of remapping area bytes */ + static unsigned long iommu_pages; /* .. and in pages */ + +-static u32 *iommu_gatt_base; /* Remapping table */ ++static u32 *iommu_gatt_base; /* Remapping table */ + +-/* If this is disabled the IOMMU will use an optimized flushing strategy +- of only flushing when an mapping is reused. With it true the GART is flushed +- for every mapping. Problem is that doing the lazy flush seems to trigger +- bugs with some popular PCI cards, in particular 3ware (but has been also +- also seen with Qlogic at least). */ ++/* ++ * If this is disabled the IOMMU will use an optimized flushing strategy ++ * of only flushing when an mapping is reused. With it true the GART is ++ * flushed for every mapping. Problem is that doing the lazy flush seems ++ * to trigger bugs with some popular PCI cards, in particular 3ware (but ++ * has been also also seen with Qlogic at least). ++ */ + int iommu_fullflush = 1; + +-/* Allocation bitmap for the remapping area */ ++/* Allocation bitmap for the remapping area: */ + static DEFINE_SPINLOCK(iommu_bitmap_lock); +-static unsigned long *iommu_gart_bitmap; /* guarded by iommu_bitmap_lock */ ++/* Guarded by iommu_bitmap_lock: */ ++static unsigned long *iommu_gart_bitmap; + +-static u32 gart_unmapped_entry; ++static u32 gart_unmapped_entry; + + #define GPTE_VALID 1 + #define GPTE_COHERENT 2 +@@ -61,10 +64,10 @@ static u32 gart_unmapped_entry; + (((x) & 0xfffff000) | (((x) >> 32) << 4) | GPTE_VALID | GPTE_COHERENT) + #define GPTE_DECODE(x) (((x) & 0xfffff000) | (((u64)(x) & 0xff0) << 28)) + +-#define to_pages(addr,size) \ ++#define to_pages(addr, size) \ + (round_up(((addr) & ~PAGE_MASK) + (size), PAGE_SIZE) >> PAGE_SHIFT) + +-#define EMERGENCY_PAGES 32 /* = 128KB */ ++#define EMERGENCY_PAGES 32 /* = 128KB */ + + #ifdef CONFIG_AGP + #define AGPEXTERN extern +@@ -77,130 +80,152 @@ AGPEXTERN int agp_memory_reserved; + AGPEXTERN __u32 *agp_gatt_table; + + static unsigned long next_bit; /* protected by iommu_bitmap_lock */ +-static int need_flush; /* global flush state. set for each gart wrap */ ++static int need_flush; /* global flush state. set for each gart wrap */ + +-static unsigned long alloc_iommu(int size) +-{ ++static unsigned long alloc_iommu(int size) ++{ + unsigned long offset, flags; + +- spin_lock_irqsave(&iommu_bitmap_lock, flags); +- offset = find_next_zero_string(iommu_gart_bitmap,next_bit,iommu_pages,size); ++ spin_lock_irqsave(&iommu_bitmap_lock, flags); ++ offset = find_next_zero_string(iommu_gart_bitmap, next_bit, ++ iommu_pages, size); + if (offset == -1) { + need_flush = 1; +- offset = find_next_zero_string(iommu_gart_bitmap,0,iommu_pages,size); ++ offset = find_next_zero_string(iommu_gart_bitmap, 0, ++ iommu_pages, size); + } +- if (offset != -1) { +- set_bit_string(iommu_gart_bitmap, offset, size); +- next_bit = offset+size; +- if (next_bit >= iommu_pages) { ++ if (offset != -1) { ++ set_bit_string(iommu_gart_bitmap, offset, size); ++ next_bit = offset+size; ++ if (next_bit >= iommu_pages) { + next_bit = 0; + need_flush = 1; +- } +- } ++ } ++ } + if (iommu_fullflush) + need_flush = 1; +- spin_unlock_irqrestore(&iommu_bitmap_lock, flags); ++ spin_unlock_irqrestore(&iommu_bitmap_lock, flags); ++ + return offset; +-} ++} + + static void free_iommu(unsigned long offset, int size) +-{ ++{ + unsigned long flags; ++ + spin_lock_irqsave(&iommu_bitmap_lock, flags); + __clear_bit_string(iommu_gart_bitmap, offset, size); + spin_unlock_irqrestore(&iommu_bitmap_lock, flags); +-} ++} + +-/* ++/* + * Use global flush state to avoid races with multiple flushers. + */ + static void flush_gart(void) +-{ ++{ + unsigned long flags; ++ + spin_lock_irqsave(&iommu_bitmap_lock, flags); + if (need_flush) { + k8_flush_garts(); + need_flush = 0; +- } ++ } + spin_unlock_irqrestore(&iommu_bitmap_lock, flags); +-} ++} + + #ifdef CONFIG_IOMMU_LEAK + +-#define SET_LEAK(x) if (iommu_leak_tab) \ +- iommu_leak_tab[x] = __builtin_return_address(0); +-#define CLEAR_LEAK(x) if (iommu_leak_tab) \ +- iommu_leak_tab[x] = NULL; ++#define SET_LEAK(x) \ ++ do { \ ++ if (iommu_leak_tab) \ ++ iommu_leak_tab[x] = __builtin_return_address(0);\ ++ } while (0) ++ ++#define CLEAR_LEAK(x) \ ++ do { \ ++ if (iommu_leak_tab) \ ++ iommu_leak_tab[x] = NULL; \ ++ } while (0) + + /* Debugging aid for drivers that don't free their IOMMU tables */ +-static void **iommu_leak_tab; ++static void **iommu_leak_tab; + static int leak_trace; + static int iommu_leak_pages = 20; ++ + static void dump_leak(void) + { + int i; +- static int dump; +- if (dump || !iommu_leak_tab) return; ++ static int dump; ++ ++ if (dump || !iommu_leak_tab) ++ return; + dump = 1; +- show_stack(NULL,NULL); +- /* Very crude. dump some from the end of the table too */ +- printk("Dumping %d pages from end of IOMMU:\n", iommu_leak_pages); +- for (i = 0; i < iommu_leak_pages; i+=2) { +- printk("%lu: ", iommu_pages-i); +- printk_address((unsigned long) iommu_leak_tab[iommu_pages-i]); +- printk("%c", (i+1)%2 == 0 ? '\n' : ' '); +- } +- printk("\n"); ++ show_stack(NULL, NULL); ++ ++ /* Very crude. dump some from the end of the table too */ ++ printk(KERN_DEBUG "Dumping %d pages from end of IOMMU:\n", ++ iommu_leak_pages); ++ for (i = 0; i < iommu_leak_pages; i += 2) { ++ printk(KERN_DEBUG "%lu: ", iommu_pages-i); ++ printk_address((unsigned long) iommu_leak_tab[iommu_pages-i], 0); ++ printk(KERN_CONT "%c", (i+1)%2 == 0 ? '\n' : ' '); ++ } ++ printk(KERN_DEBUG "\n"); + } + #else +-#define SET_LEAK(x) +-#define CLEAR_LEAK(x) ++# define SET_LEAK(x) ++# define CLEAR_LEAK(x) + #endif + + static void iommu_full(struct device *dev, size_t size, int dir) + { +- /* ++ /* + * Ran out of IOMMU space for this operation. This is very bad. + * Unfortunately the drivers cannot handle this operation properly. +- * Return some non mapped prereserved space in the aperture and ++ * Return some non mapped prereserved space in the aperture and + * let the Northbridge deal with it. This will result in garbage + * in the IO operation. When the size exceeds the prereserved space +- * memory corruption will occur or random memory will be DMAed ++ * memory corruption will occur or random memory will be DMAed + * out. Hopefully no network devices use single mappings that big. +- */ +- +- printk(KERN_ERR +- "PCI-DMA: Out of IOMMU space for %lu bytes at device %s\n", +- size, dev->bus_id); ++ */ ++ ++ printk(KERN_ERR ++ "PCI-DMA: Out of IOMMU space for %lu bytes at device %s\n", ++ size, dev->bus_id); + + if (size > PAGE_SIZE*EMERGENCY_PAGES) { + if (dir == PCI_DMA_FROMDEVICE || dir == PCI_DMA_BIDIRECTIONAL) + panic("PCI-DMA: Memory would be corrupted\n"); +- if (dir == PCI_DMA_TODEVICE || dir == PCI_DMA_BIDIRECTIONAL) +- panic(KERN_ERR "PCI-DMA: Random memory would be DMAed\n"); +- } +- ++ if (dir == PCI_DMA_TODEVICE || dir == PCI_DMA_BIDIRECTIONAL) ++ panic(KERN_ERR ++ "PCI-DMA: Random memory would be DMAed\n"); ++ } + #ifdef CONFIG_IOMMU_LEAK +- dump_leak(); ++ dump_leak(); + #endif +-} ++} + +-static inline int need_iommu(struct device *dev, unsigned long addr, size_t size) +-{ ++static inline int ++need_iommu(struct device *dev, unsigned long addr, size_t size) ++{ + u64 mask = *dev->dma_mask; + int high = addr + size > mask; + int mmu = high; +- if (force_iommu) +- mmu = 1; +- return mmu; ++ ++ if (force_iommu) ++ mmu = 1; ++ ++ return mmu; + } + +-static inline int nonforced_iommu(struct device *dev, unsigned long addr, size_t size) +-{ ++static inline int ++nonforced_iommu(struct device *dev, unsigned long addr, size_t size) ++{ + u64 mask = *dev->dma_mask; + int high = addr + size > mask; + int mmu = high; +- return mmu; ++ ++ return mmu; + } + + /* Map a single continuous physical area into the IOMMU. +@@ -208,13 +233,14 @@ static inline int nonforced_iommu(struct device *dev, unsigned long addr, size_t + */ + static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem, + size_t size, int dir) +-{ ++{ + unsigned long npages = to_pages(phys_mem, size); + unsigned long iommu_page = alloc_iommu(npages); + int i; ++ + if (iommu_page == -1) { + if (!nonforced_iommu(dev, phys_mem, size)) +- return phys_mem; ++ return phys_mem; + if (panic_on_overflow) + panic("dma_map_area overflow %lu bytes\n", size); + iommu_full(dev, size, dir); +@@ -229,35 +255,39 @@ static dma_addr_t dma_map_area(struct device *dev, dma_addr_t phys_mem, + return iommu_bus_base + iommu_page*PAGE_SIZE + (phys_mem & ~PAGE_MASK); + } + +-static dma_addr_t gart_map_simple(struct device *dev, char *buf, +- size_t size, int dir) ++static dma_addr_t ++gart_map_simple(struct device *dev, char *buf, size_t size, int dir) + { + dma_addr_t map = dma_map_area(dev, virt_to_bus(buf), size, dir); ++ + flush_gart(); ++ + return map; + } + + /* Map a single area into the IOMMU */ +-static dma_addr_t gart_map_single(struct device *dev, void *addr, size_t size, int dir) ++static dma_addr_t ++gart_map_single(struct device *dev, void *addr, size_t size, int dir) + { + unsigned long phys_mem, bus; + + if (!dev) + dev = &fallback_dev; + +- phys_mem = virt_to_phys(addr); ++ phys_mem = virt_to_phys(addr); + if (!need_iommu(dev, phys_mem, size)) +- return phys_mem; ++ return phys_mem; + + bus = gart_map_simple(dev, addr, size, dir); +- return bus; ++ ++ return bus; + } + + /* + * Free a DMA mapping. + */ + static void gart_unmap_single(struct device *dev, dma_addr_t dma_addr, +- size_t size, int direction) ++ size_t size, int direction) + { + unsigned long iommu_page; + int npages; +@@ -266,6 +296,7 @@ static void gart_unmap_single(struct device *dev, dma_addr_t dma_addr, + if (dma_addr < iommu_bus_base + EMERGENCY_PAGES*PAGE_SIZE || + dma_addr >= iommu_bus_base + iommu_size) + return; ++ + iommu_page = (dma_addr - iommu_bus_base)>>PAGE_SHIFT; + npages = to_pages(dma_addr, size); + for (i = 0; i < npages; i++) { +@@ -278,7 +309,8 @@ static void gart_unmap_single(struct device *dev, dma_addr_t dma_addr, + /* + * Wrapper for pci_unmap_single working with scatterlists. + */ +-static void gart_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, int dir) ++static void ++gart_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, int dir) + { + struct scatterlist *s; + int i; +@@ -303,12 +335,13 @@ static int dma_map_sg_nonforce(struct device *dev, struct scatterlist *sg, + + for_each_sg(sg, s, nents, i) { + unsigned long addr = sg_phys(s); +- if (nonforced_iommu(dev, addr, s->length)) { ++ ++ if (nonforced_iommu(dev, addr, s->length)) { + addr = dma_map_area(dev, addr, s->length, dir); +- if (addr == bad_dma_address) { +- if (i > 0) ++ if (addr == bad_dma_address) { ++ if (i > 0) + gart_unmap_sg(dev, sg, i, dir); +- nents = 0; ++ nents = 0; + sg[0].dma_length = 0; + break; + } +@@ -317,15 +350,16 @@ static int dma_map_sg_nonforce(struct device *dev, struct scatterlist *sg, + s->dma_length = s->length; + } + flush_gart(); ++ + return nents; + } + + /* Map multiple scatterlist entries continuous into the first. */ + static int __dma_map_cont(struct scatterlist *start, int nelems, +- struct scatterlist *sout, unsigned long pages) ++ struct scatterlist *sout, unsigned long pages) + { + unsigned long iommu_start = alloc_iommu(pages); +- unsigned long iommu_page = iommu_start; ++ unsigned long iommu_page = iommu_start; + struct scatterlist *s; + int i; + +@@ -335,32 +369,33 @@ static int __dma_map_cont(struct scatterlist *start, int nelems, + for_each_sg(start, s, nelems, i) { + unsigned long pages, addr; + unsigned long phys_addr = s->dma_address; +- ++ + BUG_ON(s != start && s->offset); + if (s == start) { + sout->dma_address = iommu_bus_base; + sout->dma_address += iommu_page*PAGE_SIZE + s->offset; + sout->dma_length = s->length; +- } else { +- sout->dma_length += s->length; ++ } else { ++ sout->dma_length += s->length; + } + + addr = phys_addr; +- pages = to_pages(s->offset, s->length); +- while (pages--) { +- iommu_gatt_base[iommu_page] = GPTE_ENCODE(addr); ++ pages = to_pages(s->offset, s->length); ++ while (pages--) { ++ iommu_gatt_base[iommu_page] = GPTE_ENCODE(addr); + SET_LEAK(iommu_page); + addr += PAGE_SIZE; + iommu_page++; + } +- } +- BUG_ON(iommu_page - iommu_start != pages); ++ } ++ BUG_ON(iommu_page - iommu_start != pages); ++ + return 0; + } + +-static inline int dma_map_cont(struct scatterlist *start, int nelems, +- struct scatterlist *sout, +- unsigned long pages, int need) ++static inline int ++dma_map_cont(struct scatterlist *start, int nelems, struct scatterlist *sout, ++ unsigned long pages, int need) + { + if (!need) { + BUG_ON(nelems != 1); +@@ -370,22 +405,19 @@ static inline int dma_map_cont(struct scatterlist *start, int nelems, + } + return __dma_map_cont(start, nelems, sout, pages); + } +- ++ + /* + * DMA map all entries in a scatterlist. +- * Merge chunks that have page aligned sizes into a continuous mapping. ++ * Merge chunks that have page aligned sizes into a continuous mapping. + */ +-static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents, +- int dir) ++static int ++gart_map_sg(struct device *dev, struct scatterlist *sg, int nents, int dir) + { +- int i; +- int out; +- int start; +- unsigned long pages = 0; +- int need = 0, nextneed; + struct scatterlist *s, *ps, *start_sg, *sgmap; ++ int need = 0, nextneed, i, out, start; ++ unsigned long pages = 0; + +- if (nents == 0) ++ if (nents == 0) + return 0; + + if (!dev) +@@ -397,15 +429,19 @@ static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents, + ps = NULL; /* shut up gcc */ + for_each_sg(sg, s, nents, i) { + dma_addr_t addr = sg_phys(s); ++ + s->dma_address = addr; +- BUG_ON(s->length == 0); ++ BUG_ON(s->length == 0); + +- nextneed = need_iommu(dev, addr, s->length); ++ nextneed = need_iommu(dev, addr, s->length); + + /* Handle the previous not yet processed entries */ + if (i > start) { +- /* Can only merge when the last chunk ends on a page +- boundary and the new one doesn't have an offset. */ ++ /* ++ * Can only merge when the last chunk ends on a ++ * page boundary and the new one doesn't have an ++ * offset. ++ */ + if (!iommu_merge || !nextneed || !need || s->offset || + (ps->offset + ps->length) % PAGE_SIZE) { + if (dma_map_cont(start_sg, i - start, sgmap, +@@ -436,6 +472,7 @@ static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents, + error: + flush_gart(); + gart_unmap_sg(dev, sg, out, dir); ++ + /* When it was forced or merged try again in a dumb way */ + if (force_iommu || iommu_merge) { + out = dma_map_sg_nonforce(dev, sg, nents, dir); +@@ -444,64 +481,68 @@ error: + } + if (panic_on_overflow) + panic("dma_map_sg: overflow on %lu pages\n", pages); ++ + iommu_full(dev, pages << PAGE_SHIFT, dir); + for_each_sg(sg, s, nents, i) + s->dma_address = bad_dma_address; + return 0; +-} ++} + + static int no_agp; + + static __init unsigned long check_iommu_size(unsigned long aper, u64 aper_size) +-{ +- unsigned long a; +- if (!iommu_size) { +- iommu_size = aper_size; +- if (!no_agp) +- iommu_size /= 2; +- } +- +- a = aper + iommu_size; ++{ ++ unsigned long a; ++ ++ if (!iommu_size) { ++ iommu_size = aper_size; ++ if (!no_agp) ++ iommu_size /= 2; ++ } ++ ++ a = aper + iommu_size; + iommu_size -= round_up(a, LARGE_PAGE_SIZE) - a; + +- if (iommu_size < 64*1024*1024) ++ if (iommu_size < 64*1024*1024) { + printk(KERN_WARNING +- "PCI-DMA: Warning: Small IOMMU %luMB. Consider increasing the AGP aperture in BIOS\n",iommu_size>>20); +- ++ "PCI-DMA: Warning: Small IOMMU %luMB." ++ " Consider increasing the AGP aperture in BIOS\n", ++ iommu_size >> 20); ++ } ++ + return iommu_size; +-} ++} + +-static __init unsigned read_aperture(struct pci_dev *dev, u32 *size) +-{ +- unsigned aper_size = 0, aper_base_32; ++static __init unsigned read_aperture(struct pci_dev *dev, u32 *size) ++{ ++ unsigned aper_size = 0, aper_base_32, aper_order; + u64 aper_base; +- unsigned aper_order; + +- pci_read_config_dword(dev, 0x94, &aper_base_32); ++ pci_read_config_dword(dev, 0x94, &aper_base_32); + pci_read_config_dword(dev, 0x90, &aper_order); +- aper_order = (aper_order >> 1) & 7; ++ aper_order = (aper_order >> 1) & 7; + +- aper_base = aper_base_32 & 0x7fff; ++ aper_base = aper_base_32 & 0x7fff; + aper_base <<= 25; + +- aper_size = (32 * 1024 * 1024) << aper_order; +- if (aper_base + aper_size > 0x100000000UL || !aper_size) ++ aper_size = (32 * 1024 * 1024) << aper_order; ++ if (aper_base + aper_size > 0x100000000UL || !aper_size) + aper_base = 0; + + *size = aper_size; + return aper_base; +-} ++} + +-/* ++/* + * Private Northbridge GATT initialization in case we cannot use the +- * AGP driver for some reason. ++ * AGP driver for some reason. + */ + static __init int init_k8_gatt(struct agp_kern_info *info) +-{ ++{ ++ unsigned aper_size, gatt_size, new_aper_size; ++ unsigned aper_base, new_aper_base; + struct pci_dev *dev; + void *gatt; +- unsigned aper_base, new_aper_base; +- unsigned aper_size, gatt_size, new_aper_size; + int i; + + printk(KERN_INFO "PCI-DMA: Disabling AGP.\n"); +@@ -509,75 +550,75 @@ static __init int init_k8_gatt(struct agp_kern_info *info) + dev = NULL; + for (i = 0; i < num_k8_northbridges; i++) { + dev = k8_northbridges[i]; +- new_aper_base = read_aperture(dev, &new_aper_size); +- if (!new_aper_base) +- goto nommu; +- +- if (!aper_base) { ++ new_aper_base = read_aperture(dev, &new_aper_size); ++ if (!new_aper_base) ++ goto nommu; ++ ++ if (!aper_base) { + aper_size = new_aper_size; + aper_base = new_aper_base; +- } +- if (aper_size != new_aper_size || aper_base != new_aper_base) ++ } ++ if (aper_size != new_aper_size || aper_base != new_aper_base) + goto nommu; + } + if (!aper_base) +- goto nommu; ++ goto nommu; + info->aper_base = aper_base; +- info->aper_size = aper_size>>20; ++ info->aper_size = aper_size >> 20; + +- gatt_size = (aper_size >> PAGE_SHIFT) * sizeof(u32); +- gatt = (void *)__get_free_pages(GFP_KERNEL, get_order(gatt_size)); +- if (!gatt) ++ gatt_size = (aper_size >> PAGE_SHIFT) * sizeof(u32); ++ gatt = (void *)__get_free_pages(GFP_KERNEL, get_order(gatt_size)); ++ if (!gatt) + panic("Cannot allocate GATT table"); +- if (change_page_attr_addr((unsigned long)gatt, gatt_size >> PAGE_SHIFT, PAGE_KERNEL_NOCACHE)) ++ if (set_memory_uc((unsigned long)gatt, gatt_size >> PAGE_SHIFT)) + panic("Could not set GART PTEs to uncacheable pages"); +- global_flush_tlb(); + +- memset(gatt, 0, gatt_size); ++ memset(gatt, 0, gatt_size); + agp_gatt_table = gatt; + + for (i = 0; i < num_k8_northbridges; i++) { +- u32 ctl; +- u32 gatt_reg; ++ u32 gatt_reg; ++ u32 ctl; + + dev = k8_northbridges[i]; +- gatt_reg = __pa(gatt) >> 12; +- gatt_reg <<= 4; ++ gatt_reg = __pa(gatt) >> 12; ++ gatt_reg <<= 4; + pci_write_config_dword(dev, 0x98, gatt_reg); +- pci_read_config_dword(dev, 0x90, &ctl); ++ pci_read_config_dword(dev, 0x90, &ctl); + + ctl |= 1; + ctl &= ~((1<<4) | (1<<5)); + +- pci_write_config_dword(dev, 0x90, ctl); ++ pci_write_config_dword(dev, 0x90, ctl); + } + flush_gart(); +- +- printk("PCI-DMA: aperture base @ %x size %u KB\n",aper_base, aper_size>>10); ++ ++ printk(KERN_INFO "PCI-DMA: aperture base @ %x size %u KB\n", ++ aper_base, aper_size>>10); + return 0; + + nommu: +- /* Should not happen anymore */ ++ /* Should not happen anymore */ + printk(KERN_ERR "PCI-DMA: More than 4GB of RAM and no IOMMU\n" + KERN_ERR "PCI-DMA: 32bit PCI IO may malfunction.\n"); +- return -1; +-} ++ return -1; ++} + + extern int agp_amd64_init(void); + + static const struct dma_mapping_ops gart_dma_ops = { +- .mapping_error = NULL, +- .map_single = gart_map_single, +- .map_simple = gart_map_simple, +- .unmap_single = gart_unmap_single, +- .sync_single_for_cpu = NULL, +- .sync_single_for_device = NULL, +- .sync_single_range_for_cpu = NULL, +- .sync_single_range_for_device = NULL, +- .sync_sg_for_cpu = NULL, +- .sync_sg_for_device = NULL, +- .map_sg = gart_map_sg, +- .unmap_sg = gart_unmap_sg, ++ .mapping_error = NULL, ++ .map_single = gart_map_single, ++ .map_simple = gart_map_simple, ++ .unmap_single = gart_unmap_single, ++ .sync_single_for_cpu = NULL, ++ .sync_single_for_device = NULL, ++ .sync_single_range_for_cpu = NULL, ++ .sync_single_range_for_device = NULL, ++ .sync_sg_for_cpu = NULL, ++ .sync_sg_for_device = NULL, ++ .map_sg = gart_map_sg, ++ .unmap_sg = gart_unmap_sg, + }; + + void gart_iommu_shutdown(void) +@@ -588,23 +629,23 @@ void gart_iommu_shutdown(void) + if (no_agp && (dma_ops != &gart_dma_ops)) + return; + +- for (i = 0; i < num_k8_northbridges; i++) { +- u32 ctl; ++ for (i = 0; i < num_k8_northbridges; i++) { ++ u32 ctl; + +- dev = k8_northbridges[i]; +- pci_read_config_dword(dev, 0x90, &ctl); ++ dev = k8_northbridges[i]; ++ pci_read_config_dword(dev, 0x90, &ctl); + +- ctl &= ~1; ++ ctl &= ~1; + +- pci_write_config_dword(dev, 0x90, ctl); +- } ++ pci_write_config_dword(dev, 0x90, ctl); ++ } + } + + void __init gart_iommu_init(void) +-{ ++{ + struct agp_kern_info info; +- unsigned long aper_size; + unsigned long iommu_start; ++ unsigned long aper_size; + unsigned long scratch; + long i; + +@@ -614,14 +655,14 @@ void __init gart_iommu_init(void) + } + + #ifndef CONFIG_AGP_AMD64 +- no_agp = 1; ++ no_agp = 1; + #else + /* Makefile puts PCI initialization via subsys_initcall first. */ + /* Add other K8 AGP bridge drivers here */ +- no_agp = no_agp || +- (agp_amd64_init() < 0) || ++ no_agp = no_agp || ++ (agp_amd64_init() < 0) || + (agp_copy_info(agp_bridge, &info) < 0); +-#endif ++#endif + + if (swiotlb) + return; +@@ -643,77 +684,78 @@ void __init gart_iommu_init(void) + } + + printk(KERN_INFO "PCI-DMA: using GART IOMMU.\n"); +- aper_size = info.aper_size * 1024 * 1024; +- iommu_size = check_iommu_size(info.aper_base, aper_size); +- iommu_pages = iommu_size >> PAGE_SHIFT; +- +- iommu_gart_bitmap = (void*)__get_free_pages(GFP_KERNEL, +- get_order(iommu_pages/8)); +- if (!iommu_gart_bitmap) +- panic("Cannot allocate iommu bitmap\n"); ++ aper_size = info.aper_size * 1024 * 1024; ++ iommu_size = check_iommu_size(info.aper_base, aper_size); ++ iommu_pages = iommu_size >> PAGE_SHIFT; ++ ++ iommu_gart_bitmap = (void *) __get_free_pages(GFP_KERNEL, ++ get_order(iommu_pages/8)); ++ if (!iommu_gart_bitmap) ++ panic("Cannot allocate iommu bitmap\n"); + memset(iommu_gart_bitmap, 0, iommu_pages/8); + + #ifdef CONFIG_IOMMU_LEAK +- if (leak_trace) { +- iommu_leak_tab = (void *)__get_free_pages(GFP_KERNEL, ++ if (leak_trace) { ++ iommu_leak_tab = (void *)__get_free_pages(GFP_KERNEL, + get_order(iommu_pages*sizeof(void *))); +- if (iommu_leak_tab) +- memset(iommu_leak_tab, 0, iommu_pages * 8); ++ if (iommu_leak_tab) ++ memset(iommu_leak_tab, 0, iommu_pages * 8); + else +- printk("PCI-DMA: Cannot allocate leak trace area\n"); +- } ++ printk(KERN_DEBUG ++ "PCI-DMA: Cannot allocate leak trace area\n"); ++ } + #endif + +- /* ++ /* + * Out of IOMMU space handling. +- * Reserve some invalid pages at the beginning of the GART. +- */ +- set_bit_string(iommu_gart_bitmap, 0, EMERGENCY_PAGES); ++ * Reserve some invalid pages at the beginning of the GART. ++ */ ++ set_bit_string(iommu_gart_bitmap, 0, EMERGENCY_PAGES); + +- agp_memory_reserved = iommu_size; ++ agp_memory_reserved = iommu_size; + printk(KERN_INFO + "PCI-DMA: Reserving %luMB of IOMMU area in the AGP aperture\n", +- iommu_size>>20); ++ iommu_size >> 20); + +- iommu_start = aper_size - iommu_size; +- iommu_bus_base = info.aper_base + iommu_start; ++ iommu_start = aper_size - iommu_size; ++ iommu_bus_base = info.aper_base + iommu_start; + bad_dma_address = iommu_bus_base; + iommu_gatt_base = agp_gatt_table + (iommu_start>>PAGE_SHIFT); + +- /* ++ /* + * Unmap the IOMMU part of the GART. The alias of the page is + * always mapped with cache enabled and there is no full cache + * coherency across the GART remapping. The unmapping avoids + * automatic prefetches from the CPU allocating cache lines in + * there. All CPU accesses are done via the direct mapping to + * the backing memory. The GART address is only used by PCI +- * devices. ++ * devices. + */ + clear_kernel_mapping((unsigned long)__va(iommu_bus_base), iommu_size); + +- /* +- * Try to workaround a bug (thanks to BenH) +- * Set unmapped entries to a scratch page instead of 0. ++ /* ++ * Try to workaround a bug (thanks to BenH) ++ * Set unmapped entries to a scratch page instead of 0. + * Any prefetches that hit unmapped entries won't get an bus abort + * then. + */ +- scratch = get_zeroed_page(GFP_KERNEL); +- if (!scratch) ++ scratch = get_zeroed_page(GFP_KERNEL); ++ if (!scratch) + panic("Cannot allocate iommu scratch page"); + gart_unmapped_entry = GPTE_ENCODE(__pa(scratch)); +- for (i = EMERGENCY_PAGES; i < iommu_pages; i++) ++ for (i = EMERGENCY_PAGES; i < iommu_pages; i++) + iommu_gatt_base[i] = gart_unmapped_entry; + + flush_gart(); + dma_ops = &gart_dma_ops; +-} ++} + + void __init gart_parse_options(char *p) + { + int arg; + + #ifdef CONFIG_IOMMU_LEAK +- if (!strncmp(p,"leak",4)) { ++ if (!strncmp(p, "leak", 4)) { + leak_trace = 1; + p += 4; + if (*p == '=') ++p; +@@ -723,18 +765,18 @@ void __init gart_parse_options(char *p) + #endif + if (isdigit(*p) && get_option(&p, &arg)) + iommu_size = arg; +- if (!strncmp(p, "fullflush",8)) ++ if (!strncmp(p, "fullflush", 8)) + iommu_fullflush = 1; +- if (!strncmp(p, "nofullflush",11)) ++ if (!strncmp(p, "nofullflush", 11)) + iommu_fullflush = 0; +- if (!strncmp(p,"noagp",5)) ++ if (!strncmp(p, "noagp", 5)) + no_agp = 1; +- if (!strncmp(p, "noaperture",10)) ++ if (!strncmp(p, "noaperture", 10)) + fix_aperture = 0; + /* duplicated from pci-dma.c */ +- if (!strncmp(p,"force",5)) ++ if (!strncmp(p, "force", 5)) + gart_iommu_aperture_allowed = 1; +- if (!strncmp(p,"allowed",7)) ++ if (!strncmp(p, "allowed", 7)) + gart_iommu_aperture_allowed = 1; + if (!strncmp(p, "memaper", 7)) { + fallback_aper_force = 1; +diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c +index 102866d..82a0a67 100644 +--- a/arch/x86/kernel/pci-swiotlb_64.c ++++ b/arch/x86/kernel/pci-swiotlb_64.c +@@ -10,7 +10,6 @@ + #include + + int swiotlb __read_mostly; +-EXPORT_SYMBOL(swiotlb); + + const struct dma_mapping_ops swiotlb_dma_ops = { + .mapping_error = swiotlb_dma_mapping_error, +diff --git a/arch/x86/kernel/pmtimer_64.c b/arch/x86/kernel/pmtimer_64.c +index ae8f912..b112406 100644 +--- a/arch/x86/kernel/pmtimer_64.c ++++ b/arch/x86/kernel/pmtimer_64.c +@@ -19,13 +19,13 @@ + #include + #include + #include ++#include ++ + #include + #include + #include + #include + +-#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */ +- + static inline u32 cyc2us(u32 cycles) + { + /* The Power Management Timer ticks at 3.579545 ticks per microsecond. +diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c +index 46d391d..968371a 100644 +--- a/arch/x86/kernel/process_32.c ++++ b/arch/x86/kernel/process_32.c +@@ -55,6 +55,7 @@ + + #include + #include ++#include + + asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); + +@@ -74,7 +75,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_number); + */ + unsigned long thread_saved_pc(struct task_struct *tsk) + { +- return ((unsigned long *)tsk->thread.esp)[3]; ++ return ((unsigned long *)tsk->thread.sp)[3]; + } + + /* +@@ -113,10 +114,19 @@ void default_idle(void) + smp_mb(); + + local_irq_disable(); +- if (!need_resched()) ++ if (!need_resched()) { ++ ktime_t t0, t1; ++ u64 t0n, t1n; ++ ++ t0 = ktime_get(); ++ t0n = ktime_to_ns(t0); + safe_halt(); /* enables interrupts racelessly */ +- else +- local_irq_enable(); ++ local_irq_disable(); ++ t1 = ktime_get(); ++ t1n = ktime_to_ns(t1); ++ sched_clock_idle_wakeup_event(t1n - t0n); ++ } ++ local_irq_enable(); + current_thread_info()->status |= TS_POLLING; + } else { + /* loop is done by the caller */ +@@ -132,7 +142,7 @@ EXPORT_SYMBOL(default_idle); + * to poll the ->work.need_resched flag instead of waiting for the + * cross-CPU IPI to arrive. Use this option with caution. + */ +-static void poll_idle (void) ++static void poll_idle(void) + { + cpu_relax(); + } +@@ -188,6 +198,9 @@ void cpu_idle(void) + rmb(); + idle = pm_idle; + ++ if (rcu_pending(cpu)) ++ rcu_check_callbacks(cpu, 0); ++ + if (!idle) + idle = default_idle; + +@@ -255,13 +268,13 @@ EXPORT_SYMBOL_GPL(cpu_idle_wait); + * New with Core Duo processors, MWAIT can take some hints based on CPU + * capability. + */ +-void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) ++void mwait_idle_with_hints(unsigned long ax, unsigned long cx) + { + if (!need_resched()) { + __monitor((void *)¤t_thread_info()->flags, 0, 0); + smp_mb(); + if (!need_resched()) +- __mwait(eax, ecx); ++ __mwait(ax, cx); + } + } + +@@ -272,19 +285,37 @@ static void mwait_idle(void) + mwait_idle_with_hints(0, 0); + } + ++static int __cpuinit mwait_usable(const struct cpuinfo_x86 *c) ++{ ++ if (force_mwait) ++ return 1; ++ /* Any C1 states supported? */ ++ return c->cpuid_level >= 5 && ((cpuid_edx(5) >> 4) & 0xf) > 0; ++} ++ + void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) + { +- if (cpu_has(c, X86_FEATURE_MWAIT)) { +- printk("monitor/mwait feature present.\n"); ++ static int selected; ++ ++ if (selected) ++ return; ++#ifdef CONFIG_X86_SMP ++ if (pm_idle == poll_idle && smp_num_siblings > 1) { ++ printk(KERN_WARNING "WARNING: polling idle and HT enabled," ++ " performance may degrade.\n"); ++ } ++#endif ++ if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { + /* + * Skip, if setup has overridden idle. + * One CPU supports mwait => All CPUs supports mwait + */ + if (!pm_idle) { +- printk("using mwait in idle threads.\n"); ++ printk(KERN_INFO "using mwait in idle threads.\n"); + pm_idle = mwait_idle; + } + } ++ selected = 1; + } + + static int __init idle_setup(char *str) +@@ -292,10 +323,6 @@ static int __init idle_setup(char *str) + if (!strcmp(str, "poll")) { + printk("using polling idle threads.\n"); + pm_idle = poll_idle; +-#ifdef CONFIG_X86_SMP +- if (smp_num_siblings > 1) +- printk("WARNING: polling idle and HT enabled, performance may degrade.\n"); +-#endif + } else if (!strcmp(str, "mwait")) + force_mwait = 1; + else +@@ -310,15 +337,15 @@ void __show_registers(struct pt_regs *regs, int all) + { + unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L; + unsigned long d0, d1, d2, d3, d6, d7; +- unsigned long esp; ++ unsigned long sp; + unsigned short ss, gs; + + if (user_mode_vm(regs)) { +- esp = regs->esp; +- ss = regs->xss & 0xffff; ++ sp = regs->sp; ++ ss = regs->ss & 0xffff; + savesegment(gs, gs); + } else { +- esp = (unsigned long) (®s->esp); ++ sp = (unsigned long) (®s->sp); + savesegment(ss, ss); + savesegment(gs, gs); + } +@@ -331,17 +358,17 @@ void __show_registers(struct pt_regs *regs, int all) + init_utsname()->version); + + printk("EIP: %04x:[<%08lx>] EFLAGS: %08lx CPU: %d\n", +- 0xffff & regs->xcs, regs->eip, regs->eflags, ++ 0xffff & regs->cs, regs->ip, regs->flags, + smp_processor_id()); +- print_symbol("EIP is at %s\n", regs->eip); ++ print_symbol("EIP is at %s\n", regs->ip); + + printk("EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n", +- regs->eax, regs->ebx, regs->ecx, regs->edx); ++ regs->ax, regs->bx, regs->cx, regs->dx); + printk("ESI: %08lx EDI: %08lx EBP: %08lx ESP: %08lx\n", +- regs->esi, regs->edi, regs->ebp, esp); ++ regs->si, regs->di, regs->bp, sp); + printk(" DS: %04x ES: %04x FS: %04x GS: %04x SS: %04x\n", +- regs->xds & 0xffff, regs->xes & 0xffff, +- regs->xfs & 0xffff, gs, ss); ++ regs->ds & 0xffff, regs->es & 0xffff, ++ regs->fs & 0xffff, gs, ss); + + if (!all) + return; +@@ -369,12 +396,12 @@ void __show_registers(struct pt_regs *regs, int all) + void show_regs(struct pt_regs *regs) + { + __show_registers(regs, 1); +- show_trace(NULL, regs, ®s->esp); ++ show_trace(NULL, regs, ®s->sp, regs->bp); + } + + /* +- * This gets run with %ebx containing the +- * function to call, and %edx containing ++ * This gets run with %bx containing the ++ * function to call, and %dx containing + * the "args". + */ + extern void kernel_thread_helper(void); +@@ -388,16 +415,16 @@ int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags) + + memset(®s, 0, sizeof(regs)); + +- regs.ebx = (unsigned long) fn; +- regs.edx = (unsigned long) arg; ++ regs.bx = (unsigned long) fn; ++ regs.dx = (unsigned long) arg; + +- regs.xds = __USER_DS; +- regs.xes = __USER_DS; +- regs.xfs = __KERNEL_PERCPU; +- regs.orig_eax = -1; +- regs.eip = (unsigned long) kernel_thread_helper; +- regs.xcs = __KERNEL_CS | get_kernel_rpl(); +- regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2; ++ regs.ds = __USER_DS; ++ regs.es = __USER_DS; ++ regs.fs = __KERNEL_PERCPU; ++ regs.orig_ax = -1; ++ regs.ip = (unsigned long) kernel_thread_helper; ++ regs.cs = __KERNEL_CS | get_kernel_rpl(); ++ regs.flags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2; + + /* Ok, create the new process.. */ + return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s, 0, NULL, NULL); +@@ -435,7 +462,12 @@ void flush_thread(void) + { + struct task_struct *tsk = current; + +- memset(tsk->thread.debugreg, 0, sizeof(unsigned long)*8); ++ tsk->thread.debugreg0 = 0; ++ tsk->thread.debugreg1 = 0; ++ tsk->thread.debugreg2 = 0; ++ tsk->thread.debugreg3 = 0; ++ tsk->thread.debugreg6 = 0; ++ tsk->thread.debugreg7 = 0; + memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array)); + clear_tsk_thread_flag(tsk, TIF_DEBUG); + /* +@@ -460,7 +492,7 @@ void prepare_to_copy(struct task_struct *tsk) + unlazy_fpu(tsk); + } + +-int copy_thread(int nr, unsigned long clone_flags, unsigned long esp, ++int copy_thread(int nr, unsigned long clone_flags, unsigned long sp, + unsigned long unused, + struct task_struct * p, struct pt_regs * regs) + { +@@ -470,15 +502,15 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long esp, + + childregs = task_pt_regs(p); + *childregs = *regs; +- childregs->eax = 0; +- childregs->esp = esp; ++ childregs->ax = 0; ++ childregs->sp = sp; + +- p->thread.esp = (unsigned long) childregs; +- p->thread.esp0 = (unsigned long) (childregs+1); ++ p->thread.sp = (unsigned long) childregs; ++ p->thread.sp0 = (unsigned long) (childregs+1); + +- p->thread.eip = (unsigned long) ret_from_fork; ++ p->thread.ip = (unsigned long) ret_from_fork; + +- savesegment(gs,p->thread.gs); ++ savesegment(gs, p->thread.gs); + + tsk = current; + if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) { +@@ -491,32 +523,15 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long esp, + set_tsk_thread_flag(p, TIF_IO_BITMAP); + } + ++ err = 0; ++ + /* + * Set a new TLS for the child thread? + */ +- if (clone_flags & CLONE_SETTLS) { +- struct desc_struct *desc; +- struct user_desc info; +- int idx; +- +- err = -EFAULT; +- if (copy_from_user(&info, (void __user *)childregs->esi, sizeof(info))) +- goto out; +- err = -EINVAL; +- if (LDT_empty(&info)) +- goto out; +- +- idx = info.entry_number; +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- goto out; +- +- desc = p->thread.tls_array + idx - GDT_ENTRY_TLS_MIN; +- desc->a = LDT_entry_a(&info); +- desc->b = LDT_entry_b(&info); +- } ++ if (clone_flags & CLONE_SETTLS) ++ err = do_set_thread_area(p, -1, ++ (struct user_desc __user *)childregs->si, 0); + +- err = 0; +- out: + if (err && p->thread.io_bitmap_ptr) { + kfree(p->thread.io_bitmap_ptr); + p->thread.io_bitmap_max = 0; +@@ -529,62 +544,52 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long esp, + */ + void dump_thread(struct pt_regs * regs, struct user * dump) + { +- int i; ++ u16 gs; + + /* changed the size calculations - should hopefully work better. lbt */ + dump->magic = CMAGIC; + dump->start_code = 0; +- dump->start_stack = regs->esp & ~(PAGE_SIZE - 1); ++ dump->start_stack = regs->sp & ~(PAGE_SIZE - 1); + dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT; + dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT; + dump->u_dsize -= dump->u_tsize; + dump->u_ssize = 0; +- for (i = 0; i < 8; i++) +- dump->u_debugreg[i] = current->thread.debugreg[i]; ++ dump->u_debugreg[0] = current->thread.debugreg0; ++ dump->u_debugreg[1] = current->thread.debugreg1; ++ dump->u_debugreg[2] = current->thread.debugreg2; ++ dump->u_debugreg[3] = current->thread.debugreg3; ++ dump->u_debugreg[4] = 0; ++ dump->u_debugreg[5] = 0; ++ dump->u_debugreg[6] = current->thread.debugreg6; ++ dump->u_debugreg[7] = current->thread.debugreg7; + + if (dump->start_stack < TASK_SIZE) + dump->u_ssize = ((unsigned long) (TASK_SIZE - dump->start_stack)) >> PAGE_SHIFT; + +- dump->regs.ebx = regs->ebx; +- dump->regs.ecx = regs->ecx; +- dump->regs.edx = regs->edx; +- dump->regs.esi = regs->esi; +- dump->regs.edi = regs->edi; +- dump->regs.ebp = regs->ebp; +- dump->regs.eax = regs->eax; +- dump->regs.ds = regs->xds; +- dump->regs.es = regs->xes; +- dump->regs.fs = regs->xfs; +- savesegment(gs,dump->regs.gs); +- dump->regs.orig_eax = regs->orig_eax; +- dump->regs.eip = regs->eip; +- dump->regs.cs = regs->xcs; +- dump->regs.eflags = regs->eflags; +- dump->regs.esp = regs->esp; +- dump->regs.ss = regs->xss; ++ dump->regs.bx = regs->bx; ++ dump->regs.cx = regs->cx; ++ dump->regs.dx = regs->dx; ++ dump->regs.si = regs->si; ++ dump->regs.di = regs->di; ++ dump->regs.bp = regs->bp; ++ dump->regs.ax = regs->ax; ++ dump->regs.ds = (u16)regs->ds; ++ dump->regs.es = (u16)regs->es; ++ dump->regs.fs = (u16)regs->fs; ++ savesegment(gs,gs); ++ dump->regs.orig_ax = regs->orig_ax; ++ dump->regs.ip = regs->ip; ++ dump->regs.cs = (u16)regs->cs; ++ dump->regs.flags = regs->flags; ++ dump->regs.sp = regs->sp; ++ dump->regs.ss = (u16)regs->ss; + + dump->u_fpvalid = dump_fpu (regs, &dump->i387); + } + EXPORT_SYMBOL(dump_thread); + +-/* +- * Capture the user space registers if the task is not running (in user space) +- */ +-int dump_task_regs(struct task_struct *tsk, elf_gregset_t *regs) +-{ +- struct pt_regs ptregs = *task_pt_regs(tsk); +- ptregs.xcs &= 0xffff; +- ptregs.xds &= 0xffff; +- ptregs.xes &= 0xffff; +- ptregs.xss &= 0xffff; +- +- elf_core_copy_regs(regs, &ptregs); +- +- return 1; +-} +- + #ifdef CONFIG_SECCOMP +-void hard_disable_TSC(void) ++static void hard_disable_TSC(void) + { + write_cr4(read_cr4() | X86_CR4_TSD); + } +@@ -599,7 +604,7 @@ void disable_TSC(void) + hard_disable_TSC(); + preempt_enable(); + } +-void hard_enable_TSC(void) ++static void hard_enable_TSC(void) + { + write_cr4(read_cr4() & ~X86_CR4_TSD); + } +@@ -609,18 +614,32 @@ static noinline void + __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, + struct tss_struct *tss) + { +- struct thread_struct *next; ++ struct thread_struct *prev, *next; ++ unsigned long debugctl; + ++ prev = &prev_p->thread; + next = &next_p->thread; + ++ debugctl = prev->debugctlmsr; ++ if (next->ds_area_msr != prev->ds_area_msr) { ++ /* we clear debugctl to make sure DS ++ * is not in use when we change it */ ++ debugctl = 0; ++ wrmsrl(MSR_IA32_DEBUGCTLMSR, 0); ++ wrmsr(MSR_IA32_DS_AREA, next->ds_area_msr, 0); ++ } ++ ++ if (next->debugctlmsr != debugctl) ++ wrmsr(MSR_IA32_DEBUGCTLMSR, next->debugctlmsr, 0); ++ + if (test_tsk_thread_flag(next_p, TIF_DEBUG)) { +- set_debugreg(next->debugreg[0], 0); +- set_debugreg(next->debugreg[1], 1); +- set_debugreg(next->debugreg[2], 2); +- set_debugreg(next->debugreg[3], 3); ++ set_debugreg(next->debugreg0, 0); ++ set_debugreg(next->debugreg1, 1); ++ set_debugreg(next->debugreg2, 2); ++ set_debugreg(next->debugreg3, 3); + /* no 4 and 5 */ +- set_debugreg(next->debugreg[6], 6); +- set_debugreg(next->debugreg[7], 7); ++ set_debugreg(next->debugreg6, 6); ++ set_debugreg(next->debugreg7, 7); + } + + #ifdef CONFIG_SECCOMP +@@ -634,6 +653,13 @@ __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, + } + #endif + ++ if (test_tsk_thread_flag(prev_p, TIF_BTS_TRACE_TS)) ++ ptrace_bts_take_timestamp(prev_p, BTS_TASK_DEPARTS); ++ ++ if (test_tsk_thread_flag(next_p, TIF_BTS_TRACE_TS)) ++ ptrace_bts_take_timestamp(next_p, BTS_TASK_ARRIVES); ++ ++ + if (!test_tsk_thread_flag(next_p, TIF_IO_BITMAP)) { + /* + * Disable the bitmap via an invalid offset. We still cache +@@ -687,11 +713,11 @@ __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p, + * More important, however, is the fact that this allows us much + * more flexibility. + * +- * The return value (in %eax) will be the "prev" task after ++ * The return value (in %ax) will be the "prev" task after + * the task-switch, and shows up in ret_from_fork in entry.S, + * for example. + */ +-struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) ++struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) + { + struct thread_struct *prev = &prev_p->thread, + *next = &next_p->thread; +@@ -710,7 +736,7 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas + /* + * Reload esp0. + */ +- load_esp0(tss, next); ++ load_sp0(tss, next); + + /* + * Save away %gs. No need to save %fs, as it was saved on the +@@ -774,7 +800,7 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas + + asmlinkage int sys_fork(struct pt_regs regs) + { +- return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL); ++ return do_fork(SIGCHLD, regs.sp, ®s, 0, NULL, NULL); + } + + asmlinkage int sys_clone(struct pt_regs regs) +@@ -783,12 +809,12 @@ asmlinkage int sys_clone(struct pt_regs regs) + unsigned long newsp; + int __user *parent_tidptr, *child_tidptr; + +- clone_flags = regs.ebx; +- newsp = regs.ecx; +- parent_tidptr = (int __user *)regs.edx; +- child_tidptr = (int __user *)regs.edi; ++ clone_flags = regs.bx; ++ newsp = regs.cx; ++ parent_tidptr = (int __user *)regs.dx; ++ child_tidptr = (int __user *)regs.di; + if (!newsp) +- newsp = regs.esp; ++ newsp = regs.sp; + return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr); + } + +@@ -804,7 +830,7 @@ asmlinkage int sys_clone(struct pt_regs regs) + */ + asmlinkage int sys_vfork(struct pt_regs regs) + { +- return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, ®s, 0, NULL, NULL); ++ return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.sp, ®s, 0, NULL, NULL); + } + + /* +@@ -815,18 +841,15 @@ asmlinkage int sys_execve(struct pt_regs regs) + int error; + char * filename; + +- filename = getname((char __user *) regs.ebx); ++ filename = getname((char __user *) regs.bx); + error = PTR_ERR(filename); + if (IS_ERR(filename)) + goto out; + error = do_execve(filename, +- (char __user * __user *) regs.ecx, +- (char __user * __user *) regs.edx, ++ (char __user * __user *) regs.cx, ++ (char __user * __user *) regs.dx, + ®s); + if (error == 0) { +- task_lock(current); +- current->ptrace &= ~PT_DTRACE; +- task_unlock(current); + /* Make sure we don't return using sysenter.. */ + set_thread_flag(TIF_IRET); + } +@@ -840,145 +863,37 @@ out: + + unsigned long get_wchan(struct task_struct *p) + { +- unsigned long ebp, esp, eip; ++ unsigned long bp, sp, ip; + unsigned long stack_page; + int count = 0; + if (!p || p == current || p->state == TASK_RUNNING) + return 0; + stack_page = (unsigned long)task_stack_page(p); +- esp = p->thread.esp; +- if (!stack_page || esp < stack_page || esp > top_esp+stack_page) ++ sp = p->thread.sp; ++ if (!stack_page || sp < stack_page || sp > top_esp+stack_page) + return 0; +- /* include/asm-i386/system.h:switch_to() pushes ebp last. */ +- ebp = *(unsigned long *) esp; ++ /* include/asm-i386/system.h:switch_to() pushes bp last. */ ++ bp = *(unsigned long *) sp; + do { +- if (ebp < stack_page || ebp > top_ebp+stack_page) ++ if (bp < stack_page || bp > top_ebp+stack_page) + return 0; +- eip = *(unsigned long *) (ebp+4); +- if (!in_sched_functions(eip)) +- return eip; +- ebp = *(unsigned long *) ebp; ++ ip = *(unsigned long *) (bp+4); ++ if (!in_sched_functions(ip)) ++ return ip; ++ bp = *(unsigned long *) bp; + } while (count++ < 16); + return 0; + } + +-/* +- * sys_alloc_thread_area: get a yet unused TLS descriptor index. +- */ +-static int get_free_idx(void) +-{ +- struct thread_struct *t = ¤t->thread; +- int idx; +- +- for (idx = 0; idx < GDT_ENTRY_TLS_ENTRIES; idx++) +- if (desc_empty(t->tls_array + idx)) +- return idx + GDT_ENTRY_TLS_MIN; +- return -ESRCH; +-} +- +-/* +- * Set a given TLS descriptor: +- */ +-asmlinkage int sys_set_thread_area(struct user_desc __user *u_info) +-{ +- struct thread_struct *t = ¤t->thread; +- struct user_desc info; +- struct desc_struct *desc; +- int cpu, idx; +- +- if (copy_from_user(&info, u_info, sizeof(info))) +- return -EFAULT; +- idx = info.entry_number; +- +- /* +- * index -1 means the kernel should try to find and +- * allocate an empty descriptor: +- */ +- if (idx == -1) { +- idx = get_free_idx(); +- if (idx < 0) +- return idx; +- if (put_user(idx, &u_info->entry_number)) +- return -EFAULT; +- } +- +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- return -EINVAL; +- +- desc = t->tls_array + idx - GDT_ENTRY_TLS_MIN; +- +- /* +- * We must not get preempted while modifying the TLS. +- */ +- cpu = get_cpu(); +- +- if (LDT_empty(&info)) { +- desc->a = 0; +- desc->b = 0; +- } else { +- desc->a = LDT_entry_a(&info); +- desc->b = LDT_entry_b(&info); +- } +- load_TLS(t, cpu); +- +- put_cpu(); +- +- return 0; +-} +- +-/* +- * Get the current Thread-Local Storage area: +- */ +- +-#define GET_BASE(desc) ( \ +- (((desc)->a >> 16) & 0x0000ffff) | \ +- (((desc)->b << 16) & 0x00ff0000) | \ +- ( (desc)->b & 0xff000000) ) +- +-#define GET_LIMIT(desc) ( \ +- ((desc)->a & 0x0ffff) | \ +- ((desc)->b & 0xf0000) ) +- +-#define GET_32BIT(desc) (((desc)->b >> 22) & 1) +-#define GET_CONTENTS(desc) (((desc)->b >> 10) & 3) +-#define GET_WRITABLE(desc) (((desc)->b >> 9) & 1) +-#define GET_LIMIT_PAGES(desc) (((desc)->b >> 23) & 1) +-#define GET_PRESENT(desc) (((desc)->b >> 15) & 1) +-#define GET_USEABLE(desc) (((desc)->b >> 20) & 1) +- +-asmlinkage int sys_get_thread_area(struct user_desc __user *u_info) +-{ +- struct user_desc info; +- struct desc_struct *desc; +- int idx; +- +- if (get_user(idx, &u_info->entry_number)) +- return -EFAULT; +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- return -EINVAL; +- +- memset(&info, 0, sizeof(info)); +- +- desc = current->thread.tls_array + idx - GDT_ENTRY_TLS_MIN; +- +- info.entry_number = idx; +- info.base_addr = GET_BASE(desc); +- info.limit = GET_LIMIT(desc); +- info.seg_32bit = GET_32BIT(desc); +- info.contents = GET_CONTENTS(desc); +- info.read_exec_only = !GET_WRITABLE(desc); +- info.limit_in_pages = GET_LIMIT_PAGES(desc); +- info.seg_not_present = !GET_PRESENT(desc); +- info.useable = GET_USEABLE(desc); +- +- if (copy_to_user(u_info, &info, sizeof(info))) +- return -EFAULT; +- return 0; +-} +- + unsigned long arch_align_stack(unsigned long sp) + { + if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space) + sp -= get_random_int() % 8192; + return sp & ~0xf; + } ++ ++unsigned long arch_randomize_brk(struct mm_struct *mm) ++{ ++ unsigned long range_end = mm->brk + 0x02000000; ++ return randomize_range(mm->brk, range_end, 0) ? : mm->brk; ++} +diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c +index ab79e1d..137a861 100644 +--- a/arch/x86/kernel/process_64.c ++++ b/arch/x86/kernel/process_64.c +@@ -3,7 +3,7 @@ + * + * Pentium III FXSR, SSE support + * Gareth Hughes , May 2000 +- * ++ * + * X86-64 port + * Andi Kleen. + * +@@ -19,19 +19,19 @@ + #include + #include + #include ++#include + #include + #include +-#include + #include + #include + #include + #include +-#include + #include + #include ++#include + #include ++#include + #include +-#include + #include + #include + #include +@@ -72,13 +72,6 @@ void idle_notifier_register(struct notifier_block *n) + { + atomic_notifier_chain_register(&idle_notifier, n); + } +-EXPORT_SYMBOL_GPL(idle_notifier_register); +- +-void idle_notifier_unregister(struct notifier_block *n) +-{ +- atomic_notifier_chain_unregister(&idle_notifier, n); +-} +-EXPORT_SYMBOL(idle_notifier_unregister); + + void enter_idle(void) + { +@@ -106,7 +99,7 @@ void exit_idle(void) + * We use this if we don't have any better + * idle routine.. + */ +-static void default_idle(void) ++void default_idle(void) + { + current_thread_info()->status &= ~TS_POLLING; + /* +@@ -116,11 +109,18 @@ static void default_idle(void) + smp_mb(); + local_irq_disable(); + if (!need_resched()) { +- /* Enables interrupts one instruction before HLT. +- x86 special cases this so there is no race. */ +- safe_halt(); +- } else +- local_irq_enable(); ++ ktime_t t0, t1; ++ u64 t0n, t1n; ++ ++ t0 = ktime_get(); ++ t0n = ktime_to_ns(t0); ++ safe_halt(); /* enables interrupts racelessly */ ++ local_irq_disable(); ++ t1 = ktime_get(); ++ t1n = ktime_to_ns(t1); ++ sched_clock_idle_wakeup_event(t1n - t0n); ++ } ++ local_irq_enable(); + current_thread_info()->status |= TS_POLLING; + } + +@@ -129,54 +129,12 @@ static void default_idle(void) + * to poll the ->need_resched flag instead of waiting for the + * cross-CPU IPI to arrive. Use this option with caution. + */ +-static void poll_idle (void) ++static void poll_idle(void) + { + local_irq_enable(); + cpu_relax(); + } + +-static void do_nothing(void *unused) +-{ +-} +- +-void cpu_idle_wait(void) +-{ +- unsigned int cpu, this_cpu = get_cpu(); +- cpumask_t map, tmp = current->cpus_allowed; +- +- set_cpus_allowed(current, cpumask_of_cpu(this_cpu)); +- put_cpu(); +- +- cpus_clear(map); +- for_each_online_cpu(cpu) { +- per_cpu(cpu_idle_state, cpu) = 1; +- cpu_set(cpu, map); +- } +- +- __get_cpu_var(cpu_idle_state) = 0; +- +- wmb(); +- do { +- ssleep(1); +- for_each_online_cpu(cpu) { +- if (cpu_isset(cpu, map) && +- !per_cpu(cpu_idle_state, cpu)) +- cpu_clear(cpu, map); +- } +- cpus_and(map, map, cpu_online_map); +- /* +- * We waited 1 sec, if a CPU still did not call idle +- * it may be because it is in idle and not waking up +- * because it has nothing to do. +- * Give all the remaining CPUS a kick. +- */ +- smp_call_function_mask(map, do_nothing, 0, 0); +- } while (!cpus_empty(map)); +- +- set_cpus_allowed(current, tmp); +-} +-EXPORT_SYMBOL_GPL(cpu_idle_wait); +- + #ifdef CONFIG_HOTPLUG_CPU + DECLARE_PER_CPU(int, cpu_state); + +@@ -207,19 +165,18 @@ static inline void play_dead(void) + * low exit latency (ie sit in a loop waiting for + * somebody to say that they'd like to reschedule) + */ +-void cpu_idle (void) ++void cpu_idle(void) + { + current_thread_info()->status |= TS_POLLING; + /* endless idle loop with no priority at all */ + while (1) { ++ tick_nohz_stop_sched_tick(); + while (!need_resched()) { + void (*idle)(void); + + if (__get_cpu_var(cpu_idle_state)) + __get_cpu_var(cpu_idle_state) = 0; + +- tick_nohz_stop_sched_tick(); +- + rmb(); + idle = pm_idle; + if (!idle) +@@ -247,6 +204,47 @@ void cpu_idle (void) + } + } + ++static void do_nothing(void *unused) ++{ ++} ++ ++void cpu_idle_wait(void) ++{ ++ unsigned int cpu, this_cpu = get_cpu(); ++ cpumask_t map, tmp = current->cpus_allowed; ++ ++ set_cpus_allowed(current, cpumask_of_cpu(this_cpu)); ++ put_cpu(); ++ ++ cpus_clear(map); ++ for_each_online_cpu(cpu) { ++ per_cpu(cpu_idle_state, cpu) = 1; ++ cpu_set(cpu, map); ++ } ++ ++ __get_cpu_var(cpu_idle_state) = 0; ++ ++ wmb(); ++ do { ++ ssleep(1); ++ for_each_online_cpu(cpu) { ++ if (cpu_isset(cpu, map) && !per_cpu(cpu_idle_state, cpu)) ++ cpu_clear(cpu, map); ++ } ++ cpus_and(map, map, cpu_online_map); ++ /* ++ * We waited 1 sec, if a CPU still did not call idle ++ * it may be because it is in idle and not waking up ++ * because it has nothing to do. ++ * Give all the remaining CPUS a kick. ++ */ ++ smp_call_function_mask(map, do_nothing, 0, 0); ++ } while (!cpus_empty(map)); ++ ++ set_cpus_allowed(current, tmp); ++} ++EXPORT_SYMBOL_GPL(cpu_idle_wait); ++ + /* + * This uses new MONITOR/MWAIT instructions on P4 processors with PNI, + * which can obviate IPI to trigger checking of need_resched. +@@ -257,13 +255,13 @@ void cpu_idle (void) + * New with Core Duo processors, MWAIT can take some hints based on CPU + * capability. + */ +-void mwait_idle_with_hints(unsigned long eax, unsigned long ecx) ++void mwait_idle_with_hints(unsigned long ax, unsigned long cx) + { + if (!need_resched()) { + __monitor((void *)¤t_thread_info()->flags, 0, 0); + smp_mb(); + if (!need_resched()) +- __mwait(eax, ecx); ++ __mwait(ax, cx); + } + } + +@@ -282,25 +280,41 @@ static void mwait_idle(void) + } + } + ++ ++static int __cpuinit mwait_usable(const struct cpuinfo_x86 *c) ++{ ++ if (force_mwait) ++ return 1; ++ /* Any C1 states supported? */ ++ return c->cpuid_level >= 5 && ((cpuid_edx(5) >> 4) & 0xf) > 0; ++} ++ + void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c) + { +- static int printed; +- if (cpu_has(c, X86_FEATURE_MWAIT)) { ++ static int selected; ++ ++ if (selected) ++ return; ++#ifdef CONFIG_X86_SMP ++ if (pm_idle == poll_idle && smp_num_siblings > 1) { ++ printk(KERN_WARNING "WARNING: polling idle and HT enabled," ++ " performance may degrade.\n"); ++ } ++#endif ++ if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) { + /* + * Skip, if setup has overridden idle. + * One CPU supports mwait => All CPUs supports mwait + */ + if (!pm_idle) { +- if (!printed) { +- printk(KERN_INFO "using mwait in idle threads.\n"); +- printed = 1; +- } ++ printk(KERN_INFO "using mwait in idle threads.\n"); + pm_idle = mwait_idle; + } + } ++ selected = 1; + } + +-static int __init idle_setup (char *str) ++static int __init idle_setup(char *str) + { + if (!strcmp(str, "poll")) { + printk("using polling idle threads.\n"); +@@ -315,13 +329,13 @@ static int __init idle_setup (char *str) + } + early_param("idle", idle_setup); + +-/* Prints also some state that isn't saved in the pt_regs */ ++/* Prints also some state that isn't saved in the pt_regs */ + void __show_regs(struct pt_regs * regs) + { + unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L, fs, gs, shadowgs; + unsigned long d0, d1, d2, d3, d6, d7; +- unsigned int fsindex,gsindex; +- unsigned int ds,cs,es; ++ unsigned int fsindex, gsindex; ++ unsigned int ds, cs, es; + + printk("\n"); + print_modules(); +@@ -330,16 +344,16 @@ void __show_regs(struct pt_regs * regs) + init_utsname()->release, + (int)strcspn(init_utsname()->version, " "), + init_utsname()->version); +- printk("RIP: %04lx:[<%016lx>] ", regs->cs & 0xffff, regs->rip); +- printk_address(regs->rip); +- printk("RSP: %04lx:%016lx EFLAGS: %08lx\n", regs->ss, regs->rsp, +- regs->eflags); ++ printk("RIP: %04lx:[<%016lx>] ", regs->cs & 0xffff, regs->ip); ++ printk_address(regs->ip, 1); ++ printk("RSP: %04lx:%016lx EFLAGS: %08lx\n", regs->ss, regs->sp, ++ regs->flags); + printk("RAX: %016lx RBX: %016lx RCX: %016lx\n", +- regs->rax, regs->rbx, regs->rcx); ++ regs->ax, regs->bx, regs->cx); + printk("RDX: %016lx RSI: %016lx RDI: %016lx\n", +- regs->rdx, regs->rsi, regs->rdi); ++ regs->dx, regs->si, regs->di); + printk("RBP: %016lx R08: %016lx R09: %016lx\n", +- regs->rbp, regs->r8, regs->r9); ++ regs->bp, regs->r8, regs->r9); + printk("R10: %016lx R11: %016lx R12: %016lx\n", + regs->r10, regs->r11, regs->r12); + printk("R13: %016lx R14: %016lx R15: %016lx\n", +@@ -379,7 +393,7 @@ void show_regs(struct pt_regs *regs) + { + printk("CPU %d:", smp_processor_id()); + __show_regs(regs); +- show_trace(NULL, regs, (void *)(regs + 1)); ++ show_trace(NULL, regs, (void *)(regs + 1), regs->bp); + } + + /* +@@ -390,7 +404,7 @@ void exit_thread(void) + struct task_struct *me = current; + struct thread_struct *t = &me->thread; + +- if (me->thread.io_bitmap_ptr) { ++ if (me->thread.io_bitmap_ptr) { + struct tss_struct *tss = &per_cpu(init_tss, get_cpu()); + + kfree(t->io_bitmap_ptr); +@@ -426,7 +440,7 @@ void flush_thread(void) + tsk->thread.debugreg3 = 0; + tsk->thread.debugreg6 = 0; + tsk->thread.debugreg7 = 0; +- memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array)); ++ memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array)); + /* + * Forget coprocessor state.. + */ +@@ -449,26 +463,21 @@ void release_thread(struct task_struct *dead_task) + + static inline void set_32bit_tls(struct task_struct *t, int tls, u32 addr) + { +- struct user_desc ud = { ++ struct user_desc ud = { + .base_addr = addr, + .limit = 0xfffff, + .seg_32bit = 1, + .limit_in_pages = 1, + .useable = 1, + }; +- struct n_desc_struct *desc = (void *)t->thread.tls_array; ++ struct desc_struct *desc = t->thread.tls_array; + desc += tls; +- desc->a = LDT_entry_a(&ud); +- desc->b = LDT_entry_b(&ud); ++ fill_ldt(desc, &ud); + } + + static inline u32 read_32bit_tls(struct task_struct *t, int tls) + { +- struct desc_struct *desc = (void *)t->thread.tls_array; +- desc += tls; +- return desc->base0 | +- (((u32)desc->base1) << 16) | +- (((u32)desc->base2) << 24); ++ return get_desc_base(&t->thread.tls_array[tls]); + } + + /* +@@ -480,7 +489,7 @@ void prepare_to_copy(struct task_struct *tsk) + unlazy_fpu(tsk); + } + +-int copy_thread(int nr, unsigned long clone_flags, unsigned long rsp, ++int copy_thread(int nr, unsigned long clone_flags, unsigned long sp, + unsigned long unused, + struct task_struct * p, struct pt_regs * regs) + { +@@ -492,14 +501,14 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long rsp, + (THREAD_SIZE + task_stack_page(p))) - 1; + *childregs = *regs; + +- childregs->rax = 0; +- childregs->rsp = rsp; +- if (rsp == ~0UL) +- childregs->rsp = (unsigned long)childregs; ++ childregs->ax = 0; ++ childregs->sp = sp; ++ if (sp == ~0UL) ++ childregs->sp = (unsigned long)childregs; + +- p->thread.rsp = (unsigned long) childregs; +- p->thread.rsp0 = (unsigned long) (childregs+1); +- p->thread.userrsp = me->thread.userrsp; ++ p->thread.sp = (unsigned long) childregs; ++ p->thread.sp0 = (unsigned long) (childregs+1); ++ p->thread.usersp = me->thread.usersp; + + set_tsk_thread_flag(p, TIF_FORK); + +@@ -520,7 +529,7 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long rsp, + memcpy(p->thread.io_bitmap_ptr, me->thread.io_bitmap_ptr, + IO_BITMAP_BYTES); + set_tsk_thread_flag(p, TIF_IO_BITMAP); +- } ++ } + + /* + * Set a new TLS for the child thread? +@@ -528,7 +537,8 @@ int copy_thread(int nr, unsigned long clone_flags, unsigned long rsp, + if (clone_flags & CLONE_SETTLS) { + #ifdef CONFIG_IA32_EMULATION + if (test_thread_flag(TIF_IA32)) +- err = ia32_child_tls(p, childregs); ++ err = do_set_thread_area(p, -1, ++ (struct user_desc __user *)childregs->si, 0); + else + #endif + err = do_arch_prctl(p, ARCH_SET_FS, childregs->r8); +@@ -547,17 +557,30 @@ out: + /* + * This special macro can be used to load a debugging register + */ +-#define loaddebug(thread,r) set_debugreg(thread->debugreg ## r, r) ++#define loaddebug(thread, r) set_debugreg(thread->debugreg ## r, r) + + static inline void __switch_to_xtra(struct task_struct *prev_p, +- struct task_struct *next_p, +- struct tss_struct *tss) ++ struct task_struct *next_p, ++ struct tss_struct *tss) + { + struct thread_struct *prev, *next; ++ unsigned long debugctl; + + prev = &prev_p->thread, + next = &next_p->thread; + ++ debugctl = prev->debugctlmsr; ++ if (next->ds_area_msr != prev->ds_area_msr) { ++ /* we clear debugctl to make sure DS ++ * is not in use when we change it */ ++ debugctl = 0; ++ wrmsrl(MSR_IA32_DEBUGCTLMSR, 0); ++ wrmsrl(MSR_IA32_DS_AREA, next->ds_area_msr); ++ } ++ ++ if (next->debugctlmsr != debugctl) ++ wrmsrl(MSR_IA32_DEBUGCTLMSR, next->debugctlmsr); ++ + if (test_tsk_thread_flag(next_p, TIF_DEBUG)) { + loaddebug(next, 0); + loaddebug(next, 1); +@@ -581,12 +604,18 @@ static inline void __switch_to_xtra(struct task_struct *prev_p, + */ + memset(tss->io_bitmap, 0xff, prev->io_bitmap_max); + } ++ ++ if (test_tsk_thread_flag(prev_p, TIF_BTS_TRACE_TS)) ++ ptrace_bts_take_timestamp(prev_p, BTS_TASK_DEPARTS); ++ ++ if (test_tsk_thread_flag(next_p, TIF_BTS_TRACE_TS)) ++ ptrace_bts_take_timestamp(next_p, BTS_TASK_ARRIVES); + } + + /* + * switch_to(x,y) should switch tasks from x to y. + * +- * This could still be optimized: ++ * This could still be optimized: + * - fold all the options into a flag word and test it with a single test. + * - could test fs/gs bitsliced + * +@@ -597,7 +626,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) + { + struct thread_struct *prev = &prev_p->thread, + *next = &next_p->thread; +- int cpu = smp_processor_id(); ++ int cpu = smp_processor_id(); + struct tss_struct *tss = &per_cpu(init_tss, cpu); + + /* we're going to use this soon, after a few expensive things */ +@@ -607,7 +636,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) + /* + * Reload esp0, LDT and the page table pointer: + */ +- tss->rsp0 = next->rsp0; ++ load_sp0(tss, next); + + /* + * Switch DS and ES. +@@ -666,8 +695,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) + /* + * Switch the PDA and FPU contexts. + */ +- prev->userrsp = read_pda(oldrsp); +- write_pda(oldrsp, next->userrsp); ++ prev->usersp = read_pda(oldrsp); ++ write_pda(oldrsp, next->usersp); + write_pda(pcurrent, next_p); + + write_pda(kernelstack, +@@ -684,8 +713,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) + /* + * Now maybe reload the debug registers and handle I/O bitmaps + */ +- if (unlikely((task_thread_info(next_p)->flags & _TIF_WORK_CTXSW)) +- || test_tsk_thread_flag(prev_p, TIF_IO_BITMAP)) ++ if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT || ++ task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV)) + __switch_to_xtra(prev_p, next_p, tss); + + /* If the task has used fpu the last 5 timeslices, just do a full +@@ -700,7 +729,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) + /* + * sys_execve() executes a new program. + */ +-asmlinkage ++asmlinkage + long sys_execve(char __user *name, char __user * __user *argv, + char __user * __user *envp, struct pt_regs regs) + { +@@ -712,11 +741,6 @@ long sys_execve(char __user *name, char __user * __user *argv, + if (IS_ERR(filename)) + return error; + error = do_execve(filename, argv, envp, ®s); +- if (error == 0) { +- task_lock(current); +- current->ptrace &= ~PT_DTRACE; +- task_unlock(current); +- } + putname(filename); + return error; + } +@@ -726,18 +750,18 @@ void set_personality_64bit(void) + /* inherit personality from parent */ + + /* Make sure to be in 64bit mode */ +- clear_thread_flag(TIF_IA32); ++ clear_thread_flag(TIF_IA32); + + /* TBD: overwrites user setup. Should have two bits. + But 64bit processes have always behaved this way, + so it's not too bad. The main problem is just that +- 32bit childs are affected again. */ ++ 32bit childs are affected again. */ + current->personality &= ~READ_IMPLIES_EXEC; + } + + asmlinkage long sys_fork(struct pt_regs *regs) + { +- return do_fork(SIGCHLD, regs->rsp, regs, 0, NULL, NULL); ++ return do_fork(SIGCHLD, regs->sp, regs, 0, NULL, NULL); + } + + asmlinkage long +@@ -745,7 +769,7 @@ sys_clone(unsigned long clone_flags, unsigned long newsp, + void __user *parent_tid, void __user *child_tid, struct pt_regs *regs) + { + if (!newsp) +- newsp = regs->rsp; ++ newsp = regs->sp; + return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid); + } + +@@ -761,29 +785,29 @@ sys_clone(unsigned long clone_flags, unsigned long newsp, + */ + asmlinkage long sys_vfork(struct pt_regs *regs) + { +- return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->rsp, regs, 0, ++ return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->sp, regs, 0, + NULL, NULL); + } + + unsigned long get_wchan(struct task_struct *p) + { + unsigned long stack; +- u64 fp,rip; ++ u64 fp,ip; + int count = 0; + + if (!p || p == current || p->state==TASK_RUNNING) + return 0; + stack = (unsigned long)task_stack_page(p); +- if (p->thread.rsp < stack || p->thread.rsp > stack+THREAD_SIZE) ++ if (p->thread.sp < stack || p->thread.sp > stack+THREAD_SIZE) + return 0; +- fp = *(u64 *)(p->thread.rsp); ++ fp = *(u64 *)(p->thread.sp); + do { + if (fp < (unsigned long)stack || + fp > (unsigned long)stack+THREAD_SIZE) + return 0; +- rip = *(u64 *)(fp+8); +- if (!in_sched_functions(rip)) +- return rip; ++ ip = *(u64 *)(fp+8); ++ if (!in_sched_functions(ip)) ++ return ip; + fp = *(u64 *)fp; + } while (count++ < 16); + return 0; +@@ -824,19 +848,19 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr) + /* Not strictly needed for fs, but do it for symmetry + with gs */ + if (addr >= TASK_SIZE_OF(task)) +- return -EPERM; ++ return -EPERM; + cpu = get_cpu(); +- /* handle small bases via the GDT because that's faster to ++ /* handle small bases via the GDT because that's faster to + switch. */ +- if (addr <= 0xffffffff) { ++ if (addr <= 0xffffffff) { + set_32bit_tls(task, FS_TLS, addr); +- if (doit) { +- load_TLS(&task->thread, cpu); ++ if (doit) { ++ load_TLS(&task->thread, cpu); + asm volatile("movl %0,%%fs" :: "r"(FS_TLS_SEL)); + } + task->thread.fsindex = FS_TLS_SEL; + task->thread.fs = 0; +- } else { ++ } else { + task->thread.fsindex = 0; + task->thread.fs = addr; + if (doit) { +@@ -848,24 +872,24 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr) + } + put_cpu(); + break; +- case ARCH_GET_FS: { +- unsigned long base; ++ case ARCH_GET_FS: { ++ unsigned long base; + if (task->thread.fsindex == FS_TLS_SEL) + base = read_32bit_tls(task, FS_TLS); + else if (doit) + rdmsrl(MSR_FS_BASE, base); + else + base = task->thread.fs; +- ret = put_user(base, (unsigned long __user *)addr); +- break; ++ ret = put_user(base, (unsigned long __user *)addr); ++ break; + } +- case ARCH_GET_GS: { ++ case ARCH_GET_GS: { + unsigned long base; + unsigned gsindex; + if (task->thread.gsindex == GS_TLS_SEL) + base = read_32bit_tls(task, GS_TLS); + else if (doit) { +- asm("movl %%gs,%0" : "=r" (gsindex)); ++ asm("movl %%gs,%0" : "=r" (gsindex)); + if (gsindex) + rdmsrl(MSR_KERNEL_GS_BASE, base); + else +@@ -873,39 +897,21 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr) + } + else + base = task->thread.gs; +- ret = put_user(base, (unsigned long __user *)addr); ++ ret = put_user(base, (unsigned long __user *)addr); + break; + } + + default: + ret = -EINVAL; + break; +- } ++ } + +- return ret; +-} ++ return ret; ++} + + long sys_arch_prctl(int code, unsigned long addr) + { + return do_arch_prctl(current, code, addr); +-} +- +-/* +- * Capture the user space registers if the task is not running (in user space) +- */ +-int dump_task_regs(struct task_struct *tsk, elf_gregset_t *regs) +-{ +- struct pt_regs *pp, ptregs; +- +- pp = task_pt_regs(tsk); +- +- ptregs = *pp; +- ptregs.cs &= 0xffff; +- ptregs.ss &= 0xffff; +- +- elf_core_copy_regs(regs, &ptregs); +- +- return 1; + } + + unsigned long arch_align_stack(unsigned long sp) +@@ -914,3 +920,9 @@ unsigned long arch_align_stack(unsigned long sp) + sp -= get_random_int() % 8192; + return sp & ~0xf; + } ++ ++unsigned long arch_randomize_brk(struct mm_struct *mm) ++{ ++ unsigned long range_end = mm->brk + 0x02000000; ++ return randomize_range(mm->brk, range_end, 0) ? : mm->brk; ++} +diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c +new file mode 100644 +index 0000000..96286df +--- /dev/null ++++ b/arch/x86/kernel/ptrace.c +@@ -0,0 +1,1545 @@ ++/* By Ross Biro 1/23/92 */ ++/* ++ * Pentium III FXSR, SSE support ++ * Gareth Hughes , May 2000 ++ * ++ * BTS tracing ++ * Markus Metzger , Dec 2007 ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#include "tls.h" ++ ++enum x86_regset { ++ REGSET_GENERAL, ++ REGSET_FP, ++ REGSET_XFP, ++ REGSET_TLS, ++}; ++ ++/* ++ * does not yet catch signals sent when the child dies. ++ * in exit.c or in signal.c. ++ */ ++ ++/* ++ * Determines which flags the user has access to [1 = access, 0 = no access]. ++ */ ++#define FLAG_MASK_32 ((unsigned long) \ ++ (X86_EFLAGS_CF | X86_EFLAGS_PF | \ ++ X86_EFLAGS_AF | X86_EFLAGS_ZF | \ ++ X86_EFLAGS_SF | X86_EFLAGS_TF | \ ++ X86_EFLAGS_DF | X86_EFLAGS_OF | \ ++ X86_EFLAGS_RF | X86_EFLAGS_AC)) ++ ++/* ++ * Determines whether a value may be installed in a segment register. ++ */ ++static inline bool invalid_selector(u16 value) ++{ ++ return unlikely(value != 0 && (value & SEGMENT_RPL_MASK) != USER_RPL); ++} ++ ++#ifdef CONFIG_X86_32 ++ ++#define FLAG_MASK FLAG_MASK_32 ++ ++static long *pt_regs_access(struct pt_regs *regs, unsigned long regno) ++{ ++ BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0); ++ regno >>= 2; ++ if (regno > FS) ++ --regno; ++ return ®s->bx + regno; ++} ++ ++static u16 get_segment_reg(struct task_struct *task, unsigned long offset) ++{ ++ /* ++ * Returning the value truncates it to 16 bits. ++ */ ++ unsigned int retval; ++ if (offset != offsetof(struct user_regs_struct, gs)) ++ retval = *pt_regs_access(task_pt_regs(task), offset); ++ else { ++ retval = task->thread.gs; ++ if (task == current) ++ savesegment(gs, retval); ++ } ++ return retval; ++} ++ ++static int set_segment_reg(struct task_struct *task, ++ unsigned long offset, u16 value) ++{ ++ /* ++ * The value argument was already truncated to 16 bits. ++ */ ++ if (invalid_selector(value)) ++ return -EIO; ++ ++ if (offset != offsetof(struct user_regs_struct, gs)) ++ *pt_regs_access(task_pt_regs(task), offset) = value; ++ else { ++ task->thread.gs = value; ++ if (task == current) ++ /* ++ * The user-mode %gs is not affected by ++ * kernel entry, so we must update the CPU. ++ */ ++ loadsegment(gs, value); ++ } ++ ++ return 0; ++} ++ ++static unsigned long debugreg_addr_limit(struct task_struct *task) ++{ ++ return TASK_SIZE - 3; ++} ++ ++#else /* CONFIG_X86_64 */ ++ ++#define FLAG_MASK (FLAG_MASK_32 | X86_EFLAGS_NT) ++ ++static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long offset) ++{ ++ BUILD_BUG_ON(offsetof(struct pt_regs, r15) != 0); ++ return ®s->r15 + (offset / sizeof(regs->r15)); ++} ++ ++static u16 get_segment_reg(struct task_struct *task, unsigned long offset) ++{ ++ /* ++ * Returning the value truncates it to 16 bits. ++ */ ++ unsigned int seg; ++ ++ switch (offset) { ++ case offsetof(struct user_regs_struct, fs): ++ if (task == current) { ++ /* Older gas can't assemble movq %?s,%r?? */ ++ asm("movl %%fs,%0" : "=r" (seg)); ++ return seg; ++ } ++ return task->thread.fsindex; ++ case offsetof(struct user_regs_struct, gs): ++ if (task == current) { ++ asm("movl %%gs,%0" : "=r" (seg)); ++ return seg; ++ } ++ return task->thread.gsindex; ++ case offsetof(struct user_regs_struct, ds): ++ if (task == current) { ++ asm("movl %%ds,%0" : "=r" (seg)); ++ return seg; ++ } ++ return task->thread.ds; ++ case offsetof(struct user_regs_struct, es): ++ if (task == current) { ++ asm("movl %%es,%0" : "=r" (seg)); ++ return seg; ++ } ++ return task->thread.es; ++ ++ case offsetof(struct user_regs_struct, cs): ++ case offsetof(struct user_regs_struct, ss): ++ break; ++ } ++ return *pt_regs_access(task_pt_regs(task), offset); ++} ++ ++static int set_segment_reg(struct task_struct *task, ++ unsigned long offset, u16 value) ++{ ++ /* ++ * The value argument was already truncated to 16 bits. ++ */ ++ if (invalid_selector(value)) ++ return -EIO; ++ ++ switch (offset) { ++ case offsetof(struct user_regs_struct,fs): ++ /* ++ * If this is setting fs as for normal 64-bit use but ++ * setting fs_base has implicitly changed it, leave it. ++ */ ++ if ((value == FS_TLS_SEL && task->thread.fsindex == 0 && ++ task->thread.fs != 0) || ++ (value == 0 && task->thread.fsindex == FS_TLS_SEL && ++ task->thread.fs == 0)) ++ break; ++ task->thread.fsindex = value; ++ if (task == current) ++ loadsegment(fs, task->thread.fsindex); ++ break; ++ case offsetof(struct user_regs_struct,gs): ++ /* ++ * If this is setting gs as for normal 64-bit use but ++ * setting gs_base has implicitly changed it, leave it. ++ */ ++ if ((value == GS_TLS_SEL && task->thread.gsindex == 0 && ++ task->thread.gs != 0) || ++ (value == 0 && task->thread.gsindex == GS_TLS_SEL && ++ task->thread.gs == 0)) ++ break; ++ task->thread.gsindex = value; ++ if (task == current) ++ load_gs_index(task->thread.gsindex); ++ break; ++ case offsetof(struct user_regs_struct,ds): ++ task->thread.ds = value; ++ if (task == current) ++ loadsegment(ds, task->thread.ds); ++ break; ++ case offsetof(struct user_regs_struct,es): ++ task->thread.es = value; ++ if (task == current) ++ loadsegment(es, task->thread.es); ++ break; ++ ++ /* ++ * Can't actually change these in 64-bit mode. ++ */ ++ case offsetof(struct user_regs_struct,cs): ++#ifdef CONFIG_IA32_EMULATION ++ if (test_tsk_thread_flag(task, TIF_IA32)) ++ task_pt_regs(task)->cs = value; ++#endif ++ break; ++ case offsetof(struct user_regs_struct,ss): ++#ifdef CONFIG_IA32_EMULATION ++ if (test_tsk_thread_flag(task, TIF_IA32)) ++ task_pt_regs(task)->ss = value; ++#endif ++ break; ++ } ++ ++ return 0; ++} ++ ++static unsigned long debugreg_addr_limit(struct task_struct *task) ++{ ++#ifdef CONFIG_IA32_EMULATION ++ if (test_tsk_thread_flag(task, TIF_IA32)) ++ return IA32_PAGE_OFFSET - 3; ++#endif ++ return TASK_SIZE64 - 7; ++} ++ ++#endif /* CONFIG_X86_32 */ ++ ++static unsigned long get_flags(struct task_struct *task) ++{ ++ unsigned long retval = task_pt_regs(task)->flags; ++ ++ /* ++ * If the debugger set TF, hide it from the readout. ++ */ ++ if (test_tsk_thread_flag(task, TIF_FORCED_TF)) ++ retval &= ~X86_EFLAGS_TF; ++ ++ return retval; ++} ++ ++static int set_flags(struct task_struct *task, unsigned long value) ++{ ++ struct pt_regs *regs = task_pt_regs(task); ++ ++ /* ++ * If the user value contains TF, mark that ++ * it was not "us" (the debugger) that set it. ++ * If not, make sure it stays set if we had. ++ */ ++ if (value & X86_EFLAGS_TF) ++ clear_tsk_thread_flag(task, TIF_FORCED_TF); ++ else if (test_tsk_thread_flag(task, TIF_FORCED_TF)) ++ value |= X86_EFLAGS_TF; ++ ++ regs->flags = (regs->flags & ~FLAG_MASK) | (value & FLAG_MASK); ++ ++ return 0; ++} ++ ++static int putreg(struct task_struct *child, ++ unsigned long offset, unsigned long value) ++{ ++ switch (offset) { ++ case offsetof(struct user_regs_struct, cs): ++ case offsetof(struct user_regs_struct, ds): ++ case offsetof(struct user_regs_struct, es): ++ case offsetof(struct user_regs_struct, fs): ++ case offsetof(struct user_regs_struct, gs): ++ case offsetof(struct user_regs_struct, ss): ++ return set_segment_reg(child, offset, value); ++ ++ case offsetof(struct user_regs_struct, flags): ++ return set_flags(child, value); ++ ++#ifdef CONFIG_X86_64 ++ case offsetof(struct user_regs_struct,fs_base): ++ if (value >= TASK_SIZE_OF(child)) ++ return -EIO; ++ /* ++ * When changing the segment base, use do_arch_prctl ++ * to set either thread.fs or thread.fsindex and the ++ * corresponding GDT slot. ++ */ ++ if (child->thread.fs != value) ++ return do_arch_prctl(child, ARCH_SET_FS, value); ++ return 0; ++ case offsetof(struct user_regs_struct,gs_base): ++ /* ++ * Exactly the same here as the %fs handling above. ++ */ ++ if (value >= TASK_SIZE_OF(child)) ++ return -EIO; ++ if (child->thread.gs != value) ++ return do_arch_prctl(child, ARCH_SET_GS, value); ++ return 0; ++#endif ++ } ++ ++ *pt_regs_access(task_pt_regs(child), offset) = value; ++ return 0; ++} ++ ++static unsigned long getreg(struct task_struct *task, unsigned long offset) ++{ ++ switch (offset) { ++ case offsetof(struct user_regs_struct, cs): ++ case offsetof(struct user_regs_struct, ds): ++ case offsetof(struct user_regs_struct, es): ++ case offsetof(struct user_regs_struct, fs): ++ case offsetof(struct user_regs_struct, gs): ++ case offsetof(struct user_regs_struct, ss): ++ return get_segment_reg(task, offset); ++ ++ case offsetof(struct user_regs_struct, flags): ++ return get_flags(task); ++ ++#ifdef CONFIG_X86_64 ++ case offsetof(struct user_regs_struct, fs_base): { ++ /* ++ * do_arch_prctl may have used a GDT slot instead of ++ * the MSR. To userland, it appears the same either ++ * way, except the %fs segment selector might not be 0. ++ */ ++ unsigned int seg = task->thread.fsindex; ++ if (task->thread.fs != 0) ++ return task->thread.fs; ++ if (task == current) ++ asm("movl %%fs,%0" : "=r" (seg)); ++ if (seg != FS_TLS_SEL) ++ return 0; ++ return get_desc_base(&task->thread.tls_array[FS_TLS]); ++ } ++ case offsetof(struct user_regs_struct, gs_base): { ++ /* ++ * Exactly the same here as the %fs handling above. ++ */ ++ unsigned int seg = task->thread.gsindex; ++ if (task->thread.gs != 0) ++ return task->thread.gs; ++ if (task == current) ++ asm("movl %%gs,%0" : "=r" (seg)); ++ if (seg != GS_TLS_SEL) ++ return 0; ++ return get_desc_base(&task->thread.tls_array[GS_TLS]); ++ } ++#endif ++ } ++ ++ return *pt_regs_access(task_pt_regs(task), offset); ++} ++ ++static int genregs_get(struct task_struct *target, ++ const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ void *kbuf, void __user *ubuf) ++{ ++ if (kbuf) { ++ unsigned long *k = kbuf; ++ while (count > 0) { ++ *k++ = getreg(target, pos); ++ count -= sizeof(*k); ++ pos += sizeof(*k); ++ } ++ } else { ++ unsigned long __user *u = ubuf; ++ while (count > 0) { ++ if (__put_user(getreg(target, pos), u++)) ++ return -EFAULT; ++ count -= sizeof(*u); ++ pos += sizeof(*u); ++ } ++ } ++ ++ return 0; ++} ++ ++static int genregs_set(struct task_struct *target, ++ const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ const void *kbuf, const void __user *ubuf) ++{ ++ int ret = 0; ++ if (kbuf) { ++ const unsigned long *k = kbuf; ++ while (count > 0 && !ret) { ++ ret = putreg(target, pos, *k++); ++ count -= sizeof(*k); ++ pos += sizeof(*k); ++ } ++ } else { ++ const unsigned long __user *u = ubuf; ++ while (count > 0 && !ret) { ++ unsigned long word; ++ ret = __get_user(word, u++); ++ if (ret) ++ break; ++ ret = putreg(target, pos, word); ++ count -= sizeof(*u); ++ pos += sizeof(*u); ++ } ++ } ++ return ret; ++} ++ ++/* ++ * This function is trivial and will be inlined by the compiler. ++ * Having it separates the implementation details of debug ++ * registers from the interface details of ptrace. ++ */ ++static unsigned long ptrace_get_debugreg(struct task_struct *child, int n) ++{ ++ switch (n) { ++ case 0: return child->thread.debugreg0; ++ case 1: return child->thread.debugreg1; ++ case 2: return child->thread.debugreg2; ++ case 3: return child->thread.debugreg3; ++ case 6: return child->thread.debugreg6; ++ case 7: return child->thread.debugreg7; ++ } ++ return 0; ++} ++ ++static int ptrace_set_debugreg(struct task_struct *child, ++ int n, unsigned long data) ++{ ++ int i; ++ ++ if (unlikely(n == 4 || n == 5)) ++ return -EIO; ++ ++ if (n < 4 && unlikely(data >= debugreg_addr_limit(child))) ++ return -EIO; ++ ++ switch (n) { ++ case 0: child->thread.debugreg0 = data; break; ++ case 1: child->thread.debugreg1 = data; break; ++ case 2: child->thread.debugreg2 = data; break; ++ case 3: child->thread.debugreg3 = data; break; ++ ++ case 6: ++ if ((data & ~0xffffffffUL) != 0) ++ return -EIO; ++ child->thread.debugreg6 = data; ++ break; ++ ++ case 7: ++ /* ++ * Sanity-check data. Take one half-byte at once with ++ * check = (val >> (16 + 4*i)) & 0xf. It contains the ++ * R/Wi and LENi bits; bits 0 and 1 are R/Wi, and bits ++ * 2 and 3 are LENi. Given a list of invalid values, ++ * we do mask |= 1 << invalid_value, so that ++ * (mask >> check) & 1 is a correct test for invalid ++ * values. ++ * ++ * R/Wi contains the type of the breakpoint / ++ * watchpoint, LENi contains the length of the watched ++ * data in the watchpoint case. ++ * ++ * The invalid values are: ++ * - LENi == 0x10 (undefined), so mask |= 0x0f00. [32-bit] ++ * - R/Wi == 0x10 (break on I/O reads or writes), so ++ * mask |= 0x4444. ++ * - R/Wi == 0x00 && LENi != 0x00, so we have mask |= ++ * 0x1110. ++ * ++ * Finally, mask = 0x0f00 | 0x4444 | 0x1110 == 0x5f54. ++ * ++ * See the Intel Manual "System Programming Guide", ++ * 15.2.4 ++ * ++ * Note that LENi == 0x10 is defined on x86_64 in long ++ * mode (i.e. even for 32-bit userspace software, but ++ * 64-bit kernel), so the x86_64 mask value is 0x5454. ++ * See the AMD manual no. 24593 (AMD64 System Programming) ++ */ ++#ifdef CONFIG_X86_32 ++#define DR7_MASK 0x5f54 ++#else ++#define DR7_MASK 0x5554 ++#endif ++ data &= ~DR_CONTROL_RESERVED; ++ for (i = 0; i < 4; i++) ++ if ((DR7_MASK >> ((data >> (16 + 4*i)) & 0xf)) & 1) ++ return -EIO; ++ child->thread.debugreg7 = data; ++ if (data) ++ set_tsk_thread_flag(child, TIF_DEBUG); ++ else ++ clear_tsk_thread_flag(child, TIF_DEBUG); ++ break; ++ } ++ ++ return 0; ++} ++ ++static int ptrace_bts_get_size(struct task_struct *child) ++{ ++ if (!child->thread.ds_area_msr) ++ return -ENXIO; ++ ++ return ds_get_bts_index((void *)child->thread.ds_area_msr); ++} ++ ++static int ptrace_bts_read_record(struct task_struct *child, ++ long index, ++ struct bts_struct __user *out) ++{ ++ struct bts_struct ret; ++ int retval; ++ int bts_end; ++ int bts_index; ++ ++ if (!child->thread.ds_area_msr) ++ return -ENXIO; ++ ++ if (index < 0) ++ return -EINVAL; ++ ++ bts_end = ds_get_bts_end((void *)child->thread.ds_area_msr); ++ if (bts_end <= index) ++ return -EINVAL; ++ ++ /* translate the ptrace bts index into the ds bts index */ ++ bts_index = ds_get_bts_index((void *)child->thread.ds_area_msr); ++ bts_index -= (index + 1); ++ if (bts_index < 0) ++ bts_index += bts_end; ++ ++ retval = ds_read_bts((void *)child->thread.ds_area_msr, ++ bts_index, &ret); ++ if (retval < 0) ++ return retval; ++ ++ if (copy_to_user(out, &ret, sizeof(ret))) ++ return -EFAULT; ++ ++ return sizeof(ret); ++} ++ ++static int ptrace_bts_write_record(struct task_struct *child, ++ const struct bts_struct *in) ++{ ++ int retval; ++ ++ if (!child->thread.ds_area_msr) ++ return -ENXIO; ++ ++ retval = ds_write_bts((void *)child->thread.ds_area_msr, in); ++ if (retval) ++ return retval; ++ ++ return sizeof(*in); ++} ++ ++static int ptrace_bts_clear(struct task_struct *child) ++{ ++ if (!child->thread.ds_area_msr) ++ return -ENXIO; ++ ++ return ds_clear((void *)child->thread.ds_area_msr); ++} ++ ++static int ptrace_bts_drain(struct task_struct *child, ++ long size, ++ struct bts_struct __user *out) ++{ ++ int end, i; ++ void *ds = (void *)child->thread.ds_area_msr; ++ ++ if (!ds) ++ return -ENXIO; ++ ++ end = ds_get_bts_index(ds); ++ if (end <= 0) ++ return end; ++ ++ if (size < (end * sizeof(struct bts_struct))) ++ return -EIO; ++ ++ for (i = 0; i < end; i++, out++) { ++ struct bts_struct ret; ++ int retval; ++ ++ retval = ds_read_bts(ds, i, &ret); ++ if (retval < 0) ++ return retval; ++ ++ if (copy_to_user(out, &ret, sizeof(ret))) ++ return -EFAULT; ++ } ++ ++ ds_clear(ds); ++ ++ return end; ++} ++ ++static int ptrace_bts_realloc(struct task_struct *child, ++ int size, int reduce_size) ++{ ++ unsigned long rlim, vm; ++ int ret, old_size; ++ ++ if (size < 0) ++ return -EINVAL; ++ ++ old_size = ds_get_bts_size((void *)child->thread.ds_area_msr); ++ if (old_size < 0) ++ return old_size; ++ ++ ret = ds_free((void **)&child->thread.ds_area_msr); ++ if (ret < 0) ++ goto out; ++ ++ size >>= PAGE_SHIFT; ++ old_size >>= PAGE_SHIFT; ++ ++ current->mm->total_vm -= old_size; ++ current->mm->locked_vm -= old_size; ++ ++ if (size == 0) ++ goto out; ++ ++ rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT; ++ vm = current->mm->total_vm + size; ++ if (rlim < vm) { ++ ret = -ENOMEM; ++ ++ if (!reduce_size) ++ goto out; ++ ++ size = rlim - current->mm->total_vm; ++ if (size <= 0) ++ goto out; ++ } ++ ++ rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT; ++ vm = current->mm->locked_vm + size; ++ if (rlim < vm) { ++ ret = -ENOMEM; ++ ++ if (!reduce_size) ++ goto out; ++ ++ size = rlim - current->mm->locked_vm; ++ if (size <= 0) ++ goto out; ++ } ++ ++ ret = ds_allocate((void **)&child->thread.ds_area_msr, ++ size << PAGE_SHIFT); ++ if (ret < 0) ++ goto out; ++ ++ current->mm->total_vm += size; ++ current->mm->locked_vm += size; ++ ++out: ++ if (child->thread.ds_area_msr) ++ set_tsk_thread_flag(child, TIF_DS_AREA_MSR); ++ else ++ clear_tsk_thread_flag(child, TIF_DS_AREA_MSR); ++ ++ return ret; ++} ++ ++static int ptrace_bts_config(struct task_struct *child, ++ long cfg_size, ++ const struct ptrace_bts_config __user *ucfg) ++{ ++ struct ptrace_bts_config cfg; ++ int bts_size, ret = 0; ++ void *ds; ++ ++ if (cfg_size < sizeof(cfg)) ++ return -EIO; ++ ++ if (copy_from_user(&cfg, ucfg, sizeof(cfg))) ++ return -EFAULT; ++ ++ if ((int)cfg.size < 0) ++ return -EINVAL; ++ ++ bts_size = 0; ++ ds = (void *)child->thread.ds_area_msr; ++ if (ds) { ++ bts_size = ds_get_bts_size(ds); ++ if (bts_size < 0) ++ return bts_size; ++ } ++ cfg.size = PAGE_ALIGN(cfg.size); ++ ++ if (bts_size != cfg.size) { ++ ret = ptrace_bts_realloc(child, cfg.size, ++ cfg.flags & PTRACE_BTS_O_CUT_SIZE); ++ if (ret < 0) ++ goto errout; ++ ++ ds = (void *)child->thread.ds_area_msr; ++ } ++ ++ if (cfg.flags & PTRACE_BTS_O_SIGNAL) ++ ret = ds_set_overflow(ds, DS_O_SIGNAL); ++ else ++ ret = ds_set_overflow(ds, DS_O_WRAP); ++ if (ret < 0) ++ goto errout; ++ ++ if (cfg.flags & PTRACE_BTS_O_TRACE) ++ child->thread.debugctlmsr |= ds_debugctl_mask(); ++ else ++ child->thread.debugctlmsr &= ~ds_debugctl_mask(); ++ ++ if (cfg.flags & PTRACE_BTS_O_SCHED) ++ set_tsk_thread_flag(child, TIF_BTS_TRACE_TS); ++ else ++ clear_tsk_thread_flag(child, TIF_BTS_TRACE_TS); ++ ++ ret = sizeof(cfg); ++ ++out: ++ if (child->thread.debugctlmsr) ++ set_tsk_thread_flag(child, TIF_DEBUGCTLMSR); ++ else ++ clear_tsk_thread_flag(child, TIF_DEBUGCTLMSR); ++ ++ return ret; ++ ++errout: ++ child->thread.debugctlmsr &= ~ds_debugctl_mask(); ++ clear_tsk_thread_flag(child, TIF_BTS_TRACE_TS); ++ goto out; ++} ++ ++static int ptrace_bts_status(struct task_struct *child, ++ long cfg_size, ++ struct ptrace_bts_config __user *ucfg) ++{ ++ void *ds = (void *)child->thread.ds_area_msr; ++ struct ptrace_bts_config cfg; ++ ++ if (cfg_size < sizeof(cfg)) ++ return -EIO; ++ ++ memset(&cfg, 0, sizeof(cfg)); ++ ++ if (ds) { ++ cfg.size = ds_get_bts_size(ds); ++ ++ if (ds_get_overflow(ds) == DS_O_SIGNAL) ++ cfg.flags |= PTRACE_BTS_O_SIGNAL; ++ ++ if (test_tsk_thread_flag(child, TIF_DEBUGCTLMSR) && ++ child->thread.debugctlmsr & ds_debugctl_mask()) ++ cfg.flags |= PTRACE_BTS_O_TRACE; ++ ++ if (test_tsk_thread_flag(child, TIF_BTS_TRACE_TS)) ++ cfg.flags |= PTRACE_BTS_O_SCHED; ++ } ++ ++ cfg.bts_size = sizeof(struct bts_struct); ++ ++ if (copy_to_user(ucfg, &cfg, sizeof(cfg))) ++ return -EFAULT; ++ ++ return sizeof(cfg); ++} ++ ++void ptrace_bts_take_timestamp(struct task_struct *tsk, ++ enum bts_qualifier qualifier) ++{ ++ struct bts_struct rec = { ++ .qualifier = qualifier, ++ .variant.jiffies = jiffies_64 ++ }; ++ ++ ptrace_bts_write_record(tsk, &rec); ++} ++ ++/* ++ * Called by kernel/ptrace.c when detaching.. ++ * ++ * Make sure the single step bit is not set. ++ */ ++void ptrace_disable(struct task_struct *child) ++{ ++ user_disable_single_step(child); ++#ifdef TIF_SYSCALL_EMU ++ clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); ++#endif ++ if (child->thread.ds_area_msr) { ++ ptrace_bts_realloc(child, 0, 0); ++ child->thread.debugctlmsr &= ~ds_debugctl_mask(); ++ if (!child->thread.debugctlmsr) ++ clear_tsk_thread_flag(child, TIF_DEBUGCTLMSR); ++ clear_tsk_thread_flag(child, TIF_BTS_TRACE_TS); ++ } ++} ++ ++#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION ++static const struct user_regset_view user_x86_32_view; /* Initialized below. */ ++#endif ++ ++long arch_ptrace(struct task_struct *child, long request, long addr, long data) ++{ ++ int ret; ++ unsigned long __user *datap = (unsigned long __user *)data; ++ ++ switch (request) { ++ /* read the word at location addr in the USER area. */ ++ case PTRACE_PEEKUSR: { ++ unsigned long tmp; ++ ++ ret = -EIO; ++ if ((addr & (sizeof(data) - 1)) || addr < 0 || ++ addr >= sizeof(struct user)) ++ break; ++ ++ tmp = 0; /* Default return condition */ ++ if (addr < sizeof(struct user_regs_struct)) ++ tmp = getreg(child, addr); ++ else if (addr >= offsetof(struct user, u_debugreg[0]) && ++ addr <= offsetof(struct user, u_debugreg[7])) { ++ addr -= offsetof(struct user, u_debugreg[0]); ++ tmp = ptrace_get_debugreg(child, addr / sizeof(data)); ++ } ++ ret = put_user(tmp, datap); ++ break; ++ } ++ ++ case PTRACE_POKEUSR: /* write the word at location addr in the USER area */ ++ ret = -EIO; ++ if ((addr & (sizeof(data) - 1)) || addr < 0 || ++ addr >= sizeof(struct user)) ++ break; ++ ++ if (addr < sizeof(struct user_regs_struct)) ++ ret = putreg(child, addr, data); ++ else if (addr >= offsetof(struct user, u_debugreg[0]) && ++ addr <= offsetof(struct user, u_debugreg[7])) { ++ addr -= offsetof(struct user, u_debugreg[0]); ++ ret = ptrace_set_debugreg(child, ++ addr / sizeof(data), data); ++ } ++ break; ++ ++ case PTRACE_GETREGS: /* Get all gp regs from the child. */ ++ return copy_regset_to_user(child, ++ task_user_regset_view(current), ++ REGSET_GENERAL, ++ 0, sizeof(struct user_regs_struct), ++ datap); ++ ++ case PTRACE_SETREGS: /* Set all gp regs in the child. */ ++ return copy_regset_from_user(child, ++ task_user_regset_view(current), ++ REGSET_GENERAL, ++ 0, sizeof(struct user_regs_struct), ++ datap); ++ ++ case PTRACE_GETFPREGS: /* Get the child FPU state. */ ++ return copy_regset_to_user(child, ++ task_user_regset_view(current), ++ REGSET_FP, ++ 0, sizeof(struct user_i387_struct), ++ datap); ++ ++ case PTRACE_SETFPREGS: /* Set the child FPU state. */ ++ return copy_regset_from_user(child, ++ task_user_regset_view(current), ++ REGSET_FP, ++ 0, sizeof(struct user_i387_struct), ++ datap); ++ ++#ifdef CONFIG_X86_32 ++ case PTRACE_GETFPXREGS: /* Get the child extended FPU state. */ ++ return copy_regset_to_user(child, &user_x86_32_view, ++ REGSET_XFP, ++ 0, sizeof(struct user_fxsr_struct), ++ datap); ++ ++ case PTRACE_SETFPXREGS: /* Set the child extended FPU state. */ ++ return copy_regset_from_user(child, &user_x86_32_view, ++ REGSET_XFP, ++ 0, sizeof(struct user_fxsr_struct), ++ datap); ++#endif ++ ++#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION ++ case PTRACE_GET_THREAD_AREA: ++ if (addr < 0) ++ return -EIO; ++ ret = do_get_thread_area(child, addr, ++ (struct user_desc __user *) data); ++ break; ++ ++ case PTRACE_SET_THREAD_AREA: ++ if (addr < 0) ++ return -EIO; ++ ret = do_set_thread_area(child, addr, ++ (struct user_desc __user *) data, 0); ++ break; ++#endif ++ ++#ifdef CONFIG_X86_64 ++ /* normal 64bit interface to access TLS data. ++ Works just like arch_prctl, except that the arguments ++ are reversed. */ ++ case PTRACE_ARCH_PRCTL: ++ ret = do_arch_prctl(child, data, addr); ++ break; ++#endif ++ ++ case PTRACE_BTS_CONFIG: ++ ret = ptrace_bts_config ++ (child, data, (struct ptrace_bts_config __user *)addr); ++ break; ++ ++ case PTRACE_BTS_STATUS: ++ ret = ptrace_bts_status ++ (child, data, (struct ptrace_bts_config __user *)addr); ++ break; ++ ++ case PTRACE_BTS_SIZE: ++ ret = ptrace_bts_get_size(child); ++ break; ++ ++ case PTRACE_BTS_GET: ++ ret = ptrace_bts_read_record ++ (child, data, (struct bts_struct __user *) addr); ++ break; ++ ++ case PTRACE_BTS_CLEAR: ++ ret = ptrace_bts_clear(child); ++ break; ++ ++ case PTRACE_BTS_DRAIN: ++ ret = ptrace_bts_drain ++ (child, data, (struct bts_struct __user *) addr); ++ break; ++ ++ default: ++ ret = ptrace_request(child, request, addr, data); ++ break; ++ } ++ ++ return ret; ++} ++ ++#ifdef CONFIG_IA32_EMULATION ++ ++#include ++#include ++#include ++#include ++ ++#define R32(l,q) \ ++ case offsetof(struct user32, regs.l): \ ++ regs->q = value; break ++ ++#define SEG32(rs) \ ++ case offsetof(struct user32, regs.rs): \ ++ return set_segment_reg(child, \ ++ offsetof(struct user_regs_struct, rs), \ ++ value); \ ++ break ++ ++static int putreg32(struct task_struct *child, unsigned regno, u32 value) ++{ ++ struct pt_regs *regs = task_pt_regs(child); ++ ++ switch (regno) { ++ ++ SEG32(cs); ++ SEG32(ds); ++ SEG32(es); ++ SEG32(fs); ++ SEG32(gs); ++ SEG32(ss); ++ ++ R32(ebx, bx); ++ R32(ecx, cx); ++ R32(edx, dx); ++ R32(edi, di); ++ R32(esi, si); ++ R32(ebp, bp); ++ R32(eax, ax); ++ R32(orig_eax, orig_ax); ++ R32(eip, ip); ++ R32(esp, sp); ++ ++ case offsetof(struct user32, regs.eflags): ++ return set_flags(child, value); ++ ++ case offsetof(struct user32, u_debugreg[0]) ... ++ offsetof(struct user32, u_debugreg[7]): ++ regno -= offsetof(struct user32, u_debugreg[0]); ++ return ptrace_set_debugreg(child, regno / 4, value); ++ ++ default: ++ if (regno > sizeof(struct user32) || (regno & 3)) ++ return -EIO; ++ ++ /* ++ * Other dummy fields in the virtual user structure ++ * are ignored ++ */ ++ break; ++ } ++ return 0; ++} ++ ++#undef R32 ++#undef SEG32 ++ ++#define R32(l,q) \ ++ case offsetof(struct user32, regs.l): \ ++ *val = regs->q; break ++ ++#define SEG32(rs) \ ++ case offsetof(struct user32, regs.rs): \ ++ *val = get_segment_reg(child, \ ++ offsetof(struct user_regs_struct, rs)); \ ++ break ++ ++static int getreg32(struct task_struct *child, unsigned regno, u32 *val) ++{ ++ struct pt_regs *regs = task_pt_regs(child); ++ ++ switch (regno) { ++ ++ SEG32(ds); ++ SEG32(es); ++ SEG32(fs); ++ SEG32(gs); ++ ++ R32(cs, cs); ++ R32(ss, ss); ++ R32(ebx, bx); ++ R32(ecx, cx); ++ R32(edx, dx); ++ R32(edi, di); ++ R32(esi, si); ++ R32(ebp, bp); ++ R32(eax, ax); ++ R32(orig_eax, orig_ax); ++ R32(eip, ip); ++ R32(esp, sp); ++ ++ case offsetof(struct user32, regs.eflags): ++ *val = get_flags(child); ++ break; ++ ++ case offsetof(struct user32, u_debugreg[0]) ... ++ offsetof(struct user32, u_debugreg[7]): ++ regno -= offsetof(struct user32, u_debugreg[0]); ++ *val = ptrace_get_debugreg(child, regno / 4); ++ break; ++ ++ default: ++ if (regno > sizeof(struct user32) || (regno & 3)) ++ return -EIO; ++ ++ /* ++ * Other dummy fields in the virtual user structure ++ * are ignored ++ */ ++ *val = 0; ++ break; ++ } ++ return 0; ++} ++ ++#undef R32 ++#undef SEG32 ++ ++static int genregs32_get(struct task_struct *target, ++ const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ void *kbuf, void __user *ubuf) ++{ ++ if (kbuf) { ++ compat_ulong_t *k = kbuf; ++ while (count > 0) { ++ getreg32(target, pos, k++); ++ count -= sizeof(*k); ++ pos += sizeof(*k); ++ } ++ } else { ++ compat_ulong_t __user *u = ubuf; ++ while (count > 0) { ++ compat_ulong_t word; ++ getreg32(target, pos, &word); ++ if (__put_user(word, u++)) ++ return -EFAULT; ++ count -= sizeof(*u); ++ pos += sizeof(*u); ++ } ++ } ++ ++ return 0; ++} ++ ++static int genregs32_set(struct task_struct *target, ++ const struct user_regset *regset, ++ unsigned int pos, unsigned int count, ++ const void *kbuf, const void __user *ubuf) ++{ ++ int ret = 0; ++ if (kbuf) { ++ const compat_ulong_t *k = kbuf; ++ while (count > 0 && !ret) { ++ ret = putreg(target, pos, *k++); ++ count -= sizeof(*k); ++ pos += sizeof(*k); ++ } ++ } else { ++ const compat_ulong_t __user *u = ubuf; ++ while (count > 0 && !ret) { ++ compat_ulong_t word; ++ ret = __get_user(word, u++); ++ if (ret) ++ break; ++ ret = putreg(target, pos, word); ++ count -= sizeof(*u); ++ pos += sizeof(*u); ++ } ++ } ++ return ret; ++} ++ ++static long ptrace32_siginfo(unsigned request, u32 pid, u32 addr, u32 data) ++{ ++ siginfo_t __user *si = compat_alloc_user_space(sizeof(siginfo_t)); ++ compat_siginfo_t __user *si32 = compat_ptr(data); ++ siginfo_t ssi; ++ int ret; ++ ++ if (request == PTRACE_SETSIGINFO) { ++ memset(&ssi, 0, sizeof(siginfo_t)); ++ ret = copy_siginfo_from_user32(&ssi, si32); ++ if (ret) ++ return ret; ++ if (copy_to_user(si, &ssi, sizeof(siginfo_t))) ++ return -EFAULT; ++ } ++ ret = sys_ptrace(request, pid, addr, (unsigned long)si); ++ if (ret) ++ return ret; ++ if (request == PTRACE_GETSIGINFO) { ++ if (copy_from_user(&ssi, si, sizeof(siginfo_t))) ++ return -EFAULT; ++ ret = copy_siginfo_to_user32(si32, &ssi); ++ } ++ return ret; ++} ++ ++asmlinkage long sys32_ptrace(long request, u32 pid, u32 addr, u32 data) ++{ ++ struct task_struct *child; ++ struct pt_regs *childregs; ++ void __user *datap = compat_ptr(data); ++ int ret; ++ __u32 val; ++ ++ switch (request) { ++ case PTRACE_TRACEME: ++ case PTRACE_ATTACH: ++ case PTRACE_KILL: ++ case PTRACE_CONT: ++ case PTRACE_SINGLESTEP: ++ case PTRACE_SINGLEBLOCK: ++ case PTRACE_DETACH: ++ case PTRACE_SYSCALL: ++ case PTRACE_OLDSETOPTIONS: ++ case PTRACE_SETOPTIONS: ++ case PTRACE_SET_THREAD_AREA: ++ case PTRACE_GET_THREAD_AREA: ++ case PTRACE_BTS_CONFIG: ++ case PTRACE_BTS_STATUS: ++ case PTRACE_BTS_SIZE: ++ case PTRACE_BTS_GET: ++ case PTRACE_BTS_CLEAR: ++ case PTRACE_BTS_DRAIN: ++ return sys_ptrace(request, pid, addr, data); ++ ++ default: ++ return -EINVAL; ++ ++ case PTRACE_PEEKTEXT: ++ case PTRACE_PEEKDATA: ++ case PTRACE_POKEDATA: ++ case PTRACE_POKETEXT: ++ case PTRACE_POKEUSR: ++ case PTRACE_PEEKUSR: ++ case PTRACE_GETREGS: ++ case PTRACE_SETREGS: ++ case PTRACE_SETFPREGS: ++ case PTRACE_GETFPREGS: ++ case PTRACE_SETFPXREGS: ++ case PTRACE_GETFPXREGS: ++ case PTRACE_GETEVENTMSG: ++ break; ++ ++ case PTRACE_SETSIGINFO: ++ case PTRACE_GETSIGINFO: ++ return ptrace32_siginfo(request, pid, addr, data); ++ } ++ ++ child = ptrace_get_task_struct(pid); ++ if (IS_ERR(child)) ++ return PTR_ERR(child); ++ ++ ret = ptrace_check_attach(child, request == PTRACE_KILL); ++ if (ret < 0) ++ goto out; ++ ++ childregs = task_pt_regs(child); ++ ++ switch (request) { ++ case PTRACE_PEEKUSR: ++ ret = getreg32(child, addr, &val); ++ if (ret == 0) ++ ret = put_user(val, (__u32 __user *)datap); ++ break; ++ ++ case PTRACE_POKEUSR: ++ ret = putreg32(child, addr, data); ++ break; ++ ++ case PTRACE_GETREGS: /* Get all gp regs from the child. */ ++ return copy_regset_to_user(child, &user_x86_32_view, ++ REGSET_GENERAL, ++ 0, sizeof(struct user_regs_struct32), ++ datap); ++ ++ case PTRACE_SETREGS: /* Set all gp regs in the child. */ ++ return copy_regset_from_user(child, &user_x86_32_view, ++ REGSET_GENERAL, 0, ++ sizeof(struct user_regs_struct32), ++ datap); ++ ++ case PTRACE_GETFPREGS: /* Get the child FPU state. */ ++ return copy_regset_to_user(child, &user_x86_32_view, ++ REGSET_FP, 0, ++ sizeof(struct user_i387_ia32_struct), ++ datap); ++ ++ case PTRACE_SETFPREGS: /* Set the child FPU state. */ ++ return copy_regset_from_user( ++ child, &user_x86_32_view, REGSET_FP, ++ 0, sizeof(struct user_i387_ia32_struct), datap); ++ ++ case PTRACE_GETFPXREGS: /* Get the child extended FPU state. */ ++ return copy_regset_to_user(child, &user_x86_32_view, ++ REGSET_XFP, 0, ++ sizeof(struct user32_fxsr_struct), ++ datap); ++ ++ case PTRACE_SETFPXREGS: /* Set the child extended FPU state. */ ++ return copy_regset_from_user(child, &user_x86_32_view, ++ REGSET_XFP, 0, ++ sizeof(struct user32_fxsr_struct), ++ datap); ++ ++ default: ++ return compat_ptrace_request(child, request, addr, data); ++ } ++ ++ out: ++ put_task_struct(child); ++ return ret; ++} ++ ++#endif /* CONFIG_IA32_EMULATION */ ++ ++#ifdef CONFIG_X86_64 ++ ++static const struct user_regset x86_64_regsets[] = { ++ [REGSET_GENERAL] = { ++ .core_note_type = NT_PRSTATUS, ++ .n = sizeof(struct user_regs_struct) / sizeof(long), ++ .size = sizeof(long), .align = sizeof(long), ++ .get = genregs_get, .set = genregs_set ++ }, ++ [REGSET_FP] = { ++ .core_note_type = NT_PRFPREG, ++ .n = sizeof(struct user_i387_struct) / sizeof(long), ++ .size = sizeof(long), .align = sizeof(long), ++ .active = xfpregs_active, .get = xfpregs_get, .set = xfpregs_set ++ }, ++}; ++ ++static const struct user_regset_view user_x86_64_view = { ++ .name = "x86_64", .e_machine = EM_X86_64, ++ .regsets = x86_64_regsets, .n = ARRAY_SIZE(x86_64_regsets) ++}; ++ ++#else /* CONFIG_X86_32 */ ++ ++#define user_regs_struct32 user_regs_struct ++#define genregs32_get genregs_get ++#define genregs32_set genregs_set ++ ++#endif /* CONFIG_X86_64 */ ++ ++#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION ++static const struct user_regset x86_32_regsets[] = { ++ [REGSET_GENERAL] = { ++ .core_note_type = NT_PRSTATUS, ++ .n = sizeof(struct user_regs_struct32) / sizeof(u32), ++ .size = sizeof(u32), .align = sizeof(u32), ++ .get = genregs32_get, .set = genregs32_set ++ }, ++ [REGSET_FP] = { ++ .core_note_type = NT_PRFPREG, ++ .n = sizeof(struct user_i387_struct) / sizeof(u32), ++ .size = sizeof(u32), .align = sizeof(u32), ++ .active = fpregs_active, .get = fpregs_get, .set = fpregs_set ++ }, ++ [REGSET_XFP] = { ++ .core_note_type = NT_PRXFPREG, ++ .n = sizeof(struct user_i387_struct) / sizeof(u32), ++ .size = sizeof(u32), .align = sizeof(u32), ++ .active = xfpregs_active, .get = xfpregs_get, .set = xfpregs_set ++ }, ++ [REGSET_TLS] = { ++ .core_note_type = NT_386_TLS, ++ .n = GDT_ENTRY_TLS_ENTRIES, .bias = GDT_ENTRY_TLS_MIN, ++ .size = sizeof(struct user_desc), ++ .align = sizeof(struct user_desc), ++ .active = regset_tls_active, ++ .get = regset_tls_get, .set = regset_tls_set ++ }, ++}; ++ ++static const struct user_regset_view user_x86_32_view = { ++ .name = "i386", .e_machine = EM_386, ++ .regsets = x86_32_regsets, .n = ARRAY_SIZE(x86_32_regsets) ++}; ++#endif ++ ++const struct user_regset_view *task_user_regset_view(struct task_struct *task) ++{ ++#ifdef CONFIG_IA32_EMULATION ++ if (test_tsk_thread_flag(task, TIF_IA32)) ++#endif ++#if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION ++ return &user_x86_32_view; ++#endif ++#ifdef CONFIG_X86_64 ++ return &user_x86_64_view; ++#endif ++} ++ ++#ifdef CONFIG_X86_32 ++ ++void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code) ++{ ++ struct siginfo info; ++ ++ tsk->thread.trap_no = 1; ++ tsk->thread.error_code = error_code; ++ ++ memset(&info, 0, sizeof(info)); ++ info.si_signo = SIGTRAP; ++ info.si_code = TRAP_BRKPT; ++ ++ /* User-mode ip? */ ++ info.si_addr = user_mode_vm(regs) ? (void __user *) regs->ip : NULL; ++ ++ /* Send us the fake SIGTRAP */ ++ force_sig_info(SIGTRAP, &info, tsk); ++} ++ ++/* notification of system call entry/exit ++ * - triggered by current->work.syscall_trace ++ */ ++__attribute__((regparm(3))) ++int do_syscall_trace(struct pt_regs *regs, int entryexit) ++{ ++ int is_sysemu = test_thread_flag(TIF_SYSCALL_EMU); ++ /* ++ * With TIF_SYSCALL_EMU set we want to ignore TIF_SINGLESTEP for syscall ++ * interception ++ */ ++ int is_singlestep = !is_sysemu && test_thread_flag(TIF_SINGLESTEP); ++ int ret = 0; ++ ++ /* do the secure computing check first */ ++ if (!entryexit) ++ secure_computing(regs->orig_ax); ++ ++ if (unlikely(current->audit_context)) { ++ if (entryexit) ++ audit_syscall_exit(AUDITSC_RESULT(regs->ax), ++ regs->ax); ++ /* Debug traps, when using PTRACE_SINGLESTEP, must be sent only ++ * on the syscall exit path. Normally, when TIF_SYSCALL_AUDIT is ++ * not used, entry.S will call us only on syscall exit, not ++ * entry; so when TIF_SYSCALL_AUDIT is used we must avoid ++ * calling send_sigtrap() on syscall entry. ++ * ++ * Note that when PTRACE_SYSEMU_SINGLESTEP is used, ++ * is_singlestep is false, despite his name, so we will still do ++ * the correct thing. ++ */ ++ else if (is_singlestep) ++ goto out; ++ } ++ ++ if (!(current->ptrace & PT_PTRACED)) ++ goto out; ++ ++ /* If a process stops on the 1st tracepoint with SYSCALL_TRACE ++ * and then is resumed with SYSEMU_SINGLESTEP, it will come in ++ * here. We have to check this and return */ ++ if (is_sysemu && entryexit) ++ return 0; ++ ++ /* Fake a debug trap */ ++ if (is_singlestep) ++ send_sigtrap(current, regs, 0); ++ ++ if (!test_thread_flag(TIF_SYSCALL_TRACE) && !is_sysemu) ++ goto out; ++ ++ /* the 0x80 provides a way for the tracing parent to distinguish ++ between a syscall stop and SIGTRAP delivery */ ++ /* Note that the debugger could change the result of test_thread_flag!*/ ++ ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) ? 0x80:0)); ++ ++ /* ++ * this isn't the same as continuing with a signal, but it will do ++ * for normal use. strace only continues with a signal if the ++ * stopping signal is not SIGTRAP. -brl ++ */ ++ if (current->exit_code) { ++ send_sig(current->exit_code, current, 1); ++ current->exit_code = 0; ++ } ++ ret = is_sysemu; ++out: ++ if (unlikely(current->audit_context) && !entryexit) ++ audit_syscall_entry(AUDIT_ARCH_I386, regs->orig_ax, ++ regs->bx, regs->cx, regs->dx, regs->si); ++ if (ret == 0) ++ return 0; ++ ++ regs->orig_ax = -1; /* force skip of syscall restarting */ ++ if (unlikely(current->audit_context)) ++ audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax); ++ return 1; ++} ++ ++#else /* CONFIG_X86_64 */ ++ ++static void syscall_trace(struct pt_regs *regs) ++{ ++ ++#if 0 ++ printk("trace %s ip %lx sp %lx ax %d origrax %d caller %lx tiflags %x ptrace %x\n", ++ current->comm, ++ regs->ip, regs->sp, regs->ax, regs->orig_ax, __builtin_return_address(0), ++ current_thread_info()->flags, current->ptrace); ++#endif ++ ++ ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) ++ ? 0x80 : 0)); ++ /* ++ * this isn't the same as continuing with a signal, but it will do ++ * for normal use. strace only continues with a signal if the ++ * stopping signal is not SIGTRAP. -brl ++ */ ++ if (current->exit_code) { ++ send_sig(current->exit_code, current, 1); ++ current->exit_code = 0; ++ } ++} ++ ++asmlinkage void syscall_trace_enter(struct pt_regs *regs) ++{ ++ /* do the secure computing check first */ ++ secure_computing(regs->orig_ax); ++ ++ if (test_thread_flag(TIF_SYSCALL_TRACE) ++ && (current->ptrace & PT_PTRACED)) ++ syscall_trace(regs); ++ ++ if (unlikely(current->audit_context)) { ++ if (test_thread_flag(TIF_IA32)) { ++ audit_syscall_entry(AUDIT_ARCH_I386, ++ regs->orig_ax, ++ regs->bx, regs->cx, ++ regs->dx, regs->si); ++ } else { ++ audit_syscall_entry(AUDIT_ARCH_X86_64, ++ regs->orig_ax, ++ regs->di, regs->si, ++ regs->dx, regs->r10); ++ } ++ } ++} ++ ++asmlinkage void syscall_trace_leave(struct pt_regs *regs) ++{ ++ if (unlikely(current->audit_context)) ++ audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax); ++ ++ if ((test_thread_flag(TIF_SYSCALL_TRACE) ++ || test_thread_flag(TIF_SINGLESTEP)) ++ && (current->ptrace & PT_PTRACED)) ++ syscall_trace(regs); ++} ++ ++#endif /* CONFIG_X86_32 */ +diff --git a/arch/x86/kernel/ptrace_32.c b/arch/x86/kernel/ptrace_32.c +deleted file mode 100644 +index ff5431c..0000000 +--- a/arch/x86/kernel/ptrace_32.c ++++ /dev/null +@@ -1,717 +0,0 @@ +-/* By Ross Biro 1/23/92 */ +-/* +- * Pentium III FXSR, SSE support +- * Gareth Hughes , May 2000 +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* +- * does not yet catch signals sent when the child dies. +- * in exit.c or in signal.c. +- */ +- +-/* +- * Determines which flags the user has access to [1 = access, 0 = no access]. +- * Prohibits changing ID(21), VIP(20), VIF(19), VM(17), NT(14), IOPL(12-13), IF(9). +- * Also masks reserved bits (31-22, 15, 5, 3, 1). +- */ +-#define FLAG_MASK 0x00050dd5 +- +-/* set's the trap flag. */ +-#define TRAP_FLAG 0x100 +- +-/* +- * Offset of eflags on child stack.. +- */ +-#define EFL_OFFSET offsetof(struct pt_regs, eflags) +- +-static inline struct pt_regs *get_child_regs(struct task_struct *task) +-{ +- void *stack_top = (void *)task->thread.esp0; +- return stack_top - sizeof(struct pt_regs); +-} +- +-/* +- * This routine will get a word off of the processes privileged stack. +- * the offset is bytes into the pt_regs structure on the stack. +- * This routine assumes that all the privileged stacks are in our +- * data space. +- */ +-static inline int get_stack_long(struct task_struct *task, int offset) +-{ +- unsigned char *stack; +- +- stack = (unsigned char *)task->thread.esp0 - sizeof(struct pt_regs); +- stack += offset; +- return (*((int *)stack)); +-} +- +-/* +- * This routine will put a word on the processes privileged stack. +- * the offset is bytes into the pt_regs structure on the stack. +- * This routine assumes that all the privileged stacks are in our +- * data space. +- */ +-static inline int put_stack_long(struct task_struct *task, int offset, +- unsigned long data) +-{ +- unsigned char * stack; +- +- stack = (unsigned char *)task->thread.esp0 - sizeof(struct pt_regs); +- stack += offset; +- *(unsigned long *) stack = data; +- return 0; +-} +- +-static int putreg(struct task_struct *child, +- unsigned long regno, unsigned long value) +-{ +- switch (regno >> 2) { +- case GS: +- if (value && (value & 3) != 3) +- return -EIO; +- child->thread.gs = value; +- return 0; +- case DS: +- case ES: +- case FS: +- if (value && (value & 3) != 3) +- return -EIO; +- value &= 0xffff; +- break; +- case SS: +- case CS: +- if ((value & 3) != 3) +- return -EIO; +- value &= 0xffff; +- break; +- case EFL: +- value &= FLAG_MASK; +- value |= get_stack_long(child, EFL_OFFSET) & ~FLAG_MASK; +- break; +- } +- if (regno > FS*4) +- regno -= 1*4; +- put_stack_long(child, regno, value); +- return 0; +-} +- +-static unsigned long getreg(struct task_struct *child, +- unsigned long regno) +-{ +- unsigned long retval = ~0UL; +- +- switch (regno >> 2) { +- case GS: +- retval = child->thread.gs; +- break; +- case DS: +- case ES: +- case FS: +- case SS: +- case CS: +- retval = 0xffff; +- /* fall through */ +- default: +- if (regno > FS*4) +- regno -= 1*4; +- retval &= get_stack_long(child, regno); +- } +- return retval; +-} +- +-#define LDT_SEGMENT 4 +- +-static unsigned long convert_eip_to_linear(struct task_struct *child, struct pt_regs *regs) +-{ +- unsigned long addr, seg; +- +- addr = regs->eip; +- seg = regs->xcs & 0xffff; +- if (regs->eflags & VM_MASK) { +- addr = (addr & 0xffff) + (seg << 4); +- return addr; +- } +- +- /* +- * We'll assume that the code segments in the GDT +- * are all zero-based. That is largely true: the +- * TLS segments are used for data, and the PNPBIOS +- * and APM bios ones we just ignore here. +- */ +- if (seg & LDT_SEGMENT) { +- u32 *desc; +- unsigned long base; +- +- seg &= ~7UL; +- +- mutex_lock(&child->mm->context.lock); +- if (unlikely((seg >> 3) >= child->mm->context.size)) +- addr = -1L; /* bogus selector, access would fault */ +- else { +- desc = child->mm->context.ldt + seg; +- base = ((desc[0] >> 16) | +- ((desc[1] & 0xff) << 16) | +- (desc[1] & 0xff000000)); +- +- /* 16-bit code segment? */ +- if (!((desc[1] >> 22) & 1)) +- addr &= 0xffff; +- addr += base; +- } +- mutex_unlock(&child->mm->context.lock); +- } +- return addr; +-} +- +-static inline int is_setting_trap_flag(struct task_struct *child, struct pt_regs *regs) +-{ +- int i, copied; +- unsigned char opcode[15]; +- unsigned long addr = convert_eip_to_linear(child, regs); +- +- copied = access_process_vm(child, addr, opcode, sizeof(opcode), 0); +- for (i = 0; i < copied; i++) { +- switch (opcode[i]) { +- /* popf and iret */ +- case 0x9d: case 0xcf: +- return 1; +- /* opcode and address size prefixes */ +- case 0x66: case 0x67: +- continue; +- /* irrelevant prefixes (segment overrides and repeats) */ +- case 0x26: case 0x2e: +- case 0x36: case 0x3e: +- case 0x64: case 0x65: +- case 0xf0: case 0xf2: case 0xf3: +- continue; +- +- /* +- * pushf: NOTE! We should probably not let +- * the user see the TF bit being set. But +- * it's more pain than it's worth to avoid +- * it, and a debugger could emulate this +- * all in user space if it _really_ cares. +- */ +- case 0x9c: +- default: +- return 0; +- } +- } +- return 0; +-} +- +-static void set_singlestep(struct task_struct *child) +-{ +- struct pt_regs *regs = get_child_regs(child); +- +- /* +- * Always set TIF_SINGLESTEP - this guarantees that +- * we single-step system calls etc.. This will also +- * cause us to set TF when returning to user mode. +- */ +- set_tsk_thread_flag(child, TIF_SINGLESTEP); +- +- /* +- * If TF was already set, don't do anything else +- */ +- if (regs->eflags & TRAP_FLAG) +- return; +- +- /* Set TF on the kernel stack.. */ +- regs->eflags |= TRAP_FLAG; +- +- /* +- * ..but if TF is changed by the instruction we will trace, +- * don't mark it as being "us" that set it, so that we +- * won't clear it by hand later. +- */ +- if (is_setting_trap_flag(child, regs)) +- return; +- +- child->ptrace |= PT_DTRACE; +-} +- +-static void clear_singlestep(struct task_struct *child) +-{ +- /* Always clear TIF_SINGLESTEP... */ +- clear_tsk_thread_flag(child, TIF_SINGLESTEP); +- +- /* But touch TF only if it was set by us.. */ +- if (child->ptrace & PT_DTRACE) { +- struct pt_regs *regs = get_child_regs(child); +- regs->eflags &= ~TRAP_FLAG; +- child->ptrace &= ~PT_DTRACE; +- } +-} +- +-/* +- * Called by kernel/ptrace.c when detaching.. +- * +- * Make sure the single step bit is not set. +- */ +-void ptrace_disable(struct task_struct *child) +-{ +- clear_singlestep(child); +- clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); +-} +- +-/* +- * Perform get_thread_area on behalf of the traced child. +- */ +-static int +-ptrace_get_thread_area(struct task_struct *child, +- int idx, struct user_desc __user *user_desc) +-{ +- struct user_desc info; +- struct desc_struct *desc; +- +-/* +- * Get the current Thread-Local Storage area: +- */ +- +-#define GET_BASE(desc) ( \ +- (((desc)->a >> 16) & 0x0000ffff) | \ +- (((desc)->b << 16) & 0x00ff0000) | \ +- ( (desc)->b & 0xff000000) ) +- +-#define GET_LIMIT(desc) ( \ +- ((desc)->a & 0x0ffff) | \ +- ((desc)->b & 0xf0000) ) +- +-#define GET_32BIT(desc) (((desc)->b >> 22) & 1) +-#define GET_CONTENTS(desc) (((desc)->b >> 10) & 3) +-#define GET_WRITABLE(desc) (((desc)->b >> 9) & 1) +-#define GET_LIMIT_PAGES(desc) (((desc)->b >> 23) & 1) +-#define GET_PRESENT(desc) (((desc)->b >> 15) & 1) +-#define GET_USEABLE(desc) (((desc)->b >> 20) & 1) +- +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- return -EINVAL; +- +- desc = child->thread.tls_array + idx - GDT_ENTRY_TLS_MIN; +- +- info.entry_number = idx; +- info.base_addr = GET_BASE(desc); +- info.limit = GET_LIMIT(desc); +- info.seg_32bit = GET_32BIT(desc); +- info.contents = GET_CONTENTS(desc); +- info.read_exec_only = !GET_WRITABLE(desc); +- info.limit_in_pages = GET_LIMIT_PAGES(desc); +- info.seg_not_present = !GET_PRESENT(desc); +- info.useable = GET_USEABLE(desc); +- +- if (copy_to_user(user_desc, &info, sizeof(info))) +- return -EFAULT; +- +- return 0; +-} +- +-/* +- * Perform set_thread_area on behalf of the traced child. +- */ +-static int +-ptrace_set_thread_area(struct task_struct *child, +- int idx, struct user_desc __user *user_desc) +-{ +- struct user_desc info; +- struct desc_struct *desc; +- +- if (copy_from_user(&info, user_desc, sizeof(info))) +- return -EFAULT; +- +- if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) +- return -EINVAL; +- +- desc = child->thread.tls_array + idx - GDT_ENTRY_TLS_MIN; +- if (LDT_empty(&info)) { +- desc->a = 0; +- desc->b = 0; +- } else { +- desc->a = LDT_entry_a(&info); +- desc->b = LDT_entry_b(&info); +- } +- +- return 0; +-} +- +-long arch_ptrace(struct task_struct *child, long request, long addr, long data) +-{ +- struct user * dummy = NULL; +- int i, ret; +- unsigned long __user *datap = (unsigned long __user *)data; +- +- switch (request) { +- /* when I and D space are separate, these will need to be fixed. */ +- case PTRACE_PEEKTEXT: /* read word at location addr. */ +- case PTRACE_PEEKDATA: +- ret = generic_ptrace_peekdata(child, addr, data); +- break; +- +- /* read the word at location addr in the USER area. */ +- case PTRACE_PEEKUSR: { +- unsigned long tmp; +- +- ret = -EIO; +- if ((addr & 3) || addr < 0 || +- addr > sizeof(struct user) - 3) +- break; +- +- tmp = 0; /* Default return condition */ +- if(addr < FRAME_SIZE*sizeof(long)) +- tmp = getreg(child, addr); +- if(addr >= (long) &dummy->u_debugreg[0] && +- addr <= (long) &dummy->u_debugreg[7]){ +- addr -= (long) &dummy->u_debugreg[0]; +- addr = addr >> 2; +- tmp = child->thread.debugreg[addr]; +- } +- ret = put_user(tmp, datap); +- break; +- } +- +- /* when I and D space are separate, this will have to be fixed. */ +- case PTRACE_POKETEXT: /* write the word at location addr. */ +- case PTRACE_POKEDATA: +- ret = generic_ptrace_pokedata(child, addr, data); +- break; +- +- case PTRACE_POKEUSR: /* write the word at location addr in the USER area */ +- ret = -EIO; +- if ((addr & 3) || addr < 0 || +- addr > sizeof(struct user) - 3) +- break; +- +- if (addr < FRAME_SIZE*sizeof(long)) { +- ret = putreg(child, addr, data); +- break; +- } +- /* We need to be very careful here. We implicitly +- want to modify a portion of the task_struct, and we +- have to be selective about what portions we allow someone +- to modify. */ +- +- ret = -EIO; +- if(addr >= (long) &dummy->u_debugreg[0] && +- addr <= (long) &dummy->u_debugreg[7]){ +- +- if(addr == (long) &dummy->u_debugreg[4]) break; +- if(addr == (long) &dummy->u_debugreg[5]) break; +- if(addr < (long) &dummy->u_debugreg[4] && +- ((unsigned long) data) >= TASK_SIZE-3) break; +- +- /* Sanity-check data. Take one half-byte at once with +- * check = (val >> (16 + 4*i)) & 0xf. It contains the +- * R/Wi and LENi bits; bits 0 and 1 are R/Wi, and bits +- * 2 and 3 are LENi. Given a list of invalid values, +- * we do mask |= 1 << invalid_value, so that +- * (mask >> check) & 1 is a correct test for invalid +- * values. +- * +- * R/Wi contains the type of the breakpoint / +- * watchpoint, LENi contains the length of the watched +- * data in the watchpoint case. +- * +- * The invalid values are: +- * - LENi == 0x10 (undefined), so mask |= 0x0f00. +- * - R/Wi == 0x10 (break on I/O reads or writes), so +- * mask |= 0x4444. +- * - R/Wi == 0x00 && LENi != 0x00, so we have mask |= +- * 0x1110. +- * +- * Finally, mask = 0x0f00 | 0x4444 | 0x1110 == 0x5f54. +- * +- * See the Intel Manual "System Programming Guide", +- * 15.2.4 +- * +- * Note that LENi == 0x10 is defined on x86_64 in long +- * mode (i.e. even for 32-bit userspace software, but +- * 64-bit kernel), so the x86_64 mask value is 0x5454. +- * See the AMD manual no. 24593 (AMD64 System +- * Programming)*/ +- +- if(addr == (long) &dummy->u_debugreg[7]) { +- data &= ~DR_CONTROL_RESERVED; +- for(i=0; i<4; i++) +- if ((0x5f54 >> ((data >> (16 + 4*i)) & 0xf)) & 1) +- goto out_tsk; +- if (data) +- set_tsk_thread_flag(child, TIF_DEBUG); +- else +- clear_tsk_thread_flag(child, TIF_DEBUG); +- } +- addr -= (long) &dummy->u_debugreg; +- addr = addr >> 2; +- child->thread.debugreg[addr] = data; +- ret = 0; +- } +- break; +- +- case PTRACE_SYSEMU: /* continue and stop at next syscall, which will not be executed */ +- case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */ +- case PTRACE_CONT: /* restart after signal. */ +- ret = -EIO; +- if (!valid_signal(data)) +- break; +- if (request == PTRACE_SYSEMU) { +- set_tsk_thread_flag(child, TIF_SYSCALL_EMU); +- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); +- } else if (request == PTRACE_SYSCALL) { +- set_tsk_thread_flag(child, TIF_SYSCALL_TRACE); +- clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); +- } else { +- clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); +- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); +- } +- child->exit_code = data; +- /* make sure the single step bit is not set. */ +- clear_singlestep(child); +- wake_up_process(child); +- ret = 0; +- break; +- +-/* +- * make the child exit. Best I can do is send it a sigkill. +- * perhaps it should be put in the status that it wants to +- * exit. +- */ +- case PTRACE_KILL: +- ret = 0; +- if (child->exit_state == EXIT_ZOMBIE) /* already dead */ +- break; +- child->exit_code = SIGKILL; +- /* make sure the single step bit is not set. */ +- clear_singlestep(child); +- wake_up_process(child); +- break; +- +- case PTRACE_SYSEMU_SINGLESTEP: /* Same as SYSEMU, but singlestep if not syscall */ +- case PTRACE_SINGLESTEP: /* set the trap flag. */ +- ret = -EIO; +- if (!valid_signal(data)) +- break; +- +- if (request == PTRACE_SYSEMU_SINGLESTEP) +- set_tsk_thread_flag(child, TIF_SYSCALL_EMU); +- else +- clear_tsk_thread_flag(child, TIF_SYSCALL_EMU); +- +- clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); +- set_singlestep(child); +- child->exit_code = data; +- /* give it a chance to run. */ +- wake_up_process(child); +- ret = 0; +- break; +- +- case PTRACE_GETREGS: { /* Get all gp regs from the child. */ +- if (!access_ok(VERIFY_WRITE, datap, FRAME_SIZE*sizeof(long))) { +- ret = -EIO; +- break; +- } +- for ( i = 0; i < FRAME_SIZE*sizeof(long); i += sizeof(long) ) { +- __put_user(getreg(child, i), datap); +- datap++; +- } +- ret = 0; +- break; +- } +- +- case PTRACE_SETREGS: { /* Set all gp regs in the child. */ +- unsigned long tmp; +- if (!access_ok(VERIFY_READ, datap, FRAME_SIZE*sizeof(long))) { +- ret = -EIO; +- break; +- } +- for ( i = 0; i < FRAME_SIZE*sizeof(long); i += sizeof(long) ) { +- __get_user(tmp, datap); +- putreg(child, i, tmp); +- datap++; +- } +- ret = 0; +- break; +- } +- +- case PTRACE_GETFPREGS: { /* Get the child FPU state. */ +- if (!access_ok(VERIFY_WRITE, datap, +- sizeof(struct user_i387_struct))) { +- ret = -EIO; +- break; +- } +- ret = 0; +- if (!tsk_used_math(child)) +- init_fpu(child); +- get_fpregs((struct user_i387_struct __user *)data, child); +- break; +- } +- +- case PTRACE_SETFPREGS: { /* Set the child FPU state. */ +- if (!access_ok(VERIFY_READ, datap, +- sizeof(struct user_i387_struct))) { +- ret = -EIO; +- break; +- } +- set_stopped_child_used_math(child); +- set_fpregs(child, (struct user_i387_struct __user *)data); +- ret = 0; +- break; +- } +- +- case PTRACE_GETFPXREGS: { /* Get the child extended FPU state. */ +- if (!access_ok(VERIFY_WRITE, datap, +- sizeof(struct user_fxsr_struct))) { +- ret = -EIO; +- break; +- } +- if (!tsk_used_math(child)) +- init_fpu(child); +- ret = get_fpxregs((struct user_fxsr_struct __user *)data, child); +- break; +- } +- +- case PTRACE_SETFPXREGS: { /* Set the child extended FPU state. */ +- if (!access_ok(VERIFY_READ, datap, +- sizeof(struct user_fxsr_struct))) { +- ret = -EIO; +- break; +- } +- set_stopped_child_used_math(child); +- ret = set_fpxregs(child, (struct user_fxsr_struct __user *)data); +- break; +- } +- +- case PTRACE_GET_THREAD_AREA: +- ret = ptrace_get_thread_area(child, addr, +- (struct user_desc __user *) data); +- break; +- +- case PTRACE_SET_THREAD_AREA: +- ret = ptrace_set_thread_area(child, addr, +- (struct user_desc __user *) data); +- break; +- +- default: +- ret = ptrace_request(child, request, addr, data); +- break; +- } +- out_tsk: +- return ret; +-} +- +-void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs, int error_code) +-{ +- struct siginfo info; +- +- tsk->thread.trap_no = 1; +- tsk->thread.error_code = error_code; +- +- memset(&info, 0, sizeof(info)); +- info.si_signo = SIGTRAP; +- info.si_code = TRAP_BRKPT; +- +- /* User-mode eip? */ +- info.si_addr = user_mode_vm(regs) ? (void __user *) regs->eip : NULL; +- +- /* Send us the fake SIGTRAP */ +- force_sig_info(SIGTRAP, &info, tsk); +-} +- +-/* notification of system call entry/exit +- * - triggered by current->work.syscall_trace +- */ +-__attribute__((regparm(3))) +-int do_syscall_trace(struct pt_regs *regs, int entryexit) +-{ +- int is_sysemu = test_thread_flag(TIF_SYSCALL_EMU); +- /* +- * With TIF_SYSCALL_EMU set we want to ignore TIF_SINGLESTEP for syscall +- * interception +- */ +- int is_singlestep = !is_sysemu && test_thread_flag(TIF_SINGLESTEP); +- int ret = 0; +- +- /* do the secure computing check first */ +- if (!entryexit) +- secure_computing(regs->orig_eax); +- +- if (unlikely(current->audit_context)) { +- if (entryexit) +- audit_syscall_exit(AUDITSC_RESULT(regs->eax), +- regs->eax); +- /* Debug traps, when using PTRACE_SINGLESTEP, must be sent only +- * on the syscall exit path. Normally, when TIF_SYSCALL_AUDIT is +- * not used, entry.S will call us only on syscall exit, not +- * entry; so when TIF_SYSCALL_AUDIT is used we must avoid +- * calling send_sigtrap() on syscall entry. +- * +- * Note that when PTRACE_SYSEMU_SINGLESTEP is used, +- * is_singlestep is false, despite his name, so we will still do +- * the correct thing. +- */ +- else if (is_singlestep) +- goto out; +- } +- +- if (!(current->ptrace & PT_PTRACED)) +- goto out; +- +- /* If a process stops on the 1st tracepoint with SYSCALL_TRACE +- * and then is resumed with SYSEMU_SINGLESTEP, it will come in +- * here. We have to check this and return */ +- if (is_sysemu && entryexit) +- return 0; +- +- /* Fake a debug trap */ +- if (is_singlestep) +- send_sigtrap(current, regs, 0); +- +- if (!test_thread_flag(TIF_SYSCALL_TRACE) && !is_sysemu) +- goto out; +- +- /* the 0x80 provides a way for the tracing parent to distinguish +- between a syscall stop and SIGTRAP delivery */ +- /* Note that the debugger could change the result of test_thread_flag!*/ +- ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) ? 0x80:0)); +- +- /* +- * this isn't the same as continuing with a signal, but it will do +- * for normal use. strace only continues with a signal if the +- * stopping signal is not SIGTRAP. -brl +- */ +- if (current->exit_code) { +- send_sig(current->exit_code, current, 1); +- current->exit_code = 0; +- } +- ret = is_sysemu; +-out: +- if (unlikely(current->audit_context) && !entryexit) +- audit_syscall_entry(AUDIT_ARCH_I386, regs->orig_eax, +- regs->ebx, regs->ecx, regs->edx, regs->esi); +- if (ret == 0) +- return 0; +- +- regs->orig_eax = -1; /* force skip of syscall restarting */ +- if (unlikely(current->audit_context)) +- audit_syscall_exit(AUDITSC_RESULT(regs->eax), regs->eax); +- return 1; +-} +diff --git a/arch/x86/kernel/ptrace_64.c b/arch/x86/kernel/ptrace_64.c +deleted file mode 100644 +index 607085f..0000000 +--- a/arch/x86/kernel/ptrace_64.c ++++ /dev/null +@@ -1,621 +0,0 @@ +-/* By Ross Biro 1/23/92 */ +-/* +- * Pentium III FXSR, SSE support +- * Gareth Hughes , May 2000 +- * +- * x86-64 port 2000-2002 Andi Kleen +- */ +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* +- * does not yet catch signals sent when the child dies. +- * in exit.c or in signal.c. +- */ +- +-/* +- * Determines which flags the user has access to [1 = access, 0 = no access]. +- * Prohibits changing ID(21), VIP(20), VIF(19), VM(17), IOPL(12-13), IF(9). +- * Also masks reserved bits (63-22, 15, 5, 3, 1). +- */ +-#define FLAG_MASK 0x54dd5UL +- +-/* set's the trap flag. */ +-#define TRAP_FLAG 0x100UL +- +-/* +- * eflags and offset of eflags on child stack.. +- */ +-#define EFLAGS offsetof(struct pt_regs, eflags) +-#define EFL_OFFSET ((int)(EFLAGS-sizeof(struct pt_regs))) +- +-/* +- * this routine will get a word off of the processes privileged stack. +- * the offset is how far from the base addr as stored in the TSS. +- * this routine assumes that all the privileged stacks are in our +- * data space. +- */ +-static inline unsigned long get_stack_long(struct task_struct *task, int offset) +-{ +- unsigned char *stack; +- +- stack = (unsigned char *)task->thread.rsp0; +- stack += offset; +- return (*((unsigned long *)stack)); +-} +- +-/* +- * this routine will put a word on the processes privileged stack. +- * the offset is how far from the base addr as stored in the TSS. +- * this routine assumes that all the privileged stacks are in our +- * data space. +- */ +-static inline long put_stack_long(struct task_struct *task, int offset, +- unsigned long data) +-{ +- unsigned char * stack; +- +- stack = (unsigned char *) task->thread.rsp0; +- stack += offset; +- *(unsigned long *) stack = data; +- return 0; +-} +- +-#define LDT_SEGMENT 4 +- +-unsigned long convert_rip_to_linear(struct task_struct *child, struct pt_regs *regs) +-{ +- unsigned long addr, seg; +- +- addr = regs->rip; +- seg = regs->cs & 0xffff; +- +- /* +- * We'll assume that the code segments in the GDT +- * are all zero-based. That is largely true: the +- * TLS segments are used for data, and the PNPBIOS +- * and APM bios ones we just ignore here. +- */ +- if (seg & LDT_SEGMENT) { +- u32 *desc; +- unsigned long base; +- +- seg &= ~7UL; +- +- mutex_lock(&child->mm->context.lock); +- if (unlikely((seg >> 3) >= child->mm->context.size)) +- addr = -1L; /* bogus selector, access would fault */ +- else { +- desc = child->mm->context.ldt + seg; +- base = ((desc[0] >> 16) | +- ((desc[1] & 0xff) << 16) | +- (desc[1] & 0xff000000)); +- +- /* 16-bit code segment? */ +- if (!((desc[1] >> 22) & 1)) +- addr &= 0xffff; +- addr += base; +- } +- mutex_unlock(&child->mm->context.lock); +- } +- +- return addr; +-} +- +-static int is_setting_trap_flag(struct task_struct *child, struct pt_regs *regs) +-{ +- int i, copied; +- unsigned char opcode[15]; +- unsigned long addr = convert_rip_to_linear(child, regs); +- +- copied = access_process_vm(child, addr, opcode, sizeof(opcode), 0); +- for (i = 0; i < copied; i++) { +- switch (opcode[i]) { +- /* popf and iret */ +- case 0x9d: case 0xcf: +- return 1; +- +- /* CHECKME: 64 65 */ +- +- /* opcode and address size prefixes */ +- case 0x66: case 0x67: +- continue; +- /* irrelevant prefixes (segment overrides and repeats) */ +- case 0x26: case 0x2e: +- case 0x36: case 0x3e: +- case 0x64: case 0x65: +- case 0xf2: case 0xf3: +- continue; +- +- case 0x40 ... 0x4f: +- if (regs->cs != __USER_CS) +- /* 32-bit mode: register increment */ +- return 0; +- /* 64-bit mode: REX prefix */ +- continue; +- +- /* CHECKME: f2, f3 */ +- +- /* +- * pushf: NOTE! We should probably not let +- * the user see the TF bit being set. But +- * it's more pain than it's worth to avoid +- * it, and a debugger could emulate this +- * all in user space if it _really_ cares. +- */ +- case 0x9c: +- default: +- return 0; +- } +- } +- return 0; +-} +- +-static void set_singlestep(struct task_struct *child) +-{ +- struct pt_regs *regs = task_pt_regs(child); +- +- /* +- * Always set TIF_SINGLESTEP - this guarantees that +- * we single-step system calls etc.. This will also +- * cause us to set TF when returning to user mode. +- */ +- set_tsk_thread_flag(child, TIF_SINGLESTEP); +- +- /* +- * If TF was already set, don't do anything else +- */ +- if (regs->eflags & TRAP_FLAG) +- return; +- +- /* Set TF on the kernel stack.. */ +- regs->eflags |= TRAP_FLAG; +- +- /* +- * ..but if TF is changed by the instruction we will trace, +- * don't mark it as being "us" that set it, so that we +- * won't clear it by hand later. +- */ +- if (is_setting_trap_flag(child, regs)) +- return; +- +- child->ptrace |= PT_DTRACE; +-} +- +-static void clear_singlestep(struct task_struct *child) +-{ +- /* Always clear TIF_SINGLESTEP... */ +- clear_tsk_thread_flag(child, TIF_SINGLESTEP); +- +- /* But touch TF only if it was set by us.. */ +- if (child->ptrace & PT_DTRACE) { +- struct pt_regs *regs = task_pt_regs(child); +- regs->eflags &= ~TRAP_FLAG; +- child->ptrace &= ~PT_DTRACE; +- } +-} +- +-/* +- * Called by kernel/ptrace.c when detaching.. +- * +- * Make sure the single step bit is not set. +- */ +-void ptrace_disable(struct task_struct *child) +-{ +- clear_singlestep(child); +-} +- +-static int putreg(struct task_struct *child, +- unsigned long regno, unsigned long value) +-{ +- unsigned long tmp; +- +- switch (regno) { +- case offsetof(struct user_regs_struct,fs): +- if (value && (value & 3) != 3) +- return -EIO; +- child->thread.fsindex = value & 0xffff; +- return 0; +- case offsetof(struct user_regs_struct,gs): +- if (value && (value & 3) != 3) +- return -EIO; +- child->thread.gsindex = value & 0xffff; +- return 0; +- case offsetof(struct user_regs_struct,ds): +- if (value && (value & 3) != 3) +- return -EIO; +- child->thread.ds = value & 0xffff; +- return 0; +- case offsetof(struct user_regs_struct,es): +- if (value && (value & 3) != 3) +- return -EIO; +- child->thread.es = value & 0xffff; +- return 0; +- case offsetof(struct user_regs_struct,ss): +- if ((value & 3) != 3) +- return -EIO; +- value &= 0xffff; +- return 0; +- case offsetof(struct user_regs_struct,fs_base): +- if (value >= TASK_SIZE_OF(child)) +- return -EIO; +- child->thread.fs = value; +- return 0; +- case offsetof(struct user_regs_struct,gs_base): +- if (value >= TASK_SIZE_OF(child)) +- return -EIO; +- child->thread.gs = value; +- return 0; +- case offsetof(struct user_regs_struct, eflags): +- value &= FLAG_MASK; +- tmp = get_stack_long(child, EFL_OFFSET); +- tmp &= ~FLAG_MASK; +- value |= tmp; +- break; +- case offsetof(struct user_regs_struct,cs): +- if ((value & 3) != 3) +- return -EIO; +- value &= 0xffff; +- break; +- } +- put_stack_long(child, regno - sizeof(struct pt_regs), value); +- return 0; +-} +- +-static unsigned long getreg(struct task_struct *child, unsigned long regno) +-{ +- unsigned long val; +- switch (regno) { +- case offsetof(struct user_regs_struct, fs): +- return child->thread.fsindex; +- case offsetof(struct user_regs_struct, gs): +- return child->thread.gsindex; +- case offsetof(struct user_regs_struct, ds): +- return child->thread.ds; +- case offsetof(struct user_regs_struct, es): +- return child->thread.es; +- case offsetof(struct user_regs_struct, fs_base): +- return child->thread.fs; +- case offsetof(struct user_regs_struct, gs_base): +- return child->thread.gs; +- default: +- regno = regno - sizeof(struct pt_regs); +- val = get_stack_long(child, regno); +- if (test_tsk_thread_flag(child, TIF_IA32)) +- val &= 0xffffffff; +- return val; +- } +- +-} +- +-long arch_ptrace(struct task_struct *child, long request, long addr, long data) +-{ +- long i, ret; +- unsigned ui; +- +- switch (request) { +- /* when I and D space are separate, these will need to be fixed. */ +- case PTRACE_PEEKTEXT: /* read word at location addr. */ +- case PTRACE_PEEKDATA: +- ret = generic_ptrace_peekdata(child, addr, data); +- break; +- +- /* read the word at location addr in the USER area. */ +- case PTRACE_PEEKUSR: { +- unsigned long tmp; +- +- ret = -EIO; +- if ((addr & 7) || +- addr > sizeof(struct user) - 7) +- break; +- +- switch (addr) { +- case 0 ... sizeof(struct user_regs_struct) - sizeof(long): +- tmp = getreg(child, addr); +- break; +- case offsetof(struct user, u_debugreg[0]): +- tmp = child->thread.debugreg0; +- break; +- case offsetof(struct user, u_debugreg[1]): +- tmp = child->thread.debugreg1; +- break; +- case offsetof(struct user, u_debugreg[2]): +- tmp = child->thread.debugreg2; +- break; +- case offsetof(struct user, u_debugreg[3]): +- tmp = child->thread.debugreg3; +- break; +- case offsetof(struct user, u_debugreg[6]): +- tmp = child->thread.debugreg6; +- break; +- case offsetof(struct user, u_debugreg[7]): +- tmp = child->thread.debugreg7; +- break; +- default: +- tmp = 0; +- break; +- } +- ret = put_user(tmp,(unsigned long __user *) data); +- break; +- } +- +- /* when I and D space are separate, this will have to be fixed. */ +- case PTRACE_POKETEXT: /* write the word at location addr. */ +- case PTRACE_POKEDATA: +- ret = generic_ptrace_pokedata(child, addr, data); +- break; +- +- case PTRACE_POKEUSR: /* write the word at location addr in the USER area */ +- { +- int dsize = test_tsk_thread_flag(child, TIF_IA32) ? 3 : 7; +- ret = -EIO; +- if ((addr & 7) || +- addr > sizeof(struct user) - 7) +- break; +- +- switch (addr) { +- case 0 ... sizeof(struct user_regs_struct) - sizeof(long): +- ret = putreg(child, addr, data); +- break; +- /* Disallows to set a breakpoint into the vsyscall */ +- case offsetof(struct user, u_debugreg[0]): +- if (data >= TASK_SIZE_OF(child) - dsize) break; +- child->thread.debugreg0 = data; +- ret = 0; +- break; +- case offsetof(struct user, u_debugreg[1]): +- if (data >= TASK_SIZE_OF(child) - dsize) break; +- child->thread.debugreg1 = data; +- ret = 0; +- break; +- case offsetof(struct user, u_debugreg[2]): +- if (data >= TASK_SIZE_OF(child) - dsize) break; +- child->thread.debugreg2 = data; +- ret = 0; +- break; +- case offsetof(struct user, u_debugreg[3]): +- if (data >= TASK_SIZE_OF(child) - dsize) break; +- child->thread.debugreg3 = data; +- ret = 0; +- break; +- case offsetof(struct user, u_debugreg[6]): +- if (data >> 32) +- break; +- child->thread.debugreg6 = data; +- ret = 0; +- break; +- case offsetof(struct user, u_debugreg[7]): +- /* See arch/i386/kernel/ptrace.c for an explanation of +- * this awkward check.*/ +- data &= ~DR_CONTROL_RESERVED; +- for(i=0; i<4; i++) +- if ((0x5554 >> ((data >> (16 + 4*i)) & 0xf)) & 1) +- break; +- if (i == 4) { +- child->thread.debugreg7 = data; +- if (data) +- set_tsk_thread_flag(child, TIF_DEBUG); +- else +- clear_tsk_thread_flag(child, TIF_DEBUG); +- ret = 0; +- } +- break; +- } +- break; +- } +- case PTRACE_SYSCALL: /* continue and stop at next (return from) syscall */ +- case PTRACE_CONT: /* restart after signal. */ +- +- ret = -EIO; +- if (!valid_signal(data)) +- break; +- if (request == PTRACE_SYSCALL) +- set_tsk_thread_flag(child,TIF_SYSCALL_TRACE); +- else +- clear_tsk_thread_flag(child,TIF_SYSCALL_TRACE); +- clear_tsk_thread_flag(child, TIF_SINGLESTEP); +- child->exit_code = data; +- /* make sure the single step bit is not set. */ +- clear_singlestep(child); +- wake_up_process(child); +- ret = 0; +- break; +- +-#ifdef CONFIG_IA32_EMULATION +- /* This makes only sense with 32bit programs. Allow a +- 64bit debugger to fully examine them too. Better +- don't use it against 64bit processes, use +- PTRACE_ARCH_PRCTL instead. */ +- case PTRACE_SET_THREAD_AREA: { +- struct user_desc __user *p; +- int old; +- p = (struct user_desc __user *)data; +- get_user(old, &p->entry_number); +- put_user(addr, &p->entry_number); +- ret = do_set_thread_area(&child->thread, p); +- put_user(old, &p->entry_number); +- break; +- case PTRACE_GET_THREAD_AREA: +- p = (struct user_desc __user *)data; +- get_user(old, &p->entry_number); +- put_user(addr, &p->entry_number); +- ret = do_get_thread_area(&child->thread, p); +- put_user(old, &p->entry_number); +- break; +- } +-#endif +- /* normal 64bit interface to access TLS data. +- Works just like arch_prctl, except that the arguments +- are reversed. */ +- case PTRACE_ARCH_PRCTL: +- ret = do_arch_prctl(child, data, addr); +- break; +- +-/* +- * make the child exit. Best I can do is send it a sigkill. +- * perhaps it should be put in the status that it wants to +- * exit. +- */ +- case PTRACE_KILL: +- ret = 0; +- if (child->exit_state == EXIT_ZOMBIE) /* already dead */ +- break; +- clear_tsk_thread_flag(child, TIF_SINGLESTEP); +- child->exit_code = SIGKILL; +- /* make sure the single step bit is not set. */ +- clear_singlestep(child); +- wake_up_process(child); +- break; +- +- case PTRACE_SINGLESTEP: /* set the trap flag. */ +- ret = -EIO; +- if (!valid_signal(data)) +- break; +- clear_tsk_thread_flag(child,TIF_SYSCALL_TRACE); +- set_singlestep(child); +- child->exit_code = data; +- /* give it a chance to run. */ +- wake_up_process(child); +- ret = 0; +- break; +- +- case PTRACE_GETREGS: { /* Get all gp regs from the child. */ +- if (!access_ok(VERIFY_WRITE, (unsigned __user *)data, +- sizeof(struct user_regs_struct))) { +- ret = -EIO; +- break; +- } +- ret = 0; +- for (ui = 0; ui < sizeof(struct user_regs_struct); ui += sizeof(long)) { +- ret |= __put_user(getreg(child, ui),(unsigned long __user *) data); +- data += sizeof(long); +- } +- break; +- } +- +- case PTRACE_SETREGS: { /* Set all gp regs in the child. */ +- unsigned long tmp; +- if (!access_ok(VERIFY_READ, (unsigned __user *)data, +- sizeof(struct user_regs_struct))) { +- ret = -EIO; +- break; +- } +- ret = 0; +- for (ui = 0; ui < sizeof(struct user_regs_struct); ui += sizeof(long)) { +- ret = __get_user(tmp, (unsigned long __user *) data); +- if (ret) +- break; +- ret = putreg(child, ui, tmp); +- if (ret) +- break; +- data += sizeof(long); +- } +- break; +- } +- +- case PTRACE_GETFPREGS: { /* Get the child extended FPU state. */ +- if (!access_ok(VERIFY_WRITE, (unsigned __user *)data, +- sizeof(struct user_i387_struct))) { +- ret = -EIO; +- break; +- } +- ret = get_fpregs((struct user_i387_struct __user *)data, child); +- break; +- } +- +- case PTRACE_SETFPREGS: { /* Set the child extended FPU state. */ +- if (!access_ok(VERIFY_READ, (unsigned __user *)data, +- sizeof(struct user_i387_struct))) { +- ret = -EIO; +- break; +- } +- set_stopped_child_used_math(child); +- ret = set_fpregs(child, (struct user_i387_struct __user *)data); +- break; +- } +- +- default: +- ret = ptrace_request(child, request, addr, data); +- break; +- } +- return ret; +-} +- +-static void syscall_trace(struct pt_regs *regs) +-{ +- +-#if 0 +- printk("trace %s rip %lx rsp %lx rax %d origrax %d caller %lx tiflags %x ptrace %x\n", +- current->comm, +- regs->rip, regs->rsp, regs->rax, regs->orig_rax, __builtin_return_address(0), +- current_thread_info()->flags, current->ptrace); +-#endif +- +- ptrace_notify(SIGTRAP | ((current->ptrace & PT_TRACESYSGOOD) +- ? 0x80 : 0)); +- /* +- * this isn't the same as continuing with a signal, but it will do +- * for normal use. strace only continues with a signal if the +- * stopping signal is not SIGTRAP. -brl +- */ +- if (current->exit_code) { +- send_sig(current->exit_code, current, 1); +- current->exit_code = 0; +- } +-} +- +-asmlinkage void syscall_trace_enter(struct pt_regs *regs) +-{ +- /* do the secure computing check first */ +- secure_computing(regs->orig_rax); +- +- if (test_thread_flag(TIF_SYSCALL_TRACE) +- && (current->ptrace & PT_PTRACED)) +- syscall_trace(regs); +- +- if (unlikely(current->audit_context)) { +- if (test_thread_flag(TIF_IA32)) { +- audit_syscall_entry(AUDIT_ARCH_I386, +- regs->orig_rax, +- regs->rbx, regs->rcx, +- regs->rdx, regs->rsi); +- } else { +- audit_syscall_entry(AUDIT_ARCH_X86_64, +- regs->orig_rax, +- regs->rdi, regs->rsi, +- regs->rdx, regs->r10); +- } +- } +-} +- +-asmlinkage void syscall_trace_leave(struct pt_regs *regs) +-{ +- if (unlikely(current->audit_context)) +- audit_syscall_exit(AUDITSC_RESULT(regs->rax), regs->rax); +- +- if ((test_thread_flag(TIF_SYSCALL_TRACE) +- || test_thread_flag(TIF_SINGLESTEP)) +- && (current->ptrace & PT_PTRACED)) +- syscall_trace(regs); +-} +diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c +index fab30e1..150ba29 100644 +--- a/arch/x86/kernel/quirks.c ++++ b/arch/x86/kernel/quirks.c +@@ -162,6 +162,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31, + ich_force_enable_hpet); + DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH8_1, + ich_force_enable_hpet); ++DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH9_7, ++ ich_force_enable_hpet); + + + static struct pci_dev *cached_dev; +diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c +new file mode 100644 +index 0000000..5818dc2 +--- /dev/null ++++ b/arch/x86/kernel/reboot.c +@@ -0,0 +1,451 @@ ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#ifdef CONFIG_X86_32 ++# include ++# include ++# include ++# include ++#else ++# include ++#endif ++ ++/* ++ * Power off function, if any ++ */ ++void (*pm_power_off)(void); ++EXPORT_SYMBOL(pm_power_off); ++ ++static long no_idt[3]; ++static int reboot_mode; ++enum reboot_type reboot_type = BOOT_KBD; ++int reboot_force; ++ ++#if defined(CONFIG_X86_32) && defined(CONFIG_SMP) ++static int reboot_cpu = -1; ++#endif ++ ++/* reboot=b[ios] | s[mp] | t[riple] | k[bd] | e[fi] [, [w]arm | [c]old] ++ warm Don't set the cold reboot flag ++ cold Set the cold reboot flag ++ bios Reboot by jumping through the BIOS (only for X86_32) ++ smp Reboot by executing reset on BSP or other CPU (only for X86_32) ++ triple Force a triple fault (init) ++ kbd Use the keyboard controller. cold reset (default) ++ acpi Use the RESET_REG in the FADT ++ efi Use efi reset_system runtime service ++ force Avoid anything that could hang. ++ */ ++static int __init reboot_setup(char *str) ++{ ++ for (;;) { ++ switch (*str) { ++ case 'w': ++ reboot_mode = 0x1234; ++ break; ++ ++ case 'c': ++ reboot_mode = 0; ++ break; ++ ++#ifdef CONFIG_X86_32 ++#ifdef CONFIG_SMP ++ case 's': ++ if (isdigit(*(str+1))) { ++ reboot_cpu = (int) (*(str+1) - '0'); ++ if (isdigit(*(str+2))) ++ reboot_cpu = reboot_cpu*10 + (int)(*(str+2) - '0'); ++ } ++ /* we will leave sorting out the final value ++ when we are ready to reboot, since we might not ++ have set up boot_cpu_id or smp_num_cpu */ ++ break; ++#endif /* CONFIG_SMP */ ++ ++ case 'b': ++#endif ++ case 'a': ++ case 'k': ++ case 't': ++ case 'e': ++ reboot_type = *str; ++ break; ++ ++ case 'f': ++ reboot_force = 1; ++ break; ++ } ++ ++ str = strchr(str, ','); ++ if (str) ++ str++; ++ else ++ break; ++ } ++ return 1; ++} ++ ++__setup("reboot=", reboot_setup); ++ ++ ++#ifdef CONFIG_X86_32 ++/* ++ * Reboot options and system auto-detection code provided by ++ * Dell Inc. so their systems "just work". :-) ++ */ ++ ++/* ++ * Some machines require the "reboot=b" commandline option, ++ * this quirk makes that automatic. ++ */ ++static int __init set_bios_reboot(const struct dmi_system_id *d) ++{ ++ if (reboot_type != BOOT_BIOS) { ++ reboot_type = BOOT_BIOS; ++ printk(KERN_INFO "%s series board detected. Selecting BIOS-method for reboots.\n", d->ident); ++ } ++ return 0; ++} ++ ++static struct dmi_system_id __initdata reboot_dmi_table[] = { ++ { /* Handle problems with rebooting on Dell E520's */ ++ .callback = set_bios_reboot, ++ .ident = "Dell E520", ++ .matches = { ++ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), ++ DMI_MATCH(DMI_PRODUCT_NAME, "Dell DM061"), ++ }, ++ }, ++ { /* Handle problems with rebooting on Dell 1300's */ ++ .callback = set_bios_reboot, ++ .ident = "Dell PowerEdge 1300", ++ .matches = { ++ DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), ++ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge 1300/"), ++ }, ++ }, ++ { /* Handle problems with rebooting on Dell 300's */ ++ .callback = set_bios_reboot, ++ .ident = "Dell PowerEdge 300", ++ .matches = { ++ DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), ++ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge 300/"), ++ }, ++ }, ++ { /* Handle problems with rebooting on Dell Optiplex 745's SFF*/ ++ .callback = set_bios_reboot, ++ .ident = "Dell OptiPlex 745", ++ .matches = { ++ DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), ++ DMI_MATCH(DMI_PRODUCT_NAME, "OptiPlex 745"), ++ DMI_MATCH(DMI_BOARD_NAME, "0WF810"), ++ }, ++ }, ++ { /* Handle problems with rebooting on Dell 2400's */ ++ .callback = set_bios_reboot, ++ .ident = "Dell PowerEdge 2400", ++ .matches = { ++ DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), ++ DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge 2400"), ++ }, ++ }, ++ { /* Handle problems with rebooting on HP laptops */ ++ .callback = set_bios_reboot, ++ .ident = "HP Compaq Laptop", ++ .matches = { ++ DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"), ++ DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq"), ++ }, ++ }, ++ { } ++}; ++ ++static int __init reboot_init(void) ++{ ++ dmi_check_system(reboot_dmi_table); ++ return 0; ++} ++core_initcall(reboot_init); ++ ++/* The following code and data reboots the machine by switching to real ++ mode and jumping to the BIOS reset entry point, as if the CPU has ++ really been reset. The previous version asked the keyboard ++ controller to pulse the CPU reset line, which is more thorough, but ++ doesn't work with at least one type of 486 motherboard. It is easy ++ to stop this code working; hence the copious comments. */ ++static unsigned long long ++real_mode_gdt_entries [3] = ++{ ++ 0x0000000000000000ULL, /* Null descriptor */ ++ 0x00009a000000ffffULL, /* 16-bit real-mode 64k code at 0x00000000 */ ++ 0x000092000100ffffULL /* 16-bit real-mode 64k data at 0x00000100 */ ++}; ++ ++static struct desc_ptr ++real_mode_gdt = { sizeof (real_mode_gdt_entries) - 1, (long)real_mode_gdt_entries }, ++real_mode_idt = { 0x3ff, 0 }; ++ ++/* This is 16-bit protected mode code to disable paging and the cache, ++ switch to real mode and jump to the BIOS reset code. ++ ++ The instruction that switches to real mode by writing to CR0 must be ++ followed immediately by a far jump instruction, which set CS to a ++ valid value for real mode, and flushes the prefetch queue to avoid ++ running instructions that have already been decoded in protected ++ mode. ++ ++ Clears all the flags except ET, especially PG (paging), PE ++ (protected-mode enable) and TS (task switch for coprocessor state ++ save). Flushes the TLB after paging has been disabled. Sets CD and ++ NW, to disable the cache on a 486, and invalidates the cache. This ++ is more like the state of a 486 after reset. I don't know if ++ something else should be done for other chips. ++ ++ More could be done here to set up the registers as if a CPU reset had ++ occurred; hopefully real BIOSs don't assume much. */ ++static unsigned char real_mode_switch [] = ++{ ++ 0x66, 0x0f, 0x20, 0xc0, /* movl %cr0,%eax */ ++ 0x66, 0x83, 0xe0, 0x11, /* andl $0x00000011,%eax */ ++ 0x66, 0x0d, 0x00, 0x00, 0x00, 0x60, /* orl $0x60000000,%eax */ ++ 0x66, 0x0f, 0x22, 0xc0, /* movl %eax,%cr0 */ ++ 0x66, 0x0f, 0x22, 0xd8, /* movl %eax,%cr3 */ ++ 0x66, 0x0f, 0x20, 0xc3, /* movl %cr0,%ebx */ ++ 0x66, 0x81, 0xe3, 0x00, 0x00, 0x00, 0x60, /* andl $0x60000000,%ebx */ ++ 0x74, 0x02, /* jz f */ ++ 0x0f, 0x09, /* wbinvd */ ++ 0x24, 0x10, /* f: andb $0x10,al */ ++ 0x66, 0x0f, 0x22, 0xc0 /* movl %eax,%cr0 */ ++}; ++static unsigned char jump_to_bios [] = ++{ ++ 0xea, 0x00, 0x00, 0xff, 0xff /* ljmp $0xffff,$0x0000 */ ++}; ++ ++/* ++ * Switch to real mode and then execute the code ++ * specified by the code and length parameters. ++ * We assume that length will aways be less that 100! ++ */ ++void machine_real_restart(unsigned char *code, int length) ++{ ++ local_irq_disable(); ++ ++ /* Write zero to CMOS register number 0x0f, which the BIOS POST ++ routine will recognize as telling it to do a proper reboot. (Well ++ that's what this book in front of me says -- it may only apply to ++ the Phoenix BIOS though, it's not clear). At the same time, ++ disable NMIs by setting the top bit in the CMOS address register, ++ as we're about to do peculiar things to the CPU. I'm not sure if ++ `outb_p' is needed instead of just `outb'. Use it to be on the ++ safe side. (Yes, CMOS_WRITE does outb_p's. - Paul G.) ++ */ ++ spin_lock(&rtc_lock); ++ CMOS_WRITE(0x00, 0x8f); ++ spin_unlock(&rtc_lock); ++ ++ /* Remap the kernel at virtual address zero, as well as offset zero ++ from the kernel segment. This assumes the kernel segment starts at ++ virtual address PAGE_OFFSET. */ ++ memcpy(swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, ++ sizeof(swapper_pg_dir [0]) * KERNEL_PGD_PTRS); ++ ++ /* ++ * Use `swapper_pg_dir' as our page directory. ++ */ ++ load_cr3(swapper_pg_dir); ++ ++ /* Write 0x1234 to absolute memory location 0x472. The BIOS reads ++ this on booting to tell it to "Bypass memory test (also warm ++ boot)". This seems like a fairly standard thing that gets set by ++ REBOOT.COM programs, and the previous reset routine did this ++ too. */ ++ *((unsigned short *)0x472) = reboot_mode; ++ ++ /* For the switch to real mode, copy some code to low memory. It has ++ to be in the first 64k because it is running in 16-bit mode, and it ++ has to have the same physical and virtual address, because it turns ++ off paging. Copy it near the end of the first page, out of the way ++ of BIOS variables. */ ++ memcpy((void *)(0x1000 - sizeof(real_mode_switch) - 100), ++ real_mode_switch, sizeof (real_mode_switch)); ++ memcpy((void *)(0x1000 - 100), code, length); ++ ++ /* Set up the IDT for real mode. */ ++ load_idt(&real_mode_idt); ++ ++ /* Set up a GDT from which we can load segment descriptors for real ++ mode. The GDT is not used in real mode; it is just needed here to ++ prepare the descriptors. */ ++ load_gdt(&real_mode_gdt); ++ ++ /* Load the data segment registers, and thus the descriptors ready for ++ real mode. The base address of each segment is 0x100, 16 times the ++ selector value being loaded here. This is so that the segment ++ registers don't have to be reloaded after switching to real mode: ++ the values are consistent for real mode operation already. */ ++ __asm__ __volatile__ ("movl $0x0010,%%eax\n" ++ "\tmovl %%eax,%%ds\n" ++ "\tmovl %%eax,%%es\n" ++ "\tmovl %%eax,%%fs\n" ++ "\tmovl %%eax,%%gs\n" ++ "\tmovl %%eax,%%ss" : : : "eax"); ++ ++ /* Jump to the 16-bit code that we copied earlier. It disables paging ++ and the cache, switches to real mode, and jumps to the BIOS reset ++ entry point. */ ++ __asm__ __volatile__ ("ljmp $0x0008,%0" ++ : ++ : "i" ((void *)(0x1000 - sizeof (real_mode_switch) - 100))); ++} ++#ifdef CONFIG_APM_MODULE ++EXPORT_SYMBOL(machine_real_restart); ++#endif ++ ++#endif /* CONFIG_X86_32 */ ++ ++static inline void kb_wait(void) ++{ ++ int i; ++ ++ for (i = 0; i < 0x10000; i++) { ++ if ((inb(0x64) & 0x02) == 0) ++ break; ++ udelay(2); ++ } ++} ++ ++void machine_emergency_restart(void) ++{ ++ int i; ++ ++ /* Tell the BIOS if we want cold or warm reboot */ ++ *((unsigned short *)__va(0x472)) = reboot_mode; ++ ++ for (;;) { ++ /* Could also try the reset bit in the Hammer NB */ ++ switch (reboot_type) { ++ case BOOT_KBD: ++ for (i = 0; i < 10; i++) { ++ kb_wait(); ++ udelay(50); ++ outb(0xfe, 0x64); /* pulse reset low */ ++ udelay(50); ++ } ++ ++ case BOOT_TRIPLE: ++ load_idt((const struct desc_ptr *)&no_idt); ++ __asm__ __volatile__("int3"); ++ ++ reboot_type = BOOT_KBD; ++ break; ++ ++#ifdef CONFIG_X86_32 ++ case BOOT_BIOS: ++ machine_real_restart(jump_to_bios, sizeof(jump_to_bios)); ++ ++ reboot_type = BOOT_KBD; ++ break; ++#endif ++ ++ case BOOT_ACPI: ++ acpi_reboot(); ++ reboot_type = BOOT_KBD; ++ break; ++ ++ ++ case BOOT_EFI: ++ if (efi_enabled) ++ efi.reset_system(reboot_mode ? EFI_RESET_WARM : EFI_RESET_COLD, ++ EFI_SUCCESS, 0, NULL); ++ ++ reboot_type = BOOT_KBD; ++ break; ++ } ++ } ++} ++ ++void machine_shutdown(void) ++{ ++ /* Stop the cpus and apics */ ++#ifdef CONFIG_SMP ++ int reboot_cpu_id; ++ ++ /* The boot cpu is always logical cpu 0 */ ++ reboot_cpu_id = 0; ++ ++#ifdef CONFIG_X86_32 ++ /* See if there has been given a command line override */ ++ if ((reboot_cpu != -1) && (reboot_cpu < NR_CPUS) && ++ cpu_isset(reboot_cpu, cpu_online_map)) ++ reboot_cpu_id = reboot_cpu; ++#endif ++ ++ /* Make certain the cpu I'm about to reboot on is online */ ++ if (!cpu_isset(reboot_cpu_id, cpu_online_map)) ++ reboot_cpu_id = smp_processor_id(); ++ ++ /* Make certain I only run on the appropriate processor */ ++ set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id)); ++ ++ /* O.K Now that I'm on the appropriate processor, ++ * stop all of the others. ++ */ ++ smp_send_stop(); ++#endif ++ ++ lapic_shutdown(); ++ ++#ifdef CONFIG_X86_IO_APIC ++ disable_IO_APIC(); ++#endif ++ ++#ifdef CONFIG_HPET_TIMER ++ hpet_disable(); ++#endif ++ ++#ifdef CONFIG_X86_64 ++ pci_iommu_shutdown(); ++#endif ++} ++ ++void machine_restart(char *__unused) ++{ ++ printk("machine restart\n"); ++ ++ if (!reboot_force) ++ machine_shutdown(); ++ machine_emergency_restart(); ++} ++ ++void machine_halt(void) ++{ ++} ++ ++void machine_power_off(void) ++{ ++ if (pm_power_off) { ++ if (!reboot_force) ++ machine_shutdown(); ++ pm_power_off(); ++ } ++} ++ ++struct machine_ops machine_ops = { ++ .power_off = machine_power_off, ++ .shutdown = machine_shutdown, ++ .emergency_restart = machine_emergency_restart, ++ .restart = machine_restart, ++ .halt = machine_halt ++}; +diff --git a/arch/x86/kernel/reboot_32.c b/arch/x86/kernel/reboot_32.c +deleted file mode 100644 +index bb1a0f8..0000000 +--- a/arch/x86/kernel/reboot_32.c ++++ /dev/null +@@ -1,413 +0,0 @@ +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include "mach_reboot.h" +-#include +-#include +- +-/* +- * Power off function, if any +- */ +-void (*pm_power_off)(void); +-EXPORT_SYMBOL(pm_power_off); +- +-static int reboot_mode; +-static int reboot_thru_bios; +- +-#ifdef CONFIG_SMP +-static int reboot_cpu = -1; +-#endif +-static int __init reboot_setup(char *str) +-{ +- while(1) { +- switch (*str) { +- case 'w': /* "warm" reboot (no memory testing etc) */ +- reboot_mode = 0x1234; +- break; +- case 'c': /* "cold" reboot (with memory testing etc) */ +- reboot_mode = 0x0; +- break; +- case 'b': /* "bios" reboot by jumping through the BIOS */ +- reboot_thru_bios = 1; +- break; +- case 'h': /* "hard" reboot by toggling RESET and/or crashing the CPU */ +- reboot_thru_bios = 0; +- break; +-#ifdef CONFIG_SMP +- case 's': /* "smp" reboot by executing reset on BSP or other CPU*/ +- if (isdigit(*(str+1))) { +- reboot_cpu = (int) (*(str+1) - '0'); +- if (isdigit(*(str+2))) +- reboot_cpu = reboot_cpu*10 + (int)(*(str+2) - '0'); +- } +- /* we will leave sorting out the final value +- when we are ready to reboot, since we might not +- have set up boot_cpu_id or smp_num_cpu */ +- break; +-#endif +- } +- if((str = strchr(str,',')) != NULL) +- str++; +- else +- break; +- } +- return 1; +-} +- +-__setup("reboot=", reboot_setup); +- +-/* +- * Reboot options and system auto-detection code provided by +- * Dell Inc. so their systems "just work". :-) +- */ +- +-/* +- * Some machines require the "reboot=b" commandline option, this quirk makes that automatic. +- */ +-static int __init set_bios_reboot(const struct dmi_system_id *d) +-{ +- if (!reboot_thru_bios) { +- reboot_thru_bios = 1; +- printk(KERN_INFO "%s series board detected. Selecting BIOS-method for reboots.\n", d->ident); +- } +- return 0; +-} +- +-static struct dmi_system_id __initdata reboot_dmi_table[] = { +- { /* Handle problems with rebooting on Dell E520's */ +- .callback = set_bios_reboot, +- .ident = "Dell E520", +- .matches = { +- DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), +- DMI_MATCH(DMI_PRODUCT_NAME, "Dell DM061"), +- }, +- }, +- { /* Handle problems with rebooting on Dell 1300's */ +- .callback = set_bios_reboot, +- .ident = "Dell PowerEdge 1300", +- .matches = { +- DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), +- DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge 1300/"), +- }, +- }, +- { /* Handle problems with rebooting on Dell 300's */ +- .callback = set_bios_reboot, +- .ident = "Dell PowerEdge 300", +- .matches = { +- DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), +- DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge 300/"), +- }, +- }, +- { /* Handle problems with rebooting on Dell Optiplex 745's SFF*/ +- .callback = set_bios_reboot, +- .ident = "Dell OptiPlex 745", +- .matches = { +- DMI_MATCH(DMI_SYS_VENDOR, "Dell Inc."), +- DMI_MATCH(DMI_PRODUCT_NAME, "OptiPlex 745"), +- DMI_MATCH(DMI_BOARD_NAME, "0WF810"), +- }, +- }, +- { /* Handle problems with rebooting on Dell 2400's */ +- .callback = set_bios_reboot, +- .ident = "Dell PowerEdge 2400", +- .matches = { +- DMI_MATCH(DMI_SYS_VENDOR, "Dell Computer Corporation"), +- DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge 2400"), +- }, +- }, +- { /* Handle problems with rebooting on HP laptops */ +- .callback = set_bios_reboot, +- .ident = "HP Compaq Laptop", +- .matches = { +- DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"), +- DMI_MATCH(DMI_PRODUCT_NAME, "HP Compaq"), +- }, +- }, +- { } +-}; +- +-static int __init reboot_init(void) +-{ +- dmi_check_system(reboot_dmi_table); +- return 0; +-} +- +-core_initcall(reboot_init); +- +-/* The following code and data reboots the machine by switching to real +- mode and jumping to the BIOS reset entry point, as if the CPU has +- really been reset. The previous version asked the keyboard +- controller to pulse the CPU reset line, which is more thorough, but +- doesn't work with at least one type of 486 motherboard. It is easy +- to stop this code working; hence the copious comments. */ +- +-static unsigned long long +-real_mode_gdt_entries [3] = +-{ +- 0x0000000000000000ULL, /* Null descriptor */ +- 0x00009a000000ffffULL, /* 16-bit real-mode 64k code at 0x00000000 */ +- 0x000092000100ffffULL /* 16-bit real-mode 64k data at 0x00000100 */ +-}; +- +-static struct Xgt_desc_struct +-real_mode_gdt = { sizeof (real_mode_gdt_entries) - 1, (long)real_mode_gdt_entries }, +-real_mode_idt = { 0x3ff, 0 }, +-no_idt = { 0, 0 }; +- +- +-/* This is 16-bit protected mode code to disable paging and the cache, +- switch to real mode and jump to the BIOS reset code. +- +- The instruction that switches to real mode by writing to CR0 must be +- followed immediately by a far jump instruction, which set CS to a +- valid value for real mode, and flushes the prefetch queue to avoid +- running instructions that have already been decoded in protected +- mode. +- +- Clears all the flags except ET, especially PG (paging), PE +- (protected-mode enable) and TS (task switch for coprocessor state +- save). Flushes the TLB after paging has been disabled. Sets CD and +- NW, to disable the cache on a 486, and invalidates the cache. This +- is more like the state of a 486 after reset. I don't know if +- something else should be done for other chips. +- +- More could be done here to set up the registers as if a CPU reset had +- occurred; hopefully real BIOSs don't assume much. */ +- +-static unsigned char real_mode_switch [] = +-{ +- 0x66, 0x0f, 0x20, 0xc0, /* movl %cr0,%eax */ +- 0x66, 0x83, 0xe0, 0x11, /* andl $0x00000011,%eax */ +- 0x66, 0x0d, 0x00, 0x00, 0x00, 0x60, /* orl $0x60000000,%eax */ +- 0x66, 0x0f, 0x22, 0xc0, /* movl %eax,%cr0 */ +- 0x66, 0x0f, 0x22, 0xd8, /* movl %eax,%cr3 */ +- 0x66, 0x0f, 0x20, 0xc3, /* movl %cr0,%ebx */ +- 0x66, 0x81, 0xe3, 0x00, 0x00, 0x00, 0x60, /* andl $0x60000000,%ebx */ +- 0x74, 0x02, /* jz f */ +- 0x0f, 0x09, /* wbinvd */ +- 0x24, 0x10, /* f: andb $0x10,al */ +- 0x66, 0x0f, 0x22, 0xc0 /* movl %eax,%cr0 */ +-}; +-static unsigned char jump_to_bios [] = +-{ +- 0xea, 0x00, 0x00, 0xff, 0xff /* ljmp $0xffff,$0x0000 */ +-}; +- +-/* +- * Switch to real mode and then execute the code +- * specified by the code and length parameters. +- * We assume that length will aways be less that 100! +- */ +-void machine_real_restart(unsigned char *code, int length) +-{ +- local_irq_disable(); +- +- /* Write zero to CMOS register number 0x0f, which the BIOS POST +- routine will recognize as telling it to do a proper reboot. (Well +- that's what this book in front of me says -- it may only apply to +- the Phoenix BIOS though, it's not clear). At the same time, +- disable NMIs by setting the top bit in the CMOS address register, +- as we're about to do peculiar things to the CPU. I'm not sure if +- `outb_p' is needed instead of just `outb'. Use it to be on the +- safe side. (Yes, CMOS_WRITE does outb_p's. - Paul G.) +- */ +- +- spin_lock(&rtc_lock); +- CMOS_WRITE(0x00, 0x8f); +- spin_unlock(&rtc_lock); +- +- /* Remap the kernel at virtual address zero, as well as offset zero +- from the kernel segment. This assumes the kernel segment starts at +- virtual address PAGE_OFFSET. */ +- +- memcpy (swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, +- sizeof (swapper_pg_dir [0]) * KERNEL_PGD_PTRS); +- +- /* +- * Use `swapper_pg_dir' as our page directory. +- */ +- load_cr3(swapper_pg_dir); +- +- /* Write 0x1234 to absolute memory location 0x472. The BIOS reads +- this on booting to tell it to "Bypass memory test (also warm +- boot)". This seems like a fairly standard thing that gets set by +- REBOOT.COM programs, and the previous reset routine did this +- too. */ +- +- *((unsigned short *)0x472) = reboot_mode; +- +- /* For the switch to real mode, copy some code to low memory. It has +- to be in the first 64k because it is running in 16-bit mode, and it +- has to have the same physical and virtual address, because it turns +- off paging. Copy it near the end of the first page, out of the way +- of BIOS variables. */ +- +- memcpy ((void *) (0x1000 - sizeof (real_mode_switch) - 100), +- real_mode_switch, sizeof (real_mode_switch)); +- memcpy ((void *) (0x1000 - 100), code, length); +- +- /* Set up the IDT for real mode. */ +- +- load_idt(&real_mode_idt); +- +- /* Set up a GDT from which we can load segment descriptors for real +- mode. The GDT is not used in real mode; it is just needed here to +- prepare the descriptors. */ +- +- load_gdt(&real_mode_gdt); +- +- /* Load the data segment registers, and thus the descriptors ready for +- real mode. The base address of each segment is 0x100, 16 times the +- selector value being loaded here. This is so that the segment +- registers don't have to be reloaded after switching to real mode: +- the values are consistent for real mode operation already. */ +- +- __asm__ __volatile__ ("movl $0x0010,%%eax\n" +- "\tmovl %%eax,%%ds\n" +- "\tmovl %%eax,%%es\n" +- "\tmovl %%eax,%%fs\n" +- "\tmovl %%eax,%%gs\n" +- "\tmovl %%eax,%%ss" : : : "eax"); +- +- /* Jump to the 16-bit code that we copied earlier. It disables paging +- and the cache, switches to real mode, and jumps to the BIOS reset +- entry point. */ +- +- __asm__ __volatile__ ("ljmp $0x0008,%0" +- : +- : "i" ((void *) (0x1000 - sizeof (real_mode_switch) - 100))); +-} +-#ifdef CONFIG_APM_MODULE +-EXPORT_SYMBOL(machine_real_restart); +-#endif +- +-static void native_machine_shutdown(void) +-{ +-#ifdef CONFIG_SMP +- int reboot_cpu_id; +- +- /* The boot cpu is always logical cpu 0 */ +- reboot_cpu_id = 0; +- +- /* See if there has been given a command line override */ +- if ((reboot_cpu != -1) && (reboot_cpu < NR_CPUS) && +- cpu_isset(reboot_cpu, cpu_online_map)) { +- reboot_cpu_id = reboot_cpu; +- } +- +- /* Make certain the cpu I'm rebooting on is online */ +- if (!cpu_isset(reboot_cpu_id, cpu_online_map)) { +- reboot_cpu_id = smp_processor_id(); +- } +- +- /* Make certain I only run on the appropriate processor */ +- set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id)); +- +- /* O.K. Now that I'm on the appropriate processor, stop +- * all of the others, and disable their local APICs. +- */ +- +- smp_send_stop(); +-#endif /* CONFIG_SMP */ +- +- lapic_shutdown(); +- +-#ifdef CONFIG_X86_IO_APIC +- disable_IO_APIC(); +-#endif +-#ifdef CONFIG_HPET_TIMER +- hpet_disable(); +-#endif +-} +- +-void __attribute__((weak)) mach_reboot_fixups(void) +-{ +-} +- +-static void native_machine_emergency_restart(void) +-{ +- if (!reboot_thru_bios) { +- if (efi_enabled) { +- efi.reset_system(EFI_RESET_COLD, EFI_SUCCESS, 0, NULL); +- load_idt(&no_idt); +- __asm__ __volatile__("int3"); +- } +- /* rebooting needs to touch the page at absolute addr 0 */ +- *((unsigned short *)__va(0x472)) = reboot_mode; +- for (;;) { +- mach_reboot_fixups(); /* for board specific fixups */ +- mach_reboot(); +- /* That didn't work - force a triple fault.. */ +- load_idt(&no_idt); +- __asm__ __volatile__("int3"); +- } +- } +- if (efi_enabled) +- efi.reset_system(EFI_RESET_WARM, EFI_SUCCESS, 0, NULL); +- +- machine_real_restart(jump_to_bios, sizeof(jump_to_bios)); +-} +- +-static void native_machine_restart(char * __unused) +-{ +- machine_shutdown(); +- machine_emergency_restart(); +-} +- +-static void native_machine_halt(void) +-{ +-} +- +-static void native_machine_power_off(void) +-{ +- if (pm_power_off) { +- machine_shutdown(); +- pm_power_off(); +- } +-} +- +- +-struct machine_ops machine_ops = { +- .power_off = native_machine_power_off, +- .shutdown = native_machine_shutdown, +- .emergency_restart = native_machine_emergency_restart, +- .restart = native_machine_restart, +- .halt = native_machine_halt, +-}; +- +-void machine_power_off(void) +-{ +- machine_ops.power_off(); +-} +- +-void machine_shutdown(void) +-{ +- machine_ops.shutdown(); +-} +- +-void machine_emergency_restart(void) +-{ +- machine_ops.emergency_restart(); +-} +- +-void machine_restart(char *cmd) +-{ +- machine_ops.restart(cmd); +-} +- +-void machine_halt(void) +-{ +- machine_ops.halt(); +-} +diff --git a/arch/x86/kernel/reboot_64.c b/arch/x86/kernel/reboot_64.c +deleted file mode 100644 +index 53620a9..0000000 +--- a/arch/x86/kernel/reboot_64.c ++++ /dev/null +@@ -1,176 +0,0 @@ +-/* Various gunk just to reboot the machine. */ +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +-#include +- +-/* +- * Power off function, if any +- */ +-void (*pm_power_off)(void); +-EXPORT_SYMBOL(pm_power_off); +- +-static long no_idt[3]; +-static enum { +- BOOT_TRIPLE = 't', +- BOOT_KBD = 'k' +-} reboot_type = BOOT_KBD; +-static int reboot_mode = 0; +-int reboot_force; +- +-/* reboot=t[riple] | k[bd] [, [w]arm | [c]old] +- warm Don't set the cold reboot flag +- cold Set the cold reboot flag +- triple Force a triple fault (init) +- kbd Use the keyboard controller. cold reset (default) +- force Avoid anything that could hang. +- */ +-static int __init reboot_setup(char *str) +-{ +- for (;;) { +- switch (*str) { +- case 'w': +- reboot_mode = 0x1234; +- break; +- +- case 'c': +- reboot_mode = 0; +- break; +- +- case 't': +- case 'b': +- case 'k': +- reboot_type = *str; +- break; +- case 'f': +- reboot_force = 1; +- break; +- } +- if((str = strchr(str,',')) != NULL) +- str++; +- else +- break; +- } +- return 1; +-} +- +-__setup("reboot=", reboot_setup); +- +-static inline void kb_wait(void) +-{ +- int i; +- +- for (i=0; i<0x10000; i++) +- if ((inb_p(0x64) & 0x02) == 0) +- break; +-} +- +-void machine_shutdown(void) +-{ +- unsigned long flags; +- +- /* Stop the cpus and apics */ +-#ifdef CONFIG_SMP +- int reboot_cpu_id; +- +- /* The boot cpu is always logical cpu 0 */ +- reboot_cpu_id = 0; +- +- /* Make certain the cpu I'm about to reboot on is online */ +- if (!cpu_isset(reboot_cpu_id, cpu_online_map)) { +- reboot_cpu_id = smp_processor_id(); +- } +- +- /* Make certain I only run on the appropriate processor */ +- set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id)); +- +- /* O.K Now that I'm on the appropriate processor, +- * stop all of the others. +- */ +- smp_send_stop(); +-#endif +- +- local_irq_save(flags); +- +-#ifndef CONFIG_SMP +- disable_local_APIC(); +-#endif +- +- disable_IO_APIC(); +- +-#ifdef CONFIG_HPET_TIMER +- hpet_disable(); +-#endif +- local_irq_restore(flags); +- +- pci_iommu_shutdown(); +-} +- +-void machine_emergency_restart(void) +-{ +- int i; +- +- /* Tell the BIOS if we want cold or warm reboot */ +- *((unsigned short *)__va(0x472)) = reboot_mode; +- +- for (;;) { +- /* Could also try the reset bit in the Hammer NB */ +- switch (reboot_type) { +- case BOOT_KBD: +- for (i=0; i<10; i++) { +- kb_wait(); +- udelay(50); +- outb(0xfe,0x64); /* pulse reset low */ +- udelay(50); +- } +- +- case BOOT_TRIPLE: +- load_idt((const struct desc_ptr *)&no_idt); +- __asm__ __volatile__("int3"); +- +- reboot_type = BOOT_KBD; +- break; +- } +- } +-} +- +-void machine_restart(char * __unused) +-{ +- printk("machine restart\n"); +- +- if (!reboot_force) { +- machine_shutdown(); +- } +- machine_emergency_restart(); +-} +- +-void machine_halt(void) +-{ +-} +- +-void machine_power_off(void) +-{ +- if (pm_power_off) { +- if (!reboot_force) { +- machine_shutdown(); +- } +- pm_power_off(); +- } +-} +- +diff --git a/arch/x86/kernel/reboot_fixups_32.c b/arch/x86/kernel/reboot_fixups_32.c +index f452726..dec0b5e 100644 +--- a/arch/x86/kernel/reboot_fixups_32.c ++++ b/arch/x86/kernel/reboot_fixups_32.c +@@ -30,6 +30,19 @@ static void cs5536_warm_reset(struct pci_dev *dev) + udelay(50); /* shouldn't get here but be safe and spin a while */ + } + ++static void rdc321x_reset(struct pci_dev *dev) ++{ ++ unsigned i; ++ /* Voluntary reset the watchdog timer */ ++ outl(0x80003840, 0xCF8); ++ /* Generate a CPU reset on next tick */ ++ i = inl(0xCFC); ++ /* Use the minimum timer resolution */ ++ i |= 0x1600; ++ outl(i, 0xCFC); ++ outb(1, 0x92); ++} ++ + struct device_fixup { + unsigned int vendor; + unsigned int device; +@@ -40,6 +53,7 @@ static struct device_fixup fixups_table[] = { + { PCI_VENDOR_ID_CYRIX, PCI_DEVICE_ID_CYRIX_5530_LEGACY, cs5530a_warm_reset }, + { PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_CS5536_ISA, cs5536_warm_reset }, + { PCI_VENDOR_ID_NS, PCI_DEVICE_ID_NS_SC1100_BRIDGE, cs5530a_warm_reset }, ++{ PCI_VENDOR_ID_RDC, PCI_DEVICE_ID_RDC_R6030, rdc321x_reset }, + }; + + /* +diff --git a/arch/x86/kernel/rtc.c b/arch/x86/kernel/rtc.c +new file mode 100644 +index 0000000..eb9b1a1 +--- /dev/null ++++ b/arch/x86/kernel/rtc.c +@@ -0,0 +1,204 @@ ++/* ++ * RTC related functions ++ */ ++#include ++#include ++#include ++ ++#include ++#include ++ ++#ifdef CONFIG_X86_32 ++# define CMOS_YEARS_OFFS 1900 ++/* ++ * This is a special lock that is owned by the CPU and holds the index ++ * register we are working with. It is required for NMI access to the ++ * CMOS/RTC registers. See include/asm-i386/mc146818rtc.h for details. ++ */ ++volatile unsigned long cmos_lock = 0; ++EXPORT_SYMBOL(cmos_lock); ++#else ++/* ++ * x86-64 systems only exists since 2002. ++ * This will work up to Dec 31, 2100 ++ */ ++# define CMOS_YEARS_OFFS 2000 ++#endif ++ ++DEFINE_SPINLOCK(rtc_lock); ++EXPORT_SYMBOL(rtc_lock); ++ ++/* ++ * In order to set the CMOS clock precisely, set_rtc_mmss has to be ++ * called 500 ms after the second nowtime has started, because when ++ * nowtime is written into the registers of the CMOS clock, it will ++ * jump to the next second precisely 500 ms later. Check the Motorola ++ * MC146818A or Dallas DS12887 data sheet for details. ++ * ++ * BUG: This routine does not handle hour overflow properly; it just ++ * sets the minutes. Usually you'll only notice that after reboot! ++ */ ++int mach_set_rtc_mmss(unsigned long nowtime) ++{ ++ int retval = 0; ++ int real_seconds, real_minutes, cmos_minutes; ++ unsigned char save_control, save_freq_select; ++ ++ /* tell the clock it's being set */ ++ save_control = CMOS_READ(RTC_CONTROL); ++ CMOS_WRITE((save_control|RTC_SET), RTC_CONTROL); ++ ++ /* stop and reset prescaler */ ++ save_freq_select = CMOS_READ(RTC_FREQ_SELECT); ++ CMOS_WRITE((save_freq_select|RTC_DIV_RESET2), RTC_FREQ_SELECT); ++ ++ cmos_minutes = CMOS_READ(RTC_MINUTES); ++ if (!(save_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) ++ BCD_TO_BIN(cmos_minutes); ++ ++ /* ++ * since we're only adjusting minutes and seconds, ++ * don't interfere with hour overflow. This avoids ++ * messing with unknown time zones but requires your ++ * RTC not to be off by more than 15 minutes ++ */ ++ real_seconds = nowtime % 60; ++ real_minutes = nowtime / 60; ++ /* correct for half hour time zone */ ++ if (((abs(real_minutes - cmos_minutes) + 15)/30) & 1) ++ real_minutes += 30; ++ real_minutes %= 60; ++ ++ if (abs(real_minutes - cmos_minutes) < 30) { ++ if (!(save_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) { ++ BIN_TO_BCD(real_seconds); ++ BIN_TO_BCD(real_minutes); ++ } ++ CMOS_WRITE(real_seconds,RTC_SECONDS); ++ CMOS_WRITE(real_minutes,RTC_MINUTES); ++ } else { ++ printk(KERN_WARNING ++ "set_rtc_mmss: can't update from %d to %d\n", ++ cmos_minutes, real_minutes); ++ retval = -1; ++ } ++ ++ /* The following flags have to be released exactly in this order, ++ * otherwise the DS12887 (popular MC146818A clone with integrated ++ * battery and quartz) will not reset the oscillator and will not ++ * update precisely 500 ms later. You won't find this mentioned in ++ * the Dallas Semiconductor data sheets, but who believes data ++ * sheets anyway ... -- Markus Kuhn ++ */ ++ CMOS_WRITE(save_control, RTC_CONTROL); ++ CMOS_WRITE(save_freq_select, RTC_FREQ_SELECT); ++ ++ return retval; ++} ++ ++unsigned long mach_get_cmos_time(void) ++{ ++ unsigned int year, mon, day, hour, min, sec, century = 0; ++ ++ /* ++ * If UIP is clear, then we have >= 244 microseconds before ++ * RTC registers will be updated. Spec sheet says that this ++ * is the reliable way to read RTC - registers. If UIP is set ++ * then the register access might be invalid. ++ */ ++ while ((CMOS_READ(RTC_FREQ_SELECT) & RTC_UIP)) ++ cpu_relax(); ++ ++ sec = CMOS_READ(RTC_SECONDS); ++ min = CMOS_READ(RTC_MINUTES); ++ hour = CMOS_READ(RTC_HOURS); ++ day = CMOS_READ(RTC_DAY_OF_MONTH); ++ mon = CMOS_READ(RTC_MONTH); ++ year = CMOS_READ(RTC_YEAR); ++ ++#if defined(CONFIG_ACPI) && defined(CONFIG_X86_64) ++ /* CHECKME: Is this really 64bit only ??? */ ++ if (acpi_gbl_FADT.header.revision >= FADT2_REVISION_ID && ++ acpi_gbl_FADT.century) ++ century = CMOS_READ(acpi_gbl_FADT.century); ++#endif ++ ++ if (RTC_ALWAYS_BCD || !(CMOS_READ(RTC_CONTROL) & RTC_DM_BINARY)) { ++ BCD_TO_BIN(sec); ++ BCD_TO_BIN(min); ++ BCD_TO_BIN(hour); ++ BCD_TO_BIN(day); ++ BCD_TO_BIN(mon); ++ BCD_TO_BIN(year); ++ } ++ ++ if (century) { ++ BCD_TO_BIN(century); ++ year += century * 100; ++ printk(KERN_INFO "Extended CMOS year: %d\n", century * 100); ++ } else { ++ year += CMOS_YEARS_OFFS; ++ if (year < 1970) ++ year += 100; ++ } ++ ++ return mktime(year, mon, day, hour, min, sec); ++} ++ ++/* Routines for accessing the CMOS RAM/RTC. */ ++unsigned char rtc_cmos_read(unsigned char addr) ++{ ++ unsigned char val; ++ ++ lock_cmos_prefix(addr); ++ outb_p(addr, RTC_PORT(0)); ++ val = inb_p(RTC_PORT(1)); ++ lock_cmos_suffix(addr); ++ return val; ++} ++EXPORT_SYMBOL(rtc_cmos_read); ++ ++void rtc_cmos_write(unsigned char val, unsigned char addr) ++{ ++ lock_cmos_prefix(addr); ++ outb_p(addr, RTC_PORT(0)); ++ outb_p(val, RTC_PORT(1)); ++ lock_cmos_suffix(addr); ++} ++EXPORT_SYMBOL(rtc_cmos_write); ++ ++static int set_rtc_mmss(unsigned long nowtime) ++{ ++ int retval; ++ unsigned long flags; ++ ++ spin_lock_irqsave(&rtc_lock, flags); ++ retval = set_wallclock(nowtime); ++ spin_unlock_irqrestore(&rtc_lock, flags); ++ ++ return retval; ++} ++ ++/* not static: needed by APM */ ++unsigned long read_persistent_clock(void) ++{ ++ unsigned long retval, flags; ++ ++ spin_lock_irqsave(&rtc_lock, flags); ++ retval = get_wallclock(); ++ spin_unlock_irqrestore(&rtc_lock, flags); ++ ++ return retval; ++} ++ ++int update_persistent_clock(struct timespec now) ++{ ++ return set_rtc_mmss(now.tv_sec); ++} ++ ++unsigned long long native_read_tsc(void) ++{ ++ return __native_read_tsc(); ++} ++EXPORT_SYMBOL(native_read_tsc); ++ +diff --git a/arch/x86/kernel/setup64.c b/arch/x86/kernel/setup64.c +index 3558ac7..309366f 100644 +--- a/arch/x86/kernel/setup64.c ++++ b/arch/x86/kernel/setup64.c +@@ -24,7 +24,11 @@ + #include + #include + ++#ifndef CONFIG_DEBUG_BOOT_PARAMS + struct boot_params __initdata boot_params; ++#else ++struct boot_params boot_params; ++#endif + + cpumask_t cpu_initialized __cpuinitdata = CPU_MASK_NONE; + +@@ -37,6 +41,8 @@ struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table }; + char boot_cpu_stack[IRQSTACKSIZE] __attribute__((section(".bss.page_aligned"))); + + unsigned long __supported_pte_mask __read_mostly = ~0UL; ++EXPORT_SYMBOL_GPL(__supported_pte_mask); ++ + static int do_not_nx __cpuinitdata = 0; + + /* noexec=on|off +@@ -80,6 +86,43 @@ static int __init nonx32_setup(char *str) + __setup("noexec32=", nonx32_setup); + + /* ++ * Copy data used in early init routines from the initial arrays to the ++ * per cpu data areas. These arrays then become expendable and the ++ * *_early_ptr's are zeroed indicating that the static arrays are gone. ++ */ ++static void __init setup_per_cpu_maps(void) ++{ ++ int cpu; ++ ++ for_each_possible_cpu(cpu) { ++#ifdef CONFIG_SMP ++ if (per_cpu_offset(cpu)) { ++#endif ++ per_cpu(x86_cpu_to_apicid, cpu) = ++ x86_cpu_to_apicid_init[cpu]; ++ per_cpu(x86_bios_cpu_apicid, cpu) = ++ x86_bios_cpu_apicid_init[cpu]; ++#ifdef CONFIG_NUMA ++ per_cpu(x86_cpu_to_node_map, cpu) = ++ x86_cpu_to_node_map_init[cpu]; ++#endif ++#ifdef CONFIG_SMP ++ } ++ else ++ printk(KERN_NOTICE "per_cpu_offset zero for cpu %d\n", ++ cpu); ++#endif ++ } ++ ++ /* indicate the early static arrays will soon be gone */ ++ x86_cpu_to_apicid_early_ptr = NULL; ++ x86_bios_cpu_apicid_early_ptr = NULL; ++#ifdef CONFIG_NUMA ++ x86_cpu_to_node_map_early_ptr = NULL; ++#endif ++} ++ ++/* + * Great future plan: + * Declare PDA itself and support (irqstack,tss,pgd) as per cpu data. + * Always point %gs to its beginning +@@ -100,18 +143,21 @@ void __init setup_per_cpu_areas(void) + for_each_cpu_mask (i, cpu_possible_map) { + char *ptr; + +- if (!NODE_DATA(cpu_to_node(i))) { ++ if (!NODE_DATA(early_cpu_to_node(i))) { + printk("cpu with no node %d, num_online_nodes %d\n", + i, num_online_nodes()); + ptr = alloc_bootmem_pages(size); + } else { +- ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size); ++ ptr = alloc_bootmem_pages_node(NODE_DATA(early_cpu_to_node(i)), size); + } + if (!ptr) + panic("Cannot allocate cpu data for CPU %d\n", i); + cpu_pda(i)->data_offset = ptr - __per_cpu_start; + memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start); + } ++ ++ /* setup percpu data maps early */ ++ setup_per_cpu_maps(); + } + + void pda_init(int cpu) +@@ -169,7 +215,8 @@ void syscall_init(void) + #endif + + /* Flags to clear on syscall */ +- wrmsrl(MSR_SYSCALL_MASK, EF_TF|EF_DF|EF_IE|0x3000); ++ wrmsrl(MSR_SYSCALL_MASK, ++ X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|X86_EFLAGS_IOPL); + } + + void __cpuinit check_efer(void) +@@ -227,7 +274,7 @@ void __cpuinit cpu_init (void) + * and set up the GDT descriptor: + */ + if (cpu) +- memcpy(cpu_gdt(cpu), cpu_gdt_table, GDT_SIZE); ++ memcpy(get_cpu_gdt_table(cpu), cpu_gdt_table, GDT_SIZE); + + cpu_gdt_descr[cpu].size = GDT_SIZE; + load_gdt((const struct desc_ptr *)&cpu_gdt_descr[cpu]); +@@ -257,10 +304,10 @@ void __cpuinit cpu_init (void) + v, cpu); + } + estacks += PAGE_SIZE << order[v]; +- orig_ist->ist[v] = t->ist[v] = (unsigned long)estacks; ++ orig_ist->ist[v] = t->x86_tss.ist[v] = (unsigned long)estacks; + } + +- t->io_bitmap_base = offsetof(struct tss_struct, io_bitmap); ++ t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap); + /* + * <= is required because the CPU will access up to + * 8 bits beyond the end of the IO permission bitmap. +diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c +index 9c24b45..62adc5f 100644 +--- a/arch/x86/kernel/setup_32.c ++++ b/arch/x86/kernel/setup_32.c +@@ -44,9 +44,12 @@ + #include + #include + #include ++#include ++#include + + #include