LinuxInternals/notes at master · abk-code/LinuxInternals · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
8/26/2017
#########

- How do we write kernel space services?
- How do they differ from user space code?

1. Statically appending a service and part of bZimage.
2. Dynamically creating a modules. - module programming

	Most of service devs, FS , protocols etc prefer this interface of adding
	kmods (dynamically).  Without knowing much existing of source layout and
	using specific set of rules , we can insert a new service into the kernel.

	kmod goes into ring 0. So no APIs, no syscalls, no library invocations.
	Include files of particular modules should be from kernel headers.
	Userspace are typically from /usr/include.

	The files being included are in the following directory:
	/lib/modules/4.12.0-041200-generic/build/include/linux/module.h

	kmod subsystem is responsible for cleanup and maintain the modules.  Using
	comments is not mandatory. Module is just mechanism (pkg) to insert code
	into kernel space.

	Makefile is not standalone Makefile.
	Take mod.c (mod.o) and make a mod out of it.

	EXPORT_SYMBOL_GPL - can't be used by proprietary modules.

	sysfs - logical file system and you will find called "module".
	We are interested in actually module body.

LDD-1
#####
	Insert the kernel services instead of a function.
	Drivers are virtualized services where applications interact with driver.

	DD in GPFS kernel is basically 2 pieces of code.
		- Interface to the application (kernel specific)
		Need to be aware of kernel architecture.
		- Hardware specific (device/bus specific).
		Need to be aware of how physically the h/w is connected to cpu.
		What protocol to use, directly connected ? via bus?

	Kernel specific driver can be implemented in 3 different ways
	These are 3 different models / ways of writing code. In other wods
	3 ways in which aplication can interact with the driver.
	1. Character driver approach (Sync devices).
	2. Block driver approach (Storage devices, flash, usb).
	3. Network driver (Network, WiFi etc).

1. Char drivers : Implemeted on all the devices where data exchange happens
synchronously. For eg mouse, keyboard. Not used for bulk data transfer.

Create a device file and allow the application to interact with the driver
using file operations. Kernel is going to view the driver like FS (vfs_ops).
Common API can be used using file* operations.

	Device files - 2 types
	1. Character devices
	2. Block devices


Check /proc/devices - for the available major,minor.
sudo mknod /dev/veda_cdrv c 300 0
	crw-r--r-- 1 root root 300, 0 Aug 26 13:49 /dev/veda_cdrv

Aquiring major number should be done dynamically. Otherwise
while porting may fail. Probe /proc/devices list and whichever
maj,min is free should be used to create a device file. This
will be a requirement in driver portability.


8/27/2017
#########

 alloc_chrdev_region() is used to dynamically reserver <maj,min> touple
 availability and return. This is used instead of register_chardev_region().

 Kernel allows 'n' number of devices to share same major number.  For eg: hard
 drive can have logical partitions. So we will have one device file for each
 logical partition. /dev/sda1, /dev/sda2 etc - they refer to multiple logical
 partitions with same major number but different minor numbers. Minor numbers
 are identifications file files/inodes of device.

 Depending on the type of devices, Maj,Min can be categorized.
 There is a way to create and delete device node can be done using 'devfs'.
 Udev is generic framework (suite of scripts) that provide creation of device
 files. These are basedon events from hotplug subsystems.

 class_create() and device_create() -
 device_destroy() and class_destroy() - These are 4 udev apis that are used to
 dyanmically create the device files.

 This will change the device tree and hotplug will generate an event.  udev
 will pickup this event generated from hotplug and creates a device file.  In
 the exit routine, device tree updated (deletion of virtual device) and hotplug
 subsystem will generate an event. udev will pickup this event and device file
 deletion happens.
 Check /sys/class/VIRTUAL - for device tree.

 hald - hald daemon maintains list of devices.  lshal - will list of devices
 that are currently plugged in.


8/28/2017
#########
#What is kernel?
	Kernel is set of routines that service the 2 generated events.
	Kernel code can be grouped in 2.
	1. Process events/context.
	2. Device/Interrupt events/context. Higher priority and cpu scheduler
    is disabled.

	Process context routine is of low priority and can be pre-empted.
	The example chr_drv_skel.c - the open is in process context.
	Single threaded driver.

#Deep drive of char dd open.
	open->sys_open->driver_open->char_dev_open()
	(normal file) > vfs open > driver specific open.
	When drivers are implemented, identify if it's concurrent i.e. available
	to 'n' number of applications parallely. Policy has to be coded that way.

	What driver open should consider during the implementation of service.
	1. Verify the caller request and enforce driver policy (concurrent or not).
	2. Verify if the target device (status) is ready or not.
	(Push to wakeup if needed).
	3. Check for need of allocation of resoureces within the driver.
	(Data structures, Interrupt service routines, memory  etc)
	4. Use-count to track the # of driver invocations if needed. May be useful
	during debugging drivers. "current" is pointer to current pcb. So use that
	to print pid of the caller etc.

#Deep drive of char dd close.
	close->sys_close->driver_close->char_dev_release()
	(normal file close) > vfs close > driver specific close.

	1. Release of the resoureces
	2. Reduce the use-count etc.
	3. Undo basically all the operations done in open.

	Devices can be accessed via ports or memory (mapped, graphics card etc).

#Deep drive of char dd read.
	read->sys_read->driver_read->char_dev_read()
	(normal file read) > vfs read> driver specific read.

	1. Verify read arguments. Do validation on the read request.
	2. Read data from the device.(can be from port, device memory, bus etc).
	3. Whatever data device returns may need to be processed and then
	return to the appliation.
	4. Transfer the data to app buffer.
	copy_to_user will copy transfer the data from kernel to user address
	space.
	5. Return # of bytes transferred.

#Deep drive of char dd write.
	write->sys_write->driver_write->char_dev_write()
	(normal file write) > vfs write> driver specific write.

	1. Verify / validate the write request.
	2. Copy data from app/userspace to driver specific buffer.
	3. Process data (may be format?)
	4. Write data to hardware.
	5. Increment *ppos
	6. Return nbytes (num of bytes that were written).

	Many frame buffer drivers (char) - they implement seek/lseek. It's not
	possible to move beyond the device space.

# concurrency
	When drivers are concurrent, there is a question of parallel access.  Take
	care of saftey of shared data/resources. So need to make driver routines
	re-entrant. There will be locking functions, etc.

	When drivers and kernel services - when multiple applications request
	request same driver functions parallely. When concurrent function
	guarrantees safty of shared  data, functions are called re-entrant.

	Even n++ is not guarateed to be atomic on all processors. Atomic
	instructions take memory lock and they are CPU specific. The atomic macros
	expand into CPU specific instructions.

	2 main types of locking mechanism
	1. Wait type which inturn has 2 - Semaphore and Mutex
		If the lock is held and there is a contention on the lock by another
		thread, that thread is put to wait state.
	2. Poll type - Spinlocks.
		They keep spinning or keep polling without any wait.

	mutex is more lightweight than semaphore.

	If there is a contention on a share resource between process context code
	ofthe driver and interrupt context code of the driver (for eg: Interrupt
	handler), use a special variant of spin_lock calls spin_lock_irqsave,
	spin_lock_irq, spin_lock_bh.

	Normal spin_lock need to be used only during contention between driver code
	within process context.

8/29/2017
#########
#ioctl - Used for special operations on device file.
	Special operations are device and vendor specific. So can't pre-declare
	the function pointers for such special operations.
	Up to 255 different cases/functionalities for single IOCTL.

	ioctl_list - will list all the available ioctls

	For implementing ioctl on the device - do the following
	1. Identify the special operations on the device.
	2. For each special ops, create a request command.
		Also create a header file so that applications can use ioctl.
	3. Implement all the cases (for request).

	Use ioctl encoding macros to create unique ioctl-request,command touple.
	There are 4 encoding macros that can be used.
		To arrive at unique numbers easily we use the following macros
   		_IO(type, nr);
   		_IOW(type, nr, dataitem) the fourth item calculated using sizeof
   		_IOR(type, nr, dataitem)
   		_IOWR(type, nr, dataitem)

	inparam - application writes, driver reads. (_IOW,application writes)
	outparm - application reads, driver writes. (_IOR,application reads)

8/30/2017
#########
#ioctl - continued			aquire big_kernel_lock()
ioctl > sys_ioctl > do_ioctl ------------------> fops > ioctl > chr_drv_ioctl
							unlock()
	Because do_ioctl() aquires a big_kernel_lock(), ioctl is rendered single
	threaded. There is no way 2 apps/threads will enter ioctl function
	parallely.

	unlocked_ioctl = this is function pointer in the file_operations is used
	to make ioctl concurrent.

	/usr/src/linux-headers-4.12.0-041200/include/linux/capability.h - this
	defines various priviledge levels. They are called CAP* constants.
	capable() kernel macro - within ioctl implementation can check for the
	privilege levels being allowed or not.

8/31/2017
#########
#Semaphores.
 	Semaphore value=0 means semaphore is locked.
	value = 1, means semaphore is unlocked.

	down_interruptible - will try to aquire the lock and if it doesn't get the
	lock,it will go into interruptible wait state. It will create a wait-queue
	and pcb for the calling context will be enqued to the wait-queue.

	up() - in the write will make the semaphore available. (Unlock call of
	semaphore).

	The wait-queues that are part of semaphore are FIFO.  This type of locking
	using semaphores is discouraged because can lead to confusion.

#completion locks
	Another way to let the callers block if the resource is not available.

#Wait-queue
#	wait_event_interruptible -
	WIll push the caller into interruptible wait state. Calling process's pcb
	is enqueued into wait-queue.
	wake_up_interruptible() is the routine to wake up.
	This will send a signal to all the PCBs that are in the wait-queue.

9/1/2017
########
	Linux provides a way that driver delivers messages to applications
	asynchronously.  Poll and select are 2 ways that applications can get the
	status of the device.

#async notifications:
	async is not used in all applications. When IO is not high priority, then
	this method is used. This will require applicatons to register with driver
	for async messages to be delivered.

	Applications get SIGIO from the driver and they can handle with special
	signal handler for SIGIO.

	kill_fasync() of sending SIGIO can be put in Interrupt Servie Routine (ISR).

	poll_wait() is a kernel routine that puts calling application into wait
	state (in the wait queue).

#Introducing delays in kernel routines.
	For kernel debugging, we may need to introduce delays.

	Busy wait - Infinite loop and cpu is busy.
	Yield cpu time - schedule() to relinquish the cpu time and is often used.
		this doesn't waste the cpu cycles.
	Timeouts - How long we want to be in delay by providing expiry timer.
		Process resumes after timer expires.


#### Interrupts
	Hardware devices are connected to interrupt controller via special line
	called IRQ. These devices are triggering interrupts using that IRQ line.
	X86 provides 16 lines. You can have up to 256 lines.

	IRQ line assigned to particular to device need to be known to driver writer.
	IRQ descriptor table is linked list of IRQ descriptor.

9/2/2017
########
	lspci -v will list the pci devices.
	request_irq() is a kernel function that will register the IRQ handler
	with a specified IRQ line. It basically creates an instance of 'irqaction'
	that is hanging off from IRQ descriptor table.

	__init my_init() will call request_irq()
	__exit my_exit() will call free_irq();

	do_irq() is responsible for doing ISR routine.

	Refer interrupt.h
	Lists the possible flags while writing ISRs.
	If we want our ISR routine to be high priority, we can do IRQF_DISABLED can
	be used so that kernel won't do pre-emption.

	Exceptions are also sychronous interrupts in CPU level.
	Device Interrupts are asynchronous.

	Vector table will contain list of descriptors.
	0-32 - Are exceptions, (page faults etc).
	0-19 - are Watchdogs.
	20-31 - Intel reserved.
	IRQ0 = 32, IRQ1 = 33 (External device interrupts).
	For interrupts between 32-127, Linux has one response function called
	do_irq(). do_irq() is low level routine. For particular interrupt do_irq()
	check if there is a handler or not. ISRs are called from do_irq().


#How interrupts work in Linux.
	do_irq() is function of process 0. Process 0 (kernel) will respond to
	interrupt. Process 1 (init) is responsible for creation and management of
	processes.

	When interrupts occur, do_irq() queries the interrupt controller and
	finds which IRQ line is triggered. Linux kernel configures do_irq() routine
	as a default response function for all external interrupts.

	do_irq() is routine of process 0, which is responsible for allocation of
	interrupt stack and invoking appropriate ISR routines. This interrupt stack
	is of 2 types. If kernel is configured to use 8k stack, then there is no
	separate stack. If kernel is configured with 4K stack, then interrupt stack
	is allocated. With 4k, there will be performance hit. By default 8K is the
	size of kernel stack. In Embedded linux and if there is a memory constraint
	4K may be needed.

	1. Find interrupt request line on which interrupt signal was triggered (by
	querying interrupt controller).

	2. Lookup IRQ descriptor table for addresses of registered interrupt
	service routines.

	3. Invoke registered ISRs.

	4. Enable IRQ line.

	5. Execute other priority work.
	[More to come here]

	6. Invoke process scheduler. Until step6, process scheduler is disabled.

	Interrupt latency is total time spent by the system in response to
	interrupt. If interrupt latency is high, applications performance may be
	impacted. High priority will starve since system is spending more time in
	interrupts. One device may block other devices. When timer interrupt goes
	off, other interrupts are disabled.

Interrupt latency == Total amount of time system spends on the interrupt.
#Factors contributing to interrupt latency.
	a] H/W latency: Amount of time a processor is taking ack interrupt and
	invoke ISR.
	b] Kernel latency:  In Linux/Windows/or when process 0 is responding, how
	much time process 0 takes to start an ISR. This is called kernel latency.
	c] ISR Latency: When ISR routine invoked, how much time it takes.
	ISR are usually referred as INterrupt handlers.
	d] Soft interrupt latency (bottom half).
	e] Scheduler latency.
		e.1] Check if any high pri tasks waiting in queue.
		e.2] Signal handlers for the signals pending.
		e.3] Giving cpu to high pri task.

	RTOS has fixed time latency for interrupts. GPOS do not have fixed time
	latency.

#For NIC, both reception and transmission of pkt will trigger an interrupt.
#Sample pseudo-code for interrupt hander for NIC on
#reception of pkt (on network device).
	1. Allocate buffer to hold the packet.
	2. Copy from NIC buffer to kernel or vice-versa.
	2. Process the packet, specially phsyical header.
	3. Queue pkt handing over to upper protocol layers.

9/3/2017
########
While designing ISRs the following issues are to be considered.
#Don't while writing ISR routine.
	1. Avoid calling dynamic memory allocation routines.
	2. Avoid transferring data (synchronous) between 2 buffer blocks.
	3. Avoid contending for access on global data structures because you need
	to go through locks.
	4. Avoid operations on user space addresses.
	5. Avoid call to scheduler. While ISR is running scheduler is disabled.
	Hence a call to scheduler may result in deadlock and need to be avoided.
	6. Avoid calls to operations which are non-atomic.

#Do's while writing ISR routine.
	1. Use pre-allocated buffers. (skb buffer in network drivers).
	2. Consider using DMA whenever data needs to be transferred between device
	and memory.
	3. Consider using per CPU data wherever needed.
	4. Identify non-critical work and use appropriate deferred routines to
	execute them when system is idle or other scheduled time.

	If you are doing anything h/w specific within ISR is critical.
	Anything other than this is non-critical from Interrupt perspective.

Linux has bottom halves or RTOS call BH as "deferred functions". Basically
step 3 and 4 (processing and enque of the packets to higher layers) can be
deferred and run. It need not be run with interrupts disabled on the cpu.
ISR
{
	schedule_bh(func);
}

func()                //Soft IRQ.
{
	body;
}

Soft IRQ == Piece of code that runs with IRQ enabled but scheduler disabled.
THis is run in interrupt context.

ISR terminates, IRQ enabled, scheduler enabled - Bottom half.
Bottom half can be of 2 types. They can be running soft interrupts context.
They are called Soft IRQs or Work queue.

9/4/2017
########
#SOFTIRQs

Soft IRQ, tasklet and workqueue are 3 ways to do deffered execution of a
routine while servicing an interrupt. Non critical work can be scheduled.

#Linux 2.6 bottom halves.
1. Soft IRQ: It's a way where routine can be scheduled for deffered execution.
	It's available for static drivers and not dynamic drivers.  This is a
	bottom half which is runs with interrupts enabled, but pre-emption
	disabled.
	Refer include/linux/interrupt.h
	All softirqs are instances of softirq_action structure.
	Maximum of 32 soft irqs.

#Execution of softirq#
#
	1. There are maximum 32 allowed softirqs. There is a per-cpu softirq list.
	If ISR is running on cpu1, then softirq will run on cpu1 and so on.  When
	raise_softirq() is called, specified softirq instance is enqued to the list
	of pending softirqs (per cpu).

	Both top-half and bottom-half need to be on same cpu otherwise there will be
	cache-line problem.

	2. Pending softirq list is cleared (run one by one) by do_irq() routine
	immediately AFTER isr is terminated with interrupt lines enabled. softirq
	run with pre-emption disabled but IRQ enabled. If interrupt goes off during
	softirq, then do_irq() is run as re-entrant and softirq will get
	pre-empted. Softirq pre-emption will happen for nested interrupts.

	3. when softirqs can run?
	They can be run in 3 different ways.
		3.1. softirq can run in the context of do_irq().

		3.2. softirq can also run in the after spin_unlock_bh().
		do_irq() can't run softirq, if the spin_lock_bh() is held.

		3.3. Softirq may execute in the context of ksoftirqd (Per cpu kernel
		thread). This is in process context of ksoftirqd - daemon.

	static int __init() my_init()
	{
		open_softirq(mysoftirq, func)
		request_irq()
	}

	ISR {
		raise_softirq(mysoftirq);
	}

	func {
		..
		..
	}

# Another example of softirq.
# Both need to run on same cpu.
	func()
	{
		/* write to a list */
		while(1)
			raise_softirq(mysoftirq); <<< This is legal, though bad.
	}

	read() {
		spin_lock_bh()
		/* read from list */
		spin_unlock_bh();
	}

	Linux kernel allows softirq routine to reschedule itself within itself.
	It's possible to do raise_softirq() within softirq(). ksoftirqd - will also
	clear the pending softirqs when it gets the cpu slice.

#Question1: Why do we allow to reschedule softirq?
Consider 2 cpus as below
cpu1.							cpu2:
read()							func()
{								{
	spin_lock_bh()					spin_lock_bh()
	/*read from list */				/*write to the list */
	spin_unlock_bh()				spin_unlock_bh()
}								}

	Suppose cpu1 has aquired the lock and is reading from the list, and
	interrupt is delivered to cpu2. cpu2 has some bh (softirq) func() and it
	tries to aquire the lock. The spin_lock_bh() in func() is going to spin on
	the cpu2 for lock to become available. This cpu2 is in interrupt context
	and going to waste cpu cycles which is not correct. Hence rescheduling of
	softirq is permitted from bh.

	Rescheduling bh is required to relinquish cpu from within bottom half when
	critical resource (like lock) is rquired for bottom half execution is not
	available.

#Question2: Will softirq (bh) can run on 2 cpus run paralley?
#Yes, because the interrupt can be delivered to different cpus.
	Use appropriate mutual exclusion locks while writing softirqs.


#Limitations of softirqs.
	1. Softirqs are concurrent, i.e. same softirq could run on 'n' cpus
	parallely.
	2. While implementing softirq code using mutual exclusion locks is
	mandatory wherever needed.
	3. Using locks in interrupt context code will result in variable interrupt
	latency.

	These are the reasons why softirqs are not available for modules.
	Concurrent execution in bottom half, only then consider softirqs.

2. Tasklets.
	It's a softirq where tasklet is guarrateed to be serial. Because of serial
	execution,there is no need for locks. Tasklets are dyanmic softirqs that
	can be used from within module drivers without concurrency.  They are
	always executed serially.

	Tasklets = Dyanmic softirqs - concurrency.

3. Work Queue
These are 3 different ways of deferring routines later during interrupt
processing.

9/5/2017
########
#TASKLETS
	Built on top of softirqs
	Represented by 2 softirqs
	tasklet_struct structure
	Created both statically and dynamically
	Less restrictive sync routines.

#Steps involved in tasklets.
	1) Declare tasklet
	DECLARE_TASKLET(name,func, data)
		OR
	struct tasklet_struct mytasklet;
	tasklet_init(mytasklet, func, data);

	2) Implement BH routine
	void func(unsigned long data);

	3) Schedule tasklet
	tasklet_schedule(&mytasklet); <<< Low priority
		OR
	tasklet_hi_schedule(&mytasklet); <<< High priority - use if you want high.

#	When tasklets are run? - Execution policy.  Tasklets are executed using
	same policy that is applied for softirqs since interrupt subystem of kernel
	views a tasklet either instance of type high_softirq or tasklet_softirq.

	Interrupt subsystem guarrantees the following with regards to execution of
	tasklet.

# Tasklets --- multithreaded analogue of BHs. (From interrupt.h file).

   Main feature differing them of generic softirqs: tasklet
   is running only on one CPU simultaneously.

   Main feature differing them of BHs: different tasklets
   may be run simultaneously on different CPUs.

   Properties:
   * If tasklet_schedule() is called, then tasklet is guaranteed
     to be executed on some cpu at least once after this.
   * If the tasklet is already scheduled, but its execution is still not
     started, it will be executed only once.
   * If this tasklet is already running on another CPU (or schedule is called
     from tasklet itself), it is rescheduled for later.
   * Tasklet is strictly serialized wrt itself, but not
     wrt another tasklets. If client needs some intertask synchronization,
     he makes it with spinlocks.

	2 different tasklets between 2 different drivers can be run parallel and
	there may be need for synchronization. Tasklets are strictly serialized but
	not wrt to other tasklets. Inside a tasklet if you are accessing global
	data structure, then locking is required.

	They can be run in 3 different ways.
		3.1. softirq can run in the context of do_irq().

		3.2. softirq can also run in the after spin_unlock_bh().
		do_irq() can't run softirq, if the spin_lock_bh() is held.

		3.3. Softirq may execute in the context of ksoftirqd (Per cpu kernel
		thread). This is in process context of ksoftirqd - daemon.

# Workqueues
	Workqueues are 2.6 kernel only. Tasklets and Softirqs were there in 2.0.
	Workqueue are instances of work structure.

# Timer bottomhalf - Executes whenever assigned timeout elapses.

9/6/2017
########
#Memory management in Linux
#There are 2 source directories for memory management.
	1] /usr/src/linux/mm - (Memory manager)
	2] /usr/src/linux/arch/i386/mm - (Memory initializer, runs at boottime)
	ppc/mm, mpis/mm, alpha/mm - Architecture specific source code in the kernel.

Processor views memory in real-mode as single array
Processor views memory in protected-mode as set of arrays/vectors.
	Each frame size is 4K.  The view changes depending on the type of mode.

When does the shift from real to protected mode happens?

# Different types of memory allocation algorithms.

	Page allocator : THis is primary memory allocator and source of memory.

	slab allocator : Odd size memory allocation and always returns physically
	contiguous memory.
		kmalloc(),returns address and kfree(), frees addess.
		CAn be called from driver or kernel service.
		/proc/slabinfo - can view the slab allocation details.

		Slab allocation allows private cache (per driver/kernel service).
		Default allocators could be called from the cache list.
		kmem_cache_create()
			kmem_cache_alloc()
			kmem_cache_free()
		kmem_cache_destroy()

	Cache allocator : Need to allocate data structures frequently. Reserver
	some pages as cache of objects so that drivers and FS can pre-create the
	objs and use them. They are not available to application directly.

	Fragmented memory allocator: Odd size requests and source of allocation is
	from various fragments. It's used when allocation request size is large and
	not called from applications directly.

	Boot memory allocators: startup drivers, BSP drivers - they aquire memory
	using boot mem allocator.

9/7/2017
########
	include <linux/mempool.h>
	pcb_task_struct_t, cdev etc - most of them are pre-allocated in pools.

	Cache is reserving pages which will be used for allocating memory later.
	Memory pool, create a cache and allocate instances. Mem pool is based on
	cache. Scsi, usb drivers, network drivers etc which are frequenty used
	structures that are used for data transfer are created using memory pool.

	Slab layer allows kernel services to create memory pools that can be used
	for pre-allocation of specific objects.

	1. Create a memory pool.
	2. Aquiring an object from the pool.  (mempool_alloc)
	3. Release the object (mempool_free)
	4. Destroy the pool (mempool_destroy).
	5. Destroy the cache (AFTER the pool destruction) - kmem_cache_destroy

	FS, protocol stack typically use this facility.

	Any request beyond > 128K - kmalloc() may fail. That is because 128K
	physically contigous block for kmalloc.

#mmap, munmap
	Map IO cache buffer (from filesystem) into memory.
	If application uses MAP_ANONYMOUS, then filedes is not considered and
	memory is mapped.
	Can be used on device files.
	open(/dev/file1) and then mmap().

	(low memory zone, normal zone, highmem zone)
	Implementing mmap callback in a character driver.
	Anything above 896M is called high memory zone.
	Process address space is also in normal zone (< 896M)

	1. Each process in the user space aquires set of pages into which process
	code, data and stack segments are mapped.
	2. Process pcb in the kernel space carries details about pages allocated to
	the process and segments to wihch they are mapped.
	(Refer mm18)

	For each process: (application)
		code segments go into few pages and need not be contiguous.
			Each vm_area_struct correspond to each contiguous segment.
		stack segments go into few pages
		data segments go into few pages
			Each vm_area_struct correspond to each contiguous segment.

	/proc/pid/maps will give each of these vm_area_structs (block).
	block = set of pages.

# How linux tracks mappings?
	1. mm_struct_instance contains reference to a list of virtual memory blocks
	(vm_area_struct) that are mapped to application's code/data/stack segment.

	2. Each instance of vm_area_struct represents one block of the process
	address space.

	3. mm_struct also carries reference to the process page table with valid
	page table entries (PTEs). For protection reasons, we are mapping into
	different segments.

# When mmap is called, what happens?
	1. Application's mmap request on a file invokes do_mmap kernel routine via
	sys_mmap.
	2. do_mmap() allocates new instance of struct vm_area_struct & fills it with
	details of the new block attributes based on the arguments passed to mmap()
	routine by the application. malloc() calls do_mmap(). This is a key routine.
	3.do_mmap() invokes appropriate mmap support routine assigned to the file
	operations (fops) instance for eg: fops_mmap().

# How the driver's mmap works?
	1. Identify physical memory region (frames) that is required to be mapped.
	2. Map physical memory region to kernel's logical address space which is
	page+offset. [Physical memory == frame+offset].
	3. Set page reservation indicating that I/O activity should be disabled.
	4. Map physical address region to VMA instance.

9/8/2017
########
# Direct Memory Access (DMA) - memory allocation and management.

	Bus specific mode etc - they require dma allocations.

#Address translation using page tables.
	Newer Intel PAE extentions provide 36 bit addresses.  There are 3 patches
	availble which breakup the virtual address space.

	1. Each process carries it's own page table allocated by kernel at the
	process load time.

	2. Page tables contain entries mapping page to valid physical frame.
	Valid	virtual-page	modified	protection page-frame
	1		140				1			RW			31
	1		20				0			R X			38

	Each entry is called page table entry (PTE).
	3. Processor's MMU (memory mgmt unit) at runtime looks up into page-table
	to translate a logical address to physical address. And reference of
	page-table is loaded into processor's register on every context switch.

# Multi level paging: where it's used?
Linux uses 3-level paging on desktop/server arch and 4-level paging on
NUMA/Enterprise architectures.

#Overhead / Limitation with page tables.
	1. As the number of processes increase kernel needs to set aside around 3MB
	of physical memory for each process to hold PTEs.

	2. Page TAble Indirection is one approach where there is no wastage of
	memory. Swap them out to disk when not required and swap in when required.
	This approach is implemented in many ways. 2-level, 3-level paging etc are
	different ways to manage page tables. The objective here is to enable page
	tables dynamically extensible.

	3. Processor needs to spend 'n' amount of cycles looking up page-tables
	before the actual operations on the memory can be executed.

#How to optimize
	1. To optimize translation time, cpus provide specific cpu local buffers
	called Translation Lookaside Buffer (TLBs).

	2. Processor < L1 < L2 < L3 < Memory  -
	Processor is fastest access.

	3. TLBs can be managed in 2 ways
		3.1. Kernel / software managed: In s/w managed mode each TLB missed
		event triggers an exception which inturn is handled by kernel by
		updating TLB entries from page-tables.
		3.2. Hardware managed: In h/w managed mode each TLB miss event makes
		processor jump into physical page table in memory and load appropriate
		entries into TLB.

	4. Processors also provide high speed data instructions caches to optimize
	program execution by mirroring program's data/instructions to cache.

9/10/2017
########
#Memory Mapped IO (MMIO) Vs Port Mapped IO (PIO)
	PPIO:x86 They have port mapped IO.Separate set of instructions to read.
	In PC platform have 16 bit addressing.

	MMIO:risk	(ARM etc) have memory mapped IO. While memory bus is used
	for IO as well.

/proc/iomem - Details of memory mapped IO. Those devices that are memory mapped.
/proc/ioports - Details of port mapped IO.

#Accessing port mapped devices from user space.
	1. Port mapped addresses can be accessed by applications of Linux and
	kernel space services (modules).
	2. Apps can access port mapped devices using either of the following ways.
		2.1] RUnning as root application using ioperm() routines.
		2.2] Using special device file using /dev/<..> - Will be disabled in
		future.
	Refer /home/abk/Desktop/Work/Linux/veda/Code-2.6.34/devio/ioports.c
	man ioperm, iopl - check 2 manpages.

# Kernel space port mapped device access.
	1. From kernel - you do request_region() and seek permission. Modules
	invoke request_region() routine to check for port access permissions and
	aquire port resource.

	2. Use kernel mode in/out family of functions for reads and writes on the
	IO port.

#Kernel APIs are:
	in[bwl], out[bwl] and string variants ins[bwl] and outs[bwl]

	linux/ioport.h and kernel/resource.c - Audit files.

# What is a bus.
	Bus = DAta path from transferring data between cpu and devices.
	3 kinds of buses
	1. Address bus  = Used to generate addresses.
	2. Data bus (parallel lines to carry data) = Used to transfer data.
	3. Control bus. = Used to transfer control signals.
	IO bus is connection between IO devices and CPU.

#	Summary of PPIO
#	PPIO can be done in 2 ways
	From user space :
		ioperm/in/out family of functions
		/dev/port - using driver's read write operations

	From kernel space:
		Using in/out family with request_region()
		Using ioport_map() - New way.

# 	MMIO
	No way to do memory mapped IO from userspace.

# Memory alignment.
	Refer Documentation/unaligned-memory-access.txt in kernel source.
	Memory alignment is storing data value at the address which is evenly
	divisible by it's size. Unaligned data causes exceptions on some
	architectures and is not supported on some archs. Intel x86 throws an
	exception when un-aligned data operation is initiated.

9/11/2017
########
#Block Device Drivers
	Bulk and mass storage - they use block driver model for asynchronous mode.

	Buffer/page cache - cache requested file's data is found. I/O request from
	the applicatino will be synchronous if the data is found in buffer cache.
	If the I/O request doesn't find the data in the buffer/page cache, the
	application is put to wait (block) and a request is made to block layer.

	Block driver is per physical disk.
	There are 2 types of block driver.
		1. Logical block drivers - RAMDISK, emulation without physical device.
		2. Physical block drivers with I/O scheduler. REquest queue will be
		present between block layer and physical block driver that actually
		talks to the disk.

	vfs identifies block device as an instance of gendisk for block driver.
	vfs identifies char device as an instance of cdev for char device driver.

9/12/2017
########
#Block Device Drivers - Continued

	Number of bios involved in a request == number of contiguous sectors If
	whole file falls physically in continuous sectors, then there will be just
	one bio and num of bio_vec will be N number physically contiguous sectors.

9/13/2017
########
#PCI and network drivers.
	- PCI is typically found in PCs but not found in SoCs or embedded.
	2 types of bridges exist.
	- PCI to PCI bridge
	- PCI to ISA bridge (PCI2eISA, PCI2PCMCIA etc).

	Each PCI device carries 256 bytes of configuration information with some
	registers as part of an EEPROM/FIRMWARE.


	Inside device structure, configuration data is stored.
		First 64 bytes region carries a general header (structure) that
		provides details about device, class, vendor, status register, command
		register, port and memory information and interrupt line number (IRQ
		number). For all devices this header is common/required. REst will be
		device specific registers.

	For each PCI bus, there will be a struct called pci_bus
	For each PCI device, there will be struct called pci_device

	lspci -v will show the details of the pci.
	00:02.0 VGA compatible controller
	bus-id:device-id.function-id. Each PCI device can provide upto 7 functions.

#Steps involved in interaction with PCI devices

	1. Register with PCI bios.
	2. Enable the device. (Push to wakeup state).
	3. Probe device configuration. (IRQs, IO ports/memory regions).
	4. Allocate resources.
	5. Communicate with the device.

# Implementation of PCI NIC drivers (PCI and network code is new).
	1. Register the driver with PCI subsystem (PCI bios).
		pci_register_driver() - This function registers driver with PCI bios.

	2. static struct pci_driver nic_driver {
		.name = "nicdriver",
		.id_table = nic_idtable,
		.probe = nic_probe,
		.remove = nic_remove,
	};

	static int __init nic_init(void)
	{
		return pci_register_driver(&nic_driver);
	}

	static int __exit nic_cleanup(void)
	{
		pci_unregister_driver(&nic_driver);
	}

	PCI bios invokes the probe routine of the registered driver when the
	physical device specified in found. Probe is callback() and called when
	device is physically there.

	What we do in the probe function?

# Major steps in NIC driver.
	1. Carry out device initiazation operations.
		1.a] Enable the device using pci_enable_device()
		1.b] Enable bus master ring (if available).
			Bus master ring means, presence of DMA functionality.
			pci_set_master()
		1.c] Extract IO/Memory information.
			pci_resource_start()
			pci_resource_len()
			pci_request_regions()

		Verify if device's IO/Memory region is in use. 2 drivers should not use
		same registers.
		Perform memory / IO mapping
		pci_iomap, pci_map() etc.

	2. Reigster the driver with appropriate subsystem.
		Incase of network driver register it with common device layer.

		Char driver register with VFS
		Block driver registers with block layer and VFS (Optional)
		Network driver registers with Common Netdevice Layer. Int n/w
		terminology is called DLL. this comprises code for Mac layer and
		physical layer.

# Network drivers
	1. Network drivers are not enumerated as device files.
	(no VFS registration, no maj,min numbers).
	2. Net drivers register as an instance of net_device with common net device
	layer (CNDL)