您的位置:首页 > 运维架构 > Linux

The Linux SG_IO ioctl in the 2.6 series

2013-07-09 09:30 656 查看
http://gmd20.blog.163.com/blog/static/1684392320100227396270/

原文地址:http://sg.danny.cz/sg/sg_io.html


The Linux SG_IO ioctl in the 2.6 series

The Linux SG_IO ioctl in the 2.6 series

Introduction
SCSI and related command sets
SG_IO ioctl overview
SG_IO ioctl in the sg driver
SG_IO ioctl differences
open() considerations
SCSI command permissions
CAP_SYS_RAWIO from a user process
SG_IO and the st driver
Maximum transfer size per command
Conclusion


Introduction

The SG_IO ioctl
permits user applications to send SCSI commands to a device. In the linux 2.4 series this ioctl was only
available via the SCSI generic (sg) driver. In the linux 2.6 series the SG_IO ioctl is additionally available for block devices and SCSI tape (st) devices. So there are multiple implementations of this ioctl within the kernel with slightly different characteristics
and describing these is the purpose of this document.

The information in this page is valid for linux kernel 2.6.16 .


SCSI and related command sets

All SCSI devices should respond to an INQUIRY command and part of their response is the so-called peripheral device type. This is
used by the linux kernel to decide which upper level driver controls the device. There are also devices that belong to other (i.e. not considered SCSI) transports that use SCSI command sets, the primary examples of this are (S-)ATAPI CD and DVD drives. Not
all peripheral device types map to upper level drivers and devices of these types are usually accessed via the SCSI generic (sg) driver.

SCSI (draft) standards are found at www.t10.org .
SCSI commands common to all SCSI devices are found in SPC-4 while those specific to block devices are found in SBC-2, those for CD/DVD drives are found in MMC-5 and those for SCSI tape drives are found in SSC-3.

The major non-SCSI command set in the storage area is for ATA non-packet devices
which are typically disks. ATApacket devices
use ATAPI which in the vast majority of cases carry a SCSI command set. The most recent draft ATA command set standard is ATA8-ACS and can be found at www.t13.org .
To complicate things (non-packet) ATA devices may have their native command set translated into SCSI. This can happen in the kernel (e.g. libata in linux) or in an intermediate device (e.g. in a USB external disk enclosure). Yet another possibility are disks
whose firmware can be changed to allow them to use either the SCSI or ATA command set, this may happen in the SAS/SATA area since the physical (cabling) and phy (electrical signalling) levels are so similar.


SG_IO ioctl overview

The third argument given to the SG_IO ioctl is a pointer to an instance of the sg_io_hdr structure which is defined in the <scsi/sg.h>
header file. The execution of the SG_IO ioctl can viewed as going through three phases:

do sanity checks on the metadata in the sg_io_hdr instance; read the input fields and the data pointed to by some of those fields; build a SCSI command and issue it to the device
wait for either a response from the device, the command to timeout or the user to terminate the process (or thread) that invoked the SG_IO ioctl
write the output fields and in some cases write data to locations pointed to by some fields, then return

Only
phase 1 returns an ioctl error (i.e. a return value of -1 and a value set in errno). In phase 2, command timeouts should be used sparingly as the device (and some others on the same interconnect) may end up being reset. If the user terminates the process or
thread that invoked the SG_IO ioctl then obviously phase 3 never occurs but the command execution runs to completion (or timeout) and the kernel "throws away" the results. If the command yields a SCSI status of CHECK CONDITION (in field "status") then sense
data
is written out in phase 3 .

Now we will assume that the SCSI command involves user data
being transferred to or from the device. The SCSI subsystem does not support true bidirectional data
transfers to a device. All data
DMA transfers (assuming the hardware supports DMA) occur in phase 2. However, if indirect IO is being used (i.e. neither direct IO nor mmap-ed transfers) then either:

data is read from the user space in phase 1 into kernel buffers and DMA-ed to the device in phase 2, or
data is read from the device into kernel buffers in phase 2 and written into the user space in phase 3

When direct IO or mmap-ed transfers are being used then all user data
is moved in phase 2 . If a process is terminated during such a data
transfer then the kernel gracefully handles this (by pinning the associated memory pages until the transfer is complete).

The sg_io_hdr structure has 22 fields (members) but typically only
a small number of them need to be set. The following code
fragment shows the setup for a simple TEST UNIT READY SCSI command which has no associated data
transfers:

unsigned char sense_b[32];

unsigned char turCmbBlk[] = {TUR_CMD, 0, 0, 0, 0, 0};

struct sg_io_hdr io_hdr;

memset(&io_hdr, 0, sizeof(struct sg_io_hdr));

io_hdr.interface_id = 'S';

io_hdr.cmd_len = sizeof(turCmbBlk);

io_hdr.mx_sb_len = sizeof(sense_b);

io_hdr.dxfer_direction = SG_DXFER_NONE;

io_hdr.cmdp = turCmbBlk;

io_hdr.sbp = sense_b;

io_hdr.timeout = DEF_TIMEOUT;

if (ioctl(fd, SG_IO, &io_hdr) < 0) {

The memset() call is pretty important,
setting unused input fields to safe values. Setting the timeout field to zero is not a good idea; 30,000 (for 30 seconds) is a reasonable default for most SCSI commands. As always, good error processing consumes a lot more code.
This is especially the case with SCSI commands that yield "sense data"
when something goes wrong. For example, if there is a medium error during a disk read, the sense data
will contain the logical block address (lba) of the failure. Another error processing example is a SCSI command that the device considers an "illegal request", the sense data
may show the byte and bit position of the field in the command block (usually referred to as a "cdb") that it objects to. For examples on error processing please refer to the sg3_utils package, its "examples" directory and its library components: sg_lib.c
(SCSI error processing and tables) and sg_cmds.c (common SCSI commands).

Below is a grouping of important
sg_io_hdr structure fields with brief summaries:

Command block (historically referred to as the "cdb"):

cmdp - pointer to cdb (the SCSI command block)
cmd_len - length (in bytes) of cdb

Data
transfer:

dxferp - pointer to user data to start reading from or start writing to
dxfer_len - number of bytes to transfer
dxfer_direction - whether to read from device (into user memory) or write to device (from user memory) or transfer no data: DXFER_FROM_DEV, DXFER_TO_DEV or DXFER_NONE respectively

resid - requested number of bytes to transfer (i.e. dxfer_len) less the actual number transferred

Error indication:

status - SCSI status returned from the device
host_status - error from Host Bus Adapter including initiator (port)

driver_status - driver (mid level or low level driver) error and suggestion mask

Sense data
(only
used when 'status' is CHECK CONDITION or (driver_status & DRIVER_SENSE) is true):

sbp - pointer to start writing sense data to

mx_sb_len - maximum number of bytes to write to sbp
sb_len_wr - actual number of bytes written to sbp

The fields in the sg_io_hdr structure are defined in more detail in the SCSI-Generic-HOWTO document.


SG_IO ioctl in the sg driver

Linux kernel 2.4.0 was the first production kernel in which the SG_IO ioctl appeared in the SCSI generic (sg) driver. The sg driver
itself has been in linux since around 1993. An instance of the sg_io_hdr structure in the sg driver can either be:

pointed to by the third argument of the SG_IO ioctl
pointed to by the second argument of UNIX write() or read() system calls which have a file descriptor of a sg device node as their first argument

The SCSI-Generic-HOWTO document
describes the sg driver in the lk 2.4 series including its use of the SG_IO ioctl. Prior to the lk 2.4 series the sg driver only
had the sg_header structure. It was used as an asynchronous command interface in which command, metadata and optionally user data
was sent via a Unix write() system call. The corresponding response which included error information (e.g. sense data)
or optionally user data
was received via a Unix read() system call. Two major additions were made to the sg driver at the beginning of the lk 2.4 series:

a new metadata structure (sg_io_hdr) as an alternative to the original mixed metadata and data structure (sg_header)
the SG_IO ioctl that used the new metadata structure and was synchronous: it sent a SCSI command and waited for its reply

The sg_io_hdr only
contains metadata in the sense that it contains pointers to locations of where data
will come from (command or data
in) or go to (sense data
or data
out). These pointers have caused problems in mixed 32/64 bit environments, especially when the user application (e.g. cdrecord) is built for 32 bits and the kernel is 64 bits. The lk 2.6 series has a compatibility layer to cope with this via code
specialized for the SG_IO ioctl. Unfortunately this problem was not foreseen when the sg_io_hdr structure was designed.

A significant feature of the SG_IO ioctl in the sg driver is that it is user interruptible. This means between issuing a command
(e.g. a long duration command like a disk format) and its response arriving a user could hit control-C on the associated application. The kernel would remain stable and resources would be cleared up at the appropriate time. The sg driver does not attempt to
abort such a command that is "in flight", it simply throws away the response and cleans up. Naturally the user has no direct way of finding out whether an interrupted command succeeded or not, by there may be indirect ways.

A warning may also be in order here: a long duration command such as format would typically be given a long timeout value. If the
user interrupted the application that sent the format command then the device may remain busy doing the format (especially if the IMMED bit is not set). So if the user then sent a short duration command such as TEST UNIT READY or REQUEST SENSE to see what
the device was doing, these commands may timeout. This would invoke the SCSI subsystem error handler which would most likely send a device reset, thus aborting the format, to get the device's attention. This is probably not what the user had in mind!


SG_IO ioctl differences

In the following table, sg_io_hdr structure fields are listed in the order they appear in that structure. Basically the "in" fields
appear at the top of the structure and are read in phase 1. The latter fields are termed as "out" and are written by the SG_IO implementation in phase 3.

Table 1. sg_io_hdr structure summary and implementation differences

sg_io_hdr fieldin or outtypedifferentbrief description including differences between implementations
interface_idinintguard field. Current implementations only accept " (int)'S' ". If not set, the sg driver sets errno to ENOSYS while the block layer sets it to EINVAL
dxfer_directionin(-ve) intminordirection of data transfer. SG_DXFER_NONE and friends are defined as negative integers so the sg driver can discriminate between sg_io_hdr instances and those of sg_header. This nuance is irrelevant to non-sg driver usage
of SG_IO. See below.
cmd_leninunsigned charlimits command length to 255 bytes. No SCSI commands (even variable length ones in OSD) are this long (yet)
max_sb_leninunsigned charmaximum number of bytes of sense data that the driver can output via the sbp pointer
iovec_countinunsigned shortyesif not sg driver and greater than zero then the SG_IO ioctl fails with errno set to EOPNOTSUPP; sg driver treats dxferp as a pointer to an array struct sg_iovec when this field is greater than zero
dxfer_lenin

unsigned intminornumber of bytes of data to transfer to or from the device. Upper limit for block devices related to/sys/block/<device>/queue/max_sectors_kb
dxferpin [*in or *out]void *minorpointer to (user space) data to transfer to (if reading from device) or transfer from (if writing to device). Further level of indirection in the sg driver when iovec_count is greater than 0 .
cmdpin [*in]unsigned char *pointer to SCSI command. The SG_IO ioctl in the sg drive fails with errno set to EMSGSIZE if cmdp is NULL and EFAULT if it is invalid; the block layer sets errno to EFAULT in both cases.
sbpin [*out]unsigned char *pointer to user data area where no more than max_sb_len bytes of sense data from the device will be written if the SCSI status is CHECK CONDITION.
timeoutinunsigned intyes

(if = 0)
time in milliseconds that the SCSI mid-level will wait for a response. If that timer expires before the command finishes, then the command may be aborted, the device (and maybe others on the same interconnect) may be reset depending
on error handler settings. Dangerous stuff, the SG_IO ioctl has no control (through this interface) of exactly what happens. In the sg driver a timeout value of 0 means 0 milliseconds, in the block layer (currently) it means 60 seconds.
flagsinunsigned intyesBlock layer SG_IO ioctl ignores this field; the sg driver uses it to request special services like direct IO or mmap-ed transfers. It is a bit mask.
pack_idin -> outintunused (for user space program tag)
usr_ptrin -> outvoid *unused (for user space pointer tag)
statusoutunsigned charSCSI command status, zero implies GOOD
masked_statusoutunsigned charLogically: masked_status == ((status & 0x3e) >> 1). Old linux SCSI subsystem usage, deprecated.
msg_statusoutunsigned charSCSI parallel interface (SPI) message status (very old, deprecated)
sb_len_wroutunsigned charactual length of sense data (in bytes) output via sbp pointer.
host_statusoutunsigned shorterror reported by the initiator (port). These are the "DID_*" error codes in scsi.h
driver_statusoutunsigned shortbit mask: error and suggestion reported by the low level driver (LLD). These are the "DRIVER_*" error codes in scsi.h
residoutint(dxfer_len - number_of_bytes_actually_transferred). Typically only set when there is a shortened DMA transfer from the device. Not necessarily an error. Older LLDs always yield zero.
durationoutunsigned intnumber of milliseconds that elapsed between when the command was injected into the SCSI mid level and the corresponding "done" callback was invoked. Roughly the duration of the SCSI command in milliseconds.
infooutunsigned intminorbit mask indicating what was done (or not) and whether any error was detected. Block layer SG_IO ioctl only sets SG_INFO_CHECK if an error was detected
The DID_* and DRIVER_* error and suggestion codes (associated with host_status and driver_status) are discussed in more detail in
the SCSI-Generic-HOWTO document.


open() considerations

Various drivers have different characteristics when a device node is opened. One
problem with the ioctl system call is that a user only
needs read permissions to execute it but may, with the ioctls like SG_IO, write to a device (e.g. format it). Command (operation code)
sniffing logic is used to overcome this security problem. Also users of the SG_IO ioctl need to be aware when they "share" a device with sd, st or a cdrom driver that state machines within those drivers may be tricked. This may be unavoidable but the users
of the SG_IO ioctl should take appropriate care.

Opening a file in linux with flags of zero implies the O_RDONLY flag and hence read only
access. All open() system calls can yield ENOENT (no such file or directory); ENODEV (no such device) if the file exists but there is no attached device and EACCES (permission denied) if the user doesn't have appropriate permissions.

A user with CAP_SYS_RAWIO capability (normally associated with the "root" user) bypasses all command sniffing and other access controls
that would otherwise lead to EACCES or EPERM errors. With the sg driver such a user may still need to open() a device node with O_RDWR (rather than O_RDONLY) to use all SCSI commands.

Table 2. open() flags for SG_IO ioctl usage

open() flagssg

notes

sd

notes

st

notes

cdrom

notes

Comments
<none> or

O_RDONLY
1, 23,43,53,6best to add O_NONBLOCK. For a device with removable media (e.g. tape drive) that depends on whether the drive or its media is being accessed.
O_RDONLY | O_NONBLOCK1,733,133recommended when SCSI commands are recognized as reading information from the device
O_RDWR24,8,95,8,96,8,9again, could be better to add O_NONBLOCK
O_RDWR | O_NONBLOCK78,98,9,138,9recommended when arbitrary (including vendor specific) SCSI commands are to be sent
<< interaction with O_EXCL>>10111211only use when sure that no other application may want to access the device (or partition). A surprising number of applications do "poke around" devices.
<< interaction with O_DIRECT>>--->--->requires sector alignment on data transfers (ignored by sg and st)
Notes:

on subsequent SG_IO ioctl calls, the sg driver will only allow SCSI commands in its allow_ops array, others result in EPERM (operation not permitted) in errno. See below .

if previous open() of this sg device node still holds O_EXCL then this open() waits until it clears.
on subsequent SG_IO ioctl calls, the block layer will only allow SCSI commands listed as "safe_for_read" in the verify_command() function in the drivers/block/scsi_ioctl.c file; others result in EPERM (operation not permitted) in errno. See below .

if removable media and it is not present then yields ENOMEDIUM (no medium found)
if a tape is not present in drive then yields EIO (input/output error), if tape is "in use" then yields EBUSY (resource busy). Only one open file descriptor is allowed per st device node at a time (although dup() can be used).

if tray closed and media is not present then yields ENOMEDIUM (no medium found); if tray open then tries to close it and if no media present then yields ENOMEDIUM
if previous open() of this sg device node still holds O_EXCL then yields EBUSY (resource busy).

on subsequent SG_IO ioctl calls, the block layer will allow SCSI commands listed as either "safe_for_read" or "safe_for_write". For other SCSI commands the user requires the CAP_SYS_RAWIO capability (usually associated with the "root" user); if not yields
EPERM (operation not permitted). The first instance of other SCSI commands since boot, sends an annoying "scsi: unknown opcode" message to the log.

if the media or drive is marked as not writable then yields EROFS (read-only file system).
if sg device node already has exclusive lock then a subsequent attempt to open(O_EXCL) will wait unless O_NONBLOCK is given in which case it yields EBUSY (resource busy)

implemented at block device level (which knows about partitions within devices). If a previous open(O_EXCL) is active then a subsequent open(O_EXCL) yields EBUSY (resource busy). Mounted file systems typically open a device/partition with O_EXCL; as long
as an application using the SG_IO ioctl does not also try and use the O_EXCL flag then it will be allowed access to the device.
the st driver does not support (i.e. ignores) the O_EXCL flag. However the fact that it only permits one active open() per tape device is similar functionality.

if tape is "in use" then yields EBUSY (resource busy). Only one open file descriptor is allowed per st device node at a time.

The O_EXCL flag has a different effect in the sg driver and the block layer. In the sg driver, once
O_EXCL is held on a device, all subsequent open() attempts will either wait or yield EBUSY (irrespective of whether they attempt to use the O_EXCL flag). Once
a partition/device is opened successfully in the block layer (with the sd or cdrom driver) only
subsequent open() attempts that also use the O_EXCL flag are rejected (with EBUSY). A O_EXCL lock held on a device in the block layer has no effect on accessing the same device via the sg driver (and vice versa).

The first successful open on a sd or a cdrom device node that has removable media will send a PREVENT ALLOW MEDIUM REMOVAL (prevent)
SCSI command to the device. If successful, this will inhibit a subsequent START STOP UNIT (eject) SCSI command and de-activate the eject button on the drive. In emergencies, the SG_IO ioctl can be used to defeat this action,
an example of this is the sdparm utility,
specifically "sdparm --command=unlock".

The open() flag O_NDELAY has the same value and meaning as O_NONBLOCK. Other flags such as O_DIRECT, O_TRUNC and O_APPEND have no
effect on the SG_IO ioctl.


SCSI command permissions

In linux a user only
needs read permissions on a file descriptor to execute an ioctl() system command. In the case of the SG_IO ioctl, a SCSI command could be sent that obviously changes the state of a device (e.g. WRITE to a disk). So both implementations of the SG_IO ioctl require
more than read permissions for some commands, especially those that are known to change the state of a device or those that have some unknown action
(e.g. vendor specific commands).

Here is a table of SCSI commands that don't need the user to have write permissions (or in some cases CAP_SYS_RAWIO capability which
usually equates to "root" user):

Table 3. SCSI command minimum permission requirements

SCSI command(draft) standardsg driver requiresblock layer SG_IO

requires (except st)
Comments
BLANKMMC-4O_RDWRO_RDWR
CLOSE TRACK/SESSIONMMC-4O_RDWRO_RDWR
ERASEMMC-4O_RDWRO_RDWR
FLUSH CACHESBC-3, MMC-4O_RDWRO_RDWRReally SYNCHRONIZE CACHE command
FORMAT UNITSBC-3, MMC-4O_RDWRO_RDWRdefault command timeout may not be long enough
GET CONFIGURATIONMMC-4O_RDWRO_RDONLYreads CD/DVD metadata
GET EVENT STATUS NOTIFICATIONMMC-4O_RDWRO_RDONLY
GET PERFORMANCEMMC-4O_RDWRO_RDONLY
INQUIRYSPC-4O_RDONLYO_RDONLYAll SCSI devices should respond to this command
LOAD UNLOAD MEDIUMMMC-4O_RDWRO_RDWRMEDIUM may be replaced by CD, DVD or nothing
LOG SELECTSPC-4O_RDWRO_RDWRused to change logging or clear logged data
LOG SENSESPC-4O_RDONLYO_RDONLYused to fetch logged data
MAINTENANCE COMMAND INSPC-4O_RDONLYCAP_SYS_RAWIO

various "REPORT ..." commands such as REPORT SUPPORTED OPERATION CODES in here
MODE SELECT (6+10)SPC-4O_RDWRO_RDWRUsed to change SCSI device metadata
MODE SENSE (6+10)SPC-4O_RDONLYO_RDONLYUsed to read SCSI device metadata
PAUSE RESUMEMMC-4O_RDWRO_RDONLY
PLAY AUDIO (10)MMC-4O_RDWRO_RDONLY
PLAY AUDIO MSFMMC-4O_RDWRO_RDONLY
PLAY AUDIO TI??O_RDWRO_RDONLYopcode 0x48, unassigned to any spec in SPC-4
PLAY CDMMC-2O_RDWRO_RDONLYold, now SPARE IN in SPC-4
PREVENT ALLOW MEDIUM REMOVALSPC-4, MMC-4O_RDWRO_RDWRsd, st and cdrom drivers use this internally
READ (6+10+12+16)SBC-3O_RDONLYO_RDONLYREAD(16) requires O_RDWR with the sg driver before lk2.6.11
READ BUFFERSPC-4O_RDONLYO_RDONLY
READ BUFFER CAPACITYMMC-4O_RDWRO_RDONLY
READ CAPACITY(10)SBC-3, MMC-4O_RDONLYO_RDONLY
READ CAPACITY(16)SBC-3,

MMC-4
O_RDONLYCAP_SYS_RAWIOwithin SERVICE ACTION IN command. Needed for RAIDs larger than 2 TB
READ CDMMC-4O_RDWRO_RDONLY
READ CD MSFMMC-4O_RDWRO_RDONLY
READ CDVD CAPACITYSBC-3, MMC-4O_RDONLYO_RDONLYStrange (old ?) name from cdrom.h . Actually is READ CAPACITY.
READ DEFECT (10)SBC-3O_RDWRO_RDONLY
READ DISC INFOMMC-4O_RDWRO_RDONLY
READ DVD STRUCTUREMMC-4O_RDWRO_RDONLY
READ FORMAT CAPACITIESMMC-4O_RDWRO_RDONLY
READ HEADERMMC-2O_RDWRO_RDONLY
READ LONG (10)SBC-3O_RDONLYO_RDONLYbut not READ LONG (16)
READ SUB-CHANNELMMC-4O_RDWRO_RDONLY
READ TOC/PMA/ATIPMMC-4O_RDWRO_RDONLY
READ TRACK (RZONE) INFOMMC-4O_RDWRO_RDONLYIn MMC-4 called READ TRACK INFO
RECEIVE DIAGNOSTICSPC-4O_RDONLYCAP_SYS_RAWIOthe SES command set uses this command a lot. An SES device is only accessible via an sg device node
REPAIR (RZONE) TRACKMMC-4O_RDWRO_RDWR
REPORT KEYMMC-4O_RDWRO_RDONLY
REPORT LUNSSPC-4O_RDONLYCAP_SYS_RAWIOmandatory since SPC-3
REQUEST SENSESPC-4O_RDONLYO_RDONLYhas uses other than those displaced by autosense
RESERVE (RZONE) TRACKMMC-4O_RDWRO_RDWR
SCANMMC-4O_RDWRO_RDONLY
SEEKMMC-4O_RDWRO_RDONLY
SEND CUE SHEETMMC-4O_RDWRO_RDWR
SEND DVD STRUCTUREMMC-4O_RDWRO_RDWR
[SEND EVENT]MMC-2O_RDWRcdrom.h associates opcode 0xa2 but MMC-2 uses opcode 0x5d ??
SEND KEYMMC-4O_RDWRO_RDWR
SEND OPC INFORMATIONMMC-4O_RDWRO_RDWR
SERVICE ACTION INSPC-4, SBC-3O_RDONLYCAP_SYS_RAWIOREAD CAPACITY (16) service action in here
SET CD SPEEDMMC-4O_RDWRO_RDWRcdrom.h calls this SET SPEED
SET STREAMINGMMC-4O_RDWRO_RDWR
START STOP UNITSBC-3, MMC-4O_RDWRO_RDONLYhmm
STOP PLAY/SCANMMC-4O_RDWRO_RDONLY
SYNCHRONIZE CACHESBC-3, MMC-4O_RDWRO_RDWRcdrom.h calls this FLUSH CACHE
TEST UNIT READYSPC-4O_RDONLYO_RDONLYAll SCSI devices should respond to this command
VERIFY (10+16)SBC-3, MMC-4O_RDWRO_RDONLY
WRITE (6+10+12+16)SBC-3O_RDWRO_RDWR
WRITE LONG (10+16)SBC-3O_RDWRO_RDWR
WRITE VERIFY (10+16)SBC-3, MMC-4O_RDWRO_RDWRonly WRITE VERIFY(10) is in MMC-4
Any other SCSI command (opcode) not mentioned for the sg driver needs O_RDWR. Any other SCSI command (opcode) not mentioned for
the block layer SG_IO ioctl needs a user with CAP_SYS_RAWIO capability. All "block" SG_IO ioctl calls on st device nodes need a user with CAP_SYS_RAWIO capability. If a user does not have sufficient permissions to execute a SCSI command via the SG_IO ioctl
then the system calls fails (i.e. no SCSI command is sent) and errno is set to EPERM (operation not permitted).

Both the sg driver and the block layer SG_IO code
use internal tables to enforce the permissions shown in the above table (allow_ops and cmd_type [safe_for_read and safe_for_write] respectively). This technique doesn't scale well, since more advanced command sets (e.g. OSD) use service actions (and one
opcode: 0x7f in the case of OSD). There may also be overlap in opcode usage between command sets, for example between SBC, MMC and SSC.


CAP_SYS_RAWIO from a user process

While root processes usually have CAP_SYS_RAWIO, processes running under a user's ID (i.e. non-root) typically don't. Hence non-root
processes may not be able to use SG_IO to send SCSI commands that require CAP_SYS_RAWIO. This may occur even if the permission bits of the device node file allow for read or write access, user processes will receive EPERM when using SG_IO.

By default the capability to assign capabilities to other processes (CAP_SETPCAP) is limited to very few processes, such as certain
kernel threads. Changing this default would require to change and recompile the kernel.

Processes which are forked by a root process and call setuid later will lose the CAP_SYS_RAWIO capability the parent root process
(and the child before the setuid) had. However, the child can preserve the capabilities of the root process in the permitted set and raise it after the call of setuid:

/* ... in child after fork(), still running as root ... */

prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);

setuid(...);

cap_set_proc(cap_from_text("cap_sys_rawio+ep"));

This way a user process with a parent root process can 'get back' the required capabilities to directly send SCSI commands to a
device via SG_IO.

The above technique may be of use to daemons that are started with root permissions (most are) and then changes to another user
after a fork(). It is not obvious to the author how utilities that use the SG_IO ioctl on device nodes that require CAP_SYS_RAWIO for some or all SCSI commands (e.g. nodes associated with the sd and st drivers) can use the above technique.


SG_IO and the st driver

In order to implement its user space API, the st driver has to maintain information about where the read head is with respect to
the structural elements of the tape (filemarks, beginning of tape, end of data).
Because the streaming device SCSI commands don't have addresses, the st driver has to know what commands have been sent. When reading, the filemarks are noticed when a read fails and sense data
is fethed. If SG_IO is mixed with tape commands, the st driver may lose information (it does not look at the SG_IO commands and results). Because of this, the st driver may not implement the semantics the user expects. If the user accepts this or knows when
using SG_IO does not cause information loss, then using SG_IO is OK.

So mixing st driver read, write and ioctl commands with SCSI commands sent via SG_IO that change the state of the tape is not recommended.
This applies whether the SG_IO SCSI commands are sent via st or sg device nodes.


Maximum transfer size per command

The largest amount of data
that can be transferred by a single SCSI command is often a concern. Various SCSI command sets (e.g. SBC-3 for disk READs and WRITEs, SSC-3 for tape READs and WRITEs, and SPC-4 for READ+WRITE BUFFER) allow very large data
transfer sizes but Linux is not so accommodating. The Host Bus Adapter (HBA) could have transfer size limits as could the transport and finally the SCSI device itself. In the latter case SBC-3 defines a "Block Limits" Vital Product Data
(VPD) while SSC has the READ BLOCK LIMITS SCSI command. SBC-3's optional Block Limits VPD page contains both maximum and optimal counts. In the author's opinion that latter distinction is very important:
the block susbsystem should try and use optimal sizes while pass through users should only
be constrained by maximum sizes. Also if a pass through user exceeds a maximum transfer size imposed by a SCSI device, then the device can report an error. There is an underlying assumption that the applications using a pass through interface know what they
are doing, or at least know more than the various kernel susbsystems. On the other hand, the kernel has the responsibility to allocate critical shared resources such as memory.

In the past, Linux used a single, "big-enough", block of memory for the source or destination of large data
transfers. Then scatter-gather lists where added to break transfers up into smaller (often "page" size (4 KB on i386 architecture)) chunks which made memory management easier for the kernel. Now, in the lk 2.6 series, the single block of memory option is being
phased out.

The Linux SCSI subsystem imposes a 128 element limit on scatter gather lists via its SCSI_MAX_PHYS_SEGMENTS define. The way various
memory pools are allocated by the linux SCSI subsystem, SCSI_MAX_PHYS_SEGMENTS could be increased to 256. Associated with each type of HBA there is normally a low level driver (LLD). Each LLD can further limit the maximum number of elements with the scsi_host_template::sg_tablesize
field. Prior to lk 2.6.16 the sg and st drivers used the .sg_tablesize field only,
since lk 2.6.16 those drivers are also constrained by SCSI_MAX_PHYS_SEGMENTS. This leads to a potential halving of the maximum transfer size. Many LLDs set the .sg_tablesize field to SG_ALL (which is 255) but they may as well set that field to 256 unless the
HBA hardware has a constraint.

User space memory may be allocated as the source and/or destination for DMA transfers from the HBA (i.e. direct IO). Even if the
user space allocated a large amount of memory with a single malloc(), the HBA DMA element typically has a different view of memory. This view may well contain many "page" size discontinuous pieces. This has the effect of using up, or perhaps exhausting, scatter-gather
elements.

The sg driver attempts to build scatter gather lists with each element up to SG_SCATTER_SZ bytes large. This define is found in
include/scsi/sg.h and has been set to 32 KB for some years. That is 8 times the page size (of 4 KB) on the i386 architecture. Some users who need really large transfers increase this define (and it is best to keep it a power of 2). However since lk 2.6.16
another limit comes into play: the MAX_SEGMENT_SIZE define which is set to 64 KB. MAX_SEGMENT_SIZE is a default and can be overridden by the LLD calling blk_queue_max_segment_size().

In lk 2.6.16 two further LLD parameters come into play even when the sg (and st) driver is used. These are scsi_host_template::max_sectors
and scsi_host_template::use_clustering .

The .max_sectors setting in the LLD is the maximum number of 512 byte sectors allowed in a single SCSI command's scatter gather
lists (for data
transfers). Yes, that is a strange limit when trying to send a SCSI WRITE BUFFER command to upload firmware. Sysfs makes the LLD's .max_sectors setting visible (converted to kilobytes) in /sys/block/sd<x>/queue/max_hw_sectors_kb . The maximum allowable value
in a LLD's .max_sector seems to be 65535 (0xffff in hexadecimal). This limits the maximum transfer size to (32*1024*1024 - 512) bytes, assuming other limitations have been overcome. [The 65535 sector limit is because Scsi_Host::max_sectors has type "unsigned
short". Hopefully this type is expanded to "int" in the future (or removed).]

The .use_clustering field should be set to ENABLE_CLUSTERING . If not, the block subsystem rebuilds the scatter gather list it gets
from the sg driver with page size (e.g. 4 KB) elements. [Actually is does that anyway, but when ENABLE_CLUSTERING is set, it coalesces them again!]


Conclusion

In some situations, sending commands via the SG_IO ioctl may interfere with a higher level driver's use of a device. Users of the
SG_IO ioctl should be aware that they are using a powerful, but low level facility, and write code
accordingly. An example of this would be a utility to perform self tests on a disk: "background" self tests should be preferred over "foreground" self tests if there is a chance the computer may be using a file system on that disk at the time. Even a short
foreground self test may take up to two minutes which is a long time to lock out a file system.

Return to main page.

Last updated: 26th July 2008
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: