Samba

Planet Samba

Here you will find the personal blogs of Samba developers (for those that keep them). More information about members can also be found on the Samba Team page.

March 09, 2017

Rusty

Quick Stats on zstandard (zstd) Performance

Was looking at using zstd for backup, and wanted to see the effect of different compression levels. I backed up my (built) bitcoin source, which is a decent representation of my home directory, but only weighs in 2.3GB. zstd -1 compressed it 71.3%, zstd -22 compressed it 78.6%, and here’s a graph showing runtime (on my laptop) and the resulting size:

zstandard compression (bitcoin source code, object files and binaries) times and sizes

For this corpus, sweet spots are 3 (the default), 6 (2.5x slower, 7% smaller), 14 (10x slower, 13% smaller) and 20 (46x slower, 22% smaller). Spreadsheet with results here.

March 09, 2017 12:53 AM

January 06, 2017

David

Rapido: A Glorified Wrapper for Dracut and QEMU

Introduction


I've blogged a few of times about how Dracut and QEMU can be combined to greatly improve Linux kernel dev/test turnaround.
  • My first post covered the basics of building the kernel, running dracut, and booting the resultant image with qemu-kvm.
  • A later post took a closer look at network configuration, and focused on bridging VMs with the hypervisor.
  • Finally, my third post looked at how this technique could be combined with Ceph, to provide a similarly efficient workflow for Ceph development.
    In bringing this series to a conclusion, I'd like to introduce the newly released Rapido project. Rapido combines all of the procedures and techniques described in the articles above into a handful of scripts, which can be used to test specific Linux kernel functionality, standalone or alongside other technologies such as Ceph.

     

     

    Usage - Standalone Linux VM


    The following procedure was tested on openSUSE Leap 42.2 and SLES 12SP2, but should work fine on many other Linux distributions.

     

    Step 1: Checkout and Build


    Checkout the Linux kernel and Rapido source repositories:
    ~/> cd ~
    ~/> git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
    ~/> git clone https://github.com/ddiss/rapido.git

    Build the kernel (using a config provided with the Rapido source):
    ~/> cp rapido/kernel/vanilla_config linux/.config
    ~/> cd linux
    ~/linux/> make -j6
    ~/linux/> make modules
    ~/linux/> INSTALL_MOD_PATH=./mods make modules_install
    ~/linux/> sudo ln -s $PWD/mods/lib/modules/$(make kernelrelease) \
    /lib/modules/$(make kernelrelease)

    Step 2: Configuration 


    Install Rapido dependencies: dracut, qemu, brctl (bridge-utils) and tunctl.

    Edit rapido.conf, the master Rapido configuration file:
    ~/linux/> cd ~/rapido
    ~/rapido/> vi rapido.conf
    • set KERNEL_SRC="/home/<user>/linux"
    • set TAP_USER="<user>"
    • set MAC_ADDR1 to a valid MAC address, e.g. "b8:ac:24:45:c5:01"
    • set MAC_ADDR2 to a valid MAC address, e.g. "b8:ac:24:45:c5:02"

    Configure the bridge and tap network devices. This must be done as root:
    ~/rapido/> sudo tools/br_setup.sh
    ~/rapido/> ip addr show br0
    4: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    ...
    inet 192.168.155.1/24 scope global br0


    Step 3: Image Generation 


    Generate a minimal Linux VM image which includes binaries, libraries and kernel modules for filesystem testing:
    ~/rapido/> ./cut_fstests_local.sh
    ...
     dracut: *** Creating initramfs image file 'initrds/myinitrd' done ***
    ~/rapido/> ls -lah initrds/myinitrd
    -rw-r--r-- 1 ddiss users 30M Dec 13 18:17 initrds/myinitrd

    Step 4 - Boot!

     ~/rapido/> ./vm.sh
    + mount -t btrfs /dev/zram1 /mnt/scratch
    [ 3.542927] BTRFS info (device zram1): disk space caching is enabled
    ...
    btrfs filesystem mounted at /mnt/test and /mnt/scratch
    rapido1:/# 

    In a whopping four seconds, or thereabouts, the VM should have booted to a rapido:/# bash prompt. Leaving you with two zram backed Btrfs filesystems mounted at /mnt/test and /mnt/scratch.

    Everything, including the VM's root filesystem, is in memory, so any changes will not persist across reboot. Use the rapido.conf QEMU_EXTRA_ARGS parameter if you wish to add persistent storage to a VM.

    Although the network isn't used in this case, you should be able to observe that the VM's network adapter can be reached from the hypervisor, and vice-versa.
    rapido1:/# ip a show dev eth0
    ...
    inet 192.168.155.101/24 brd 192.168.155.255 scope global eth0
    ...
    rapido1:/# ping 192.168.155.1
    PING 192.168.155.1 (192.168.155.1) 56(84) bytes of data.
    64 bytes from 192.168.155.1: icmp_seq=1 ttl=64 time=1.97 ms

    Once you're done playing around, you can shutdown:
    rapido1:/# shutdown
    [ 267.304313] sysrq: SysRq : sysrq: Power Off
    rapido1:/# [ 268.168447] ACPI: Preparing to enter system sleep state S5
    [ 268.169493] reboot: Power down
    + exit 0

     

     

    Usage - Ceph vstart.sh cluster and CephFS client VM

    This usage guide builds on the previous standalone Linux VM procedure, but this time adds Ceph to the mix. If you're not interested in Ceph (how could you not be!) then feel free to skip to the next section.

     

    Step I - Checkout and Build


    We already have a clone of the Rapido and Linux kernel repositories. All that's needed for CephFS testing is a Ceph build:
    ~/> git clone https://github.com/ceph/ceph
    ~/> cd ceph
    <install Ceph build dependencies>
    ~/ceph/> ./do_cmake.sh -DWITH_MANPAGE=0 -DWITH_OPENLDAP=0 -DWITH_FUSE=0 -DWITH_NSS=0 -DWITH_LTTNG=0
    ~/ceph/> cd build
    ~/ceph/build/> make -j4

     

    Step II - Start a vstart.sh Ceph "cluster"


    Once Ceph has finished compiling, vstart.sh can be run with the following parameters to configure and locally start three OSDs, one monitor process, and one MDS.
    ~/ceph/build/> OSD=3 MON=1 RGW=0 MDS=1 ../src/vstart.sh -i 192.168.155.1 -n
    ...
    ~/ceph/build/> bin/ceph -c status
    ...
    health HEALTH_OK
    monmap e2: 1 mons at {a=192.168.155.1:40160/0}
    election epoch 4, quorum 0 a
    fsmap e5: 1/1/1 up {0=a=up:active}
    mgr no daemons active
    osdmap e10: 3 osds: 3 up, 3 in

     

    Step III - Rapido configuration


    Edit rapido.conf, the master Rapido configuration file:
    ~/ceph/build/> cd ~/rapido
    ~/rapido/> vi rapido.conf
    • set CEPH_SRC="/home/<user>/ceph/src"
    • KERNEL_SRC and network parameters were configured earlier

    Step IV - Image Generation


    The cut_cephfs.sh script generates a VM image with the Ceph configuration and keyring from the vstart.sh cluster, as well as the CephFS kernel module.
    ~/rapido/> ./cut_cephfs.sh
    ...
     dracut: *** Creating initramfs image file 'initrds/myinitrd' done ***

     

    Step V - Boot!


    Booting the newly generated image should bring you to a shell prompt, with the vstart.sh provisioned CephFS filesystem mounted under /mnt/cephfs:
    ~/rapido/> ./vm.sh
    ...
    + mount -t ceph 192.168.155.1:40160:/ /mnt/cephfs -o name=admin,secret=...
    [ 3.492742] libceph: mon0 192.168.155.1:40160 session established
    ...
    rapido1:/# df -h /mnt/cephfs
    Filesystem Size Used Avail Use% Mounted on
    192.168.155.1:40160:/ 1.3T 611G 699G 47% /mnt/cephfs
    CephFS is a clustered filesystem, in which case testing from multiple clients is also of interest. From another window, boot a second VM:
    ~/rapido/> ./vm.sh

     

     

    Further Use Cases


    Rapido ships with a bunch of scripts for testing different kernel components:
    • cut_cephfs.sh (shown above)
      • Image: includes Ceph config, credentials and CephFS kernel module
      • Boot: mounts CephFS filesystem
    • cut_cifs.sh
      • Image: includes CIFS (SMB client) kernel module
      • Boot: mounts share using details and credentials specified in rapido.conf
    • cut_dropbear.sh
      • Image: includes dropbear SSH server
      • Boot: starts an SSH server with SSH_AUTHORIZED_KEY
    • cut_fstests_cephfs.sh
      • Image: includes xfstests and CephFS kernel client
      • Boot: mounts CephFS filesystem and runs FSTESTS_AUTORUN_CMD
    • cut_fstests_local.sh (shown above)
      • Image: includes xfstests and local Btrfs and XFS dependencies
      • Boot: provisions local xfstest zram devices. Runs FSTESTS_AUTORUN_CMD
    • cut_lio_local.sh
      • Image: includes LIO, loopback dev and dm-delay kernel modules
      • Boot: provisions an iSCSI target, with three LUs exposed
    • cut_lio_rbd.sh
      • Image: includes LIO and Ceph RBD kernel modules
      • Boot: provisions an iSCSI target backed by CEPH_RBD_IMAGE, using target_core_rbd
    • cut_qemu_rbd.sh
      • Image: CEPH_RBD_IMAGE is attached to the VM using qemu-block-rbd
      • Boot: runs shell only
    • cut_rbd.sh
      • Image: includes Ceph config, credentials and Ceph RBD kernel module
      • Boot: maps CEPH_RBD_IMAGE using the RBD kernel client
    • cut_tcmu_rbd_loop.sh
      • Image: includes Ceph config, librados, librbd, and pulls in tcmu-runner from TCMU_RUNNER_SRC
      • Boot: starts tcmu-runner and configures a tcmu+rbd backstore exposing CEPH_RBD_IMAGE via the LIO loopback fabric
    • cut_usb_rbd.sh (see https://github.com/ddiss/rbd-usb)
      • Image: usb_f_mass_storage, zram, dm-crypt, and RBD_USB_SRC
      • Boot: starts the conf-fs.sh script from RBD_USB_SRC

     

     

    Conclusion


      • Dracut and QEMU can be combined for super-fast Linux kernel testing and development.
      • Rapido is mostly just a glorified wrapper around these utilities, but does provide some useful tools for automated testing of specific Linux kernel functionality.

      If you run into any problems, or wish to provide any kind of feedback (always appreciated), please feel free to leave a message below, or raise a ticket in the Rapido issue tracker.

      Update 20170106:
      • Add cut_tcmu_rbd_loop.sh details and fix the example CEPH_SRC path.
        

      January 06, 2017 11:29 PM

      December 27, 2016

      David

      Adding Reviewed-by and Acked-by Tags with Git

      This week's "Git Rocks!" moment came while I was investigating how I could automatically add Reviewed-by, Acked-by, Tested-by, etc. tags to a given commit message.

      Git's interpret-trailers command is capable of testing for and manipulating arbitrary Key: Value tags in commit messages.

      For example, appending Reviewed-by: MY NAME <my@email.com> to the top commit message is as simple as running:

      > GIT_EDITOR='git interpret-trailers --trailer \
      "Reviewed-by: $(git config user.name) <$(git config user.email)>" \
      --in-place' git commit --amend 

      Or with the help of a "git rb" alias, via:
      > git config alias.rb "interpret-trailers --trailer \
      \"Reviewed-by: $(git config user.name) <$(git config user.email)>\" \
      --in-place"
      > GIT_EDITOR="git rb" git commit --amend

      The above examples work by replacing the normal git commit editor with a call to git interpret-trailers, which appends the desired tag to the commit message and then exits.

      My specific use case is to add Reviewed-by: tags to specific commits during interactive rebase, e.g.:
      > git rebase --interactive HEAD~3

      This brings up an editor with a list of the top three commits in the current branch. Assuming the aforementioned rb alias has been configured, individual commits will be given a Reviewed-by tag when appended with the following line:

      exec GIT_EDITOR="git rb" git commit --amend

      As an example, the following will see three commits applied, with the commit message for two of them (d9e994e and 5f8c115) appended with my Reviewed-by tag.

      pick d9e994e ctdb: Fix CID 1398179 Argument cannot be negative
      exec GIT_EDITOR="git rb" git commit --amend
      pick 0fb313c ctdb: Fix CID 1398178 Argument cannot be negative
      # ^^^^^^^ don't add a Reviewed-by tag for this one just yet
      pick 5f8c115 ctdb: Fix CID 1398175 Dereference after null check
      exec GIT_EDITOR="git rb" git commit --amend

      Bonus: By default, the vim editor includes git rebase --interactive syntax highlighting and key-bindings - if you press K while hovering over a commit hash (e.g. d9e994e from above), vim will call git show <commit-hash>, making reviewing and tagging even faster!



      Thanks to:
      • Upstream Git developers, especially those who implemented the interpret-trailers functionality.
      • My employer, SUSE.

      December 27, 2016 06:22 PM

      December 23, 2016

      David

      QEMU/KVM Bridged Network with TAP interfaces

      In my previous post, Rapid Linux Kernel Dev/Test with QEMU, KVM and Dracut, I described how build and boot a Linux kernel quickly, making use of port forwarding between hypervisor and guest VM for virtual network traffic.

      This post describes how to plumb the Linux VM directly into a hypervisor network, through the use of a bridge.

      Start by creating a bridge on the hypervisor system:

      > sudo ip link add br0 type bridge

      Clear the IP address on the network interface that you'll be bridging (e.g. eth0).
      Note: This will disable network traffic on eth0!
      > sudo ip addr flush dev eth0
      Add the interface to the bridge:
      > sudo ip link set eth0 master br0

      Next up, create a TAP interface:
      > sudo /sbin/tunctl -u $(whoami)
      Set 'tap0' persistent and owned by uid 1001
      The -u parameter ensures that the current user will be able to connect to the TAP interface.

      Add the TAP interface to the bridge:
      > sudo ip link set tap0 master br0

      Make sure everything is up:
      > sudo ip link set dev br0 up
      > sudo ip link set dev tap0 up

      The TAP interface is now ready for use. Assuming that a DHCP server is available on the bridged network, the VM can now obtain an IP address during boot via:
      > qemu-kvm -kernel arch/x86/boot/bzImage \
      -initrd initramfs \
      -device e1000,netdev=network0,mac=52:55:00:d1:55:01 \
      -netdev tap,id=network0,ifname=tap0,script=no,downscript=no \
      -append "ip=dhcp rd.shell=1 console=ttyS0" -nographic

      The MAC address is explicitly specified, so care should be taken to ensure its uniqueness.

      The DHCP server response details are printed alongside network interface configuration. E.g.
      [    3.792570] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
      [ 3.796085] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
      [ 3.812083] Sending DHCP requests ., OK
      [ 4.824174] IP-Config: Got DHCP answer from 10.155.0.42, my address is 10.155.0.1
      [ 4.825119] IP-Config: Complete:
      [ 4.825476] device=eth0, hwaddr=52:55:00:d1:55:01, ipaddr=10.155.0.1, mask=255.255.0.0, gw=10.155.0.254
      [ 4.826546] host=rocksolid-sles, domain=suse.de, nis-domain=suse.de
      ...

      Didn't get an IP address? There are a few things to check:
      • Confirm that the kernel is built with boot-time DHCP client (CONFIG_IP_PNP_DHCP=y) and E1000 network driver (CONFIG_E1000=y) support.
      • Check the -device and -netdev arguments specify a valid e1000 TAP interface.
      • Ensure that ip=dhcp is provided as a kernel boot parameter, and that the DHCP server is up and running.
      Happy hacking

       Update 20161223:
      • Use 'ip' instead of 'brctl' to manipulate the bridge device - thanks Yagamy Light!

      December 23, 2016 06:48 PM

      November 03, 2016

      David

      Git send-email PATCH version subject prefix

      I use git send-email to submit patches to developer mailing lists. A reviewer may request a series of changes, in which case I find it easiest to make and test those changes locally, before sending a new round of patches to the mailing list with a new version number:

      git send-email --subject-prefix="PATCH v4" --compose -14

      Assigning a version number to each round of patches allows me to add a change log for the entire patch-set to the introductory mail, e.g.:
      From: David Disseldorp 
      Subject: [PATCH v4 00/14] add compression ioctl support

      This patch series adds support for the FSCTL_GET_COMPRESSION and
      FSCTL_SET_COMPRESSION ioctls, as well as the reporting of the current
      compression state via the FILE_ATTRIBUTE_COMPRESSED flag.

      Hooks are added to the Btrfs VFS module, which translates such requests
      into corresponding FS_IOC_GETFLAGS and FS_IOC_SETFLAGS ioctls.

      Changes since v3 (thanks for the feedback Jeremy):
      - fixed and split copy-chunk dest unlock change into separate commit

      Changes since v2:
      - Check for valid fsp file descriptors
      - Rebase atop filesystem specific selftest changes
      - Change compression fsctl permission checks to match Windows behaviour
      + Add corresponding smbtorture test

      Changes since v1:
      - Only use smb_fname and fsp args with GET_COMPRESSION. The smb_fname
      argument is only needed for the dosmode() code-path, fsp is used
      everywhere else.
      - Add an extra SetInfo(FILE_ATTRIBUTE_COMPRESSED) test.

      GIT: [PATCH v4 01/14] selftest/s3: expose share with FS applicable config
      ...

      Change logs can also be added to individual patches using the --annotate parameter:

      git send-email --annotate --subject-prefix=PATCH v2 -1
       
       Subject: [PATCH v2] common: add CephFS support
      ...
      ---
      Changes since version 1:
      - Remove -ceph parameter for check and rely on $FSTYP instead
      ...
      diff --git a/README.config-sections b/README.config-sections

      Putting the change log between the first "---" and "diff --git a/..." lines ensures that it will be dropped when applying the patch via git-am.

      [update 2016-11-03]
      - Describe the --annotate parameter 

      November 03, 2016 01:38 PM

      October 25, 2016

      Andreas

      Microsoft Catalog Files and Digital Signatures decoded

      TL;DR: Parse and print .cat files: parsemscat

      Introduction

      Günther Deschner and myself are looking into the new Microsoft Printing Protocol [MS-PAR]. Printing always means you have to deal with drivers. Microsoft package-aware v3 print drivers and v4 print drivers contain Microsoft Catalog files.

      A Catalog file (.cat) is a digitally-signed file. To be more precise it is a PKCS7 certificate with embedded data. Before I started to look into the problem understanding them I’ve searched the web, if someone already decoded them. I found a post by Richard Hughes: Building a better catalog file. Richard described some of the things we already discovered and some new details. It looks like he gave up when it came down to understand the embedded data and write an ASN.1 description for it. I started to decode the myth of Catalog files the last two weeks and created a tool for parsing them and printing what they contain, in human readable form.

      Details

      The embedded data in the PKCS7 signature of a Microsoft Catalog is a Certificate Trust List (CTL). Nikos Mavrogiannopoulos taught me ASN.1 and helped to create an ASN.1 description for the CTL. With this description I was able to start parsing Catalog files.

      CATALOG {}
      DEFINITIONS IMPLICIT TAGS ::=
      
      BEGIN
      
      -- CATALOG_NAME_VALUE
      CatalogNameValue ::= SEQUENCE {
          name       BMPString, -- UCS2-BE
          flags      INTEGER,
          value      OCTET STRING -- UCS2-LE
      }
      
      ...
      
      END

      mscat.asn

      The PKCS7 part of the .cat-file is the signature for the CTL. Nikos implemented support to get the embedded raw data from the PKCS7 Signature with GnuTLS. It is also possible to verify the signature using GnuTLS now!
      The CTL includes members and attributes. A member holds information about file name included in the driver package, OS attributes and often a hash for the content of the file name, either SHA1 or SHA256. I’ve written abstracted function so it is possible to create a library and a simple command line tool called dumpmscat.

      Here is an example of the output:

      CATALOG MEMBER COUNT=1
      CATALOG MEMBER
        CHECKSUM: E5221540DC4B974F54DB4E390BFF4132399C8037
      
        FILE: sambap1000.inf, FLAGS=0x10010001
        OSATTR: 2:6.0,2:6.1,2:6.4, FLAGS=0x10010001
        MAC: SHA1, DIGEST: E5221540DC4B974F54DB4E39BFF4132399C8037

      In addition the CTL has normally a list of attributes. In those attributes are normally OS Flags, Version information and Hardware IDs.

      CATALOG ATTRIBUTE COUNT=2
        NAME=OS, FLAGS=0x10010001, VALUE=VistaX86,7X86,10X86
        NAME=HWID1, FLAGS=0x10010001, VALUE=usb\\vid_0ff0&pid;_ff00&mi;_01

      Currently the projects only has a command line tool called: dumpmscat. And it can only print the CTL for now. I plan to add options to verify the signature, dump only parts etc. When this is done I will create a library so it can easily be consumed by other software. If someone is interested and wants to contribute. Something like signtool.exe would be nice to have.

      October 25, 2016 03:17 PM

      September 21, 2016

      Andreas

      A new cmocka release version 1.1.0

      It took more than a year but finally Jakub and I released a new version of cmocka today. If you don’t know it yet, cmocka is a unit testing framework for C with support for mock objects!

      We set the version number to 1.1.0 because we have some new features:

      • Support to catch multiple exceptions
      • Support to verify call ordering (for mocking)
      • Support to pass initial data to test cases
      • A will_return_maybe() function for ignoring mock returns
      • Subtests for groups using TAP output
      • Support to write multiple XML output files if you have several groups in a test
      • and improved documentation

      We have some more features we are working on. I hope it will not take such a long time to release them.

      September 21, 2016 03:15 PM

      June 28, 2016

      David

      Linux USB Gadget Application Testing

      Developing a USB gadget application that runs on Linux?
      Following a recent Ceph USB gateway project, I was looking at ways to test a Linux USB device without the need to fiddle with cables, or deal with slow embedded board boot times.

      Ideally USB gadget testing could be performed by running the USB device code within a virtual machine, and attaching the VM's virtual USB device port to an emulated USB host controller on the hypervisor system.


      I was unfortunately unable to find support for virtual USB device ports in QEMU, so I abandoned the above architecture, and discovered dummy_hcd.ko instead.


      The dummy_hcd Linux kernel module is an excellent utility for USB device testing from within a standalone system or VM.



      dummy_hcd.ko offers the following features:
      • Re-route USB device traffic back to the local system
        • Effectively providing device loopback functionality
      • USB high-speed and super-speed connection simulation
      It can be enabled via the USB_DUMMY_HCD kernel config parameter. Once the module is loaded, no further configuration is required.

      June 28, 2016 01:53 PM

      June 15, 2016

      Rusty

      Minor update on transaction fees: users still don’t care.

      I ran some quick numbers on the last retargeting period (blocks 415296 through 416346 inclusive) which is roughly a week’s worth.

      Blocks were full: median 998k mean 818k (some miners blind mining on top of unknown blocks). Yet of the 1,618,170 non-coinbase transactions, 48% were still paying dumb, round fees (like 5000 satoshis). Another 5% were paying dumbround-numbered per-byte fees (like 80 satoshi per byte).

      The mean fee was 24051 satoshi (~16c), the mean fee rate 60 satoshi per byte. But if we look at the amount you needed to pay to get into a block (using the second cheapest tx which got in), the mean was 16.81 satoshis per byte, or about 5c.

      tl;dr: It’s like a tollbridge charging vehicles 7c per ton, but half the drivers are just throwing a quarter as they drive past and hoping it’s enough. It really shows fees aren’t high enough to notice, and transactions don’t get stuck often enough to notice. That’s surprising; at what level will they notice? What wallets or services are they using?

      June 15, 2016 03:00 AM

      May 11, 2016

      David

      Rapid Ceph Kernel Module Testing with vstart.sh

      Introduction

      Ceph's vstart.sh utility is very useful for deploying and testing a mock cluster directly from the Ceph source repository. It can:
      • Generate a cluster configuration file and authentication keys
      • Provision and deploy a number of OSDs
        • Backed by local disk, or memory using the --memstore parameter
      • Deploy an arbitrary number of monitor, MDS or rados-gateway nodes
      All services are deployed as the running user. I.e. root access is not needed.

      Once deployed, the mock cluster can be used with any of the existing Ceph client utilities, or exercised with the unit tests in the Ceph src/test directory.

      When developing or testing Linux kernel changes for CephFS or RBD, it's useful to also be able to use these kernel clients against a vstart.sh deployed Ceph cluster.

      Test Environment Overview - image based on content by Sage Weil

      The instructions below walk through configuration and deployment of all components needed to test Linux kernel RBD and CephFS modules against a mock Ceph cluster. The procedure was performed on openSUSE Leap 42.1, but should also be applicable for other Linux distributions.

      Network Setup

      First off, configure a bridge interface to connect the Ceph cluster with a kernel client VM network:

      > sudo /sbin/brctl addbr br0
      > sudo ip addr add 192.168.155.1/24 dev br0
      > sudo ip link set dev br0 up

      br0 will not be bridged with any physical adapters, just the kernel VM via a TAP interface which is configured with:

      > sudo /sbin/tunctl -u $(whoami) -t tap0
      > sudo /sbin/brctl addif br0 tap0
      > sudo ip link set tap0 up

      For more information on the bridge setup, see:
      http://blog.elastocloud.org/2015/07/qemukvm-bridged-network-with-tap.html

      Ceph Cluster Deployment

      The Ceph cluster can now be deployed, with all nodes accepting traffic on the bridge network:

      > cd $ceph_source_dir
      <build Ceph>
      > cd src
      > OSD=3 MON=1 RGW=0 MDS=1 ./vstart.sh -i 192.168.155.1 -n --memstore

      $ceph_source_dir should be replaced with the actual path. Be sure to specify the same IP address with -i as was assigned to the br0 interface.

      More information about vstart.sh usage can be found at:
       http://docs.ceph.com/docs/hammer/dev/dev_cluster_deployement/

      Kernel VM Deployment

      Build a kernel:
       
      > cd $kernel_source_dir
      > make menuconfig
      $kernel_source_dir should be replaced with the actual path. Ensure CONFIG_BLK_DEV_RBD=m, CONFIG_CEPH_FS=y, CONFIG_CEPH_LIB=y, CONFIG_E1000=y and CONFIG_IP_PNP=y are set in the kernel config. A sample can be found here.
       
      > make
      > INSTALL_MOD_PATH=./mods make modules_install
       
      Create a link to the modules directory ./mods, so that Dracut can find them:
       
      > sudo ln -s $PWD/mods/lib/modules/$(make kernelrelease) \
      /lib/modules/$(make kernelrelease)

      Generate an initramfs with Dracut. This image will be used as the test VM.
       
      > export CEPH_SRC=$ceph_source_dir/src
      > dracut --no-compress --kver "$(cat include/config/kernel.release)" \
      --install "tail blockdev ps rmdir resize dd vim grep find df sha256sum \
      strace mkfs.xfs /lib64/libkeyutils.so.1" \
      --include "$CEPH_SRC/mount.ceph" "/sbin/mount.ceph" \
      --include "$CEPH_SRC/ceph.conf" "/etc/ceph/ceph.conf" \
      --add-drivers "rbd" \
      --no-hostonly --no-hostonly-cmdline \
      --modules "bash base network ifcfg" \
      --force myinitrd

      Boot the kernel and initramfs directly using QEMU/KVM:
       
      > qemu-kvm -smp cpus=2 -m 512 \
      -kernel arch/x86/boot/bzImage -initrd myinitrd \
      -device e1000,netdev=network1,mac=b8:ac:6f:31:45:70 \
      -netdev tap,id=network1,script=no,downscript=no,ifname=tap0 \
      -append "ip=192.168.155.2:::255.255.255.0:myhostname \
      rd.shell=1 console=ttyS0 rd.lvm=0 rd.luks=0" \
      -nographic

      This should bring up a Dracut debug shell in the VM, with a network configuration matching the values parsed in via the ip= kernel parameter.

      dracut:/# ip a
      ...
      2: eth0: ... mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
      link/ether b8:ac:6f:31:45:70 brd ff:ff:ff:ff:ff:ff
      inet 192.168.155.2/24 brd 192.168.155.255 scope global eth0

      For more information on kernel setup, see:
      http://blog.elastocloud.org/2015/06/rapid-linux-kernel-devtest-with-qemu.html

      RBD Image Provisioning

      An RBD volume can be provisioned using the regular Ceph utilities in the Ceph source directory:

      > cd $ceph_source_dir/src
      > ./rados lspools
      rbd
      ...

      By default, an rbd pool is created by vstart.sh, which can be used for RBD images:
       
      > ./rbd create --image-format 1 --size 1024 1g_vstart_img
      > ./rbd ls -l
      NAME SIZE PARENT FMT PROT LOCK
      1g_vstart_img 1024M 1

      Note: "--image-format 1" is specified to ensure that the kernel supports all features of the provisioned RBD image.

      Kernel RBD Usage

      From the Dracut shell, the newly provisioned 1g_vstart_img image can be mapped locally using the sysfs filesystem:
      dracut:/# modprobe rbd
      [ 9.031056] rbd: loaded
      dracut:/# echo -n "192.168.155.1:6789 name=admin,secret=AQBPiuhd9389dh28djASE32Ceiojc234AF345w== rbd 1g_vstart_img -" > /sys/bus/rbd/add
      [ 347.743272] libceph: mon0 192.168.155.1:6789 session established
      [ 347.744284] libceph: client4121 fsid 234b432f-a895-43d2-23fd-9127a1837b32
      [ 347.749516] rbd: rbd0: added with size 0x40000000

      Note: The monitor address and admin credentials can be retrieved from the ceph.conf and keyring files respectively, located in the Ceph source directory.

      The /dev/rbd0 mapped image can now be used like any other block device:
      dracut:/# mkfs.xfs /dev/rbd0 
      ...
      dracut:/# mkdir -p /mnt/rbdfs
      dracut:/# mount /dev/rbd0 /mnt/rbdfs
      [ 415.841757] XFS (rbd0): Mounting V4 Filesystem
      [ 415.917595] XFS (rbd0): Ending clean mount
      dracut:/# df -h /mnt/rbdfs
      Filesystem Size Used Avail Use% Mounted on
      /dev/rbd0 1014M 33M 982M 4% /mnt/rbdfs


      Kernel CephFS Usage

      vstart.sh already goes to the effort of deploying a filesystem:
      > cd $ceph_source_dir/src
      > ./ceph fs ls
      > name: cephfs_a, metadata pool: cephfs_metadata_a, data pools: [cephfs_data_a ]

      All that's left is to mount it from the kernel VM using the mount.ceph binary that was copied into the initramfs:
      dracut:/# mkdir -p /mnt/mycephfs
      dracut:/# mount.ceph 192.168.155.1:6789:/ /mnt/mycephfs \
      -o name=admin,secret=AQBPiuhd9389dh28djASE32Ceiojc234AF345w==
      [ 723.103153] libceph: mon0 192.168.155.1:6789 session established
      [ 723.184978] libceph: client4122 fsid 234b432f-a895-43d2-23fd-9127a1837b32

      dracut:/# df -h /mnt/mycephfs/
      Filesystem Size Used Avail Use% Mounted on
      192.168.155.1:6789:/ 3.0G 4.0M 3.0G 1% /mnt/mycephfs


      Note: The monitor address and admin credentials can be retrieved from the ceph.conf and keyring files respectively, located in the Ceph source directory.

      Cleanup

      Unmount CephFS:
      dracut:/# umount /mnt/mycephfs

      Unmount the RBD image:
      dracut:/# umount /dev/rbd0
      [ 1592.592510] XFS (rbd0): Unmounting Filesystem

      Unmap the RBD image (0 is derived from /dev/rbdX):
      dracut:/# echo -n 0 > /sys/bus/rbd/remove

      Power-off the VM:
      dracut:/# echo 1 > /proc/sys/kernel/sysrq && echo o > /proc/sysrq-trigger
      [ 1766.387417] sysrq: SysRq : Power Off
      dracut:/# [ 1766.811686] ACPI: Preparing to enter system sleep state S5
      [ 1766.812217] reboot: Power down

      Shutdown the Ceph cluster:
      > cd $ceph_source_dir/src
      > ./stop.sh

      Conclusion

      A mock Ceph cluster can be deployed from source in a matter of seconds using the vstart.sh utility.
      Likewise, a kernel can be booted directly from source alongside a throwaway VM and connected to the mock Ceph cluster in a couple of minutes with Dracut and QEMU/KVM.

      This environment is ideal for rapid development and integration testing of Ceph user-space and kernel components, including RBD and CephFS.

      May 11, 2016 02:40 PM

      April 08, 2016

      David

      Efficient Microsoft Azure Uploads and Downloads

      With the release of version 0.7.1, Elasto is now capable of efficient (sparse aware) uploads and downloads to/from Microsoft Azure, using the Blob and File services.


      Example of a Microsoft Azure Page Blob Download


      This is done by determining which regions of a Page Blob, File Service file, or local file are allocated and only transferring those regions, which improves both network and storage utilisation.
      • For Azure Page Blobs, the Get Page Ranges API request is used to obtain a list of allocated regions.
      • For Azure File Service files, the List Ranges API request is used.
      • For local files, SEEK_DATA and SEEK_HOLE are used to determine which regions of a file are allocated.
      • Amazon S3 Objects and Azure Block Blobs are still downloaded and uploaded in entirety.
        • Sparse regions are unsupported by these services.
      Elasto is free software, and can be obtained for openSUSE and many other Linux distributions from the openSUSE Build Service. Be safe, take backups before experimenting with this new feature.

      April 08, 2016 05:46 AM

      Rusty

      Bitcoin Generic Address Format Proposal

      I’ve been implementing segregated witness support for c-lightning; it’s interesting that there’s no address format for the new form of addresses.  There’s a segregated-witness-inside-p2sh which uses the existing p2sh format, but if you want raw segregated witness (which is simply a “0” followed by a 20-byte or 32-byte hash), the only proposal is BIP142 which has been deferred.

      If we’re going to have a new address format, I’d like to make the case for shifting away from bitcoin’s base58 (eg. 1At1BvBMSEYstWetqTFn5Au4m4GFg7xJaNVN2):

      1. base58 is not trivial to parse.  I used the bignum library to do it, though you can open-code it as bitcoin-core does.
      2. base58 addresses are variable-length.  That makes webforms and software mildly harder, but also eliminates a simple sanity check.
      3. base58 addresses are hard to read over the phone.  Greg Maxwell points out that the upper and lower case mix is particularly annoying.
      4. The 4-byte SHA check does not guarantee to catch the most common form of errors; transposed or single incorrect letters, though it’s pretty good (1 in 4 billion chance of random errors passing).
      5. At around 34 letters, it’s fairly compact (36 for the BIP141 P2WPKH).

      This is my proposal for a generic replacement (thanks to CodeShark for generalizing my previous proposal) which covers all possible future address types (as well as being usable for current ones):

      1. Prefix for type, followed by colon.  Currently “btc:” or “testnet:“.
      2. The full scriptPubkey using base 32 encoding as per http://philzimmermann.com/docs/human-oriented-base-32-encoding.txt.
      3. At least 30 bits for crc64-ecma, up to a multiple of 5 to reach a letter boundary.  This covers the prefix (as ascii), plus the scriptPubKey.
      4. The final letter is the Damm algorithm check digit of the entire previous string, using this 32-way quasigroup. This protects against single-letter errors as well as single transpositions.

      These addresses look like btc:ybndrfg8ejkmcpqxot1uwisza345h769ybndrrfg (41 digits for a P2WPKH) or btc:yybndrfg8ejkmcpqxot1uwisza345h769ybndrfg8ejkmcpqxot1uwisza34 (60 digits for a P2WSH) (note: neither of these has the correct CRC or check letter, I just made them up).  A classic P2PKH would be 45 digits, like btc:ybndrfg8ejkmcpqxot1uwisza345h769wiszybndrrfg, and a P2SH would be 42 digits.

      While manually copying addresses is something which should be avoided, it does happen, and the cost of making them robust against common typographic errors is small.  The CRC is a good idea even for machine-based systems: it will let through less than 1 in a billion mistakes.  Distinguishing which blockchain is a nice catchall for mistakes, too.

      We can, of course, bikeshed this forever, but I wanted to anchor the discussion with something I consider fairly sane.

      April 08, 2016 01:50 AM

      April 01, 2016

      Rusty

      BIP9: versionbits In a Nutshell

      Hi, I was one of the authors/bikeshedders of BIP9, which Pieter Wuille recently refined (and implemented) into its final form.  The bitcoin core plan is to use BIP9 for activations from now on, so let’s look at how it works!

      Some background:

      • Blocks have a 32-bit “version” field.  If the top three bits are “001”, the other 29 bits represent possible soft forks.
      • BIP9 uses the same 2016-block periods (roughly 2 weeks) as the difficulty adjustment does.

      So, let’s look at BIP68 & 112 (Sequence locks and OP_CHECKSEQUENCEVERIFY) which are being activated together:

      • Every soft fork chooses an unused bit: these are using bit 1 (not bit 0), so expect to see blocks with version 536870914.
      • Every soft fork chooses an start date: these use May 1st, 2016, and time out a year later if it fails.
      • Every period, we look back to see if 95% have a bit set (75% for testnet).
        • If so, and that bit is for a known soft fork, and we’re within its start time that soft fork is locked-in: it will activate after another 2016 blocks, giving the stragglers time to upgrade.

      There are also two alerts in the bitcoin core implementation:

      • If at any stage 50 of the last 100 blocks have unexpected bits set, you get Warning: Unknown block versions being mined! It’s possible unknown rules are in effect.
      • If we see an unknown softfork bit activate: you get Warning: unknown new rules activated (versionbit X).

      Now, when could the OP_CSV soft forks activate? bitcoin-core will only start setting the bit in the first period after the start date, so somewhere between 1st and 15th of May[1], then will take another period to lock-in (even if 95% of miners are already upgraded), then another period to activate.  So early June would be the earliest possible date, but we’ll get two weeks notice for sure.

      The Old Algorithm

      For historical purposes, I’ll describe how the old soft-fork code worked.  It used version as a simple counter, eg. 3 or above meant BIP66, 4 or above meant BIP65 support.  Every block, it examined the last 1000 blocks to see if more than 75% had the new version.  If so, then the new softfork rules were enforced on new version blocks: old version blocks would still be accepted, and use the old rules.  If more than 95% had the new version, old version blocks would be rejected outright.

      I remember Gregory Maxwell and other core devs stayed up late several nights because BIP66 was almost activated, but not quite.  And as a miner there was no guarantee on how long before you had to upgrade: one smaller miner kept producing invalid blocks for weeks after the BIP66 soft fork.  Now you get two weeks’ notice (probably more if you’re watching the network).

      Finally, this change allows for miners to reject a particular soft fork without rejecting them all.  If we’re going to see more contentious or competing proposals in the future, this kind of plumbing allows it.

      Hope that answers all your questions!


       

      [1] It would be legal for an implementation to start setting it on the very first block past the start date, though it’s easier to only think about version bits once every two weeks as bitcoin-core does.

      April 01, 2016 01:28 AM

      January 21, 2016

      Andreas

      Testing PAM modules and PAM-aware applications in the Matrix

      Jakub Hrozek and I are proud to announce the first release of pam_wrapper. This tool allows you to either simplify testing PAM modules or your application using PAM to authenticate users. PAM (Pluggable Authentication Modules) is a layer of abstraction on top of Unix authentication.

      For testing PAM-aware applications we have written a simple PAM module called pam_matrix. If you plan to test a PAM module you can use the pamtest library we have implemented. It simplifies testing of modules. You can combine it with the cmocka unit testing framework or you can use the provided Python bindings to write tests for your module in Python.

      Jakub and I have written an article for LWN.net to provide more details how to use it. You can find it here.

      Now start testing your PAM module or application!

      January 21, 2016 07:21 AM

      January 03, 2016

      Rusty

      Bitcoin And Stuck Transactions?

      One problem of filling blocks is that transactions with too-low fees will get “stuck”; I’ve read about such things happening on Reddit.  Then one of my coworkers told me that those he looked at were simply never broadcast properly, and broadcasting them manually fixed it.  Which lead both of us to wonder how often it’s really happening…

      My approach is to look at the last 2 years of block data, and make a simple model:

      1. I assume the tx is not a priority tx (some miners reserve space for these; default 50k).
      2. I judge the “minimum feerate to get into a block” as the smallest feerate for any transaction after the first 50k beyond the coinbase (this is an artifact of how bitcoin core builds transactions; priority area first).
      3. I assume the tx won’t be included in “empty” blocks with only a coinbase or a single non-coinbase tx (SPV mining); their feerate is “infinite”.

      Now, what feerate do we assume?  The default “dumb wallet” fee is 10000 satoshi per kilobyte: bitcoin-core doesn’t do this pro-rata, so a median 300-byte transaction still pays 10000 satoshi by default (fee-per-byte 33.33).  The worse case is a transaction of exactly 1000 bytes (or, a wallet which does pro-rata fees), which would have a fee-per-byte of 10.

      So let’s consider the last two years (since block 277918).  How many blocks in a row we see with a fee-per-byte > 33.33, and how many we see with a feerate of > 10:

      Conclusion

      In the last two years you would never have experienced a delay of more than 10 blocks for a median-size transaction with a 10,000 satoshi fee.

      For a 1000-byte transaction paying the same fee, you would have experienced a 10 block delay 0.7% of the time, with a 20+ block delay on eight occasions: the worse being a 26 block delay at block 382918 (just under 5 hours).  But note that this fee is insufficient to be included in 40% of blocks during the last two years, too; if your wallet is generating such things without warning you, it’s time to switch wallets!

      Stuck low-fee transactions are not a real user problem yet.  It’s good to see adoption of smarter wallets, though, because it’s expected that they will be in the near future…

      January 03, 2016 10:24 PM

      December 22, 2015

      Rusty

      Bitcoin: Mixed Signs of A Fee Market

      Six months ago in a previous post I showed that 45% of transactions have an output of less that $1, and estimated that they would get squeezed out first as blocks filled.  It’s time to review that prediction, and also to see several things:

      1. Are fees rising?
      2. Are fees detached from magic (default) numbers of satoshi?
      3. Are low value transactions getting squeezed out?
      4. Are transactions starting to shrink in response to fee pressure?

      Here are some scenarios: low-value transactions might be vanishing even if nothing else changes, because people’s expectations (“free global microtransactions!” are changing).  Fees might be rising but still on magic numbers, because miners and nodes increased their relayfee due to spam attacks (most commonly, the rate was increased from 1000 satoshi per kb to 5000 satoshi per kb).  Finally, we’d eventually expect wallets which produce large transactions (eg. using uncompressed signatures) to lose popularity, and wallets to get smarter about transaction generation (particularly once Segregated Witness makes it fairly easy).

      Fees For The Last 2 Years

      The full 4 year graph is very noisy, so I only plotted the mean txfee/kb for each day for the last two years, in Satoshi and USD (thanks to the Coindesk BPI data for the conversion):

       

      Conclusion: Too noisy to be conclusive: they seem to be rising recently, but some of that reflects the exchange rate changes.

      Are Fees on Magic Boundaries?

      Wallets should be estimating fees: in a real fee market they’d need to.

      Dumb wallets pay a fixed fee per kb: eg. the bitcoin-core wallet pays 1,000 (now 5,000) satoshi per kb by default; even if the transaction is 300 bytes, it will pay 5,000 satoshi.  Some wallets use (slightly more sensible) scaling-by-size, so they’d pay 1,500 satoshi.  So if a transaction fee ends in “000”, or the scaled transaction fee does (+/- 2) we can categorize them as “fixed fee”.  We assume others are using a variable fee (about 0.6% will be erroneously marked as fixed):

      This graph is a bit dense, so we thin it by grouping into weeks:

       

      Conclusion: Wallets are starting to adapt to fee pressure, though the majority are still using a fixed fee.

      Low Value Transactions For Last 4 Years

      We categorize 4 distinct types of transactions: ones which have an output below 25c, ones which have an output between 25c and $1, ones which have an output between $1 and $5, and ones which have no output below $5, and graph the trends for each for the last four years:

      Conclusion: 25c transactions are flat (ignoring those spam attack spikes).  < $1 and <$5 are growing, but most growth is coming from transactions >= $5.

      Transaction Size For Last 4 Years

      Here are the transaction sizes for the last 4 years:

      Conclusion: There seems to be a slight decline in transaction sizes, but it’s not clear the cause, and it might be just noise.

      Conclusion

      There are signs of a nascent fee market, but it’s still very early. I’d expect something conclusive in the next 6 months.

      The majority of fees should be variable, and they’re not: wallets remain poor, but users will migrate as blocks fill and more transactions get stuck.

      A fee rate of over 10c per kb (2.5c per median transaction) hasn’t suppressed 25c transactions: perhaps it’s not high enough yet, or perhaps wallets aren’t making the relative fees clear enough (eg. my Trezor gives fees in BTC, as well as  only offering fixed fee rates).

      The slight dip in mean transaction sizes and lack of growth in 25c transactions to may point to early market pressure, however.

      Six months ago I showed that 45% of transactions were less than a dollar.  In the last six months that has declined to 38%.  I previously estimated that we would want larger blocks within two years, and need them within three.  That still seems a reasonable estimate.

      Data

      I used bitcoin-iterate and a really crappy Makefile to generate CSVs with the data.  You can see the result on github or go straight to downloading the Gnumeric spreadsheet with the graphs.

      Disclaimer: I Work For Blockstream

      On lightning.  Not on drawing pretty graphs.  But I wanted to see the data…

       

      December 22, 2015 05:53 AM

      December 15, 2015

      David

      Ceph USB Storage Gateway


      Last week was Hackweek, a week full of fun and innovation at SUSE. I decided to use the time to work on a USB storage gateway for Ceph.



      The concept is simple - create a USB device that, when configured and connected, exposes remote Ceph RADOS Block Device (RBD) images for access as USB mass storage, allowing for:
      • Ceph storage usage by almost any system with a USB port
        • Including dumb systems such as TVs, MP3 players and mobile phones
      • Boot from RBD images
        • Many systems are capable of booting from a USB mass storage device
      • Minimal configuration
        • Network, Ceph credentials and image details should be all that's needed for configuration


      Hardware

      I already own a Cubietruck, which has the following desirable characteristics for this project:
      • Works with a mainline Linux Kernel
      • Is reasonably small and portable
      • Supports power and data transfer via a single mini-USB port
      • Performs relatively well
        • Dual-core 1GHz processor and 2GB RAM
        • Gigabit network adapter and WiFi 802.11 b/g/n

      Possible alternatives worth evaluation include C.H.I.P (smaller and cheaper), NanoPi2, and UP (faster). I should take this opportunity to mention that I do gladly accept hardware donations!


      Base System

      I decided on using openSUSE Tumbleweed as the base operating system for this project. An openSUSE Tubleweed ARM port for the Cubietruck is available for download at:
      http://download.opensuse.org/ports/armv7hl/factory/images/openSUSE-Tumbleweed-ARM-JeOS-cubietruck.armv7l-Current.raw.xz

      Installation is as straightforward as copying the image to an SD card and booting - I documented the installation procedure on the openSUSE Wiki.
      Releases prior to Build350 exhibit boot issues due to the U-Boot device-tree path. However, this has been fixed in recent builds.


      Kernel

      The Linux kernel currently shipped with the openSUSE image does not include support for acting as a USB mass storage gadget, nor does it include Ceph RBD support. In order to obtain these features, and also reduce the size of the base image, I built a mainline Linux kernel (4.4-rc4) using a minimal custom kernel configuration:
      ~/> sudo zypper install --no-recommends git-core gcc make ncurses-devel bc
      ~/> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
      ~/> cd linux
      ~/linux/> wget https://raw.githubusercontent.com/ddiss/ceph_usb_gateway/master/.config
      # or `make sunxi_defconfig menuconfig`
      # ->enable Ceph, sunxi and USB gadget modules
      ~/linux/> make oldconfig
      ~/linux/> make -j2 zImage dtbs modules
      ~/linux/> sudo make install modules_install
      ~/linux/> sudo cp arch/arm/boot/zImage /boot/zImage-$(make kernelrelease)
      ~/linux/> sudo cp arch/arm/boot/dts/sun7i-a20-cubietruck.dtb /boot/dtb-4.3.0-2/
      ~/linux/> sudo cp arch/arm/boot/dts/sun7i-a20-cubietruck.dtb /boot/dtb/

      This build procedure takes a long time to complete. Cross compilation could be used to improve build times.
      I plan on publishing a USB gadget enabled ARM kernel on the Open Build Service in the future, which would allow for simple installation via zypper - watch this space!


      Ceph RADOS Block Device (RBD) mapping

      To again save space, I avoided installation of user-space Ceph packages by using the bare kernel sysfs interface for RBD image mapping.
      The Ceph RBD kernel module must be loaded prior to use:
      # modprobe rbd

      Ceph RADOS block devices can be mapped using the following command:
      # echo -n "${MON_IP}:6789 name=${AUTH_NAME},secret=${AUTH_SECRET} " \
                "${CEPH_POOL} ${CEPH_IMG} -" > /sys/bus/rbd/add

      $MON_IP can be obtained from ceph.conf. Similarly, the $AUTH_NAME and $AUTH_SECRET credentials can be retrieved from a regular Ceph keyring file.
      $CEPH_POOL and $CEPH_IMG correspond to the location of the RBD image.

      A locally mapped RBD block device can be subsequently removed via:
      # echo -n "${DEV_ID}" > /sys/bus/rbd/remove

      $DEV_ID can be determined from the numeric suffix assigned to the /dev/rbdX device path.

      Images can't be provisioned without the Ceph user-space utilities installed, so should be performed on a separate system (e.g. an OSD) prior to mapping on the Cubietruck. E.g. To provision a 10GB image:
      # rbd create --size=10240 --pool ${CEPH_POOL} ${CEPH_IMG}

      With my Cubietruck connected to the network via the ethernet adapter, I observed streaming read (/dev/rbd -> /dev/null) throughput at ~37MB/s, and the same value for streaming writes (/dev/zero -> /dev/rbd). Performance appears to be constrained by limitations of the Cubietruck hardware.


      USB Mass Storage Gadget

      The Linux kernel mass storage gadget module is configured via configfs. A device can be exposed as a USB mass storage device with the following procedure:
      # modprobe sunxi configfs libcomposite usb_f_mass_storage

      # mount -t configfs configfs /sys/kernel/config
      # cd /sys/kernel/config/usb_gadget/
      # mkdir -p ceph
      # cd ceph

      # mkdir -p strings/0x409
      # echo "fedcba9876543210" > strings/0x409/serialnumber
      # echo "openSUSE" > strings/0x409/manufacturer
      # echo "Ceph USB Drive" > strings/0x409/product

      # mkdir -p functions/mass_storage.usb0
      # echo 1 > functions/mass_storage.usb0/stall
      # echo 0 > functions/mass_storage.usb0/lun.0/cdrom
      # echo 0 > functions/mass_storage.usb0/lun.0/ro
      # echo 0 > functions/mass_storage.usb0/lun.0/nofua
      # echo "$DEV" > functions/mass_storage.usb0/lun.0/file

      # mkdir -p configs/c.1/strings/0x409
      # echo "Config 1: mass-storage" > configs/c.1/strings/0x409/configuration
      # echo 250 > configs/c.1/MaxPower
      # ln -s functions/mass_storage.usb0 configs/c.1/

      # ls /sys/class/udc > UDC
      $DEV corresponds to a /dev/X device path, which should be a locally mapped RBD device path. The module can however also use local files as backing for USB mass storage.


      Boot-Time Automation

      By default, Cubietruck boots when the board is connected to a USB host via the mini-USB connection.
      With RBD image mapping and USB mass storage exposure now working, the process can be run automatically on boot via a simple script: rbd_usb_gw.sh
      Furthermore, a systemd service can be added:
      [Unit]
      Wants=network-online.target
      After=network-online.target

      [Service]
      # XXX assume that rbd_usb_gw.sh is present in /bin
      ExecStart=/bin/rbd_usb_gw.sh %i
      Type=oneshot
      RemainAfterExit=yes

      Finally, this service can be triggered by Wicked when the network interface comes online, with the following entry added to /etc/sysconfig/network/config:
      POST_UP_SCRIPT="systemd:rbd-mapper@.service"


      Boot Performance Optimisation

      A significant reduction in boot time can be achieved by running everything from initramfs, rather than booting to the full Linux distribution.
      Generating a minimal initramfs image, with support for mapping and exposing RBD images is straightforward, thanks to the Dracut utility:
      # dracut --no-compress  \
      --kver "`uname -r" \
      --install "ps rmdir dd vim grep find df modinfo" \
      --add-drivers "rbd musb_hdrc sunxi configfs" \
      --no-hostonly --no-hostonly-cmdline \
      --modules "bash base network ifcfg" \
      --include /bin/rbd_usb_gw.sh /lib/dracut/hooks/emergency/02_rbd_usb_gw.sh \
      myinitrd

      The rbd_usb_gw.sh script is installed into the initramfs image as a Dracut emergency hook, which sees it executed as soon as initramfs has booted.

      To ensure that the network is up prior to the launch of rbd_usb_gw.sh, the kernel DHCP client (CONFIG_IP_PNP_DHCP) can be used by appending ip=dhcp to the boot-time kernel parameters. This can be set from the U-Boot bootloader prompt:
      => setenv append 'ip=dhcp'
      => boot

      The new initramfs image must be committed to the boot partition via:

      # cp myinitrd /boot/
      # rm /boot/initrd
      # sudo ln -s /boot/myinitrd /boot/initrd

      Note: In order to boot back to the full Linux distribution, you will have to mount the /boot partition and revert the /boot/initrd symlink to its previous target.


      Future Improvements

      • Support configuration of the device without requiring console access
        • Run an embedded web-server, or expose a configuration filesystem via USB 
      • Install the operating system onto on-board NAND storage,
      • Further improve boot times
        • Avoid U-Boot device probes
      • Experiment with the new f_tcm USB gadget module
        • Expose RBD images via USB and iSCSI


      Credits

      Many thanks to:
      • My employer, SUSE Linux, for encouraging me to work on projects like this during Hackweek.
      • The linux-sunxi community, for their excellent contributions to the mainline Linux kernel.
      • Colleagues Dirk, Bernhard, Alex and Andreas for their help in bringing up openSUSE Tumbleweed on my Cubietruck board.

      December 15, 2015 12:18 PM

      October 29, 2015

      Andreas

      uid_wrapper-1.2.0 released!

      I’ve just released uid_wrapper-1.2.0, a testing tool to fake privilege separation!

      uid_wrapper

      The new version correctly checks privileges when changing IDs and has a lot more tests! Learn more at https://cwrap.org.

      October 29, 2015 12:52 PM

      October 20, 2015

      Rusty

      ccan/mem’s memeqzero iteration

      On Thursday I was writing some code, and I wanted to test if an array was all zero.  First I checked if ccan/mem had anything, in case I missed it, then jumped on IRC to ask the author (and overall CCAN co-maintainer) David Gibson about it.

      We bikeshedded around names: memallzero? memiszero? memeqz? memeqzero() won by analogy with the already-extant memeq and memeqstr. Then I asked:

      rusty: dwg: now, how much time do I waste optimizing?
      dwg: rusty, in the first commit, none

      Exactly five minutes later I had it implemented and tested.

      The Naive Approach: Times: 1/7/310/37064 Bytes: 50

      bool memeqzero(const void *data, size_t length)
      {
          const unsigned char *p = data;
      
          while (length) {
              if (*p)
                  return false;
              p++;
              length--;
          }
          return true;
      }

      As a summary, I’ve give the nanoseconds for searching through 1,8,512 and 65536 bytes only.

      Another 20 minutes, and I had written that benchmark, and an optimized version.

      128-byte Static Buffer: Times: 6/8/48/5872 Bytes: 108

      Here’s my first attempt at optimization; using a static array of 128 bytes of zeroes and assuming memcmp is well-optimized for fixed-length comparisons.  Worse for small sizes, much better for big.

       const unsigned char *p = data;
       static unsigned long zeroes[16];
      
       while (length > sizeof(zeroes)) {
           if (memcmp(zeroes, p, sizeof(zeroes)))
               return false;
           p += sizeof(zeroes);
           length -= sizeof(zeroes);
       }
       return memcmp(zeroes, p, length) == 0;

      Using a 64-bit Constant: Times: 12/12/84/6418 Bytes: 169

      dwg: but blowing a cacheline (more or less) on zeroes for comparison, which isn’t necessarily a win

      Using a single zero uint64_t for comparison is pretty messy:

      bool memeqzero(const void *data, size_t length)
      {
          const unsigned char *p = data;
          const unsigned long zero = 0;
          size_t pre;
          pre = (size_t)p % sizeof(unsigned long);
          if (pre) {
              size_t n = sizeof(unsigned long) - pre;
              if (n > length)
                  n = length;
              if (memcmp(p, &zero, n) != 0)
                  return false;
              p += n;
              length -= n;
          }
          while (length > sizeof(zero)) {
              if (*(unsigned long *)p != zero)
                  return false;
              p += sizeof(zero);
              length -= sizeof(zero);
          }
          return memcmp(&zero, p, length) == 0;
      }

      And, worse in every way!

      Using a 64-bit Constant With Open-coded Ends: Times: 4/9/68/6444 Bytes: 165

      dwg: rusty, what colour is the bikeshed if you have an explicit char * loop for the pre and post?

      That’s slightly better, but memcmp still wins over large distances, perhaps due to prefetching or other tricks.

      Epiphany #1: We Already Have Zeroes: Times 3/5/92/5801 Bytes: 422

      Then I realized that we don’t need a static buffer: we know everything we’ve already tested is zero!  So I open coded the first 16 byte compare, then memcmp()ed against the previous bytes, doubling each time.  Then a final memcmp for the tail.  Clever huh?

      But it no faster than the static buffer case on the high end, and much bigger.

      dwg: rusty, that is brilliant. but being brilliant isn’t enough to make things work, necessarily :p

      Epiphany #2: memcmp can overlap: Times 3/5/37/2823 Bytes: 307

      My doubling logic above was because my brain wasn’t completely in phase: unlike memcpy, memcmp arguments can happily overlap!  It’s still worth doing an open-coded loop to start (gcc unrolls it here with -O3), but after 16 it’s worth memcmping with the previous 16 bytes.  This is as fast as naive with as little as 2 bytes, and the fastest solution by far with larger numbers:

       const unsigned char *p = data;
       size_t len;
      
       /* Check first 16 bytes manually */
       for (len = 0; len < 16; len++) {
           if (!length)
               return true;
           if (*p)
               return false;
           p++;
           length--;
       }
      
       /* Now we know that's zero, memcmp with self. */
       return memcmp(data, p, length) == 0;

      You can find the final code in CCAN (or on Github) including the benchmark code.

      Finally, after about 4 hours of random yak shaving, it turns out lightning doesn’t even want to use memeqzero() any more!  Hopefully someone else will benefit.

      October 20, 2015 12:09 AM

      September 24, 2015

      Andreas

      libssh is running in the Matrix now

      Since I joined the libssh project we started to write tests to find regression and make development easier. This has been achieved using the a unit testing framework called cmocka which I maintain and develop. The problem is that to run these tests you need to modify the sshd configuration and setup a test user so that the tests can be successfully executed. This is something contributors normally don’t do so we need to rely on our testing infrastructure.

      In 2013 I’ve started the cwrap project. cwrap is a set of tool to make full network server/client testing made easy. These tools are used to make it possible to run the Samba Testsuite easily on every machine without setting anything up. Some time ago I’ve started to use cwrap for libssh testing. Finally I found the time to finish the task.

      libssh in the Matrix

      Now a libssh client tests sets up an artificial test environment. We have a passwd, shadow and group file so we can use two users to authenticate (nss_wrapper). sshd is running as the user starting the testcase but as it is part of the Matrix it thinks it is root (uid_wrapper). The client and server think they communicate on a real network (socket_wrapper) but it is again the Matrix!

      It took me a while to get it working and I needed to implement new feature to the wrapper libraries of cwrap. socket_wrapper needed support to report TCP_NODELAY in getsockopt(). nss_wrapper needed shadow file support for password authentication so I had to add support for getspnam(). And as sshd is paranoid uid_wrapper needed checks if if is privileged to actually change to the user. After it drops privileged it checks if it really can’t go back.

      With all of this implemented and new releases of the wrappers, which I’m preparing at the moment, all you have to do is to install cmocka, socket_wrapper, nss_wrapper and uid_wrapper and run ‘make test’. The Matrix will be created and libssh tested. You can find the cwrap libssh branch here.

      There is one test for a feature missing right now. We do not test keyboard-interactive authentication, but the cwrap project is working on a new wrapper to fix this. Stay tuned!

      September 24, 2015 05:02 PM

      Last updated: April 26, 2017 07:01 PM

      Donations


      Nowadays, the Samba Team needs a dollar instead of pizza ;-)

      Beyond Samba

      Releases