#8

discuss using RLIMIT_DATA instead of RLIMIT_AS

    • Status: Invalid
    • Priority: Normal (3)
    • Component: -
    • Estimate: None/Small/Medium/Large None
    I suggest using RLIMIT_DATA instead of RLIMIT_AS. I'll first show the difference, from the man pages, and then explain why I think we should change it:

    RLIMIT_AS
    bq. The maximum size of the process's virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2) and mremap(2),
    which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills
    the process when no alternate stack has been made available). Since the value is a long, on machines with a 32-bit long either this limit
    is at most 2 GiB, or this resource is unlimited.

    RLIMIT_DATA
    bq. The maximum size of the process's data segment (initialized data, uninitialized data, and heap). This limit affects calls to brk() and
    sbrk(), which fail with the error ENOMEM upon encountering the soft limit of this resource.

    So, basically, I don't think we should include mmap in the memory limit. You already know the size of the files you're going to memory-map, so you'd program the job to only mmap files that would fit in memory. Also, if you're mmapping data that you don't write to, the memory will be shared by all the mapreduce jobs running on the system. So, if one has configured 8 mappers on a machine, and you're mmapping a 2GB file, and other anonymous memory used per map job is 250MB, then you're using a total of 2GB + 8*250MB = 4GB. Because of this sharing, anonymous memory and mmap memory limits really need to be specified separately.

    If you're accessing the mmap file data read only, the data will not be swapped to the pagefile; the file itself acts as the backing store; it will not write data out to the swapfile, it will just read the data in from the file, or discard the pages in ram, and then re-read the data from the file later.
  • Followers
     
    Ico-users Klaas Bosteels (Assigned To) , Daniel Lescohier 
     
    Attachments
    No attachments
    Associations
     
    No associations
    Activity
     
    User picture

          on Mar 19, 2009 @ 09:39am UTC * By Klaas Bosteels

    Status changed from New to Invalid
    The reason why we use RLIMIT_AS is because RLIMIT_DATA doesn't seem to work in practice:
    $ python
    Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) 
    [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import resource
    >>> resource.setrlimit(resource.RLIMIT_AS, (10000, 10000))
    >>> try: manyints = range(100000)
    ... except MemoryError: print "memerror"
    ... 
    memerror
    >>>
    $ python
    Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) 
    [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import resource
    >>> resource.setrlimit(resource.RLIMIT_DATA, (10000, 10000))
    >>> try: manyints = range(100000)
    ... except MemoryError: print "memerror"
    ... 
    >>> len(manyints)
    100000
    >>>
    Feel free to reopen if you know a way to make RLIMIT_DATA work, but for now I'm closing this ticket...
    Time Expenditure
    Loading