如何将regexec与内存映射文件一起使用?

时间:2012-06-14 16:35:45

标签: c regex unix mmap

我试图在大内存映射文件中找到正则表达式 通过使用 regexec()函数。我发现程序在文件大小时崩溃了 是页面大小的倍数。

是否有 regexec()函数具有字符串的长度 作为额外的论点?

或者:

如何在内存映射文件中找到正则表达式?

以下是ALWAYS崩溃的最小例子 (如果我少跑3线程程序不会崩溃):

ls -la ttt.txt 
-rwx------ 1 bob bob 409600 Jun 14 18:16 ttt.txt

gcc -Wall mal.c -o mal -lpthread -g && ./mal
[1]    11364 segmentation fault (core dumped)  ./mal

该计划是:

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

#include <stdio.h>
#include <assert.h>
#include <pthread.h>
#include <regex.h>

void* f(void*arg) {
  int size = 409600;
  int fd = open("ttt.txt", O_RDONLY);
  char* text = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
  close(fd);

  fd = open("/dev/zero", O_RDONLY);
  char* end = mmap(text + size, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
  close(fd);

  assert(text+size == end);

  regex_t myre;
  regcomp(&myre, "XXXXX", REG_EXTENDED);
  regexec(&myre, text, 0, NULL, 0);
  regfree(&myre);
  return NULL;
}

int main(int argc, char* argv[]) {
  int n = 10;
  int i;
  pthread_t t[n];
  for (i = 0; i < n; ++i) {
    pthread_create(&t[n], NULL, f, NULL);
  }
  for (i = 0; i < n; ++i) {
    pthread_join(t[n], NULL);
  }
  return 0;
}

P.S。 这是gdb的输出:

gdb ./mal 
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/bob/prog/c/mal...done.
(gdb) r

Starting program: /home/srdjan/prog/c/mal 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff77ff700 (LWP 11817)]
[New Thread 0x7ffff6ffe700 (LWP 11818)]
[New Thread 0x7ffff6799700 (LWP 11819)]
[New Thread 0x7fffeffff700 (LWP 11820)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6799700 (LWP 11819)]
__strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
72  ../sysdeps/x86_64/multiarch/../strlen.S: No such file or directory.
(gdb) bt
#0  __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:72
#1  0x00007ffff78df254 in __regexec (preg=0x7ffff6798e80, string=0x7fffef79b000 'a' <repeats 200 times>..., nmatch=<optimized out>, 
pmatch=0x0, eflags=<optimized out>) at regexec.c:245
#2  0x00000000004008e6 in f (arg=0x0) at mal.c:24
#3  0x00007ffff7bc4e9a in start_thread (arg=0x7ffff6799700) at pthread_create.c:308
#4  0x00007ffff78f24bd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()
(gdb) 

2 个答案:

答案 0 :(得分:2)

问题是regexec()用于将空终止字符串与预编译模式缓冲区匹配,但mmap ed文件不一定(实际上通常不是)以null结尾。因此,它正在查看文件末尾以找到NUL字符(0字节)。

你需要一个regexec()的版本,它接受一个缓冲区和一个size参数而不是一个以null结尾的字符串,但似乎没有一个。

答案 1 :(得分:2)

Celada正确识别问题 - 文件数据不一定包含空终止符。

您可以通过在文件后立即映射零页来解决问题:

int fd;
char *text;

fd = open("ttt.txt", O_RDONLY);
text = mmap(NULL, 409600, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);

fd = open("/dev/zero", O_RDONLY);
mmap(text + 409600, 4096, PROT_READ, MAP_PRIVATE | MAP_FIXED, fd, 0);
close(fd);

(请注意,您可以在fd之后立即关闭mmap(),因为mmap()会添加对打开文件说明的引用。)

您当然应该在上面添加错误检查。此外,许多UNIX系统支持MAP_ANONYMOUS标志,您可以使用该标志而不是打开/dev/zero(但这不在POSIX中)。

相关问题