Robert Seacord, author of Effective C, The CERT C Coding Standard, and Secure Coding in C and C++, discusses the top 5 security issues and the tools and techniques you can employ to write secure code in C. Host Gavin Henry spoke with Seacord about the C standards, strings, arrays of chars, null pointers, buffer overflows, memory leaks, corrupt memory, how this can be exploited, bad inputs, dangling pointers, the stack, the heap, memory allocators, data structures, enum surprises, C23, compilers, committee meetings, Annex K secure function options, static and dynamic analysis tools, good IDEs, fuzzing, gcc and clang options, MISRA C, CERT C and making sure you understand C so you can write C programs correctly to begin with, rather than relying on trial and error techniques.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.
Gavin Henry 00:01:06 Welcome to Software Engineering Radio. I’m your host, Gavin Henry, and today my guest is Robert Seacord. Robert Seacord is a Technical Director at NCC Group where he develops and delivers secure coding training in C and C++ and other languages. Seacord is an expert on the C Standards. His six previous books include CERT C Coding Standard and Secure Coding in C and C++. Robert, welcome to Software Engineering Radio. Is there anything I missed in your bio that you’d like to add?
Robert Seacord 00:01:36 No, that was quite complete. Thanks for having me here.
Gavin Henry 00:01:40 A pleasure. So, I’d like to start off with a brief history of the C language and then touch on why programming in C can be insecure. We’re going to also then move on to top five security issues. And then the last bit of the show is going to be talking on the various ways and tools we can use to help us write secure C programs. Okay? Small disclosure, I might mention an open-source project I’m working on called SentryPeer, which is written in C for various things that have come up while I’ve been writing the code and tools. I found security issues I thought that weren’t an issue and things I found in your books and the sections on how to improve your code. I think it’ll be a nice bit. So, let’s lay down some foundations: when was C created?
Robert Seacord 00:02:35 I had to look this up because I’m actually not quite that old, but it first appeared in 1972. And it was developed by Dennis Ritchie at Bell Laboratories in New Jersey. So, it’s had a very long history. It was based on a typeless language called B, as you might imagine, because programmers have never been very good at naming things.
Gavin Henry 00:03:01 Cool, are there such a thing as versions, or how does that work?
Robert Seacord 00:03:05 Well yeah, there’s a lot of variation in what we call C, right? So, there was KRC, which was a Kerningham-Ritchie, kind of corresponded to their book back in the 70s. And back in the 70s, ANSI started a committee to standardize the language. So they published their first standard in 1989. So that’s often referred to as C89, and the next year that was published by ISO. So it was fast track to the international standards organization as C90, and a lot of people have assured me, including John Benito, who was the previous convenor of the C standards committee, that those two standards are exactly the same. There’s just a different cover page. But it’s actually quite hard to find copies of those original standards. But a lot of embedded code is still written in C90 and then there’s been several versions, major versions of the Standard release license.
Robert Seacord 00:04:10 So the next one was C99. And C99 was a little slow on adoption, but it had a bunch of features. C11 was the first standard that I worked on from beginning to end, and C11 primarily introduced parallel programming, concurrent programming, threads, thread library, atomics. And it was meant to also address security. I’m not sure it did as good a job of addressing security as it did addressing parallel execution, but we did add things like Annex K, which is the bound checking interface or the underbar S functions. Many people think that underbar S function stands for Security, but it actually stands for Bounds Controlled Interface. And we added that we had an Annex L, which was analyzability annex and we made some other small improvements here and there to address security. That was C11. Yeah. 2011. We just had the single digits I mean, I guess eventually we’ll wrap around, but I hope to be dead by then and expect it to be someone else’s problem.
Gavin Henry 00:05:31 You mentioned ANSI, that’s the American…?
Robert Seacord 00:05:33 Yeah, that’s American National Standards Institute but nowadays it’s actually, there’s a group called Insights, which is sort of under the umbrella of ANSI. And so, if you are in the US you’re a member of the Insights Committee, and Insights gets a single vote in ISO, so ISO is the International Standards body, so it’s one nation, one vote at ISO. And the Committee is actually, it’s very US-centric. We had a meeting some years ago in Delft in the Netherlands, and there’s a portion of the meeting where, we just handle Insights business. So we asked people who aren’t part of the US body to leave. And the only person to leave was the host of the meeting. And this was a meeting taking place in Europe. There was only one European there, and it was the host. So usually we get better participation from Northern Europe, Canada, but not much beyond there; hasn’t been a lot of participation from Asia or elsewhere lately.
Gavin Henry 00:06:50 And is that because C’s not used there, or they don’t participate?
Robert Seacord 00:06:54 It might be used there, but it’s all the compiler vendors are in the US primarily. There’s IBM, their compiler group is in Markham in Canada. And so, that’s actually the Canadian representation is from IBM, the well-known Canadian company, of course.
Gavin Henry 00:07:16 So there’s not really versions; it’s the standard and that changes each…?
Robert Seacord 00:07:24 Yeah, so the versions of the standard, and then that sort of drives the baseline. So there’s C11, C17. C17, some people mistakenly call it C18 because it was published by ISO in 2018, but it is actually the 2017 standard. And that was really an unusual one. It was just really bug fixes of C11. So, no one really should be using C11. C11 is like C17 with bugs, and C17 is C11 without bugs, but there’s, of course all the compiler vendors fix all the bugs in C11. So you won’t see them anymore, regardless of which standard you specify. And so C17 is a current version and we are currently working on C23 and the deadline for papers to introduce new features has come and gone as of this past, I think it was as of November. And so, we know what’s not going to be in C23 right now, which is anything we haven’t got a paper on and what’s going to be in it is still up in the air because we have to, we’ll see if we can get consensus on the remaining proposals that are in front of the committee.
Gavin Henry 00:08:40 And that’s what ends up in the compiler, doesn’t it — the version or standard the compiler supports?
Robert Seacord 00:08:47 Well, it could, right? So first of all, when we create the standard and C there’s a strong requirement for existing implementations, right? So, the C committee more than most committees does not like to invent things. We’d like to find things that are being used in practice that could benefit from standardization because that might increase portability over a number of platforms, and then get it into the standard. And sometimes the committee will, I’ll use the term “make improvements” to existing practice. They do like to fiddle and that’s good and bad. I mean, it’s nice to maybe make some improvements, but at the same time now it’s not just exactly existing in practice anymore, that you made some changes to it. And some things like Annex K, the committee fiddled with that a bit and got to the point where the existing implementation from Microsoft became non-conforming to the standard, and they weren’t really up for changing it. And so, the standard — I’m trying to find a different word than “standard” — it sets a standard.
Gavin Henry 00:10:11 No, but you touched on a good point in there that the standard is there to reinforce portability. I think that’s what you’re trying to get to.
Robert Seacord 00:10:19 Yeah. But all these compilers, they’re always, each implementation, right? Each compiler implementation exists over a continuum, right? So, you’ll have a compiler that has say, maybe it’s fully implemented to C99, but they’re working towards implementing all the C11 or C17 features, right? And so it’s somewhere in-between. And then most compilers have compiler-specific extensions that you can use, right? Which are not standardized. And so, so every implementation there’s a lot of variation, each sort of standard version is sort of a different flavor of the language. And then the actual compiler implementations they’ll fall into different areas in terms of which standards they implement and which additional features. So, there’s a consistent sort of portable backbone, but there’s a certain amount of variation kind of built on top of that.
Gavin Henry 00:11:30 Yeah. Just touching on the bit where you spoke about it’s really C17 and not C18, in my open source project that I mentioned, when I was getting the continuous integration tasks set up, to build my project with the compiler flags I put on, it was GCC standard C18, cause I’m running Fedora Linux latest from my desktop, like develop on, but the runners were, this code was built on GitHub. Cause you’ve been to 20 LTS and they didn’t have that flagged support in those PCCs. I think it was there. Or when I was testing on net BSD and open BSD, they didn’t, they only support C11. So even things, not even that many years old, they haven’t caught up or it’s just the version of compile that was released with the operating system. So, I understand what you mean by depending on how the compilers were implemented and who’s rolled them out.
Robert Seacord 00:12:31 Yeah. And you know, Microsoft has always been an interesting case because they’ve always been sort of comfortable, partially supporting standards. So supporting parts of standards they like, but ignoring parts they don’t.
Gavin Henry 00:12:46 But then it’s not really a standard, is it? You either do it all, or you don’t.
Robert Seacord 00:12:50 Yeah, that’s true. So for a long time, they didn’t like all the parts of C99, and they just kind of took a pass on those bits, but they’ve sort of announced a direction where they want to sort of become more aligned with the C standard. They haven’t been sending anyone to the committee meetings, so it’s hard to tell exactly what their future relationship with the language is. But compilers like Clang and GCC do a very good job kind of keeping up with the latest version of the standards. And you can get some, even C23 kind of features supported in those compilers as well.
Gavin Henry 00:13:36 Excellent. Well, I’m going to move us onto the next section of the show, which was really about the top five security issues that I’ve come up with a bit of research, and I want you to correct me on them. So before we dig into these five, if we could spend a minute or two to understand why a C program can be insecure and then we’ll dig into the five issues I have listed?
Robert Seacord 00:14:02 Well, yeah so, in all programming languages are insecure and they’re all general-purpose programming languages. So they all can sort of achieve the same things, right? So they have the same, they’re all Turing complete, and they’ve got different abstractions, different idioms for programming in those languages. But, in the way languages are broken they can be quite different, right? Because that’s, that’s not an intentional design; it’s sort of the defect surface of the language, or however you want to describe it. And so, if you look at a language like Java, which had been billed as a secure language for many years, it’s got some serious problems with things like deserialization, which basically allows an attacker to execute their own code inside your virtual machine.
Gavin Henry 00:15:07 Very topical language at the moment, isn’t it with everything that’s been going on the past two weeks. We have to be careful of timelines on this type of show, but the big with log4 J.
Robert Seacord 00:15:22 Yeah. And I mean, that’s, I haven’t studied that carefully. I mean, that mostly seems like a design flaw.
Gavin Henry 00:15:29 It’s kind of, like you said, where code can be injected and it runs where it shouldn’t be.
Robert Seacord 00:15:34 So yeah Java has got a pretty significant attack surface and it’s at a certain level where it sort of in the libraries and in the features and ways that those features can be sort of misused to exploit the code, C being sort of a simpler compiled language doesn’t have that attack surface. But C and C++ are sort of well-known for memory safety issues. And these are things where, basically, you read or write outside the bounds of an object and C and C++, these languages are designed to be optimally efficient. So they sort of trust the programmer’s not going to make these types of mistakes. And it turns out that trust had been very misplaced because programmers make these mistakes all the time. And if you write outside the bounds of an object, that can have various consequences as undefined behavior, depending on what that write does it could overwrite data, it could overwrite function pointers, it could overwrite the return address on the stack. And attackers can exploit that kind of problem by among other things, injecting code into your process and overwriting the return address on stack with the address of that malicious code so when a function goes return, instead of returned to the caller, it executes codes that’s been injected by the attacker and then that code runs with the permissions of the vulnerable process. So that’s a pretty significant style of attack.
Gavin Henry 00:17:25 Okay. Well that’s a good overview of a few things that would be insecure. Let me break down some of them before we start on this next bit, when we’re talking about C you mentioned the word object, which always makes me think of an object-oriented program, like a JavaScript object or a Java one, what do we mean in C when we talk about an object?
Robert Seacord 00:17:49 Oh, I didn’t know this was going to be a really deeply technical conversation.
Gavin Henry 00:17:54 Well, I suppose you can make it just a single-sentence definition of an object.
Robert Seacord 00:18:02 Yeah. We have a memory management study group that’s trying to answer that question.
Gavin Henry 00:18:07 Maybe we can’t do a simple answer then?
Robert Seacord 00:18:10 But basically an object is –- okay, I mean, in C you have functions and you have objects, right? So an object is everything that’s not a function. So that’s a, variable would be an object or you can have an object in dynamically allocated storage. So yeah, it’s basically a…
Gavin Henry 00:18:31 Yes, that’s correct. That is exactly what I was just going to read from your book. So in your book, Effective C, you say “an object is storage in which you can represent values. To be precise, an object is defined by the C standard as a region of data storage in the execution environment, the contents of which can represent values.” The added note “when a reference object can be interpreted as having a particular type.” So yeah, that is a big tick for that answer. Thank you.
Robert Seacord 00:19:03 Thanks, I’m glad I’m consistent.
Gavin Henry 00:19:06 So yeah, you touched on a couple of things that I was going to pull apart shortly on that to do with how these memory issues are actually exploited. We’ll start off from my list. So do my own project and other things like that whenever I save something or I’m working on an ID and I push it to Github, I’ve got all sorts of static analysis on it that we’ll mention it in the next section, but it usually comes back with something like a string issue. So I’ve always understood strings to be a security issue as in not terminated or an array of characters. People treat it not as a string when it’s not a string. Could you give us some information on why a string can be insecure?
Robert Seacord 00:19:56 Yeah. Strings are kind of rough. So strings are, they’re not a primitive type in either C or C++. So they’re constructed on top of arrays and C arrays are problematic in and of themselves, right? And so for starters, we know that there’s no implicit bounds checking and there’s a lot of functions such as stir copy where you’re copying a string from a source to a destination, and it’s going to copy the entire length of the string, but there’s no indication in that function of the size, say of the destination array. And so stir copy will just do what you ask it to do, which is copy from this sources, this destination, without checking to see if there’s room for that, to make that copy of string inside the bounds of that destination object.
Robert Seacord 00:21:00 And so the problem with the arrays, one of the problems with arrays is when you pass them to a function, they decay to a point or two, the first element of the array. And so once you’re inside the function, there is no way to determine the size of the entire array. So that size information has to be passed to be available. So functions like stir copy that don’t passed the size, there’s a library functions, trusting you the programmer to pass it an object, which will fit to the destination. Right. And if it doesn’t, you’ll have this undefined behavior and this potentially vulnerable code.
Gavin Henry 00:21:45 I always remember that the name of an array is also a pointer. So when you pass it into a function that, like you said, it, the keys to just the pointer, you can still find out what type of point that is inside your function? So is that correct?
Robert Seacord 00:22:05 Well, I mean, the type of the point is the pointers tight so, I mean, you could have void pointers in C, but that’s not particularly a great idea. So typically a string would be a char pointer, I mean, typically, I mean, correctly, it would be a char pointer. But you don’t know how long it is. And even, the idea that it’s an array is not necessarily the case, right? It could just be a pointer to a single character.
Gavin Henry 00:22:41 So do you have to think about what could I’ve seen where they pass in the lens? They usually lose one of the standard functions, string lands, but again, that function has to figure out how long the string is. So do you have to take an extra step and make sure it’s not terminated? Or do you have, or is there something that we can reach to so we don’t have to think about any of this for strings? What do you recommend?
Robert Seacord 00:23:08 So again, there’s no string type, there’s no premise string type. So it’s an array and the definition of a string is that basically there’s no character before the bound, right? So if there is no character before the bound, it’s not actually a string, itís a character array, right? And that’s okay. It’s okay to have a character array in C, it’s defined behavior. But it becomes undefined behavior if you pass a character array into a stirling function. Because it’s going to examine each element of that character array for no character. And it’s going to continue looking for no character to find one. So if string length, which again it doesn’t take a size, it doesn’t know what size the string is that it’s examining. If it doesn’t find a no character before the bound, it’s going to continue to look for a match through memory for no character. And as soon as that function, accesses storage beyond the bounds of the array, it’s now undefined behavior, right? And once you have undefined behavior in your code, all bets are off. That program can now exhibit any type of behavior. So there’s certainly requirement to ensure that any string you pass to a string function is actually a string, meaning that it has no termination before the bound.
Gavin Henry 00:24:41 Yeah. I’ve seen some of the documentation on some of the string functions that look to work around the space. Then they say, if there’s no unknown character found at that length that you pass, then we’ll make sure there’s one there.
Robert Seacord 00:24:58 Right, and a lot of functions, newer sort of more secure functions will ensure that when they create a string, that it will, it will be properly, no terminated. If you, if you perhaps give it more data than it has room to store in whatever sized object you have, then it will overwrite the last character trying to store with a no character. So you’ve got a properly, no terminate string. And so I mean this choice of a datatype was made early on and could very well be the wrong data type. I mean maybe having a size followed by the string and not using a no termination, maybe that would have been a better more efficient, more secure design, but it’s not something that’s likely to change at this point in the, in the evolution of these languages.
Gavin Henry 00:26:02 And I think to move on to number two on my list now, I think we’ve touched a little bit on it and I’ve called this buffer overruns and underruns, and I think you’ve helped me understand the question I was going to ask in the section where in my project, essentially, Peer one, I’ve got some errors on my ID where I’m doing a, I think it’s a string and compare some, basically checking a URL that comes in to see if it matches the data to one of my functions. So I’ve got the URL and I’ve got how the size of how long it went to look along the array of chart to find a match basically. So I’ve given it a max paths length, I think it’s of 1,024 or something. But my ID says, I shouldn’t check that URL string longer than the strings there, even though it finds a match. So my tasks all work, because I think that’s just what you’ve explained there. Once it gets past the chart of the array of chart, which might not be a string I use not terminated, all bets are off because it’s on the fine behavior when it gets to say chart 101 of the URL, that’s a hundred chart long.
Robert Seacord 00:27:21 You definitely can’t examine characters beyond the bounds of that object, beyond the bounds of that character array.
Gavin Henry 00:27:31 Yes so I think when the URL comes in, you need to do a size check on it and then make sure you’re not checking past that from match, is that the correct way?
Robert Seacord 00:27:40 Yes. I mean, so you’ve got a max path buffer that you’re storing it in. So you’ve got that amount of room for that array, but you’re comparing it to another string. And so you don’t want to exceed the bounds of either of those character array.
Gavin Henry 00:28:04 Actually the string and compare. So I’ve got the URL on the length of the string that I want to compare against. So like four slash home, I want to make sure that goes to the right place or about, or something about page and I’ve got a max length. So it’s going along that string for as long as I passed length for when it says thatís bad, but you don’t know how long the path is until you’ve calculated the path. We kind of get in this chicken and egg type situation. But yeah. So when we talk about going past the end of array, that would be an overrun? Is that a buffer overrun? Or is that an underrun?
Robert Seacord 00:28:46 So there’s these terms that they kicked around in security like buffer overflow and buffer underrun and overrun. And I don’t know what any of those words mean. I mean they’re kind of loosely used terms in security, but they don’t have very precise definitions. So in the C language, really, we just talk about an access outside of the bounds of an object. And we don’t care about what that access looks like, right? So you could start at the beginning of an array and you can increment a point or an index and then run off the end of the array, right? And that’s an out of bound access. You might call that a buffer overflow. And then you could start at the end of an array and you could detriment the pointer and you can run off that end of the memory.
Robert Seacord 00:29:42 Sometimes you’ll just sort of arbitrarily jump from, you might have some sort of integer here and jump from accessing an array to some random place and memory. And again, I don’t know what that’s called. That’s a buffer overflow or buffer overrun, but it’s just, it’s definitely an access outside of the bounds of that object, which is undefined behavior. You can’t take a point or two in an array and you can add or subtract interger value to it. As long as the pointers still refers to the same array or to one path that array the too far element. But if the pointer you form from that pointer arithmetic, is outside of that bound, it’s just undefined behavior. And what you call it, kind of varies. There’s it’s a little bit unrelated, but people like to talk about integer overflow and integer underflow in C, but there’s actually no such thing as integer underflow. That’s just someone’s creation. If you have an operation into operation at forms of value, that can’t be represented, that’s integer overflow there’s there is no such thing as integer underflow, but people like to use that term for whatever reason.
Gavin Henry 00:31:07 Well, it’s a good explanation. Thank you. So we’ve done something here where we’ve gone outside the bounds of what we’re trying to do. The third thing on my list is what I’ve called memory leaks. So when you request some memory from the operating system with one of the allocation functions and you don’t free it, so you get what I think is called the wrong time leak, runtime leak or corrupt memory. So runtime would be where you’re continually asking for this memory, but you’re not freeing it. So you’re using more than you should be. Is that a correct definition?
Robert Seacord 00:31:47 Thereís a lot of stuff that was slightly wrong in that question.
Gavin Henry 00:31:53 Thatís what I want to hear. Correct me.
Robert Seacord 00:31:55 Yeah. So for starters there’s a memory allocation function, right? Malik Cadillac, realigned Alec, and none of these directly request memory from the operating system. Right? So the process has a memory allocator that runs as part of the same process base, right? And so your memory allocator will request a very large block of memory from the operating system, and then it will manage that. And so when you make a call to Malik, it’s allocating storage, is allocating a piece of storage from this large block of memory that the memory managerís managing within the process, right?
Gavin Henry 00:32:38 So part of the kernel that’s doing this memory management?
Robert Seacord 00:32:42 No, it’s all in your process. So the memory management, you’re going to link to a library and that library has implementations of stir copy and Malik, and all of these functions run as part of your executable, in your process.
Gavin Henry 00:32:58 So this isn’t like a memory pool that I’ve created. This is something to do with how I execute bill has created?
Robert Seacord 00:33:05 So I mean, when you start up, the memory manager is going to go to the operating system, itís going to get a block of memory. But then once it gets this large block, which is basically the heap, your memory manager is not going to manage that heap storage for you. So, when you make a request to Malik, that’s going to execute the Malik function, which is part of this memory manager implementation. And it’s going to say what’s the next available number of the next available block of memory that’s at least this number of bytes large, and carve that off this bigger block and return that to the user. So that entire process doesn’t involve the Kernel at that point, right? That blocks thatís been carved out. The only time they’ll Kernel might become involved again is if you completely use all the allocated memory from the operating system, you might then seek to sort of extend that. But that one implementation doesn’t necessarily, I mean, the other possibility is that at that point, that location would fail for an inadequate memory.
Gavin Henry 00:34:23 Okay and so when we’re talking about these bounce things that happen, I’m not going to use the word overrun or undrawn okay. Does it make a difference if it’s over, does something I’ll bounce into memory that we haven’t freed, or are we contained within what the memory allocation tool has given us from memory? Or is it just undefined? Is there a difference between, so we’ve corrupt some of our own memory are not free to, and then one of these array operations we’re doing ends up trying to go into that it’s just undefined or? What Iím trying to ask is, when you see exploits of these types of things, and there, they know that we’re not cleaning up memory, or there’s some type of memory they can get to with this exploit to run their own code. How do they predictably get out that if these things were all quite undefined and random?
Robert Seacord 00:35:23 Well, an undefined is a term used by the standard, right? So, the standard says, simply we haven’t defined what happens here. And so particular implementation is of course, is going to do something. And because it’s not defined by the standard, what it does, you as a programmer don’t really know what it does, right? So sometimes the implementation sort of align with your expectations of program or what sort of behavior you’re going to get, in which case you could have code, you could have executable generated from code containing undefined behavior, which is actually correct, but more commonly if you’re invoking undefined behavior that suggests that you don’t have a correct understanding of the language, with regards to that behavior. And most likely the code is ISRA. Now when we talk about memory, heat memory, there’s several classes of potential errors, which can lead to vulnerabilities. The first one, which we’ve kind of discussed in arrays, buffer overflows, right?
Robert Seacord 00:36:38 So buffer, overflows can occur in any memory segment so they can occur in the stack, in the data segment or in the heap. And the consequence is, so an overflow in the heap, and anytime you write outside the bounds of an object, itís undefined behavior.
Gavin Henry 00:36:57 Can you define the stack in the heap briefly just in context?
Robert Seacord 00:37:01 So the stack in the heap, I mean Iíll say, Iíll start out by saying that neither concept is defined in the C standard. So these are kind of like implementation concepts, but typically a stack is a data structure which supports program execution by allowing you to have a function that calls another function and then creates a stack frame for function that it’s calling where it preserves all the local variables and arguments that are being passed to that function and so forth.
Robert Seacord 00:37:42 And then that function could call another function and that function could recurse, right? So you could wind up with multiple instances of the same function on the stack. And then once the function returns, the stack sort of unwind. So you would turn back to the calling function and re-established that function stack frame so it has access to the local variables. And so the execution stack is a data structure to allow for this basically functional style of programming. So that’s a stack and typical variable that you would declare inside of a function, a non-static variable, if you just have a function app and you IDE, that variable an automatic variable, that’s declared in the scope of that function. And what happens is when you call that function, a stack frame gets created for that function and instances that variable gets allocated on the stack, right?
Robert Seacord 00:38:44 And so once that function returns the lifetime of that, that variable ends, and it can no longer be accessed. So you’ve got two other data segments. You have the data segment, which is where static variables go and static variables, will where variable are, they have the same lifetime as that of the program. So they’re always accessible. And that’s where you might keep a counter or something, right? Where function will come, you’ll call a function node, you’ll increment this counter, the function will exit, but the count will still remain because it’s a global variable. And global variables have their uses and they have their problems. But the next type of the next segment is the heap. And the heap is where dynamically allocate storage exist. And the heap allows you to allocate storage as you need it during program execution.
Robert Seacord 00:39:52 And those objects persist until they’re explicitly de-allocated or destroyed. So, those have their own kind of lifetime. It’s based on you, allocating and de-allocating.
Gavin Henry 00:40:08 So that’s where the leak could happen. Corrupt.
Robert Seacord 00:39:12 Yeah. So there’s the buffer overflows on the heap, and those are exploitable and how they’re exploited depends on the implementation of your memory manager. Some memory managers implement the knuth algorithm, which uses each boundary tags where you’ll have control structures before and after each allocated blocks. So if you write beyond the bounds of the allocated object, you’ll start overriding these control structures in the heap, corrupting the heap, and an attacker could overwrite those structures basically again, to our protection per told. And the specifics of that depend on the implementation of the allocator.
Robert Seacord 00:40:58 But there’s also two other classes of problems, at least two other class problems with memory, allocated memory. So, one is you allocate memory, and then you fail to deallocate to release it. That’s a memory leak. And a memory leak can be benign if you have a short running program and you don’t ever exhaust memory. But if you have something like a server that’s going to run for extended periods of time, as it runs, if it’s continuing leaking memory, that memory is no longer available to the memory allocator to allocate to the process. So eventually that system is going to exhaust memory and that type of defect once that happens, your server’s not going to be very effective at serving. Because it’s going to start having memory failures and constantly be in a state of trying to recover from memory errors.
Robert Seacord 00:42:05 And so that situation is sort of known as resource exhaustion. And one form of attack is denial of service attack by resource exhaustion, right? Where an attacker finds a memory leak in your system, exploits that to exhaust your memory. And now it appears that your server is operational, but actually it’s no longer serving requests because it’s out of memory and it can’t function properly. So out of memory, failing to properly deallocate storage when it’s no longer required, can lead to those sorts of denial of service attacks. The other problem is you can accidentally release the same storage multiple times. And that’s often referred to as double free vulnerability. Double free vulnerability is, it looks a little bit different, but it can have the same consequence as a buffer from the heap, which is that an attacker could exploit that to execute arbitrary code. So double free is also quite dangerous sort of coding error.
Gavin Henry 00:43:17 Would you be able to give an example of, I know it’s hard because it depends on the program on implementation of where it’s running and things, as far as I understand it. But how can an attacker exploit what you just explained with a double free, or an over or under on how did they get this code. Is it assembling language that they put in the code and they inject that into this memory of area, area of memory? Or what does that look like?
Robert Seacord 00:43:45 So if we just discussed just sort of a basic exploit
Gavin Henry 00:43:53 Put in your name or something, I donít know, something really.
Robert Seacord 00:43:57 Yeah. In independent of the error, what can happen is an attacker can inject executable instructions into your process memory, and it can really do that on any input operation and there’s valid, there’s executable codes, it looks like valid ASCII. Executable codes that looks like valid UTF strings. So whatever type of string you’re inputting, it’s always a good idea to validate that string to the extent possible, but sometimes you just can’t, sometimes it’s just kind of a string data.
Gavin Henry 00:44:38 That thing you really got a good section in your Effective C book on validating the program arguments on the commodity. I find it really extensive.
Robert Seacord 00:44:48 Oh, thanks. And I mean, secure coning and C and C++ really goes into these exploits more. The Effective C book is meant more of an introductory text it. So I don’t try to go too in depth in how exploits or how to write exploits. But I try to write that book to provide kind of a strong foundation to programmer.
Gavin Henry 00:45:15 I think that’s why I like it so much.
Robert Seacord 00:45:18 Thank you. I mean, in a way if you code correctly and you avoid undefined behaviors, your code is secure. You don’t need to understand how it might be exploited, but the study of sort of how code is exploited is really motivational. It’s for people like, oh I’ve got legacy code base with tens of thousands of errors. So how do I prioritize that? And so you kind of talk about what the various errors are, how they can be exploited, how you might mitigate against these problems with sometimes sort of runtime strategies, which would protect against exploits of all of these. And then also about secure coding practices, how to correctly code. So it was not exploitable. But getting a legacy system poorly written, legacy system to be secure can be a significant investment in rewriting and improving the code.
Gavin Henry 00:46:23 Yeah. I think you’ve touched nicely on to number four, which is on my list, which has inputs. So I’ve got some questions to do with processing command line arguments, environmental variables, defensive programming, how network traffic is processed about runtime into data structures, things like that. I think just really understanding, listening to what you explained with them, the memory leaks and attack vectors. It just depends on how the input is coming into your program and you processed it correctly. That could be the, how it’s when you see the CVE exploit less, and it says, there’s a double free or a buffer before or something in certain situations doing this, if the wind’s blowing Northwest and you’re wearing your favorite jumper, this might get exploited type thing. It just depends on how it’s coming into that program and what the program does. Is that a fair summary?
Robert Seacord 00:47:24 Yeah. Some of it is quite tricky, right? I mean, so you’ll look at some source code and it will have some undefined behavior and it might be on this platform, under these circumstances with whatever runtime protections are available. This particular coding error won’t be exploitable, right? But you could run that on. You could port that to a different system. You could run on a different platform, you could change something about the runtime environment, or you could upgrade your compiler where the compilers now used to do one thing with an undefined behavior, but now it’s now they’ve developed an optimization that takes advantage of that undefined behavior to improve your performance. And now a problem which was the error was always present in the source code, but now because of this new optimization, that executable has been changed.
Robert Seacord 00:48:28 And it’s now vulnerable to attacks. So sometimes, many times it’s easier to repair the code than it is to understand all the potential security consequences of an exploit. So some cases where it’s cheap to fix, usually just make sense to fix it. I mean, there’s some cases where if you put some code on the Mars Rover and you landed on Mars, right? It’s a bit more involved to repair that code, right? So you want to analyze that defect more. You want to analyze that vulnerability more to find out whether it’s how much it was security risk is, is it worth repairing or not, but many cases it’s just easier to you to make the repair to the source code because that’s the end defined behaviors eliminate it, you shouldn’t have vulnerabilities in most cases. Now there are vulnerabilities which can exist absent of undefined behavior. These can be logical errors or just simple things like a memory leak, right? So if your program accidentally prints out or logs some personally identifiable information, it doesn’t necessarily have to have undefined behavior to do that. Right? So you could have, I almost want to use the phrase insecure by design where there’s not,
Gavin Henry 00:50:05 This has nothing to do with C, that’s just engineering software, engineering is not right?
Robert Seacord 00:50:10 Right.
Gavin Henry 00:50:12 Okay. And I think that was a good summary. And so with an improved compiler, could that cart to double free, if it’s tracking the amount of times you freed something or what? A garbage collection system?
Robert Seacord 00:50:28 Oh yeah. Well, C doesn’t really have garbage collection.
Gavin Henry 00:50:33 That was just an example.
Robert Seacord 00:50:34 Yeah. So double free, those type of errors, there are ways to catch it. Right? So, one mechanism is just to, so compiler does some analysis, right? It doesn’t do a lot of analysis. So there’s, they’re static analysis tools that do more depth, more in-depth analysis.
Gavin Henry 00:50:56 So I’m going to touch on in the next section, I’ve really enjoyed this middle section. So back move us on because we’re over our time on this. But so just the last thing I have in my list, because I think we’ve done a really good job. And I didn’t say at the time, but I really enjoyed their description of the stock and the heap that made everything really clear. So the last point is, sorry, that was a bad pun. It’s dangling pointers. Where are these in what problems that they caused just a minute or two, and then we’ll move on to the tools to help you be a better programmer.
Robert Seacord 00:51:30 Well it certainly caused bad puns, but the problem with a dangling pointer is that it could lead kind of directly to two classes of exploitable defects, right? One being double free, which we’ve just discussed, right? So if you free a pointer and you don’t assign it to know, you could free that pointer a your second time and we’ve already discussed that can be vulnerable. If you do set it to know, and you free a no pointer, that’s a no ops. So that has no, no effect on the code. The other problem with that dangling pointer is that it’s now pointing to memory, which has been deallocated possibly deallocate it and then reallocate it. So writing to that point, or is now undefined behavior and say for that storage is deallocated you write to it, once you deallocate storage, the memory manager takes it over and it might use the kind of user space to insert control structures in order to track, keep track of free blocks of storage. So if you write to these dangling pointers, again, you could overwrite these control structures, corrupting the heap, and potentially doing that in a way, which again, makes it possible to execute arbitrary code.
Gavin Henry 00:52:50 Yeah. I’ve seen that in a lot in something that I do and in my code and in Guisetís book who I had on the show and who you know because you work with them and Standards, Episode 414, and also a shout out to your art’s call for the IEEE secure coding and C and C++, and strings and integers and your other article on Effective C. How I’ve got those links in the show notes, but all his, and I think in your code examples, after free, the pointer is set to zero, which is the null. Excellent, that was a really good coverage. In the last section, I don’t have as much time as I hoped, but we’ve done a good in some crossovers here. So we’ve got IDs and things that we use as we’re running the code that try and give us as much help as possible. We’ve got a sort of built tools, but you mentioned earlier static and dynamic analysis. I think you mentioned dynamic analysis but Iíll mentioned it in here anyway. So what static in the now and dynamic analysis and how do they help?
Robert Seacord 00:54:00 These are just kind of tools and approaches to analyze the code and understand what it does and what potential defects it might have.
Gavin Henry 00:54:14 So I looks at the source code, the physical files. Well, not physical, the tax file.
Robert Seacord 00:54:19 So static code analysis, it looks like a bit a compiler, right? So it builds your source code and build typically in abstract syntax tree. So it creates a structure and then it might build some additional graphs that can be analyzed. And then you’ll have a series of rules where you say I don’t want to free a pointer and then free a pointer a second time. And so the static analysis will examine the graphs of the source code, the abstract syntax tree. And it will look for different structural, very structural defects in the code, or potentially do some path analysis or some data flow analysis. So static analysis tends to be very good at finding, say structural problems in a program it’s not as good at data flow and control flow type.
Gavin Henry 00:55:19 There are things that have caught me on this is where you returned from the function because this is an error, but you haven’t freed what youíve allocated beforehand. That’s always something that I find in my stuff.
Robert Seacord 00:55:34 There’s some problems that are fairly amenable to a stack analysis, but frequently memory management concurrency, these aren’t always discoverable through stack analysis. So often dynamic analysis is more effective to find these type of problems. And so you do have things like address sanitizer and thread sanitizer that available in claying and GCC and, and these let you and a variety of other products, but these allow you to instrument the executable. And then once it’s instrument that you’ll exercise it, using whatever variety of tests you have available, perhaps using fuzz, fuzzers to drive the code with various inputs. And these interment executable is now we’ll be able to basically trap on any sort of violation. So their very dynamic analysis is more effective at discovering things like the NAMIC memory issues and concurrency issues, basically at run time.
Gavin Henry 00:56:52 Some of the things that you’ve mentioned in your book that I’ve played with and I used in my projects is the sanitizer ones. The Tsan, which is the thread one, Asan which you mentioned as well. The address sanitizer for memory things, and then the Ubsan, which is the undefined behavior where I seem to find errors using those is when I’m running my test suite, because I’m not as careful as I’m actually running the core product as it were. I always find issues where I’ve set the task case by I haven’t torn it down or something you know. Which is kind of a biggie and you should sort out as you find them. And then some of the other tools I see other people use as the sanitizers, the clang sanitizer one that you mentioned, and then there’s loads of, I think a lot, you mentioned a few in your book, but if you’ve got an open source project, it’s quite easy to get access to all these free tools. But I think most of them are commercial. Iíll put the links into my show notes for that.
Robert Seacord 00:57:56 And I don’t know where to go with this. I mean it really, C is difficult language.
Gavin Henry 00:58:05 It’s simple, but it’s simply hard as well. Isn’t it?
Robert Seacord 00:58:10 Simple. I’m not sure. It’s smaller than other languages. And so I guess from that respect, you could say it’s simple, but thereís so many layers to it that I’m still peeling after I started programming C in ë95. So it was still peeling after.
Gavin Henry 00:58:33 And what sort of things have you come up with releasing recently that surprised you?
Robert Seacord 00:58:38 So here’s a good one. This was probably the most recent thing that surprised me. So you can define an Enum and you can have an enumeration constant, which has a type, which is different from the base type of the enumeration type.
Gavin Henry 00:58:57 Arenít Enum just supposed to be a thing that meant something to you?
Robert Seacord 00:59:04 Well, there’s this question. There’s always this question of what is the type of these things, right? So you write enum color, red, green, blue. Okay. So what type are those things?
Robert Seacord 00:59:12 So there’s a strong tendency to, well, the standard will say that the numeration constants the red, green, blue, those should all be INT, but you could say, for example, you could pass your GCC to client, flag, which says use you short enumeration content. So in a case like that, red, green, blue GCC, your claim might say, oh I’ve only got three values, 0 1 2. I can easily fit that non signed char. So I’m going to save quite a lot of storage and make this time signed char. So now you’ve got the base type of this object is unsigned char, but the type of each enumeration constant is INT. And mostly you don’t notice this, but there are cases where say you’re doing generic programming and you’re trying to execute some particular code based on the type of something. It might come as a surprise to people to discover that the type of the constant is different than the type of the enum object. That’s somewhat surprising. That’s the one that’s got me most recently.
Gavin Henry 01:00:36 You mentioned something there that what’s the point of a signed char and an unsigned char. just cause you mentioned it?
Robert Seacord 01:00:43 Well, signed char and unsigned char basically small integer types. If you want to represent a character, you should use char plain char and all three of those types are different and incompatible types.
Gavin Henry 01:01:00 Perfect. Okay. Just before we start wrapping up the show, just to put some more meat into the tool section, a good cover of static and dynamic analysis. We’ve mentioned the Tsan and Asan and Ubsan.
Gavin Henry 01:01:18 But over the show we spoke about Annex K, is that something that we can actually use today? Itís been out for a while. You mentioned that in your book and Jens mentioned it in his. Do you recommend it?
Robert Seacord 01:01:34 Yeah. I like it. There are two school of thoughts there and we voted on this in the committee a couple of times and the community is equally divided on this half. The community hates it, half the community likes it. And because it’s in the standard, you can’t eliminate it. You can’t change the standard without consensus, right. It’s the status quo, unless you can’t add anything, you can’t remove anything without consensus. And some of the history of this, it started with Microsoft back in the 90ís as a reaction to some well-publicized vulnerabilities. And basically it sort of improves upon the existing string library functions by typically adding an additional argument, which specifies the size of destination array. So now when you call these functions, they can determine that there’s not enough room in this destination array to make a copy of this string.
Robert Seacord 01:02:40 And so rather than write beyond the bounds of the object, I’m just going to indicate an error either by invoking a runtime constraint handler or returning an error value. And so I like these, I think they improve, they made it easier for novice programmers to avoid buffer overflows and undefined behaviors. Companies like Cisco have used these extensively and swear by them. They claim to have had significant improvement in quality and security is a result of using these functions. So they are available claying and GCC. A lot of the vendors sort of don’t like these libraries that might be because they originated from Microsoft or could be other reasons, but there are third party version of these libraries that you can download and use and they are standard API. So I like them. I would recommend their use.
Gavin Henry 01:03:52 To finish off this section there’s standards that we talk about. There’s the CERT C guidelines, right. I remember listening to show by SQL Lite, how they spend a year getting their C code up to some medical standards. Can’t remember what it was. Is that a thing? Is that’s something youíve heard of? Some type of medical standards where that code is suitable to be deployed and medical equipment, I have to do some more search for that. Okay, so I think that was really good to start wrapping up. So obviously C is a very powerful language with a strong history and deployment base. But if there was one thing a software engineer should remember from our show, what would you like that to be? If we haven’t covered that or just something you wanted to bring to the top?
Robert Seacord 01:04:43 Okay, I’ll say this, we didn’t spend a lot of time talking about IDs, right? But there’s an interesting thing people say about C programmers is that C programmers are a little frustrated by sort of compiler diagnostics and they want to get past that so they can get to the real job of debugging the program, right? And there’s one style of programming, which is this trial and error, right? So you have a bit of a problem. You Google, you go to stack overflow, you find a code example, you copy paste that into your system and you tweak it. You compile it. It doesn’t compile there’s some diagnostics. Oh yeah. Ms. Name is variable misspells. It makes you improvements that compiles and then you run it in, it doesn’t quite run.
Robert Seacord 01:05:49 So you change something and now you get a run that succeeds and you’re like, cool, that’s working onto the next thing. And so this kind of technique of trial and error, it can get to a program which, which works in a kind of, optimal scenario, right? But it doesn’t mean that programs. Correct, right? You don’t know how that program’s going to deal with kind of unexpected data. And we talked about the input validation briefly, but really your code has to work with all possible data values, right? There can’t be any inputs for which the program’s going to exhibit incorrect behavior. So that’s the point of input validation and programming in general, right? Make sure that you deal with all, all possible combinations of data. So to do this trial and error is really insufficient. You need to understand the language, you need to understand the code you’re writing and make sure you understand all possible cases that you’re considering type conversions. You’re considering integer overflow and all these.
Gavin Henry 01:07:14 I switched to on Mesa, just use tax, or I think wherever you use and there’s loads of C plugins, and the amount of time you save by just looking at what gets highlighted or before you even clicked build, or you’ve run a command. Most of your problems are solved if you just pay attention to the,
Robert Seacord 01:07:35 Yeah, it helps a lot, but it’s still definitely inadequate because all the tooling, isn’t going to find all the problems. So it is helpful to understand the language you’re using. And you could achieve that through training classes. You can achieve that through reading. One thing I did when I sort of transitioned from being a programmer to a secure coder is I spent some time, mostly in visual studio and I would, I’d write a little bit of C source code and I would sort of predict in my head what sort of assembly would be generated from that code. And then I would compile it and then I would be surprised. I would go back and read the standard, like, okay, now I understand. And so, eventually I got to the point where I could successfully predict the assembly code that is being generated. Until you get to that point, your understanding of the language is sort of falling short, right?
Gavin Henry 01:08:41 Yeah, there’s something to be said for just actually experimenting and I like to call it “proving it to yourself,” basically have the assumption and write a task or something.
Robert Seacord 01:08:55 Yeah. And what I do is perfect some code right. Where I gained a lot of confidence. I understand this, I know what this is. I can use this and now I’ve got a kind of a reusable component I can use, but it’s quite dangerous to sort of just throw in a bunch of things because they’re there without really understanding yet. So, I mean, maybe it’s more fun, but it doesn’t necessarily produce secure systems.
Gavin Henry 01:09:28 So, just to summarize before we shut up the podcast, what one thing would you like them to remember? Is that, be good with your IDE, pick a good one, or prove your assumptions, or what would you like them to remember out of that?
Robert Seacord 01:09:48 I would say the best time to avoid the defect is when you’re coding. It’s better to write correct code initially than it is to try to find and repair defects downstream. I mean, correct coding, quality code, secure code, it’s difficult to achieve. And you really need to use all the available tools and processes and discipline to get close to achieving that. But yeah, the most important thing is sort of writing code securely to begin with.
Gavin Henry 01:10:39 Thank you. If people want to find out more and explore some of these things we’ve chatted about, where’s the best place to get in touch? You’re pretty active on Twitter, is that the best place?
Robert Seacord 01:10:49 Well, I can be found on Twitter. I have a website, RobertSeacord.com, I think where I’ve got some errata for the Effective C book.
Gavin Henry 01:11:04 I think you need to update your SSL certificate as I was looking at it last week and it was complaining that it was insecure of all things. Okay. So your Twitter account and your website.
Robert Seacord 01:11:16 You can look there. I’m on LinkedIn, as well. I’m not very hard to find, I don’t have any handles anywhere.
Gavin Henry 01:11:25 I guess it’s @RCS on Twitter for those that want to go there straight away. Okay. Robert, thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening. [End of Audio]