I am usually amused by the way really competent people judge other's context.
This post assumes understanding of:
- emacs (what it is, and terminology like buffers)
- strace
- linux directories and "everything is a file"
- environment variables
- grep and similar
- what git is
- the fact that 'git whatever' works to run a custom script if git-whatever exists in the path (this one was a TIL for me!)
- irc
- CVEs
- dynamic loaders
- file priviledges
but then feels important to explain to the audience that:
>A socket is a facility that enables interprocess communication
Juniors know how much they have learned whereas a 10+ year senior (like the author) forget most people don't know all this stuff intuitively.
I still will say stuff like "yeah it's just a string" forgetting everyone else thinks a "string" is a bit of thread/cord.
I think we're seeing a similar disconnect here. Some people think a string is a contiguous block of bytes, perhaps with a sentinel value on the end (C string) or a fixed-size count on the front. Others think of it as an API for storing text. These used to overlap, but in recent decades they have diverged substantially. The argument here is about the meaning of the word, not the technical reality of the thing it refers to, so it has no objective resolution.
Strings are very very not sequences of bytes. Strings are a semantic thing. There may be a sequence of bytes in some representation of a particular string, but even then those bytes are not enough to define a string without other stuff. An encoding, at the very least. But even then, there are many things that could be described as a "string". A sequence of code points, perhaps? Or scalar values? Grapheme clusters?
Not to mention that you may not even have a linear sequence of bytes at the bottom level. You might have a rope (cons cell), or an intern pointer, or...
This is a profoundly stupid kind of argument. There isn't even an objective truth you could conceivably convince someone of. There's just how you're choosing to use the word in conflict with a preexisting convention, which marks you as part of some social group, just like "this slaps", "skibidi", "rad", or "whenever". The preexisting convention isn't some apprehension of objective truth either. It's just an arbitrary tradition, like the meaning of any word.
People who are using the word in the older sense are usually not mistaken. At worst, they're your political enemies, but often they aren't even that; they just have experiences you don't. Attempting to persuade them, as you are doing, can only have the effect of further narrowing your intellectual horizons—even in the unlikely case that you are successful, but especially in the far more common case where they try to avoid you after that.
I recommend more curiosity and less crusading.
(In the rare case where someone is mistaken, it's sufficient to say "I meant a Unicode string" or "but we're iterating over codepoints, not bytes," but such mere clarification is not what you're up to.)
Strings are sequences of bytes only in the sense that everything stored in memory is a sequence of bytes. The semantics matter far more, and they aren’t the same as a sequence of bytes.
Also many languages make strings immutable and byte arrays mutable.
That is wonderfully ironic.
Anyway, coming from a C background, sure, strings are kind of just sequences of bytes. For people coming from other backgrounds, they'll have different understandings of what a string is (probably more based on semantics of the language they learnt first than on the underlying representation in memory). I'm not trying to persuade you of one definition or another. Nor am I redefining the meaning of a string, as it's clearly subjective by background and/or by context.
To that end, take my point as merely "you need to know the context", and I happen to believe the context that matters is the semantics of the programming language you're using (as opposed to the underlying representation of an instance of the type in memory).
My comments are also for the benefit of the many folks (particularly junior members of our community) that perhaps don't have exposure to this way of looking at things.
In most programming languages strings contains more semantics than just sequences of bytes.
For example in Rust all Strings are utf-8, so Rust strings are binary sequences but not all binary sequences can be Rust strings.
As someone younger, ports and sockets appeared very early in my learning. I'd say they appeared in passing before programming even, as we had to deal with router issues to get some online games or p2p programs to work.
And conversely, some of the other topics are in the 'completely optional' category. Many of my colleagues work on IDEs from the start, and some may not even have used git in its command line form at all, though I think that extreme is more rare.
>The term socket dates to the publication of RFC 147 in 1971, when it was used in the ARPANET. Most modern implementations of sockets are based on Berkeley sockets (1983), and other stacks such as Winsock (1991).
[1] https://en.wikipedia.org/wiki/Berkeley_sockets
[2] https://medium.com/theconsole/40-years-of-berkeley-sockets-8...
> Initially we intend to add the facilities described here to UNIX. We will then begin to implement portions of UNIX itself using the IPC [inter-process communication] as an implementation tool. This will involve layering structure on top of the IPC facilities. The eventual result will be a distributed UNIX kernel based on the IPC framework.
> The IPC mechanism is based on an abstraction of a space of communicating entities communicating through one or more sockets. Each socket has a type and an address. Information is transmitted between sockets by send and receive operations. Sockets of specific type may provide other control operations related to the specific protocol of the socket.
They did deliver sockets more or less as described in 4.1BSD later that year, but distributing the Unix kernel never materialized. The closest thing was what Joy would later bring about at Sun, NFS and YP (later NIS). They clarify that they had a prototype working already:
> A more complete description of the IPC architecture described here, measurements of a prototype implementation, comparisons with other work and a complete bibliography are given in CSRG TR/3: "An IPC Architecture for UNIX."
And they give a definition for struct in_addr, though not today's definition. Similarly they use SOCK_DG and SOCK_VC rather than today's SOCK_DGRAM and SOCK_STREAM, offering this sample bit of source:
s = socket(SOCK_DG, &addr, &pref);
CSRG TR/3 does not seem to have been promoted to an EECS TR, because I cannot find anything similar in https://www2.eecs.berkeley.edu/Pubs/TechRpts/. And they evidently didn't check their "prototype" socket implementation in to source control until November 01981: https://github.com/robohack/ucb-csrg-bsd/commit/9a54bb7a2aa0...In theory that's four months after the 4.1BSD release in http://bitsavers.trailing-edge.com/bits/UCB_CSRG/4.1_BSD_198..., linked from https://gunkies.org/wiki/4.1_BSD, which does seem to have sockets in some minimal form. I don't understand the tape image format, but the string "socket" occurs: "Protocol wrong type for socket^@Protocol not available^@Protocol not supported^@Socket type not supported^@Operation not supported on socket^@Protocol family not supported^@Address family not supported by protocol family^@Address already in use^@Can't assign requested address^@".
This is presumably compiled from lib/libc/gen/errlst.c or its moral equivalent (e.g., there was an earlier version that was part of the ex editor source code). But those messages were not added to the checked-in version of that file until Charlie Root checked in "get rid of mpx stuff" in February of 01982: https://github.com/robohack/ucb-csrg-bsd/commit/96df46d72642...
The 4.1 tape image I linked above does not contain man pages for sockets. Evidently those weren't added until 4.2! The file listings in burst/00002.txt mention finger and biff, but those could have been non-networked versions (although Finger was a documented service on the ARPANet for several years at that point, with no sign of growing into a networked hypertext platform with mobile code). Delivermail, the predecessor of sendmail, evidently had cmd/delivermail/arpa-mailer.8, cmd/delivermail/arpa.c, etc.
That release was actually the month before Joy and Fabry's proposal, so perhaps sockets were still a "prototype" in that release?
The current sockaddr_in structure was checked in to source control as a patch to sys/netinet/in.h on November 18, 01981: https://github.com/robohack/ucb-csrg-bsd/commit/b5bb9400a15e...
Kirk McCusick's "Twenty Years of Berkeley Unix" https://www.oreilly.com/openbook/opensources/book/kirkmck.ht... says:
> When Rob Gurwitz released an early implementation of the TCP/IP protocols to Berkeley, Joy integrated it into the system and tuned its performance. During this work, it became clear to Joy and Leffler that the new system would need to provide support for more than just the DARPA standard network protocols. Thus, they redesigned the internal structuring of the software, refining the interfaces so that multiple network protocols could be used simultaneously.
> With the internal restructuring completed and the TCP/IP protocols integrated with the prototype IPC facilities, several simple applications were created to provide local users access to remote resources. These programs, rcp, rsh, rlogin, and rwho were intended to be temporary tools that would eventually be replaced by more reasonable facilities (hence the use of the distinguishing "r" prefix). This system, called 4.1a, was first distributed in April 1982 for local use; it was never intended that it would have wide circulation, though bootleg copies of the system proliferated as sites grew impatient waiting for the 4.2 release.
rcmd, rexec, rsh, rlogin, and rlogind were checked into SCCS on April 2, 01982. At first glance, this socket code looks like it would compile today: https://github.com/robohack/ucb-csrg-bsd/commit/58a2fc8197d0...
Telnet, also using sockets, had been checked in earlier on February 28: https://github.com/robohack/ucb-csrg-bsd/commit/0dd802d6a649...
Though one explanation is that I think for the other stuff that the writer doesn't explain, one can just guess and be half right, and even if the reader guesses wrong, isn't critical to the bug — but sockets and capabilities are the concepts that are required to understand the post.
It still is amusing and I wouldn't have even realized that until you pointed that out.
The author is both an example of and an example for how we can get caught in "bubbles" of tools/things we know and use and don't, and blog posts like this are great for discovery (I didn't know about git invoking a binary in the path like his "git re-edit", for example, until today).
It’s not that I was unaware that’s how Unix worked here, just that I rarely think of sockets in that context.
I would expect a person with 10+ years of Unix sysadmin experience — but who has never programmed directly against any OS APIs, “merely” scripting together invocations of userland CLI tools — to have exactly this kind of lopsided knowledge.
(And that pattern is more common than you might think; if you remember installing early SuSE or Slackware on a random beige box, it probably applies to you!)
Then he didn't go back to clean it up afterwards.
I agree that it's amusing.
Years ago I worked on contract for a large blue 3 letter company doing outsourced server management for the fancy credit card company. The incident in question happened before my time on the team but I heard about it first hand from the server admin (let's call him Ben) who had been at the center of it.
The data center in question was (IIRC) 160K sqft of raised floor spread across multiple floors in a major metropolitan downtown area. It isn't there anymore. Windows, Unix, Linux, mainframe, San, all the associated fun stuff.
Ben was working the day after thanksgiving decommissioning a system. Full software and physical decommission. Approved through all the proper change management procedures.
As part of the decommission Ben removed the network cables from under raised floor. Standard snip the connector off and pull it back. Easy. Little did he know that network cable was ever so slightly entangled with another cable. Not enough to give him pause when pulling it though. It wouldn't have been an issue if the other cable had been properly latched in its ports. It wasn't. That little pull ended up pulling the network connection out of a completely unrelated system. A system managed by a completely different group. A system responsible for credit card processing. On USA Black Friday.
Oops. CC processing went down. It took far too long to resolve. Amazingly Ben didnt loose his job. After all he followed all the processes and procedures. Kudos to the management team who kept him protected.
Change management and change freezes were far more stringent by the time I joined the team. There was also now a raised floor infrastructure group and no one pulled a tile without their involvement.
Be careful what you tug on!
I wonder what could be done to make this type of problem less hidden and easier to diagnose.
The one thing that comes to mind is to have the loader fail fast. For security reasons, the loader needs to ensure TMPDIR isn't set. Right now it accomplishes this by un-setting TMPDIR, which leads to silent failures. Instead, it could check if TMPDIR is set, and if so, give a fatal error.
This would force you to unset TMPDIR yourself before you run a privileged program, which would be tedious, but at least you'd know it was happening because you'd be the one doing it.
(To be clear, I'm not proposing actually doing this. It would break compatibility. It's just interesting to think about alternative designs.)
(This is about UNSECURE_ENVVARS, if someone needs to find the source location.)
Making these things more transparent is a good idea, of course, but it is somewhat hard. Maybe we could add Systemtap probes when environment variables are removed or ignored.
A related issue is that people stick LD_LIBRARY_PATH and LD_PRELOAD settings into shell profiles/login scripts and forget about them, leading to hard-to-diagnose failures. More transparency there would help, but again it's hard to see how to accomplish that.
I know when this was necessary and used it myself quite a bit. But today, couldn't we just open up a mount namespace and bind-mount something else to /tmp, like SystemDs private tempdirs? (Which broke a lot of assumptions about tmpdirs and caused a bit of ruckus, but on the other hand, I see their point by now)
I'm honestly starting to wonder about a lot of these really weird, prickly and fragile environment variables which cause security vulnerabilities, if low-overhead virtualization and namespacing/containers are available. This would also raise the security floor.
No, because unless you're already root (in which case you wouldn't have needed the binary with the capability in the first place), you can't make a mount namespace without also making a user namespace, and the counterproductive risk-averse craziness has led to removing unprivileged users' ability to make user namespaces.
Are we just shittier engineers, is it more complex, or is the culture such that we output lower quality? Does building a bridge require less cognitive load then a complex software project?
We're better at encapsulating lower-level complexities in e.g. bridge building than we are at software.
All the complexities of, say, martensite grain boundaries and what-not are implicit in how we use steel to reinforce concrete. But we've got enough of it in a given project that the statistical summaries are adequate. It's a member with thus strength in tension, and thus in compression, and we put a 200% safety factor in and soldier on.
And nobody can take over the ownership of leftpad and suddenly falsify all our assumptions about how steel is supposed to act when we next deploy ibeam.js ...
The most well understood and dependable components of our electronic infrastructure are the ones we cordially loathe because they're composed in shudder COBOL, or CICS transactions, or whatever.
Both IMO: first, anybody could buy a computer during the last three decades, dabble in programming without learning basic concepts of software construction and/or user-interface design and get a job.
And copying bad libraries was (and is) easy. I still get angry when software tells me "this isn't a valid phone number" when I cut/copy/paste a number with a blank or a hyphen between digits. Or worse, libraries which expect the local part of an email address to only consist of alphanumeric characters and maybe a hyphen.
Second, writing software definitely is more complex than building physical objects. Because there are "no laws" of physics which limit what can be done. In the sense that physics tell you that you need to follow certain rules to get a stable building or a bridge capable of withstanding rain, wind, etc.
Given hardware available to an average modern Linux box, it is hardly surprising that these bells and whistles were added - someone will find them useful in some scenarios and additional resource is negligible. It does however make understanding the whole beast much, much harder...
I'd say it comes from some of (order of most to least imo, but I'm only mid level so take what I say accordingly):
* physical processes have a fuzzy good enough. The bridge stands with thrice its expected max load. It is good enough.
* most software doesn't have life safety behind it. In construction, life safety systems receive orders of magnitude more scrutiny.
* physical projects don't have more than 20 different interdependencies; there's an upper limit on arbitrary complexity
* physical projects usually have clearish deadlines (they lie, but by a constant factor)
* The industries are old enough that they check juniors before they give them big decisions.
* Similarly, there exists PE accountability in construction
There are no big wins left in bridge building, so there is no justification for taking big risks. Also, in most software project failures, the only cost is people's time; no animals are harmed, no irreplaceable antique guitars are smashed, no ecosystems are damaged, and no buses of schoolchildren plunge screaming into an abyss.
Your software startup didn't get funded? Well, you can go back and finish college.
Also, “direct” link: https://blog.plover.com/tech/tmpdir.html (This doesn't really matter, as the posted link is to https://blog.plover.com/2016/07/01/#tmpdir i.e. the blog post named “tmpdir” posted on 2016-07-01 and there is only post posted on that date, so the content of the page is basically the same.)
https://www.youtube.com/watch?v=aWXuDNmO7j8
Peter Weller, playing Buckaroo Banzai, is late for his military-particle-physics-interdimensional-jet-car test because he's helping Jeff Goldblum's character with neurosurgery. Later that day he will go play lead guitar in an ensemble.
Scriptwriting gurus advise that your protagonist should have flaws and character progression. The writers of this movie disagree.
Setting TMPDIR to /mnt/tmp seems also to come from that.
I would guess both were the result of someone who didn't really know what they were doing trying things until they found something that got what they needed to work, then pushed that out without understanding the broader implications.
Also, computers in 2015 were not meaningfully less complex than today. Certainly not when the topic is weird emacs and perl interactions.
I was pleased that it was more interesting than that, and I want people to write more twitchy-detail-post-mortems like this :-)
You have your terminal window and your .bashrc (or equivalent), and that sets a bunch of environment variables. But your GUI runs with, most likely, different environment variables. It sucks.
And here’s my controversial take on things—the “correct” resolution is to reify some higher-level concept of environment. Each process should not have its own separate copy of environment variables. Some things should be handled… ugh, I hate to say it… through RPC to some centralized system like systemd.