#archiveteam-bs 2017-06-05,Mon

↑back Search

Time	Nickname	Message
00:00 ^🔗	timmc	The WARC had the space in it?
00:00 ^🔗	JAA	Yes.
00:00 ^🔗	timmc	Now I wonder what browsers do when confronted with that.
00:00 ^🔗	JAA	They'll handle it fine, probably.
00:01 ^🔗	JAA	Browsers are developed to handle all sorts of crap thrown at them by badly written web servers.
00:02 ^🔗	JAA	(Unfortunately, because that means that the web servers never get fixed to conform to the standards.)
00:02 ^🔗	JAA	But I guess the IA library which handles this stuff is more strict.
00:06 ^🔗	timmc	OK, can confirm, Firefox is fine with it.
00:09 ^🔗	voidsta	vvvvvvv/13
00:09 ^🔗	voidsta	oops
00:14 ^🔗	xmc	hi
00:14 ^🔗	voidsta	hello
00:14 ^🔗	JAA	Looks like others have experienced this problem of web servers including whitespace in the chunk size before: https://webcache.googleusercontent.com/search?q=cache:https%3A%2F%2Fjava.net%2Fjira%2Fbrowse%2FGRIZZLY%2D1684 (java.net shut down recently :-\| )
00:15 ^🔗	joepie91	JAA: timmc: chunk sizes in the WARCs are padded to multiples of 3 hex chars
00:15 ^🔗	joepie91	using spaces
00:16 ^🔗	joepie91	(in my report, they're represented as dots)
00:17 ^🔗	JAA	Yeah, but not always.
00:17 ^🔗	JAA	Hmm, or maybe it is always.
00:19 ^🔗	joepie91	from what I could see, it's always
00:19 ^🔗	joepie91	just not all numbers in the source are chunk sizes :)
00:21 ^🔗	JAA	Found the Apache bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=41364
00:21 ^🔗	JAA	Although it seems unlikely that they were still using that version last year. :-P
00:23 ^🔗	JAA	Someone claims there that "The spaces padding the hex value are ok according to rfc2616"
00:25 ^🔗	timmc	joepie91: Yeah, I tested it with the added spaces and a fake HTTP server.
00:26 ^🔗	JAA	I'm glad that there are standards, but I hate the standards. They're so hard to read at times.
00:26 ^🔗	timmc	ugh implied LWS
00:26 ^🔗	JAA	Yes, but...
00:27 ^🔗	JAA	RFC 2616 was obsoleted by 7230. 7230 uses ABNF from RFC 5234.
00:27 ^🔗	JAA	5234 specifically states that "This specification for ABNF does not provide for implicit specification of linear white space." and "Any grammar that wishes to permit linear white space around delimiters or string segments must specify it explicitly."
00:28 ^🔗	JAA	And I can't find that in 7230 anywhere.
00:29 ^🔗	JAA	Actually, it states explicitly: "Rules about implicit linear whitespace between certain grammar productions have been removed; now whitespace is only allowed where specifically defined in the ABNF."
00:29 ^🔗	JRWR	man the scaleway API is strange
00:30 ^🔗		Ravenloft has quit IRC (Read error: Operation timed out)
00:45 ^🔗		VADemon has quit IRC (Read error: Connection reset by peer)
00:45 ^🔗	joepie91	timmc: LWS?
00:45 ^🔗	JAA	Linear white space
00:46 ^🔗	JAA	Meaning white space (0x20) and horizontal tabs (0x09).
00:48 ^🔗	timmc	JAA: Too little too late, I suppose.
00:48 ^🔗	JAA	So here's what I think is happening on those pages: Apache, portalgraphics's web server, for some reason pads the chunk sizes with spaces to multiples of three. Since almost all clients probably support RFC 2616 (backwards compatibility etc.), this isn't actually a problem, although it isn't exactly conformant with the most up-to-date standards. (portalgraphics may have been using an old version of Apa
00:49 ^🔗	JAA	che though from before the release of RFC 7230.)
00:49 ^🔗	JAA	However, the IA library handling HTTP responses uses RFC 7230 and therefore doesn't allow whitespace after the chunk size. It fails to decode it and handles it as raw data instead, in effect simply dropping the "Transfer-Encoding: chunked" header.
00:50 ^🔗	JAA	Which then leads to "garbage" showing up in the final response from the IA.
00:53 ^🔗	JRWR	Anyone want my scaleway script
00:53 ^🔗	JRWR	it depolys 7 grab scripts at a time
00:54 ^🔗	JRWR	using the arm64 instances
00:54 ^🔗	JRWR	(2.99Euro)
00:54 ^🔗	voidsta	sure, share it :)
00:54 ^🔗	JAA	By the way, if you request the id_ resource for that link I gave above, the IA sends it in chunked transfer encoding again; the raw traffic back from IA to the browser then looks like double-chunk-encoded: https://gist.githubusercontent.com/anonymous/accf1455050dcf01f19a3b6d1f7cf658/raw/89f5ab19945c49e3770bb6571e36b9f2ae8f1594/gistfile1.txt
00:55 ^🔗	JRWR	and voidsta a script to clean up all the servers as well
00:55 ^🔗	voidsta	JRWR: cool :)
00:55 ^🔗	JAA	JRWR: I'd love to see how you automated it, although I won't be using it directly. I've been meaning to look into how to make the whole process of joining a new project a bit easier across multiple machines.
00:55 ^🔗	JRWR	ya
00:56 ^🔗	JRWR	its simple as fuck really
00:56 ^🔗	voidsta	same
01:00 ^🔗	JRWR	https://gist.github.com/JRWR/4b1cdbe0f55f00d92c10ff1e2355c5b7
01:00 ^🔗	JRWR	there you go
01:00 ^🔗	JRWR	thats both scripts
01:01 ^🔗	JRWR	updated to show my script.sh it sent to the servers
01:01 ^🔗		ajft has joined #archiveteam-bs
01:01 ^🔗	JRWR	mostly its the default one with a screen -dm on the run-pipeline
01:02 ^🔗	voidsta	cool, thanks for sharing
01:03 ^🔗	JRWR	its very shotgun style
01:03 ^🔗	JRWR	but it gets the job done
01:05 ^🔗		ndiddy has quit IRC ()
01:05 ^🔗	JAA	Thanks, I'll have a look at it tomorrow.
01:06 ^🔗	JAA	"Some twats drove a van into pedestrians and stabbed people. But don't despair, this will never happen again once we start regulating the internet." FFS, Theresa...
01:12 ^🔗		j08nY has quit IRC (Quit: Leaving)
01:14 ^🔗	xmc	seems reasonable
01:14 ^🔗	xmc	worked here in the usa
01:19 ^🔗	joepie91	JAA: interestingly, the chunk cutoffs happened in specific places a lot. I wonder whether you can infer where variables were used (in string-concatenation, in PHP) from the chunk cutoff poiints
01:20 ^🔗	joepie91	JAA: also, if that theory is correct, it'd be relatively simple to fix all the WARCs
01:21 ^🔗	xmc	if modifying warcs is acceptable.
01:22 ^🔗	xmc	tbh i'm on the fence about that
01:22 ^🔗	xmc	even in this case
01:22 ^🔗	JAA	I guess it might be possible to infer something about internal buffer sizes etc.
01:23 ^🔗	JAA	Agreed. The WARCs contain an accurate representation of what the web server delivered to clients. The fact that some clients or libraries can't handle it is secondary to preserving the original data in my opinion.
01:23 ^🔗		ndiddy has joined #archiveteam-bs
01:25 ^🔗	JAA	We definitely need to get in touch with the IA though so they can fix their software if my assumptions above are correct.
01:25 ^🔗		BlueMaxim has joined #archiveteam-bs
01:25 ^🔗	JAA	I'm sure they have tons of other pages with the same "bug".
01:27 ^🔗	xmc	quite possibly!
01:36 ^🔗		schbirid2 has joined #archiveteam-bs
01:39 ^🔗		schbirid has quit IRC (Read error: Operation timed out)
01:42 ^🔗	joepie91	okay, that was poorly worded
01:42 ^🔗	joepie91	it'd be relatively simple to fix the wayback output*
01:42 ^🔗	joepie91	:p
01:42 ^🔗	joepie91	cc JAA xmc
01:42 ^🔗	joepie91	definitely not advocating for [irrevocably] modifying source data
01:42 ^🔗	xmc	ah yep
01:43 ^🔗	joepie91	but eg. storing a 'fixed' copy of the WARC can be desirable for perf purposes (over fixing stuff on-the-fly in the wayback)
01:43 ^🔗	joepie91	without touching the original
01:53 ^🔗		icedice has quit IRC (Ping timeout: 250 seconds)
01:54 ^🔗	*	JRWR spins up 100 instances
01:54 ^🔗	JRWR	oops
02:03 ^🔗	voidsta	:)
02:04 ^🔗		pizzaiolo has quit IRC (Ping timeout: 260 seconds)
02:26 ^🔗		JRWR has quit IRC (Quit: Page closed)
03:15 ^🔗		superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
03:24 ^🔗		Aranje has quit IRC (Quit: Three sheets to the wind)
03:34 ^🔗		Sk1d has joined #archiveteam-bs
03:44 ^🔗		ndiddy has quit IRC ()
03:51 ^🔗		ajft has left
04:08 ^🔗		slyphic has quit IRC (Read error: Operation timed out)
04:08 ^🔗		slyphic has joined #archiveteam-bs
04:13 ^🔗	MrRadar	Over in #outofsteam DoomTay noticed that SPUF was returning data with chunked encoding when he used wpull to grab their front page
04:14 ^🔗	MrRadar	I checked on my end and found out that they did that for both wpull and wget-lua if a custom user-agent is not specified
04:14 ^🔗	MrRadar	With the ArchiveTeam user-agent SPUF returns data without chunking
04:15 ^🔗	MrRadar	I've also verified that both wpull and wget-lua are producing WARCs with the same corruption as portalgraphics when SPUF returns data with chunked transfers
04:24 ^🔗	MrRadar	We've also figured out that using a browser User-agent still results in chunked transfers, but adding an Accept header like an actual browser would will cause it to switch back to non-chunked transfers
04:29 ^🔗	MrRadar	Can someone with access to the tracker please stop Pixiv?
04:29 ^🔗	MrRadar	The chunked transfer issue is 100% affecting grabs from them
04:29 ^🔗	MrRadar	xmc, arkiver, SketchCow: ^^^
04:37 ^🔗	MrRadar	I'm grabbing the latest unpatched wget to see if that has the same issue as wpull and wget-lua
04:46 ^🔗	MrRadar	This issue affects the official wget 1.19 release
04:52 ^🔗	MrRadar	OK, I've tracked down the bug in the wget source code.
04:53 ^🔗	MrRadar	The way WARC writing works in wget is there are two output files passed to the fd_read_body() function
04:53 ^🔗	MrRadar	The first gets only the main content, the second gets both the content and headers
04:53 ^🔗	MrRadar	WARC output uses stores the data from the second stream into the WARC.
04:54 ^🔗	MrRadar	However, as the comment on the function says: "If OUT2 is non-NULL, the contents is also written to OUT2. OUT2 will get an exact copy of the response: if this is a chunked response, everything -- including the chunk headers -- is written to OUT2. (OUT will only get the unchunked response.)"
04:54 ^🔗	MrRadar	So it's a deliberate design decision to dump the chunked transfer size as part of the WARC output
04:55 ^🔗	voidsta	so, not a bug?
04:56 ^🔗	MrRadar	Actually, looks like it is a bug
04:56 ^🔗	voidsta	hm
04:56 ^🔗	MrRadar	According to the "WARC_ISO_28500_final_draft v018" document I found: "The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed. If a truncated 'response' record block contains less than the full entity-body, the payload is considered truncated at the same position."
04:56 ^🔗	MrRadar	The "with any transfer-encoding" removed bit indicates that this is non-compliant behavior on the part of wget
04:57 ^🔗	MrRadar	As the chunked-transfer header would count as part of the transfer-encoding
04:57 ^🔗	MrRadar	Unless they mean something completely different than the HTTP spec when they are referring to "transfer-encoding"
04:58 ^🔗	MrRadar	(The document can be found here: http://archive-access.sourceforge.net/warc/WARC_ISO_28500_final_draft%20v018%20Zentveld%20080618.doc)
05:00 ^🔗	MrRadar	Well, I need to get to bed. See you in the morning
05:00 ^🔗	*	MrRadar is AFK
05:01 ^🔗		Sk1d has quit IRC (Ping timeout: 194 seconds)
05:06 ^🔗		ItsYoda has joined #archiveteam-bs
05:07 ^🔗		Sk1d has joined #archiveteam-bs
05:15 ^🔗	godane	so i have only uploaded 37k items this year so far
05:31 ^🔗		JRWR has joined #archiveteam-bs
05:31 ^🔗	JRWR	something is going down
05:31 ^🔗	JRWR	MrRadar: do you confirm?
05:32 ^🔗	MrRadar	I'm not 100% sure
05:32 ^🔗	JRWR	ill point my webserver to the ingress folder if you want to start checking
05:32 ^🔗	MrRadar	wget is saving the chunked transfer headers into the WARCs but I'm pretty sure that's against the WARC spec
05:32 ^🔗	MrRadar	But I'm definitely not an expert on the WARC spec
05:33 ^🔗	MrRadar	We'd probably need to hear for sure from somebody at the IA who knows the spec very well
05:34 ^🔗	MrRadar	I did confirm the WARCs I was uploading for Pixiv contained the hex garbage
05:35 ^🔗	JRWR	Shit
05:35 ^🔗	JRWR	I want a confirm with a OP
05:36 ^🔗	JRWR	but I will keep the ingress in case of something crazy happening
05:37 ^🔗	JRWR	MrRadar: http://spacescience.tech/warc/incoming-uploads/
05:37 ^🔗	JRWR	you can start checking if you want
05:38 ^🔗	MrRadar	Picking at random this file chunked transfer headers in the roomtop.php response body: http://spacescience.tech/warc/incoming-uploads/JRWR/pixiv-roomtop_100594-20170605-044020.warc.gz
05:39 ^🔗	JRWR	ya I see them too
05:39 ^🔗	JRWR	Interesting
05:40 ^🔗	JRWR	im looking to see if there are any issues with the dumps
05:42 ^🔗	JRWR	Yep found some
05:42 ^🔗	JRWR	FUCK
05:42 ^🔗	JRWR	http://spacescience.tech/warc/incoming-uploads/Abel_LF/pixiv-roomtop_618874-20170604-145848.warc.gz
05:42 ^🔗	JRWR	Line 426
05:43 ^🔗	JRWR	Shit
05:43 ^🔗	JRWR	there is some in the AMFs
05:43 ^🔗	JRWR	fffffffffffffff
05:44 ^🔗	JRWR	I extracted all the static files
05:44 ^🔗	JRWR	out of the 20, only 2 matched their SHA1s
05:44 ^🔗	JRWR	These are bad dumps
05:45 ^🔗	JRWR	Who do we ping MrRadar
05:45 ^🔗	JRWR	http://imgur.com/6NfmQ
05:46 ^🔗	MrRadar	I already tried pinging everyone who has tracker access, but none of them are online at the moment
05:46 ^🔗	MrRadar	In the mean time you could reduce your rsync to 1 connection max
05:47 ^🔗	MrRadar	Or just turn it off altogether
05:49 ^🔗	JRWR	rsync is OFFLINE
05:49 ^🔗	MrRadar	JRWR: Which AMF files are you seeing with this issue? In the one you linked none of the AMF files were transferred with chunked encoding
05:51 ^🔗	JRWR	my bad it was the PNGs
05:51 ^🔗	MrRadar	OK, yeah some of those are definitely affected
05:53 ^🔗	JRWR	felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened.
05:54 ^🔗	JRWR	ok
05:54 ^🔗	JRWR	We got to fix this in the meantime
05:55 ^🔗	JRWR	its def Wget-lua doing this
05:56 ^🔗	MrRadar	Yeah, I actually tracked down the caust of the bug while you weren't online
05:56 ^🔗	MrRadar	Inside wget
05:56 ^🔗	JRWR	Good
05:56 ^🔗	JRWR	a simple fix is to disable http1.1
05:56 ^🔗	JRWR	and ask for HTTP/1.0
05:56 ^🔗	JRWR	but that does disable keepalives
05:57 ^🔗	JRWR	wait, how many dumps have been going on over the years with this issue?
05:57 ^🔗	JRWR	I wonder if anyone ever checked
05:58 ^🔗	MrRadar	While I haven't verified with the git history, this looks like it's been a problem since WARC support was first added to wget
05:58 ^🔗	godane	so i got a 256gb usb stick Saturday
05:58 ^🔗	godane	for $45
05:59 ^🔗	JRWR	so..
05:59 ^🔗	JRWR	thats ALL the dumps?
06:00 ^🔗	MrRadar	Any ones that have data transferred wiht the chunked transfer-encoding
06:00 ^🔗	MrRadar	Assuming my interpretation of the WARC spec is correct
06:00 ^🔗	JRWR	hrm
06:00 ^🔗	MrRadar	Given how extensive the issue is, it may be easier to just update the WARC spec to allow chunked transfer headers inside WARC response records
06:00 ^🔗	JRWR	true
06:02 ^🔗	JRWR	so the hex we are seeing are the headers for the next chunk?
06:02 ^🔗	godane	so the wget WARC code was screwing things up?
06:02 ^🔗	MrRadar	If I'm right, yes
06:02 ^🔗	MrRadar	But I'm not sure I am
06:03 ^🔗	MrRadar	When data is transferred with the HTTP "chunked" transfer-encoding, wget is writing the chunk headers into the WARC
06:03 ^🔗	godane	but wouldn't that cause the last few years of archiving to have problems
06:03 ^🔗	pikhq	Unless everyone's been misreading the spec the same way.
06:04 ^🔗	MrRadar	The WARC spec says "The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed."
06:04 ^🔗	godane	but whatever this bug is its not with everything
06:05 ^🔗	MrRadar	Yes, only when the web server uses "Transfer-encoding: chunked"
06:05 ^🔗	pikhq	Not a lot of things used chunked encoding.
06:05 ^🔗	ranma	is there a "best way" to back up a reddit post
06:05 ^🔗	godane	ok
06:05 ^🔗	ranma	with a lot of collapsed comment threads?
06:05 ^🔗	MrRadar	Locally or with e.g. archivebot?
06:05 ^🔗	ranma	for archive bat
06:05 ^🔗	ranma	bot
06:06 ^🔗	ranma	lol
06:06 ^🔗	ranma	e.g. https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12/dienjss/?context=3
06:06 ^🔗	ranma	er
06:06 ^🔗	ranma	https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12
06:06 ^🔗	godane	that at least means we shouldn't have alot of corrupt data
06:06 ^🔗	MrRadar	!a https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12/ without an ignore set should do the trick I think?
06:06 ^🔗	MrRadar	(Make sure to have the trailing slash)
06:07 ^🔗	pikhq	godane: It also implies it should be possible to find all of the data corrupted by this bug.
06:07 ^🔗	pikhq	Though the act of finding all of it is definitely a big one just because of how much data there is to sift through...
06:10 ^🔗	ranma	MrRadar: isn't that going to hit all the linked sites
06:10 ^🔗	ranma	and then maybe a stupid number of other sites?
06:10 ^🔗	ranma	!a scares me
06:10 ^🔗	MrRadar	!a only recurses into URLs with the same prefix
06:11 ^🔗	MrRadar	URLs with a different prefix will be visited but not recursively
06:11 ^🔗		Igloo has quit IRC (Read error: Operation timed out)
06:11 ^🔗	MrRadar	That's why the trailing slash would be so important, to limit the scope of the recursion
06:11 ^🔗	ranma	i've seen !a of example.com start to crawl marthastewart.com
06:11 ^🔗	ranma	hm
06:12 ^🔗	ranma	not sure if i used trailing slash
06:19 ^🔗	SketchCow	What's the upshot of the bug
06:20 ^🔗	MrRadar	SketchCow: When web servers return data with "Transfer-encoding: chunked" wget is saving information into the WARC that (I think?) the spec says should be stipped
06:20 ^🔗	MrRadar	Specifically, the size of each data chunk
06:21 ^🔗	pikhq	Everything sent from servers using chunked transfer encoding will have spurious hex digits and \r\n sequences in the data that were on the wire, but apparently WARC says aren't supposed to be there.
06:21 ^🔗	pikhq	(that is, in the file itself)
06:21 ^🔗	MrRadar	You should ask someone at the IA who is familiar with the WARC format about what the right way to handle chunked transfers is
06:21 ^🔗	MrRadar	It's possible I'm just reading the spec wrong and wget is doing it right
06:22 ^🔗	pikhq	https://github.com/iipc/warc-specifications/issues/22
06:22 ^🔗	pikhq	That seems to imply you're reading the spec wrong.
06:24 ^🔗	MrRadar	pikhq: Reading that discussion I think you're right
06:25 ^🔗	pikhq	At the least, it's clear the intention is wget's behavior.
06:25 ^🔗	MrRadar	Yes, they're very deliberately including the headers in the WARC
06:26 ^🔗	pikhq	So, if you want to process WARC stuff (for rendering or what have you) you should probably be careful to take into account the transfer encoding, or else you'll get the spurious hex digits and such.
06:26 ^🔗	pikhq	But if you're generating a WARC, that's supposed to be there.
06:26 ^🔗	MrRadar	That makes sense
06:27 ^🔗	MrRadar	Sorry for the false alarm everyone
06:27 ^🔗	MrRadar	JRWR: If you're still around, please restart your rsync target
06:27 ^🔗	pikhq	No worries. The standard text is genuinely confusing, and your interpretation is a valid one.
06:27 ^🔗	JRWR	Been done already
06:27 ^🔗	pikhq	(at least, if you're not reading the exact same way they are)
06:31 ^🔗		JRWR_ has joined #archiveteam-bs
06:34 ^🔗		JRWR has quit IRC (Ping timeout: 268 seconds)
06:36 ^🔗		JRWR_ is now known as JRWR
06:36 ^🔗	JRWR	So overall that means IA's Wayback Machine doesn't follow the spec as well then
06:37 ^🔗	MrRadar	I think the issue with portalgraphics was they were sending slightly malformed chunked encoding headers
06:37 ^🔗	MrRadar	With extra padding?
06:37 ^🔗	MrRadar	That the IA didn't handle but browsers did
06:37 ^🔗	MrRadar	If my review of the logs is correct
06:37 ^🔗	MrRadar	*chat logs
06:50 ^🔗	JRWR	SketchCow: Looks like we got blacklisted at pixiv
06:51 ^🔗	MrRadar	arkiver: ^^^
06:52 ^🔗	MrRadar	It's not by IP since I can view URLs that fail through wget-lua just fine in my browser
06:55 ^🔗	MrRadar	Pixiv appears to be running again
06:55 ^🔗	JRWR	Ya
06:55 ^🔗	JRWR	It looks like we got funneled
07:10 ^🔗		Whopper_ has joined #archiveteam-bs
07:13 ^🔗		Whopper has quit IRC (Ping timeout: 268 seconds)
08:00 ^🔗		SHODAN_UI has joined #archiveteam-bs
08:44 ^🔗		Nazca_ has joined #archiveteam-bs
08:45 ^🔗	Nazca_	funneled is good or bad?
08:45 ^🔗		Nazca has quit IRC (Read error: Operation timed out)
08:45 ^🔗		Nazca_ is now known as Nazca
08:55 ^🔗		Igloo has joined #archiveteam-bs
09:17 ^🔗	godane	Donald Trump on Charlie Rose: https://archive.org/details/Charlie-Rose-1992-11-06
09:24 ^🔗		kristian_ has joined #archiveteam-bs
09:25 ^🔗		jtn2 has joined #archiveteam-bs
09:29 ^🔗		jtn2 has quit IRC (Read error: Operation timed out)
09:31 ^🔗		jtn2 has joined #archiveteam-bs
09:33 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
09:35 ^🔗	godane	i'm close to half way point of uploads from last month
09:36 ^🔗	godane	i only uploaded 955 items last month
09:36 ^🔗	godane	i was grabbing the Mister Rogers stream and ripping tape this past month
09:40 ^🔗		jtn2 has quit IRC (Read error: Operation timed out)
09:42 ^🔗		jtn2 has joined #archiveteam-bs
10:07 ^🔗		jtn2 has quit IRC (Read error: Operation timed out)
10:12 ^🔗	JAA	06-05 06:37:06 < MrRadar> I think the issue with portalgraphics was they were sending slightly malformed chunked encoding headers -- Yes, that's how I understand it. Interesting that the WARC should have transfer encoding stripped. I guess it makes sense in a way though.
10:18 ^🔗		jtn2 has joined #archiveteam-bs
10:23 ^🔗	JAA	But all in all, I don't think we need to stop current projects or anything like that. It wouldn't be hard to fix WARCs retroactively at some point if we want to do that.
10:25 ^🔗	JAA	joepie91: Fixing it in the Wayback Machine should be easy. IA's library for handling HTTP responses in WARC files already deals with chunked encoding, just not with this "malformed" variant. No need to update WARCs or anything; instead, the library should be modified to handle the whitespace padding.
10:27 ^🔗		j08nY has joined #archiveteam-bs
10:27 ^🔗	Sanqui	JAA: can you make some sort of writeup so this information doesn't get lost if somebody doesn't get to it right away?
10:29 ^🔗	JAA	Sanqui: Yeah, sure.
10:42 ^🔗		jtn2 has quit IRC (Ping timeout: 250 seconds)
10:43 ^🔗		jtn2 has joined #archiveteam-bs
10:43 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
10:44 ^🔗		BlueMaxim has joined #archiveteam-bs
11:29 ^🔗		BlueMaxim has quit IRC (Ping timeout: 600 seconds)
11:30 ^🔗		BlueMaxim has joined #archiveteam-bs
11:55 ^🔗		SHODAN_UI has joined #archiveteam-bs
12:07 ^🔗		kristian_ has quit IRC (Quit: Leaving)
12:43 ^🔗		tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
12:52 ^🔗	JAA	Anyone want to archive this? ;-) https://www.bleepingcomputer.com/news/security/hadoop-servers-expose-over-5-petabytes-of-data/
12:53 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
13:49 ^🔗		superkuh has joined #archiveteam-bs
13:57 ^🔗	joepie91	"To put things in perspective, HDFS servers leak 200 times more data compared to MongoDB servers, which are ten times more prevalent."
13:57 ^🔗	joepie91	~big data~
13:58 ^🔗	joepie91	JAA: hmm. the WARC stores the original chunked data in the WARC?
13:58 ^🔗	joepie91	ie. the stream of bytes as it appeared over the wire
13:58 ^🔗	joepie91	(as opposed to it beiing turned into just the content)
13:58 ^🔗	JRWR	I do find that strange for a format like WARC
14:02 ^🔗	joepie91	MrRadar: JRWR: please make sure to confirm intended WARC behaviour with somebody who has access to the final WARC spec, to ensure that nothing was changed from the draft
14:03 ^🔗	JRWR	We did
14:03 ^🔗	joepie91	JRWR: does something still need to be disabled on the tracker?
14:03 ^🔗	*	joepie91 has tracker access
14:03 ^🔗	joepie91	(I'm still reading backlog)
14:03 ^🔗	JRWR	there is a issue open on the WARC Spec Github that explains the issue
14:03 ^🔗	JRWR	and currently wget is correct in its saving
14:04 ^🔗	JRWR	right now we are being throttled HARD by pixiv
14:05 ^🔗	JRWR	442053done + 94722out + 463227to do
14:05 ^🔗	Kalroth	they hit the anti-DDoS panic button
14:07 ^🔗	joepie91	JRWR: right, if something needs to be changed on the tracker and nobody is around, ping me :P
14:07 ^🔗	joepie91	(pinging me on Freenode results in faster responses)
14:07 ^🔗	JRWR	Ah
14:07 ^🔗	JRWR	Its OK for now, kind of wish pixiv had not throttled us
14:08 ^🔗	joepie91	I'm going to be pretty busy today though, so preferably include a very precise request of what needs changing so that it's just a few clicks for me and doesn't require extra thinking :P
14:11 ^🔗	JRWR	of course joepie91
14:12 ^🔗	JRWR	The only warning I've got on my dash right now is my storage is now half full
14:14 ^🔗		pizzaiolo has joined #archiveteam-bs
14:15 ^🔗		icedice has joined #archiveteam-bs
14:22 ^🔗	MrRadar	joepie91: Yeah, after reading the spec issue on Github I initially reading the spec wrong and wget is doing the right thing
14:22 ^🔗	MrRadar	I was confused about what the portalgraphics issue was earlier
14:23 ^🔗	MrRadar	I missed that it was due to extra whitespace in their chunked transfer headers that was the issue
14:23 ^🔗	MrRadar	Not the headers themselves
15:08 ^🔗		SHODAN_UI has quit IRC (Quit: zzz)
15:08 ^🔗		SHODAN_UI has joined #archiveteam-bs
15:27 ^🔗	JAA	joepie91: Yes, as far as I can tell, wget and wpull store the raw data stream in the WARCs. In a way, that's exactly what I'd expect, although I can also see some arguments for stripping transfer encoding first.
15:28 ^🔗	JAA	On a related note, I find it interesting that TLS certificates aren't stored in WARCs.
15:39 ^🔗	joepie91	JAA: that might just be a wget thing? I know that Heritrix stores a lot more stuff in WARCs than wget does, even down to DNS requests and responses
15:40 ^🔗	JAA	Oh yeah, DNS as well.
15:40 ^🔗	JAA	That's very well possible.
15:44 ^🔗	JAA	joepie91: Do you have an example Heritrix WARC? I'd like to know how they store those things.
15:46 ^🔗	joepie91	JAA: I don't, unfortunately. somebody in here has made some in the past
15:46 ^🔗	joepie91	but that was a few years ago :)
15:56 ^🔗		icedice has quit IRC (Ping timeout: 245 seconds)
16:28 ^🔗		icedice has joined #archiveteam-bs
17:11 ^🔗		JRWR has quit IRC (Ping timeout: 268 seconds)
17:16 ^🔗		ZexaronS has joined #archiveteam-bs
17:50 ^🔗		dashcloud has quit IRC (Ping timeout: 260 seconds)
17:54 ^🔗		fie has quit IRC (Read error: Operation timed out)
18:31 ^🔗		za3k has joined #archiveteam-bs
18:31 ^🔗	za3k	#internetarchive
18:32 ^🔗	za3k	i'm an idiot, ignore
18:33 ^🔗		Rai-chan has quit IRC (Ping timeout: 268 seconds)
18:33 ^🔗	za3k	What I meant to say is: https://za3k.com/github/ is back up and actively archiving the summary metadata of github projects (mostly names and ids)
18:33 ^🔗	za3k	ghtorrent.org is pretty much strictly better, does anyone already have a copy?
18:34 ^🔗		Jon has quit IRC (Ping timeout: 268 seconds)
18:35 ^🔗		Jon has joined #archiveteam-bs
18:37 ^🔗		Aoede has quit IRC (Ping timeout: 268 seconds)
18:37 ^🔗		purplebot has quit IRC (Ping timeout: 268 seconds)
18:37 ^🔗		Aoede has joined #archiveteam-bs
18:38 ^🔗		fie has joined #archiveteam-bs
18:43 ^🔗		purplebot has joined #archiveteam-bs
18:43 ^🔗		Rai-chan has joined #archiveteam-bs
19:01 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
19:06 ^🔗		xmc has quit IRC (Read error: Operation timed out)
19:09 ^🔗		xmc has joined #archiveteam-bs
19:09 ^🔗		swebb sets mode: +o xmc
19:28 ^🔗	SketchCow	FOS is now back to half-full, although you maniacs could probably fill it if you tried
19:28 ^🔗		JRWR has joined #archiveteam-bs
19:33 ^🔗		za3k has quit IRC (Quit: http://chat.efnet.org (EOF))
20:03 ^🔗	*	zino whistles innocently.
20:37 ^🔗		gui7 has joined #archiveteam-bs
20:37 ^🔗		gui7 has left LIST
20:38 ^🔗		gui7 has joined #archiveteam-bs
20:39 ^🔗		gui7 has quit IRC (Remote host closed the connection)
20:39 ^🔗		gui7 has joined #archiveteam-bs
20:40 ^🔗		SHODAN_UI has joined #archiveteam-bs
21:48 ^🔗		icedice has quit IRC (Quit: Leaving)
21:49 ^🔗		gui7 has quit IRC (Leaving.)
21:53 ^🔗	deathy	SketchCow: maybe update http://www.archiveteam.org/index.php?title=Rescuing_optical_media in case you know of better tools now? I'm also working through a backlog of personal CD/DVDs now...
22:42 ^🔗		dashcloud has joined #archiveteam-bs
23:01 ^🔗		yakfish has quit IRC (Operation timed out)
23:06 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
23:10 ^🔗	wp494	deathy: anyone can
23:20 ^🔗		twigfoot has joined #archiveteam-bs
23:37 ^🔗	Odd0002	I used readom for my ISOs and they all seem to work fine in a windows 98SE VM
23:39 ^🔗		ndiddy has joined #archiveteam-bs
23:42 ^🔗		GLaDOS has joined #archiveteam-bs

irclogger-viewer