#archiveteam-bs 2017-06-05,Mon

↑back Search

Time Nickname Message
00:00 🔗 timmc The WARC had the space in it?
00:00 🔗 JAA Yes.
00:00 🔗 timmc Now I wonder what browsers do when confronted with that.
00:00 🔗 JAA They'll handle it fine, probably.
00:01 🔗 JAA Browsers are developed to handle all sorts of crap thrown at them by badly written web servers.
00:02 🔗 JAA (Unfortunately, because that means that the web servers never get fixed to conform to the standards.)
00:02 🔗 JAA But I guess the IA library which handles this stuff is more strict.
00:06 🔗 timmc OK, can confirm, Firefox is fine with it.
00:09 🔗 voidsta vvvvvvv/13
00:09 🔗 voidsta oops
00:14 🔗 xmc hi
00:14 🔗 voidsta hello
00:14 🔗 JAA Looks like others have experienced this problem of web servers including whitespace in the chunk size before: https://webcache.googleusercontent.com/search?q=cache:https%3A%2F%2Fjava.net%2Fjira%2Fbrowse%2FGRIZZLY%2D1684 (java.net shut down recently :-| )
00:15 🔗 joepie91 JAA: timmc: chunk sizes in the WARCs are padded to multiples of 3 hex chars
00:15 🔗 joepie91 using spaces
00:16 🔗 joepie91 (in my report, they're represented as dots)
00:17 🔗 JAA Yeah, but not always.
00:17 🔗 JAA Hmm, or maybe it is always.
00:19 🔗 joepie91 from what I could see, it's always
00:19 🔗 joepie91 just not all numbers in the source are chunk sizes :)
00:21 🔗 JAA Found the Apache bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=41364
00:21 🔗 JAA Although it seems unlikely that they were still using that version last year. :-P
00:23 🔗 JAA Someone claims there that "The spaces padding the hex value are ok according to rfc2616"
00:25 🔗 timmc joepie91: Yeah, I tested it with the added spaces and a fake HTTP server.
00:26 🔗 JAA I'm glad that there are standards, but I hate the standards. They're so hard to read at times.
00:26 🔗 timmc ugh implied LWS
00:26 🔗 JAA Yes, but...
00:27 🔗 JAA RFC 2616 was obsoleted by 7230. 7230 uses ABNF from RFC 5234.
00:27 🔗 JAA 5234 specifically states that "This specification for ABNF does not provide for implicit specification of linear white space." and "Any grammar that wishes to permit linear white space around delimiters or string segments must specify it explicitly."
00:28 🔗 JAA And I can't find that in 7230 anywhere.
00:29 🔗 JAA Actually, it states explicitly: "Rules about implicit linear whitespace between certain grammar productions have been removed; now whitespace is only allowed where specifically defined in the ABNF."
00:29 🔗 JRWR man the scaleway API is strange
00:30 🔗 Ravenloft has quit IRC (Read error: Operation timed out)
00:45 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
00:45 🔗 joepie91 timmc: LWS?
00:45 🔗 JAA Linear white space
00:46 🔗 JAA Meaning white space (0x20) and horizontal tabs (0x09).
00:48 🔗 timmc JAA: Too little too late, I suppose.
00:48 🔗 JAA So here's what I think is happening on those pages: Apache, portalgraphics's web server, for some reason pads the chunk sizes with spaces to multiples of three. Since almost all clients probably support RFC 2616 (backwards compatibility etc.), this isn't actually a problem, although it isn't exactly conformant with the most up-to-date standards. (portalgraphics may have been using an old version of Apa
00:49 🔗 JAA che though from before the release of RFC 7230.)
00:49 🔗 JAA However, the IA library handling HTTP responses uses RFC 7230 and therefore doesn't allow whitespace after the chunk size. It fails to decode it and handles it as raw data instead, in effect simply dropping the "Transfer-Encoding: chunked" header.
00:50 🔗 JAA Which then leads to "garbage" showing up in the final response from the IA.
00:53 🔗 JRWR Anyone want my scaleway script
00:53 🔗 JRWR it depolys 7 grab scripts at a time
00:54 🔗 JRWR using the arm64 instances
00:54 🔗 JRWR (2.99Euro)
00:54 🔗 voidsta sure, share it :)
00:54 🔗 JAA By the way, if you request the id_ resource for that link I gave above, the IA sends it in chunked transfer encoding again; the raw traffic back from IA to the browser then looks like double-chunk-encoded: https://gist.githubusercontent.com/anonymous/accf1455050dcf01f19a3b6d1f7cf658/raw/89f5ab19945c49e3770bb6571e36b9f2ae8f1594/gistfile1.txt
00:55 🔗 JRWR and voidsta a script to clean up all the servers as well
00:55 🔗 voidsta JRWR: cool :)
00:55 🔗 JAA JRWR: I'd love to see how you automated it, although I won't be using it directly. I've been meaning to look into how to make the whole process of joining a new project a bit easier across multiple machines.
00:55 🔗 JRWR ya
00:56 🔗 JRWR its simple as fuck really
00:56 🔗 voidsta same
01:00 🔗 JRWR https://gist.github.com/JRWR/4b1cdbe0f55f00d92c10ff1e2355c5b7
01:00 🔗 JRWR there you go
01:00 🔗 JRWR thats both scripts
01:01 🔗 JRWR updated to show my script.sh it sent to the servers
01:01 🔗 ajft has joined #archiveteam-bs
01:01 🔗 JRWR mostly its the default one with a screen -dm on the run-pipeline
01:02 🔗 voidsta cool, thanks for sharing
01:03 🔗 JRWR its very shotgun style
01:03 🔗 JRWR but it gets the job done
01:05 🔗 ndiddy has quit IRC ()
01:05 🔗 JAA Thanks, I'll have a look at it tomorrow.
01:06 🔗 JAA "Some twats drove a van into pedestrians and stabbed people. But don't despair, this will never happen again once we start regulating the internet." FFS, Theresa...
01:12 🔗 j08nY has quit IRC (Quit: Leaving)
01:14 🔗 xmc seems reasonable
01:14 🔗 xmc worked here in the usa
01:19 🔗 joepie91 JAA: interestingly, the chunk cutoffs happened in specific places a lot. I wonder whether you can infer where variables were used (in string-concatenation, in PHP) from the chunk cutoff poiints
01:20 🔗 joepie91 JAA: also, if that theory is correct, it'd be relatively simple to fix all the WARCs
01:21 🔗 xmc if modifying warcs is acceptable.
01:22 🔗 xmc tbh i'm on the fence about that
01:22 🔗 xmc even in this case
01:22 🔗 JAA I guess it might be possible to infer something about internal buffer sizes etc.
01:23 🔗 JAA Agreed. The WARCs contain an accurate representation of what the web server delivered to clients. The fact that some clients or libraries can't handle it is secondary to preserving the original data in my opinion.
01:23 🔗 ndiddy has joined #archiveteam-bs
01:25 🔗 JAA We definitely need to get in touch with the IA though so they can fix their software if my assumptions above are correct.
01:25 🔗 BlueMaxim has joined #archiveteam-bs
01:25 🔗 JAA I'm sure they have tons of other pages with the same "bug".
01:27 🔗 xmc quite possibly!
01:36 🔗 schbirid2 has joined #archiveteam-bs
01:39 🔗 schbirid has quit IRC (Read error: Operation timed out)
01:42 🔗 joepie91 okay, that was poorly worded
01:42 🔗 joepie91 it'd be relatively simple to fix the wayback output*
01:42 🔗 joepie91 :p
01:42 🔗 joepie91 cc JAA xmc
01:42 🔗 joepie91 definitely not advocating for [irrevocably] modifying source data
01:42 🔗 xmc ah yep
01:43 🔗 joepie91 but eg. storing a 'fixed' copy of the WARC can be desirable for perf purposes (over fixing stuff on-the-fly in the wayback)
01:43 🔗 joepie91 without touching the original
01:53 🔗 icedice has quit IRC (Ping timeout: 250 seconds)
01:54 🔗 * JRWR spins up 100 instances
01:54 🔗 JRWR oops
02:03 🔗 voidsta :)
02:04 🔗 pizzaiolo has quit IRC (Ping timeout: 260 seconds)
02:26 🔗 JRWR has quit IRC (Quit: Page closed)
03:15 🔗 superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye)
03:24 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
03:34 🔗 Sk1d has joined #archiveteam-bs
03:44 🔗 ndiddy has quit IRC ()
03:51 🔗 ajft has left
04:08 🔗 slyphic has quit IRC (Read error: Operation timed out)
04:08 🔗 slyphic has joined #archiveteam-bs
04:13 🔗 MrRadar Over in #outofsteam DoomTay noticed that SPUF was returning data with chunked encoding when he used wpull to grab their front page
04:14 🔗 MrRadar I checked on my end and found out that they did that for both wpull and wget-lua if a custom user-agent is not specified
04:14 🔗 MrRadar With the ArchiveTeam user-agent SPUF returns data without chunking
04:15 🔗 MrRadar I've also verified that both wpull and wget-lua are producing WARCs with the same corruption as portalgraphics when SPUF returns data with chunked transfers
04:24 🔗 MrRadar We've also figured out that using a browser User-agent still results in chunked transfers, but adding an Accept header like an actual browser would will cause it to switch back to non-chunked transfers
04:29 🔗 MrRadar Can someone with access to the tracker please stop Pixiv?
04:29 🔗 MrRadar The chunked transfer issue is 100% affecting grabs from them
04:29 🔗 MrRadar xmc, arkiver, SketchCow: ^^^
04:37 🔗 MrRadar I'm grabbing the latest unpatched wget to see if that has the same issue as wpull and wget-lua
04:46 🔗 MrRadar This issue affects the official wget 1.19 release
04:52 🔗 MrRadar OK, I've tracked down the bug in the wget source code.
04:53 🔗 MrRadar The way WARC writing works in wget is there are two output files passed to the fd_read_body() function
04:53 🔗 MrRadar The first gets only the main content, the second gets both the content and headers
04:53 🔗 MrRadar WARC output uses stores the data from the second stream into the WARC.
04:54 🔗 MrRadar However, as the comment on the function says: "If OUT2 is non-NULL, the contents is also written to OUT2. OUT2 will get an exact copy of the response: if this is a chunked response, everything -- including the chunk headers -- is written to OUT2. (OUT will only get the unchunked response.)"
04:54 🔗 MrRadar So it's a deliberate design decision to dump the chunked transfer size as part of the WARC output
04:55 🔗 voidsta so, not a bug?
04:56 🔗 MrRadar Actually, looks like it is a bug
04:56 🔗 voidsta hm
04:56 🔗 MrRadar According to the "WARC_ISO_28500_final_draft v018" document I found: "The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed. If a truncated 'response' record block contains less than the full entity-body, the payload is considered truncated at the same position."
04:56 🔗 MrRadar The "with any transfer-encoding" removed bit indicates that this is non-compliant behavior on the part of wget
04:57 🔗 MrRadar As the chunked-transfer header would count as part of the transfer-encoding
04:57 🔗 MrRadar Unless they mean something completely different than the HTTP spec when they are referring to "transfer-encoding"
04:58 🔗 MrRadar (The document can be found here: http://archive-access.sourceforge.net/warc/WARC_ISO_28500_final_draft%20v018%20Zentveld%20080618.doc)
05:00 🔗 MrRadar Well, I need to get to bed. See you in the morning
05:00 🔗 * MrRadar is AFK
05:01 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
05:06 🔗 ItsYoda has joined #archiveteam-bs
05:07 🔗 Sk1d has joined #archiveteam-bs
05:15 🔗 godane so i have only uploaded 37k items this year so far
05:31 🔗 JRWR has joined #archiveteam-bs
05:31 🔗 JRWR something is going down
05:31 🔗 JRWR MrRadar: do you confirm?
05:32 🔗 MrRadar I'm not 100% sure
05:32 🔗 JRWR ill point my webserver to the ingress folder if you want to start checking
05:32 🔗 MrRadar wget is saving the chunked transfer headers into the WARCs but I'm pretty sure that's against the WARC spec
05:32 🔗 MrRadar But I'm definitely not an expert on the WARC spec
05:33 🔗 MrRadar We'd probably need to hear for sure from somebody at the IA who knows the spec very well
05:34 🔗 MrRadar I did confirm the WARCs I was uploading for Pixiv contained the hex garbage
05:35 🔗 JRWR Shit
05:35 🔗 JRWR I want a confirm with a OP
05:36 🔗 JRWR but I will keep the ingress in case of something crazy happening
05:37 🔗 JRWR MrRadar: http://spacescience.tech/warc/incoming-uploads/
05:37 🔗 JRWR you can start checking if you want
05:38 🔗 MrRadar Picking at random this file chunked transfer headers in the roomtop.php response body: http://spacescience.tech/warc/incoming-uploads/JRWR/pixiv-roomtop_100594-20170605-044020.warc.gz
05:39 🔗 JRWR ya I see them too
05:39 🔗 JRWR Interesting
05:40 🔗 JRWR im looking to see if there are any issues with the dumps
05:42 🔗 JRWR Yep found some
05:42 🔗 JRWR FUCK
05:42 🔗 JRWR http://spacescience.tech/warc/incoming-uploads/Abel_LF/pixiv-roomtop_618874-20170604-145848.warc.gz
05:42 🔗 JRWR Line 426
05:43 🔗 JRWR Shit
05:43 🔗 JRWR there is some in the AMFs
05:43 🔗 JRWR fffffffffffffff
05:44 🔗 JRWR I extracted all the static files
05:44 🔗 JRWR out of the 20, only 2 matched their SHA1s
05:44 🔗 JRWR These are bad dumps
05:45 🔗 JRWR Who do we ping MrRadar
05:45 🔗 JRWR http://imgur.com/6NfmQ
05:46 🔗 MrRadar I already tried pinging everyone who has tracker access, but none of them are online at the moment
05:46 🔗 MrRadar In the mean time you could reduce your rsync to 1 connection max
05:47 🔗 MrRadar Or just turn it off altogether
05:49 🔗 JRWR rsync is OFFLINE
05:49 🔗 MrRadar JRWR: Which AMF files are you seeing with this issue? In the one you linked none of the AMF files were transferred with chunked encoding
05:51 🔗 JRWR my bad it was the PNGs
05:51 🔗 MrRadar OK, yeah some of those are definitely affected
05:53 🔗 JRWR felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened.
05:54 🔗 JRWR ok
05:54 🔗 JRWR We got to fix this in the meantime
05:55 🔗 JRWR its def Wget-lua doing this
05:56 🔗 MrRadar Yeah, I actually tracked down the caust of the bug while you weren't online
05:56 🔗 MrRadar Inside wget
05:56 🔗 JRWR Good
05:56 🔗 JRWR a simple fix is to disable http1.1
05:56 🔗 JRWR and ask for HTTP/1.0
05:56 🔗 JRWR but that does disable keepalives
05:57 🔗 JRWR wait, how many dumps have been going on over the years with this issue?
05:57 🔗 JRWR I wonder if anyone ever checked
05:58 🔗 MrRadar While I haven't verified with the git history, this looks like it's been a problem since WARC support was first added to wget
05:58 🔗 godane so i got a 256gb usb stick Saturday
05:58 🔗 godane for $45
05:59 🔗 JRWR so..
05:59 🔗 JRWR thats ALL the dumps?
06:00 🔗 MrRadar Any ones that have data transferred wiht the chunked transfer-encoding
06:00 🔗 MrRadar Assuming my interpretation of the WARC spec is correct
06:00 🔗 JRWR hrm
06:00 🔗 MrRadar Given how extensive the issue is, it may be easier to just update the WARC spec to allow chunked transfer headers inside WARC response records
06:00 🔗 JRWR true
06:02 🔗 JRWR so the hex we are seeing are the headers for the next chunk?
06:02 🔗 godane so the wget WARC code was screwing things up?
06:02 🔗 MrRadar If I'm right, yes
06:02 🔗 MrRadar But I'm not sure I am
06:03 🔗 MrRadar When data is transferred with the HTTP "chunked" transfer-encoding, wget is writing the chunk headers into the WARC
06:03 🔗 godane but wouldn't that cause the last few years of archiving to have problems
06:03 🔗 pikhq Unless everyone's been misreading the spec the same way.
06:04 🔗 MrRadar The WARC spec says "The payload of a 'response' record with a target-URI of scheme 'http' or 'https' is defined as its 'entity-body' (per [RFC2616]), with any transfer-encoding removed."
06:04 🔗 godane but whatever this bug is its not with everything
06:05 🔗 MrRadar Yes, only when the web server uses "Transfer-encoding: chunked"
06:05 🔗 pikhq Not a *lot* of things used chunked encoding.
06:05 🔗 ranma is there a "best way" to back up a reddit post
06:05 🔗 godane ok
06:05 🔗 ranma with a lot of collapsed comment threads?
06:05 🔗 MrRadar Locally or with e.g. archivebot?
06:05 🔗 ranma for archive bat
06:05 🔗 ranma bot
06:06 🔗 ranma lol
06:06 🔗 ranma e.g. https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12/dienjss/?context=3
06:06 🔗 ranma er
06:06 🔗 ranma https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12
06:06 🔗 godane that at least means we shouldn't have alot of corrupt data
06:06 🔗 MrRadar !a https://www.reddit.com/r/apple/comments/6ezhwm/iama_foxconn_insider_with_information_on_next_12/ without an ignore set should do the trick I think?
06:06 🔗 MrRadar (Make sure to have the trailing slash)
06:07 🔗 pikhq godane: It also implies it should be possible to find all of the data corrupted by this bug.
06:07 🔗 pikhq Though the act of finding all of it is definitely a big one just because of how much data there is to sift through...
06:10 🔗 ranma MrRadar: isn't that going to hit all the linked sites
06:10 🔗 ranma and then maybe a stupid number of other sites?
06:10 🔗 ranma !a scares me
06:10 🔗 MrRadar !a only recurses into URLs with the same prefix
06:11 🔗 MrRadar URLs with a different prefix will be visited but not recursively
06:11 🔗 Igloo has quit IRC (Read error: Operation timed out)
06:11 🔗 MrRadar That's why the trailing slash would be so important, to limit the scope of the recursion
06:11 🔗 ranma i've seen !a of example.com start to crawl marthastewart.com
06:11 🔗 ranma hm
06:12 🔗 ranma not sure if i used trailing slash
06:19 🔗 SketchCow What's the upshot of the bug
06:20 🔗 MrRadar SketchCow: When web servers return data with "Transfer-encoding: chunked" wget is saving information into the WARC that (I think?) the spec says should be stipped
06:20 🔗 MrRadar Specifically, the size of each data chunk
06:21 🔗 pikhq Everything sent from servers using chunked transfer encoding will have spurious hex digits and \r\n sequences in the data that were on the wire, but apparently WARC says aren't supposed to be there.
06:21 🔗 pikhq (that is, in the file itself)
06:21 🔗 MrRadar You should ask someone at the IA who is familiar with the WARC format about what the right way to handle chunked transfers is
06:21 🔗 MrRadar It's possible I'm just reading the spec wrong and wget is doing it right
06:22 🔗 pikhq https://github.com/iipc/warc-specifications/issues/22
06:22 🔗 pikhq That seems to imply you're reading the spec wrong.
06:24 🔗 MrRadar pikhq: Reading that discussion I think you're right
06:25 🔗 pikhq At the least, it's clear the *intention* is wget's behavior.
06:25 🔗 MrRadar Yes, they're very deliberately including the headers in the WARC
06:26 🔗 pikhq So, if you want to process WARC stuff (for rendering or what have you) you should probably be careful to take into account the transfer encoding, or else you'll get the spurious hex digits and such.
06:26 🔗 pikhq But if you're generating a WARC, that's supposed to be there.
06:26 🔗 MrRadar That makes sense
06:27 🔗 MrRadar Sorry for the false alarm everyone
06:27 🔗 MrRadar JRWR: If you're still around, please restart your rsync target
06:27 🔗 pikhq No worries. The standard text is genuinely confusing, and your interpretation is a valid one.
06:27 🔗 JRWR Been done already
06:27 🔗 pikhq (at least, if you're not reading the exact same way they are)
06:31 🔗 JRWR_ has joined #archiveteam-bs
06:34 🔗 JRWR has quit IRC (Ping timeout: 268 seconds)
06:36 🔗 JRWR_ is now known as JRWR
06:36 🔗 JRWR So overall that means IA's Wayback Machine doesn't follow the spec as well then
06:37 🔗 MrRadar I think the issue with portalgraphics was they were sending slightly malformed chunked encoding headers
06:37 🔗 MrRadar With extra padding?
06:37 🔗 MrRadar That the IA didn't handle but browsers did
06:37 🔗 MrRadar If my review of the logs is correct
06:37 🔗 MrRadar *chat logs
06:50 🔗 JRWR SketchCow: Looks like we got blacklisted at pixiv
06:51 🔗 MrRadar arkiver: ^^^
06:52 🔗 MrRadar It's not by IP since I can view URLs that fail through wget-lua just fine in my browser
06:55 🔗 MrRadar Pixiv appears to be running again
06:55 🔗 JRWR Ya
06:55 🔗 JRWR It looks like we got funneled
07:10 🔗 Whopper_ has joined #archiveteam-bs
07:13 🔗 Whopper has quit IRC (Ping timeout: 268 seconds)
08:00 🔗 SHODAN_UI has joined #archiveteam-bs
08:44 🔗 Nazca_ has joined #archiveteam-bs
08:45 🔗 Nazca_ funneled is good or bad?
08:45 🔗 Nazca has quit IRC (Read error: Operation timed out)
08:45 🔗 Nazca_ is now known as Nazca
08:55 🔗 Igloo has joined #archiveteam-bs
09:17 🔗 godane Donald Trump on Charlie Rose: https://archive.org/details/Charlie-Rose-1992-11-06
09:24 🔗 kristian_ has joined #archiveteam-bs
09:25 🔗 jtn2 has joined #archiveteam-bs
09:29 🔗 jtn2 has quit IRC (Read error: Operation timed out)
09:31 🔗 jtn2 has joined #archiveteam-bs
09:33 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
09:35 🔗 godane i'm close to half way point of uploads from last month
09:36 🔗 godane i only uploaded 955 items last month
09:36 🔗 godane i was grabbing the Mister Rogers stream and ripping tape this past month
09:40 🔗 jtn2 has quit IRC (Read error: Operation timed out)
09:42 🔗 jtn2 has joined #archiveteam-bs
10:07 🔗 jtn2 has quit IRC (Read error: Operation timed out)
10:12 🔗 JAA 06-05 06:37:06 < MrRadar> I think the issue with portalgraphics was they were sending slightly malformed chunked encoding headers -- Yes, that's how I understand it. Interesting that the WARC should have transfer encoding stripped. I guess it makes sense in a way though.
10:18 🔗 jtn2 has joined #archiveteam-bs
10:23 🔗 JAA But all in all, I don't think we need to stop current projects or anything like that. It wouldn't be hard to fix WARCs retroactively at some point if we want to do that.
10:25 🔗 JAA joepie91: Fixing it in the Wayback Machine should be easy. IA's library for handling HTTP responses in WARC files already deals with chunked encoding, just not with this "malformed" variant. No need to update WARCs or anything; instead, the library should be modified to handle the whitespace padding.
10:27 🔗 j08nY has joined #archiveteam-bs
10:27 🔗 Sanqui JAA: can you make some sort of writeup so this information doesn't get lost if somebody doesn't get to it right away?
10:29 🔗 JAA Sanqui: Yeah, sure.
10:42 🔗 jtn2 has quit IRC (Ping timeout: 250 seconds)
10:43 🔗 jtn2 has joined #archiveteam-bs
10:43 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
10:44 🔗 BlueMaxim has joined #archiveteam-bs
11:29 🔗 BlueMaxim has quit IRC (Ping timeout: 600 seconds)
11:30 🔗 BlueMaxim has joined #archiveteam-bs
11:55 🔗 SHODAN_UI has joined #archiveteam-bs
12:07 🔗 kristian_ has quit IRC (Quit: Leaving)
12:43 🔗 tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
12:52 🔗 JAA Anyone want to archive this? ;-) https://www.bleepingcomputer.com/news/security/hadoop-servers-expose-over-5-petabytes-of-data/
12:53 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:49 🔗 superkuh has joined #archiveteam-bs
13:57 🔗 joepie91 "To put things in perspective, HDFS servers leak 200 times more data compared to MongoDB servers, which are ten times more prevalent."
13:57 🔗 joepie91 ~big data~
13:58 🔗 joepie91 JAA: hmm. the WARC stores the original chunked data in the WARC?
13:58 🔗 joepie91 ie. the stream of bytes as it appeared over the wire
13:58 🔗 joepie91 (as opposed to it beiing turned into just the content)
13:58 🔗 JRWR I do find that strange for a format like WARC
14:02 🔗 joepie91 MrRadar: JRWR: please make sure to confirm intended WARC behaviour with somebody who has access to the *final* WARC spec, to ensure that nothing was changed from the draft
14:03 🔗 JRWR We did
14:03 🔗 joepie91 JRWR: does something still need to be disabled on the tracker?
14:03 🔗 * joepie91 has tracker access
14:03 🔗 joepie91 (I'm still reading backlog)
14:03 🔗 JRWR there is a issue open on the WARC Spec Github that explains the issue
14:03 🔗 JRWR and currently wget is correct in its saving
14:04 🔗 JRWR right now we are being throttled HARD by pixiv
14:05 🔗 JRWR 442053done + 94722out + 463227to do
14:05 🔗 Kalroth they hit the anti-DDoS panic button
14:07 🔗 joepie91 JRWR: right, if something needs to be changed on the tracker and nobody is around, ping me :P
14:07 🔗 joepie91 (pinging me on Freenode results in faster responses)
14:07 🔗 JRWR Ah
14:07 🔗 JRWR Its OK for now, kind of wish pixiv had not throttled us
14:08 🔗 joepie91 I'm going to be pretty busy today though, so preferably include a very precise request of what needs changing so that it's just a few clicks for me and doesn't require extra thinking :P
14:11 🔗 JRWR of course joepie91
14:12 🔗 JRWR The only warning I've got on my dash right now is my storage is now half full
14:14 🔗 pizzaiolo has joined #archiveteam-bs
14:15 🔗 icedice has joined #archiveteam-bs
14:22 🔗 MrRadar joepie91: Yeah, after reading the spec issue on Github I initially reading the spec wrong and wget is doing the right thing
14:22 🔗 MrRadar I was confused about what the portalgraphics issue was earlier
14:23 🔗 MrRadar I missed that it was due to *extra whitespace* in their chunked transfer headers that was the issue
14:23 🔗 MrRadar Not the headers themselves
15:08 🔗 SHODAN_UI has quit IRC (Quit: zzz)
15:08 🔗 SHODAN_UI has joined #archiveteam-bs
15:27 🔗 JAA joepie91: Yes, as far as I can tell, wget and wpull store the raw data stream in the WARCs. In a way, that's exactly what I'd expect, although I can also see some arguments for stripping transfer encoding first.
15:28 🔗 JAA On a related note, I find it interesting that TLS certificates aren't stored in WARCs.
15:39 🔗 joepie91 JAA: that might just be a wget thing? I know that Heritrix stores a lot more stuff in WARCs than wget does, even down to DNS requests and responses
15:40 🔗 JAA Oh yeah, DNS as well.
15:40 🔗 JAA That's very well possible.
15:44 🔗 JAA joepie91: Do you have an example Heritrix WARC? I'd like to know how they store those things.
15:46 🔗 joepie91 JAA: I don't, unfortunately. somebody in here has made some in the past
15:46 🔗 joepie91 but that was a few years ago :)
15:56 🔗 icedice has quit IRC (Ping timeout: 245 seconds)
16:28 🔗 icedice has joined #archiveteam-bs
17:11 🔗 JRWR has quit IRC (Ping timeout: 268 seconds)
17:16 🔗 ZexaronS has joined #archiveteam-bs
17:50 🔗 dashcloud has quit IRC (Ping timeout: 260 seconds)
17:54 🔗 fie has quit IRC (Read error: Operation timed out)
18:31 🔗 za3k has joined #archiveteam-bs
18:31 🔗 za3k #internetarchive
18:32 🔗 za3k i'm an idiot, ignore
18:33 🔗 Rai-chan has quit IRC (Ping timeout: 268 seconds)
18:33 🔗 za3k What I meant to say is: https://za3k.com/github/ is back up and actively archiving the summary metadata of github projects (mostly names and ids)
18:33 🔗 za3k ghtorrent.org is pretty much strictly better, does anyone already have a copy?
18:34 🔗 Jon has quit IRC (Ping timeout: 268 seconds)
18:35 🔗 Jon has joined #archiveteam-bs
18:37 🔗 Aoede has quit IRC (Ping timeout: 268 seconds)
18:37 🔗 purplebot has quit IRC (Ping timeout: 268 seconds)
18:37 🔗 Aoede has joined #archiveteam-bs
18:38 🔗 fie has joined #archiveteam-bs
18:43 🔗 purplebot has joined #archiveteam-bs
18:43 🔗 Rai-chan has joined #archiveteam-bs
19:01 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
19:06 🔗 xmc has quit IRC (Read error: Operation timed out)
19:09 🔗 xmc has joined #archiveteam-bs
19:09 🔗 swebb sets mode: +o xmc
19:28 🔗 SketchCow FOS is now back to half-full, although you maniacs could probably fill it if you tried
19:28 🔗 JRWR has joined #archiveteam-bs
19:33 🔗 za3k has quit IRC (Quit: http://chat.efnet.org (EOF))
20:03 🔗 * zino whistles innocently.
20:37 🔗 gui7 has joined #archiveteam-bs
20:37 🔗 gui7 has left LIST
20:38 🔗 gui7 has joined #archiveteam-bs
20:39 🔗 gui7 has quit IRC (Remote host closed the connection)
20:39 🔗 gui7 has joined #archiveteam-bs
20:40 🔗 SHODAN_UI has joined #archiveteam-bs
21:48 🔗 icedice has quit IRC (Quit: Leaving)
21:49 🔗 gui7 has quit IRC (Leaving.)
21:53 🔗 deathy SketchCow: maybe update http://www.archiveteam.org/index.php?title=Rescuing_optical_media in case you know of better tools now? I'm also working through a backlog of personal CD/DVDs now...
22:42 🔗 dashcloud has joined #archiveteam-bs
23:01 🔗 yakfish has quit IRC (Operation timed out)
23:06 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
23:10 🔗 wp494 deathy: anyone can
23:20 🔗 twigfoot has joined #archiveteam-bs
23:37 🔗 Odd0002 I used readom for my ISOs and they all seem to work fine in a windows 98SE VM
23:39 🔗 ndiddy has joined #archiveteam-bs
23:42 🔗 GLaDOS has joined #archiveteam-bs

irclogger-viewer