[00:05] *** BlueMaxim has joined #archiveteam-bs [00:35] *** icedice2 has joined #archiveteam-bs [00:36] *** icedice2 has quit IRC (Client Quit) [00:38] *** icedice has quit IRC (Ping timeout: 245 seconds) [00:49] *** j08nY has quit IRC (Quit: Leaving) [01:24] *** brayden__ has quit IRC (Read error: Operation timed out) [01:26] *** ndiddy has joined #archiveteam-bs [01:31] just got done reading Posterous/Story [01:31] Vincent is the hero all closures need [01:36] SketchCow: Wiki Still Broken :( [01:36] Could not store file "/tmp/phpBzwZUM" at "mwstore://local-backend/local-public/3/3f/Chatpivixlogo.gif". [03:06] *** SHODAN_UI has joined #archiveteam-bs [03:07] *** ndiddy has quit IRC (Read error: Operation timed out) [03:27] I'm finding this Delicious acquisition far more amusing and fascinating than I probably should [03:27] at this point I'm willing to accept the theory that his acquisition was purely as a "fuck you" to the Hacker News 'community' and chance of archival [03:47] *** SHODAN_UI has quit IRC (Remote host closed the connection) [04:02] *** jrwr has quit IRC (Quit: WeeChat 1.4) [04:15] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:22] *** Sk1d has joined #archiveteam-bs [04:26] joepie91: the acquisition is just awesome, and it makes a huge amount of sense for Pinboard to own it- he gets to claim he's bought his largest competitor out, and gets to keep all the parts there (and continue mocking them) [06:02] joepie9, why is this a "fuck you" to HN fold? what I miss? :P [06:07] *** Aranje has quit IRC (Quit: Three sheets to the wind) [06:09] *** brayden has joined #archiveteam-bs [06:09] *** swebb sets mode: +o brayden [06:18] Sanqui: FWIW, here's the nanog-l post: https://mailman.nanog.org/pipermail/nanog/2017-June/091273.html [06:32] Hello. [06:32] grumble, wrong channel [07:24] *** schbirid has joined #archiveteam-bs [07:53] *** antomatic has quit IRC (Read error: Operation timed out) [08:21] *** wp494 has quit IRC (Ping timeout: 506 seconds) [08:49] *** brayden has quit IRC (Quit: Leaving) [08:49] *** brayden has joined #archiveteam-bs [08:49] *** swebb sets mode: +o brayden [08:53] *** dashcloud has quit IRC (Read error: Operation timed out) [08:59] *** dashcloud has joined #archiveteam-bs [09:24] *** j08nY has joined #archiveteam-bs [09:26] *** antomatic has joined #archiveteam-bs [09:26] *** swebb sets mode: +o antomatic [09:41] *** schbirid has quit IRC (Quit: Leaving) [10:00] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [10:08] *** schbirid has joined #archiveteam-bs [10:23] *** ZexaronS has joined #archiveteam-bs [10:33] *** wp494 has joined #archiveteam-bs [11:19] *** DFJustin has quit IRC (Remote host closed the connection) [11:19] *** DFJustin has joined #archiveteam-bs [11:19] *** swebb sets mode: +o DFJustin [12:21] *** mls_ has joined #archiveteam-bs [13:21] *** BlueMaxim has quit IRC (Read error: Operation timed out) [13:52] install raspbian, upgrade packages, shutdown, never get your pi back online again \o/ [14:09] sounds good. [14:32] here we go [14:39] FYI the grab of Pixiv chatrooms has started. Join us in #savepixiv [14:50] alembic: Probably because Maciej was always mocking the startup crowd with their $5M investors and terrible business models. [14:50] worshippers of rapid growth and more rapid demise [15:04] Lol fair, although a lot of bootstrappers like Colin Percival are prominent there. Also, pinboard has an untouchable rep there :P [15:10] if you remember, Pinboard also ran a nano VC fund once- 5 or so projects got $37 each in the Co-Prosperity Fund, which also attracted some bigger names in the Hacker News VC space who gave add-on rounds to some of the projects [15:15] *** zino has quit IRC (Remote host closed the connection) [15:29] *** pizzaiolo has joined #archiveteam-bs [15:35] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [15:42] *** pizzaiolo has joined #archiveteam-bs [16:06] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [16:15] *** pipt has joined #archiveteam-bs [17:16] *** BartoCH has joined #archiveteam-bs [17:32] *** SHODAN_UI has joined #archiveteam-bs [18:12] *** pipt has quit IRC (Quit: leaving) [18:13] *** antomatic has quit IRC (Read error: Operation timed out) [18:32] alembic: Maciej is an outspoken critic of the "throw mountains of money at it, get acquired and fuck everybody who gets screwed in the process" culture of Silicon Valley, which is pretty much the centerpiece of Hacker News and its community [18:33] He's been offering to acquire Delicious for some time now. [18:34] alembic: and if I'm not mistaken, that same HN-y crowd has over the years repeatedly made grandiose claims about Pinboard being unsustainable because it didn't have the backing of Delicious and such (though this probably predates HN the site) [18:34] so... yeah :P [18:34] nowadays though Pinboard seems to be liked on HNM [18:34] HN* [18:34] although that may just be a different part of the community responding to it [18:39] *** Aranje has joined #archiveteam-bs [18:44] HN != HN [18:44] just like not all archiveteam are furry [18:49] wat [18:53] *** antomatic has joined #archiveteam-bs [18:53] *** swebb sets mode: +o antomatic [19:12] *** icedice has joined #archiveteam-bs [19:12] *** Metruptio has quit IRC (Remote host closed the connection) [19:12] joepie91 you're one of the people behind adios-hola.org, right? [19:12] Did Hola ever fix their shit? [19:25] *** ndiddy has joined #archiveteam-bs [19:29] *** SHODAN_UI has quit IRC (Remote host closed the connection) [19:48] *** JRWR_ has joined #archiveteam-bs [19:48] *** JRWR_ is now known as JRWR-Work [20:00] *** JRWR-Work has left [20:01] *** JRWR-Work has joined #archiveteam-bs [20:16] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [20:17] *** qwebirc28 has joined #archiveteam-bs [20:18] *** JRWR-Work has quit IRC (Ping timeout: 268 seconds) [21:06] *** pipt has joined #archiveteam-bs [21:20] *** SHODAN_UI has joined #archiveteam-bs [21:21] Bit past coming in: [21:21] Some crawls of images at http://www.portalgraphics.net/ appear to have garbage characters in them at the WARC level. [21:22] http://web.archive.org/web/20160724223147/http://www.portalgraphics.net/pg/illust/?image_id=23 [21:22] http://web.archive.org/web/20160725181823/http://www.portalgraphics.net/pg/illust/?image_id=25 [21:22] http://web.archive.org/web/20160725222705/http://www.portalgraphics.net/pg/illust/?image_id=28 [21:22] http://web.archive.org/web/20160723205134/http://www.portalgraphics.net/pg/illust/?image_id=31 [21:22] http://web.archive.org/web/20160725091406/http://www.portalgraphics.net/pg/illust/?image_id=1332 [21:22] http://web.archive.org/web/20160724001629/http://www.portalgraphics.net/pg/illust/?image_id=10575 [21:22] http://web.archive.org/web/20160615222159/http://www.portalgraphics.net/pg/illust/?image_id=10575 [21:22] http://web.archive.org/web/20160725195224/http://www.portalgraphics.net/pg/illust/?image_id=10577 [21:22] http://web.archive.org/web/20160725161940/http://www.portalgraphics.net/pg/illust/?image_id=10578 [21:22] http://web.archive.org/web/20160723213647/http://www.portalgraphics.net/pg/illust/?image_id=10581 [21:22] http://web.archive.org/web/20160725154248/http://www.portalgraphics.net/pg/illust/?image_id=28089 [21:22] This is by no means the complete list, but it should be enough to show how big this is. [21:22] The fact other crawls of this same page do not have these "garbage" characters, coupled with the fact that they appear to be present in the WARCs themselves tell me that these pages were not actually like this at the time of crawling, and that the WARCs are, for lack of a better word, corrupted. Can anything be done about this? [21:22] ----- [21:22] If anyone has thoughts, let me know. [21:28] icedice: yes, I am, and I don't know [21:28] *** Ravenloft has quit IRC (Ping timeout: 250 seconds) [21:29] ok [21:29] SketchCow: this reminds me of something... mirrors that were run off this WARC that I produced (using wget + warc) used to display some garbage at the top of the page: https://cryptoanarchy.freed0m4all.net/at-cryptoanarchy.warc.gz [21:30] SketchCow: here's the original: https://archive.org/details/CryptoAnarchyWarc [21:30] SketchCow: the mirrors were set up within days after I uploaded that WARC, so I doubt it was some kind of corruption that happens over a longer period of time [21:31] Yeah, nothing corrupts over time [21:31] I always assumed that the issue was in the source data, but perhaps not [21:31] SketchCow: well, replication with non-ECC memory *can* cause corruption over time if there's insufficient integrity checks [21:31] but then it'd be random garbage, this looks too consistent for that [21:32] could it be encoding? [21:32] unlikely [21:33] qwebirc28: https://i.imgur.com/SJKTZfg.png [21:33] that's not an encoding issue :P [21:33] Interesting! [21:33] *** qwebirc28 is now known as JRWR-Work [21:34] Maybe it came off a scraper with some bad ram / hardware? [21:34] SketchCow: was the Portal Graphics grab produced using wget-warc or wpull? [21:35] I don't know. [21:35] JRWR-Work: too consistent like joepie91 said [21:35] I seen entire windows license files get intermuxed with a database log before [21:36] hm [21:36] this is a heavy UTF8/16 site [21:36] software: Wget/1.14.lua.20130523-9a5c (linux-gnu) [21:36] SketchCow: ^ [21:36] based on https://archive.org/download/archiveteam_portalgraphics_20160727140857/portalgraphics_20160727140857.megawarc.warc.gz [21:37] is it possible that there's an implementation error in wget, perhaps? [21:37] Maybe [21:37] I know for a fact that my cryptoanarchy crawl was done using wget-warc, iirc before it was merged into mainline [21:37] I had that issue in PHP working on Pixiv at one point [21:38] hold on [21:38] all the garbage on https://web.archive.org/web/20160724001629/http://www.portalgraphics.net/pg/illust/?image_id=10575 is within hex character range [21:38] 0-f [21:39] Hrm [21:40] inb4 we find that its a encoding bug in gunzip [21:40] looking up the cryptoanarchy wiki in the wayback, it doesn't show garbage there [21:40] unsure why it showed on the mirrors back then [21:41] JRWR-Work: that'd be... unpleasant :) [21:42] I work in Infosec, wouldn't be the first time [21:42] what every garbage insertion has in common is that it has newlines on both ends [21:42] so \ngarbage\n [21:43] let's run a little test... [21:47] okay, so all the concatenated hex garbage for that portalgraphics URL is: ffb7db13a20211058e2b31ae20c20025925339d10b1541af28811c2cd1973b70011000000330000448320 [21:47] obviously not valid ASCII [21:47] generally looks like garbage [21:48] not even valid hex apparently [21:49] ah, length of 85 [21:50] for image_id=28089, the resulting hex is: [21:50] ffb82e1122808919fccccccccccccccccccccccccccbbccccbbccccbbccccbbccddddddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeeeeeeeddddeeddddeeddddeedddd84079128089280893135172808919e28089280002808919728000280892808920f1012808928089280891012815ffb100552b320442044580ccccccddddddeedddd19f6930 [21:50] and the pattern is very different [21:51] *** zino has joined #archiveteam-bs [21:51] it also seems to regularly fail around the same types of values - for example, in https://web.archive.org/web/20160725154248/http://www.portalgraphics.net/pg/illust/?image_id=28089, if you look at the source, you'll see that every time there's an ?image_id= link, it will insert numeric garbage around the ID [21:51] oh, no, hex garbage*, not just numeric [21:52] where is the lua for this [21:52] tl;dr there's definitely a pattern here [21:52] JRWR-Work: this = ? [21:52] I would guess it was pulled with a warrior project at one point? [21:52] JRWR-Work: yeah, likely [21:52] sec [21:53] JRWR-Work: that'd be https://github.com/ArchiveTeam/portalgraphics-grab then [21:54] SketchCow: okay, so, as far as I can tell, all the garbage is additive; it does not *replace* any content, it just adds garbage to it. it seems that a regex of /\n([0-9a-f]+)\n/g successfully matches all of the garbage -- however, there's no guarantee that there's not legitimate content on some pages that might match this, although unlikely [21:54] (especially given indentation usually being involved) [21:54] Ugly [21:54] Can you put this somewhere? [21:54] This is all part of us assessing this stuff anyway [21:54] so you *could* probably produce a 'fixed' WARC by replacing everything matching that regex with nothing, but it still doesn't explain why the garbage is there in the first place [21:54] A project we REALLY SHOULD take up [21:54] SketchCow: as in, in a more permanent medium? [21:57] SketchCow: like, is there any place in particular you'd like me to put this? [21:58] this is bad :/ [21:58] isn't is possible that had problem and returned garbage at some point in time for a short time? [21:59] nvm, the timestamps are hours away from each other [21:59] my current bet is that it originates from some sort of rewrite code [21:59] hold on [21:59] let me generate a report [22:00] But is IS in the megawarc right? [22:00] yes [22:00] so thats before the Wayback machine rewrite then [22:00] aye [22:00] hmm or not in the megawarcs? [22:00] wayback machine rewrite problem would be strange [22:01] but understandable, DOM is hard [22:01] could be a issue with the wget bindings with lua, a nasty edgecase [22:01] this doesn't look like something that'd be produced by a proper DOM stringifier [22:01] prime suspect right now is the urlencoding stuff in the Lua [22:01] which deals with hex [22:02] anyway, sec [22:02] Ah, if the page has strange control codes maybe [22:02] https://web.archive.org/web/20160723205134id_/http://www.portalgraphics.net/pg/illust/?image_id=31 is the not rewritten version [22:02] still garbage [22:02] yeah [22:03] Might be strange controlcodes urlencodecd [22:04] based off they seem to be repeating, but these are not hand made pages, they are generated [22:04] Does the WARC's Content-Length match before or after stripping the junk? (New to this...) [22:04] after [22:04] but that's very likely not the problem [22:04] it would not be able to parse the records if the Content-Length is incorrect [22:05] But that's good for identifying whether you've stripped too much, I guess. [22:05] but we're not really stripping anything [22:05] it doesn't cut a record off after a certain length [22:09] interesting [22:10] sometimes there's whitespace after the hex values [22:10] seems to be padded to 6 chars often but not always? [22:10] er [22:10] to multiples of 3 chars * [22:10] actually, no, it's always padded to a multiple of 3 chars, my regex just misdetecfts... [22:11] * joepie91 returns to drawing table [22:12] let me check to be sure [22:13] oops [22:13] that was for an other chat [22:15] aw, fuck [22:15]