#archiveteam-bs 2017-06-03,Sat

↑back Search

Time Nickname Message
00:05 🔗 BlueMaxim has joined #archiveteam-bs
00:35 🔗 icedice2 has joined #archiveteam-bs
00:36 🔗 icedice2 has quit IRC (Client Quit)
00:38 🔗 icedice has quit IRC (Ping timeout: 245 seconds)
00:49 🔗 j08nY has quit IRC (Quit: Leaving)
01:24 🔗 brayden__ has quit IRC (Read error: Operation timed out)
01:26 🔗 ndiddy has joined #archiveteam-bs
01:31 🔗 jrwr just got done reading Posterous/Story
01:31 🔗 jrwr Vincent is the hero all closures need
01:36 🔗 jrwr SketchCow: Wiki Still Broken :(
01:36 🔗 jrwr Could not store file "/tmp/phpBzwZUM" at "mwstore://local-backend/local-public/3/3f/Chatpivixlogo.gif".
03:06 🔗 SHODAN_UI has joined #archiveteam-bs
03:07 🔗 ndiddy has quit IRC (Read error: Operation timed out)
03:27 🔗 joepie91 I'm finding this Delicious acquisition far more amusing and fascinating than I probably should
03:27 🔗 joepie91 at this point I'm willing to accept the theory that his acquisition was purely as a "fuck you" to the Hacker News 'community' and chance of archival
03:47 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
04:02 🔗 jrwr has quit IRC (Quit: WeeChat 1.4)
04:15 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:22 🔗 Sk1d has joined #archiveteam-bs
04:26 🔗 dashcloud joepie91: the acquisition is just awesome, and it makes a huge amount of sense for Pinboard to own it- he gets to claim he's bought his largest competitor out, and gets to keep all the parts there (and continue mocking them)
06:02 🔗 alembic joepie9, why is this a "fuck you" to HN fold? what I miss? :P
06:07 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
06:09 🔗 brayden has joined #archiveteam-bs
06:09 🔗 swebb sets mode: +o brayden
06:18 🔗 Somebody2 Sanqui: FWIW, here's the nanog-l post: https://mailman.nanog.org/pipermail/nanog/2017-June/091273.html
06:32 🔗 Somebody2 Hello.
06:32 🔗 Somebody2 grumble, wrong channel
07:24 🔗 schbirid has joined #archiveteam-bs
07:53 🔗 antomatic has quit IRC (Read error: Operation timed out)
08:21 🔗 wp494 has quit IRC (Ping timeout: 506 seconds)
08:49 🔗 brayden has quit IRC (Quit: Leaving)
08:49 🔗 brayden has joined #archiveteam-bs
08:49 🔗 swebb sets mode: +o brayden
08:53 🔗 dashcloud has quit IRC (Read error: Operation timed out)
08:59 🔗 dashcloud has joined #archiveteam-bs
09:24 🔗 j08nY has joined #archiveteam-bs
09:26 🔗 antomatic has joined #archiveteam-bs
09:26 🔗 swebb sets mode: +o antomatic
09:41 🔗 schbirid has quit IRC (Quit: Leaving)
10:00 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
10:08 🔗 schbirid has joined #archiveteam-bs
10:23 🔗 ZexaronS has joined #archiveteam-bs
10:33 🔗 wp494 has joined #archiveteam-bs
11:19 🔗 DFJustin has quit IRC (Remote host closed the connection)
11:19 🔗 DFJustin has joined #archiveteam-bs
11:19 🔗 swebb sets mode: +o DFJustin
12:21 🔗 mls_ has joined #archiveteam-bs
13:21 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
13:52 🔗 schbirid install raspbian, upgrade packages, shutdown, never get your pi back online again \o/
14:09 🔗 SmileyG sounds good.
14:32 🔗 Frogging here we go
14:39 🔗 MrRadar FYI the grab of Pixiv chatrooms has started. Join us in #savepixiv
14:50 🔗 timmc alembic: Probably because Maciej was always mocking the startup crowd with their $5M investors and terrible business models.
14:50 🔗 timmc worshippers of rapid growth and more rapid demise
15:04 🔗 alembic Lol fair, although a lot of bootstrappers like Colin Percival are prominent there. Also, pinboard has an untouchable rep there :P
15:10 🔗 dashcloud if you remember, Pinboard also ran a nano VC fund once- 5 or so projects got $37 each in the Co-Prosperity Fund, which also attracted some bigger names in the Hacker News VC space who gave add-on rounds to some of the projects
15:15 🔗 zino has quit IRC (Remote host closed the connection)
15:29 🔗 pizzaiolo has joined #archiveteam-bs
15:35 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
15:42 🔗 pizzaiolo has joined #archiveteam-bs
16:06 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
16:15 🔗 pipt has joined #archiveteam-bs
17:16 🔗 BartoCH has joined #archiveteam-bs
17:32 🔗 SHODAN_UI has joined #archiveteam-bs
18:12 🔗 pipt has quit IRC (Quit: leaving)
18:13 🔗 antomatic has quit IRC (Read error: Operation timed out)
18:32 🔗 joepie91 alembic: Maciej is an outspoken critic of the "throw mountains of money at it, get acquired and fuck everybody who gets screwed in the process" culture of Silicon Valley, which is pretty much the centerpiece of Hacker News and its community
18:33 🔗 timmc He's been offering to acquire Delicious for some time now.
18:34 🔗 joepie91 alembic: and if I'm not mistaken, that same HN-y crowd has over the years repeatedly made grandiose claims about Pinboard being unsustainable because it didn't have the backing of Delicious and such (though this probably predates HN the site)
18:34 🔗 joepie91 so... yeah :P
18:34 🔗 joepie91 nowadays though Pinboard seems to be liked on HNM
18:34 🔗 joepie91 HN*
18:34 🔗 joepie91 although that may just be a different part of the community responding to it
18:39 🔗 Aranje has joined #archiveteam-bs
18:44 🔗 schbirid HN != HN
18:44 🔗 schbirid just like not all archiveteam are furry
18:49 🔗 Frogging wat
18:53 🔗 antomatic has joined #archiveteam-bs
18:53 🔗 swebb sets mode: +o antomatic
19:12 🔗 icedice has joined #archiveteam-bs
19:12 🔗 Metruptio has quit IRC (Remote host closed the connection)
19:12 🔗 icedice joepie91 you're one of the people behind adios-hola.org, right?
19:12 🔗 icedice Did Hola ever fix their shit?
19:25 🔗 ndiddy has joined #archiveteam-bs
19:29 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
19:48 🔗 JRWR_ has joined #archiveteam-bs
19:48 🔗 JRWR_ is now known as JRWR-Work
20:00 🔗 JRWR-Work has left
20:01 🔗 JRWR-Work has joined #archiveteam-bs
20:16 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
20:17 🔗 qwebirc28 has joined #archiveteam-bs
20:18 🔗 JRWR-Work has quit IRC (Ping timeout: 268 seconds)
21:06 🔗 pipt has joined #archiveteam-bs
21:20 🔗 SHODAN_UI has joined #archiveteam-bs
21:21 🔗 SketchCow Bit past coming in:
21:21 🔗 SketchCow Some crawls of images at http://www.portalgraphics.net/ appear to have garbage characters in them at the WARC level.
21:22 🔗 SketchCow http://web.archive.org/web/20160724223147/http://www.portalgraphics.net/pg/illust/?image_id=23
21:22 🔗 SketchCow http://web.archive.org/web/20160725181823/http://www.portalgraphics.net/pg/illust/?image_id=25
21:22 🔗 SketchCow http://web.archive.org/web/20160725222705/http://www.portalgraphics.net/pg/illust/?image_id=28
21:22 🔗 SketchCow http://web.archive.org/web/20160723205134/http://www.portalgraphics.net/pg/illust/?image_id=31
21:22 🔗 SketchCow http://web.archive.org/web/20160725091406/http://www.portalgraphics.net/pg/illust/?image_id=1332
21:22 🔗 SketchCow http://web.archive.org/web/20160724001629/http://www.portalgraphics.net/pg/illust/?image_id=10575
21:22 🔗 SketchCow http://web.archive.org/web/20160615222159/http://www.portalgraphics.net/pg/illust/?image_id=10575
21:22 🔗 SketchCow http://web.archive.org/web/20160725195224/http://www.portalgraphics.net/pg/illust/?image_id=10577
21:22 🔗 SketchCow http://web.archive.org/web/20160725161940/http://www.portalgraphics.net/pg/illust/?image_id=10578
21:22 🔗 SketchCow http://web.archive.org/web/20160723213647/http://www.portalgraphics.net/pg/illust/?image_id=10581
21:22 🔗 SketchCow http://web.archive.org/web/20160725154248/http://www.portalgraphics.net/pg/illust/?image_id=28089
21:22 🔗 SketchCow This is by no means the complete list, but it should be enough to show how big this is.
21:22 🔗 SketchCow The fact other crawls of this same page do not have these "garbage" characters, coupled with the fact that they appear to be present in the WARCs themselves tell me that these pages were not actually like this at the time of crawling, and that the WARCs are, for lack of a better word, corrupted. Can anything be done about this?
21:22 🔗 SketchCow -----
21:22 🔗 SketchCow If anyone has thoughts, let me know.
21:28 🔗 joepie91 icedice: yes, I am, and I don't know
21:28 🔗 Ravenloft has quit IRC (Ping timeout: 250 seconds)
21:29 🔗 icedice ok
21:29 🔗 joepie91 SketchCow: this reminds me of something... mirrors that were run off this WARC that I produced (using wget + warc) used to display some garbage at the top of the page: https://cryptoanarchy.freed0m4all.net/at-cryptoanarchy.warc.gz
21:30 🔗 joepie91 SketchCow: here's the original: https://archive.org/details/CryptoAnarchyWarc
21:30 🔗 joepie91 SketchCow: the mirrors were set up within days after I uploaded that WARC, so I doubt it was some kind of corruption that happens over a longer period of time
21:31 🔗 SketchCow Yeah, nothing corrupts over time
21:31 🔗 joepie91 I always assumed that the issue was in the source data, but perhaps not
21:31 🔗 joepie91 SketchCow: well, replication with non-ECC memory *can* cause corruption over time if there's insufficient integrity checks
21:31 🔗 joepie91 but then it'd be random garbage, this looks too consistent for that
21:32 🔗 qwebirc28 could it be encoding?
21:32 🔗 joepie91 unlikely
21:33 🔗 joepie91 qwebirc28: https://i.imgur.com/SJKTZfg.png
21:33 🔗 joepie91 that's not an encoding issue :P
21:33 🔗 qwebirc28 Interesting!
21:33 🔗 qwebirc28 is now known as JRWR-Work
21:34 🔗 JRWR-Work Maybe it came off a scraper with some bad ram / hardware?
21:34 🔗 joepie91 SketchCow: was the Portal Graphics grab produced using wget-warc or wpull?
21:35 🔗 SketchCow I don't know.
21:35 🔗 Frogging JRWR-Work: too consistent like joepie91 said
21:35 🔗 JRWR-Work I seen entire windows license files get intermuxed with a database log before
21:36 🔗 Frogging hm
21:36 🔗 JRWR-Work this is a heavy UTF8/16 site
21:36 🔗 joepie91 software: Wget/1.14.lua.20130523-9a5c (linux-gnu)
21:36 🔗 joepie91 SketchCow: ^
21:36 🔗 joepie91 based on https://archive.org/download/archiveteam_portalgraphics_20160727140857/portalgraphics_20160727140857.megawarc.warc.gz
21:37 🔗 joepie91 is it possible that there's an implementation error in wget, perhaps?
21:37 🔗 SketchCow Maybe
21:37 🔗 joepie91 I know for a fact that my cryptoanarchy crawl was done using wget-warc, iirc before it was merged into mainline
21:37 🔗 JRWR-Work I had that issue in PHP working on Pixiv at one point
21:38 🔗 joepie91 hold on
21:38 🔗 joepie91 all the garbage on https://web.archive.org/web/20160724001629/http://www.portalgraphics.net/pg/illust/?image_id=10575 is within hex character range
21:38 🔗 joepie91 0-f
21:39 🔗 JRWR-Work Hrm
21:40 🔗 JRWR-Work inb4 we find that its a encoding bug in gunzip
21:40 🔗 joepie91 looking up the cryptoanarchy wiki in the wayback, it doesn't show garbage there
21:40 🔗 joepie91 unsure why it showed on the mirrors back then
21:41 🔗 joepie91 JRWR-Work: that'd be... unpleasant :)
21:42 🔗 JRWR-Work I work in Infosec, wouldn't be the first time
21:42 🔗 joepie91 what every garbage insertion has in common is that it has newlines on both ends
21:42 🔗 joepie91 so \ngarbage\n
21:43 🔗 joepie91 let's run a little test...
21:47 🔗 joepie91 okay, so all the concatenated hex garbage for that portalgraphics URL is: ffb7db13a20211058e2b31ae20c20025925339d10b1541af28811c2cd1973b70011000000330000448320
21:47 🔗 joepie91 obviously not valid ASCII
21:47 🔗 joepie91 generally looks like garbage
21:48 🔗 joepie91 not even valid hex apparently
21:49 🔗 joepie91 ah, length of 85
21:50 🔗 joepie91 for image_id=28089, the resulting hex is:
21:50 🔗 joepie91 ffb82e1122808919fccccccccccccccccccccccccccbbccccbbccccbbccccbbccddddddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeeeeeeeddddeeddddeeddddeedddd84079128089280893135172808919e28089280002808919728000280892808920f1012808928089280891012815ffb100552b320442044580ccccccddddddeedddd19f6930
21:50 🔗 joepie91 and the pattern is very different
21:51 🔗 zino has joined #archiveteam-bs
21:51 🔗 joepie91 it also seems to regularly fail around the same types of values - for example, in https://web.archive.org/web/20160725154248/http://www.portalgraphics.net/pg/illust/?image_id=28089, if you look at the source, you'll see that every time there's an ?image_id= link, it will insert numeric garbage around the ID
21:51 🔗 joepie91 oh, no, hex garbage*, not just numeric
21:52 🔗 JRWR-Work where is the lua for this
21:52 🔗 joepie91 tl;dr there's definitely a pattern here
21:52 🔗 joepie91 JRWR-Work: this = ?
21:52 🔗 JRWR-Work I would guess it was pulled with a warrior project at one point?
21:52 🔗 joepie91 JRWR-Work: yeah, likely
21:52 🔗 joepie91 sec
21:53 🔗 joepie91 JRWR-Work: that'd be https://github.com/ArchiveTeam/portalgraphics-grab then
21:54 🔗 joepie91 SketchCow: okay, so, as far as I can tell, all the garbage is additive; it does not *replace* any content, it just adds garbage to it. it seems that a regex of /\n([0-9a-f]+)\n/g successfully matches all of the garbage -- however, there's no guarantee that there's not legitimate content on some pages that might match this, although unlikely
21:54 🔗 joepie91 (especially given indentation usually being involved)
21:54 🔗 SketchCow Ugly
21:54 🔗 SketchCow Can you put this somewhere?
21:54 🔗 SketchCow This is all part of us assessing this stuff anyway
21:54 🔗 joepie91 so you *could* probably produce a 'fixed' WARC by replacing everything matching that regex with nothing, but it still doesn't explain why the garbage is there in the first place
21:54 🔗 SketchCow A project we REALLY SHOULD take up
21:54 🔗 joepie91 SketchCow: as in, in a more permanent medium?
21:57 🔗 joepie91 SketchCow: like, is there any place in particular you'd like me to put this?
21:58 🔗 arkiver this is bad :/
21:58 🔗 arkiver isn't is possible that had problem and returned garbage at some point in time for a short time?
21:59 🔗 arkiver nvm, the timestamps are hours away from each other
21:59 🔗 joepie91 my current bet is that it originates from some sort of rewrite code
21:59 🔗 joepie91 hold on
21:59 🔗 joepie91 let me generate a report
22:00 🔗 JRWR-Work But is IS in the megawarc right?
22:00 🔗 arkiver yes
22:00 🔗 JRWR-Work so thats before the Wayback machine rewrite then
22:00 🔗 joepie91 aye
22:00 🔗 arkiver hmm or not in the megawarcs?
22:00 🔗 arkiver wayback machine rewrite problem would be strange
22:01 🔗 JRWR-Work but understandable, DOM is hard
22:01 🔗 JRWR-Work could be a issue with the wget bindings with lua, a nasty edgecase
22:01 🔗 joepie91 this doesn't look like something that'd be produced by a proper DOM stringifier
22:01 🔗 joepie91 prime suspect right now is the urlencoding stuff in the Lua
22:01 🔗 joepie91 which deals with hex
22:02 🔗 joepie91 anyway, sec
22:02 🔗 JRWR-Work Ah, if the page has strange control codes maybe
22:02 🔗 arkiver https://web.archive.org/web/20160723205134id_/http://www.portalgraphics.net/pg/illust/?image_id=31 is the not rewritten version
22:02 🔗 joepie91 still garbage
22:02 🔗 arkiver yeah
22:03 🔗 JRWR-Work Might be strange controlcodes urlencodecd
22:04 🔗 JRWR-Work based off they seem to be repeating, but these are not hand made pages, they are generated
22:04 🔗 timmc Does the WARC's Content-Length match before or after stripping the junk? (New to this...)
22:04 🔗 arkiver after
22:04 🔗 arkiver but that's very likely not the problem
22:04 🔗 arkiver it would not be able to parse the records if the Content-Length is incorrect
22:05 🔗 timmc But that's good for identifying whether you've stripped too much, I guess.
22:05 🔗 arkiver but we're not really stripping anything
22:05 🔗 arkiver it doesn't cut a record off after a certain length
22:09 🔗 joepie91 interesting
22:10 🔗 joepie91 sometimes there's whitespace after the hex values
22:10 🔗 joepie91 seems to be padded to 6 chars often but not always?
22:10 🔗 joepie91 er
22:10 🔗 joepie91 to multiples of 3 chars *
22:10 🔗 joepie91 actually, no, it's always padded to a multiple of 3 chars, my regex just misdetecfts...
22:11 🔗 * joepie91 returns to drawing table
22:12 🔗 arkiver let me check to be sure
22:13 🔗 arkiver oops
22:13 🔗 arkiver that was for an other chat
22:15 🔗 joepie91 aw, fuck
22:15 🔗 joepie91 <s db cript type="text/javascript" src="http://platform.twitter.com/widgets.js"></script><!--<a href="http://twitter.com/share" class="twitter-share-button" data-count="none" data-via="portalgraphics" data-lang="ja">Tweet</a>
22:15 🔗 joepie91 :/
22:16 🔗 arkiver that " db " is not a typo?
22:16 🔗 joepie91 that's in the source...
22:16 🔗 joepie91 there's a newline-less instance of corruption
22:16 🔗 joepie91 using spaces
22:16 🔗 arkiver ugh :(
22:16 🔗 joepie91 but only one, it appears
22:16 🔗 arkiver I'll get the logs of this project
22:17 🔗 arkiver see if it was just random users who downloaded these
22:19 🔗 joepie91 weird
22:19 🔗 joepie91 <s 69 cript type="text/javascript" src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
22:19 🔗 joepie91 <s c3 cript type="text/javascript"><!--
22:19 🔗 joepie91 always in the same spot
22:19 🔗 joepie91 oh, or not
22:19 🔗 joepie91 <s c1 pan>ポタグラ</span>&nbsp;|&nbsp;<a href="https://web.archive.org/web/20160725154248/http://www.portalgraphics.net/oc/">openCanvas</a>&nbsp;|&nbsp;<a href="https://web.archive.org/web/20160725154248/http://www.portalgraphics.net/cl/">コミラボ</a></li>
22:20 🔗 arkiver not always in both?
22:20 🔗 joepie91 only seems to happen after <s though?
22:20 🔗 arkiver ah
22:23 🔗 joepie91 arkiver: here's a report for the last URL: http://sprunge.us/RjWi -- all instances using newlines (not spaces), where the middle line for each case is the garbage value, with a dot for each trailing space, and the first and last lines are the surrounding context
22:23 🔗 joepie91 there are definite patterns here
22:23 🔗 arkiver :(
22:24 🔗 joepie91 arkiver: the upside is that patterns are detectable :P
22:27 🔗 arkiver yeah
22:27 🔗 joepie91 but yeah, this won't be a simple one to fix
22:30 🔗 joepie91 arkiver: https://git.cryto.net/joepie91/garbagechecker
22:31 🔗 joepie91 arkiver: locate-snippets.js produces a report like I linked above, concatenate-garbage does what it says on the tin, `npm install` to install deps, then run either script with the URL as the first and only argr
22:31 🔗 joepie91 arg *
22:31 🔗 joepie91 in case you want to do more digging :P
22:32 🔗 joepie91 ./lib/extract-garbage obtains all the garbage (or well, as best as I can match so far, only newlines supported so far) from a given string
22:56 🔗 SHODAN_UI has quit IRC (Remote host closed the connection)
23:35 🔗 alembic (gogs looks a lot better than I remember...)
23:43 🔗 fie has joined #archiveteam-bs

irclogger-viewer