#archiveteam-bs 2017-06-03,Sat

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***BlueMaxim has joined #archiveteam-bs [00:05]
....... (idle for 30mn)
icedice2 has joined #archiveteam-bs
icedice2 has quit IRC (Client Quit)
icedice has quit IRC (Ping timeout: 245 seconds)
[00:35]
j08nY has quit IRC (Quit: Leaving) [00:49]
........ (idle for 35mn)
brayden__ has quit IRC (Read error: Operation timed out)
ndiddy has joined #archiveteam-bs
[01:24]
jrwrjust got done reading Posterous/Story
Vincent is the hero all closures need
[01:31]
SketchCow: Wiki Still Broken :(
Could not store file "/tmp/phpBzwZUM" at "mwstore://local-backend/local-public/3/3f/Chatpivixlogo.gif".
[01:36]
................... (idle for 1h30mn)
***SHODAN_UI has joined #archiveteam-bs
ndiddy has quit IRC (Read error: Operation timed out)
[03:06]
..... (idle for 20mn)
joepie91I'm finding this Delicious acquisition far more amusing and fascinating than I probably should
at this point I'm willing to accept the theory that his acquisition was purely as a "fuck you" to the Hacker News 'community' and chance of archival
[03:27]
..... (idle for 20mn)
***SHODAN_UI has quit IRC (Remote host closed the connection) [03:47]
.... (idle for 15mn)
jrwr has quit IRC (Quit: WeeChat 1.4) [04:02]
Sk1d has quit IRC (Ping timeout: 250 seconds) [04:15]
Sk1d has joined #archiveteam-bs [04:22]
dashcloudjoepie91: the acquisition is just awesome, and it makes a huge amount of sense for Pinboard to own it- he gets to claim he's bought his largest competitor out, and gets to keep all the parts there (and continue mocking them) [04:26]
.................... (idle for 1h36mn)
alembicjoepie9, why is this a "fuck you" to HN fold? what I miss? :P [06:02]
***Aranje has quit IRC (Quit: Three sheets to the wind)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
[06:07]
Somebody2Sanqui: FWIW, here's the nanog-l post: https://mailman.nanog.org/pipermail/nanog/2017-June/091273.html [06:18]
Hello.
grumble, wrong channel
[06:32]
........... (idle for 52mn)
***schbirid has joined #archiveteam-bs [07:24]
...... (idle for 29mn)
antomatic has quit IRC (Read error: Operation timed out) [07:53]
...... (idle for 28mn)
wp494 has quit IRC (Ping timeout: 506 seconds) [08:21]
...... (idle for 28mn)
brayden has quit IRC (Quit: Leaving)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
dashcloud has quit IRC (Read error: Operation timed out)
[08:49]
dashcloud has joined #archiveteam-bs [08:59]
...... (idle for 25mn)
j08nY has joined #archiveteam-bs
antomatic has joined #archiveteam-bs
swebb sets mode: +o antomatic
[09:24]
.... (idle for 15mn)
schbirid has quit IRC (Quit: Leaving) [09:41]
.... (idle for 19mn)
Sk1d has quit IRC (Ping timeout: 194 seconds) [10:00]
schbirid has joined #archiveteam-bs [10:08]
.... (idle for 15mn)
ZexaronS has joined #archiveteam-bs [10:23]
wp494 has joined #archiveteam-bs [10:33]
.......... (idle for 46mn)
DFJustin has quit IRC (Remote host closed the connection)
DFJustin has joined #archiveteam-bs
swebb sets mode: +o DFJustin
[11:19]
............. (idle for 1h2mn)
mls_ has joined #archiveteam-bs [12:21]
............. (idle for 1h0mn)
BlueMaxim has quit IRC (Read error: Operation timed out) [13:21]
....... (idle for 31mn)
schbiridinstall raspbian, upgrade packages, shutdown, never get your pi back online again \o/ [13:52]
.... (idle for 17mn)
SmileyGsounds good. [14:09]
..... (idle for 23mn)
Frogginghere we go [14:32]
MrRadarFYI the grab of Pixiv chatrooms has started. Join us in #savepixiv [14:39]
timmcalembic: Probably because Maciej was always mocking the startup crowd with their $5M investors and terrible business models.
worshippers of rapid growth and more rapid demise
[14:50]
alembicLol fair, although a lot of bootstrappers like Colin Percival are prominent there. Also, pinboard has an untouchable rep there :P [15:04]
dashcloudif you remember, Pinboard also ran a nano VC fund once- 5 or so projects got $37 each in the Co-Prosperity Fund, which also attracted some bigger names in the Hacker News VC space who gave add-on rounds to some of the projects [15:10]
***zino has quit IRC (Remote host closed the connection) [15:15]
pizzaiolo has joined #archiveteam-bs [15:29]
pizzaiolo has quit IRC (Read error: Connection reset by peer) [15:35]
pizzaiolo has joined #archiveteam-bs [15:42]
..... (idle for 24mn)
BartoCH has quit IRC (Ping timeout: 260 seconds) [16:06]
pipt has joined #archiveteam-bs [16:15]
............. (idle for 1h1mn)
BartoCH has joined #archiveteam-bs [17:16]
.... (idle for 16mn)
SHODAN_UI has joined #archiveteam-bs [17:32]
......... (idle for 40mn)
pipt has quit IRC (Quit: leaving)
antomatic has quit IRC (Read error: Operation timed out)
[18:12]
.... (idle for 19mn)
joepie91alembic: Maciej is an outspoken critic of the "throw mountains of money at it, get acquired and fuck everybody who gets screwed in the process" culture of Silicon Valley, which is pretty much the centerpiece of Hacker News and its community [18:32]
timmcHe's been offering to acquire Delicious for some time now. [18:33]
joepie91alembic: and if I'm not mistaken, that same HN-y crowd has over the years repeatedly made grandiose claims about Pinboard being unsustainable because it didn't have the backing of Delicious and such (though this probably predates HN the site)
so... yeah :P
nowadays though Pinboard seems to be liked on HNM
HN*
although that may just be a different part of the community responding to it
[18:34]
***Aranje has joined #archiveteam-bs [18:39]
schbiridHN != HN
just like not all archiveteam are furry
[18:44]
Froggingwat [18:49]
***antomatic has joined #archiveteam-bs
swebb sets mode: +o antomatic
[18:53]
.... (idle for 19mn)
icedice has joined #archiveteam-bs
Metruptio has quit IRC (Remote host closed the connection)
[19:12]
icedicejoepie91 you're one of the people behind adios-hola.org, right?
Did Hola ever fix their shit?
[19:12]
***ndiddy has joined #archiveteam-bs
SHODAN_UI has quit IRC (Remote host closed the connection)
[19:25]
.... (idle for 19mn)
JRWR_ has joined #archiveteam-bs
JRWR_ is now known as JRWR-Work
[19:48]
JRWR-Work has left
JRWR-Work has joined #archiveteam-bs
[20:00]
.... (idle for 15mn)
pizzaiolo has quit IRC (Read error: Connection reset by peer)
qwebirc28 has joined #archiveteam-bs
JRWR-Work has quit IRC (Ping timeout: 268 seconds)
[20:16]
.......... (idle for 48mn)
pipt has joined #archiveteam-bs [21:06]
SHODAN_UI has joined #archiveteam-bs [21:20]
SketchCowBit past coming in:
Some crawls of images at http://www.portalgraphics.net/ appear to have garbage characters in them at the WARC level.
http://web.archive.org/web/20160724223147/http://www.portalgraphics.net/pg/illust/?image_id=23
http://web.archive.org/web/20160725181823/http://www.portalgraphics.net/pg/illust/?image_id=25
http://web.archive.org/web/20160725222705/http://www.portalgraphics.net/pg/illust/?image_id=28
http://web.archive.org/web/20160723205134/http://www.portalgraphics.net/pg/illust/?image_id=31
http://web.archive.org/web/20160725091406/http://www.portalgraphics.net/pg/illust/?image_id=1332
http://web.archive.org/web/20160724001629/http://www.portalgraphics.net/pg/illust/?image_id=10575
http://web.archive.org/web/20160615222159/http://www.portalgraphics.net/pg/illust/?image_id=10575
http://web.archive.org/web/20160725195224/http://www.portalgraphics.net/pg/illust/?image_id=10577
http://web.archive.org/web/20160725161940/http://www.portalgraphics.net/pg/illust/?image_id=10578
http://web.archive.org/web/20160723213647/http://www.portalgraphics.net/pg/illust/?image_id=10581
http://web.archive.org/web/20160725154248/http://www.portalgraphics.net/pg/illust/?image_id=28089
This is by no means the complete list, but it should be enough to show how big this is.
The fact other crawls of this same page do not have these "garbage" characters, coupled with the fact that they appear to be present in the WARCs themselves tell me that these pages were not actually like this at the time of crawling, and that the WARCs are, for lack of a better word, corrupted. Can anything be done about this?
-----
If anyone has thoughts, let me know.
[21:21]
joepie91icedice: yes, I am, and I don't know [21:28]
***Ravenloft has quit IRC (Ping timeout: 250 seconds) [21:28]
icediceok [21:29]
joepie91SketchCow: this reminds me of something... mirrors that were run off this WARC that I produced (using wget + warc) used to display some garbage at the top of the page: https://cryptoanarchy.freed0m4all.net/at-cryptoanarchy.warc.gz
SketchCow: here's the original: https://archive.org/details/CryptoAnarchyWarc
SketchCow: the mirrors were set up within days after I uploaded that WARC, so I doubt it was some kind of corruption that happens over a longer period of time
[21:29]
SketchCowYeah, nothing corrupts over time [21:31]
joepie91I always assumed that the issue was in the source data, but perhaps not
SketchCow: well, replication with non-ECC memory *can* cause corruption over time if there's insufficient integrity checks
but then it'd be random garbage, this looks too consistent for that
[21:31]
qwebirc28could it be encoding? [21:32]
joepie91unlikely
qwebirc28: https://i.imgur.com/SJKTZfg.png
that's not an encoding issue :P
[21:32]
qwebirc28Interesting! [21:33]
***qwebirc28 is now known as JRWR-Work [21:33]
JRWR-WorkMaybe it came off a scraper with some bad ram / hardware? [21:34]
joepie91SketchCow: was the Portal Graphics grab produced using wget-warc or wpull? [21:34]
SketchCowI don't know. [21:35]
FroggingJRWR-Work: too consistent like joepie91 said [21:35]
JRWR-WorkI seen entire windows license files get intermuxed with a database log before [21:35]
Frogginghm [21:36]
JRWR-Workthis is a heavy UTF8/16 site [21:36]
joepie91software: Wget/1.14.lua.20130523-9a5c (linux-gnu)
SketchCow: ^
based on https://archive.org/download/archiveteam_portalgraphics_20160727140857/portalgraphics_20160727140857.megawarc.warc.gz
is it possible that there's an implementation error in wget, perhaps?
[21:36]
SketchCowMaybe [21:37]
joepie91I know for a fact that my cryptoanarchy crawl was done using wget-warc, iirc before it was merged into mainline [21:37]
JRWR-WorkI had that issue in PHP working on Pixiv at one point [21:37]
joepie91hold on
all the garbage on https://web.archive.org/web/20160724001629/http://www.portalgraphics.net/pg/illust/?image_id=10575 is within hex character range
0-f
[21:38]
JRWR-WorkHrm
inb4 we find that its a encoding bug in gunzip
[21:39]
joepie91looking up the cryptoanarchy wiki in the wayback, it doesn't show garbage there
unsure why it showed on the mirrors back then
JRWR-Work: that'd be... unpleasant :)
[21:40]
JRWR-WorkI work in Infosec, wouldn't be the first time [21:42]
joepie91what every garbage insertion has in common is that it has newlines on both ends
so \ngarbage\n
let's run a little test...
okay, so all the concatenated hex garbage for that portalgraphics URL is: ffb7db13a20211058e2b31ae20c20025925339d10b1541af28811c2cd1973b70011000000330000448320
obviously not valid ASCII
generally looks like garbage
not even valid hex apparently
ah, length of 85
for image_id=28089, the resulting hex is:
ffb82e1122808919fccccccccccccccccccccccccccbbccccbbccccbbccccbbccddddddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeeeeeeeddddeeddddeeddddeedddd84079128089280893135172808919e28089280002808919728000280892808920f1012808928089280891012815ffb100552b320442044580ccccccddddddeedddd19f6930
and the pattern is very different
[21:42]
***zino has joined #archiveteam-bs [21:51]
joepie91it also seems to regularly fail around the same types of values - for example, in https://web.archive.org/web/20160725154248/http://www.portalgraphics.net/pg/illust/?image_id=28089, if you look at the source, you'll see that every time there's an ?image_id= link, it will insert numeric garbage around the ID
oh, no, hex garbage*, not just numeric
[21:51]
JRWR-Workwhere is the lua for this [21:52]
joepie91tl;dr there's definitely a pattern here
JRWR-Work: this = ?
[21:52]
JRWR-WorkI would guess it was pulled with a warrior project at one point? [21:52]
joepie91JRWR-Work: yeah, likely
sec
JRWR-Work: that'd be https://github.com/ArchiveTeam/portalgraphics-grab then
SketchCow: okay, so, as far as I can tell, all the garbage is additive; it does not *replace* any content, it just adds garbage to it. it seems that a regex of /\n([0-9a-f]+)\n/g successfully matches all of the garbage -- however, there's no guarantee that there's not legitimate content on some pages that might match this, although unlikely
(especially given indentation usually being involved)
[21:52]
SketchCowUgly
Can you put this somewhere?
This is all part of us assessing this stuff anyway
[21:54]
joepie91so you *could* probably produce a 'fixed' WARC by replacing everything matching that regex with nothing, but it still doesn't explain why the garbage is there in the first place [21:54]
SketchCowA project we REALLY SHOULD take up [21:54]
joepie91SketchCow: as in, in a more permanent medium?
SketchCow: like, is there any place in particular you'd like me to put this?
[21:54]
arkiverthis is bad :/
isn't is possible that had problem and returned garbage at some point in time for a short time?
nvm, the timestamps are hours away from each other
[21:58]
joepie91my current bet is that it originates from some sort of rewrite code
hold on
let me generate a report
[21:59]
JRWR-WorkBut is IS in the megawarc right? [22:00]
arkiveryes [22:00]
JRWR-Workso thats before the Wayback machine rewrite then [22:00]
joepie91aye [22:00]
arkiverhmm or not in the megawarcs?
wayback machine rewrite problem would be strange
[22:00]
JRWR-Workbut understandable, DOM is hard
could be a issue with the wget bindings with lua, a nasty edgecase
[22:01]
joepie91this doesn't look like something that'd be produced by a proper DOM stringifier
prime suspect right now is the urlencoding stuff in the Lua
which deals with hex
anyway, sec
[22:01]
JRWR-WorkAh, if the page has strange control codes maybe [22:02]
arkiverhttps://web.archive.org/web/20160723205134id_/http://www.portalgraphics.net/pg/illust/?image_id=31 is the not rewritten version [22:02]
joepie91still garbage [22:02]
arkiveryeah [22:02]
JRWR-WorkMight be strange controlcodes urlencodecd
based off they seem to be repeating, but these are not hand made pages, they are generated
[22:03]
timmcDoes the WARC's Content-Length match before or after stripping the junk? (New to this...) [22:04]
arkiverafter
but that's very likely not the problem
it would not be able to parse the records if the Content-Length is incorrect
[22:04]
timmcBut that's good for identifying whether you've stripped too much, I guess. [22:05]
arkiverbut we're not really stripping anything
it doesn't cut a record off after a certain length
[22:05]
joepie91interesting
sometimes there's whitespace after the hex values
seems to be padded to 6 chars often but not always?
er
to multiples of 3 chars *
actually, no, it's always padded to a multiple of 3 chars, my regex just misdetecfts...
joepie91 returns to drawing table
[22:09]
arkiverlet me check to be sure
oops
that was for an other chat
[22:12]
joepie91aw, fuck
<s db cript type="text/javascript" src="http://platform.twitter.com/widgets.js"></script><!--<a href="http://twitter.com/share" class="twitter-share-button" data-count="none" data-via="portalgraphics" data-lang="ja">Tweet</a>
:/
[22:15]
arkiverthat " db " is not a typo? [22:16]
joepie91that's in the source...
there's a newline-less instance of corruption
using spaces
[22:16]
arkiverugh :( [22:16]
joepie91but only one, it appears [22:16]
arkiverI'll get the logs of this project
see if it was just random users who downloaded these
[22:16]
joepie91weird
<s 69 cript type="text/javascript" src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
<s c3 cript type="text/javascript"><!--
always in the same spot
oh, or not
<s c1 pan>ポタグラ</span>&nbsp;|&nbsp;<a href="https://web.archive.org/web/20160725154248/http://www.portalgraphics.net/oc/">openCanvas</a>&nbsp;|&nbsp;<a href="https://web.archive.org/web/20160725154248/http://www.portalgraphics.net/cl/">コミラボ</a></li>
[22:19]
arkivernot always in both? [22:20]
joepie91only seems to happen after <s though? [22:20]
arkiverah [22:20]
joepie91arkiver: here's a report for the last URL: http://sprunge.us/RjWi -- all instances using newlines (not spaces), where the middle line for each case is the garbage value, with a dot for each trailing space, and the first and last lines are the surrounding context
there are definite patterns here
[22:23]
arkiver:( [22:23]
joepie91arkiver: the upside is that patterns are detectable :P [22:24]
arkiveryeah [22:27]
joepie91but yeah, this won't be a simple one to fix
arkiver: https://git.cryto.net/joepie91/garbagechecker
arkiver: locate-snippets.js produces a report like I linked above, concatenate-garbage does what it says on the tin, `npm install` to install deps, then run either script with the URL as the first and only argr
arg *
in case you want to do more digging :P
./lib/extract-garbage obtains all the garbage (or well, as best as I can match so far, only newlines supported so far) from a given string
[22:27]
..... (idle for 24mn)
***SHODAN_UI has quit IRC (Remote host closed the connection) [22:56]
........ (idle for 39mn)
alembic(gogs looks a lot better than I remember...) [23:35]
***fie has joined #archiveteam-bs [23:43]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)