[00:28] *** BlueMaxim has joined #archiveteam-bs [00:54] *** dashcloud has quit IRC (Ping timeout: 260 seconds) [00:56] *** dashcloud has joined #archiveteam-bs [01:10] *** JesseW has joined #archiveteam-bs [01:14] so i'm doing a different brute force method for SBS [01:15] https://archive.org/details/www.sbs.com.au-news-node-190k-20160820 [01:16] i'm now doing like 7k url sets at once [01:16] i had to do this cause in the 190k area its going from odd to even back to odd numbers [01:19] also i'm close to been doing with nasa docs for 1983 [01:20] turns out +100 pdfs didn't get uploaded [01:34] deals.kinja.com is saved: https://archive.org/details/@chris85?and[]=subject:%22deals.kinja.com%22 [01:58] *** username1 has joined #archiveteam-bs [02:02] *** schbirid2 has quit IRC (Read error: Operation timed out) [02:17] *** tomwsmf has joined #archiveteam-bs [02:34] *** REiN^ has quit IRC () [02:54] *** tomaspark has joined #archiveteam-bs [03:06] *** JesseW has quit IRC (Quit: Leaving.) [03:07] *** JesseW has joined #archiveteam-bs [03:40] ez.gizmodo.com is saved and is being uploaded [03:41] *es.gizmodo.com [03:46] *** DFJustin has quit IRC (Remote host closed the connection) [03:48] *** zyphlar has quit IRC (Quit: Connection closed for inactivity) [04:17] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:25] *** Sk1d has joined #archiveteam-bs [04:39] *** Start has quit IRC (Quit: Disconnected.) [04:40] *** Start has joined #archiveteam-bs [06:04] *** dashcloud has quit IRC (Read error: Operation timed out) [06:08] *** dashcloud has joined #archiveteam-bs [06:11] *** RichardG has quit IRC (Read error: Connection reset by peer) [06:12] *** RichardG has joined #archiveteam-bs [06:19] *** tomwsmf has quit IRC (Ping timeout: 255 seconds) [07:40] *** JesseW has quit IRC (Ping timeout: 370 seconds) [08:14] *** DFJustin has joined #archiveteam-bs [08:23] *** Honno has joined #archiveteam-bs [08:59] *** GE has joined #archiveteam-bs [09:06] *** RichardG has quit IRC (Read error: Connection reset by peer) [09:09] *** GE_ has joined #archiveteam-bs [09:10] *** GE has quit IRC (Ping timeout: 255 seconds) [09:10] *** GE_ is now known as GE [09:16] *** wp494 has quit IRC (Read error: Connection reset by peer) [10:11] *** GE_ has joined #archiveteam-bs [10:13] *** GE has quit IRC (Ping timeout: 255 seconds) [10:13] *** GE_ is now known as GE [10:44] *** i0npulse has quit IRC (Ping timeout: 244 seconds) [10:55] *** i0npulse has joined #archiveteam-bs [11:14] *** tuankiet has quit IRC (Quit: Leaving) [11:16] *** GE has quit IRC (Ping timeout: 255 seconds) [11:25] *** wp494 has joined #archiveteam-bs [11:26] *** tuankiet6 has joined #archiveteam-bs [11:31] *** tuankiet6 has quit IRC (Quit: Leaving) [11:31] *** tuankiet6 has joined #archiveteam-bs [11:31] *** tuankiet6 has quit IRC (Remote host closed the connection) [11:32] *** tuankiet6 has joined #archiveteam-bs [11:32] *** tuankiet6 is now known as tuankiet [11:48] *** GE has joined #archiveteam-bs [12:03] *** REiN^ has joined #archiveteam-bs [12:03] *** GE has quit IRC (Ping timeout: 255 seconds) [12:17] *** GE has joined #archiveteam-bs [12:22] *** REiN^ has quit IRC (Read error: Connection reset by peer) [12:24] *** dashcloud has quit IRC (Read error: Operation timed out) [12:29] *** dashcloud has joined #archiveteam-bs [12:32] *** dashcloud has quit IRC (Read error: Operation timed out) [12:40] *** kristian_ has joined #archiveteam-bs [12:42] *** dashcloud has joined #archiveteam-bs [12:54] *** GE has quit IRC (Ping timeout: 255 seconds) [13:00] *** REiN^ has joined #archiveteam-bs [13:10] *** GE has joined #archiveteam-bs [13:42] *** RichardG has joined #archiveteam-bs [13:43] *** GE has quit IRC (Ping timeout: 255 seconds) [15:40] *** GE has joined #archiveteam-bs [15:41] *** BlueMaxim has quit IRC (Quit: Leaving) [15:47] *** username1 has quit IRC (Remote host closed the connection) [16:41] *** kristian_ has quit IRC (Leaving) [16:43] *** tuankiet has quit IRC (Remote host closed the connection) [17:05] *** JesseW has joined #archiveteam-bs [17:27] *** GE_ has joined #archiveteam-bs [17:29] *** GE has quit IRC (Ping timeout: 255 seconds) [17:29] *** GE_ is now known as GE [17:43] *** bzc6p has joined #archiveteam-bs [17:43] *** swebb sets mode: +o bzc6p [17:44] Igloo^: can you please look at your dnshistory crawlers? Strange that you return only xn--ses554g (tiny) items. [17:45] Sure [17:46] It's reporting 403's bzc6p [17:46] Though opening in a browser works [17:46] But that browser is from a different IP I guess. [17:46] Yeah just tyring from same IP 1 mo [17:47] You must have been banned. My question is that is was now (recently) or earlier, in the beginning. [17:48] In the beginning it was fine [17:48] Oh yep [17:48] Banned. [17:48] I mean, when did you restart it? Or haven't stopped it at all? [17:48] I restarted it when the jobs became available the other day [17:49] But then you were already banned I guess. [17:49] Possibly. [17:49] Then they are not banning *now*. That's good. [17:49] We've had the exact same situation with another member yesterday. [17:50] I can only apologise I didn't notice [17:50] Igloo^: Unless you can change IP, please stop your pipeline, because you're taking away all items [17:50] I've stopped my pipeline [17:50] Thanks [17:50] Going to check the other server see if it is also banned. [17:50] http://imgur.com/a/cdDBv [17:51] Is the error you get BTW. [17:52] yes, they used to be assholes [17:53] They implemented cloudfare after the shutdown [17:53] They were still being assholes iirc [17:54] They kept the site up and haven't banned recently, so they are pending [17:54] Assholity Pending [17:55] Now we just need to find who others of us left their pipelines on and take all the yummy items away [17:57] Do we need more pipelines? I've got one that isn't banned [17:57] (It never ran dnshistory) [17:58] I think yes we could have some more [17:58] But you don't have any other banned one on, do you? [17:58] No [17:58] I only ran it on one pipeline [17:58] ok [17:58] We suffered really slow crawl rates last time [17:59] Their site couldn't handle the load [17:59] Let's move to #greatlookup [18:14] Since when does pastebin show captchas when VIEWING content? [18:17] can't say I've ever seen that but I guess it might be a rate limit thing? [18:18] I've just seen it now. It says spam filter. But that used to be used when uploading, not when viewing. Can't see the logic but annoying. [18:20] Pastebin.com is ad supported -- making sure entities worth money to their advertisers are the only ones initiating page loads seems consistent with that [18:21] Yeah, blocking scrapers. But if I must select store fronts every time I want to see a paste, I'll rapidly stop using their service. [18:21] as long as they have enough storage space -- *hosting* content uploaded by bots is fine for them (some advertising-vulnerable entities might even load pages with such content, which is a net win). It's *displaying* pages to non-advertising-vulnerable entities that they want to avoid [18:21] *have to [18:22] there are a LOT of pastebins -- I certainly wouldn't use pastebin.com anymore (and I haven't for a while) [18:22] Which is not a net win [18:23] yep, they have to balance refusing service to non-advertising-vulnerable entities with providing enough value to entities whose attention they *can* sell to get them to participate [18:23] I don't use it either. I'd like a simple one [18:24] I like termbin for stuff I have on the terminal [18:24] One day I'll start my own one [18:24] I don't remember one offhand for actual pastes [18:24] oh, 0bin [18:25] Yes, problem is sharing a paste is expected to be a very prompt thing, shouldn't take more than a few seconds. This captcha thing makes it too long, that's why I think it's not a good idea, at least for such a service. [18:26] (I'm already accustomed to that letting archivists do their job is already far off the table) [18:27] :-P [18:28] I don't disagree [18:28] It's just my opinion. We are different in terms of patience. [18:29] (In fact, I'm usually patient but I don't like needless work) [18:33] https://ybin.me/ is pretty nice for pastes... don't think it does syntax highlighting though [18:34] *** bzc6p sets mode: +oooo achip Atluxity chfoo closure [18:34] *** bzc6p sets mode: +oooo Coderjoe dashcloud DFJustin FalconK [18:35] *** bzc6p sets mode: +oooo GLaDOS godane Infreq JesseW [18:35] *** bzc6p sets mode: +oooo JW_work Kaz luckcolor midas [18:35] *** bzc6p sets mode: +oooo PurpleSym Sanqui Smiley Start [18:35] *** bzc6p sets mode: +oo wp494 yipdw [18:36] What happened to aaaaaaaaa? He's been away, at least with this nickname, since New Year's Eve. [18:38] A sudden influx of op... [18:39] I have no idea what's up with aaaaaaaa [18:48] *** schbirid has joined #archiveteam-bs [18:50] I just found he had github activity in May so he's okay, just stays away from IRC. [18:51] good :-) [18:57] *** bzc6p has left [19:13] *** GE_ has joined #archiveteam-bs [19:13] *** tomwsmf has joined #archiveteam-bs [19:14] *** GE has quit IRC (Ping timeout: 255 seconds) [19:14] *** GE_ is now known as GE [19:42] *** JesseW has quit IRC (Read error: Operation timed out) [20:09] *** schbirid has quit IRC (Ping timeout: 1208 seconds) [20:20] *** bzc6p has joined #archiveteam-bs [20:20] *** swebb sets mode: +o bzc6p [20:21] *** bzc6p has left [20:31] *** kristian_ has joined #archiveteam-bs [20:47] *** dashcloud has quit IRC (Read error: Operation timed out) [20:51] look like my first web archive to failed derive : https://catalogd.archive.org/log/553682276 [20:51] *** dashcloud has joined #archiveteam-bs [20:53] SketchCow: i figure you would want to know about my first web archive to fail derive: https://archive.org/details/www.sbs.com.au-news-node-201k-20160820 [21:12] *** dashcloud has quit IRC (Read error: Connection reset by peer) [21:17] *** dashcloud has joined #archiveteam-bs [21:30] *** RichardG has quit IRC (Ping timeout: 244 seconds) [21:35] *** alembic has quit IRC (Read error: Operation timed out) [21:37] *** alembic has joined #archiveteam-bs [21:37] *** Honno has quit IRC (Read error: Operation timed out) [21:45] *** alembic has quit IRC (Read error: Operation timed out) [21:46] *** alembic has joined #archiveteam-bs [22:04] gzip: 201k/www.sbs.com.au-news-node-201k-20160820.warc.gz: decompression OK, trailing garbage ignored [22:05] i now see the problem [22:05] md5sum is find for everything in that item [22:05] so my try a re-download of those urls [22:05] !ao http://populationpyramid.net/static/data/mainData_en.json [22:05] oops [22:06] *** GE has quit IRC (Ping timeout: 255 seconds) [22:09] *** GE has joined #archiveteam-bs [22:09] *** JesseW has joined #archiveteam-bs [22:11] *** kristian_ has quit IRC (Leaving) [22:12] *** kristian_ has joined #archiveteam-bs [23:09] *** RichardG has joined #archiveteam-bs [23:16] *** kristian_ has quit IRC (Leaving) [23:38] *** OpticalSw has joined #archiveteam-bs [23:39] Hi Joe [23:39] ohai :) [23:39] many more big projects are coming yp [23:39] up* [23:39] flickr, tumblr [23:40] http://pastebin.com/MxxTj9Lf [23:40] Oooh might buy a ton of VMs then [23:40] Some sentris ones likely [23:43] joepie91? [23:43] Any luck? [23:48] hold on [23:48] patience, I'm multitasking :) [23:48] errr [23:48] that log doesn't contain an error... [23:49] chfoo: arkiver: who is currently responsible for seesaw? [23:49] Hangon [23:50] I was a bit of a retard I think [23:52] I followed an oldish tutorial [23:52] for livejournal [23:52] Nope still failed [23:53] OpticalSw: always follow the instructions for the thing you're setting up, in the README :P [23:53] I was doing livejournal then you said Orkut haha [23:56] pythons easy_install worked [23:57] that error looks like you're running some ancient Python component [23:57] .egg as an archive format isn't new [23:58] Fresh install on Jessie [23:58] I guess it's pip then [23:58] Will reinstall pip [23:58] reinstalling from packages might not help; Debian ships an old version for some reasn [23:59] ah crap. Recomendation? [23:59] virtualenv may make it possible to install one that isn't that old [23:59] Could you give me some pointers?