[00:12] so looks like i can now brute force the NASA docs [00:12] using the real url vs the url that redirects [00:17] I remember you said that a load seemed to be gone. Now might be a good time to double-check those? [00:28] *** venture37 has left [00:36] *** Ravenloft has joined #archiveteam [00:43] *** ris has quit IRC () [00:48] *** ccordova has quit IRC (Remote host closed the connection) [01:02] *** zhongfu has quit IRC (Quit: cya losers) [01:03] *** zhongfu has joined #archiveteam [01:06] *** j08nY has quit IRC (Quit: Leaving) [01:22] *** Start has quit IRC (Quit: Disconnected.) [01:23] *** Emcy_ has joined #archiveteam [01:28] *** Start has joined #archiveteam [01:32] *** davidar_ has joined #archiveteam [01:35] *** Start has quit IRC (Quit: Disconnected.) [01:42] *** nertzy has quit IRC (Read error: Operation timed out) [02:02] *** tfgbd_znc has quit IRC (Ping timeout: 633 seconds) [02:16] *** Fake-Name has joined #archiveteam [02:17] *** Fake-Nam1 has quit IRC (Read error: Operation timed out) [02:23] *** dashcloud has quit IRC (Ping timeout: 250 seconds) [02:26] *** dashcloud has joined #archiveteam [02:42] *** nertzy has joined #archiveteam [02:51] *** JesseW has joined #archiveteam [03:08] *** antomati_ has joined #archiveteam [03:08] *** swebb sets mode: +o antomati_ [03:10] *** oli_ has joined #archiveteam [03:16] *** TC01 has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** Igloo has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** godane has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** remsen has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** botpie91 has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** khaoohs_ has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** nwf_ has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** Coderjoe has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** MMovie has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** luckcolor has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** oli has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** ploop has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** Lord_Nigh has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** antomatic has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** sivoais_ has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** SirCmpwn has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** bwn has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** Atom-- has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** mhazinsk has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** phuzion has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** rossdylan has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** aMunster has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** beardicus has quit IRC (hub.efnet.us ircd.choopa.net) [03:16] *** khaoohs_ has joined #archiveteam [03:16] *** TC01 has joined #archiveteam [03:16] *** Igloo has joined #archiveteam [03:16] *** godane has joined #archiveteam [03:16] *** ploop has joined #archiveteam [03:16] *** sivoais_ has joined #archiveteam [03:16] *** remsen has joined #archiveteam [03:16] *** SirCmpwn has joined #archiveteam [03:16] *** bwn has joined #archiveteam [03:16] *** rossdylan has joined #archiveteam [03:16] *** beardicus has joined #archiveteam [03:16] *** ircd.choopa.net sets mode: +o beardicus [03:16] *** swebb sets mode: +o beardicus [03:16] *** LordNigh2 has joined #archiveteam [03:18] *** aMunster has joined #archiveteam [03:29] *** remsen1 has joined #archiveteam [03:32] *** oli_ is now known as oli [03:32] *** LordNigh2 is now known as Lord_Nigh [03:32] *** luckcolor has joined #archiveteam [03:33] *** Coderjoe has joined #archiveteam [03:35] *** TC01 has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** Igloo has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** godane has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** remsen has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** aMunster has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** khaoohs_ has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** ploop has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** sivoais_ has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** SirCmpwn has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** bwn has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** rossdylan has quit IRC (hub.efnet.us ircd.choopa.net) [03:35] *** beardicus has quit IRC (hub.efnet.us ircd.choopa.net) [03:37] *** bwn_ has joined #archiveteam [03:38] *** khaoohs_ has joined #archiveteam [03:38] *** TC01 has joined #archiveteam [03:38] *** Igloo has joined #archiveteam [03:38] *** godane has joined #archiveteam [03:38] *** ploop has joined #archiveteam [03:38] *** sivoais_ has joined #archiveteam [03:38] *** SirCmpwn has joined #archiveteam [03:38] *** rossdylan has joined #archiveteam [03:38] *** beardicus has joined #archiveteam [03:38] *** ircd.choopa.net sets mode: +o beardicus [03:38] *** swebb sets mode: +o beardicus [03:38] *** nwf_ has joined #archiveteam [03:39] *** aMunster has joined #archiveteam [03:43] *** jmad980 has quit IRC (Ping timeout: 633 seconds) [03:46] *** nwf_ has quit IRC (Read error: Connection reset by peer) [03:46] *** nwf_ has joined #archiveteam [03:50] *** bwn_ is now known as bwn [03:56] *** jmad980 has joined #archiveteam [03:56] *** Start has joined #archiveteam [04:53] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:01] *** Sk1d has joined #archiveteam [05:20] *** VADemon has quit IRC (Quit: left4dead) [05:23] *** ndizzle has quit IRC (Read error: Operation timed out) [05:28] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:17] *** tomwsmf-a has quit IRC (Ping timeout: 258 seconds) [06:23] *** BartoCH has quit IRC (Read error: Connection reset by peer) [06:32] *** BartoCH has joined #archiveteam [06:41] *** Aranje has quit IRC (Quit: Three sheets to the wind) [07:16] *** DoomTay has quit IRC (Quit: Page closed) [07:17] *** Wuked has joined #archiveteam [07:29] *** Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [07:33] *** atomotic has joined #archiveteam [08:02] *** aMunster has quit IRC (Read error: Operation timed out) [08:10] *** aMunster has joined #archiveteam [08:13] *** schbirid has joined #archiveteam [08:13] *** phuzion has joined #archiveteam [08:52] *** pikhq has quit IRC (Ping timeout: 506 seconds) [09:11] *** pikhq has joined #archiveteam [09:21] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [09:30] *** dashcloud has quit IRC (Read error: Operation timed out) [09:30] *** Wuked has joined #archiveteam [09:42] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [09:48] *** kristian_ has joined #archiveteam [09:49] *** BartoCH has joined #archiveteam [09:50] *** dashcloud has joined #archiveteam [09:50] *** pfallenop has quit IRC (Ping timeout: 260 seconds) [09:58] *** pfallenop has joined #archiveteam [09:58] *** mhazinsk has joined #archiveteam [10:19] *** metal_cam has joined #archiveteam [10:20] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [10:25] *** dashcloud has quit IRC (Read error: Operation timed out) [10:28] *** dashcloud has joined #archiveteam [10:33] *** WinterFox has joined #archiveteam [10:33] *** WinterFox has quit IRC (Read error: Connection reset by peer) [10:33] *** W1nterFox has joined #archiveteam [10:44] strange [10:44] cuorsera works in webarchiveplayer [10:45] but does like it doesn't exist in the wayback machine https://wayback-beta.archive.org/web/20160627062439/https://class.coursera.org/virology-001 [10:45] I'll be writing a little tool anyway to export full courses from the Wayback Machine [10:51] *** Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [10:53] SketchCow: currently everything that is being uploaded from FOS to IA is not deriving. [10:57] *** Wuked has joined #archiveteam [11:35] *** brayden has quit IRC (Read error: Connection reset by peer) [11:36] *** brayden has joined #archiveteam [11:36] *** swebb sets mode: +o brayden [11:42] Yeah, I'll ask today. [11:45] Other than coursera are there any other active projects? [11:52] *** dashcloud has quit IRC (Read error: Operation timed out) [11:55] *** dashcloud has joined #archiveteam [12:03] *** kristian_ has quit IRC (Leaving) [12:22] *** ndiddy has joined #archiveteam [12:37] The FOS is focused on Coursera uploads and ArchiveBot uploads. Once it pushes through both backlogs, it should be much more effective very quickly. [12:41] *** rolfb has joined #archiveteam [12:42] *** BlueMaxim has quit IRC (Quit: Leaving) [12:45] *** dashcloud has quit IRC (Read error: Operation timed out) [12:48] Finally added date stamps to the homer shover page [12:49] *** dashcloud has joined #archiveteam [13:47] SketchCow: do you think you guys will be able to get everything before it closes? The deadline is damn close. [13:48] chfoo: can you send me the logs of coursera as soon as possible? [13:49] We have about 30 to grab - But I think they're all issued at the moment so I don't know if we can reissue arkiver ? [13:49] they're mostly issued at the moment [13:49] I'll have a look at what we can requeue [13:49] you only have the aboriginaled item right? [13:51] Yep [13:51] ok [13:59] *** Wuked has quit IRC (Ping timeout: 258 seconds) [13:59] *** VADemon has joined #archiveteam [14:00] *** Wuked has joined #archiveteam [14:10] *** dashcloud has quit IRC (Read error: Operation timed out) [14:11] arkiver: for the last round pls update the scripts to ignore 500 errors as fatal [14:13] whicih project [14:13] coursera [14:13] that's with the retrfinished right? [14:14] yeah the one i told you yesterday [14:14] that is fixed [14:14] currently only Igloo, Medowar and HCross have items though [14:14] and we're almost done [14:14] *** dashcloud has joined #archiveteam [14:15] main issue we are running into is FOS, and as the saying goes "too many cooks spoil the broth" [14:15] arkiver if you need me to run some items let me know [14:15] Hcross: wat? XD [14:16] hold on, I'll add zino's target [14:17] Uploads are taking forever luckcolor [14:18] understandable [14:19] luckcolor, we are talking several hours per upload [14:19] *** W1nterFox has quit IRC (Remote host closed the connection) [14:19] arkiver, thats more like it [14:19] zino: we are using your target, when the project is finished, can you sync it to FOS? [14:19] I'll give you a target on FOS by then [14:25] *** trs80 has quit IRC (Ping timeout: 190 seconds) [14:32] *** rolfb has quit IRC (Leaving...) [15:06] *** Piet0r has left [15:22] *** nertzy has quit IRC (Read error: Operation timed out) [15:26] *** Aranje has joined #archiveteam [15:41] no reply from email to dnshistory.com about getting a copy of their database, so I filled out the support webform on their site... [15:41] We'll do a project for the site [15:55] *** trs80 has joined #archiveteam [16:00] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [16:04] arkiver: Sure. [16:06] Who admins FOS? When I start syncing there I might want to talk TCP settings. [16:09] *** JesseW has joined #archiveteam [16:14] *** DoomTay has joined #archiveteam [16:16] *** metal_cam is now known as metalcamp [16:25] *** RichardG has joined #archiveteam [16:26] I've reach # 271 on the URLTeam leaderboard... go me! [16:33] *** JesseW has quit IRC (Ping timeout: 370 seconds) [16:54] zino: fos is run by SketchCow [16:54] TCP settings, you say [16:56] clearly, zino wants you to help him fill the disk faster than you can empty it [17:06] *** Ravenloft has quit IRC (Ping timeout: 244 seconds) [17:20] Under "Three people cared", the fos.textfiles.com/ARCHIVETEAM page now accurately shows the timestamp of archivebot uploads. So no more gaps in the "Uploaded" column going forward. [17:21] *** Tomcat_ has joined #archiveteam [17:22] *** Wuked has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [17:33] *** rolfb has joined #archiveteam [17:34] *** arrith has quit IRC (Read error: Operation timed out) [17:42] *** rolfb has quit IRC (Ping timeout: 506 seconds) [18:02] SketchCow: Based on my attempt at upload to IA:s s3 servers I suspect they aren't configured for long-haul TCP. I'll annoy you about that when I start uploading to FOS later. [18:08] I'm happy to discuss networking with you, zino, I work at IA and am familiar with our setup. [18:08] (We have a LOT of people uploading from far away.) [18:08] * HCross hides in a corner [18:08] yea. sorry about the constant 300Mbps from france [18:09] We have 40 gigabits [18:09] same :p [18:09] re: the big france pipe [18:10] Our "network weathermap" is down today because we're adding a new ISP, but in general we only have a few gigabits incoming out of 40 max. [18:10] *** vitzli has joined #archiveteam [18:10] wumpus, who have you got coming in now? [18:11] Our friends at ISC, mostly. [18:11] might I suggest #archiveteam-bs [18:14] Or #internetarchive [18:15] wumpus: So what I'm talking about is just regular TCP window scaling. I have a hard time getting above more than a few megabytes per connection uploading to IA from Sweden. To sustain anything reasonable I have to use 30-50 parallel uploads. In contrast to Amazon S3 US West and US East where I can push quite a bit more. [18:16] Spoiler is I'm not going to modify FOS settings [18:16] Aw. :-( [18:17] But feel free to work through what possible bottlenecks are in place, see what possible solutions there are. [18:18] Well. I haven't tried uploading to FOS yet. So maybe it will magically work without problems... [18:19] *** ndiddy has quit IRC (Read error: Connection reset by peer) [18:21] You literally wanted me to make network changes without interfacing with the network first? [18:22] Bold move, soldier [18:22] highly parallel uploads are the way to go, given all of the places that packets can be lost that neither of us control. [18:23] SketchCow: Nope, I wanted to talk with you later about maybe modifying the TCP buffers and window scaling after starting uploads. [18:24] I mean, that won't happen. [18:24] Noted. [18:24] But really, next time do a thing and find the thing not working before coming up with potential solutions or announcing your intention to demand a change. [18:25] No demand. I wanted a contact for when it inevitably fucked up. [18:46] *** dashcloud has quit IRC (Read error: Operation timed out) [18:50] *** dashcloud has joined #archiveteam [18:50] *** Tomcat_ has quit IRC (Remote host closed the connection) [18:54] *** vitzli has quit IRC (Quit: Leaving) [19:08] *** MMovie has joined #archiveteam [19:10] *** tomwsmf-a has joined #archiveteam [19:15] *** tfgbd_znc has joined #archiveteam [19:15] *** dashcloud has quit IRC (Read error: Operation timed out) [19:18] *** MMovie has quit IRC (Leaving.) [19:19] *** dashcloud has joined #archiveteam [19:21] scripts for arto are updated for the final run [19:21] now skipping any bad URLs. [19:29] *** superkuh has quit IRC (Quit: the neuronal action potential is an electrical manipulation of reversible abrupt phase changes in the lipid bilaye) [19:37] *** superkuh has joined #archiveteam [19:48] *** dashcloud has quit IRC (Read error: Operation timed out) [19:54] *** dashcloud has joined #archiveteam [19:54] My gawker crawl using the latest heritrix is 1.3M urls in (250GB of data downloaded) with 1.6M urls queued [19:54] *** Wuked has joined #archiveteam [19:54] *** Wuked has quit IRC (Client Quit) [19:57] swebb: how is heritrix in comparison of other crawlers? [19:57] i'm curious cause i haven't used it [19:58] Heritrix is the Internet Archive crawler. It's grown easier to use over the years, but still takes a little to get it working properly. For our grabs, we want to not swamp the site, but we don't use the defaults either to crawl the site in a few weeks (depending on the size of the site). It generates warc files which the IA and wayback machine use. [20:00] Once it's up and running, it has a (newly improved) web interface where you can monitor your jobs and start new ones. [20:00] yeah i saw it once [20:00] *** ris has joined #archiveteam [20:00] i mean what do you mean about the defaults [20:00] what do you usually change [20:01] Oh, the crawl waits are pretty slow - like 5-30 seconds between urls per hostname, but I change things around a bit. The config that I'm using for the gawker crawl is: https://gist.github.com/scumola/6c2dc8c96d2165e9fb608d49c15e0ebf [20:06] I also have heritrix crawls for doc2doc and portalgraphics.net running at the same time. [20:06] nice [20:06] Oh? [20:07] I had no idea anyone was doing that second one [20:07] not so sure about portalgraphics though [20:07] Uh? [20:07] did you change it to grab the URLs that a 'normal' crawler wouldn't get? [20:07] https://www.evernote.com/l/ACn1_ZZ6tDtGeID-OC6Jh8eZWsAh0kzMV5U [20:07] Like ignore robots.txt? No. [20:08] It still honors robots.txt [20:08] I think he means find a specific XML which points to other part s of a flash movie [20:08] That is basically a making-of of a given image [20:08] It'll parse urls from all kinds of different types of files. It'll even parse javascript to render urls, I think. [20:09] I doubt if it's grabbing multi-part flash video. [20:10] The kind of crawls that I do with heritrix are frequently the kind where IA already has several copies of the site over time, but people here want a 'full' crawl. I guess that IA does incremental crawls or something? [20:11] It's structured something like http://www.portalgraphics.net/pg/movie/pg_player/res_movie_data.php?mid=80728&lang=en though that URL is "hidden" in comments so I don't know if that can be picked up [20:11] Another thing to look out for is that if the site is overloaded, a page will be rendered as a message basically saying the lines are full [20:12] Yea, it is not smart enough to catch those. [20:12] Should be pretty easy to notice afterwards with its small content-length [20:13] Then again, a "doesn't exist anymore" substitute will probably be even smaller [20:13] Can you tell when I started the gawker crawl? :) https://www.evernote.com/l/AClf9BO53KVARaJs_jTGzQ59jXrrnpvJhao [20:13] *** bauruine has quit IRC (Ping timeout: 260 seconds) [20:23] *** bauruine has joined #archiveteam [20:42] *** MMovie has joined #archiveteam [20:47] *** ohhdemgir has joined #archiveteam [20:59] *** Wuked has joined #archiveteam [21:01] *** Wuked has quit IRC (Client Quit) [21:02] *** Wuked has joined #archiveteam [21:03] *** Wuked has quit IRC (Client Quit) [21:14] *** maseck has quit IRC (Remote host closed the connection) [21:14] *** Wuked has joined #archiveteam [21:22] *** maseck has joined #archiveteam [21:29] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [21:30] yahoosucks [21:30] (that felt a little silly) [21:30] you think it makes YOU silly [21:48] *** ohhdemgir has quit IRC (Read error: Operation timed out) [21:50] *** metalcamp has quit IRC (Ping timeout: 250 seconds) [22:07] *** redlob has quit IRC (Ping timeout: 260 seconds) [22:11] *** ohhdemgir has joined #archiveteam [22:13] *** redlob has joined #archiveteam [22:33] *** Wuked has quit IRC (My Mac has gone to sleep. ZZZzzz…) [22:36] lol [22:46] *** RichardG has quit IRC (Ping timeout: 260 seconds) [22:47] *** j08nY has joined #archiveteam [23:11] *** oituniet has joined #archiveteam [23:13] *** oituniet has quit IRC (Client Quit) [23:30] *** RichardG has joined #archiveteam [23:40] *** antonizoo has quit IRC (Ping timeout: 260 seconds) [23:41] *** arkiver has quit IRC (Ping timeout: 260 seconds) [23:42] *** lesderid has quit IRC (Ping timeout: 260 seconds) [23:42] *** Sanqui has quit IRC (Ping timeout: 260 seconds) [23:42] *** lesderid has joined #archiveteam [23:52] *** remsen1 has quit IRC (ZNC 1.6.2 - http://znc.in) [23:52] *** Sanqui has joined #archiveteam [23:52] *** remsen has joined #archiveteam [23:53] *** arkiver has joined #archiveteam [23:53] *** swebb sets mode: +o arkiver