[00:04] question: when im using wget to archive something, a lot of times it doesnt pull something because the directory that its in is forbidden, even though the actual files arent [00:04] making a very incomplete archive [00:04] is there any way to get around that? [00:05] the problem of course being that the only way to get the files is through a link that uses javascript [00:05] have you tried using wpull instead of wget [00:07] *** myself has quit IRC (Read error: Operation timed out) [00:07] *** aMunster has quit IRC (Read error: Operation timed out) [00:07] *** beardicus has quit IRC (Read error: Operation timed out) [00:07] *** sep332 has quit IRC (Read error: Operation timed out) [00:07] *** robink has quit IRC (Read error: Operation timed out) [00:07] *** mutoso has quit IRC (Read error: Operation timed out) [00:07] *** nwf has quit IRC (Read error: Operation timed out) [00:07] *** yakfish has quit IRC (Read error: Operation timed out) [00:07] *** xmc has quit IRC (Read error: Operation timed out) [00:07] *** vegbrasil has quit IRC (Read error: Operation timed out) [00:08] it has the ability to find link urls in javascript in some cases [00:11] *** megaminxw has quit IRC (Read error: Operation timed out) [00:12] *** Ghost_of_ has quit IRC (Remote host closed the connection) [00:12] *** mutoso has joined #archiveteam [00:17] *** phuzion has quit IRC (Read error: Connection reset by peer) [00:18] *** phuzion has joined #archiveteam [01:16] *** Ravenloft has joined #archiveteam [01:36] *** vegbrasil has joined #archiveteam [01:43] *** vegbrasil has quit IRC (Read error: Operation timed out) [01:54] *** BlueMaxim has joined #archiveteam [02:23] *** schbirid2 has joined #archiveteam [02:25] *** schbirid has quit IRC (Read error: Operation timed out) [02:25] *** Ghost_of_ has joined #archiveteam [02:37] *** wyatt8740 has quit IRC (Read error: Operation timed out) [02:45] *** wyatt8740 has joined #archiveteam [03:08] *** robink has joined #archiveteam [03:08] *** sep332 has joined #archiveteam [03:08] *** vegbrasil has joined #archiveteam [03:08] *** aMunster has joined #archiveteam [03:09] *** beardicus has joined #archiveteam [03:09] *** yakfish has joined #archiveteam [03:12] *** myself has joined #archiveteam [03:16] *** megaminxw has joined #archiveteam [03:16] *** kris33 has joined #archiveteam [03:18] *** xmc has joined #archiveteam [03:18] *** swebb sets mode: +o xmc [03:23] *** yakfish has quit IRC (Read error: Operation timed out) [03:24] *** myself has quit IRC (Read error: Operation timed out) [03:24] *** robink has quit IRC (Write error: Broken pipe) [03:24] *** sep332 has quit IRC (Write error: Broken pipe) [03:24] *** aMunster has quit IRC (Write error: Broken pipe) [03:24] *** xmc has quit IRC (Read error: Operation timed out) [03:25] *** vegbrasil has quit IRC (Read error: Operation timed out) [03:25] *** beardicus has quit IRC (Read error: Operation timed out) [03:26] *** redlob has quit IRC (Ping timeout: 483 seconds) [03:29] *** megaminxw has quit IRC (Read error: Operation timed out) [03:37] *** redlob has joined #archiveteam [03:41] *** nertzy has joined #archiveteam [03:54] *** kris33 has quit IRC (Textual IRC Client: www.textualapp.com) [04:00] *** cechk01 has joined #archiveteam [04:28] *** Ghost_of_ has quit IRC (Quit: Leaving) [04:46] *** megaminxw has joined #archiveteam [04:46] okay, wpull didnt work either [04:46] gragh [04:47] if the link is generated by Javascript code and can't be derived by wpull's heuristics, you can also just get the link via Chromium or whatever and feed it in [04:48] alternatively, capture e.g. Chromium's network activity to a HAR and then find or write a har2warc utility, if you care about having things as a WARC [04:49] that gets a bit tedious after 99 times [04:49] please do investigate automation [04:50] yeah im doing that [04:50] I mean there's tools like selenium and selenium-webdriver to drive browsers to make them do what you want, and they usually work well, but it's not the lightest thing [04:51] they're intended for integration testing, though perhaps you may find something like capybara useful [04:52] also what are these resources, things stored on Mega? [04:52] *** vegbrasil has joined #archiveteam [04:53] *** yakfish has joined #archiveteam [04:53] *** beardicus has joined #archiveteam [04:53] just a particular homepage of a guy who hasnt updated in a while and which might go offline [04:53] better safe than sorry anyway [04:53] *** robink has joined #archiveteam [04:53] if you have a URL it lets others figure out ways to get a copy [04:53] *** megaminx1 has joined #archiveteam [04:54] *** megaminxw has quit IRC (Quit: Page closed) [04:54] yeah sure [04:54] http://www.jaapsch.net/puzzles/ [04:55] *** aMunster has joined #archiveteam [04:55] *** sep332 has joined #archiveteam [04:56] *** JesseW has joined #archiveteam [04:57] *** myself has joined #archiveteam [05:01] one of the links being if you click on the big "javascript simulation" link on... this page, say http://www.jaapsch.net/puzzles/starlet.htm [05:01] megaminx1: oh that's not that bad, you can do a wpull crawl and extract /javascripts/.* [05:01] and then append the results [05:02] er /javascript/.* [05:03] *** xmc has joined #archiveteam [05:03] *** swebb sets mode: +o xmc [05:03] or do one crawl for the main site and a second crawl for the /javascript/.* URLs, and then upload the separate WARCs + a combined WARC via megawarc etc [05:05] *** yakfish has quit IRC (Read error: Operation timed out) [05:05] *** beardicus has quit IRC (Read error: Operation timed out) [05:05] *** JesseW has quit IRC (Read error: Operation timed out) [05:05] *** xmc has quit IRC (Write error: Broken pipe) [05:05] *** vegbrasil has quit IRC (Read error: Operation timed out) [05:06] *** aMunster has quit IRC (Read error: Operation timed out) [05:08] *** myself has quit IRC (Read error: Operation timed out) [05:09] *** megaminx1 has quit IRC (Read error: Operation timed out) [05:09] *** sep332 has quit IRC (Read error: Operation timed out) [05:10] *** megaminxw has joined #archiveteam [05:10] argh my connection keeps dying [05:13] *** nertzy has quit IRC (This computer has gone to sleep) [05:14] *** robink has quit IRC (Read error: Operation timed out) [06:27] *** jmad980 has quit IRC (Ping timeout: 369 seconds) [06:34] *** vegbrasil has joined #archiveteam [06:34] *** myself has joined #archiveteam [06:34] *** beardicus has joined #archiveteam [06:34] *** yakfish has joined #archiveteam [06:35] *** robink has joined #archiveteam [06:36] *** sep332 has joined #archiveteam [06:37] *** aMunster has joined #archiveteam [06:37] *** nwf has joined #archiveteam [06:39] *** JesseW has joined #archiveteam [06:40] *** megaminx1 has joined #archiveteam [06:40] *** megaminxw has quit IRC (Quit: Page closed) [06:41] ping'ing chfoo or someone who can check if the cron job for exporting URLteam results has gotten turned off or stuck. There hasn't been an export since the 24th, and there are 10 million results waiting to go out. [06:43] *** xmc has joined #archiveteam [06:43] *** swebb sets mode: +o xmc [06:53] *** jmad980 has joined #archiveteam [06:55] *** dashcloud has quit IRC (Read error: Operation timed out) [06:55] yipdw: okay ive tried to crawl just the javascript urls, and it wont let me because the directory is forbidden even though the files arent [06:55] any ideas? [06:57] *** scyther has joined #archiveteam [06:58] *** dashcloud has joined #archiveteam [07:03] *** JesseW has quit IRC (Leaving.) [07:22] *** Emcy_ has quit IRC (Read error: Operation timed out) [07:23] *** Emcy_ has joined #archiveteam [08:10] *** scyther has quit IRC (Quit: Leaving) [08:13] *** Sanqui is now known as Sanky|gon [08:13] *** Sanky|gon is now known as SankyGONE [08:25] megaminx1: I mean do a crawl first, and then get the Javascript URLs from the crawl content [08:25] via grep [08:25] or whatever [08:25] if you do that then you don't need to access the index of /javascript/ [08:48] *** Emcy has joined #archiveteam [08:48] *** Emcy_ has quit IRC (Ping timeout: 252 seconds) [09:16] yipdw: im a bit confused, sorry [09:17] noob here [09:58] *** Froggypwn has joined #archiveteam [10:22] *** megaminx1 has quit IRC (Ping timeout: 615 seconds) [10:31] *** megaminxw has joined #archiveteam [11:11] *** BlueMaxim has quit IRC (Quit: Leaving) [12:00] *** Boltsie__ has quit IRC (Quit: Connection closed for inactivity) [12:15] *** vtyl has quit IRC (Read error: Operation timed out) [12:35] *** lytv has joined #archiveteam [12:37] *** vOYtEC has joined #archiveteam [13:02] *** primus104 has joined #archiveteam [13:02] *** primus104 has left [13:38] *** Ghost_of_ has joined #archiveteam [13:42] *** primus104 has joined #archiveteam [13:43] *** primus104 has quit IRC (Client Quit) [13:55] *** primus104 has joined #archiveteam [13:55] *** primus104 has left [14:18] *** VADemon has joined #archiveteam [14:20] *** megaminxw has quit IRC (Leaving.) [14:45] *** atomotic has joined #archiveteam [14:58] *** Ghost_of_ has quit IRC (Quit: Leaving) [15:00] *** atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [15:01] *** atomotic has joined #archiveteam [15:05] *** logan2 has quit IRC (Read error: Connection reset by peer) [15:05] *** logan has joined #archiveteam [15:06] *** aliz has quit IRC (Read error: Operation timed out) [15:07] *** aliz has joined #archiveteam [15:39] *** atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [15:41] *** JetBalsa has quit IRC (Remote host closed the connection) [15:51] *** Start has quit IRC (Excess Flood) [15:52] *** Start has joined #archiveteam [15:53] *** yipdw has quit IRC (Ping timeout: 311 seconds) [15:53] *** atomotic has joined #archiveteam [15:54] *** JetBalsa has joined #archiveteam [15:54] *** Smiley has quit IRC (Remote host closed the connection) [15:54] *** vitzli has joined #archiveteam [15:54] *** yipdw_ has joined #archiveteam [15:54] *** SmileyG has joined #archiveteam [15:58] *** Nertsy has quit IRC (Ping timeout: 364 seconds) [16:00] *** Start has quit IRC (Excess Flood) [16:01] *** Start has joined #archiveteam [16:01] *** Nertsy has joined #archiveteam [16:01] *** signius has quit IRC (Ping timeout: 311 seconds) [16:06] *** lukeman_ has quit IRC (Ping timeout: 311 seconds) [16:07] *** lukeman has joined #archiveteam [16:11] *** signius has joined #archiveteam [16:21] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [16:26] *** vitzli has quit IRC (Quit: Leaving) [16:44] *** Rickster has joined #archiveteam [16:50] *** JesseW has joined #archiveteam [17:09] *** nertzy has joined #archiveteam [17:10] *** DSs has joined #archiveteam [17:11] *** DSs has quit IRC (Client Quit) [17:20] *** JesseW has quit IRC (Leaving.) [17:47] *** nertzy has quit IRC (This computer has gone to sleep) [17:52] https://news.ycombinator.com/item?id=10797201 (pointing to http://www.roger-pearse.com/weblog/2015/12/26/when-will-the-librarians-start-to-throw-offline-literature-away/ ) [17:59] *** Morbus has joined #archiveteam [18:08] I forgot people don't get much done during this week. [18:14] Going to get some food and try to close out some tickets and archiveteamery [18:14] And also, maybe (maybe) start setting up some digitization stations here. [18:30] I'm going to try get the oldfriens project ready tomorrow for the first items [18:31] SketchCow: to get everything from oldfriends we need an account [18:31] We can get the photos, schools, institutions and others [18:31] but we can't get the member profiles, for example http://www.oldfriends.co.nz/MemberProfile.aspx?oldfriends_member_id=347583 redirects to a login [18:32] Unfortunately the doesn't allow anymore to create accounts so we would have to use an existing account [18:33] SketchCow: I'd also like to talk with you about the wiki grabs. [18:34] We have currently done external link grabs for a lot of wikis early november. [18:34] Are we going to regrab external links periodically? maybe once/twice every year? [18:34] We'll soon start to grab the actual wikis too [18:49] *** wyatt8740 has quit IRC (Read error: Operation timed out) [19:07] *** thefinn93 has quit IRC (Remote host closed the connection) [19:09] *** wyatt8740 has joined #archiveteam [19:14] *** yipdw_ is now known as yipdw [19:30] *** thefinn93 has joined #archiveteam [19:34] *** Ravenloft has quit IRC (Ping timeout: 250 seconds) [19:52] *** icedice has joined #archiveteam [19:53] psdisasters.com apparently closed down a few months ago. Does anyone here have it archived? [19:55] and the website used robots.txt, so this is all that remains of it: https://research.domaintools.com/research/screenshot-history/psdisasters.com/ [19:55] Shutdown announcement: https://thumbnails.domaintools.com/domaintools/2015-12-28T19:36:58.000Z/FKIhrwARxXcg3dZ6np-Xs-341qI=/psdisasters.com/fullsize/457b7cf657624df203706914a278e72c/1444080895.jpg [20:00] I wonder if the webmaster(s) would hand over a site backup and upload it to Archive.org if we could contact him/her/them somehow? [20:06] *** Stiletto has quit IRC () [20:15] *** philpem has joined #archiveteam [20:16] ze germans are moving https://de.wikipedia.org/wiki/Spezial:Beitr%C3%A4ge/GiftBot [20:23] moving what [20:23] ? [20:29] *** Stiletto has joined #archiveteam [20:38] Bombbot? [20:38] Hey, so a book company has released all their 10+ year old books for free. [20:38] I'm going to do something about that. [20:39] It's 35 books but hey [20:40] *** wyatt8740 has quit IRC (Read error: Operation timed out) [20:43] Wait, no. [20:43] It's thousands of books. [20:43] Well then. [20:45] Physical books? [20:48] *** wyatt8740 has joined #archiveteam [20:52] "The Multi-Level Marketing Strategies that Doctors DON'T WANT YOU TO KNOW ABOUT -- 2008 edition eBook free resale rights MLM ebay ebay ebay ebay" [20:52] Or something good? [20:52] :) [20:59] *** icedice2 has joined #archiveteam [21:01] *** aMunster has quit IRC (Read error: Operation timed out) [21:01] *** robink has quit IRC (Write error: Broken pipe) [21:01] *** sep332 has quit IRC (Read error: Operation timed out) [21:01] *** nwf has quit IRC (Read error: Operation timed out) [21:01] *** yakfish has quit IRC (Read error: Operation timed out) [21:01] *** xmc has quit IRC (Read error: Operation timed out) [21:01] *** myself has quit IRC (Write error: Broken pipe) [21:02] *** beardicus has quit IRC (Read error: Operation timed out) [21:02] *** vegbrasil has quit IRC (Read error: Operation timed out) [21:02] *** icedice has quit IRC (Ping timeout: 364 seconds) [21:04] You all realx. [21:04] I'm writing scrapers. [21:04] Also, I bought an Ultra HD TV [21:04] These are good books. very good. [21:04] I'm working to ensure every possible drop of metadata ends up in the books. [21:04] Spend an hour or two on it, and the result will be jam-packed of goodness. [21:05] *** xmc has joined #archiveteam [21:05] *** swebb sets mode: +o xmc [21:16] *** REiN has quit IRC (Read error: Operation timed out) [21:30] *** Stiletto has quit IRC (Read error: Connection reset by peer) [21:33] nice [21:45] *** Stiletto has joined #archiveteam [21:53] http://link.springer.com/book/10.1007/b98860 [21:53] https://archive.org/details/springer_10.1007-b98860 [21:53] I LIKE it [21:54] sexy [21:57] *** toad1 has joined #archiveteam [22:12] *** Ghost_of_ has joined #archiveteam [22:31] *** vegbrasil has joined #archiveteam [22:31] *** robink has joined #archiveteam [22:31] *** yakfish has joined #archiveteam [22:32] *** beardicus has joined #archiveteam [22:32] *** aMunster has joined #archiveteam [22:32] *** sep332 has joined #archiveteam [22:32] They're happening. [22:32] 56,000 books. [22:32] that should.... hold up, generally. [22:32] Are Springer making all of these books free, forever? [22:33] *** myself has joined #archiveteam [22:34] *** dashcloud has quit IRC (Read error: Operation timed out) [22:34] amazing if so [22:35] *** MRX3 has joined #archiveteam [22:36] *** dashcloud has joined #archiveteam [22:37] *** icedice2 has quit IRC (Ping timeout: 250 seconds) [22:45] iirc o'reilly has done that for a long time, are those mirrored [22:52] *** nwf has joined #archiveteam [22:53] *** Boppen has quit IRC (Ping timeout: 200 seconds) [22:55] *** Boppen has joined #archiveteam [22:58] *** wyatt8740 has quit IRC (Read error: Operation timed out) [23:16] *** redlob has quit IRC (Read error: Operation timed out) [23:16] *** redlob has joined #archiveteam [23:20] *** wyatt8740 has joined #archiveteam [23:27] *** dashcloud has quit IRC (Remote host closed the connection) [23:28] That's way cool. Are there WARCs of the scrapes? [23:32] I avoid O'Reilly [23:32] If someone else wants to [23:32] That guy and I don't get along [23:33] Looks like there are a lot more than 56k. A total of 121,186 book [23:33] Whatever criteria I chose, I chose. [23:33] https://gist.github.com/bishboria/8326b17bbd652f34566a [23:33] Oh, nope 110,041 [23:37] Looks like 110,183 to me [23:49] Oh I see [23:49] I added "and put it in english" [23:49] So I could globally set the OCR correctly. [23:49] Otherwise I have to guess or we use an analyzer [23:51] Also, something broke-ass with the searcher they have [23:51] http://link.springer.com/search/page/1?facet-language=%22En%22&facet-content-type=%22Book%22&showAll=false [23:52] most are en and de; could do those two as separate packs with correct language, then the rest set to whatever. Also their web site sucks [23:52] http://link.springer.com/search/page/999?facet-language=%22En%22&facet-content-type=%22Book%22&showAll=false [23:52] See, if you switch to the next page... server error [23:55] ...fail [23:56] This allowed me to get the name of 19,000 books, and I'll be adding a thing to allow me to keep retrying (and it will just not re-upload) [23:57] so it's not possible to get any books past that page