#archiveteam 2015-12-28,Mon

↑back Search

Time Nickname Message
00:04 🔗 megaminxw question: when im using wget to archive something, a lot of times it doesnt pull something because the directory that its in is forbidden, even though the actual files arent
00:04 🔗 megaminxw making a very incomplete archive
00:04 🔗 megaminxw is there any way to get around that?
00:05 🔗 megaminxw the problem of course being that the only way to get the files is through a link that uses javascript
00:05 🔗 DFJustin have you tried using wpull instead of wget
00:07 🔗 myself has quit IRC (Read error: Operation timed out)
00:07 🔗 aMunster has quit IRC (Read error: Operation timed out)
00:07 🔗 beardicus has quit IRC (Read error: Operation timed out)
00:07 🔗 sep332 has quit IRC (Read error: Operation timed out)
00:07 🔗 robink has quit IRC (Read error: Operation timed out)
00:07 🔗 mutoso has quit IRC (Read error: Operation timed out)
00:07 🔗 nwf has quit IRC (Read error: Operation timed out)
00:07 🔗 yakfish has quit IRC (Read error: Operation timed out)
00:07 🔗 xmc has quit IRC (Read error: Operation timed out)
00:07 🔗 vegbrasil has quit IRC (Read error: Operation timed out)
00:08 🔗 DFJustin it has the ability to find link urls in javascript in some cases
00:11 🔗 megaminxw has quit IRC (Read error: Operation timed out)
00:12 🔗 Ghost_of_ has quit IRC (Remote host closed the connection)
00:12 🔗 mutoso has joined #archiveteam
00:17 🔗 phuzion has quit IRC (Read error: Connection reset by peer)
00:18 🔗 phuzion has joined #archiveteam
01:16 🔗 Ravenloft has joined #archiveteam
01:36 🔗 vegbrasil has joined #archiveteam
01:43 🔗 vegbrasil has quit IRC (Read error: Operation timed out)
01:54 🔗 BlueMaxim has joined #archiveteam
02:23 🔗 schbirid2 has joined #archiveteam
02:25 🔗 schbirid has quit IRC (Read error: Operation timed out)
02:25 🔗 Ghost_of_ has joined #archiveteam
02:37 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
02:45 🔗 wyatt8740 has joined #archiveteam
03:08 🔗 robink has joined #archiveteam
03:08 🔗 sep332 has joined #archiveteam
03:08 🔗 vegbrasil has joined #archiveteam
03:08 🔗 aMunster has joined #archiveteam
03:09 🔗 beardicus has joined #archiveteam
03:09 🔗 yakfish has joined #archiveteam
03:12 🔗 myself has joined #archiveteam
03:16 🔗 megaminxw has joined #archiveteam
03:16 🔗 kris33 has joined #archiveteam
03:18 🔗 xmc has joined #archiveteam
03:18 🔗 swebb sets mode: +o xmc
03:23 🔗 yakfish has quit IRC (Read error: Operation timed out)
03:24 🔗 myself has quit IRC (Read error: Operation timed out)
03:24 🔗 robink has quit IRC (Write error: Broken pipe)
03:24 🔗 sep332 has quit IRC (Write error: Broken pipe)
03:24 🔗 aMunster has quit IRC (Write error: Broken pipe)
03:24 🔗 xmc has quit IRC (Read error: Operation timed out)
03:25 🔗 vegbrasil has quit IRC (Read error: Operation timed out)
03:25 🔗 beardicus has quit IRC (Read error: Operation timed out)
03:26 🔗 redlob has quit IRC (Ping timeout: 483 seconds)
03:29 🔗 megaminxw has quit IRC (Read error: Operation timed out)
03:37 🔗 redlob has joined #archiveteam
03:41 🔗 nertzy has joined #archiveteam
03:54 🔗 kris33 has quit IRC (Textual IRC Client: www.textualapp.com)
04:00 🔗 cechk01 has joined #archiveteam
04:28 🔗 Ghost_of_ has quit IRC (Quit: Leaving)
04:46 🔗 megaminxw has joined #archiveteam
04:46 🔗 megaminxw okay, wpull didnt work either
04:46 🔗 megaminxw gragh
04:47 🔗 yipdw if the link is generated by Javascript code and can't be derived by wpull's heuristics, you can also just get the link via Chromium or whatever and feed it in
04:48 🔗 yipdw alternatively, capture e.g. Chromium's network activity to a HAR and then find or write a har2warc utility, if you care about having things as a WARC
04:49 🔗 megaminxw that gets a bit tedious after 99 times
04:49 🔗 yipdw please do investigate automation
04:50 🔗 megaminxw yeah im doing that
04:50 🔗 yipdw I mean there's tools like selenium and selenium-webdriver to drive browsers to make them do what you want, and they usually work well, but it's not the lightest thing
04:51 🔗 yipdw they're intended for integration testing, though perhaps you may find something like capybara useful
04:52 🔗 yipdw also what are these resources, things stored on Mega?
04:52 🔗 vegbrasil has joined #archiveteam
04:53 🔗 yakfish has joined #archiveteam
04:53 🔗 beardicus has joined #archiveteam
04:53 🔗 megaminxw just a particular homepage of a guy who hasnt updated in a while and which might go offline
04:53 🔗 megaminxw better safe than sorry anyway
04:53 🔗 robink has joined #archiveteam
04:53 🔗 yipdw if you have a URL it lets others figure out ways to get a copy
04:53 🔗 megaminx1 has joined #archiveteam
04:54 🔗 megaminxw has quit IRC (Quit: Page closed)
04:54 🔗 megaminx1 yeah sure
04:54 🔗 megaminx1 http://www.jaapsch.net/puzzles/
04:55 🔗 aMunster has joined #archiveteam
04:55 🔗 sep332 has joined #archiveteam
04:56 🔗 JesseW has joined #archiveteam
04:57 🔗 myself has joined #archiveteam
05:01 🔗 megaminx1 one of the links being if you click on the big "javascript simulation" link on... this page, say http://www.jaapsch.net/puzzles/starlet.htm
05:01 🔗 yipdw megaminx1: oh that's not that bad, you can do a wpull crawl and extract /javascripts/.*
05:01 🔗 yipdw and then append the results
05:02 🔗 yipdw er /javascript/.*
05:03 🔗 xmc has joined #archiveteam
05:03 🔗 swebb sets mode: +o xmc
05:03 🔗 yipdw or do one crawl for the main site and a second crawl for the /javascript/.* URLs, and then upload the separate WARCs + a combined WARC via megawarc etc
05:05 🔗 yakfish has quit IRC (Read error: Operation timed out)
05:05 🔗 beardicus has quit IRC (Read error: Operation timed out)
05:05 🔗 JesseW has quit IRC (Read error: Operation timed out)
05:05 🔗 xmc has quit IRC (Write error: Broken pipe)
05:05 🔗 vegbrasil has quit IRC (Read error: Operation timed out)
05:06 🔗 aMunster has quit IRC (Read error: Operation timed out)
05:08 🔗 myself has quit IRC (Read error: Operation timed out)
05:09 🔗 megaminx1 has quit IRC (Read error: Operation timed out)
05:09 🔗 sep332 has quit IRC (Read error: Operation timed out)
05:10 🔗 megaminxw has joined #archiveteam
05:10 🔗 megaminxw argh my connection keeps dying
05:13 🔗 nertzy has quit IRC (This computer has gone to sleep)
05:14 🔗 robink has quit IRC (Read error: Operation timed out)
06:27 🔗 jmad980 has quit IRC (Ping timeout: 369 seconds)
06:34 🔗 vegbrasil has joined #archiveteam
06:34 🔗 myself has joined #archiveteam
06:34 🔗 beardicus has joined #archiveteam
06:34 🔗 yakfish has joined #archiveteam
06:35 🔗 robink has joined #archiveteam
06:36 🔗 sep332 has joined #archiveteam
06:37 🔗 aMunster has joined #archiveteam
06:37 🔗 nwf has joined #archiveteam
06:39 🔗 JesseW has joined #archiveteam
06:40 🔗 megaminx1 has joined #archiveteam
06:40 🔗 megaminxw has quit IRC (Quit: Page closed)
06:41 🔗 JesseW ping'ing chfoo or someone who can check if the cron job for exporting URLteam results has gotten turned off or stuck. There hasn't been an export since the 24th, and there are 10 million results waiting to go out.
06:43 🔗 xmc has joined #archiveteam
06:43 🔗 swebb sets mode: +o xmc
06:53 🔗 jmad980 has joined #archiveteam
06:55 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:55 🔗 megaminx1 yipdw: okay ive tried to crawl just the javascript urls, and it wont let me because the directory is forbidden even though the files arent
06:55 🔗 megaminx1 any ideas?
06:57 🔗 scyther has joined #archiveteam
06:58 🔗 dashcloud has joined #archiveteam
07:03 🔗 JesseW has quit IRC (Leaving.)
07:22 🔗 Emcy_ has quit IRC (Read error: Operation timed out)
07:23 🔗 Emcy_ has joined #archiveteam
08:10 🔗 scyther has quit IRC (Quit: Leaving)
08:13 🔗 Sanqui is now known as Sanky|gon
08:13 🔗 Sanky|gon is now known as SankyGONE
08:25 🔗 yipdw megaminx1: I mean do a crawl first, and then get the Javascript URLs from the crawl content
08:25 🔗 yipdw via grep
08:25 🔗 yipdw or whatever
08:25 🔗 yipdw if you do that then you don't need to access the index of /javascript/
08:48 🔗 Emcy has joined #archiveteam
08:48 🔗 Emcy_ has quit IRC (Ping timeout: 252 seconds)
09:16 🔗 megaminx1 yipdw: im a bit confused, sorry
09:17 🔗 megaminx1 noob here
09:58 🔗 Froggypwn has joined #archiveteam
10:22 🔗 megaminx1 has quit IRC (Ping timeout: 615 seconds)
10:31 🔗 megaminxw has joined #archiveteam
11:11 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:00 🔗 Boltsie__ has quit IRC (Quit: Connection closed for inactivity)
12:15 🔗 vtyl has quit IRC (Read error: Operation timed out)
12:35 🔗 lytv has joined #archiveteam
12:37 🔗 vOYtEC has joined #archiveteam
13:02 🔗 primus104 has joined #archiveteam
13:02 🔗 primus104 has left
13:38 🔗 Ghost_of_ has joined #archiveteam
13:42 🔗 primus104 has joined #archiveteam
13:43 🔗 primus104 has quit IRC (Client Quit)
13:55 🔗 primus104 has joined #archiveteam
13:55 🔗 primus104 has left
14:18 🔗 VADemon has joined #archiveteam
14:20 🔗 megaminxw has quit IRC (Leaving.)
14:45 🔗 atomotic has joined #archiveteam
14:58 🔗 Ghost_of_ has quit IRC (Quit: Leaving)
15:00 🔗 atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
15:01 🔗 atomotic has joined #archiveteam
15:05 🔗 logan2 has quit IRC (Read error: Connection reset by peer)
15:05 🔗 logan has joined #archiveteam
15:06 🔗 aliz has quit IRC (Read error: Operation timed out)
15:07 🔗 aliz has joined #archiveteam
15:39 🔗 atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
15:41 🔗 JetBalsa has quit IRC (Remote host closed the connection)
15:51 🔗 Start has quit IRC (Excess Flood)
15:52 🔗 Start has joined #archiveteam
15:53 🔗 yipdw has quit IRC (Ping timeout: 311 seconds)
15:53 🔗 atomotic has joined #archiveteam
15:54 🔗 JetBalsa has joined #archiveteam
15:54 🔗 Smiley has quit IRC (Remote host closed the connection)
15:54 🔗 vitzli has joined #archiveteam
15:54 🔗 yipdw_ has joined #archiveteam
15:54 🔗 SmileyG has joined #archiveteam
15:58 🔗 Nertsy has quit IRC (Ping timeout: 364 seconds)
16:00 🔗 Start has quit IRC (Excess Flood)
16:01 🔗 Start has joined #archiveteam
16:01 🔗 Nertsy has joined #archiveteam
16:01 🔗 signius has quit IRC (Ping timeout: 311 seconds)
16:06 🔗 lukeman_ has quit IRC (Ping timeout: 311 seconds)
16:07 🔗 lukeman has joined #archiveteam
16:11 🔗 signius has joined #archiveteam
16:21 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
16:26 🔗 vitzli has quit IRC (Quit: Leaving)
16:44 🔗 Rickster has joined #archiveteam
16:50 🔗 JesseW has joined #archiveteam
17:09 🔗 nertzy has joined #archiveteam
17:10 🔗 DSs has joined #archiveteam
17:11 🔗 DSs has quit IRC (Client Quit)
17:20 🔗 JesseW has quit IRC (Leaving.)
17:47 🔗 nertzy has quit IRC (This computer has gone to sleep)
17:52 🔗 JW_work https://news.ycombinator.com/item?id=10797201 (pointing to http://www.roger-pearse.com/weblog/2015/12/26/when-will-the-librarians-start-to-throw-offline-literature-away/ )
17:59 🔗 Morbus has joined #archiveteam
18:08 🔗 Sketchcow I forgot people don't get much done during this week.
18:14 🔗 Sketchcow Going to get some food and try to close out some tickets and archiveteamery
18:14 🔗 Sketchcow And also, maybe (maybe) start setting up some digitization stations here.
18:30 🔗 arkiver I'm going to try get the oldfriens project ready tomorrow for the first items
18:31 🔗 arkiver SketchCow: to get everything from oldfriends we need an account
18:31 🔗 arkiver We can get the photos, schools, institutions and others
18:31 🔗 arkiver but we can't get the member profiles, for example http://www.oldfriends.co.nz/MemberProfile.aspx?oldfriends_member_id=347583 redirects to a login
18:32 🔗 arkiver Unfortunately the doesn't allow anymore to create accounts so we would have to use an existing account
18:33 🔗 arkiver SketchCow: I'd also like to talk with you about the wiki grabs.
18:34 🔗 arkiver We have currently done external link grabs for a lot of wikis early november.
18:34 🔗 arkiver Are we going to regrab external links periodically? maybe once/twice every year?
18:34 🔗 arkiver We'll soon start to grab the actual wikis too
18:49 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
19:07 🔗 thefinn93 has quit IRC (Remote host closed the connection)
19:09 🔗 wyatt8740 has joined #archiveteam
19:14 🔗 yipdw_ is now known as yipdw
19:30 🔗 thefinn93 has joined #archiveteam
19:34 🔗 Ravenloft has quit IRC (Ping timeout: 250 seconds)
19:52 🔗 icedice has joined #archiveteam
19:53 🔗 icedice psdisasters.com apparently closed down a few months ago. Does anyone here have it archived?
19:55 🔗 icedice and the website used robots.txt, so this is all that remains of it: https://research.domaintools.com/research/screenshot-history/psdisasters.com/
19:55 🔗 icedice Shutdown announcement: https://thumbnails.domaintools.com/domaintools/2015-12-28T19:36:58.000Z/FKIhrwARxXcg3dZ6np-Xs-341qI=/psdisasters.com/fullsize/457b7cf657624df203706914a278e72c/1444080895.jpg
20:00 🔗 icedice I wonder if the webmaster(s) would hand over a site backup and upload it to Archive.org if we could contact him/her/them somehow?
20:06 🔗 Stiletto has quit IRC ()
20:15 🔗 philpem has joined #archiveteam
20:16 🔗 Nemo_bis ze germans are moving https://de.wikipedia.org/wiki/Spezial:Beitr%C3%A4ge/GiftBot
20:23 🔗 JW_work moving what
20:23 🔗 JW_work ?
20:29 🔗 Stiletto has joined #archiveteam
20:38 🔗 Sketchcow Bombbot?
20:38 🔗 Sketchcow Hey, so a book company has released all their 10+ year old books for free.
20:38 🔗 Sketchcow I'm going to do something about that.
20:39 🔗 Sketchcow It's 35 books but hey
20:40 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
20:43 🔗 Sketchcow Wait, no.
20:43 🔗 Sketchcow It's thousands of books.
20:43 🔗 Sketchcow Well then.
20:45 🔗 HCross Physical books?
20:48 🔗 wyatt8740 has joined #archiveteam
20:52 🔗 antomatic "The Multi-Level Marketing Strategies that Doctors DON'T WANT YOU TO KNOW ABOUT -- 2008 edition eBook free resale rights MLM ebay ebay ebay ebay"
20:52 🔗 antomatic Or something good?
20:52 🔗 antomatic :)
20:59 🔗 icedice2 has joined #archiveteam
21:01 🔗 aMunster has quit IRC (Read error: Operation timed out)
21:01 🔗 robink has quit IRC (Write error: Broken pipe)
21:01 🔗 sep332 has quit IRC (Read error: Operation timed out)
21:01 🔗 nwf has quit IRC (Read error: Operation timed out)
21:01 🔗 yakfish has quit IRC (Read error: Operation timed out)
21:01 🔗 xmc has quit IRC (Read error: Operation timed out)
21:01 🔗 myself has quit IRC (Write error: Broken pipe)
21:02 🔗 beardicus has quit IRC (Read error: Operation timed out)
21:02 🔗 vegbrasil has quit IRC (Read error: Operation timed out)
21:02 🔗 icedice has quit IRC (Ping timeout: 364 seconds)
21:04 🔗 Sketchcow You all realx.
21:04 🔗 Sketchcow I'm writing scrapers.
21:04 🔗 Sketchcow Also, I bought an Ultra HD TV
21:04 🔗 Sketchcow These are good books. very good.
21:04 🔗 Sketchcow I'm working to ensure every possible drop of metadata ends up in the books.
21:04 🔗 Sketchcow Spend an hour or two on it, and the result will be jam-packed of goodness.
21:05 🔗 xmc has joined #archiveteam
21:05 🔗 swebb sets mode: +o xmc
21:16 🔗 REiN has quit IRC (Read error: Operation timed out)
21:30 🔗 Stiletto has quit IRC (Read error: Connection reset by peer)
21:33 🔗 antomatic nice
21:45 🔗 Stiletto has joined #archiveteam
21:53 🔗 Sketchcow http://link.springer.com/book/10.1007/b98860
21:53 🔗 Sketchcow https://archive.org/details/springer_10.1007-b98860
21:53 🔗 Sketchcow I LIKE it
21:54 🔗 DFJustin sexy
21:57 🔗 toad1 has joined #archiveteam
22:12 🔗 Ghost_of_ has joined #archiveteam
22:31 🔗 vegbrasil has joined #archiveteam
22:31 🔗 robink has joined #archiveteam
22:31 🔗 yakfish has joined #archiveteam
22:32 🔗 beardicus has joined #archiveteam
22:32 🔗 aMunster has joined #archiveteam
22:32 🔗 sep332 has joined #archiveteam
22:32 🔗 Sketchcow They're happening.
22:32 🔗 Sketchcow 56,000 books.
22:32 🔗 Sketchcow that should.... hold up, generally.
22:32 🔗 antomatic Are Springer making all of these books free, forever?
22:33 🔗 myself has joined #archiveteam
22:34 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:34 🔗 antomatic amazing if so
22:35 🔗 MRX3 has joined #archiveteam
22:36 🔗 dashcloud has joined #archiveteam
22:37 🔗 icedice2 has quit IRC (Ping timeout: 250 seconds)
22:45 🔗 DFJustin iirc o'reilly has done that for a long time, are those mirrored
22:52 🔗 nwf has joined #archiveteam
22:53 🔗 Boppen has quit IRC (Ping timeout: 200 seconds)
22:55 🔗 Boppen has joined #archiveteam
22:58 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
23:16 🔗 redlob has quit IRC (Read error: Operation timed out)
23:16 🔗 redlob has joined #archiveteam
23:20 🔗 wyatt8740 has joined #archiveteam
23:27 🔗 dashcloud has quit IRC (Remote host closed the connection)
23:28 🔗 kyan That's way cool. Are there WARCs of the scrapes?
23:32 🔗 Sketchcow I avoid O'Reilly
23:32 🔗 Sketchcow If someone else wants to
23:32 🔗 Sketchcow That guy and I don't get along
23:33 🔗 kyan Looks like there are a lot more than 56k. A total of 121,186 book
23:33 🔗 Sketchcow Whatever criteria I chose, I chose.
23:33 🔗 DFJustin https://gist.github.com/bishboria/8326b17bbd652f34566a
23:33 🔗 kyan Oh, nope 110,041
23:37 🔗 tobbez Looks like 110,183 to me
23:49 🔗 Sketchcow Oh I see
23:49 🔗 Sketchcow I added "and put it in english"
23:49 🔗 Sketchcow So I could globally set the OCR correctly.
23:49 🔗 Sketchcow Otherwise I have to guess or we use an analyzer
23:51 🔗 Sketchcow Also, something broke-ass with the searcher they have
23:51 🔗 Sketchcow http://link.springer.com/search/page/1?facet-language=%22En%22&facet-content-type=%22Book%22&showAll=false
23:52 🔗 kyan most are en and de; could do those two as separate packs with correct language, then the rest set to whatever. Also their web site sucks
23:52 🔗 Sketchcow http://link.springer.com/search/page/999?facet-language=%22En%22&facet-content-type=%22Book%22&showAll=false
23:52 🔗 Sketchcow See, if you switch to the next page... server error
23:55 🔗 kyan ...fail
23:56 🔗 Sketchcow This allowed me to get the name of 19,000 books, and I'll be adding a thing to allow me to keep retrying (and it will just not re-upload)
23:57 🔗 arkiver so it's not possible to get any books past that page

irclogger-viewer