[01:15] *** VADemon has joined #archiveteam [01:26] *** Meeh_ has joined #archiveteam [01:27] *** raylee has quit IRC (hub.dk irc.underworld.no) [01:27] *** wm_ has quit IRC (hub.dk irc.underworld.no) [01:27] *** Atluxity has quit IRC (hub.dk irc.underworld.no) [01:32] *** philpem has quit IRC (Ping timeout: 252 seconds) [01:40] *** primus104 has quit IRC (Leaving.) [01:45] *** Aranje has joined #archiveteam [02:00] *** DopefishJ is now known as DFJustin [02:14] *** VADemon has quit IRC (Quit: left4dead) [03:17] *** machinedr has joined #archiveteam [03:52] *** Jonimus has quit IRC (Ping timeout: 252 seconds) [03:53] *** T31M has quit IRC (Read error: Connection reset by peer) [03:54] *** T31M has joined #archiveteam [04:05] has archiveteam dealt with this mess yet: https://gist.githubusercontent.com/sebadoom/f0eedcba2f39e3e07a1c/raw/c168b48210bf7f85029545743891e7e4f8c95df4/gistfile1.txt [04:05] lots of stuff to mirror [04:32] *** aaaaaaaaa has quit IRC (Leaving) [05:10] *** mistym has quit IRC (Remote host closed the connection) [05:14] *** machinedr has quit IRC (Quit: ChatZilla 0.9.91.1 [Firefox 39.0/20150630154324]) [05:24] *** machinedr has joined #archiveteam [05:25] how is PhantomJS working out on wpull? [05:28] I created a project similar to phantomjs, but based on java [05:41] *** mistym has joined #archiveteam [05:45] machinedr: the output is generally pretty good, but it imposes significant system load and we have seen phantomjs processes that don't terminate as expected [05:45] this in the archivebot+wpull setup, so it may not be wpull's fault [05:46] ok [05:46] the future for archivebot+wpull is probably Chrome-as-crawler [05:47] as in selenium chrome driver? [05:48] as in the webkit remote debugging protocol [05:48] that may be what Selenium+chromedriver uses; I haven't kept up with that [05:48] anyway that is not a short-term thing [05:50] yeah, not sure. I saw this issue which mentioned selenium, https://github.com/chfoo/wpull/issues/248 [05:50] ah [05:50] that would be nice also [05:53] it would also get us exactly what we need, which is a web thing that can deal with JS without bombing [05:54] the rest of wpull (WARC generation, link identification, bookkeeping) seems to do fine [05:54] oh and scripting, concurrency, queue management, etc [05:56] yeah I experienced bad performance in crashes using selenium's firefox driver. It motivated me to try making my own driver using only java [05:56] javafx has a webkit embedded [06:09] With the major browsers dropping Java Applet, support, I was thinking it was time for a new "hotjava" browser. [06:11] https://en.wikipedia.org/wiki/HotJava [06:11] ironically java's webview does not support applets :) ... at least not out of the box [06:13] oh wait... maybe http://stackoverflow.com/questions/27949881/java-applet-in-webview [06:28] *** Fusl has quit IRC (Ping timeout: 255 seconds) [06:33] *** _0x2A has quit IRC (Read error: Operation timed out) [06:44] *** bentpins has joined #archiveteam [06:59] *** machinedr has quit IRC (Quit: ChatZilla 0.9.91.1 [Firefox 39.0/20150630154324]) [07:07] *** ruukasu_ has quit IRC (Read error: No route to host) [07:12] *** ruukasu has joined #archiveteam [07:20] SketchCow: we're going to do a grab of Reddit. We'll save all posts [07:20] #deaddit [07:21] users are starting to remove all their posts, some subreddits are going private and some subreddits have announced they're going to delete themselves [07:31] *** schbirid has joined #archiveteam [07:51] *** Aranje has quit IRC (Remote host closed the connection) [07:52] *** bzc6p_ has joined #archiveteam [07:57] *** bzc6p has quit IRC (Ping timeout: 600 seconds) [07:57] *** bzc6p_ is now known as bzc6p [08:01] *** ohhdemgir has quit IRC (Quit: Leaving) [08:07] *** mistym has quit IRC (Remote host closed the connection) [08:12] *** primus104 has joined #archiveteam [08:15] *** wm_ has joined #archiveteam [08:15] *** raylee has joined #archiveteam [08:17] *** WinterFox has quit IRC (Ping timeout: 483 seconds) [08:22] *** ohhdemgir has joined #archiveteam [08:22] *** WinterFox has joined #archiveteam [08:23] *** habi has joined #archiveteam [08:23] *** primus104 has quit IRC (Leaving.) [08:32] *** philpem has joined #archiveteam [08:32] *** habi has left [08:59] *** primus104 has joined #archiveteam [09:08] *** mistym has joined #archiveteam [09:11] *** WinterFox has quit IRC (Remote host closed the connection) [09:13] *** WinterFox has joined #archiveteam [09:14] *** Ungstein has joined #archiveteam [09:17] *** mistym has quit IRC (Ping timeout: 483 seconds) [09:23] *** alt40409 has quit IRC (Ping timeout: 370 seconds) [09:29] *** WinterFox has quit IRC (Ping timeout: 483 seconds) [09:39] *** WinterFox has joined #archiveteam [09:40] SketchCow: last.fm user discovery has started [10:48] *** vitzli has joined #archiveteam [11:10] *** mistym has joined #archiveteam [11:18] *** mistym has quit IRC (Read error: Operation timed out) [11:25] *** BlueMaxim has quit IRC (Quit: Leaving) [11:41] *** Fusl has joined #archiveteam [11:42] *** thewalrus has joined #archiveteam [11:42] *** thewalrus has left [11:44] *** _0x2A has joined #archiveteam [11:52] *** szalwia has joined #archiveteam [12:25] *** VADemon has joined #archiveteam [12:28] *** RichardG has joined #archiveteam [12:30] *** Ungstein1 has joined #archiveteam [12:30] *** Ungstein has quit IRC (Ping timeout: 265 seconds) [12:46] *** Rickster has quit IRC (Ping timeout: 252 seconds) [12:47] *** Muad-Dib has quit IRC (Ping timeout: 252 seconds) [13:10] *** mistym has joined #archiveteam [13:11] *** Rickster has joined #archiveteam [13:18] *** mistym has quit IRC (Read error: Operation timed out) [13:19] *** signius has quit IRC (Read error: Operation timed out) [13:32] *** signius has joined #archiveteam [13:41] *** Emcy_ has joined #archiveteam [13:44] *** Emcy has quit IRC (Ping timeout: 306 seconds) [13:45] *** VADemon has quit IRC (Read error: Connection reset by peer) [13:46] *** VADemon has joined #archiveteam [14:30] *** WinterFox has quit IRC (Remote host closed the connection) [14:32] Take a shot [14:32] (Reddit) [15:12] *** mistym has joined #archiveteam [15:16] *** mistym has quit IRC (Ping timeout: 252 seconds) [15:17] Awesome, let's grab reddit [15:22] *** Ungstein has joined #archiveteam [15:25] *** Ungstein1 has quit IRC (Ping timeout: 265 seconds) [15:40] *** bentpins has quit IRC (Read error: Connection reset by peer) [15:46] *** primus104 has quit IRC (Leaving.) [16:21] *** primus104 has joined #archiveteam [16:23] *** ruukasu has quit IRC (Ping timeout: 265 seconds) [16:27] *** ruukasu has joined #archiveteam [16:36] *** ruukasu has quit IRC (Ping timeout: 265 seconds) [16:46] *** godane has quit IRC (Read error: Operation timed out) [16:50] *** mistym has joined #archiveteam [16:54] *** primus104 has quit IRC (Leaving.) [16:56] *** godane has joined #archiveteam [16:57] https://twitter.com/renesugar/status/617736740044836864 [16:57] *** SN4T14 has joined #archiveteam [17:20] oh my [17:21] *** primus104 has joined #archiveteam [17:22] *** vitzli has quit IRC (Quit: Leaving) [17:28] *** ruukasu has joined #archiveteam [18:04] *** aaaaaaaaa has joined #archiveteam [18:14] Dear Archive Team, [18:14] It appears from your Wiki that you successfully archived Windows Live Spaces. I am trying to access my old space and have tried the wayback machine with no success. [18:14] Did you succeed in archiving Live Spaces? Is there a way I might be able to access my old 'space'? [18:14] Thanks for your excellent work. [18:14] Parag [18:44] *** Stiletto has quit IRC (Ping timeout: 258 seconds) [18:47] *** Stiletto has joined #archiveteam [19:07] *** bzc6p has quit IRC (Ping timeout: 600 seconds) [19:08] *** bzc6p has joined #archiveteam [19:53] *** habi has joined #archiveteam [19:53] *** habi has left [19:54] Channel of our full grab of reddit: #deaddit [20:03] *** jbaumgart has joined #archiveteam [20:03] hello [20:10] Hey [20:12] *** primus104 has quit IRC (Leaving.) [20:15] did you get the link to the reddit_data torrent? [20:15] News to me. Others might know. [20:15] here's the thread I made for it -- https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ [20:17] *** jbaumgart has quit IRC (Leaving) [20:18] *** bzc6p_ has joined #archiveteam [20:21] magnet:?xt=urn:btih:7690f71ea949b868080401c749e878f98de34d3d&dn=reddit%5Fdata&tr=http%3A%2F%2Ftracker.pushshift.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80 [20:21] the important bit ;) [20:21] Yes, I'm grabbing it and will put it on archive.org. [20:23] *** bzc6p has quit IRC (Read error: Operation timed out) [20:24] :) [20:58] *** primus104 has joined #archiveteam [21:22] *** bzc6p_ is now known as bzc6p [21:38] who can figure out how to search for licenseurl containing by-nc on https://archive.org/advancedsearch.php ? [21:38] oh my god that website suck [21:45] :) [21:45] Bring it on [21:47] in a moment, i have to wait for the search field expanding animation to finish [21:48] *** nox2 has quit IRC (Ping timeout: 252 seconds) [21:48] What are you on, a tin can connected to a windmill? [21:49] Sounds like someone should be using https://pypi.python.org/pypi/internetarchive [21:49] And utilizing ia search [21:49] not sure i want to tell that to the friend who was asking [21:49] I mean, keep trashing it [21:50] Because as you know, my gentle and supplicant personality is legendary [21:50] *** philpem has quit IRC (Remote host closed the connection) [21:51] Also, remember our money-back guarantee [21:51] you can dwell in trash talk as much as you like, the site is not becoming better [21:51] i guess patches are welcome [21:52] Well, as you know, we work day in and day out to make your experience as terrible as possible. [21:52] We wait and look over every feature, and if we find it has use or utility, we strip it out [21:52] That's what we do. [21:52] schbirid: licenseurl doesn't look like it's set up for substring search, or the tokens (e.g. by-nc) are too short. using a full URL, e.g. http://creativecommons.org/licenses/by-nc-nd/3.0/, works fine [21:53] But, I mean, sure, nothing gets the job done quicker than whining like your subscription copy of Dark Plunders III has an overpriced optional weapon you couldn't hack on a F2P server. [21:53] That's how the job gets done. [21:53] yipdw: zero results here -> https://archive.org/search.php?query=licenseurl%3A%28http%3A%2F%2Fcreativecommons.org%2Flicenses%2Fby-nc-nd%2F3.0%2F%29 [21:53] https://archive.org/search.php?query=licenseurl%3A%22http%3A%2F%2Fcreativecommons.org%2Flicenses%2Fby-nc-nd%2F3.0%2F%22 [21:53] nonzero cardinality there [21:54] SketchCow: i love the search bar animation, it really adds usability. also image galleries for music! [21:54] IA's search engine is Solr, or at least it seems Lucene-based [21:54] it helps to know Lucene syntax [21:54] ah, nice [21:54] SketchCow: when you get a chance, there's a lot of spam when you search for "Microsoft Office" [21:54] () is what "contains" from the ui got me [21:54] that may or may not be right, I forget what that means in Lucene [21:55] I don't actually know what IA uses for search except that a lot of what I used to do when tweaking Solr installs seems to apply [21:56] those were dark times [22:02] *** oldcad has joined #archiveteam [22:03] *** schbirid has quit IRC (Leaving) [22:38] *** nox has quit IRC (Read error: Connection reset by peer) [22:47] http://pastebin.com/raw.php?i=qYh8E841 [22:47] awww yis [23:02] nice! [23:10] *** WinterFox has joined #archiveteam [23:17] *** VADemon_ has joined #archiveteam [23:20] *** VADemon has quit IRC (Read error: Operation timed out) [23:50] *** DopefishJ has joined #archiveteam [23:59] *** DFJustin has quit IRC (Ping timeout: 740 seconds)