[00:04] *** nightpool has quit IRC (Ping timeout: 260 seconds) [00:05] *** espes__ has joined #archiveteam-bs [00:09] *** nightpool has joined #archiveteam-bs [00:13] *** DoomTay has quit IRC (Quit: Page closed) [00:18] *** DoomTay has joined #archiveteam-bs [00:34] so i found the San Francisco Bay Area Television Archive [00:35] https://diva.sfsu.edu/ [00:38] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [00:40] Looks like a job for ArchiveBot [00:41] Oh, never mind. There's stuff behind an accountwall [00:46] Also, just to make sure the word gets out real soon, here's what will likely be a REAL fun situation: http://laurapinto.tripod.com/andykim/ [00:46] Waitn no [00:46] http://blog.bioware.com/2016/07/29/concerning-our-forums/ [01:10] SketchCo2: i will be give the CBS 1960-11-08 Election Coverage sometime next week [01:10] its 4 hours of it on dvd [01:10] *** SketchCo2 is now known as SketchCow [01:11] godane: nice find on that archive! [01:11] Are you planning on getting those videos into IA? [01:11] also, I love the NASA uploads [01:11] i upload them to FOS [01:12] The videos from the archive you found? [01:12] yes [01:12] i upload the dvds videos i find to FOS [01:12] awesome! [01:13] i found this: http://www.cyclenews.com/cycle-news-archives/ [01:13] but looks like it's payed wall [01:14] that's really nice [01:14] the paywall sucks though [01:14] that archives magazines that go back to the 1960 [01:14] well [01:14] * arkiver is afk for the night [01:14] keep up the awesome work godane :D [01:14] i will [01:14] i up to 2008-09-05 with funny or die archive [01:15] for example http://magazine.cyclenews.com/i/84166-cycle-news-1972-issue-27-jul-18 doesn't seem paywalled [01:15] ok then [01:15] i was only looking at the 1960s ones [01:15] alright [01:15] I'm off anyway [01:16] have a good day :) [01:17] *** username1 has joined #archiveteam-bs [01:21] *** schbirid2 has quit IRC (Ping timeout: 244 seconds) [01:25] *** GLaDOS has joined #archiveteam-bs [02:19] *** JesseW has joined #archiveteam-bs [02:23] *** nightpool has quit IRC (Ping timeout: 260 seconds) [03:31] *** nightpool has joined #archiveteam-bs [03:36] *** nightpool has quit IRC (Ping timeout: 250 seconds) [03:48] *** DoomTay has quit IRC (Quit: Page closed) [03:51] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [04:14] *** DoomTay has joined #archiveteam-bs [04:26] *** dashcloud has quit IRC (Read error: Operation timed out) [04:29] *** dashcloud has joined #archiveteam-bs [04:39] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:45] *** Sk1d has joined #archiveteam-bs [04:50] *** DoomTay has quit IRC (Quit: Page closed) [05:01] *** GLaDOS has joined #archiveteam-bs [05:27] *** dashcloud has quit IRC (Read error: Connection reset by peer) [05:28] *** dashcloud has joined #archiveteam-bs [05:39] *** robink has quit IRC (Ping timeout: 246 seconds) [05:43] *** robink has joined #archiveteam-bs [05:52] *** dashcloud has quit IRC (Read error: Operation timed out) [06:00] *** dashcloud has joined #archiveteam-bs [06:14] i have 787k items now [06:14] more like 788k if you include my godanefunnyordie account [06:32] *** zgrant has left [06:40] *** Honno has joined #archiveteam-bs [06:43] *** dashcloud has quit IRC (Read error: Operation timed out) [06:47] *** dashcloud has joined #archiveteam-bs [06:54] godane: what would happen if we tried to put it through archivebot? [07:03] *** Honno has quit IRC (Ping timeout: 1208 seconds) [07:21] hook54321: what site are you talking about? [07:39] For anyone who gives a total shit, I have been gearing a side project to turn the Apple II emulated software collection on the Internet Archive into a world-class collection [07:39] Currently, I'm doing a sweep of all redundant items. It's taking a while, because of the 10,000, there's 1,000 or so dupes. [07:39] *** dashcloud has quit IRC (Read error: Operation timed out) [07:40] *** dashcloud has joined #archiveteam-bs [07:40] * JesseW isn't particularly interested in Apple II's, but is always pleased by pretty metadata [07:41] Once the redundants are removed, I will drill against the remaining collection and metadata it beyond belief. [07:43] we need to find old scans of Scholastic Arrow Book Club News Paper: https://www.flickr.com/photos/annainca/5659359459/in/album-72157626587471194/ [07:43] *** JesseW has quit IRC (Remote host closed the connection) [07:44] something i loved took like at when i was in grade 1 to 5 [07:44] *** Sanqui is now known as SanquiGON [07:44] *** SanquiGON is now known as SanquiAFK [07:49] After that, I'll end up adding more stuff, but everything gets scanned and only new things are added [08:00] *** dashcloud has quit IRC (Read error: Operation timed out) [08:05] *** dashcloud has joined #archiveteam-bs [08:21] Anyone else having issues with livestream.com videos? [08:24] what kind of issues HCross ? [08:25] my first issue is that i need flash. [08:26] I hit play.. and nothing happens. youtube-dl also gets a content too short error [08:27] Trying to watch Jason's HOPE talk [08:27] link? [08:27] http://livestream.com/internetsociety/hopeconf/videos/130749038 [08:28] seems broken indeed [08:28] firefox says it's processing - please come back later. [08:29] the lockpicking one works [08:30] so yeah, seems broken on their side [08:33] SketchCow, do you have another copy of the video please? [08:35] *** mksplg has quit IRC (Quit: WeeChat 0.4.2) [08:37] *** Yoshimura has joined #archiveteam-bs [08:51] iirc hope recordings get properly published at some point [08:53] Hoping is useless. [08:53] hope is a conference [08:55] Oh yeah, that one. [08:55] Efnet does not +q? Instead of answering why is it offtopic using ban hammer. I am no longer surprised here. It is a hobby not a serious thing. [08:56] ugh, stop complaining. you're disruptive and irritating. [08:57] you have accomplished nothing with archiveteam, just complaining that we are doing things wrong [08:57] that is why you are not welcome here [08:59] *** Yoshimura was kicked by xmc (out) [09:08] *** Yoshimura has joined #archiveteam-bs [09:08] Raising a concern and a question is not complaining. [09:08] *** dashcloud has quit IRC (Read error: Operation timed out) [09:09] *** xmc sets mode: +b *!4f8dff3d@ag-255-61.sta.ji.cz [09:09] *** Yoshimura was kicked by xmc (Yoshimura) [09:12] *** dashcloud has joined #archiveteam-bs [09:16] why do the idiots always find these channels :p [09:17] i could suggest a reason but i would probably get banned as well [09:17] gah, xchat... [09:17] *** username1 is now known as schbirid [09:17] *** fie has joined #archiveteam-bs [09:18] lol :p [09:18] ah, it's you :P [09:18] you have a history of not being a useless twat who gets in the way [09:21] oh i get in my own way alright! [09:22] *** wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) [09:46] \o/ [10:07] *** dashcloud has quit IRC (Read error: Operation timed out) [10:10] *** dashcloud has joined #archiveteam-bs [10:29] so in theory i'm at 2008-09-10 with funny or die videos [10:42] *** mismatch has joined #archiveteam-bs [10:44] would it be possible to download/backup ~200,000 ISP hosting sites already in the Wayback Machine to a warc? [10:45] robots.txt keeps changing meaning they're sometimes inaccessible. [10:45] I have a txt file with all the sites listed [10:58] I guess my question should be, can archivebot download a list of websites from a txt file? [11:00] Is it just myVIP that is underway at the moment, or is there anything else that needs help? [11:04] mismatch: yes, it can. but only in ao mode [11:05] but 200k individual sites might be a bit much [11:07] midas: thanks. HCross: that's fair enough, I'll maybe try with just 50 to start with and see how it performs [11:12] in !a/recursive mode, if a site links to an external url such as google.com - is that also downloaded? [11:15] god https://www.youtube.com/watch?v=UqVYWP4wk3I "6 Years of Hard Work Erased in 5 Clicks" [11:22] that's just wow... [11:24] ^ it'll be interesting to see if YouTube solve this [11:25] if they even will try [11:26] Hcross i remeber it worked with a too [11:26] mismatch [11:26] join #archivebot [11:27] luckcolor: already there :) - thanks though, I need to experiment a bit [11:29] in theory it mightjust work with 200k urls [11:29] i think ao makes one per url (warc) [11:29] also internet is slow [11:30] and sketcy [11:31] mismatch: if you need voiced or you want to schedule the job feel free to message [11:31] true, all the urls are web.archive.org/web/[date]/[site] which I'm guessing might also cause recursion issues for archivebot [11:31] luckcolor: <3 [11:36] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [11:36] *** BartoCH has joined #archiveteam-bs [11:45] *** joepie91 has quit IRC (Read error: Operation timed out) [11:46] *** botpie91 has quit IRC (Read error: Operation timed out) [11:48] *** arkiver has quit IRC (Ping timeout: 370 seconds) [12:11] *** joepie91 has joined #archiveteam-bs [12:11] *** arkiver has joined #archiveteam-bs [12:19] *** schbirid has quit IRC (Quit: Leaving) [12:24] *** dashcloud has quit IRC (Read error: Operation timed out) [12:31] *** dashcloud has joined #archiveteam-bs [12:59] *** metalcamp has joined #archiveteam-bs [13:18] *** BlueMaxim has quit IRC (Quit: Leaving) [13:30] *** bzc6p has joined #archiveteam-bs [13:30] *** swebb sets mode: +o bzc6p [13:33] *** bzc6p sets mode: +oooo arkiver Atluxity chfoo closure [13:34] *** bzc6p sets mode: +oooo Coderjoe dashcloud FalconK Fletcher [13:34] *** bzc6p sets mode: +oooo GLaDOS godane HCross HCross2 [13:34] *** bzc6p sets mode: +oooo joepie91 JW_work1 Kaz Kenshin [13:34] *** bzc6p sets mode: +oooo luckcolor midas PurpleSym Start [13:34] *** bzc6p sets mode: +o yipdw [13:40] *** bzc6p has left [13:44] *** fie_ has joined #archiveteam-bs [13:44] *** fie has quit IRC (Read error: Connection reset by peer) [13:51] *** dashcloud has quit IRC (Read error: Operation timed out) [13:55] *** dashcloud has joined #archiveteam-bs [14:11] *** nightpool has joined #archiveteam-bs [14:15] *** DoomTay has joined #archiveteam-bs [14:20] *** zino has quit IRC (Remote host closed the connection) [14:23] *** bzc6p has joined #archiveteam-bs [14:23] *** swebb sets mode: +o bzc6p [14:24] DoomTay: livejournal discovery items are usually 1,299 bytes in size, so it's fine. [14:24] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [14:24] Huh. [14:24] And I think I" can guess why items usually arn't in KB [14:25] I mean DISPLAYED in KB [14:25] Feel free to improve the software. [14:27] *** BartoCH has joined #archiveteam-bs [14:32] *** nightpool has quit IRC (Read error: Operation timed out) [14:41] *** Honno has joined #archiveteam-bs [14:53] *** bzc6p has left [14:54] *** SanquiAFK has quit IRC (Ping timeout: 260 seconds) [14:55] *** nightpool has joined #archiveteam-bs [15:07] *** MrRadar has quit IRC (Quit: Restarting) [15:08] *** Honno has quit IRC (Ping timeout: 1208 seconds) [15:34] lol Yoshimura was back I see [15:35] that was amusing [15:35] *** Coderjoe has quit IRC (Read error: Operation timed out) [15:39] *** Coderjoe has joined #archiveteam-bs [15:52] *** Rye has quit IRC (Ping timeout: 244 seconds) [15:53] *** DoomTay has quit IRC (Quit: Page closed) [15:54] *** MrRadar has joined #archiveteam-bs [15:54] *** Rye has joined #archiveteam-bs [16:01] *** Sanqui has joined #archiveteam-bs [16:10] *** kristian_ has joined #archiveteam-bs [16:43] *** ndiddy has quit IRC (Read error: Connection reset by peer) [16:44] *** ndiddy has joined #archiveteam-bs [17:15] *** dashcloud has quit IRC (Read error: Operation timed out) [17:18] *** dashcloud has joined #archiveteam-bs [19:02] *** Coderjoe has quit IRC (Ping timeout: 260 seconds) [19:03] *** Coderjoe has joined #archiveteam-bs [19:04] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:09] *** BartoCH has joined #archiveteam-bs [19:18] *** tomwsmf has joined #archiveteam-bs [19:27] *** JesseW has joined #archiveteam-bs [19:28] I just noticed that, in the year (!) since I put together a pile of metadata about sourceforge projects, six people have forked the repo I put it in. None have *done* anything with the forks -- they merely made them. One only forked one other repo. [19:28] The world is strange. [19:30] *** kristian_ has quit IRC (Leaving) [19:41] joepie91: regarding newww, I found a copy of the code here: https://github.com/rafaeljesus/newww (albeit not particularly up to date) [19:42] exactly *one* of the issues from the repo was saved in the wayback machine: https://web.archive.org/web/20150224214155/https://github.com/npm/newww/issues/190 [19:48] godane: I was talking about the cyclenews site [19:50] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [19:51] joepie91: https://github.com/npm/www/issues/9 -- I just asked them to make the old issues available. We'll see what the response might be. [19:52] I'm not sure whether having you support that request would be a positive or a negative. :-) [19:54] It would probably be good to grab copies of the issues from all the other npm repos now, just in case. [19:56] Did you check Google cache for any of the old issues? [19:56] hook54321: good idea. Please do so, and dump them into archive.is if you find any [19:57] There's a script that I think does stuff like that automatically, I still haven't gotten it work though. Do you know the URL of the old repo? [19:57] https://github.com/npm/newww [19:58] there's an even older repo, https://github.com/npm/npm-www with over 900 issues, that it would be good to grab [20:02] *** BartoCH has joined #archiveteam-bs [20:07] Have we put any of the second one through archivebot yet? [20:08] you don't put GitHub repos in ArchiveBot, by the way [20:13] *** JesseW has quit IRC (Ping timeout: 370 seconds) [20:13] What if we just put in the issues page? [20:14] And used lots of ignores [20:15] oh, issues [20:16] yeah, probably [20:17] there's also this. https://github.com/joeyh/github-backup [20:17] that's what yipdw recommended be used for github stuff [20:18] it produces more usable results, yes [20:18] a WARC copy of a github repo is IMO pointless [20:18] maybe for issues only it would be less useless? [20:18] it's pointless [20:18] if you want the issues github-backup gets it too [20:18] kk [20:20] a derivation of the issue data into static HTML probably has merit but the github interface has so many links that an automated crawl is a mess [20:21] !ao tends to be okay if you're trying to be a jerk [20:25] My main concern is that their should be some sort of central place where people can browse and upload backups of repositories. [20:25] people usually do that on github yes [20:26] google code for example [20:26] GitLab has a Github import function which works reasonably well, also [20:27] I do wonder however someone will fund this central place [20:28] I mean we could upload backups of github repos to the Internet archive, but it would still be useless until the user downloaded the whole archive, which could potentially be huge. [20:29] yep [20:29] or you do a shallow clone [20:31] if you're interested in saving copies of git repositories and their associated data, I think self-hosted gitlab is a good choice [20:31] I run an instance that does more or less that [20:31] then I have my backups of dependent libraries and I only need to worry about keeping my instance healthy [20:34] for github backups, codearchive.org is doing a backup of every repo with 10 stars or more (and less than 250 MB in size, unless whitelisted) and every change [20:34] oh, actually, correction [20:35] github->gitlab import requires your github OAuth tokens and so you can only do that backup for repositories you're authorized on [20:35] so, yes, if you're a project member it's a good choice :P [20:35] here's the project site: https://the-code-archive.launchrock.com/ [20:36] wow [20:36] nice [20:37] oh cool, Filippo Valsorda is involved in that [20:37] (he also wrote the Warrior Dockerfile) [20:38] Interesting. About what percentage of repos are 10 stars or more and less than 250 MB? [20:39] https://github.com/search?utf8=%E2%9C%93&q=stars%3A%22%3E+10%22+size%3A%22%3C%3D+262144000%22&type=Repositories&ref=advsearch&l=&l= [20:40] there's 3,466,477 with at least one star [20:41] oh, oops, that's supposed to be size in kB [20:41] https://github.com/search?utf8=%E2%9C%93&q=stars%3A%22%3E+10%22+size%3A%22%3C%3D+262144%22&type=Repositories&ref=searchresults is fixed [20:42] and it should be >= but no matter what you're looking at 10-11% [20:42] *** robink has quit IRC (Ping timeout: 501 seconds) [20:43] which rounds pretty nicely with Sturgeon's Law [20:49] Not to mention, there are probably tons of repositories that are just empty. [20:49] in case you're wondering, the repo itself is backed up because it got 10 stars by the end of the talk (which is here: http://livestream.com/internetsociety2/hopeconf/videos/130613964) [20:50] Which repo? [21:02] *** Honno has joined #archiveteam-bs [21:14] *** robink has joined #archiveteam-bs [21:18] *** Honno has quit IRC (Read error: Operation timed out) [21:34] *** DoomTay has joined #archiveteam-bs [21:52] *** whydomain has joined #archiveteam-bs [21:59] Can anyone recommend a (preferably python) browser emulator? Basically all I want to do is navigate to a web page with a cookie preloaded, and then click on two div elements. [22:13] whydomain: PhantomJS [22:13] It's basically a scriptable headless version of Chromium [22:15] *** DoomTay_ has joined #archiveteam-bs [22:15] Are there any examples of usage for loading in pages an manipulating elements? I can't find any. [22:16] ^and manipulating [22:16] *** DoomTay has quit IRC (Ping timeout: 268 seconds) [22:16] This is apparently an example for injecting JS into a page: https://github.com/ariya/phantomjs/blob/master/examples/injectme.js [22:18] It looks like this talks about manipulating the DOM: http://phantomjs.org/page-automation.html [22:19] I've never use PhantomJS myself so I'm not too knowledgable about it [22:22] Ah, thanks. That was just the example I was looking for. [22:59] *** JesseW has joined #archiveteam-bs [23:10] *** ItsYoda has quit IRC (Ping timeout: 260 seconds) [23:12] is archivebot kept more up to date than grab-site? [23:13] ha, I doubt it. [23:13] hook54321: what do you mean by "up to date"? [23:14] Developed more I guess? [23:15] archivebot isn't developed at all currently. Effort is explicitly directed to grab-site (or the wrapper around it, I can't remember) instead. [23:15] (I may be mistaken about this -- if so, yipdw should be able to correct it) [23:19] *** ItsYoda has joined #archiveteam-bs [23:43] *** Coderjoe has quit IRC (Read error: Operation timed out) [23:45] *** DoomTay_ is now known as DoomTay [23:50] *** Coderjoe has joined #archiveteam-bs [23:54] Think grab-site would run on a computer with an Atom N270 processor? [23:56] I don't see why it wouldn't [23:56] I've run wpull (the core component of grab-site/ArchiveBot that actually does the scraping) on a 1st gen Raspbery Pi [23:56] Which is almost certainly slower [23:56] *** wp494 has joined #archiveteam-bs [23:57] ah, right -- it's wpull that's the kernel, and grab-site/Archivebot that are the wrappers