#archiveteam-bs 2016-07-30,Sat

↑back Search

Time Nickname Message
00:04 🔗 nightpool has quit IRC (Ping timeout: 260 seconds)
00:05 🔗 espes__ has joined #archiveteam-bs
00:09 🔗 nightpool has joined #archiveteam-bs
00:13 🔗 DoomTay has quit IRC (Quit: Page closed)
00:18 🔗 DoomTay has joined #archiveteam-bs
00:34 🔗 godane so i found the San Francisco Bay Area Television Archive
00:35 🔗 godane https://diva.sfsu.edu/
00:38 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
00:40 🔗 DoomTay Looks like a job for ArchiveBot
00:41 🔗 DoomTay Oh, never mind. There's stuff behind an accountwall
00:46 🔗 DoomTay Also, just to make sure the word gets out real soon, here's what will likely be a REAL fun situation: http://laurapinto.tripod.com/andykim/
00:46 🔗 DoomTay Waitn no
00:46 🔗 DoomTay http://blog.bioware.com/2016/07/29/concerning-our-forums/
01:10 🔗 godane SketchCo2: i will be give the CBS 1960-11-08 Election Coverage sometime next week
01:10 🔗 godane its 4 hours of it on dvd
01:10 🔗 SketchCo2 is now known as SketchCow
01:11 🔗 arkiver godane: nice find on that archive!
01:11 🔗 arkiver Are you planning on getting those videos into IA?
01:11 🔗 arkiver also, I love the NASA uploads
01:11 🔗 godane i upload them to FOS
01:12 🔗 arkiver The videos from the archive you found?
01:12 🔗 godane yes
01:12 🔗 godane i upload the dvds videos i find to FOS
01:12 🔗 arkiver awesome!
01:13 🔗 godane i found this: http://www.cyclenews.com/cycle-news-archives/
01:13 🔗 godane but looks like it's payed wall
01:14 🔗 arkiver that's really nice
01:14 🔗 arkiver the paywall sucks though
01:14 🔗 godane that archives magazines that go back to the 1960
01:14 🔗 arkiver well
01:14 🔗 * arkiver is afk for the night
01:14 🔗 arkiver keep up the awesome work godane :D
01:14 🔗 godane i will
01:14 🔗 godane i up to 2008-09-05 with funny or die archive
01:15 🔗 arkiver for example http://magazine.cyclenews.com/i/84166-cycle-news-1972-issue-27-jul-18 doesn't seem paywalled
01:15 🔗 godane ok then
01:15 🔗 godane i was only looking at the 1960s ones
01:15 🔗 arkiver alright
01:15 🔗 arkiver I'm off anyway
01:16 🔗 arkiver have a good day :)
01:17 🔗 username1 has joined #archiveteam-bs
01:21 🔗 schbirid2 has quit IRC (Ping timeout: 244 seconds)
01:25 🔗 GLaDOS has joined #archiveteam-bs
02:19 🔗 JesseW has joined #archiveteam-bs
02:23 🔗 nightpool has quit IRC (Ping timeout: 260 seconds)
03:31 🔗 nightpool has joined #archiveteam-bs
03:36 🔗 nightpool has quit IRC (Ping timeout: 250 seconds)
03:48 🔗 DoomTay has quit IRC (Quit: Page closed)
03:51 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
04:14 🔗 DoomTay has joined #archiveteam-bs
04:26 🔗 dashcloud has quit IRC (Read error: Operation timed out)
04:29 🔗 dashcloud has joined #archiveteam-bs
04:39 🔗 Sk1d has quit IRC (Ping timeout: 194 seconds)
04:45 🔗 Sk1d has joined #archiveteam-bs
04:50 🔗 DoomTay has quit IRC (Quit: Page closed)
05:01 🔗 GLaDOS has joined #archiveteam-bs
05:27 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
05:28 🔗 dashcloud has joined #archiveteam-bs
05:39 🔗 robink has quit IRC (Ping timeout: 246 seconds)
05:43 🔗 robink has joined #archiveteam-bs
05:52 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:00 🔗 dashcloud has joined #archiveteam-bs
06:14 🔗 godane i have 787k items now
06:14 🔗 godane more like 788k if you include my godanefunnyordie account
06:32 🔗 zgrant has left
06:40 🔗 Honno has joined #archiveteam-bs
06:43 🔗 dashcloud has quit IRC (Read error: Operation timed out)
06:47 🔗 dashcloud has joined #archiveteam-bs
06:54 🔗 hook54321 godane: what would happen if we tried to put it through archivebot?
07:03 🔗 Honno has quit IRC (Ping timeout: 1208 seconds)
07:21 🔗 godane hook54321: what site are you talking about?
07:39 🔗 SketchCow For anyone who gives a total shit, I have been gearing a side project to turn the Apple II emulated software collection on the Internet Archive into a world-class collection
07:39 🔗 SketchCow Currently, I'm doing a sweep of all redundant items. It's taking a while, because of the 10,000, there's 1,000 or so dupes.
07:39 🔗 dashcloud has quit IRC (Read error: Operation timed out)
07:40 🔗 dashcloud has joined #archiveteam-bs
07:40 🔗 * JesseW isn't particularly interested in Apple II's, but is always pleased by pretty metadata
07:41 🔗 SketchCow Once the redundants are removed, I will drill against the remaining collection and metadata it beyond belief.
07:43 🔗 godane we need to find old scans of Scholastic Arrow Book Club News Paper: https://www.flickr.com/photos/annainca/5659359459/in/album-72157626587471194/
07:43 🔗 JesseW has quit IRC (Remote host closed the connection)
07:44 🔗 godane something i loved took like at when i was in grade 1 to 5
07:44 🔗 Sanqui is now known as SanquiGON
07:44 🔗 SanquiGON is now known as SanquiAFK
07:49 🔗 SketchCow After that, I'll end up adding more stuff, but everything gets scanned and only new things are added
08:00 🔗 dashcloud has quit IRC (Read error: Operation timed out)
08:05 🔗 dashcloud has joined #archiveteam-bs
08:21 🔗 HCross Anyone else having issues with livestream.com videos?
08:24 🔗 midas what kind of issues HCross ?
08:25 🔗 midas my first issue is that i need flash.
08:26 🔗 HCross I hit play.. and nothing happens. youtube-dl also gets a content too short error
08:27 🔗 HCross Trying to watch Jason's HOPE talk
08:27 🔗 midas link?
08:27 🔗 HCross http://livestream.com/internetsociety/hopeconf/videos/130749038
08:28 🔗 midas seems broken indeed
08:28 🔗 midas firefox says it's processing - please come back later.
08:29 🔗 midas the lockpicking one works
08:30 🔗 midas so yeah, seems broken on their side
08:33 🔗 HCross SketchCow, do you have another copy of the video please?
08:35 🔗 mksplg has quit IRC (Quit: WeeChat 0.4.2)
08:37 🔗 Yoshimura has joined #archiveteam-bs
08:51 🔗 username1 iirc hope recordings get properly published at some point
08:53 🔗 Yoshimura Hoping is useless.
08:53 🔗 xmc hope is a conference
08:55 🔗 Yoshimura Oh yeah, that one.
08:55 🔗 Yoshimura Efnet does not +q? Instead of answering why is it offtopic using ban hammer. I am no longer surprised here. It is a hobby not a serious thing.
08:56 🔗 xmc ugh, stop complaining. you're disruptive and irritating.
08:57 🔗 xmc you have accomplished nothing with archiveteam, just complaining that we are doing things wrong
08:57 🔗 xmc that is why you are not welcome here
08:59 🔗 Yoshimura was kicked by xmc (out)
09:08 🔗 Yoshimura has joined #archiveteam-bs
09:08 🔗 Yoshimura Raising a concern and a question is not complaining.
09:08 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:09 🔗 xmc sets mode: +b *!4f8dff3d@ag-255-61.sta.ji.cz
09:09 🔗 Yoshimura was kicked by xmc (Yoshimura)
09:12 🔗 dashcloud has joined #archiveteam-bs
09:16 🔗 midas why do the idiots always find these channels :p
09:17 🔗 username1 i could suggest a reason but i would probably get banned as well
09:17 🔗 username1 gah, xchat...
09:17 🔗 username1 is now known as schbirid
09:17 🔗 fie has joined #archiveteam-bs
09:18 🔗 midas lol :p
09:18 🔗 xmc ah, it's you :P
09:18 🔗 xmc you have a history of not being a useless twat who gets in the way
09:21 🔗 schbirid oh i get in my own way alright!
09:22 🔗 wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
09:46 🔗 SmileyG \o/
10:07 🔗 dashcloud has quit IRC (Read error: Operation timed out)
10:10 🔗 dashcloud has joined #archiveteam-bs
10:29 🔗 godane so in theory i'm at 2008-09-10 with funny or die videos
10:42 🔗 mismatch has joined #archiveteam-bs
10:44 🔗 mismatch would it be possible to download/backup ~200,000 ISP hosting sites already in the Wayback Machine to a warc?
10:45 🔗 mismatch robots.txt keeps changing meaning they're sometimes inaccessible.
10:45 🔗 mismatch I have a txt file with all the sites listed
10:58 🔗 mismatch I guess my question should be, can archivebot download a list of websites from a txt file?
11:00 🔗 HCross Is it just myVIP that is underway at the moment, or is there anything else that needs help?
11:04 🔗 midas mismatch: yes, it can. but only in ao mode
11:05 🔗 HCross but 200k individual sites might be a bit much
11:07 🔗 mismatch midas: thanks. HCross: that's fair enough, I'll maybe try with just 50 to start with and see how it performs
11:12 🔗 mismatch in !a/recursive mode, if a site links to an external url such as google.com - is that also downloaded?
11:15 🔗 schbirid god https://www.youtube.com/watch?v=UqVYWP4wk3I "6 Years of Hard Work Erased in 5 Clicks"
11:22 🔗 midas that's just wow...
11:24 🔗 mismatch ^ it'll be interesting to see if YouTube solve this
11:25 🔗 midas if they even will try
11:26 🔗 luckcolor Hcross i remeber it worked with a too
11:26 🔗 luckcolor mismatch
11:26 🔗 luckcolor join #archivebot
11:27 🔗 mismatch luckcolor: already there :) - thanks though, I need to experiment a bit
11:29 🔗 midas in theory it mightjust work with 200k urls
11:29 🔗 midas i think ao makes one per url (warc)
11:29 🔗 midas also internet is slow
11:30 🔗 midas and sketcy
11:31 🔗 luckcolor mismatch: if you need voiced or you want to schedule the job feel free to message
11:31 🔗 mismatch true, all the urls are web.archive.org/web/[date]/[site] which I'm guessing might also cause recursion issues for archivebot
11:31 🔗 mismatch luckcolor: <3
11:36 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
11:36 🔗 BartoCH has joined #archiveteam-bs
11:45 🔗 joepie91 has quit IRC (Read error: Operation timed out)
11:46 🔗 botpie91 has quit IRC (Read error: Operation timed out)
11:48 🔗 arkiver has quit IRC (Ping timeout: 370 seconds)
12:11 🔗 joepie91 has joined #archiveteam-bs
12:11 🔗 arkiver has joined #archiveteam-bs
12:19 🔗 schbirid has quit IRC (Quit: Leaving)
12:24 🔗 dashcloud has quit IRC (Read error: Operation timed out)
12:31 🔗 dashcloud has joined #archiveteam-bs
12:59 🔗 metalcamp has joined #archiveteam-bs
13:18 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:30 🔗 bzc6p has joined #archiveteam-bs
13:30 🔗 swebb sets mode: +o bzc6p
13:33 🔗 bzc6p sets mode: +oooo arkiver Atluxity chfoo closure
13:34 🔗 bzc6p sets mode: +oooo Coderjoe dashcloud FalconK Fletcher
13:34 🔗 bzc6p sets mode: +oooo GLaDOS godane HCross HCross2
13:34 🔗 bzc6p sets mode: +oooo joepie91 JW_work1 Kaz Kenshin
13:34 🔗 bzc6p sets mode: +oooo luckcolor midas PurpleSym Start
13:34 🔗 bzc6p sets mode: +o yipdw
13:40 🔗 bzc6p has left
13:44 🔗 fie_ has joined #archiveteam-bs
13:44 🔗 fie has quit IRC (Read error: Connection reset by peer)
13:51 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:55 🔗 dashcloud has joined #archiveteam-bs
14:11 🔗 nightpool has joined #archiveteam-bs
14:15 🔗 DoomTay has joined #archiveteam-bs
14:20 🔗 zino has quit IRC (Remote host closed the connection)
14:23 🔗 bzc6p has joined #archiveteam-bs
14:23 🔗 swebb sets mode: +o bzc6p
14:24 🔗 bzc6p DoomTay: livejournal discovery items are usually 1,299 bytes in size, so it's fine.
14:24 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
14:24 🔗 DoomTay Huh.
14:24 🔗 DoomTay And I think I" can guess why items usually arn't in KB
14:25 🔗 DoomTay I mean DISPLAYED in KB
14:25 🔗 bzc6p Feel free to improve the software.
14:27 🔗 BartoCH has joined #archiveteam-bs
14:32 🔗 nightpool has quit IRC (Read error: Operation timed out)
14:41 🔗 Honno has joined #archiveteam-bs
14:53 🔗 bzc6p has left
14:54 🔗 SanquiAFK has quit IRC (Ping timeout: 260 seconds)
14:55 🔗 nightpool has joined #archiveteam-bs
15:07 🔗 MrRadar has quit IRC (Quit: Restarting)
15:08 🔗 Honno has quit IRC (Ping timeout: 1208 seconds)
15:34 🔗 Frogging lol Yoshimura was back I see
15:35 🔗 Frogging that was amusing
15:35 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
15:39 🔗 Coderjoe has joined #archiveteam-bs
15:52 🔗 Rye has quit IRC (Ping timeout: 244 seconds)
15:53 🔗 DoomTay has quit IRC (Quit: Page closed)
15:54 🔗 MrRadar has joined #archiveteam-bs
15:54 🔗 Rye has joined #archiveteam-bs
16:01 🔗 Sanqui has joined #archiveteam-bs
16:10 🔗 kristian_ has joined #archiveteam-bs
16:43 🔗 ndiddy has quit IRC (Read error: Connection reset by peer)
16:44 🔗 ndiddy has joined #archiveteam-bs
17:15 🔗 dashcloud has quit IRC (Read error: Operation timed out)
17:18 🔗 dashcloud has joined #archiveteam-bs
19:02 🔗 Coderjoe has quit IRC (Ping timeout: 260 seconds)
19:03 🔗 Coderjoe has joined #archiveteam-bs
19:04 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
19:09 🔗 BartoCH has joined #archiveteam-bs
19:18 🔗 tomwsmf has joined #archiveteam-bs
19:27 🔗 JesseW has joined #archiveteam-bs
19:28 🔗 JesseW I just noticed that, in the year (!) since I put together a pile of metadata about sourceforge projects, six people have forked the repo I put it in. None have *done* anything with the forks -- they merely made them. One only forked one other repo.
19:28 🔗 JesseW The world is strange.
19:30 🔗 kristian_ has quit IRC (Leaving)
19:41 🔗 JesseW joepie91: regarding newww, I found a copy of the code here: https://github.com/rafaeljesus/newww (albeit not particularly up to date)
19:42 🔗 JesseW exactly *one* of the issues from the repo was saved in the wayback machine: https://web.archive.org/web/20150224214155/https://github.com/npm/newww/issues/190
19:48 🔗 hook54321 godane: I was talking about the cyclenews site
19:50 🔗 BartoCH has quit IRC (Ping timeout: 260 seconds)
19:51 🔗 JesseW joepie91: https://github.com/npm/www/issues/9 -- I just asked them to make the old issues available. We'll see what the response might be.
19:52 🔗 JesseW I'm not sure whether having you support that request would be a positive or a negative. :-)
19:54 🔗 JesseW It would probably be good to grab copies of the issues from all the other npm repos now, just in case.
19:56 🔗 hook54321 Did you check Google cache for any of the old issues?
19:56 🔗 JesseW hook54321: good idea. Please do so, and dump them into archive.is if you find any
19:57 🔗 hook54321 There's a script that I think does stuff like that automatically, I still haven't gotten it work though. Do you know the URL of the old repo?
19:57 🔗 JesseW https://github.com/npm/newww
19:58 🔗 JesseW there's an even older repo, https://github.com/npm/npm-www with over 900 issues, that it would be good to grab
20:02 🔗 BartoCH has joined #archiveteam-bs
20:07 🔗 hook54321 Have we put any of the second one through archivebot yet?
20:08 🔗 Frogging you don't put GitHub repos in ArchiveBot, by the way
20:13 🔗 JesseW has quit IRC (Ping timeout: 370 seconds)
20:13 🔗 hook54321 What if we just put in the issues page?
20:14 🔗 hook54321 And used lots of ignores
20:15 🔗 Frogging oh, issues
20:16 🔗 Frogging yeah, probably
20:17 🔗 Frogging there's also this. https://github.com/joeyh/github-backup
20:17 🔗 Frogging that's what yipdw recommended be used for github stuff
20:18 🔗 yipdw it produces more usable results, yes
20:18 🔗 yipdw a WARC copy of a github repo is IMO pointless
20:18 🔗 Frogging maybe for issues only it would be less useless?
20:18 🔗 yipdw it's pointless
20:18 🔗 yipdw if you want the issues github-backup gets it too
20:18 🔗 Frogging kk
20:20 🔗 yipdw a derivation of the issue data into static HTML probably has merit but the github interface has so many links that an automated crawl is a mess
20:21 🔗 yipdw !ao tends to be okay if you're trying to be a jerk
20:25 🔗 hook54321 My main concern is that their should be some sort of central place where people can browse and upload backups of repositories.
20:25 🔗 yipdw people usually do that on github yes
20:26 🔗 yipdw google code for example
20:26 🔗 yipdw GitLab has a Github import function which works reasonably well, also
20:27 🔗 yipdw I do wonder however someone will fund this central place
20:28 🔗 hook54321 I mean we could upload backups of github repos to the Internet archive, but it would still be useless until the user downloaded the whole archive, which could potentially be huge.
20:29 🔗 yipdw yep
20:29 🔗 yipdw or you do a shallow clone
20:31 🔗 yipdw if you're interested in saving copies of git repositories and their associated data, I think self-hosted gitlab is a good choice
20:31 🔗 yipdw I run an instance that does more or less that
20:31 🔗 yipdw then I have my backups of dependent libraries and I only need to worry about keeping my instance healthy
20:34 🔗 dashcloud for github backups, codearchive.org is doing a backup of every repo with 10 stars or more (and less than 250 MB in size, unless whitelisted) and every change
20:34 🔗 yipdw oh, actually, correction
20:35 🔗 yipdw github->gitlab import requires your github OAuth tokens and so you can only do that backup for repositories you're authorized on
20:35 🔗 yipdw so, yes, if you're a project member it's a good choice :P
20:35 🔗 dashcloud here's the project site: https://the-code-archive.launchrock.com/
20:36 🔗 yipdw wow
20:36 🔗 yipdw nice
20:37 🔗 yipdw oh cool, Filippo Valsorda is involved in that
20:37 🔗 yipdw (he also wrote the Warrior Dockerfile)
20:38 🔗 hook54321 Interesting. About what percentage of repos are 10 stars or more and less than 250 MB?
20:39 🔗 yipdw https://github.com/search?utf8=%E2%9C%93&q=stars%3A%22%3E+10%22+size%3A%22%3C%3D+262144000%22&type=Repositories&ref=advsearch&l=&l=
20:40 🔗 yipdw there's 3,466,477 with at least one star
20:41 🔗 yipdw oh, oops, that's supposed to be size in kB
20:41 🔗 yipdw https://github.com/search?utf8=%E2%9C%93&q=stars%3A%22%3E+10%22+size%3A%22%3C%3D+262144%22&type=Repositories&ref=searchresults is fixed
20:42 🔗 yipdw and it should be >= but no matter what you're looking at 10-11%
20:42 🔗 robink has quit IRC (Ping timeout: 501 seconds)
20:43 🔗 yipdw which rounds pretty nicely with Sturgeon's Law
20:49 🔗 hook54321 Not to mention, there are probably tons of repositories that are just empty.
20:49 🔗 dashcloud in case you're wondering, the repo itself is backed up because it got 10 stars by the end of the talk (which is here: http://livestream.com/internetsociety2/hopeconf/videos/130613964)
20:50 🔗 hook54321 Which repo?
21:02 🔗 Honno has joined #archiveteam-bs
21:14 🔗 robink has joined #archiveteam-bs
21:18 🔗 Honno has quit IRC (Read error: Operation timed out)
21:34 🔗 DoomTay has joined #archiveteam-bs
21:52 🔗 whydomain has joined #archiveteam-bs
21:59 🔗 whydomain Can anyone recommend a (preferably python) browser emulator? Basically all I want to do is navigate to a web page with a cookie preloaded, and then click on two div elements.
22:13 🔗 MrRadar whydomain: PhantomJS
22:13 🔗 MrRadar It's basically a scriptable headless version of Chromium
22:15 🔗 DoomTay_ has joined #archiveteam-bs
22:15 🔗 whydomain Are there any examples of usage for loading in pages an manipulating elements? I can't find any.
22:16 🔗 whydomain ^and manipulating
22:16 🔗 DoomTay has quit IRC (Ping timeout: 268 seconds)
22:16 🔗 MrRadar This is apparently an example for injecting JS into a page: https://github.com/ariya/phantomjs/blob/master/examples/injectme.js
22:18 🔗 MrRadar It looks like this talks about manipulating the DOM: http://phantomjs.org/page-automation.html
22:19 🔗 MrRadar I've never use PhantomJS myself so I'm not too knowledgable about it
22:22 🔗 whydomain Ah, thanks. That was just the example I was looking for.
22:59 🔗 JesseW has joined #archiveteam-bs
23:10 🔗 ItsYoda has quit IRC (Ping timeout: 260 seconds)
23:12 🔗 hook54321 is archivebot kept more up to date than grab-site?
23:13 🔗 JesseW ha, I doubt it.
23:13 🔗 JesseW hook54321: what do you mean by "up to date"?
23:14 🔗 hook54321 Developed more I guess?
23:15 🔗 JesseW archivebot isn't developed at all currently. Effort is explicitly directed to grab-site (or the wrapper around it, I can't remember) instead.
23:15 🔗 JesseW (I may be mistaken about this -- if so, yipdw should be able to correct it)
23:19 🔗 ItsYoda has joined #archiveteam-bs
23:43 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
23:45 🔗 DoomTay_ is now known as DoomTay
23:50 🔗 Coderjoe has joined #archiveteam-bs
23:54 🔗 hook54321 Think grab-site would run on a computer with an Atom N270 processor?
23:56 🔗 MrRadar I don't see why it wouldn't
23:56 🔗 MrRadar I've run wpull (the core component of grab-site/ArchiveBot that actually does the scraping) on a 1st gen Raspbery Pi
23:56 🔗 MrRadar Which is almost certainly slower
23:56 🔗 wp494 has joined #archiveteam-bs
23:57 🔗 JesseW ah, right -- it's wpull that's the kernel, and grab-site/Archivebot that are the wrappers

irclogger-viewer