[00:04] *** Ymgve has quit IRC () [00:18] I'll get cow.net back up [00:40] Sanqui do you mean like parsing html? [01:01] lynx -dump internic.net|grep "[0-9]\. .t"|awk '{print $2}' [01:20] *** BiggieJon has quit IRC (Read error: Operation timed out) [01:27] *** BiggieJon has joined #archiveteam [01:44] *** Start has joined #archiveteam [02:51] *** signius has quit IRC (Read error: Operation timed out) [02:54] *** kyan has joined #archiveteam [02:56] *** primus104 has quit IRC (Leaving.) [03:00] *** BiggieJon has quit IRC (Read error: Operation timed out) [03:04] *** mistym has quit IRC (Remote host closed the connection) [03:04] *** signius has joined #archiveteam [03:05] *** mistym has joined #archiveteam [03:05] *** mistym has quit IRC (Remote host closed the connection) [03:09] *** mistym has joined #archiveteam [03:12] *** mistym has quit IRC (Remote host closed the connection) [03:14] *** BiggieJon has joined #archiveteam [03:42] *** BlueMaxim has joined #archiveteam [03:43] *** aschmitz has quit IRC (Quit: Leaving) [03:44] *** mistym has joined #archiveteam [04:01] Google is about to take down, in one month, all Blogger.com blogs that are "secually exlicit" [04:01] Sorry, keyboard [04:01] They are just sending letters. [04:02] I think we should just download Blogger. [04:07] i'll add it to the current projects list [04:07] any ideas for an irc channel? [04:07] #frogger [04:08] #prairiedogger [04:17] i've put #frogger as the channel on the wiki [04:26] Well, they're removing nudity that lacks educational, documentary or academic standing [04:27] So something good [04:29] pornogger [04:30] hornblogger [04:30] pornblogger [04:30] horndogger [04:38] *** C-apple-a is now known as C-apple [04:38] Google, but for prudes [04:38] OH WAIT [04:39] fiftyshadesofblogger - without the "iftyshadesofb" in the middle [04:43] " We’ll still allow nudity if the content offers a substantial public benefit, for example in artistic, educational, documentary, or scientific contexts." [04:43] sexual arousal is not a substantial public benefit evidently [04:44] Google Pills, to prevent Stirrings [04:44] well thats depressing [04:44] oh well theres still tumblr [04:45] owned by yahoo, so not really [04:45] I thought tumblr already took steps to clean up the more prurient ones [04:50] oh they hide them from the search feature [04:52] *** BlueMaxim has quit IRC (Ping timeout: 370 seconds) [04:53] *** BlueMaxim has joined #archiveteam [05:01] i wonder if we should archive google sites sometime soon [05:05] google hasn't been giving much attention to it recently and i have the feeling that it might be one of their next products on the chopping block [05:06] *** Nertsy has quit IRC (Quit: Nertsy) [05:09] what happened to picasa web albums? [05:09] *** mistym has quit IRC (Remote host closed the connection) [05:09] https://picasaweb.google.com redirects to google plus [05:10] with a link to https://picasaweb.google.com/lh/myphotos?noredirect=1 [05:11] looks like it's being merged into google plus [05:13] looks like content on picasaweb.google.com won't be there for much longer [05:13] warrior project? [05:14] I think Google Sites might survive simply because it's one of the services advertised as being available in Google Apps for Work (which is really just Google stuff with your own domain name instead of @gmail.com, plus some enterprise control). [05:15] *** aaaaaaaaa has quit IRC (Leaving) [05:16] Just like Google Talk is supposedly dead, but I'm still using it (both on Google Talk and on XMPP)-- they already marketed it as the chat integration for (paid) Google Apps for Work, so they'd annoy actual paying customers if they yanked XMPP. [05:18] I think the 18+ bloggers age-gate so you probably need to grab with cookies [05:24] i've added picasa to the current projects list [05:30] Picasa is online and seems fine [05:30] it doesn't redirect to Google+ [05:30] it requires that you're logged in [05:31] If you aren't logged in, it redirects to Google Plus? [05:31] if you aren't logged in, you get a login prompt [05:31] I logged in and it was there [05:31] I don't have a Google+ account [05:33] Ah, OK, I remember this... [05:33] they started redirecting google plus users away from it [05:34] OK, I'm logged in, and I go to https://picasaweb.google.com and it sends me to my personal Google+ photo album, and across the top is a "Click here to go back to Picasa Web Albums." message. [05:34] *** mistym has joined #archiveteam [05:35] For what it's worth, at least one my photos showing there pre-dates Google+ I believe; it was a PicasaWeb test, not a Google+ upload, and was uploaded directly. not from the Picasa software. [05:45] ArchiveTeam isn't too concerned with site that the Wayback Machine already managed to crawl, right? (I'm thinking of AllGame, for example.) [05:45] *** ben_ has quit IRC (Read error: Connection reset by peer) [05:45] *** ex-parrot has joined #archiveteam [05:46] C-apple: we usually crawl for ourselves if it's changed much, or we can find basically anything that wbm systematically didn't get [05:50] C-apple: that said, re: allgame, we did get a copy -> http://archive.fart.website/archivebot/viewer/job/aeefs [05:51] it should be part of wayback [05:52] in practice, wbm always misses things [05:53] xmc: Because it only pays attention to stuff that's linked from something it already knew about, normally? [05:53] also it doesn't crawl every link it finds [05:54] Ew. Any theory on why? [05:55] think about it for a second [05:56] Theory 1: So much stuff that it queues stuff up and never gets to it by the time it's dead. [05:57] Theory 2: Lame-o JavaScript hrefs. (But that's usually a site-wide disease, not specific pages.) [05:58] #1 is it [06:03] *** bsmith093 has joined #archiveteam [06:04] Does archive.org have any history of accepting and importing WARC files into the Wayback Machine, or do they only accept what they crawled themselves? [06:05] Ah-- I'm reading more-- so WARC is the IA's format? [06:05] Archive Team web crawls have made it into wayback, as have archivebot inputs [06:05] WARC's an ISO standard based in part on IA's ARC, there's a few tools out there that read it [06:05] read and write it [06:05] OK, cool. [06:06] well #1 except it's not the site that goes dead, they have to eventually pull the plug on the crawl so they can go get some new content [06:07] brewster explained in one of his interviews that the web is infinite, for example there's a website where you can play chess move by move so there's a url for every possible chessboard [06:08] so they have to have some reasonable limits on how deep they go on any one site so it doesn't get completely bogged down [06:08] DFJustin: Does IA add links to the "back" of the queue, behind any other web address that has already been found in a link? And as for those infinite/recursive links: You're saying that IA just crawls so much before it figures that site has gone down the rabbit hole and de-prioritizes it? [06:09] I don't work there so I don't have the gospel on how exactly it works [06:10] but I'm sure you have to make some tough decisions if you wanna try and get the entire web [06:10] Yeah, I can imagine. [06:10] also a lot of the content in the wayback machine was actually crawled by third parties and later donated to the archive, so there's no one consistent methodology [06:11] I would hope the general algorithm would figure that sequential URLs were not dead-ends. [06:11] *** ikreymer has joined #archiveteam [06:11] C-apple: they can be [06:12] improperly configured PHP photo galleries will accept negative indices and just render something with a 200 OK [06:12] they'll also render links to those indices [06:12] so you'll end up crawling billions in both directions without some sort of limit or fuzzy matcher [06:12] Yuck. [06:12] yeah there are a lot of sites with &page=4 etc. where it will keep linking to &page=97174981 way after you've actually run out of results [06:13] Ah, yes, that. [06:13] I mention that because we're hitting that problem in archivebot right now [06:13] also, calendars [06:13] fuck them [06:13] They'll give actual "useful" content, or just show a 0-result kind of gallery? [06:13] it'll either show nothing or the last page over and over [06:13] hi archiveteam, this may be useful to folks here, i've made a point-and-click tool for opening WARC files locally on your computer: https://github.com/ikreymer/webarchiveplayer [06:14] sweet [06:14] Oh, yeah, calendars are a pain in the ass even for manual calendar users-- sometimes there's no damned way to figure out if you scheduled something for the year 2024 or what. [06:14] ikreymer: oh cool [06:14] there are downloadable windows and os x versions, you can just run and then select one or more WARC files, and it'll open the browser for you allowing you to browse them [06:15] this is pretty neat, now I don't need a full pywb instance to verify things :) [06:15] feel free to add it to http://archiveteam.org/index.php?title=The_WARC_Ecosystem [06:15] yes, it just pywb wrapped into an executable, hopefully a bit simpler to use. [06:15] "how do I view this crap" is a pretty common question we get from people [06:15] ikreymer: Oo! Yummy! [06:17] *** mistym_ has joined #archiveteam [06:18] *** ionpulse has joined #archiveteam [06:19] DFJustin: great, i will list it on the wiki. feel free to open issues on github. definitely needs a bit more testing, especially the windows build. i tested it mostly on windows 7. [06:19] *** mistym has quit IRC (Read error: Operation timed out) [06:27] *** ikreymer has quit IRC (Quit: http://chat.efnet.org ) [06:27] *** ikreymer has joined #archiveteam [06:30] *** antomatic has quit IRC (Read error: Operation timed out) [06:30] *** antomatic has joined #archiveteam [06:33] OK, here's a more complicated one: Let's say I'm a hosting admin and there are some sites that are no longer publicly accessible under their old URLs, but I want to provide an archive. Any tools to do that, either as a convert-file-directory-to-WARC type of thing, or a script that rewrites the base URL and archive dates if I crawl it under a temporary revival URL? [06:35] *** mistym_ has quit IRC (Remote host closed the connection) [06:36] *** mistym has joined #archiveteam [06:37] *** sep332 has quit IRC (Read error: Operation timed out) [06:42] (Asking that after midnight on a Tuesday night probably isn't the best way to get a response to that...) [06:43] *** ikreymer has quit IRC (Quit: http://chat.efnet.org ) [06:43] *** ikreymer has joined #archiveteam [06:45] You could crawl it under the temporary URL and leave it alone... faking the dates/URLs on them seems like a highly dubious prospect to me? [06:46] then just have the WARC be available for browsing with reference to the temporary URLs [06:46] (I don't know what I'm talking about though, so don't trust me :P) [06:46] C-apple ^ [06:46] But then people looking for archives of the original site wouldn't see that as a snapshot. [06:47] true true [06:47] IDK :( maybe someone else knows [06:48] the idea of forging the WARCs really rubs me the wrong way hard though [06:48] I was thinking, crawl it/whatever privately, then date it to the last time in the logs that the public could reach it-- so if there is a later site on that URL it doesn't look in a timeline like the site throttled between content. [06:50] Well, yes, I think diddling the WARC metadata is touchy-- but worse than having data with a far-too-late date and a pointless URL that the site never used while live? [06:52] *** GLaDOS has quit IRC (Ping timeout: 246 seconds) [07:08] https://support.google.com/blogger/answer/6170671?p=policy_update&rd=1? -- Blogger is closing nude blogs. or atleast making them less easy to access. [07:20] *** ikreymer has quit IRC () [07:20] *** ikreymer has joined #archiveteam [07:37] *** nertzy has quit IRC (Read error: Connection reset by peer) [07:39] *** nertzy has joined #archiveteam [07:43] *** signius has quit IRC (ircd.choopa.net irc.teksavvy.ca) [07:43] *** Baljem has quit IRC (ircd.choopa.net irc.teksavvy.ca) [07:43] *** Fusl has quit IRC (ircd.choopa.net irc.teksavvy.ca) [07:43] *** Kazzy has quit IRC (ircd.choopa.net irc.teksavvy.ca) [07:43] *** MMovie has quit IRC (ircd.choopa.net irc.teksavvy.ca) [07:43] *** closure has quit IRC (ircd.choopa.net irc.teksavvy.ca) [07:43] *** Baljem_ has joined #archiveteam [07:44] *** sep332 has joined #archiveteam [07:44] *** Kazzy_ has joined #archiveteam [07:50] *** mutoso has quit IRC (Read error: Operation timed out) [07:50] *** Fusl_ has joined #archiveteam [07:55] *** signius has joined #archiveteam [07:56] *** mutoso has joined #archiveteam [07:56] *** ikreymer has quit IRC (Remote host closed the connection) [07:58] *** Fusl_ is now known as Fusl [07:59] C-apple: crawl it and provide a separate URL mapping [07:59] *** Kazzy_ is now known as Kazzy [07:59] don't touch the WARC records [07:59] it is easy (easier) to write software that understands the mapping and transforms input WARC records online [07:59] altering the WARC is irreversible and undetectable [07:59] there's probably already some standard for a mapping like that, I can't name one offhand unfortunately [08:00] Does AT or Wayback accept WARCs that need to be converted that way? [08:00] I mean, rather than just mapping them to their (private/useless) crawl parameters? [08:00] I can't think of a situation in which it has happened before [08:01] that said easiest way to go forward is to have the data on hand [08:01] *** signius has quit IRC (ircd.choopa.net irc.teksavvy.ca) [08:05] Another silly question: Is there any reason we use .gz for WARC instead of .bz2 or .xz, since it's stuff that's going to be compressed once, stored, and retrieved a lot? [08:07] .warc.gz is a concatenation of gzipped WARC records [08:07] it typically achieves 2:1 compression ratio [08:07] C-apple: this is getting lengthy, it should be in #archiveteam-bs [08:08] Ah. [08:09] *** MMovie has joined #archiveteam [08:09] *** mistym has quit IRC (Remote host closed the connection) [08:13] *** signius has joined #archiveteam [08:16] *** primus104 has joined #archiveteam [08:27] *** MMovie has quit IRC (Ping timeout: 306 seconds) [08:28] *** MMovie has joined #archiveteam [08:36] So, Blogger... [08:37] Have we got a channel yet? [08:37] #flogger perhaps ;) [08:37] damn Google [08:38] And where is Vint Cerf now? [08:39] "Oh but they're not deleting anything, they're just making sure that nobody can access it, so that's alright, right?" [08:39] Grr [08:40] Vint Cerf is off writing joke RFCs about how the Internet is for everyone [08:41] While backing up his secret blog full of pictures of ladies' ankles. [08:41] (Which was the style at the time) [08:42] not sure about that but https://tools.ietf.org/html/rfc3271 does exist [08:42] *** godane has quit IRC (Read error: Operation timed out) [08:43] And what will the martians think of our human race if there are no sexy blogs for them to assess? [08:43] "No, it's alright, because they will still be visible if you explicitly choose to share the blog with the martians." [08:44] "Not deleting anything." [08:44] "Just making it so that nobody can see. Different, of course." [08:44] it's ok, we have tumblr and ic.cz [08:44] well sort of the latter [08:44] anyway [08:44] So.. Blogger.. [08:45] Brute-force username discovery? Crawls, dictionaries, blog-to-blog links, etc..? [08:47] Could 'social' it up, maybe - e.g. 'Add +ArchiveTeam to your GoogleConnectFriendsPlusWhatever circle and we'll archive your blog" etc [08:47] * antomatic thinks [08:57] *** godane has joined #archiveteam [09:02] *** zenguy_pc has joined #archiveteam [09:04] *** primus104 has quit IRC (Leaving.) [09:06] *** boozehoun has quit IRC (Ping timeout: 512 seconds) [09:29] *** rejon has quit IRC (Ping timeout: 512 seconds) [09:30] * ex-parrot has recovered from the excitement of Hyves and is ready to back up some blogs [09:33] Could be a nice chunky project. [09:38] *** rejon has joined #archiveteam [10:00] *** schbirid has joined #archiveteam [10:03] *** MMovie has quit IRC (Read error: Operation timed out) [10:03] *** MMovie has joined #archiveteam [11:18] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [11:25] *** Sk1d has joined #archiveteam [11:30] *** Sk2d has joined #archiveteam [11:32] *** Sk1d has quit IRC (Read error: Operation timed out) [11:32] *** Sk2d is now known as Sk1d [11:33] Reposurgeon might have been overkill a suggestion for the Toolserver SVN I asked about earlier. [11:34] For now I'm just merging all repos into one for ease (certainly not cleanliness) per https://stackoverflow.com/a/267307/4145951 ; then I'll just look for some place where to dump the merged repo for the sake of downloadability and history [11:35] Just sourceforge shell maybe https://sourceforge.net/p/forge/community-docs/svn%20import/ [11:37] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [11:39] *** Ymgve has joined #archiveteam [11:41] *** Sk1d has joined #archiveteam [11:47] *** Sk2d has joined #archiveteam [11:48] *** Sk1d has quit IRC (Read error: Operation timed out) [11:53] *** Sk2d has quit IRC (Ping timeout: 265 seconds) [11:54] *** Sk1d has joined #archiveteam [11:58] *** dashcloud has quit IRC (Ping timeout: 246 seconds) [12:01] *** dashcloud has joined #archiveteam [12:01] *** Sk2d has joined #archiveteam [12:04] *** Sk1d has quit IRC (Read error: Operation timed out) [12:04] *** Sk2d is now known as Sk1d [12:07] *** primus104 has joined #archiveteam [12:09] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [12:09] *** dashcloud has quit IRC (Read error: Operation timed out) [12:11] *** Sk1d has joined #archiveteam [12:12] *** dashcloud has joined #archiveteam [12:17] *** Sk2d has joined #archiveteam [12:18] sad day https://www.facebook.com/groups/105586892805903/permalink/897231230308128/ [12:18] *** Sk1d has quit IRC (Read error: Operation timed out) [12:18] *** Sk2d is now known as Sk1d [12:37] (for those who don't use facebook: http://www.gamasutra.com/view/news/237129/Obituary_Atari_pioneer_Steve_Bristow.php) [12:42] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [12:43] thanks garyrh :) [12:45] *** Sk1d has joined #archiveteam [12:52] *** uwe has quit IRC (Ping timeout: 240 seconds) [12:53] *** Sk1d has quit IRC (Read error: Operation timed out) [12:54] *** Sk1d has joined #archiveteam [13:01] *** Sk1d has quit IRC (Read error: Operation timed out) [13:02] *** Sk1d has joined #archiveteam [13:06] *** dashcloud has quit IRC (Read error: Operation timed out) [13:10] *** dashcloud has joined #archiveteam [13:17] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [13:20] *** Sk1d has joined #archiveteam [13:25] *** closure has joined #archiveteam [13:26] *** Sk2d has joined #archiveteam [13:30] *** Sk1d has quit IRC (Read error: Operation timed out) [13:30] *** Sk2d is now known as Sk1d [13:31] *** dashcloud has quit IRC (Read error: Operation timed out) [13:36] *** dashcloud has joined #archiveteam [13:47] *** sankin has joined #archiveteam [13:49] *** rejon has quit IRC (Remote host closed the connection) [13:51] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [13:53] *** primus104 has quit IRC (Leaving.) [13:55] *** Sk1d has joined #archiveteam [14:00] *** Sk2d has joined #archiveteam [14:02] *** Sk1d has quit IRC (Read error: Operation timed out) [14:06] *** Sk2d has quit IRC (Ping timeout: 265 seconds) [14:07] *** Sk1d has joined #archiveteam [14:09] *** russss has joined #archiveteam [14:10] *** Sk1d has quit IRC (Read error: Operation timed out) [14:14] *** Sk1d has joined #archiveteam [14:19] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [14:21] *** sankin has quit IRC (Leaving.) [14:21] *** Sk1d has joined #archiveteam [14:21] *** rolfb has joined #archiveteam [14:24] *** Sk1d has quit IRC (Read error: Operation timed out) [14:25] Hi. I am considering uploading a huge archive of data to the Internet Archive for preservation. Is there anyone that would be able to walk us through how to best go about it? [14:26] how big? [14:28] midas: around 4.5 terabytes [14:29] http://qz.com/349569/google-will-ban-adult-content-on-its-blogging-platform/ [14:29] is it a website rolfb or software? [14:29] *** Sk1d has joined #archiveteam [14:29] :o [14:30] depending if you can split the data into multiple files because 1 4.5TB file is going to be horrible to upload [14:32] More precisely, impossible [14:32] yeah that too [14:32] :p [14:33] midas: it is not so much about the website, but the content which it serves [14:33] but i can not go into further details before next week, so perhaps I should return then :) [14:34] rolfb: usually we archive websites in chunks of ~40 GB WARC files [14:34] ok, anyway rolfb, dont save it into 1 file, 4.5TB will not upload to IA [14:34] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [14:34] midas: torrent is an option? [14:35] yes [14:35] torrent with an archive then? [14:35] size limits? [14:35] could try yeah [14:35] same limits [14:35] Nemo_bis: meaning chunks of 40gb WARC files? [14:35] yes, one per torrent [14:36] i'd go that course yeah [14:36] If you have a mass of data you can't process yourself, you can ask here and someone can give you an rsync target [14:36] But of course it's better if you clean your data yourself ;) [14:37] Nemo_bis: thanks, i think i will return next week with some more intricate details of the data set [14:37] but it is nice to know that it is possible [14:38] *** Sk1d has joined #archiveteam [14:38] *** Ymgve__ has joined #archiveteam [14:39] http://archiveteam.org/index.php?title=Blogger looks like the XML trick is still the best we have, some work needed here [14:40] Is there a reasonably big hosted blogs platform which handles conversion from Blogger? [14:40] wordpress maybe? [14:41] WordPress comes with a built-in importer tool for Blogger. It is good enough to import posts & comments which are a major part of your blog. [14:42] *** Ymgve has quit IRC (Ping timeout: 506 seconds) [14:43] Yes, but WordPress is unlikely to engage in a campaign "No worries, we'll absorb all adult blogspost subdomains" [14:43] The folks I know at Automattic are rather prudish [14:44] lol [14:44] maybe we can use the tools they use to get all the data [14:44] True. Not *all* of automattic code is free software/open source, but the importer might be [14:46] Hm, is there a way to make wget -p work with XML files too [14:46] *** Sk1d has quit IRC (Read error: Operation timed out) [14:46] rolfb: if it's data it needs a little different treatment [14:47] 4.5tb is a bit much, should be doable, but I'd probably suggest talking to someone from IA about it [14:47] (like SketchCow) [14:47] balrog: how so? [14:47] well, it's not warcs so it will have to be uploaded as items tagged with metadata [14:48] Nemo_bis: this thing: https://wordpress.org/plugins/blogger-importer/ ? [14:48] which timezone is SketchCow in? [14:49] Eastern Time, but he usually shows up a little bit later in the day [14:49] *** Sk1d has joined #archiveteam [14:49] *** dashcloud has quit IRC (Ping timeout: 240 seconds) [14:50] balrog: is it possible to reach out by email? [14:51] you can try jason@textfiles.com but he's rather busy [14:51] if it's mostly content why not ship drives like ye olden days? [14:52] sneakernet! [14:53] thechip: is that an option? :-) [14:54] *** Sk1d has quit IRC (Ping timeout: 265 seconds) [14:54] I would think so, if it were me I'd rather go through the trouble of dealing with HDDs than pay for bandwidth [14:55] but ask IA they might be up for it [14:55] *** dashcloud has joined #archiveteam [14:56] *** BiggieJo1 has joined #archiveteam [14:56] lots of good ideas, i will return next week to discuss more :) thanks all [14:57] *** rolfb has left [14:58] *** Sk1d has joined #archiveteam [15:01] *** BiggieJon has quit IRC (Ping timeout: 600 seconds) [15:06] balrog: GPLv2 or later, sounds good [15:08] *** Emcy_ has quit IRC (Ping timeout: 512 seconds) [15:21] *** sankin has joined #archiveteam [15:24] *** Start has quit IRC (Disconnected.) [15:32] *** mistym has joined #archiveteam [15:35] *** primus104 has joined #archiveteam [15:36] *** BiggieJon has joined #archiveteam [15:37] *** mistym has quit IRC (Read error: Operation timed out) [15:37] *** mistym has joined #archiveteam [15:39] *** BiggieJo1 has quit IRC (Ping timeout: 600 seconds) [15:40] *** mistym has quit IRC (Remote host closed the connection) [15:46] *** dashcloud has quit IRC (Read error: Operation timed out) [15:49] *** dashcloud has joined #archiveteam [16:01] *** mistym has joined #archiveteam [16:02] *** Start has joined #archiveteam [16:15] *** Smiley has joined #archiveteam [16:18] *** tephra has joined #archiveteam [16:18] *** Peetz0r_ has joined #archiveteam [16:18] *** nertzy2 has joined #archiveteam [16:19] *** lukeman_ has joined #archiveteam [16:20] *** primus104 has quit IRC (hub.se irc.efnet.pl) [16:20] *** schbirid has quit IRC (hub.se irc.efnet.pl) [16:20] *** nertzy has quit IRC (hub.se irc.efnet.pl) [16:20] *** miljo has quit IRC (hub.se irc.efnet.pl) [16:20] *** S[h]O[r]T has quit IRC (hub.se irc.efnet.pl) [16:20] *** edsu_ has quit IRC (hub.se irc.efnet.pl) [16:20] *** primus has quit IRC (hub.se irc.efnet.pl) [16:20] *** Coderjoe has quit IRC (hub.se irc.efnet.pl) [16:20] *** tephra_ has quit IRC (hub.se irc.efnet.pl) [16:20] *** SmileyG has quit IRC (hub.se irc.efnet.pl) [16:20] *** Peetz0r has quit IRC (hub.se irc.efnet.pl) [16:20] *** lukeman has quit IRC (hub.se irc.efnet.pl) [16:20] *** altlabel has quit IRC (hub.se irc.efnet.pl) [16:23] *** BlueMaxim has quit IRC (Read error: Operation timed out) [16:24] *** BlueMaxim has joined #archiveteam [16:25] *** primus_ has joined #archiveteam [16:26] *** altlabel_ has joined #archiveteam [16:26] *** aaaaaaaaa has joined #archiveteam [16:29] *** dashcloud has quit IRC (Ping timeout: 306 seconds) [16:29] *** edsu has joined #archiveteam [16:30] *** dashcloud has joined #archiveteam [16:35] *** ikreymer has joined #archiveteam [16:35] Technically, Sketchcow is in PST this week. [16:35] *** S[h]O[r]T has joined #archiveteam [16:35] What 4.5tb of material is rolfb offering [16:36] SketchCow: I don't know [16:36] he said he can't tell us until next week [16:36] probably some website is shutting down [16:36] and they can't announce now for legal reasons [16:36] Pbbbbbbbbbb [16:38] *** schbirid has joined #archiveteam [16:40] *** closure has quit IRC (Ping timeout: 306 seconds) [16:41] *** closure has joined #archiveteam [16:51] *** mistym has quit IRC (Remote host closed the connection) [16:52] *** Start has quit IRC (Disconnected.) [16:55] *** Coderjoe has joined #archiveteam [16:55] *** Jonimus has quit IRC (Write error: Broken pipe) [16:56] *** Laverne has quit IRC (Read error: Operation timed out) [16:57] *** atlogbot has quit IRC (Ping timeout: 369 seconds) [16:58] *** primus104 has joined #archiveteam [17:00] *** miljo has joined #archiveteam [17:01] *** dashcloud has quit IRC (Read error: Operation timed out) [17:01] *** Jonimus has joined #archiveteam [17:06] *** lemonkey has quit IRC (Read error: Operation timed out) [17:07] *** mistym has joined #archiveteam [17:08] *** dashcloud has joined #archiveteam [17:11] *** Lord_Nigh has quit IRC (Ping timeout: 246 seconds) [17:22] *** Lord_Nigh has joined #archiveteam [17:23] *** balrog sets mode: +o Lord_Nigh [17:23] *** chazchaz has quit IRC (Ping timeout: 369 seconds) [17:23] *** C-apple has quit IRC (Quit: Woohoo.) [17:24] *** dcmorton has quit IRC (Read error: Operation timed out) [17:26] *** dserodio has quit IRC (Read error: Operation timed out) [17:35] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [17:36] *** dashcloud has joined #archiveteam [17:39] looks like this is broken too: mms://125.60.61.137/e_history/StreamRoot/MH/MH_0801_1984_01.wmv [17:40] it gives a 'Error while reading network steam' error [17:40] i did it twice and i get the same md5sum 0299d9acfde242a2ac41b396df14b287 [17:41] *** chazchaz has joined #archiveteam [17:41] *** atlogbot has joined #archiveteam [17:42] *** Jonimus has quit IRC (Read error: Operation timed out) [17:45] *** dcmorton has joined #archiveteam [17:47] *** Nertsy has joined #archiveteam [17:49] *** lemonkey has joined #archiveteam [17:50] *** Laverne has joined #archiveteam [17:50] *** dserodio has joined #archiveteam [17:50] *** Nertsy has quit IRC (Client Quit) [17:52] *** Nertsy has joined #archiveteam [17:55] *** Start has joined #archiveteam [17:58] *** yotta has joined #archiveteam [18:04] *** Emcy has joined #archiveteam [18:05] So, Blogger... [18:06] any ideas? [18:06] #flogger [18:06] Archive ALL of the naughty bits! [18:07] And everything else, for balance. [18:07] And because it's easier. [18:07] We have 28 days [18:18] https://imgur.com/a/SjcgE [18:18] looks like we have an easy way to discovery all the blogs: https://www.blogger.com/profile/5618947 [18:18] we'll have to grab all the profile pages first [18:22] here's the highest one i could find: https://www.blogger.com/profile/35217655 [18:23] nice! [18:30] Start: that highest one says "since April 2007" to me [18:39] *** mschfr has joined #archiveteam [18:42] *** Start has quit IRC (Disconnected.) [18:50] google switched to using 20 digit numbers later on [18:50] ouch [18:53] and integrated blogger + google+ [18:58] Still, there's a good base to start from. And each blog discovered will link to other blogs, so a worthwhile self-perpetutating crawl+download+discovery cycle soon gets going [18:59] (hopefully) [18:59] *** thechip has quit IRC (Ping timeout: 252 seconds) [19:01] *** Ymgve__ is now known as Ymgve [19:02] well, if we dump an initial list of blogs [19:02] then for every blog, extract the next-blog url and inject into tracker if it doesn't already exist [19:02] the project could be self-sustaining until there's no more new items [19:05] Medium invented the blog post. [19:09] All we need now is a rockstar to express this project in beautiful, running warrior code [19:10] *** wp494_ has joined #archiveteam [19:10] *** wp494_ has quit IRC (Excess Flood) [19:10] *** wp494_ has joined #archiveteam [19:10] *** wp494_ has quit IRC (Excess Flood) [19:11] *** edsu has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** BlueMaxim has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** xtr-201 has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** pikhq has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** offby1 has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** underscor has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** wp494 has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** ats_ has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** sivoais has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** lytv has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** dx has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** rduser has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** DFJustin has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** SadDM has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** will__ has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** matthusby has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** maltris has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** useretail has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** Marc has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** torvik has quit IRC (ircd.shaw.ca irc.shaw.ca) [19:11] *** wp494_ has joined #archiveteam [19:13] *** ats has joined #archiveteam [19:13] *** edsu_ has joined #archiveteam [19:14] *** nertzy3 has joined #archiveteam [19:14] *** nertzy2 has quit IRC (Ping timeout: 240 seconds) [19:15] 1 month, god knows how many blogs [19:17] i'm quite curious how is google going to decide what's sexual and what isn't [19:17] *** Nertsy has quit IRC (Ping timeout: 512 seconds) [19:19] *** signius has quit IRC (Read error: Operation timed out) [19:19] select top 25000000 blogs FROM tbl_allblogs ORDER BY hits DESC [19:19] *** Nertsy has joined #archiveteam [19:20] Apparently at the moment it's any blog that the owner (or someone else) has already flagged as adult [19:21] Archive Team: We don't wait for the DROP TABLE. [19:21] Interestingly Google says 'explicit images and video' [19:21] so presuambly the text can be as rumpatious as you like [19:22] 'sexually explicit or graphic nude images or video' [19:27] *** BlueMaxim has joined #archiveteam [19:29] *** lytv has joined #archiveteam [19:32] *** raylee has quit IRC (Ping timeout: 240 seconds) [19:34] *** sivoais has joined #archiveteam [19:34] *** signius has joined #archiveteam [19:36] *** raylee has joined #archiveteam [19:38] *** DFJustin has joined #archiveteam [19:38] *** swebb sets mode: +o DFJustin [19:41] *** Marc has joined #archiveteam [19:45] *** Marc has quit IRC (Ping timeout: 240 seconds) [19:46] *** rduser has joined #archiveteam [19:49] *** raylee has quit IRC (Ping timeout: 240 seconds) [19:50] Hey. I'm trying to archive various tumblr blogs and I'm wondering if you guys know of any tools to do that. I would prefer it to be an incrimental backup so I don't have duplicate files everywhere. [19:51] *** raylee has joined #archiveteam [19:51] archivebot does that [19:52] http://boutofcontext.com/tumblr_backup.php ... now that looks interesting [19:52] *** Marc has joined #archiveteam [19:52] hm yeah [19:52] Nertsy: you can also grab from the rss feed [19:52] (archivebot takes longer than usual for tumblr) [19:53] *** useretail has joined #archiveteam [19:53] *** underscor has joined #archiveteam [19:53] *** dx has joined #archiveteam [19:53] *** matthusby has joined #archiveteam [19:53] *** torvik has joined #archiveteam [19:53] *** irc.shaw.ca sets mode: +o underscor [19:53] *** swebb sets mode: +o underscor [19:54] *** xtr-201 has joined #archiveteam [19:54] >http://boutofcontext.com/tumblr_backup.php [19:54] Source anywhere? [19:55] *** SadDM has joined #archiveteam [19:55] *** swebb sets mode: +o SadDM [19:56] *** pikhq has joined #archiveteam [19:57] *** will__ has joined #archiveteam [19:58] *** mschfr has quit IRC (Ping timeout: 240 seconds) [20:24] *** thechip has joined #archiveteam [20:53] *** bolex83 has joined #archiveteam [20:54] *** bolex83 has quit IRC (Client Quit) [20:55] *** schbirid has quit IRC (Quit: Leaving) [21:12] *** atlogbot has quit IRC (Remote host closed the connection) [21:12] *** swebb has quit IRC (badcheese.com - where crap sometimes gets done) [21:14] *** swebb has joined #archiveteam [21:24] *** BlueMaxim has quit IRC (Quit: Leaving) [21:32] *** xmc sets mode: +o swebb [22:00] *** BlueMaxim has joined #archiveteam [22:07] *** sankin has quit IRC (Leaving.) [22:52] *** wp494_ is now known as wp494 [23:33] *** xk_id has joined #archiveteam