#archiveteam 2015-02-24,Tue

↑back Search

Time Nickname Message
00:04 🔗 Ymgve has quit IRC ()
00:18 🔗 SketchCow I'll get cow.net back up
00:40 🔗 S[h]O[r]T Sanqui do you mean like parsing html?
01:01 🔗 dugo lynx -dump internic.net|grep "[0-9]\. .t"|awk '{print $2}'
01:20 🔗 BiggieJon has quit IRC (Read error: Operation timed out)
01:27 🔗 BiggieJon has joined #archiveteam
01:44 🔗 Start has joined #archiveteam
02:51 🔗 signius has quit IRC (Read error: Operation timed out)
02:54 🔗 kyan has joined #archiveteam
02:56 🔗 primus104 has quit IRC (Leaving.)
03:00 🔗 BiggieJon has quit IRC (Read error: Operation timed out)
03:04 🔗 mistym has quit IRC (Remote host closed the connection)
03:04 🔗 signius has joined #archiveteam
03:05 🔗 mistym has joined #archiveteam
03:05 🔗 mistym has quit IRC (Remote host closed the connection)
03:09 🔗 mistym has joined #archiveteam
03:12 🔗 mistym has quit IRC (Remote host closed the connection)
03:14 🔗 BiggieJon has joined #archiveteam
03:42 🔗 BlueMaxim has joined #archiveteam
03:43 🔗 aschmitz has quit IRC (Quit: Leaving)
03:44 🔗 mistym has joined #archiveteam
04:01 🔗 SketchCow Google is about to take down, in one month, all Blogger.com blogs that are "secually exlicit"
04:01 🔗 SketchCow Sorry, keyboard
04:01 🔗 SketchCow They are just sending letters.
04:02 🔗 SketchCow I think we should just download Blogger.
04:07 🔗 Start i'll add it to the current projects list
04:07 🔗 Start any ideas for an irc channel?
04:07 🔗 xmc #frogger
04:08 🔗 xmc #prairiedogger
04:17 🔗 Start i've put #frogger as the channel on the wiki
04:26 🔗 SketchCow Well, they're removing nudity that lacks educational, documentary or academic standing
04:27 🔗 SketchCow So something good
04:29 🔗 xmc pornogger
04:30 🔗 SketchCow hornblogger
04:30 🔗 SketchCow pornblogger
04:30 🔗 xmc horndogger
04:38 🔗 C-apple-a is now known as C-apple
04:38 🔗 yipdw Google, but for prudes
04:38 🔗 yipdw OH WAIT
04:39 🔗 C-apple fiftyshadesofblogger - without the "iftyshadesofb" in the middle
04:43 🔗 yipdw " We’ll still allow nudity if the content offers a substantial public benefit, for example in artistic, educational, documentary, or scientific contexts."
04:43 🔗 yipdw sexual arousal is not a substantial public benefit evidently
04:44 🔗 yipdw Google Pills, to prevent Stirrings
04:44 🔗 Emcy_ well thats depressing
04:44 🔗 Emcy_ oh well theres still tumblr
04:45 🔗 yipdw owned by yahoo, so not really
04:45 🔗 aaaaaaaaa I thought tumblr already took steps to clean up the more prurient ones
04:50 🔗 aaaaaaaaa oh they hide them from the search feature
04:52 🔗 BlueMaxim has quit IRC (Ping timeout: 370 seconds)
04:53 🔗 BlueMaxim has joined #archiveteam
05:01 🔗 Start i wonder if we should archive google sites sometime soon
05:05 🔗 Start google hasn't been giving much attention to it recently and i have the feeling that it might be one of their next products on the chopping block
05:06 🔗 Nertsy has quit IRC (Quit: Nertsy)
05:09 🔗 Start what happened to picasa web albums?
05:09 🔗 mistym has quit IRC (Remote host closed the connection)
05:09 🔗 Start https://picasaweb.google.com redirects to google plus
05:10 🔗 Start with a link to https://picasaweb.google.com/lh/myphotos?noredirect=1
05:11 🔗 Start looks like it's being merged into google plus
05:13 🔗 Start looks like content on picasaweb.google.com won't be there for much longer
05:13 🔗 Start warrior project?
05:14 🔗 C-apple I think Google Sites might survive simply because it's one of the services advertised as being available in Google Apps for Work (which is really just Google stuff with your own domain name instead of @gmail.com, plus some enterprise control).
05:15 🔗 aaaaaaaaa has quit IRC (Leaving)
05:16 🔗 C-apple Just like Google Talk is supposedly dead, but I'm still using it (both on Google Talk and on XMPP)-- they already marketed it as the chat integration for (paid) Google Apps for Work, so they'd annoy actual paying customers if they yanked XMPP.
05:18 🔗 DFJustin I think the 18+ bloggers age-gate so you probably need to grab with cookies
05:24 🔗 Start i've added picasa to the current projects list
05:30 🔗 yipdw Picasa is online and seems fine
05:30 🔗 yipdw it doesn't redirect to Google+
05:30 🔗 yipdw it requires that you're logged in
05:31 🔗 C-apple If you aren't logged in, it redirects to Google Plus?
05:31 🔗 yipdw if you aren't logged in, you get a login prompt
05:31 🔗 yipdw I logged in and it was there
05:31 🔗 yipdw I don't have a Google+ account
05:33 🔗 C-apple Ah, OK, I remember this...
05:33 🔗 Start they started redirecting google plus users away from it
05:34 🔗 C-apple OK, I'm logged in, and I go to https://picasaweb.google.com and it sends me to my personal Google+ photo album, and across the top is a "Click here to go back to Picasa Web Albums." message.
05:34 🔗 mistym has joined #archiveteam
05:35 🔗 C-apple For what it's worth, at least one my photos showing there pre-dates Google+ I believe; it was a PicasaWeb test, not a Google+ upload, and was uploaded directly. not from the Picasa software.
05:45 🔗 C-apple ArchiveTeam isn't too concerned with site that the Wayback Machine already managed to crawl, right? (I'm thinking of AllGame, for example.)
05:45 🔗 ben_ has quit IRC (Read error: Connection reset by peer)
05:45 🔗 ex-parrot has joined #archiveteam
05:46 🔗 xmc C-apple: we usually crawl for ourselves if it's changed much, or we can find basically anything that wbm systematically didn't get
05:50 🔗 yipdw C-apple: that said, re: allgame, we did get a copy -> http://archive.fart.website/archivebot/viewer/job/aeefs
05:51 🔗 yipdw it should be part of wayback
05:52 🔗 xmc in practice, wbm always misses things
05:53 🔗 C-apple xmc: Because it only pays attention to stuff that's linked from something it already knew about, normally?
05:53 🔗 xmc also it doesn't crawl every link it finds
05:54 🔗 C-apple Ew. Any theory on why?
05:55 🔗 xmc think about it for a second
05:56 🔗 C-apple Theory 1: So much stuff that it queues stuff up and never gets to it by the time it's dead.
05:57 🔗 C-apple Theory 2: Lame-o JavaScript hrefs. (But that's usually a site-wide disease, not specific pages.)
05:58 🔗 xmc #1 is it
06:03 🔗 bsmith093 has joined #archiveteam
06:04 🔗 C-apple Does archive.org have any history of accepting and importing WARC files into the Wayback Machine, or do they only accept what they crawled themselves?
06:05 🔗 C-apple Ah-- I'm reading more-- so WARC is the IA's format?
06:05 🔗 yipdw Archive Team web crawls have made it into wayback, as have archivebot inputs
06:05 🔗 yipdw WARC's an ISO standard based in part on IA's ARC, there's a few tools out there that read it
06:05 🔗 yipdw read and write it
06:05 🔗 C-apple OK, cool.
06:06 🔗 DFJustin well #1 except it's not the site that goes dead, they have to eventually pull the plug on the crawl so they can go get some new content
06:07 🔗 DFJustin brewster explained in one of his interviews that the web is infinite, for example there's a website where you can play chess move by move so there's a url for every possible chessboard
06:08 🔗 DFJustin so they have to have some reasonable limits on how deep they go on any one site so it doesn't get completely bogged down
06:08 🔗 C-apple DFJustin: Does IA add links to the "back" of the queue, behind any other web address that has already been found in a link? And as for those infinite/recursive links: You're saying that IA just crawls so much before it figures that site has gone down the rabbit hole and de-prioritizes it?
06:09 🔗 DFJustin I don't work there so I don't have the gospel on how exactly it works
06:10 🔗 DFJustin but I'm sure you have to make some tough decisions if you wanna try and get the entire web
06:10 🔗 C-apple Yeah, I can imagine.
06:10 🔗 DFJustin also a lot of the content in the wayback machine was actually crawled by third parties and later donated to the archive, so there's no one consistent methodology
06:11 🔗 C-apple I would hope the general algorithm would figure that sequential URLs were not dead-ends.
06:11 🔗 ikreymer has joined #archiveteam
06:11 🔗 yipdw C-apple: they can be
06:12 🔗 yipdw improperly configured PHP photo galleries will accept negative indices and just render something with a 200 OK
06:12 🔗 yipdw they'll also render links to those indices
06:12 🔗 yipdw so you'll end up crawling billions in both directions without some sort of limit or fuzzy matcher
06:12 🔗 C-apple Yuck.
06:12 🔗 DFJustin yeah there are a lot of sites with &page=4 etc. where it will keep linking to &page=97174981 way after you've actually run out of results
06:13 🔗 C-apple Ah, yes, that.
06:13 🔗 yipdw I mention that because we're hitting that problem in archivebot right now
06:13 🔗 yipdw also, calendars
06:13 🔗 yipdw fuck them
06:13 🔗 C-apple They'll give actual "useful" content, or just show a 0-result kind of gallery?
06:13 🔗 DFJustin it'll either show nothing or the last page over and over
06:13 🔗 ikreymer hi archiveteam, this may be useful to folks here, i've made a point-and-click tool for opening WARC files locally on your computer: https://github.com/ikreymer/webarchiveplayer
06:14 🔗 DFJustin sweet
06:14 🔗 C-apple Oh, yeah, calendars are a pain in the ass even for manual calendar users-- sometimes there's no damned way to figure out if you scheduled something for the year 2024 or what.
06:14 🔗 yipdw ikreymer: oh cool
06:14 🔗 ikreymer there are downloadable windows and os x versions, you can just run and then select one or more WARC files, and it'll open the browser for you allowing you to browse them
06:15 🔗 yipdw this is pretty neat, now I don't need a full pywb instance to verify things :)
06:15 🔗 DFJustin feel free to add it to http://archiveteam.org/index.php?title=The_WARC_Ecosystem
06:15 🔗 ikreymer yes, it just pywb wrapped into an executable, hopefully a bit simpler to use.
06:15 🔗 DFJustin "how do I view this crap" is a pretty common question we get from people
06:15 🔗 C-apple ikreymer: Oo! Yummy!
06:17 🔗 mistym_ has joined #archiveteam
06:18 🔗 ionpulse has joined #archiveteam
06:19 🔗 ikreymer DFJustin: great, i will list it on the wiki. feel free to open issues on github. definitely needs a bit more testing, especially the windows build. i tested it mostly on windows 7.
06:19 🔗 mistym has quit IRC (Read error: Operation timed out)
06:27 🔗 ikreymer has quit IRC (Quit: http://chat.efnet.org )
06:27 🔗 ikreymer has joined #archiveteam
06:30 🔗 antomatic has quit IRC (Read error: Operation timed out)
06:30 🔗 antomatic has joined #archiveteam
06:33 🔗 C-apple OK, here's a more complicated one: Let's say I'm a hosting admin and there are some sites that are no longer publicly accessible under their old URLs, but I want to provide an archive. Any tools to do that, either as a convert-file-directory-to-WARC type of thing, or a script that rewrites the base URL and archive dates if I crawl it under a temporary revival URL?
06:35 🔗 mistym_ has quit IRC (Remote host closed the connection)
06:36 🔗 mistym has joined #archiveteam
06:37 🔗 sep332 has quit IRC (Read error: Operation timed out)
06:42 🔗 C-apple (Asking that after midnight on a Tuesday night probably isn't the best way to get a response to that...)
06:43 🔗 ikreymer has quit IRC (Quit: http://chat.efnet.org )
06:43 🔗 ikreymer has joined #archiveteam
06:45 🔗 kyan You could crawl it under the temporary URL and leave it alone... faking the dates/URLs on them seems like a highly dubious prospect to me?
06:46 🔗 kyan then just have the WARC be available for browsing with reference to the temporary URLs
06:46 🔗 kyan (I don't know what I'm talking about though, so don't trust me :P)
06:46 🔗 kyan C-apple ^
06:46 🔗 C-apple But then people looking for archives of the original site wouldn't see that as a snapshot.
06:47 🔗 kyan true true
06:47 🔗 kyan IDK :( maybe someone else knows
06:48 🔗 kyan the idea of forging the WARCs really rubs me the wrong way hard though
06:48 🔗 C-apple I was thinking, crawl it/whatever privately, then date it to the last time in the logs that the public could reach it-- so if there is a later site on that URL it doesn't look in a timeline like the site throttled between content.
06:50 🔗 C-apple Well, yes, I think diddling the WARC metadata is touchy-- but worse than having data with a far-too-late date and a pointless URL that the site never used while live?
06:52 🔗 GLaDOS has quit IRC (Ping timeout: 246 seconds)
07:08 🔗 midas https://support.google.com/blogger/answer/6170671?p=policy_update&rd=1? -- Blogger is closing nude blogs. or atleast making them less easy to access.
07:20 🔗 ikreymer has quit IRC ()
07:20 🔗 ikreymer has joined #archiveteam
07:37 🔗 nertzy has quit IRC (Read error: Connection reset by peer)
07:39 🔗 nertzy has joined #archiveteam
07:43 🔗 signius has quit IRC (ircd.choopa.net irc.teksavvy.ca)
07:43 🔗 Baljem has quit IRC (ircd.choopa.net irc.teksavvy.ca)
07:43 🔗 Fusl has quit IRC (ircd.choopa.net irc.teksavvy.ca)
07:43 🔗 Kazzy has quit IRC (ircd.choopa.net irc.teksavvy.ca)
07:43 🔗 MMovie has quit IRC (ircd.choopa.net irc.teksavvy.ca)
07:43 🔗 closure has quit IRC (ircd.choopa.net irc.teksavvy.ca)
07:43 🔗 Baljem_ has joined #archiveteam
07:44 🔗 sep332 has joined #archiveteam
07:44 🔗 Kazzy_ has joined #archiveteam
07:50 🔗 mutoso has quit IRC (Read error: Operation timed out)
07:50 🔗 Fusl_ has joined #archiveteam
07:55 🔗 signius has joined #archiveteam
07:56 🔗 mutoso has joined #archiveteam
07:56 🔗 ikreymer has quit IRC (Remote host closed the connection)
07:58 🔗 Fusl_ is now known as Fusl
07:59 🔗 yipdw C-apple: crawl it and provide a separate URL mapping
07:59 🔗 Kazzy_ is now known as Kazzy
07:59 🔗 yipdw don't touch the WARC records
07:59 🔗 yipdw it is easy (easier) to write software that understands the mapping and transforms input WARC records online
07:59 🔗 yipdw altering the WARC is irreversible and undetectable
07:59 🔗 yipdw there's probably already some standard for a mapping like that, I can't name one offhand unfortunately
08:00 🔗 C-apple Does AT or Wayback accept WARCs that need to be converted that way?
08:00 🔗 C-apple I mean, rather than just mapping them to their (private/useless) crawl parameters?
08:00 🔗 yipdw I can't think of a situation in which it has happened before
08:01 🔗 yipdw that said easiest way to go forward is to have the data on hand
08:01 🔗 signius has quit IRC (ircd.choopa.net irc.teksavvy.ca)
08:05 🔗 C-apple Another silly question: Is there any reason we use .gz for WARC instead of .bz2 or .xz, since it's stuff that's going to be compressed once, stored, and retrieved a lot?
08:07 🔗 yipdw .warc.gz is a concatenation of gzipped WARC records
08:07 🔗 yipdw it typically achieves 2:1 compression ratio
08:07 🔗 xmc C-apple: this is getting lengthy, it should be in #archiveteam-bs
08:08 🔗 C-apple Ah.
08:09 🔗 MMovie has joined #archiveteam
08:09 🔗 mistym has quit IRC (Remote host closed the connection)
08:13 🔗 signius has joined #archiveteam
08:16 🔗 primus104 has joined #archiveteam
08:27 🔗 MMovie has quit IRC (Ping timeout: 306 seconds)
08:28 🔗 MMovie has joined #archiveteam
08:36 🔗 antomatic So, Blogger...
08:37 🔗 antomatic Have we got a channel yet?
08:37 🔗 antomatic #flogger perhaps ;)
08:37 🔗 antomatic damn Google
08:38 🔗 antomatic And where is Vint Cerf now?
08:39 🔗 antomatic "Oh but they're not deleting anything, they're just making sure that nobody can access it, so that's alright, right?"
08:39 🔗 antomatic Grr
08:40 🔗 yipdw Vint Cerf is off writing joke RFCs about how the Internet is for everyone
08:41 🔗 antomatic While backing up his secret blog full of pictures of ladies' ankles.
08:41 🔗 antomatic (Which was the style at the time)
08:42 🔗 yipdw not sure about that but https://tools.ietf.org/html/rfc3271 does exist
08:42 🔗 godane has quit IRC (Read error: Operation timed out)
08:43 🔗 antomatic And what will the martians think of our human race if there are no sexy blogs for them to assess?
08:43 🔗 antomatic "No, it's alright, because they will still be visible if you explicitly choose to share the blog with the martians."
08:44 🔗 antomatic "Not deleting anything."
08:44 🔗 antomatic "Just making it so that nobody can see. Different, of course."
08:44 🔗 yipdw it's ok, we have tumblr and ic.cz
08:44 🔗 yipdw well sort of the latter
08:44 🔗 yipdw anyway
08:44 🔗 antomatic So.. Blogger..
08:45 🔗 antomatic Brute-force username discovery? Crawls, dictionaries, blog-to-blog links, etc..?
08:47 🔗 antomatic Could 'social' it up, maybe - e.g. 'Add +ArchiveTeam to your GoogleConnectFriendsPlusWhatever circle and we'll archive your blog" etc
08:47 🔗 * antomatic thinks
08:57 🔗 godane has joined #archiveteam
09:02 🔗 zenguy_pc has joined #archiveteam
09:04 🔗 primus104 has quit IRC (Leaving.)
09:06 🔗 boozehoun has quit IRC (Ping timeout: 512 seconds)
09:29 🔗 rejon has quit IRC (Ping timeout: 512 seconds)
09:30 🔗 * ex-parrot has recovered from the excitement of Hyves and is ready to back up some blogs
09:33 🔗 antomatic Could be a nice chunky project.
09:38 🔗 rejon has joined #archiveteam
10:00 🔗 schbirid has joined #archiveteam
10:03 🔗 MMovie has quit IRC (Read error: Operation timed out)
10:03 🔗 MMovie has joined #archiveteam
11:18 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
11:25 🔗 Sk1d has joined #archiveteam
11:30 🔗 Sk2d has joined #archiveteam
11:32 🔗 Sk1d has quit IRC (Read error: Operation timed out)
11:32 🔗 Sk2d is now known as Sk1d
11:33 🔗 Nemo_bis Reposurgeon might have been overkill a suggestion for the Toolserver SVN I asked about earlier.
11:34 🔗 Nemo_bis For now I'm just merging all repos into one for ease (certainly not cleanliness) per https://stackoverflow.com/a/267307/4145951 ; then I'll just look for some place where to dump the merged repo for the sake of downloadability and history
11:35 🔗 Nemo_bis Just sourceforge shell maybe https://sourceforge.net/p/forge/community-docs/svn%20import/
11:37 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
11:39 🔗 Ymgve has joined #archiveteam
11:41 🔗 Sk1d has joined #archiveteam
11:47 🔗 Sk2d has joined #archiveteam
11:48 🔗 Sk1d has quit IRC (Read error: Operation timed out)
11:53 🔗 Sk2d has quit IRC (Ping timeout: 265 seconds)
11:54 🔗 Sk1d has joined #archiveteam
11:58 🔗 dashcloud has quit IRC (Ping timeout: 246 seconds)
12:01 🔗 dashcloud has joined #archiveteam
12:01 🔗 Sk2d has joined #archiveteam
12:04 🔗 Sk1d has quit IRC (Read error: Operation timed out)
12:04 🔗 Sk2d is now known as Sk1d
12:07 🔗 primus104 has joined #archiveteam
12:09 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
12:09 🔗 dashcloud has quit IRC (Read error: Operation timed out)
12:11 🔗 Sk1d has joined #archiveteam
12:12 🔗 dashcloud has joined #archiveteam
12:17 🔗 Sk2d has joined #archiveteam
12:18 🔗 midas sad day https://www.facebook.com/groups/105586892805903/permalink/897231230308128/
12:18 🔗 Sk1d has quit IRC (Read error: Operation timed out)
12:18 🔗 Sk2d is now known as Sk1d
12:37 🔗 garyrh (for those who don't use facebook: http://www.gamasutra.com/view/news/237129/Obituary_Atari_pioneer_Steve_Bristow.php)
12:42 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
12:43 🔗 midas thanks garyrh :)
12:45 🔗 Sk1d has joined #archiveteam
12:52 🔗 uwe has quit IRC (Ping timeout: 240 seconds)
12:53 🔗 Sk1d has quit IRC (Read error: Operation timed out)
12:54 🔗 Sk1d has joined #archiveteam
13:01 🔗 Sk1d has quit IRC (Read error: Operation timed out)
13:02 🔗 Sk1d has joined #archiveteam
13:06 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:10 🔗 dashcloud has joined #archiveteam
13:17 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
13:20 🔗 Sk1d has joined #archiveteam
13:25 🔗 closure has joined #archiveteam
13:26 🔗 Sk2d has joined #archiveteam
13:30 🔗 Sk1d has quit IRC (Read error: Operation timed out)
13:30 🔗 Sk2d is now known as Sk1d
13:31 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:36 🔗 dashcloud has joined #archiveteam
13:47 🔗 sankin has joined #archiveteam
13:49 🔗 rejon has quit IRC (Remote host closed the connection)
13:51 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
13:53 🔗 primus104 has quit IRC (Leaving.)
13:55 🔗 Sk1d has joined #archiveteam
14:00 🔗 Sk2d has joined #archiveteam
14:02 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:06 🔗 Sk2d has quit IRC (Ping timeout: 265 seconds)
14:07 🔗 Sk1d has joined #archiveteam
14:09 🔗 russss has joined #archiveteam
14:10 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:14 🔗 Sk1d has joined #archiveteam
14:19 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
14:21 🔗 sankin has quit IRC (Leaving.)
14:21 🔗 Sk1d has joined #archiveteam
14:21 🔗 rolfb has joined #archiveteam
14:24 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:25 🔗 rolfb Hi. I am considering uploading a huge archive of data to the Internet Archive for preservation. Is there anyone that would be able to walk us through how to best go about it?
14:26 🔗 midas how big?
14:28 🔗 rolfb midas: around 4.5 terabytes
14:29 🔗 balrog http://qz.com/349569/google-will-ban-adult-content-on-its-blogging-platform/
14:29 🔗 midas is it a website rolfb or software?
14:29 🔗 Sk1d has joined #archiveteam
14:29 🔗 Nemo_bis :o
14:30 🔗 midas depending if you can split the data into multiple files because 1 4.5TB file is going to be horrible to upload
14:32 🔗 Nemo_bis More precisely, impossible
14:32 🔗 midas yeah that too
14:32 🔗 midas :p
14:33 🔗 rolfb midas: it is not so much about the website, but the content which it serves
14:33 🔗 rolfb but i can not go into further details before next week, so perhaps I should return then :)
14:34 🔗 Nemo_bis rolfb: usually we archive websites in chunks of ~40 GB WARC files
14:34 🔗 midas ok, anyway rolfb, dont save it into 1 file, 4.5TB will not upload to IA
14:34 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
14:34 🔗 rolfb midas: torrent is an option?
14:35 🔗 Nemo_bis yes
14:35 🔗 rolfb torrent with an archive then?
14:35 🔗 rolfb size limits?
14:35 🔗 midas could try yeah
14:35 🔗 Nemo_bis same limits
14:35 🔗 rolfb Nemo_bis: meaning chunks of 40gb WARC files?
14:35 🔗 Nemo_bis yes, one per torrent
14:36 🔗 midas i'd go that course yeah
14:36 🔗 Nemo_bis If you have a mass of data you can't process yourself, you can ask here and someone can give you an rsync target
14:36 🔗 Nemo_bis But of course it's better if you clean your data yourself ;)
14:37 🔗 rolfb Nemo_bis: thanks, i think i will return next week with some more intricate details of the data set
14:37 🔗 rolfb but it is nice to know that it is possible
14:38 🔗 Sk1d has joined #archiveteam
14:38 🔗 Ymgve__ has joined #archiveteam
14:39 🔗 Nemo_bis http://archiveteam.org/index.php?title=Blogger looks like the XML trick is still the best we have, some work needed here
14:40 🔗 Nemo_bis Is there a reasonably big hosted blogs platform which handles conversion from Blogger?
14:40 🔗 midas wordpress maybe?
14:41 🔗 midas WordPress comes with a built-in importer tool for Blogger. It is good enough to import posts & comments which are a major part of your blog.
14:42 🔗 Ymgve has quit IRC (Ping timeout: 506 seconds)
14:43 🔗 Nemo_bis Yes, but WordPress is unlikely to engage in a campaign "No worries, we'll absorb all adult blogspost subdomains"
14:43 🔗 Nemo_bis The folks I know at Automattic are rather prudish
14:44 🔗 midas lol
14:44 🔗 midas maybe we can use the tools they use to get all the data
14:44 🔗 Nemo_bis True. Not *all* of automattic code is free software/open source, but the importer might be
14:46 🔗 Nemo_bis Hm, is there a way to make wget -p work with XML files too
14:46 🔗 Sk1d has quit IRC (Read error: Operation timed out)
14:46 🔗 balrog rolfb: if it's data it needs a little different treatment
14:47 🔗 balrog 4.5tb is a bit much, should be doable, but I'd probably suggest talking to someone from IA about it
14:47 🔗 balrog (like SketchCow)
14:47 🔗 rolfb balrog: how so?
14:47 🔗 balrog well, it's not warcs so it will have to be uploaded as items tagged with metadata
14:48 🔗 balrog Nemo_bis: this thing: https://wordpress.org/plugins/blogger-importer/ ?
14:48 🔗 rolfb which timezone is SketchCow in?
14:49 🔗 balrog Eastern Time, but he usually shows up a little bit later in the day
14:49 🔗 Sk1d has joined #archiveteam
14:49 🔗 dashcloud has quit IRC (Ping timeout: 240 seconds)
14:50 🔗 rolfb balrog: is it possible to reach out by email?
14:51 🔗 balrog you can try jason@textfiles.com but he's rather busy
14:51 🔗 thechip if it's mostly content why not ship drives like ye olden days?
14:52 🔗 Sanqui sneakernet!
14:53 🔗 rolfb thechip: is that an option? :-)
14:54 🔗 Sk1d has quit IRC (Ping timeout: 265 seconds)
14:54 🔗 thechip I would think so, if it were me I'd rather go through the trouble of dealing with HDDs than pay for bandwidth
14:55 🔗 thechip but ask IA they might be up for it
14:55 🔗 dashcloud has joined #archiveteam
14:56 🔗 BiggieJo1 has joined #archiveteam
14:56 🔗 rolfb lots of good ideas, i will return next week to discuss more :) thanks all
14:57 🔗 rolfb has left
14:58 🔗 Sk1d has joined #archiveteam
15:01 🔗 BiggieJon has quit IRC (Ping timeout: 600 seconds)
15:06 🔗 Nemo_bis balrog: GPLv2 or later, sounds good
15:08 🔗 Emcy_ has quit IRC (Ping timeout: 512 seconds)
15:21 🔗 sankin has joined #archiveteam
15:24 🔗 Start has quit IRC (Disconnected.)
15:32 🔗 mistym has joined #archiveteam
15:35 🔗 primus104 has joined #archiveteam
15:36 🔗 BiggieJon has joined #archiveteam
15:37 🔗 mistym has quit IRC (Read error: Operation timed out)
15:37 🔗 mistym has joined #archiveteam
15:39 🔗 BiggieJo1 has quit IRC (Ping timeout: 600 seconds)
15:40 🔗 mistym has quit IRC (Remote host closed the connection)
15:46 🔗 dashcloud has quit IRC (Read error: Operation timed out)
15:49 🔗 dashcloud has joined #archiveteam
16:01 🔗 mistym has joined #archiveteam
16:02 🔗 Start has joined #archiveteam
16:15 🔗 Smiley has joined #archiveteam
16:18 🔗 tephra has joined #archiveteam
16:18 🔗 Peetz0r_ has joined #archiveteam
16:18 🔗 nertzy2 has joined #archiveteam
16:19 🔗 lukeman_ has joined #archiveteam
16:20 🔗 primus104 has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 schbirid has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 nertzy has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 miljo has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 S[h]O[r]T has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 edsu_ has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 primus has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 Coderjoe has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 tephra_ has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 SmileyG has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 Peetz0r has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 lukeman has quit IRC (hub.se irc.efnet.pl)
16:20 🔗 altlabel has quit IRC (hub.se irc.efnet.pl)
16:23 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
16:24 🔗 BlueMaxim has joined #archiveteam
16:25 🔗 primus_ has joined #archiveteam
16:26 🔗 altlabel_ has joined #archiveteam
16:26 🔗 aaaaaaaaa has joined #archiveteam
16:29 🔗 dashcloud has quit IRC (Ping timeout: 306 seconds)
16:29 🔗 edsu has joined #archiveteam
16:30 🔗 dashcloud has joined #archiveteam
16:35 🔗 ikreymer has joined #archiveteam
16:35 🔗 SketchCow Technically, Sketchcow is in PST this week.
16:35 🔗 S[h]O[r]T has joined #archiveteam
16:35 🔗 SketchCow What 4.5tb of material is rolfb offering
16:36 🔗 balrog SketchCow: I don't know
16:36 🔗 balrog he said he can't tell us until next week
16:36 🔗 balrog probably some website is shutting down
16:36 🔗 balrog and they can't announce now for legal reasons
16:36 🔗 SketchCow Pbbbbbbbbbb
16:38 🔗 schbirid has joined #archiveteam
16:40 🔗 closure has quit IRC (Ping timeout: 306 seconds)
16:41 🔗 closure has joined #archiveteam
16:51 🔗 mistym has quit IRC (Remote host closed the connection)
16:52 🔗 Start has quit IRC (Disconnected.)
16:55 🔗 Coderjoe has joined #archiveteam
16:55 🔗 Jonimus has quit IRC (Write error: Broken pipe)
16:56 🔗 Laverne has quit IRC (Read error: Operation timed out)
16:57 🔗 atlogbot has quit IRC (Ping timeout: 369 seconds)
16:58 🔗 primus104 has joined #archiveteam
17:00 🔗 miljo has joined #archiveteam
17:01 🔗 dashcloud has quit IRC (Read error: Operation timed out)
17:01 🔗 Jonimus has joined #archiveteam
17:06 🔗 lemonkey has quit IRC (Read error: Operation timed out)
17:07 🔗 mistym has joined #archiveteam
17:08 🔗 dashcloud has joined #archiveteam
17:11 🔗 Lord_Nigh has quit IRC (Ping timeout: 246 seconds)
17:22 🔗 Lord_Nigh has joined #archiveteam
17:23 🔗 balrog sets mode: +o Lord_Nigh
17:23 🔗 chazchaz has quit IRC (Ping timeout: 369 seconds)
17:23 🔗 C-apple has quit IRC (Quit: Woohoo.)
17:24 🔗 dcmorton has quit IRC (Read error: Operation timed out)
17:26 🔗 dserodio has quit IRC (Read error: Operation timed out)
17:35 🔗 dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
17:36 🔗 dashcloud has joined #archiveteam
17:39 🔗 godane looks like this is broken too: mms://125.60.61.137/e_history/StreamRoot/MH/MH_0801_1984_01.wmv
17:40 🔗 godane it gives a 'Error while reading network steam' error
17:40 🔗 godane i did it twice and i get the same md5sum 0299d9acfde242a2ac41b396df14b287
17:41 🔗 chazchaz has joined #archiveteam
17:41 🔗 atlogbot has joined #archiveteam
17:42 🔗 Jonimus has quit IRC (Read error: Operation timed out)
17:45 🔗 dcmorton has joined #archiveteam
17:47 🔗 Nertsy has joined #archiveteam
17:49 🔗 lemonkey has joined #archiveteam
17:50 🔗 Laverne has joined #archiveteam
17:50 🔗 dserodio has joined #archiveteam
17:50 🔗 Nertsy has quit IRC (Client Quit)
17:52 🔗 Nertsy has joined #archiveteam
17:55 🔗 Start has joined #archiveteam
17:58 🔗 yotta has joined #archiveteam
18:04 🔗 Emcy has joined #archiveteam
18:05 🔗 antomatic So, Blogger...
18:06 🔗 antomatic any ideas?
18:06 🔗 antomatic #flogger
18:06 🔗 antomatic Archive ALL of the naughty bits!
18:07 🔗 antomatic And everything else, for balance.
18:07 🔗 antomatic And because it's easier.
18:07 🔗 antomatic We have 28 days
18:18 🔗 ohhdemgir https://imgur.com/a/SjcgE
18:18 🔗 Start looks like we have an easy way to discovery all the blogs: https://www.blogger.com/profile/5618947
18:18 🔗 Start we'll have to grab all the profile pages first
18:22 🔗 Start here's the highest one i could find: https://www.blogger.com/profile/35217655
18:23 🔗 antomatic nice!
18:30 🔗 schbirid Start: that highest one says "since April 2007" to me
18:39 🔗 mschfr has joined #archiveteam
18:42 🔗 Start has quit IRC (Disconnected.)
18:50 🔗 Kenshin google switched to using 20 digit numbers later on
18:50 🔗 antomatic ouch
18:53 🔗 Kenshin and integrated blogger + google+
18:58 🔗 antomatic Still, there's a good base to start from. And each blog discovered will link to other blogs, so a worthwhile self-perpetutating crawl+download+discovery cycle soon gets going
18:59 🔗 antomatic (hopefully)
18:59 🔗 thechip has quit IRC (Ping timeout: 252 seconds)
19:01 🔗 Ymgve__ is now known as Ymgve
19:02 🔗 Kenshin well, if we dump an initial list of blogs
19:02 🔗 Kenshin then for every blog, extract the next-blog url and inject into tracker if it doesn't already exist
19:02 🔗 Kenshin the project could be self-sustaining until there's no more new items
19:05 🔗 SketchCow Medium invented the blog post.
19:09 🔗 antomatic All we need now is a rockstar to express this project in beautiful, running warrior code
19:10 🔗 wp494_ has joined #archiveteam
19:10 🔗 wp494_ has quit IRC (Excess Flood)
19:10 🔗 wp494_ has joined #archiveteam
19:10 🔗 wp494_ has quit IRC (Excess Flood)
19:11 🔗 edsu has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 BlueMaxim has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 xtr-201 has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 pikhq has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 offby1 has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 underscor has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 wp494 has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 ats_ has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 sivoais has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 lytv has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 dx has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 rduser has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 DFJustin has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 SadDM has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 will__ has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 matthusby has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 maltris has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 useretail has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 Marc has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 torvik has quit IRC (ircd.shaw.ca irc.shaw.ca)
19:11 🔗 wp494_ has joined #archiveteam
19:13 🔗 ats has joined #archiveteam
19:13 🔗 edsu_ has joined #archiveteam
19:14 🔗 nertzy3 has joined #archiveteam
19:14 🔗 nertzy2 has quit IRC (Ping timeout: 240 seconds)
19:15 🔗 Kenshin 1 month, god knows how many blogs
19:17 🔗 Kenshin i'm quite curious how is google going to decide what's sexual and what isn't
19:17 🔗 Nertsy has quit IRC (Ping timeout: 512 seconds)
19:19 🔗 signius has quit IRC (Read error: Operation timed out)
19:19 🔗 antomatic select top 25000000 blogs FROM tbl_allblogs ORDER BY hits DESC
19:19 🔗 Nertsy has joined #archiveteam
19:20 🔗 antomatic Apparently at the moment it's any blog that the owner (or someone else) has already flagged as adult
19:21 🔗 garyrh Archive Team: We don't wait for the DROP TABLE.
19:21 🔗 antomatic Interestingly Google says 'explicit images and video'
19:21 🔗 antomatic so presuambly the text can be as rumpatious as you like
19:22 🔗 antomatic 'sexually explicit or graphic nude images or video'
19:27 🔗 BlueMaxim has joined #archiveteam
19:29 🔗 lytv has joined #archiveteam
19:32 🔗 raylee has quit IRC (Ping timeout: 240 seconds)
19:34 🔗 sivoais has joined #archiveteam
19:34 🔗 signius has joined #archiveteam
19:36 🔗 raylee has joined #archiveteam
19:38 🔗 DFJustin has joined #archiveteam
19:38 🔗 swebb sets mode: +o DFJustin
19:41 🔗 Marc has joined #archiveteam
19:45 🔗 Marc has quit IRC (Ping timeout: 240 seconds)
19:46 🔗 rduser has joined #archiveteam
19:49 🔗 raylee has quit IRC (Ping timeout: 240 seconds)
19:50 🔗 Nertsy Hey. I'm trying to archive various tumblr blogs and I'm wondering if you guys know of any tools to do that. I would prefer it to be an incrimental backup so I don't have duplicate files everywhere.
19:51 🔗 raylee has joined #archiveteam
19:51 🔗 xmc archivebot does that
19:52 🔗 balrog http://boutofcontext.com/tumblr_backup.php ... now that looks interesting
19:52 🔗 Marc has joined #archiveteam
19:52 🔗 xmc hm yeah
19:52 🔗 xmc Nertsy: you can also grab from the rss feed
19:52 🔗 balrog (archivebot takes longer than usual for tumblr)
19:53 🔗 useretail has joined #archiveteam
19:53 🔗 underscor has joined #archiveteam
19:53 🔗 dx has joined #archiveteam
19:53 🔗 matthusby has joined #archiveteam
19:53 🔗 torvik has joined #archiveteam
19:53 🔗 irc.shaw.ca sets mode: +o underscor
19:53 🔗 swebb sets mode: +o underscor
19:54 🔗 xtr-201 has joined #archiveteam
19:54 🔗 Nertsy >http://boutofcontext.com/tumblr_backup.php
19:54 🔗 Nertsy Source anywhere?
19:55 🔗 SadDM has joined #archiveteam
19:55 🔗 swebb sets mode: +o SadDM
19:56 🔗 pikhq has joined #archiveteam
19:57 🔗 will__ has joined #archiveteam
19:58 🔗 mschfr has quit IRC (Ping timeout: 240 seconds)
20:24 🔗 thechip has joined #archiveteam
20:53 🔗 bolex83 has joined #archiveteam
20:54 🔗 bolex83 has quit IRC (Client Quit)
20:55 🔗 schbirid has quit IRC (Quit: Leaving)
21:12 🔗 atlogbot has quit IRC (Remote host closed the connection)
21:12 🔗 swebb has quit IRC (badcheese.com - where crap sometimes gets done)
21:14 🔗 swebb has joined #archiveteam
21:24 🔗 BlueMaxim has quit IRC (Quit: Leaving)
21:32 🔗 xmc sets mode: +o swebb
22:00 🔗 BlueMaxim has joined #archiveteam
22:07 🔗 sankin has quit IRC (Leaving.)
22:52 🔗 wp494_ is now known as wp494
23:33 🔗 xk_id has joined #archiveteam

irclogger-viewer