[00:03] *** mutoso has joined #archiveteam [00:05] I've got GameTrailers items which will finish uploading in just under two weeks. [Which is not actually a problem, obvs.] [00:11] *** kyan has joined #archiveteam [00:37] *** RichardG has quit IRC (Ping timeout: 250 seconds) [00:54] *** n00b972 has joined #archiveteam [00:55] not sure if this has been posted already but http://www.factmag.com/2016/02/11/soundcloud-financial-report-44m-losses/ [00:56] Being discussed in #soundclowns, come on by [00:56] ah, thanks! [00:56] *** n00b972 is now known as mhazinsk_ [01:00] snape: mhazinsk_: #soundclown, not #soundclowns :) [01:01] *** wyatt8740 has quit IRC (Read error: Operation timed out) [01:02] *** wyatt8740 has joined #archiveteam [01:06] *** mhazinsk has joined #archiveteam [01:06] *** mhazinsk_ has quit IRC (Quit: http://chat.efnet.org ) [01:38] *** RichardG has joined #archiveteam [01:46] *** superkuh has quit IRC (Ping timeout: 252 seconds) [01:50] *** RichardG has quit IRC (Ping timeout: 633 seconds) [01:52] *** RichardG has joined #archiveteam [01:52] *** JesseW has joined #archiveteam [02:07] *** lunG has quit IRC (Ping timeout: 250 seconds) [02:12] *** brayden_ has joined #archiveteam [02:12] *** swebb sets mode: +o brayden_ [02:14] since gametrailers is (currently) exhausted, where should i direct my warrior? [02:15] *** vitzli has joined #archiveteam [02:17] *** brayden has quit IRC (Read error: Operation timed out) [02:26] Coderjoe: fotolog [04:08] I have a rather large proxmox dedicated server sitting idle right now, hmmmm [04:08] it's not mine, but I admin it [04:19] *** kyan has quit IRC (Leaving) [04:27] *** kyan has joined #archiveteam [04:41] *** megaminxw has joined #archiveteam [04:41] *** Microguru has quit IRC (Read error: Connection reset by peer) [04:51] TheKiwi: feel free to point it at urlteam or fotolog... [04:54] *** kyan has quit IRC (Leaving) [04:59] *** superkuh has joined #archiveteam [05:06] *** JetBalsa has quit IRC (Read error: Connection reset by peer) [05:08] *** Atom__ has quit IRC (Read error: Connection reset by peer) [05:11] *** Infreq_ has quit IRC (Ping timeout: 252 seconds) [05:12] *** Infreq has joined #archiveteam [05:24] *** acridAxid has quit IRC (Quit: marauder) [05:26] *** acridAxid has joined #archiveteam [05:40] *** Sk1d has quit IRC (Ping timeout: 200 seconds) [05:46] *** Sk1d has joined #archiveteam [05:52] *** vitzli has quit IRC (Leaving) [05:56] *** WinterFox has joined #archiveteam [06:31] *** nickname_ has quit IRC (Ping timeout: 300 seconds) [06:40] *** GLaDOS has quit IRC (Read error: Operation timed out) [06:41] *** GLaDOS has joined #archiveteam [07:15] *** fpoee has joined #archiveteam [07:20] *** plog99 has quit IRC (Read error: Operation timed out) [07:27] *** vitzli has joined #archiveteam [07:39] *** yipdw has quit IRC (Ping timeout: 633 seconds) [07:57] *** JesseW has quit IRC (Quit: Leaving.) [08:06] *** yipdw has joined #archiveteam [08:12] *** yipdw has quit IRC (Ping timeout: 246 seconds) [08:19] *** JesseW has joined #archiveteam [08:43] Load on FOS is now up to 250 [08:43] This is what happens when I increase rsync connections. [08:43] *** RichardG has quit IRC (Ping timeout: 250 seconds) [09:01] *** schbirid has joined #archiveteam [09:01] *** yipdw has joined #archiveteam [10:35] *** JesseW has quit IRC (Quit: Leaving.) [10:36] *** zino_ is now known as zino [11:19] Faster, harder, scooter [11:35] *** dan- has quit IRC (Read error: Operation timed out) [11:49] *** oldcad has joined #archiveteam [12:15] *** dserodio has quit IRC (Read error: Operation timed out) [12:18] *** dserodio has joined #archiveteam [12:19] *** dserodio has quit IRC (Read error: Operation timed out) [12:22] *** dserodio has joined #archiveteam [12:26] *** megaminxw has quit IRC (Quit: Leaving.) [12:29] *** megaminxw has joined #archiveteam [13:00] Can someone do an ao on http://www.bbc.co.uk/news/uk-35561645 for me please [13:03] *** dan- has joined #archiveteam [13:07] done [13:08] Thanks. Normally NewsBuddy would get it [13:11] *** WinterFox has quit IRC (Remote host closed the connection) [13:13] *** weles has joined #archiveteam [13:35] *** nickname_ has joined #archiveteam [13:42] *** Stilett0 has joined #archiveteam [13:44] *** Stiletto has quit IRC (Read error: Operation timed out) [13:56] *** Atom__ has joined #archiveteam [14:00] *** RichardG has joined #archiveteam [14:09] *** Jon has quit IRC (Ping timeout: 260 seconds) [14:10] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [14:10] *** GLaDOS has joined #archiveteam [14:12] *** jmtd has joined #archiveteam [14:29] *** philpem has joined #archiveteam [14:30] *** bzc6p has joined #archiveteam [14:30] *** swebb sets mode: +o bzc6p [14:34] SketchCow: I don't know what you've been doing exactly with the rsync limit, but a bit after your last announcement the limit started to visibly affect all projects. [14:35] A suggestion of mine would be suspending the myVIP project, even though it's not the core of the problem. [14:35] arkiver: ^ [14:35] *** GLaDOS has quit IRC (Read error: Operation timed out) [14:37] *** GLaDOS has joined #archiveteam [14:44] *** megaminxw has quit IRC (Quit: Leaving.) [14:54] *** zyphlar_ has quit IRC (Quit: Connection closed for inactivity) [14:59] *** Start has quit IRC (Quit: Disconnected.) [15:02] *** redlob has quit IRC (Ping timeout: 260 seconds) [15:05] *** rspn has joined #archiveteam [15:06] If I wanted to make sure to mirror an entire website (but no outgoing links), would I simply use wget -m or something else? [15:07] rspn: We recommend using the WARC format for archiving web sites [15:07] It saves the HTTP headers in addition to the content [15:07] And there are many programs like pywb that let you browse the WARC like archived pages in the Wayback Machine [15:07] MrRadar: so this http://www.archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget [15:07] WARC is what I want anyway [15:08] Yes, exactly [15:08] I personally use wpull but wget is good too [15:08] Alright, great! Thanks [15:11] *** redlob has joined #archiveteam [15:11] *** morbus_ has quit IRC (Quit: http://www.disobey.com/) [15:30] *** PuppyCock has quit IRC (Ping timeout: 300 seconds) [15:32] *** PuppyCock has joined #archiveteam [15:32] *** rspn has quit IRC (Ping timeout: 255 seconds) [15:48] *** Start has joined #archiveteam [15:48] *** PuppyCock has quit IRC (Ping timeout: 300 seconds) [15:54] *** WapCapLet has joined #archiveteam [16:56] *** vitzli has quit IRC (Leaving) [17:01] *** JesseW has joined #archiveteam [17:04] *** VADemon has joined #archiveteam [17:05] My AlJazeraAmerica crawl is 4GB in and still rolling. [17:07] *** Start has quit IRC (Quit: Disconnected.) [17:08] How many requests has it captured so far? [17:09] *** JesseW has quit IRC (Quit: Leaving.) [17:11] https://www.evernote.com/l/ACnalPqDNFNIu73LL1oFZR5COgQgEbyEp64 [17:12] Nice [17:12] Yup. Nicely chugging along. :) [17:12] I'm not sure if it's grabbing everything [17:13] what are your seed URLs? [17:13] http://america.aljazeera.com/ [17:14] Does Heritrix also use the sitemap? [17:14] I believe so - and it honors robots.txt [17:14] Here's the 3rd party domains that it's also hitting from the crawl report: https://www.evernote.com/l/ACkp16aTeyxLxJEABrMhAXIhWR5EbjgC9hc [17:38] *** JesseW has joined #archiveteam [17:57] *** JesseW has quit IRC (Quit: Leaving.) [17:58] FOS is on track to fill. [17:59] I killed off a few things, and the load average is (slowly) dropping, but the time it takes to make and upload a chunk, people are just uploading so much. [18:01] *** notjack has joined #archiveteam [18:05] Hey everyone, I'm back for the time being because work has ended. [18:06] Just wondering if anyone's tracking SoundCloud, because it's been reported as in financial trouble, and contains lots of music, a large amount of which is not available elsewhere on the internet. [18:07] Yep, head over to #soundclown [18:07] We're planning to do a discovery crawl and possibly grab the most popular tracks [18:07] Since SoundCloud is too big even for us to grab everything [18:08] "even for us" - cute :) [18:09] lol [18:09] We should develop a new scheme to turn data files into high speed barcodes, which can be uploaded and stored forever on the Infinite Storage System that is YouTube. :) [18:10] hmm [18:10] antomatic, encoding data into video - interesting [18:11] hmmm :D [18:11] Back in the 80s and 90s there were cards that let you back up data to VHS tapes [18:11] can we just upload all the audio to youtube, privately? :D [18:12] I had one, MrRadar. Was an interesting old thing. [18:12] Maybe time to move over to #archiveteam-bs [18:12] Nod [18:12] SmileyG: Doesn't stop it being contentID flagged, sadly. :) [18:12] [nods] [18:14] *** lunG has joined #archiveteam [18:24] I took the rsync connections count for FOS down to 100 for a while. [18:35] *** xekc has joined #archiveteam [18:36] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [18:36] why [18:37] Oh, let him in. [18:37] xekc: yahoosucks [18:37] HUZZAH, KNAVE, THE WORD IS YAHOOSUCKS [18:37] thanks! :) [18:45] *** Stilett0 has quit IRC (Read error: Operation timed out) [18:46] Do us proud [18:46] SketchCow: How long has the secret word been yahoosucks? [18:47] since forever [18:47] why? [18:47] because yahoo sucked forever [18:47] Because they're still living up to that. [18:47] it's been that since we started having antispam on the wiki [18:47] so, basically forever [18:48] remember, archiveteam more or less started with geocities [18:48] *** Start has joined #archiveteam [18:48] we have always been at war with eurasia [18:48] Yeah, true. [18:48] -> -bs [18:50] *** beardicus has quit IRC (bye) [18:54] *** beardicus has joined #archiveteam [18:58] *** beardicus has quit IRC (Client Quit) [19:00] *** mismatch has joined #archiveteam [19:01] *** beardicus has joined #archiveteam [19:01] *** phuzion has quit IRC (Remote host closed the connection) [19:02] *** phuzion has joined #archiveteam [19:02] *** Start has quit IRC (Quit: Disconnected.) [19:11] NewsBuddy is back, and ERROR FREE! [19:12] *** bzc6p has left [19:17] famous last words [19:25] *** Start has joined #archiveteam [19:28] added a section in the wiki on FreeFeed.net - hope didn't write anything too stupid there http://www.archiveteam.org/index.php?title=FriendFeed#Archiving [19:36] *** RichardG has quit IRC (Ping timeout: 492 seconds) [19:41] I deleted a 330gb project I was working on. That puts it at 84% filled, with 1.2tb of space free. [19:46] *** Stiletto has joined #archiveteam [19:54] chfoo, its looking like the rsync server is maxing out on connection preventing friendsreunited grab from being able to upload data [19:55] @ERROR: max connections (100) reached -- try again later [20:02] *** RichardG has joined #archiveteam [20:12] HI. [20:12] -------------------------------------------------------------------------------------- [20:12] HERE IS THE DEAL. FOS IS OVERLOADED BECAUSE WE ARE DOING 5 MAJOR THINGS TO IT. [20:12] RSYNC IS SET TO 100 CONNECTIONS WHILE (HOPEFULLY) IT DOESN'T RUN OUT OF DISK SPACE [20:13] BUT HOO BOY, WE ARE AT 84 PERCENT FULL AND 1.2T LEFT ON THE DRIVE AND YOU GUYS ARE [20:13] BLOWING THAT SHIT OUT OF THE WATER [20:13] -------------------------------------------------------------------------------------- [20:14] I have to travel now. I will periodically check but the chances of a nightmare scenario are non-zero [20:14] In the future, we need to see about a buffer box [20:14] (again) [20:14] And then it either does the work itself or it gives it to FOS later. [20:14] FOS is an OK machine, but it's a VM and it hates everybody [20:15] SketchCow: Want me to see if I can track down someone who can offer a few TB of space in the meantime as an alternative rsync target? [20:15] It better be a crapload [20:15] coordinate with arkiver [20:15] okie dokie [20:18] *** xekc has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [20:18] I'm new to running ATWarrior.. I'm getting No item received... after selecting "current project". is that normal? [20:20] weles: WELCOME! [20:20] what project is currect project? [20:21] sometimes there is downtime and projects will either end of get paused, so sometimes you'll hit no items received [20:21] yeah [20:21] looks like gametrailers is selected [20:21] weles: http://tracker.archiveteam.org/gametrailers/ [20:21] this is the tracker view [20:22] top left it says "items" [20:22] on the "to do" is 0 [20:22] so no items left in tracker [20:23] thanks.. is any other project currently active? [20:23] or what's the priority [20:23] fotolog, it looks to me [20:23] no... [20:23] thats not running, it looks [20:24] I dont know, weles [20:24] hang around [20:24] fos has been capped so some projects have slowed down [20:24] weles: If you're looking to put your warrior to use for the time being, feel free to set it to URLTeam [20:25] phuzion: i tried that but i'm getting 404 , no items available :) [20:26] oh right [20:26] * phuzion checks the dashboard [20:28] btw. does the archiveteam archive its irc channel? are there any logs anywhere? [20:29] http://www.pcworld.com/article/3032490/internet/new-sourceforge-owners-kill-contentious-devshare-bloatware-program.html oooh err [20:31] *** jut has joined #archiveteam [20:38] weles: http://archive.fart.website/bin/irclogger_logs [20:38] MrRadar: thanks [20:39] i wouldnt trust new sourceforge owners [20:40] I have the old archive here: https://badcheese.com/~steve/atlogs/ [20:40] *** BubuAnabe has joined #archiveteam [20:43] *** atlogbot has joined #archiveteam [20:44] SimpBrain: I wouldn't either. They also own SlashDot which we should probably grab as well if we ever decide to go after SF [20:44] we partly did, till we were told not to [20:45] *** Start has quit IRC (Quit: Disconnected.) [20:46] BTW, there is already an archive.org archive in the wayback machine for america.aljazera.com: https://web.archive.org/web/*/america.aljazeera.com [20:46] and it's updated daily from the looks of things. [20:46] Same with slashdot: https://web.archive.org/web/*/http://slashdot.org [20:48] *** signius has quit IRC (Read error: Operation timed out) [20:49] Same with Soundcloud. [20:51] aljazeera and slashdot look fantastically covered in the wayback machine, but soundcloud doesn't - probably because of the amount of javascript and whatnot - soundcloud is probably a single-page-app which is not easy for a traditional crawler to grab. [20:55] Google's closing Picasa in favor of Google Photos. http://googlephotos.blogspot.com.ar/2016/02/moving-on-from-picasa.html [20:56] oh joy [20:56] ruh oh [20:57] If you have photos or videos in a Picasa Web Album today, the easiest way to still access, modify and share most of that content is to log in to Google Photos, and all your photos and videos will already be there. [20:57] nothing being nuked [20:57] Looks like they're merging [20:58] Which is what they did when they shut down Google Video too, so nothing to worry about. [20:58] ISTR that they didn't have a migration plan for Google Video at first, though.. [20:58] Didn't it take a fair bit of cajouling to get them to fold the Google Video library into YouTube instead of deleting it? [20:59] Wasn't it all "Get your shit before this date, no we won't upload it to youtube for you| ? [20:59] *** signius has joined #archiveteam [20:59] Yea, initially they were going to nuke it all, but there was enough pushback that they switched gears. [20:59] I'm certain Sketchcow makes that exact point in one of his talks [20:59] *** weles has quit IRC (Read error: Operation timed out) [20:59] ArchiveTeam made Google look bad, questions raised internally, policy changed [20:59] result. [21:00] Google customers probably complained too. [21:00] That's what the AT page for Google Video says http://archiveteam.org/index.php?title=Google_Video [21:00] [nods] [21:04] SimpBrain: still, Picasa Web is a source for freely licensed images which 1) will disappear in most cases, 2) once migrated to Google Photos lose their copyright license marking. [21:06] *** mismatch has quit IRC (Ping timeout: 260 seconds) [21:11] *** WinterFox has joined #archiveteam [21:11] *** W1nterFox has joined #archiveteam [21:11] *** WinterFox has quit IRC (Read error: Connection reset by peer) [21:11] *** W1nterFox has quit IRC (Client Quit) [21:12] *** WinterFox has joined #archiveteam [21:31] *** megaminxw has joined #archiveteam [21:46] *** jut has quit IRC (Read error: Connection reset by peer) [22:08] *** PuppyCock has joined #archiveteam [22:09] *** WapCapLet has quit IRC (Read error: Operation timed out) [22:15] *** espes__ has quit IRC (Read error: Operation timed out) [22:15] does anyone know what's up with archiving Quora? [22:22] *** SN4T14 has joined #archiveteam [22:22] *** SN4T14 has quit IRC (Connection closed) [22:25] *** Stiletto has quit IRC (Read error: Connection reset by peer) [22:26] *** Stiletto has joined #archiveteam [22:32] *** casdr has joined #archiveteam [22:47] *** wyatt8740 has quit IRC (Read error: Operation timed out) [23:02] fpoee: it's pretty annoying because they deploy IP bans for crawling too many pages/sec, where it isn't very many pages at all [23:02] they might also do them manually, it's hard to tell [23:10] *** Stiletto has quit IRC (Read error: Operation timed out) [23:11] can it even be crawled properly? it seems pages are loaded in parts [23:13] maybe i'm lucky that my ips didnt get banned yet. my crawl is running at about 30GB now [23:14] does anyone crawl it? [23:14] *did or does anyone else? [23:19] ArchiveBot has partially crawled it a few times [23:19] yeah, you need some kind of browser engine to get all the content and nobody has done that afaik [23:20] but there is plenty of static HTML [23:20] you can try it in Firefox with NoScript blocking JS [23:24] its a very messy site to grab. how big have those crawls gotten? [23:25] i guess that was a long time ago [23:27] *** nickname_ has quit IRC (Ping timeout: 300 seconds) [23:39] *** nickname_ has joined #archiveteam [23:41] *** [phire] has quit IRC (Quit: ZNC - http://znc.in) [23:50] *** SN4T14 has joined #archiveteam [23:50] *** SN4T14 has quit IRC (Connection closed) [23:53] *** [phire] has joined #archiveteam