[00:06] *** jmad980 has quit IRC (Ping timeout: 369 seconds) [00:08] Still uploading. [00:08] We're past 675 [00:15] *** RichardG has joined #archiveteam [00:25] *** megaminxw has joined #archiveteam [00:28] *** jmad980 has joined #archiveteam [00:42] *** JetBalsa has joined #archiveteam [01:09] *** MMovie2 has joined #archiveteam [01:11] *** MMovie has quit IRC (Read error: Operation timed out) [01:15] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [01:47] *** JesseW has quit IRC (Leaving.) [01:49] *** JesseW has joined #archiveteam [01:54] so i gathered a list of all the items in jux's s3 bucket (http://user-zip-files.s3.amazonaws.com) and put it into archivebot [01:54] jux is being saved over a year after it died [01:55] what is jux? [01:56] http://www.archiveteam.org/index.php?title=Jux [01:56] it was a blogging website [01:58] nice. What's the format of the zip files? [01:59] it appears that each zip file contains all the images on a users blog [02:00] most of the posts were grabbed before it shut down: https://archive.org/details/jux_posts_to_nov_24 [02:01] nice of them to keep paying for AWS for another year. :-) [02:05] *** JesseW has quit IRC (Leaving.) [02:12] *** rctbeast has joined #archiveteam [02:16] *** schbirid2 has joined #archiveteam [02:19] *** schbirid has quit IRC (Read error: Operation timed out) [02:34] *** JesseW has joined #archiveteam [02:37] *** philpem has quit IRC (Ping timeout: 260 seconds) [02:41] *** Ravenloft has joined #archiveteam [02:49] *** Emcy has quit IRC (Ping timeout: 252 seconds) [03:06] *** ohhdemgir has quit IRC (Remote host closed the connection) [03:11] *** ohhdemgir has joined #archiveteam [03:29] *** megaminxw has quit IRC (Quit: Leaving.) [03:57] *** acridAxid has joined #archiveteam [04:04] *** dashcloud has quit IRC (Remote host closed the connection) [04:06] *** dashcloud has joined #archiveteam [04:07] *** dashcloud has quit IRC (Remote host closed the connection) [04:09] *** dashcloud has joined #archiveteam [04:59] *** JetBalsa has quit IRC (Read error: Connection reset by peer) [05:11] Start: nice! [05:22] *** rctbeast has quit IRC (Ping timeout: 240 seconds) [05:26] *** acridAxid has quit IRC (marauder) [05:29] *** acridAxid has joined #archiveteam [05:34] *** VADemon has joined #archiveteam [05:42] Great work, Start. [05:59] *** megaminxw has joined #archiveteam [07:01] *** WinterFox has joined #archiveteam [07:30] *** FAMAS has joined #archiveteam [07:32] greetings to all, as this group is dedicated for purposes of data archival, this user is posting requests for volunteers who will participate in actions of video screenshotting contents displayed via digital devices [08:16] *** FAMAS has quit IRC (Quit: http://chat.efnet.org (EOF)) [08:32] *** REiN^ has joined #archiveteam [08:33] *** JesseW has quit IRC (Read error: Operation timed out) [08:39] maybe, if you can interest someone in your project [08:40] *** Ghost_of_ has joined #archiveteam [09:12] *** BlueMaxim has quit IRC (Quit: Leaving) [09:13] *** philpem has joined #archiveteam [09:22] *** vOYtEC_ has quit IRC (Read error: Connection reset by peer) [09:36] *** scyther has joined #archiveteam [10:13] *** vOYtEC has joined #archiveteam [11:53] *** Emcy has joined #archiveteam [12:02] *** VADemon_ has joined #archiveteam [12:07] *** VADemon has quit IRC (hub.se irc.efnet.pl) [12:07] *** dashcloud has quit IRC (hub.se irc.efnet.pl) [12:07] *** godane has quit IRC (hub.se irc.efnet.pl) [12:18] *** lytv has quit IRC (Read error: Connection reset by peer) [12:21] *** lytv has joined #archiveteam [12:24] *** SimpBrain has quit IRC (Read error: Operation timed out) [12:27] *** dashcloud has joined #archiveteam [12:32] *** SimpBrain has joined #archiveteam [12:39] *** godane has joined #archiveteam [12:47] *** godane has quit IRC (Excess Flood) [12:49] *** godane has joined #archiveteam [13:04] *** zino_ has joined #archiveteam [13:08] Hmm. Just realized the combined potential disk sizes of the warrior images for Virtualbox is 68GiB. That seems excessive. Mine has ballooned to 62G so far. [13:12] *** nertzy2 has joined #archiveteam [13:30] zino_: what seems excessive about it? You need 8GB for the system parttion, and 60GB for the data partition. [13:30] ps, warrior questions generally go in #warrior [13:34] *** nertzy2 has quit IRC (Quit: This computer has gone to sleep) [13:36] *** megaminxw has quit IRC (Quit: Leaving.) [14:19] *** HarryCros has joined #archiveteam [14:19] *** wp494_ has joined #archiveteam [14:20] *** Emcy_ has joined #archiveteam [14:20] *** RichardG_ has joined #archiveteam [14:20] *** WinterFox has quit IRC (Remote host closed the connection) [14:21] *** Microguru has quit IRC (Ping timeout: 250 seconds) [14:21] *** wp494 has quit IRC (Ping timeout: 250 seconds) [14:21] *** lytv has quit IRC (Ping timeout: 250 seconds) [14:21] *** RichardG has quit IRC (Ping timeout: 250 seconds) [14:21] *** Gfy has quit IRC (Ping timeout: 250 seconds) [14:21] *** alard has quit IRC (Ping timeout: 250 seconds) [14:21] *** diacope has quit IRC (Ping timeout: 250 seconds) [14:22] *** Emcy has quit IRC (Ping timeout: 250 seconds) [14:22] *** HCross has quit IRC (Ping timeout: 250 seconds) [14:22] *** superkuh_ has quit IRC (Ping timeout: 250 seconds) [14:22] *** lytv has joined #archiveteam [14:28] *** Gfy has joined #archiveteam [14:32] *** alard has joined #archiveteam [14:32] *** swebb sets mode: +o alard [14:35] *** superkuh_ has joined #archiveteam [14:36] *** Microguru has joined #archiveteam [14:49] *** Ghost_of_ has quit IRC (Quit: Leaving) [15:29] phuzion: Excessive as in there is no need for it. In what situation should a casual user cache 60GiB of data before uploading? [15:31] *** RichardG_ is now known as RichardG [15:34] zino_: probably because 60 GB is enough to cover any kind of project, rather than needing to have separate images for different kinds of projects [15:35] for text or image projects, 60 GB is very likely overkill, but you'll never get there, where as for video projects, 60 GB is large enough to have you not constantly running out of space [15:36] the issue with large sizes like that, is that people with slow uploads like me will eventually have a large queue of files to upload [15:38] I consider it a bigger problem that a casual contributor suddenly finds his 240G C: SSD maxed out and deletes the whole thing. That will eventually happend even if he never caches more than a gig at any time. [15:39] (This is not an actual problem for me. I'm mostly bike-shedding.) [15:39] I don't think anyone is doing this casually [15:39] it was discussed previously [15:39] tho i wonder why it's expanding to use the entire 60Gb from the off [15:39] I thought it expanded as needed. [15:41] SmileyG: In a perfect world the same filestructure/blocks would be used every time, but in practise you eventually write to most blocks. Deleting data from blocks will not shrink the image. [15:41] yah [15:42] oh right, you've been running it awhile then? [15:42] there is an option to 'shrink' it again btw [15:42] It's probably been running for 2 years or so. [15:42] VirtualBoxs CLI interface has something for that I think. Will check when I get some time. [15:52] Meh. Waiting for a compile now anyway. I'll do it and pester #warrior with the result. [15:54] *** Atom__ has joined #archiveteam [16:16] Nope. Giving up that. the warrior is a vmdk, not vdi. So VirtualBox's tools can't shrink it. Would involve converting it or installing VMware tools. [16:17] *** Ravenloft has quit IRC (Ping timeout: 606 seconds) [17:28] *** JesseW has joined #archiveteam [17:56] if you shut the warrior down properly such that no tasks are running, you can just delete the data partition file and recreate an empty one [18:14] *** atomotic has joined #archiveteam [18:23] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [18:39] *** scyther has quit IRC (Read error: Connection reset by peer) [19:04] *** diacope has joined #archiveteam [19:29] *** dserodio has joined #archiveteam [19:46] *** Microguru has quit IRC (Read error: Connection reset by peer) [19:59] Hey, News grabbing people - I was informed about this project that Archive interacts with: http://gdeltproject.org/ [20:03] Please look at it, make sure we're doing something different. [20:07] Yes, we're doing somethinig different [20:08] The GDELT projec misses a lot of non-English websites [20:08] as well as more local newssites [20:09] The GDELT grab also get's videos if a regex for them is provided [20:09] We can also control the newsgrabs better then GDELT [20:10] As far as I know GDELT does the frontpage and the subpages (not sure if they always do subpages) and srape new articles they find [20:10] We can discover items through better manual settings and URLs, like RSS feeds, etc. [20:10] This makes our crawls probably a bit more complete [20:11] I'm not sure about this but I think GDELT covers 2 afghanistan newswebsites [20:11] NewsGrabber currently covers more then 20 [20:11] The GDELT grab also get's videos if a regex for them is provided [20:11] *** dserodio has quit IRC (Read error: Operation timed out) [20:12] ^ for that I meant to say NewsGrabber grab also get's videos if a regex for them is provided [20:12] But we're currently only grabbing videos if the website is supported by youtube-dl [20:12] So that's why NewsGrabber is different from the GDELT grab [20:21] Somebody, please add any of the above that isn't already on http://archiveteam.org/index.php?title=NewsGrabber [20:41] *** dserodio has joined #archiveteam [20:56] and if youtube-dl has decided to behave [21:09] *** WinterFox has joined #archiveteam [21:10] *** WinterFox has quit IRC (Client Quit) [21:10] *** WinterFox has joined #archiveteam [21:19] *** SilSte has quit IRC (Remote host closed the connection) [21:21] *** schbirid2 has quit IRC (Quit: Leaving) [21:47] *** HarryCros is now known as HCross [21:52] *** JesseW has quit IRC (Read error: Operation timed out) [22:25] *** acridAxid has quit IRC (marauder) [22:28] *** acridAxid has joined #archiveteam [23:00] *** megaminxw has joined #archiveteam [23:08] *** nertzy2 has joined #archiveteam [23:09] *** Emcy_ has quit IRC (Ping timeout: 252 seconds) [23:10] *** nertzy2 has quit IRC (Client Quit) [23:12] *** Emcy has joined #archiveteam [23:22] *** JesseW has joined #archiveteam [23:26] The NewsGrabber project now covers more then 30 sites from the UAE! [23:26] We now fully cover the UAE [23:27] awesome! [23:27] Join NewsGrabber: #newsgrabber [23:27] Is there a nicely formatted, automatically updated, list of sites included in newsgrabber written yet? [23:28] https://github.com/ArchiveTeam/NewsGrabber/tree/master/services [23:28] that looks nice I think [23:28] The bot of NewsGrabber, newsbuddy, gives updates on new found links and uploads and can be followed here: #newsgrabberbot [23:34] Hm. I think I'll write something more like what I was thinking of, then. [23:35] what are you thinking of? [23:36] I think it'll be easier to write it than explain it. :-) [23:36] making an outdated list [23:36] :-) [23:36] ersi: :-P [23:36] thing with github is that it updates all the time, and is the list the server feeds off [23:40] I think what I was thinking of is the writehtmllist function in main.py [23:41] So http://newsgrabber.harrycross.me [23:42] yes, but apparently with one row per file in services/ and different columns.