[00:04] *** kristian_ has quit IRC (Quit: Leaving) [00:12] *** ZexaronS has quit IRC (Quit: Leaving) [00:26] *** ZexaronS has joined #archiveteam-bs [00:31] *** Ravenloft has quit IRC () [00:44] imgbox.com and abload.de have a pretty good track record (though Imgbox announced they were shutting down a while ago and then retracted it later saying they "have partnered with a new team that have extensive experience in large-scale hosting") [00:44] but yeah, image hosts are dropping like flies [00:44] IPFS could maybe be a solution to that in the future [00:45] Just chiming in to say that I think doing much more regular scans of imgur would be peachy keen, [00:46] https://ipfs.io/ [00:47] Doing an !a archivation job of https://www.reddit.com/domain/imgur.com/ would be a great start [00:48] icedice: once again: IPFS *does not provide persistence* [00:48] there is absolutely zero guarantee that a copy of a given file will remain available [00:48] ok [00:48] didn't know that [00:48] first time discussing it here or anywhere else online for that matter [00:48] icedice: unfortunately IPFS markets itself as 'the permanent web', and per the authors 'permanent' is meant to refer to 'immutable', not 'persistent' [00:48] (which I still think is grossly misleading) [00:49] so I understand the confusioin but I still want to point it out very clearly and unambiguously :P [00:49] yeah [00:49] ok [00:49] icedice: basically, think of IPFS as "if a filesystem were based on torrent technology" [00:49] IPFS is great if you understand its limitations; it's just not an archival medium nor a reliable hosting platform [00:49] and it doesn't implement any 'assure availability' mechanics like Freenet does [00:50] the moment there are no seeds, data is gone [00:50] so it's like kind of like Freenet minus the anonymity? [00:52] Have you guys crawled https://www.reddit.com/domain/imgur.com/ btw? [00:55] With some exclusion rules that limit the crawl to imgur.com it should do a pretty good job at archiving a lot of popular content from Imgur [00:55] icedice: it's *not* like Freenet at all :) [00:56] (that's half the point) [00:56] icedice: it's like torrents, if anything. [00:56] ok [00:56] has all the same technical characteristics [00:56] just more suitable for filesystem-y tasks [00:56] So maybe more like ZeroNet [00:56] but generally, any assumption that holds true for torrents also holds true for IPFS [00:56] I don't know enough about ZeroNet architecture to meaningfully answer that [00:57] https://zeronet.io/ [00:57] "Open, free and uncensorable websites, [00:57] using Bitcoin cryptography and BitTorrent network" [00:57] ^ BitTorrent powered there as well [00:57] icedice: yes, but that's the marketing slogan, it doesn't tell me what its actual design or guarantees are :) [00:58] ok [00:59] icedice: !a https://www.reddit.com/domain/imgur.com/ wouldn't work. /domain pages are limited to 1000 results. [00:59] Same for the search, for that matter. [01:00] *** BlueMaxim has joined #archiveteam-bs [01:00] You can work around it by using the "cloudsearch" syntax and timestamps, but it's annoying. [01:02] And obviously, it won't cover any Imgur links used outside of Reddit. [01:03] But yes, it might be a good idea to start a low-priority project for this. We might be able to reuse some of the code from Eroshare for the link extraction part. [01:05] *** dashcloud has quit IRC (Ping timeout: 245 seconds) [01:06] *** dashcloud has joined #archiveteam-bs [01:21] *** j08nY has quit IRC (Quit: Leaving) [01:25] *** fie has quit IRC (Ping timeout: 246 seconds) [01:32] *** pizzaiolo has quit IRC (Remote host closed the connection) [02:10] is there any kind of standardized database/format for content-addessible data storage [02:11] I know there's magnet links and IPFS and so on, but none of them seem either standard or interconnected? [02:14] I'm not talking distribution, just metadata/indexing/cross-references [02:26] *** ZexaronS- has joined #archiveteam-bs [02:32] *** ZexaronS- has quit IRC (Ping timeout: 260 seconds) [02:32] *** ZexaronS- has joined #archiveteam-bs [02:33] *** ZexaronS has quit IRC (Read error: Operation timed out) [02:34] *** Odd0002 has quit IRC (Remote host closed the connection) [02:34] http://archivisthings.eieidoh.net:8880/DataHoarder/Comics/ [02:35] *** ZexaronS- has quit IRC (Client Quit) [02:36] *** ZexaronS has joined #archiveteam-bs [02:44] *** ReimuHaku has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [02:44] *** ReimuHaku has joined #archiveteam-bs [02:48] *** ReimuHaku has quit IRC (Client Quit) [02:55] *** icedice has quit IRC (Read error: Operation timed out) [02:56] *** SilSte has quit IRC (Read error: Operation timed out) [02:57] *** ReimuHaku has joined #archiveteam-bs [02:57] *** ReimuHaku has quit IRC (Client Quit) [03:00] *** SilSte has joined #archiveteam-bs [03:02] *** ReimuHaku has joined #archiveteam-bs [03:49] *** qw3rty has joined #archiveteam-bs [03:56] *** qw3rty2 has quit IRC (Read error: Operation timed out) [04:29] *** BubuAnabe has quit IRC (Ping timeout: 268 seconds) [04:33] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:36] *** BubuAnabe has joined #archiveteam-bs [04:40] *** Sk1d has joined #archiveteam-bs [05:02] *** zhongfu has joined #archiveteam-bs [05:17] *** BubuAnabe has quit IRC (Ping timeout: 268 seconds) [06:38] *** ZexaronS- has joined #archiveteam-bs [06:40] *** ZexaronS has quit IRC (Read error: Operation timed out) [06:45] *** Honno has joined #archiveteam-bs [07:04] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [07:07] *** ZexaronS- has quit IRC (Read error: Operation timed out) [07:08] *** ZexaronS has joined #archiveteam-bs [07:12] *** Famicoman has joined #archiveteam-bs [07:18] *** ZexaronS has quit IRC (Quit: Leaving) [07:24] *** ZexaronS has joined #archiveteam-bs [07:33] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [07:40] *** Famicoman has joined #archiveteam-bs [07:46] so i up to 1995-06-30 with tagesschau 20 clock news [07:57] *** kyounko has joined #archiveteam-bs [08:03] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [08:10] *** Famicoman has joined #archiveteam-bs [08:28] *** BlueMaxim has quit IRC (Quit: Leaving) [08:28] *** BlueMaxim has joined #archiveteam-bs [08:33] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [08:39] *** Famicoman has joined #archiveteam-bs [08:47] just noticed that electronic gaming monthly went dark 36 days ago [08:51] *** kristian_ has joined #archiveteam-bs [09:00] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [09:05] *** kyounko|2 has joined #archiveteam-bs [09:06] *** BlueMaxim has quit IRC (Read error: Operation timed out) [09:07] *** Famicoman has joined #archiveteam-bs [09:08] *** BlueMaxim has joined #archiveteam-bs [09:11] *** kyounko has quit IRC (Read error: Operation timed out) [09:11] *** SHODAN_UI has joined #archiveteam-bs [09:31] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [09:36] *** Famicoman has joined #archiveteam-bs [09:53] *** kyounko|2 has quit IRC (Read error: Connection reset by peer) [09:59] *** SHODAN_UI has quit IRC (Remote host closed the connection) [10:00] *** kristian_ has quit IRC (Quit: Leaving) [10:08] *** BlueMaxim has quit IRC (Quit: Leaving) [10:15] *** j08nY has joined #archiveteam-bs [10:29] *** Honno has quit IRC (Read error: Operation timed out) [11:06] i'm uploading newer eric archive docs: https://archive.org/details/ERIC_ED565342 [12:16] *** SHODAN_UI has joined #archiveteam-bs [12:28] *** Honno has joined #archiveteam-bs [12:41] *** kristian_ has joined #archiveteam-bs [13:29] *** icedice has joined #archiveteam-bs [13:52] odemg: http://archivisthings.eieidoh.net:8880/DataHoarder/Comics/ gives me a 403 [13:53] arkiver, server went down, I've redirected dns, just populating /DataHoarder/Comics as fast as I can [13:54] thanks odemg [13:56] arkiver, 1.1TB of anime stuff in the mean time? http://archivisthings.eieidoh.net:8880/DataHoarder/ [13:58] :) [13:59] odemg: what this VR Content? [13:59] from the README [13:59] it was 1TB of VR related games etc mirrored from ultimategamer.club after the hack [14:00] very nice [14:00] definitely grabbing a copy of that [14:01] arkiver, I'll let you know when it's back up [14:01] thanks [14:11] odemg: is that a complete Naruto collection? [14:12] I've been looking for this for a while [14:12] yes [14:14] Thank you so much [14:15] HCross2, get it as fast as you can :p [14:20] odemg: is there a nicer way then doing a wget -r? [14:21] feed aria the file list aria2c -j 25 -c -i list [14:22] *** pizzaiolo has joined #archiveteam-bs [14:23] HCross2, http://archivisthings.eieidoh.net:8880/DataHoarder/Anime/Naruto%20Complete%20Series/list [14:23] tyvm [14:24] there you go, 50-70MB/s [14:32] *** yaMatt has joined #archiveteam-bs [14:33] *** yaMatt has quit IRC (Client Quit) [14:46] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [14:50] *** Honno has quit IRC (Read error: Operation timed out) [14:52] *** Smiley has quit IRC (Read error: Connection reset by peer) [14:52] *** Smiley has joined #archiveteam-bs [14:53] *** Famicoman has joined #archiveteam-bs [15:06] *** SHODAN_UI has quit IRC (Ping timeout: 255 seconds) [15:07] *** kristian_ has quit IRC (Ping timeout: 370 seconds) [15:08] *** winr4r has quit IRC (Remote host closed the connection) [15:11] *** SHODAN_UI has joined #archiveteam-bs [15:11] *** SHODAN_UI has quit IRC (Read error: Connection reset by peer) [15:13] *** SHODAN_UI has joined #archiveteam-bs [15:15] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [15:16] *** SHODAN_UI has quit IRC (Read error: Connection reset by peer) [15:18] *** SHODAN_UI has joined #archiveteam-bs [15:24] *** Famicoman has joined #archiveteam-bs [15:31] *** dashcloud has quit IRC (Ping timeout: 260 seconds) [15:34] *** dashcloud has joined #archiveteam-bs [15:40] Do any of you know how to install grab-site on archlinux? [15:44] hey guys, is there some tripod archive? [15:45] wayback says that it's excluded [16:10] *** BubuAnabe has joined #archiveteam-bs [16:35] HCross2, anime and comics dirs updated [16:36] odemg: can you do me a favour and make a list of every URL please? [16:37] Im going to mirror it to some HDDs locally [16:37] and I want to copy it to my own Online.net box first so I can let it download at its own pace [16:38] hmm I wonder if I have space for any of this myself [16:40] HCross2, https://chrome.google.com/webstore/detail/link-grabber/caodelkhipncidmoebgbbeemedohcdma [16:40] ty [16:42] *** simsy has joined #archiveteam-bs [16:42] hi [16:46] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [16:55] *** RichardG has joined #archiveteam-bs [16:55] *** RichardG_ has quit IRC (Read error: Connection reset by peer) [17:11] How do I import cookies into a grab-site/archivebot instance? [17:19] *** BartoCH has joined #archiveteam-bs [17:36] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [17:39] i actually figured out the cookie thing. [17:39] For grab-site, what is the format of the ignore file like? [17:42] *** Honno has joined #archiveteam-bs [17:45] *** Famicoman has joined #archiveteam-bs [17:46] *** simsy has quit IRC (Read error: Connection reset by peer) [17:47] *** Ravenloft has joined #archiveteam-bs [17:57] hook54321: https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/forums [17:58] K. got that working. I imported a cookies.txt file, but it's not logged into the website for some reason. [18:03] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [18:04] *** Ravenloft has quit IRC (Ping timeout: 250 seconds) [18:05] Different IP or user agent from when you logged in? [18:07] Useragent yeah. I'll try to set it to the same and see what happens. [18:08] Note that it's possible your session already got invalidated on the server side, so you may need to log in again. [18:13] *** Famicoman has joined #archiveteam-bs [18:17] It just keeps on crashing about 3 or 4 urls in [18:23] https://gist.githubusercontent.com/hook54321a/71f8224b4e15d0ec23eb378f6474fcee/raw/eeada89d724f7941bf3708b31509905cc2d3aac2/gistfile1.txt [18:34] *** SHODAN_UI has quit IRC (Remote host closed the connection) [18:51] hook54321: please make an arch grab-site package :) [19:06] kisspunch: If there were one, I wouldn't be trying to run it through the Ubuntu Windows bash thing. [19:06] *** Stilett0 has quit IRC (Read error: Operation timed out) [19:07] *** Honno has quit IRC (Read error: Operation timed out) [19:10] i have no idea what you're trying to describe but it sounds horrifying [19:10] learn to make packages, it's pretty easy [19:10] go read a random PKGBUILD [19:11] I did get through part of the installation process, but then it said something about missing OpenSSL libraries. [19:11] yeah, you'd have to manage the manual installation process as step 1 [20:18] *** Honno has joined #archiveteam-bs [20:25] *** marvinw is now known as ivan [20:25] hook54321: segfault might imply a problem with lmdb, try grab-site --no-dupespotter [20:26] *** SHODAN_UI has joined #archiveteam-bs [20:33] I think it's working now. Thank you so much [20:33] cool [20:41] I'm using grab-site for some pretty huge crawls and its coping really well [20:41] In fact, im currently capturing every .london homepage and its not falling over [20:41] Nice [20:42] I split it in 6 in case it did have issues [20:42] but each pack is still around 15k homepages [20:42] plus whatever other assets it needds [20:42] HCross2: Im looking to make a Tor Version of ArchiveBot [20:42] oh nice [20:42] I just need something with Diskspace, all I have access to is 50GB [20:42] Can the wayback handle .onion sites? [20:43] I think so [20:43] even then, archive now, worry about it later [20:43] jrwr: use your 50GB as a testbed, but talk to me when you have it working [20:44] I had one setup [20:45] pretty easy, just do Tor in a transparent method [20:45] abused LXC a little to do it as well [20:45] I'm running it through the Ubuntu bash thing in Windows 10... Which probably has something to do with it. [20:48] use a VM or actual linux [20:50] hook54321: Can you send me a warc from your Windows 10 setup please? I would like to run a few validation checks on it [21:19] *** bmcginty has quit IRC (Ping timeout: 250 seconds) [21:21] *** bmcginty has joined #archiveteam-bs [21:47] 6 days into my Tilt API grab: 4.36M URLs retrieved for 11.5 GiB of warc.gz, 5.87M queued (rising again, unfortunately); 779k users, 104k campaigns, 1.67M URLs discovered [22:12] *** Honno has quit IRC (Read error: Operation timed out) [22:14] *** j08nY has quit IRC (Read error: Operation timed out) [22:14] *** j08nY has joined #archiveteam-bs [22:26] *** SHODAN_UI has quit IRC (Remote host closed the connection) [22:30] I freaked out briefly because I found a corrupted photo on my NAS despite the RAID check telling me everything was fine [22:31] turns out it was corrupted at the source. phew [22:33] the source being an old external HDD. it's a good thing I cloned that disk when I did because clearly it wasn't trustworthy [22:33] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [22:39] *** mundus201 has joined #archiveteam-bs [22:40] *** Famicoman has joined #archiveteam-bs [23:09] HCross2: It's not done yet. [23:09] What are validation checks? [23:23] *** BubuAnabe has quit IRC (Ping timeout: 268 seconds) [23:25] *** Ravenloft has joined #archiveteam-bs [23:28] Frogging: obligatory "RAID is an availability measure, not an integrity measure" [23:28] (ie. not a backup) [23:29] oh I know, I just use it in my NAS, which I use to back up my PC. I was comparing the files in my PC with those on the NAS. but I still run a monthly check just to catch anything odd [23:30] *** BubuAnabe has joined #archiveteam-bs [23:30] the comparison lead me to believe corruption occured on the NAS but really it was because I was comparing my PC with a backup of a backup that got corrupted long ago [23:31] if that sounds dumb it's because it is, and that's why I'm sorting all this stuff out so it can actually make sense :p [23:33] I ran rsync with -ni and saw this [23:33] the checksum changing but not the size or the time is a red flag :p [23:56] *** pizzaiolo has quit IRC (Remote host closed the connection) [23:59] *** pizzaiolo has joined #archiveteam-bs