#archiveteam-bs 2017-07-01,Sat

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***kristian_ has quit IRC (Quit: Leaving) [00:04]
ZexaronS has quit IRC (Quit: Leaving) [00:12]
ZexaronS has joined #archiveteam-bs [00:26]
Ravenloft has quit IRC () [00:31]
icediceimgbox.com and abload.de have a pretty good track record (though Imgbox announced they were shutting down a while ago and then retracted it later saying they "have partnered with a new team that have extensive experience in large-scale hosting")
but yeah, image hosts are dropping like flies
IPFS could maybe be a solution to that in the future
[00:44]
AsparagirJust chiming in to say that I think doing much more regular scans of imgur would be peachy keen, [00:45]
icedicehttps://ipfs.io/
Doing an !a archivation job of https://www.reddit.com/domain/imgur.com/ would be a great start
[00:46]
joepie91icedice: once again: IPFS *does not provide persistence*
there is absolutely zero guarantee that a copy of a given file will remain available
[00:48]
icediceok
didn't know that
first time discussing it here or anywhere else online for that matter
[00:48]
joepie91icedice: unfortunately IPFS markets itself as 'the permanent web', and per the authors 'permanent' is meant to refer to 'immutable', not 'persistent'
(which I still think is grossly misleading)
so I understand the confusioin but I still want to point it out very clearly and unambiguously :P
[00:48]
icediceyeah
ok
[00:49]
joepie91icedice: basically, think of IPFS as "if a filesystem were based on torrent technology"
IPFS is great if you understand its limitations; it's just not an archival medium nor a reliable hosting platform
and it doesn't implement any 'assure availability' mechanics like Freenet does
the moment there are no seeds, data is gone
[00:49]
icediceso it's like kind of like Freenet minus the anonymity?
Have you guys crawled https://www.reddit.com/domain/imgur.com/ btw?
With some exclusion rules that limit the crawl to imgur.com it should do a pretty good job at archiving a lot of popular content from Imgur
[00:50]
joepie91icedice: it's *not* like Freenet at all :)
(that's half the point)
icedice: it's like torrents, if anything.
[00:55]
icediceok [00:56]
joepie91has all the same technical characteristics
just more suitable for filesystem-y tasks
[00:56]
icediceSo maybe more like ZeroNet [00:56]
joepie91but generally, any assumption that holds true for torrents also holds true for IPFS
I don't know enough about ZeroNet architecture to meaningfully answer that
[00:56]
icedicehttps://zeronet.io/
"Open, free and uncensorable websites,
using Bitcoin cryptography and BitTorrent network"
^ BitTorrent powered there as well
[00:57]
joepie91icedice: yes, but that's the marketing slogan, it doesn't tell me what its actual design or guarantees are :) [00:57]
icediceok [00:58]
JAAicedice: !a https://www.reddit.com/domain/imgur.com/ wouldn't work. /domain pages are limited to 1000 results.
Same for the search, for that matter.
[00:59]
***BlueMaxim has joined #archiveteam-bs [01:00]
JAAYou can work around it by using the "cloudsearch" syntax and timestamps, but it's annoying.
And obviously, it won't cover any Imgur links used outside of Reddit.
But yes, it might be a good idea to start a low-priority project for this. We might be able to reuse some of the code from Eroshare for the link extraction part.
[01:00]
***dashcloud has quit IRC (Ping timeout: 245 seconds)
dashcloud has joined #archiveteam-bs
[01:05]
.... (idle for 15mn)
j08nY has quit IRC (Quit: Leaving)
fie has quit IRC (Ping timeout: 246 seconds)
[01:21]
pizzaiolo has quit IRC (Remote host closed the connection) [01:32]
........ (idle for 38mn)
kisspunchis there any kind of standardized database/format for content-addessible data storage
I know there's magnet links and IPFS and so on, but none of them seem either standard or interconnected?
I'm not talking distribution, just metadata/indexing/cross-references
[02:10]
***ZexaronS- has joined #archiveteam-bs [02:26]
ZexaronS- has quit IRC (Ping timeout: 260 seconds)
ZexaronS- has joined #archiveteam-bs
ZexaronS has quit IRC (Read error: Operation timed out)
Odd0002 has quit IRC (Remote host closed the connection)
[02:32]
odemghttp://archivisthings.eieidoh.net:8880/DataHoarder/Comics/ [02:34]
***ZexaronS- has quit IRC (Client Quit)
ZexaronS has joined #archiveteam-bs
[02:35]
ReimuHaku has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
ReimuHaku has joined #archiveteam-bs
ReimuHaku has quit IRC (Client Quit)
[02:44]
icedice has quit IRC (Read error: Operation timed out)
SilSte has quit IRC (Read error: Operation timed out)
ReimuHaku has joined #archiveteam-bs
ReimuHaku has quit IRC (Client Quit)
SilSte has joined #archiveteam-bs
ReimuHaku has joined #archiveteam-bs
[02:55]
.......... (idle for 47mn)
qw3rty has joined #archiveteam-bs [03:49]
qw3rty2 has quit IRC (Read error: Operation timed out) [03:56]
....... (idle for 33mn)
BubuAnabe has quit IRC (Ping timeout: 268 seconds)
Sk1d has quit IRC (Ping timeout: 250 seconds)
BubuAnabe has joined #archiveteam-bs
Sk1d has joined #archiveteam-bs
[04:29]
..... (idle for 22mn)
zhongfu has joined #archiveteam-bs [05:02]
.... (idle for 15mn)
BubuAnabe has quit IRC (Ping timeout: 268 seconds) [05:17]
................. (idle for 1h21mn)
ZexaronS- has joined #archiveteam-bs
ZexaronS has quit IRC (Read error: Operation timed out)
[06:38]
Honno has joined #archiveteam-bs [06:45]
.... (idle for 19mn)
Famicoman has quit IRC (Ping timeout: 260 seconds)
ZexaronS- has quit IRC (Read error: Operation timed out)
ZexaronS has joined #archiveteam-bs
Famicoman has joined #archiveteam-bs
[07:04]
ZexaronS has quit IRC (Quit: Leaving) [07:18]
ZexaronS has joined #archiveteam-bs [07:24]
Famicoman has quit IRC (Ping timeout: 260 seconds) [07:33]
Famicoman has joined #archiveteam-bs [07:40]
godaneso i up to 1995-06-30 with tagesschau 20 clock news [07:46]
***kyounko has joined #archiveteam-bs [07:57]
Famicoman has quit IRC (Ping timeout: 260 seconds) [08:03]
Famicoman has joined #archiveteam-bs [08:10]
.... (idle for 18mn)
BlueMaxim has quit IRC (Quit: Leaving)
BlueMaxim has joined #archiveteam-bs
[08:28]
Famicoman has quit IRC (Ping timeout: 260 seconds) [08:33]
Famicoman has joined #archiveteam-bs [08:39]
godanejust noticed that electronic gaming monthly went dark 36 days ago [08:47]
***kristian_ has joined #archiveteam-bs [08:51]
Famicoman has quit IRC (Ping timeout: 260 seconds) [09:00]
kyounko|2 has joined #archiveteam-bs
BlueMaxim has quit IRC (Read error: Operation timed out)
Famicoman has joined #archiveteam-bs
BlueMaxim has joined #archiveteam-bs
kyounko has quit IRC (Read error: Operation timed out)
SHODAN_UI has joined #archiveteam-bs
[09:05]
..... (idle for 20mn)
Famicoman has quit IRC (Ping timeout: 260 seconds) [09:31]
Famicoman has joined #archiveteam-bs [09:36]
.... (idle for 17mn)
kyounko|2 has quit IRC (Read error: Connection reset by peer) [09:53]
SHODAN_UI has quit IRC (Remote host closed the connection)
kristian_ has quit IRC (Quit: Leaving)
[09:59]
BlueMaxim has quit IRC (Quit: Leaving) [10:08]
j08nY has joined #archiveteam-bs [10:15]
Honno has quit IRC (Read error: Operation timed out) [10:29]
........ (idle for 37mn)
godanei'm uploading newer eric archive docs: https://archive.org/details/ERIC_ED565342 [11:06]
............... (idle for 1h10mn)
***SHODAN_UI has joined #archiveteam-bs [12:16]
Honno has joined #archiveteam-bs [12:28]
kristian_ has joined #archiveteam-bs [12:41]
.......... (idle for 48mn)
icedice has joined #archiveteam-bs [13:29]
..... (idle for 23mn)
arkiverodemg: http://archivisthings.eieidoh.net:8880/DataHoarder/Comics/ gives me a 403 [13:52]
odemgarkiver, server went down, I've redirected dns, just populating /DataHoarder/Comics as fast as I can [13:53]
arkiverthanks odemg [13:54]
odemgarkiver, 1.1TB of anime stuff in the mean time? http://archivisthings.eieidoh.net:8880/DataHoarder/ [13:56]
arkiver:)
odemg: what this VR Content?
from the README
[13:58]
odemgit was 1TB of VR related games etc mirrored from ultimategamer.club after the hack [13:59]
arkiververy nice
definitely grabbing a copy of that
[14:00]
odemgarkiver, I'll let you know when it's back up [14:01]
arkiverthanks [14:01]
HCross2odemg: is that a complete Naruto collection?
I've been looking for this for a while
[14:11]
odemgyes [14:12]
HCross2Thank you so much [14:14]
odemgHCross2, get it as fast as you can :p [14:15]
HCross2odemg: is there a nicer way then doing a wget -r? [14:20]
odemgfeed aria the file list aria2c -j 25 -c -i list [14:21]
***pizzaiolo has joined #archiveteam-bs [14:22]
odemgHCross2, http://archivisthings.eieidoh.net:8880/DataHoarder/Anime/Naruto%20Complete%20Series/list [14:23]
HCross2tyvm [14:23]
odemgthere you go, 50-70MB/s [14:24]
***yaMatt has joined #archiveteam-bs
yaMatt has quit IRC (Client Quit)
[14:32]
Famicoman has quit IRC (Ping timeout: 260 seconds)
Honno has quit IRC (Read error: Operation timed out)
Smiley has quit IRC (Read error: Connection reset by peer)
Smiley has joined #archiveteam-bs
Famicoman has joined #archiveteam-bs
[14:46]
SHODAN_UI has quit IRC (Ping timeout: 255 seconds)
kristian_ has quit IRC (Ping timeout: 370 seconds)
winr4r has quit IRC (Remote host closed the connection)
SHODAN_UI has joined #archiveteam-bs
SHODAN_UI has quit IRC (Read error: Connection reset by peer)
SHODAN_UI has joined #archiveteam-bs
Famicoman has quit IRC (Ping timeout: 260 seconds)
SHODAN_UI has quit IRC (Read error: Connection reset by peer)
SHODAN_UI has joined #archiveteam-bs
[15:06]
Famicoman has joined #archiveteam-bs [15:24]
dashcloud has quit IRC (Ping timeout: 260 seconds)
dashcloud has joined #archiveteam-bs
[15:31]
hook54321Do any of you know how to install grab-site on archlinux? [15:40]
useretailhey guys, is there some tripod archive?
wayback says that it's excluded
[15:44]
...... (idle for 25mn)
***BubuAnabe has joined #archiveteam-bs [16:10]
...... (idle for 25mn)
odemgHCross2, anime and comics dirs updated [16:35]
HCross2odemg: can you do me a favour and make a list of every URL please?
Im going to mirror it to some HDDs locally
and I want to copy it to my own Online.net box first so I can let it download at its own pace
[16:36]
Frogginghmm I wonder if I have space for any of this myself [16:38]
odemgHCross2, https://chrome.google.com/webstore/detail/link-grabber/caodelkhipncidmoebgbbeemedohcdma [16:40]
HCross2ty [16:40]
***simsy has joined #archiveteam-bs [16:42]
simsyhi [16:42]
***BartoCH has quit IRC (Ping timeout: 260 seconds) [16:46]
RichardG has joined #archiveteam-bs
RichardG_ has quit IRC (Read error: Connection reset by peer)
[16:55]
.... (idle for 16mn)
hook54321How do I import cookies into a grab-site/archivebot instance? [17:11]
***BartoCH has joined #archiveteam-bs [17:19]
.... (idle for 17mn)
Famicoman has quit IRC (Ping timeout: 260 seconds) [17:36]
hook54321i actually figured out the cookie thing.
For grab-site, what is the format of the ignore file like?
[17:39]
***Honno has joined #archiveteam-bs
Famicoman has joined #archiveteam-bs
simsy has quit IRC (Read error: Connection reset by peer)
Ravenloft has joined #archiveteam-bs
[17:42]
Aoedehook54321: https://github.com/ludios/grab-site/blob/master/libgrabsite/ignore_sets/forums [17:57]
hook54321K. got that working. I imported a cookies.txt file, but it's not logged into the website for some reason. [17:58]
***Famicoman has quit IRC (Ping timeout: 260 seconds)
Ravenloft has quit IRC (Ping timeout: 250 seconds)
[18:03]
JAADifferent IP or user agent from when you logged in? [18:05]
hook54321Useragent yeah. I'll try to set it to the same and see what happens. [18:07]
JAANote that it's possible your session already got invalidated on the server side, so you may need to log in again. [18:08]
***Famicoman has joined #archiveteam-bs [18:13]
hook54321It just keeps on crashing about 3 or 4 urls in [18:17]
https://gist.githubusercontent.com/hook54321a/71f8224b4e15d0ec23eb378f6474fcee/raw/eeada89d724f7941bf3708b31509905cc2d3aac2/gistfile1.txt [18:23]
***SHODAN_UI has quit IRC (Remote host closed the connection) [18:34]
.... (idle for 17mn)
kisspunchhook54321: please make an arch grab-site package :) [18:51]
.... (idle for 15mn)
hook54321kisspunch: If there were one, I wouldn't be trying to run it through the Ubuntu Windows bash thing. [19:06]
***Stilett0 has quit IRC (Read error: Operation timed out)
Honno has quit IRC (Read error: Operation timed out)
[19:06]
kisspunchi have no idea what you're trying to describe but it sounds horrifying
learn to make packages, it's pretty easy
go read a random PKGBUILD
[19:10]
hook54321I did get through part of the installation process, but then it said something about missing OpenSSL libraries. [19:11]
kisspunchyeah, you'd have to manage the manual installation process as step 1 [19:11]
.............. (idle for 1h7mn)
***Honno has joined #archiveteam-bs [20:18]
marvinw is now known as ivan [20:25]
ivanhook54321: segfault might imply a problem with lmdb, try grab-site --no-dupespotter [20:25]
***SHODAN_UI has joined #archiveteam-bs [20:26]
hook54321I think it's working now. Thank you so much [20:33]
ivancool [20:33]
HCross2I'm using grab-site for some pretty huge crawls and its coping really well
In fact, im currently capturing every .london homepage and its not falling over
[20:41]
jrwrNice [20:41]
HCross2I split it in 6 in case it did have issues
but each pack is still around 15k homepages
plus whatever other assets it needds
[20:42]
jrwrHCross2: Im looking to make a Tor Version of ArchiveBot [20:42]
HCross2oh nice [20:42]
jrwrI just need something with Diskspace, all I have access to is 50GB [20:42]
HCross2Can the wayback handle .onion sites? [20:42]
jrwrI think so
even then, archive now, worry about it later
[20:43]
HCross2jrwr: use your 50GB as a testbed, but talk to me when you have it working [20:43]
jrwrI had one setup
pretty easy, just do Tor in a transparent method
abused LXC a little to do it as well
[20:44]
hook54321I'm running it through the Ubuntu bash thing in Windows 10... Which probably has something to do with it. [20:45]
Frogginguse a VM or actual linux [20:48]
HCross2hook54321: Can you send me a warc from your Windows 10 setup please? I would like to run a few validation checks on it [20:50]
...... (idle for 29mn)
***bmcginty has quit IRC (Ping timeout: 250 seconds)
bmcginty has joined #archiveteam-bs
[21:19]
...... (idle for 26mn)
JAA6 days into my Tilt API grab: 4.36M URLs retrieved for 11.5 GiB of warc.gz, 5.87M queued (rising again, unfortunately); 779k users, 104k campaigns, 1.67M URLs discovered [21:47]
...... (idle for 25mn)
***Honno has quit IRC (Read error: Operation timed out)
j08nY has quit IRC (Read error: Operation timed out)
j08nY has joined #archiveteam-bs
[22:12]
SHODAN_UI has quit IRC (Remote host closed the connection) [22:26]
FroggingI freaked out briefly because I found a corrupted photo on my NAS despite the RAID check telling me everything was fine
turns out it was corrupted at the source. phew
the source being an old external HDD. it's a good thing I cloned that disk when I did because clearly it wasn't trustworthy
[22:30]
***Famicoman has quit IRC (Ping timeout: 260 seconds) [22:33]
mundus201 has joined #archiveteam-bs
Famicoman has joined #archiveteam-bs
[22:39]
...... (idle for 29mn)
hook54321HCross2: It's not done yet.
What are validation checks?
[23:09]
***BubuAnabe has quit IRC (Ping timeout: 268 seconds)
Ravenloft has joined #archiveteam-bs
[23:23]
joepie91Frogging: obligatory "RAID is an availability measure, not an integrity measure"
(ie. not a backup)
[23:28]
Froggingoh I know, I just use it in my NAS, which I use to back up my PC. I was comparing the files in my PC with those on the NAS. but I still run a monthly check just to catch anything odd [23:29]
***BubuAnabe has joined #archiveteam-bs [23:30]
Froggingthe comparison lead me to believe corruption occured on the NAS but really it was because I was comparing my PC with a backup of a backup that got corrupted long ago
if that sounds dumb it's because it is, and that's why I'm sorting all this stuff out so it can actually make sense :p
I ran rsync with -ni and saw this
<fc........ Panorama 1.JPG
the checksum changing but not the size or the time is a red flag :p
[23:30]
..... (idle for 22mn)
***pizzaiolo has quit IRC (Remote host closed the connection)
pizzaiolo has joined #archiveteam-bs
[23:56]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)