[00:06] JensRex: Ever considered using Scaleway? Cheaper and you can also get dedicated baremetal VPS servers from them [00:10] Nah, I like DO better. [00:15] dedicated baremetal VPS? [00:16] how does that work? [00:16] That sounds contradictory [00:19] there are single-tenant VPS but they're not baremetal. and the other kind of VPS wouldn't be dedicated either [00:21] *** username1 is now known as schbirid [00:22] is ftp.cs.princeton.edu know and archived already? i grabbed it the other day when it was posted on opendirectories [00:22] ~90G [00:51] He means their actual baremetal offerings [00:51] People have gotten used to calling everything VPSs [00:52] *** odemg has quit IRC (Remote host closed the connection) [00:55] *** schbirid has quit IRC (Quit: Leaving) [01:02] *** alfie has quit IRC (Ping timeout: 260 seconds) [01:05] *** alfie has joined #archiveteam-bs [02:21] *** j08nY has quit IRC (Quit: Leaving) [02:57] *** pizzaiolo has quit IRC (Remote host closed the connection) [03:13] *** kristian_ has quit IRC (Quit: Leaving) [04:10] *** DFJustin has quit IRC (Remote host closed the connection) [04:13] *** DFJustin has joined #archiveteam-bs [04:26] *** pnJay has quit IRC (Read error: Connection reset by peer) [04:27] *** pnJay has joined #archiveteam-bs [04:33] *** beardicus has quit IRC (Read error: Operation timed out) [04:50] *** beardicus has joined #archiveteam-bs [04:54] *** ndiddy has quit IRC () [04:58] *** icedice has quit IRC (Quit: Leaving) [05:10] *** beardicus has quit IRC (Read error: Operation timed out) [05:12] i found 3 more episodes of call for help canada [05:12] 2 from 2004 and one from 2005 [05:27] *** kristian_ has joined #archiveteam-bs [05:28] *** beardicus has joined #archiveteam-bs [05:48] *** kristian_ has quit IRC (Quit: Leaving) [05:49] godane: I had someone look at the japanese scans collection [05:49] His assessment is they're really, really super bad [05:49] But I was glad to be tipped off [05:57] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [06:00] *** Aranje has quit IRC (Ping timeout: 506 seconds) [06:03] *** Sk1d has joined #archiveteam-bs [06:04] *** Honno_ has joined #archiveteam-bs [06:23] i sort of figure they are bad scans [06:23] but i only seen a few images of it [06:24] and it didn't look bad from those images [06:30] *** Stilett0- has joined #archiveteam-bs [08:08] *** GE has joined #archiveteam-bs [08:55] *** dashcloud has quit IRC (Read error: Operation timed out) [08:57] *** midas has quit IRC (Read error: Operation timed out) [08:57] *** midas has joined #archiveteam-bs [08:59] *** dashcloud has joined #archiveteam-bs [09:31] *** GE has quit IRC (Quit: zzz) [09:31] *** GE has joined #archiveteam-bs [09:44] *** j08nY has joined #archiveteam-bs [09:58] *** BlueMaxim has quit IRC (Read error: Operation timed out) [10:32] *** schbirid has joined #archiveteam-bs [11:51] *** GE has quit IRC (Remote host closed the connection) [11:55] I wouldn't touch anything Scaleway/Online.net as far as i could throw it [12:21] *** icedice has joined #archiveteam-bs [12:22] *** pizzaiolo has joined #archiveteam-bs [12:35] *** schbirid2 has joined #archiveteam-bs [12:38] *** schbirid has quit IRC (Read error: Operation timed out) [12:50] *** j08nY has quit IRC (Quit: Leaving) [13:33] *** odemg has joined #archiveteam-bs [14:05] *** icedice has quit IRC (Quit: Leaving) [14:07] *** pizzaiol1 has joined #archiveteam-bs [14:08] *** pizzaiolo has quit IRC (Ping timeout: 245 seconds) [14:20] The guy who scanned Computer Gaming World contacted me and said "Hey, you know, I always put up the middle of the road resolutions. Want the full resolutions?" [14:20] Yes, yes I do. He's uploading gigs of scans now. [14:21] Also, I am happy to say that FOS is in the realm of getting normal [14:21] Mostly because I made a second independent channel/drive for the IMDB nightmare [14:32] *** Roelandus has joined #archiveteam-bs [14:32] Roelandus, have you already read through http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem ? [14:33] yeah, I think so [14:33] Are you just intending to read a warc file, or host a warc file? [14:33] But I don't have the skills to use python [14:33] I'm intending to read a .warc file [14:34] Then I would recommend https://github.com/ikreymer/webarchiveplayer [14:34] I'm using that program [14:34] So I downloaded a test warc file from https://archive.org/details/testWARCfiles [14:35] but it onyl shows me the links. When I click on those links it shows me a bunch of text instead of a web page [14:36] Are you afk? [14:37] I am looking into your issue, please be patient. [14:37] Okay, thank you [14:40] I have an image: https://gyazo.com/da2effe8c3e83612e7480987c69809a5 [14:42] You are opening the warc file, not the meta warc or cdx file? [14:42] the 1gb file? [14:43] I just tested it with one of my uploads: https://archive.org/download/allthingsrat.ditb.org-2017-03-01-6475a329/allthingsrat.ditb.org-2017-03-01-6475a329-00000.warc.gz and it works fine. [14:43] the warc file, I downloaded the test torrent but it doesnt contain a .gz or .cdx file [14:44] Try clicking on my link and using that file. It is only 53mb. [14:44] awh shit, my download speed is fucking slow [14:44] im at 5 MB :( [14:45] 11 MB [14:46] Why is this channel called archiveteam-bs (bullshit) seems a bit offensive to me [14:46] 20.6 MB [14:47] I have a 100 GB archive of a social-networking website on my PC it does have a .cdx but I can't open it with webarchiveplayer [14:48] webarchiveplayer only opens .arc , .warc, .gz [14:49] Because cdx is only a index of URLs that were grabbed. [14:49] The warc file is the actual content. [14:49] im almost done downloading your file [14:50] alright [14:52] that one seems to work [14:52] *** SketchCow sets mode: +o rocode [14:53] yeah, your site works [14:53] that test file is a bit weird though [14:53] I have another question though [14:54] Go for it. :) [14:54] I got this 100GB .warc file which I mentioned previously. It takes forever to load it in. [14:55] *** Darkstar has quit IRC (Ping timeout: 370 seconds) [14:55] Isn't there a way to just load the index URLs and then load in the indiv. pages [14:55] I forgot the question mark [14:55] *** GE has joined #archiveteam-bs [14:55] This unfortunately requires a tiny bit of that python you hated. Several tools exist to split warc files into smaller files. Warcat is the most popular. https://pypi.python.org/pypi/Warcat/ [14:56] I don't hate it, I just can't use it [14:56] I have python installed on my pc [14:56] You are on windows I am assuming? [14:57] Windows 10, yes [14:58] I have Python 3.5 installed [14:58] In your bottom left Cortana search button, type "Powershell" [14:58] cmds? [14:59] Negatory. The program you need to launch is called "Windows Powershell" [14:59] I already launched it [14:59] I was asking for which commands I shall put in [15:00] Get this script: https://bootstrap.pypa.io/get-pip.py [15:00] Save it to your computer. [15:01] woops [15:01] already failed there [15:01] how do I save it to my PC? just open notepad and paste the code in and save it as .py? [15:02] Correct. [15:02] done [15:04] If you saved it in your home directory (i.e. C:\Users\\), and powershell shows you are in that directory, simply type python [15:04] I gotta cd to my desktop first [15:04] I saved it there [15:04] k [15:05] I just moved it to that folder because that is the easiest way I think [15:06] I got an error [15:06] my input: "python code.py" [15:07] output error: "python : The term 'python' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:1 + python code.py + ~~~~~~ + CategoryInfo : ObjectNotFound: (python:String) [], CommandNotFoundException + FullyQualifiedErrorId : CommandNotFoundException" [15:07] I have python 3.5 32-bt [15:07] *32-bit [15:07] *** Darkstar has joined #archiveteam-bs [15:08] hold on [15:08] np [15:10] Type this: $env:Path = "C:\Python35\"; [15:11] https://dl.dropboxusercontent.com/u/53753604/screenshots/2017.03.12_16-08-29__explorer.png [15:11] hm yeah [15:11] (If that is where your python is installed, please check first) [15:11] VADemon, yeah, the installer includes the option to add to path, but it is off by default. :P [15:12] ...great [15:12] Python35 is not in the C: folder [15:13] Oh, I remember [15:13] What was that environment table thing? [15:13] $env:Path = "C:\Python35\"; replacing the stuff in the quotations with the actual path. [15:13] So you can access python from anywhere by just typing "python" [15:14] I'll try to fix the environment table [15:15] https://gyazo.com/b2f4b7888e956a6e694feab23e953f6e [15:16] This is what it looks like with nothing changed [15:16] I changed something in the past though [15:16] Did you enter rocod3's command though? [15:17] No, because my python is located here "C:\Users\roels\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Python 3.5" [15:17] No it isn't. Your start menu shortcut is located there. [15:17] This should be just the link to your installation [15:18] woops [15:18] Your actual install, if you left the options the same, should be C:\Python35 [15:18] If I do properties on that .exe it gives the path "C:\Users\roels\AppData\Local\Programs\Python\Python35-32\" [15:19] You checked install for only me in the installer process then. [15:19] probably [15:19] this should be it [15:19] $env:Path = "C:\Users\roels\AppData\Local\Programs\Python\Python35-32\"; [15:19] Copy paste that into your powershell, then hit enter [15:20] it shows nothing as output [15:20] just a new line [15:20] idk if that's good [15:20] That's good. [15:20] Now type python code.py [15:21] yay [15:21] oh shit [15:21] it says it uninstalled pip :( [15:21] it was already installed for some reason [15:22] Standby [15:22] now it's installed [15:22] again [15:22] Yep, it just upgraded your pip. [15:23] Let it finish, then type pip install warcat [15:25] error: "pip : The term 'pip' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the pelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:1 + pip install warcat + ~~~ + CategoryInfo : ObjectNotFound: (pip:String) [], CommandNotFoundException + FullyQualifiedErrorId : CommandNotFoundException" [15:26] Standby [15:29] Roelandus, do you still have the Python3.5 installer in your downloads? [15:30] I got some kind of folder named platform-tools [15:31] should I download a new installer? [15:31] Standby [15:31] Download this: https://www.python.org/ftp/python/3.6.1/python-3.6.1rc1-amd64.exe [15:31] downloaded, open> [15:31] ? [15:31] During the installation process, you will get a list of components being installed. [15:32] so, I should install it? [15:32] Negate previous, at the bottom of the installer is a checkbox that says "Add Python3.6 to path" [15:33] Check it. [15:33] customize installation or install now button? [15:33] Once you have that checked, install now button [15:33] Restart your powershell [15:33] (close and reopen) [15:34] it's still installing [15:34] Once it finishs, at the last screen is "Disable path limits" [15:35] Click it [15:35] it's almost done [15:35] I'm still on HDD [15:36] I clicked it [15:36] but it doesn't show a new install of some sort [15:36] *install->installer [15:36] that's fine [15:37] okay, what's next? :) [15:37] Open up powershell and type pip install warcat [15:38] it's done [15:38] Standby. [15:41] what keyboard do you use btw? [15:41] Type: python -m warcat split [15:41] Replacing with the one you wish to split [15:42] should I cd to my warcfile location first? [15:42] Yes [15:42] I use this due to severe wrist pain: https://www.trulyergonomic.com/store/truly-ergonomic-mechanical-keyboard-soft-tactile-kailh-cherry-mx-compatible-brown-keyswitches-227-english [15:43] uhm it seems like cd'ing in cmd is not the same as cd'ing in ms powershell [15:43] that keyboard is more expensive than hhkb [15:44] I see the problem [15:46] In the future, to bypass all the windows bullshit we just had to go through, you may want to consider installing the Windows Subsystem for Linux, which makes all of this 100x easier. https://msdn.microsoft.com/en-us/commandline/wsl/install_guide [15:46] If you end up doing a lot of CLi work. [15:46] *** GE has quit IRC (Quit: zzz) [15:46] *** GE has joined #archiveteam-bs [15:47] I'll save that link [15:48] sorry I had a folder name that powershell responded to in the wrong way [15:48] I had a folder named "test environment" [15:48] It saw environment as a command of some sort [15:49] it's splitting them [15:50] not in the way I wanted to (wrong folder) but I can post fix that [15:51] how can I check the progress in % [15:52] Windows does not have a method of generating progress bars for CLI commands that do not have them built in like linux does. [15:52] (This is my subtle conversion process) [15:53] that truly ergonomic company is pretty expensive [15:53] $65 numpad with cherry mx [15:54] I am not saying it is for everyone. I use them because they solved my wrist pain. I know there are cheaper options. But you asked what I used, not what I recommended. ;) [15:54] yeah [15:55] what do you think of hhkb excluding the lite2? [15:57] can I already open a split .warc file with webarchiveplayer? [15:57] Yes [15:58] Can you tell me your age? [15:59] If you are considering topre keyboards, I can only point you to /r/MechanicalKeyboards . I may have an expensive keyboard, but that stuff is beyond my purview. :P However, if you are going to spend that kind of money, I recommend you also look at https://ergodox-ez.com [16:00] I need a new keyboard but I'm probably gonna buy a membrane or a clone mechanical [16:00] I'm just 16 y/o [16:00] So I can't afford the finest mx clears or topre [16:01] If money is an issue, I recommend what I had at your age: http://www.daskeyboard.com/products/mechanical-keyboards/ [16:03] daskeyboard is by no means cheap [16:03] I got €1800 on my bank account [16:03] And I'm in like the highest 5 percent in my school [16:03] http://www.ebay.com/sch/i.html?_from=R40&_trksid=p2050601.m570.l1313.TR12.TRC2.A0.H0.Xibm+model+m.TRS0&_nkw=ibm+model+m&_sacat=0 [16:04] Find a cheap, used IBM Model M [16:04] yeah, that's a good idea they're like €40 [16:04] A good keyboard, properly maintained, will last you 10 years. [16:04] And I get dem geekhack respect with my model M [16:04] but I have a problem with the split warc files [16:05] Oh? [16:05] https://gyazo.com/7eec1d45473e55f8bce45a908b34ca51 [16:06] when I click on a random file it shows basically nothing useful [16:06] it's from the hyves.nl archive [16:10] Roelandus, select multiple warc files with the webarchiveplayer [16:11] Actually, I have commited a cardinal sin. What are you actually trying to accomplish with this warc file? [16:11] Find a specific record? [16:11] Or just randomly browser? [16:11] hyves.nl [16:11] I understand, but what are you trying to achieve? [16:11] I need to find a couple of pages so I can grab their images/videos [16:12] I already ahve the username/groupname list [16:13] Roelandus, did you get this warc file from IA? [16:13] yes I beleive [16:14] got it from here: http://www.archiveteam.org/index.php?title=Hyves but it points to internet archive I think [16:14] I see. And you can't use the wayback machine because of their shitty robots.txt [16:14] yeah, I don't get the robots.txt nonsense [16:14] Unfortunately, you will need to use the built in searcher on the ENTIRE warc file, as slow as it is. [16:15] The only thing I know is that it is used by search engines [16:15] Jason Scott put it pretty well here: http://www.archiveteam.org/index.php?title=Robots.txt [16:16] I already read half of that [16:17] still I don't get the IA robots.txt [16:17] IA follows robots.txt instructions. http://hyvesgames.nl/robots.txt is set up as a whitelist instead of a blacklist. [16:17] I think the robots.txt works like this: search engine crawls -> sees robots.txt and acts accordingly [16:18] IA follows robots.txt for several reasons, all of which I am not qualified to really elaborate on. It is the unfortunate reality of the situation. [16:18] has it to do with legal stuff? [16:18] Yes. [16:18] yeah the website doesn't want their data to be 'stolen' [16:19] Anyway, if you are looking for specific stuff within a warc, as far as I know, you will need to use the warc as a whole. [16:19] Sorry. [16:19] shit [16:19] well... [16:19] Anyway, I really should be getting back to work. If you need anything else, please feel free to ask in this channel. :) [16:19] okay, but is there no way to speed up the loading progress? [16:20] because I only want to see one page as a starting point [16:21] As far as I know, no. Some others in this channel are way more experienced with actually using the actual warc file and may be able to help you, but we have reached the end of my expertise. :) [16:21] okay, thanks for your time. Have a nice day sir :) [16:23] Roelandus: you can grep through the index (CDX) file and then use the offsets to extract the corresponding WARC partially [16:23] this will still take a good while though [16:23] *** tapedrive has quit IRC (Read error: Operation timed out) [16:24] *** tapedrive has joined #archiveteam-bs [16:26] SketchCow: i have a collection of 321 Contact [16:26] i maybe uploading that soon [16:27] joepie91 nederlands? [16:29] Schools are encouraged to videotape the program, without fear of copyright violation, and to build the segments into the curriculum at convenient times. In the past, 26 countries have broadcast the series. West Germany, France and Spain have produced their own versions with native casts. [16:29] source about 321 Contact: http://www.nytimes.com/1984/10/02/science/about-education-forgotten-tv-audience-children.html [16:30] i hope that will be enough so we don't have to take collection down [16:42] joepie91 [16:57] Roelandus: yes, but the default language here is English :) [16:57] Can you help me? [16:58] Roelandus: well, see my advice above - there's not really a faster way to do this, and I don't have the time atm to go through it step by step [17:00] is there a specific program to partially extract the warc by the .cdx offset? [17:05] *** whydomain has quit IRC (Read error: Operation timed out) [17:06] *** whydomain has joined #archiveteam-bs [17:18] *** icedice has joined #archiveteam-bs [17:22] *** odemg has quit IRC (Remote host closed the connection) [17:40] *** whydomain has quit IRC (Read error: Operation timed out) [17:41] *** whydomain has joined #archiveteam-bs [17:43] *** kyounko has quit IRC (Read error: Connection reset by peer) [17:46] How do I install wikiteam? [18:00] *** whydomain has quit IRC (Read error: Operation timed out) [18:01] *** whydomain has joined #archiveteam-bs [18:08] Roelandus, http://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior [18:48] *** RichardG_ has joined #archiveteam-bs [18:49] *** RichardG has quit IRC (Ping timeout: 260 seconds) [18:53] *** RichardG has joined #archiveteam-bs [18:56] *** odemg has joined #archiveteam-bs [18:56] *** odemg has quit IRC (Connection closed) [18:58] *** RichardG_ has quit IRC (Read error: Operation timed out) [19:07] *** spiko has quit IRC (Read error: Connection reset by peer) [19:12] *** bwn has quit IRC (Ping timeout: 244 seconds) [19:18] *** Aranje has joined #archiveteam-bs [19:34] *** j08nY has joined #archiveteam-bs [19:42] [18:00] is there a specific program to partially extract the warc by the .cdx offset? [19:42] there might be, but warc files are just big gzipped balls of text [19:42] so if you grab the offset, you can just read the specified segment from the file (using `head` and `tail` maybe?) and ungzip that [19:42] and you get text and headers [19:43] and the resource [19:47] I'll probably let my server(old dell/hp pc) extract it overnight [19:47] Because my PC is in my bedroom [19:49] This would be a lot easier if the robots.txt from IA wouldn't block me [19:50] *** bwn has joined #archiveteam-bs [20:00] Roelandus: were you able to get what you were after? [20:01] three [20:01] Roelandus, IA is not supplying the robots.txt. The website you are trying to access is. [20:01] there's only a couple of directives in https://archive.org/robots.txt so I dunno what's up with that [20:01] or see above [20:01] IA is just following the request of the website owner (as misguided as it may be.) [20:02] I do wonder what caused that item to be added to the archive.org robots.txt [20:03] Probably a legal order if I had to guess. [20:05] as far as extraction from an offset goes, yeah it's possible. there might be better tools for this, but one way to do it is to pull offsets out of the cdxes and issue range requests [20:05] https://gitlab.peach-bun.com/snippets/34 [20:06] this only works if each record is compressed individually but that's that we do anyway [20:08] i've had success using warctools' warcfilter utility to grab specific urls from a warc: https://github.com/internetarchive/warctools [20:11] *** odemg has joined #archiveteam-bs [20:12] does warctools support retrieving only a small portion of a large WARC? from https://github.com/internetarchive/warctools/blob/master/hanzo/warctools/stream.py, it's not clear to me if it does [20:42] *** phuzion has quit IRC (Read error: Operation timed out) [20:46] *** phuzion has joined #archiveteam-bs [20:52] *** Roelandus has quit IRC (Ping timeout: 268 seconds) [20:54] *** phuzion has quit IRC (Read error: Operation timed out) [20:57] *** phuzion has joined #archiveteam-bs [21:19] *** odemg has quit IRC (Remote host closed the connection) [21:22] *** Honno_ has quit IRC (Ping timeout: 370 seconds) [21:31] *** kristian_ has joined #archiveteam-bs [21:43] *** odemg has joined #archiveteam-bs [21:46] *** icedice has quit IRC (Quit: Leaving) [21:47] *** odemg has quit IRC (Remote host closed the connection) [21:56] *** BlueMaxim has joined #archiveteam-bs [22:06] *** kristian_ has quit IRC (Quit: Leaving) [22:12] *** no1spod has quit IRC (Quit: reboot) [22:31] *** Roelandus has joined #archiveteam-bs [22:33] *** odemg has joined #archiveteam-bs [22:38] hello [22:44] *** GE has quit IRC (Ping timeout: 255 seconds) [22:45] *** GE has joined #archiveteam-bs [23:07] *** amiiboh has joined #archiveteam-bs [23:09] *** t2t2 has quit IRC (Read error: Operation timed out) [23:10] *** t2t2 has joined #archiveteam-bs [23:31] *** amiiboh has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [23:54] *** schbirid2 has quit IRC (Quit: Leaving) [23:59] *** Stiletto has joined #archiveteam-bs