[00:26] *** Geekonoci has joined #archiveteam-bs [00:34] *** Geekonoci has quit IRC (Quit: Page closed) [00:46] How should I organize that? A table? or just a plain list? [00:48] What information do you have? [00:49] If it's really just a list of account names plus maybe some comments/notes, then I'd say plain list (alphabetically sorted, I guess). But if you have additional data for many of the accounts, a table might be better. [00:58] *** BlueMaxim has joined #archiveteam-bs [01:02] A lot of the time I discover it because people ask for it to be excluded, publicly on twitter... [01:11] balrog, JAA [01:14] *** qw3rty116 has joined #archiveteam-bs [01:17] *** qw3rty115 has quit IRC (Read error: Operation timed out) [01:30] i'm at 1083k items [01:31] hook54321: nice [01:33] arkiver: I'm supposed to work with you on OpenNIC stuff [01:33] I just read that yeah :P [01:34] o/ [01:34] *** Aranje has quit IRC (Remote host closed the connection) [01:34] hook54321: thanks for taking care of and getting AT involved in this [01:34] np [01:37] I'm not sure what we're going to do about the .free tld. At this point it makes the most sense to just grab the .libre sites, since the .free sites were moved over there. However, I'm not sure if we should be worried about other webpages still using .free URLs. [01:38] I also have no idea what we'll do if ICANN decides to create a TLD that's already an OpenNIC tld. [01:38] again [01:39] There's nothing to do about it. The DNS has no provisions for conflicting roots [01:39] (does it?) [01:40] In the case of .free, OpenNIC is basically just moving all of the .free domains over to .libre [01:42] Personally, I'm hoping that's what they're planning to do if this happens again. [01:42] what else can be done? [01:44] This is their position on it: [01:44] a subdomain? .free.opennic? [01:44] "What is OpenNIC's relationship with the other alternative roots and ICANN? [01:44] OpenNIC currently recognizes and peers all of the existing ICANN TLDs (.com, .uk, etc.). Therefore, if you configure your computer to resolve OpenNIC domains, you'll also be able to resolve all of the ICANN TLDs automatically. [01:44] OpenNIC has not yet evaluated nor does it hold a formal position on the current/future ICANN TLDs." [01:45] Somebody2: I don't think they would like that [01:45] hook54321: who? [01:45] ICANN or OpenNIC, or someone else? [01:45] OpenNIC [01:46] why not? It would make it clear the source of the domains...? [01:46] today in "wtf Google": https://twitter.com/joepie91/status/900534232296161284 [01:46] they could even do the same with IACANN domains: .com.icann [01:46] That would break ssl certificates though [01:47] joepie91: http://i.imgur.com/By95Lva.jpg [01:47] and it kinda defeats the purpose of what they're trying to do [01:47] lol [01:56] hook54321: between converting domains previously at .free into domains at .libre vs converting them into domains at .free.opennic -- I'm not sure why .libre is better... can you clarify? [01:57] and how does adding .opennic or .icann at the end of domains defeat what they are trying to do? [01:59] because then it's a subdomain [02:00] example.com is a subdomain of .com [02:00] .com is the TLD though [02:00] DNS doesn't distinguish [02:01] http://com/ is valid [02:01] DNS doesn't, but various protocols on top do in various subtle ways [02:01] and as a side note, I am now immensely confused at the result of visiting that URL [02:01] like various browsers do different things depending on whether a DNS segment is a toplevel or not [02:02] oh I see. someone has a sense of humour. ".XYZ is the next .COM. .XYZ is the #1 new domain in the world" [02:03] ok, so rather than .free.opennic, you could do performance horrible things by using a different delimiter, e.g. .free-opennic, or even .free;opennic [02:05] Frogging: oddly, neither dig nor curl are able to follow that redirect. [02:08] indeed, I think that one does not actually work, it was my browser adding .com to it automatically [02:08] Ah, this is because the browser automatically converts "http://com/" into a request for http://www.com.com/ [02:09] try this one http://dk/ [02:10] yes, that one works in dig and curl [02:10] 301 redirecting to https://www.dk-hostmaster.dk/ [02:10] yup [02:34] I was thinking about that line ".XYZ is the next .COM" and it occurred to me that I've never in my life seen a legitimate .xyz website [02:34] so I went to one of the ones linked on the page [02:34] https://www.goinnovate.xyz/ [02:35] this is... frightful. [02:35] "fog computing" [02:36] http://www.exponentials.xyz/posts/the-roles-of-cloud-computing-and-fog-computing-in-the-internet-of-things-revolut-6205388 [02:53] *** drumstick has quit IRC (Ping timeout: 268 seconds) [03:00] *** drumstick has joined #archiveteam-bs [03:14] *** qw3rty117 has joined #archiveteam-bs [03:18] I'm surprised that IA actually complied with this guy's request: https://twitter.com/darthodius/status/658731881626783745 [03:20] *** qw3rty116 has quit IRC (Read error: Operation timed out) [03:41] *** Fletcher has quit IRC (Remote host closed the connection) [03:56] *** Fletcher has joined #archiveteam-bs [04:15] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:16] public.resource.org refers to the Internet Archive building as "the Church of the Internet Archive" [04:20] Well, the bulding really was an old church once. [04:20] *building [04:20] It still has pews and all that. [04:22] *** Sk1d has joined #archiveteam-bs [04:38] Asparagir: yeah, i know. [04:39] I wonder if anyone has ever gone there in person and angerly demanded their site get removed from the wayback machine [04:58] *** Asparagir has quit IRC (Asparagir) [05:38] this guy is my history hero https://www.youtube.com/watch?v=ZqUm1YXTxNc [05:38] Steve1989MREInfo [06:12] *** Honno has joined #archiveteam-bs [06:15] *** RichardG has quit IRC (Read error: Connection reset by peer) [06:17] *** RichardG has joined #archiveteam-bs [06:31] *** soja92 has joined #archiveteam-bs [07:08] *** kristian_ has joined #archiveteam-bs [07:44] *** drumstick has quit IRC (Read error: Operation timed out) [07:44] *** drumstick has joined #archiveteam-bs [08:07] *** tuluu has quit IRC (Ping timeout: 245 seconds) [08:17] *** tuluu has joined #archiveteam-bs [08:52] *** Boppen has quit IRC (Ping timeout: 194 seconds) [08:53] *** drumstick has quit IRC (Ping timeout: 268 seconds) [08:54] *** kristian_ has quit IRC (Read error: Operation timed out) [09:00] *** etudier has joined #archiveteam-bs [09:05] *** drumstick has joined #archiveteam-bs [09:15] *** Boppen has joined #archiveteam-bs [09:40] *** dashcloud has quit IRC (Read error: Connection reset by peer) [09:40] *** dashcloud has joined #archiveteam-bs [11:11] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [11:26] *** drumstick has quit IRC (Read error: Operation timed out) [12:15] *** brayden has quit IRC (Read error: Connection reset by peer) [12:17] *** brayden has joined #archiveteam-bs [12:17] *** swebb sets mode: +o brayden [12:49] *** kurt has quit IRC (Read error: Operation timed out) [13:44] The Archive has had to deal with a lot of crazy walking right in and demanding things, yes. [13:44] Two levels: The one you think of, someone coming in and demanding something related to content. [13:44] The other: Since it looks like a church, church-related things, like homeless or people asking for sanctuary, etc. [13:44] There are between 3-5 people who sleep at the Archive at night, since it looks like a church [13:45] There was one who was across the street for years, Thomas. He died last October, and even the night he died, we'd brought over extra food from the celebration going on, which he had. [13:47] http://richmondsfblog.com/2016/10/27/thomas-resident-homeless-man-at-funston-clement-passed-away-wednesday-night/ [13:54] *** BartoCH has quit IRC (Quit: WeeChat 1.9) [14:08] *** vitzli has joined #archiveteam-bs [14:21] *** TheLovina has quit IRC (Quit: Leaving) [14:30] til [14:31] *** Mateon1 has quit IRC (Read error: Operation timed out) [14:31] *** Mateon1 has joined #archiveteam-bs [14:36] *** REiN^ has quit IRC (Max SendQ exceeded) [14:36] *** REiN^ has joined #archiveteam-bs [14:36] hook54321: That's for pages which are explicitly blocked due to a direct request to IA, I assume? [14:37] I'm not exactly sure yet. [14:38] Whatever we make it out to be I guess [14:38] Maybe both, idk [14:39] I just think that a list of pages blocked by robots.txt is probably *massive*. [14:40] yeah. on the other hand, it could be used for us to have some sort of database of what IA might not be crawling. [14:40] *could be useful\ [16:48] *** vitzli has quit IRC (Quit: Leaving) [17:25] another page is missing for 01-01-2014 [17:26] *2014-01-01 [17:26] anyways is page 24 [17:35] *** Asparagir has joined #archiveteam-bs [18:05] *** namespace has joined #archiveteam-bs [18:06] So I'm trying to OCR/type up/something this PDF so it can be put on a public website as an actually readable/searchable/etc historical archive: https://archive.org/details/8BBSArchiveP1V1 [18:06] (Also to generate more interest in it through sex appeal, because it's actually a fairy important bit of phreaker history.) [18:07] And it is just the nastiest PDF out there (blurry monospace font, holepunched so that sometimes entire bits of text are missing, scuff marks, etc). [18:08] The automatic archive.org OCR is actually better than what I could get using OCRFeeder: https://archive.org/stream/8BBSArchiveP1V1/8BBS_Archive_P1V1_djvu.txt [18:08] Is there any way to do better, or? [18:08] *** schbirid has joined #archiveteam-bs [18:29] wow that is nasty yeah [18:29] i have a very flakey ocr program that works well on single-font fixed-width material [18:29] https://github.com/chronomex/ess-ocr if you want to poke at it [18:30] all the magic bits are hardcoded, so youd have to hardcode new constants to describe the pages you want to ocr [18:30] and that sort of thing [18:34] Yeah here's the strategies I've thought up so far: [18:35] - Retype the whole thing (would prefer not to, takes TON of time, solitary) [18:35] i mean, this is the sort of thing that i wrote my ocr program for [18:35] this will ocr quite well [18:35] - Highlight all the actual posts and then extract them as images, put up on a wiki and let other people help. [18:35] - OCR it, which yeah I'll try that thing you just posted thanks. [18:35] it's extremely fragile but this is fixed width enough to work super well [18:36] i am planning to do a total rewrite of the tool to make it useful, but for now you will be able to get it to work [18:36] watch out, tool is designed to work with pages that have a black frame around them, so you might need to remove some code about that [18:36] (it used the black frame for fiducial registration) [18:37] Eheheh. [18:37] crop_to_rect is the function to comment out calls to [18:37] where's the PDF you're trying to OCR? [18:37] https://archive.org/stream/8BBSArchiveP1V1/8BBS_Archive_P1V1#page/n3/mode/1up [18:39] i have pdf ocr software at work, i'll try running it on it, lol [18:41] downloading at like 300KiB/s, ugh [18:41] wow lol [18:44] astrid: So stupid question, how do I compile this? [18:44] There's two files, no makefile, do I just compile them both separately and then execute one? [18:46] they both have a comment at the beginning saying the magic compiler spells to cast [18:46] deskew comes from the leptonica library [18:46] Ah, k. [18:46] Thanks. [18:46] so you might need to set that up [18:46] also you need to feed it 'pgm' images [18:47] Any dependencies? ^^;; [18:47] ess-ocr needs you to create a directory 'training' or it'll core [18:47] yeah libnetpnm among others [18:47] as i said [18:47] very rough :) [18:49] Yeah I'm smelling a goose chase, thanks anyway. :p [18:49] once you get it running, call it over one of your images [18:49] Yes see that first bit. [18:49] Is the bit that is very unlikely to happen. [18:50] it'll write out 'crop.pgm' which should be cropped to include the content and include the grid it's using for segmentation [18:50] oh :( [18:50] ok [18:50] This needs like, a readme.txt [18:50] well my plan for it involves a rewrite, because proper monospace ocr is a thing the world needs [18:50] desperately [18:50] Otherwise I'll just be asking you questions all day. [18:50] as we all know [18:50] Yes, yes it does. [18:50] yeah :| [18:50] Here's my advice. [18:50] If you want to make this usable to others. [18:50] Make a debian/ubuntu/etc vm. [18:51] And set the thing up from scratch, and write down each step as you do. [18:51] That's your readme.txt [18:51] it's not currently intended to be useful, it's a quick hack [18:51] yeah [18:51] It's how I do all my setup readme files and they always come out excellent, whereas people who just write them from memory always end up skipping steps and stuff. [18:51] "if you want to do monospace ocr, here's a very fiddly thing that gets good results" [18:52] i usually wind up with 2 or 3 misrecognized characters per page at most [18:52] Yeah that would be incredible. [18:52] it doesn't assume it knows what letters look like [18:52] instead it's like "hm, idk what this is, hey user, what is it?" and you say "that's a B, and you can use it as an example of other Bs" [18:53] so it builds training data as you go [18:53] but i'm planning a rewrite with maintainability and also smartness [18:53] next version will lay down a grid over the page that can be skewed and bent, so that photographs of monospace text can be recognized f.ex [18:54] and it'll be much much less manual fiddly process [18:54] Yes well, the next version is in the indeterminate future and I'd like this to be on the net now. :P [18:54] because i have about 500 pages of photographs of typewritten text that i'd like to get into readable form [18:54] yes i know : [18:54] :) [18:58] i have my work OCR running, it's going VERY slowly, lol [18:59] Unsurprising. XD [19:01] i feel like i should've just ran it on the first 10 pages or soemthing instead of all 700, lol [19:02] so now page 36 and 37 of 2014-04-30 of east bay express is giving me 404 [19:02] *** HarryCros has quit IRC (Read error: Connection reset by peer) [19:03] *** HarryCros has joined #archiveteam-bs [19:06] ok looks there was a middle booklet for 'Bike to work day' [19:06] on may 8 that year [19:06] after that in continues from page 28 [19:07] so its really page 8 and 9 of the 'bike to work day' booklet [19:09] *** HarryCros has quit IRC (Remote host closed the connection) [19:09] *** HarryCros has joined #archiveteam-bs [19:10] so there pages 31 to 33 missing from 2014-06-04 [19:11] http://meddl.com/temp/ocrtest.txt [19:11] that's what my work OCR app did, lol [19:11] the first 10 pages at least [19:11] *** HarryCros has quit IRC (Remote host closed the connection) [19:12] Not that bad, tbh. But only of about comparable quality to the archive.org OCR scan. [19:13] yeah, lol [19:13] doesn't help that there's holes all through it [19:13] It really *really* doesn't. [19:13] Like I said, nasty. [19:15] lol [19:26] *** C4K3_ has quit IRC (leaving) [19:30] *** C4K3 has joined #archiveteam-bs [19:36] *** kristian_ has joined #archiveteam-bs [19:40] *** odemg has quit IRC (Read error: Operation timed out) [19:55] *** odemg has joined #archiveteam-bs [20:41] *** bitBaron has joined #archiveteam-bs [20:41] *** bitBaron has quit IRC (Client Quit) [20:48] yeah, why not move the "WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD" bs to a specific channel? [20:48] if the main channel is too holy to even discuss itself [20:51] I would not give password to this person >> and I give you sucky sucky [20:52] Sanqui Sanqui? [21:02] $5 [21:07] i'm with Sanqui here [21:09] *** Honno has quit IRC (Read error: Operation timed out) [21:12] *** HarryCros has joined #archiveteam-bs [21:20] *** Pudsey has joined #archiveteam-bs [21:21] *** HarryCros has quit IRC (Remote host closed the connection) [21:21] *** kristian_ has quit IRC (Ping timeout: 370 seconds) [21:24] *** HarryCros has joined #archiveteam-bs [21:25] *** HarryCros has quit IRC (Read error: Connection reset by peer) [21:25] *** HarryCros has joined #archiveteam-bs [21:38] *** Pudsey has quit IRC (Remote host closed the connection) [21:40] *** ZexaronS has joined #archiveteam-bs [21:42] *** kristian_ has joined #archiveteam-bs [21:50] *** REiN^ has quit IRC (Max SendQ exceeded) [21:51] *** REiN^ has joined #archiveteam-bs [21:51] *** REiN^ has quit IRC (Max SendQ exceeded) [21:52] *** REiN^ has joined #archiveteam-bs [21:55] *** drumstick has joined #archiveteam-bs [21:56] I'm not sure that this is what was meant with "steal from work": https://twitter.com/SeamusHughes/status/900790149017219073 [21:56] :p [21:59] i guy put this up on r/opendirectories : http://95.211.186.214/Incoming/ [22:00] ohh, lots of oldish stuff there [22:00] ha [22:00] yeah I'm pretty sure this guy went to... 33C3? [22:00] i can't grab it but some one here would want it [22:00] fairly certain that the Leaks directory came off one of the FTPs there [22:01] godane: definitely, thanks :P [22:01] i got from here: https://www.reddit.com/r/opendirectories/comments/6vri47/large_dj_sets_directory/ [22:01] i know that SketchCow is looking for older dj sets [22:02] godane: oh, is he? any particular type? [22:02] I may have a pile of my own [22:02] that is, sets from a specific internet radio channel and some of their live events [22:02] (afterhoursdjs.org, but it's stuff that isn't yet in the collection on IA) [22:02] he is doing the hip hop mixtapes collection [22:03] ah yeah, this is def not hip hop :p [22:03] he will take it [22:05] this is what I have laying around: https://gist.githubusercontent.com/joepie91/1afb987f86a2b417c33a61a35a6c0f29/raw/84a200d991cf846e79133b78e046558b8104a822/gistfile1.txt (cc SketchCow - let me know if you want me to ship them to FOS or such) [22:05] haven't gotten around to sorting out the metadata yet [22:06] goes back to 2002 in some places :P [22:28] uploaded : https://archive.org/details/forum.kingsnake.com-1997-to-2003-archives-20161203: [22:28] *uploaded : https://archive.org/details/forum.kingsnake.com-1997-to-2003-archives-20161203 [22:52] *** Stiletto has quit IRC (Read error: Operation timed out) [23:28] *** kristian_ has quit IRC (Quit: Leaving) [23:38] That DJ directory looks like crap. [23:40] *** BlueMaxim has joined #archiveteam-bs [23:52] *** Stilett0 has joined #archiveteam-bs [23:59] *** TC01 has quit IRC (Remote host closed the connection)